Principle of RegMOD method
Genes or proteins are linked to form a network by interactions, which facilitates biological functions. From the perspective of information flow, the information of biological systems contained in genes is transferred to proteins or other molecules for executing biological functions. In this paper, we do not distinguish gene with its product protein as used in the seminal paper of Ideker et al. [13]. The nodes in the interaction network represent genes and links between genes represent interactions including protein-protein interaction, protein-DNA interaction or other functional linkages.
Based on the assumption that interacting genes have similar active scores which measure the extent to which genes respond to a specific disease. The active score of a gene is defined by a nonlinear active scoring function which considers the cooperations among genes. To estimate these underlying active scores defined by the active scoring function, a kind of observed active scores defined as differentially expressed levels of genes is calculated from the case-control microarray data. The goal is to estimate the underlying active scores for each genes which approach to the observed active scores and capture the cooperative pattern simultaneously. This is formulated as a typical regression problem by fitting the active scoring function to the observed active scores. In this paper, we use the support vector regression method with a diffusion kernel to evaluate each gene's underlying active score. This procedure can be regarded as a smoothing process that gives each gene a new active score. It can eliminate acute changes among neighboring genes in the network and infer underlying active scores of genes whose expression level could not be measured (Figure 1). It can significantly improve the accuracy and robustness of the predicted activity of genes and reduce the effect of noise and incompleteness of the high-throughput data. Finally, the induced subnetworks of significantly scored genes form the active modules which are expected to be related with a specific phenotype or disease. Particularly, prioritizing genes according to their active scores can provide the order of gene's association level with a specific phenotype or disease.
Regression Model with Diffusion Kernel
Genes interacting with each other can be expressed as a network. The gene expression data provides valuable information on gene's activity, which can be represented by weights of genes and interactions in the network. Formally, an interaction network with weighted nodes and weighted edges can be expressed as G = (V, E, S, C), where node set V represents genes, edge set E represents interactions, S represents the scores or weights of nodes, i.e. the active level of each node, and C represents the weight of edges. Since coexpressed genes are more likely to function together, the edge weight is defined as absolute Pearson correlation coefficient of expression profile of two node genes as following
where y and z are two expression profiles,
and
are mean expression values. For a node subset V' ⊆ V, G' = (V', E', S', C'), where
,
and
, is the induced subnetwork of V' from G.
To model the relationship between active scores and genes, we construct a function in which underlying active score f
i
of gene i is taken as a response variable and genes x1, x2,..., x
n
are viewed as explain variables, i.e.,
where x
i
, i = 1,..., n is the attribute vector of each gene which contains the biological information. Clearly, each gene's active score is affected by other genes. The relationships between genes are involved in this nonlinear function rather than an additive function which assumes non-cooperativity among genes. In other words, in contrast to the existing methods, the proposed model can represent cooperative effect among genes.
By a nonlinear mapping ϕ as in [22, 23], we transform the relationship of genes and active scores from the input space of x into a feature space (such as reproducing kernel Hilbert space). Then, the nonlinear function f in the input space can be expressed as a linear function in the feature space
Based on the representation theorem [22, 23], u can be expressed as
in the reproducing kernel Hilbert space which has good theoretical properties. Using the kernel trick as mentioned in [22, 23], the inner products in the feature space can be calculated by a kernel function in the input space, i.e. k
ij
= ⟨ϕ(x
i
), ϕ(x
j
)⟩, i, j = 1, 2,..., n. Then
where k
ij
is the kernel function which is the similarity of gene i and gene j. Thus, it is unnecessary to know neither the mapping ϕ nor the feature space exactly due to the kernel trick. In other words, the transformed nonlinear function can be numerically evaluated simply by a linear mode, thereby greatly simplifying computation. Evaluating the kernel function on all pairs of genes yields a kernel matrix K = (k
ij
), i, j = 1, 2,..., n, which is symmetric and positive semi-definite. There are many kernels such as Gaussian kernel, polynomial kernel and spline kernel to describe the similarity of continuous variables. To model similarity of nodes in a network which has a discrete structure, the following diffusion graph kernel from [24] which simulates the heat transduction in the network is adopted [23],
where τ controls the magnitude of the diffusion and L is the graph Laplacian matrix
where
is the degree of node i. For edge-unweighted case, C
ij
= 1 if node i and j are adjacent and C
ij
= 0 otherwise. Note that even for an edge-weighted network, its diffusion kernel can be calculated in the same way. k
ij
describing the similarity of two nodes increases with the shortest-path distance and the number of all possible paths between them [24]. Thus, with diffusion kernel, f is calculated based on nodes' topological similarity in the network.
In the following part, we introduce how to define the observed active score for each gene. For a microarray dataset including two classes of samples, e.g. case and control, the observed active score of each gene is evaluated by the Signal to Noise Ratio (SNR). For the i th gene, its expression profile can be divided into two classes and its SNR active score can be calculated as
where μi 1and μi 2are the means of the expression levels of gene i in sample set 1 and sample set 2 respectively, and σi 1and σi 2are the standard deviations of gene i in sample set 1 and sample set 2 respectively. There are only a few number of samples contained in the microarray data which can hardly meet the statistical significance, therefore this simple metric SNR is more compatible and has been successfully used to detect disease related genes and gene sets [11, 25]. Other statistics such as t-statistics and statistics used in SAM [6] have also been tested. SNR performances better than other statistics, thus we adopt this metric for further analysis (See Figure S2 in the Additional file 1).
The SNR score of each gene w
i
is regarded as the observed value of underlying active score f
i
. Due to the noise and incompleteness of the microarray data, the observed active score may not well reflect the fact that interacting genes have similar active scores. Thus, using f to fit the observed score can predict the underlying active score which captures the cooperative patterns among genes and is most close to the observation. In practice, the fitting process is to estimate the parameters β in function f towards w
i
. This forms a nonlinear regression problem, which can be solved by the following support vector regression (SVR) [22] model,
where
is the regularization term and C* is a regularization constant, they control the complexity of f and prevent from overfitting. This problem is solved in its dual space by converting the primal problem to its dual problem which is a convex quadratic programming:
where α
i
and
are Lagrange multipliers, β
i
= (α
i
-
), ε is a small constant. Thus, the globally optimal solution can be obtained even for a large scale problem. In this study, the widely used software LIBSVM from [26] was employed to solve the SVR problem.
After f is fitted, each gene gets a new active score f
i
which is regarded as the underlying active score. Since the observed active score is noisy and has many outliers, the proposed method RegMOD can smooth out the outliers and estimate the underlying active score by integrating the network topological similarity between genes. A gene with a low w score but functioning as a bridge to connect high weighted genes will get a newly high f score. Conversely, a gene with a high w score but interacting with low score genes will get a newly low f score. The Matlab code of RegMOD is available in the Additional file 2.
Based on the estimated underlying active score, we extract the active modules from the network. The high scoring region in the network constitutes active modules which associate to the specific phenotype or disease. To extract these modules, the high scoring active gene set including up-regulated gene set VU
f
and down-regulated gene set VD
f
, are selected as follows,
where
is the median value of f and σ is the median absolute deviation of f, θ represents the fold change apart from the median. Genes with f value higher than the fold change are considered significantly active. In this paper, we found that θ = 6 is reasonable for obtaining biologically significant results in both practical applications (See Figure S3 in the Additional file 1). Then, we can obtain the induced subnetwork GU
f
(GD
f
) of VU
f
(VD
f
) from the interaction network. The connected components of GU
f
(GD
f
) correspond to the up-regulated (down-regulated) modules.
Performance measurements
To evaluate the performance of the accuracy of the identified active modules in the simulated dataset, the recall and precision (denoted as r and p respectively) are applied and defined as below:
The F-measure which is the harmonic mean of precision and recall is used to measure total accuracy:
Large F-measure value indicates good performance of the results. For each identified module by one method, we calculate the r, p and F-measure. The recall-precision plot and box-plot of F-measure are implemented to compare the performance of different methods.
To measure the topological similarity of two genes within a module, the following index is adopted:
where k
ij
is the element of the diffusion kernel matrix K. The logarithmic transformation could facilitate to visualize subtle changes. Other logarithm base, such as 2 or e, could also be used, which do not affect the results. The high value of similarity indicates tight or close relationship between two genes.
Prioritizing genes according to the absolute active scores can generate a gene ranking list. The distribution of disease related genes in this ranking list is used for evaluating the active score's biological meaning. The plot of coverage of disease genes versus the rank of genes is used (See Figure 2G and 3F, Figure S2, S3, S4 and S5 in the Additional file 1), i.e. each point in the figure represents the number of disease related genes in the set of genes with higher rank than a value.