- Research article
- Open Access
Regularized estimation of large-scale gene association networks using graphical Gaussian models
- Nicole Krämer^{1}Email author,
- Juliane Schäfer^{2, 3} and
- Anne-Laure Boulesteix^{4, 5}
https://doi.org/10.1186/1471-2105-10-384
© Krämer et al; licensee BioMed Central Ltd. 2009
- Received: 5 May 2009
- Accepted: 24 November 2009
- Published: 24 November 2009
Abstract
Background
Graphical Gaussian models are popular tools for the estimation of (undirected) gene association networks from microarray data. A key issue when the number of variables greatly exceeds the number of samples is the estimation of the matrix of partial correlations. Since the (Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this scenario, standard methods are inappropriate and adequate regularization techniques are needed. Popular approaches include biased estimates of the covariance matrix and high-dimensional regression schemes, such as the Lasso and Partial Least Squares.
Results
In this article, we investigate a general framework for combining regularized regression methods with the estimation of Graphical Gaussian models. This framework includes various existing methods as well as two new approaches based on ridge regression and adaptive lasso, respectively. These methods are extensively compared both qualitatively and quantitatively within a simulation study and through an application to six diverse real data sets. In addition, all proposed algorithms are implemented in the R package "parcor", available from the R repository CRAN.
Conclusion
In our simulation studies, the investigated non-sparse regression methods, i.e. Ridge Regression and Partial Least Squares, exhibit rather conservative behavior when combined with (local) false discovery rate multiple testing in order to decide whether or not an edge is present in the network. For networks with higher densities, the difference in performance of the methods decreases. For sparse networks, we confirm the Lasso's well known tendency towards selecting too many edges, whereas the two-stage adaptive Lasso is an interesting alternative that provides sparser solutions. In our simulations, both sparse and non-sparse methods are able to reconstruct networks with cluster structures. On six real data sets, we also clearly distinguish the results obtained using the non-sparse methods and those obtained using the sparse methods where specification of the regularization parameter automatically means model selection. In five out of six data sets, Partial Least Squares selects very dense networks. Furthermore, for data that violate the assumption of uncorrelated observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, indicating that they might not be suited under these conditions. The shrinkage approach is more stable than the regression based approaches when using subsampling.
Keywords
- Mean Square Error
- Partial Little Square
- Partial Correlation
- Lasso
- Partial Little Square Regression
Background
Besides Bayesian networks [1], auto-regressive models [2], and state-space models [3], graphical Gaussian models (GGMs) are a popular method for modeling genetic networks based on microarray transcriptome data. In the GGM methodology [4], which is considered in the present article, networks are represented as undirected graphs. Each vertex represents a gene, and an edge connects two genes if they are partially correlated. In contrast to correlation, which measures both direct and indirect interactions between pairs of variables, partial correlation measures the strength of direct interaction only. Since investigators are primarily interested in direct gene interactions, the GGM framework is attractive for modeling of regulatory networks: several recent methodological articles report successful applications of GGMs to the estimation of genetic networks from microarray data [5–10]. These approaches are used in numerous applied studies, e.g., for estimating Arabidopsis gene networks [11] or for the study of genetically mediated cortical networks [12].
Nonetheless, reconstructing GGMs from high-dimensional microarray data remains a difficult task. The standard estimation of partial correlations involves either the inversion of the sample covariance matrix, or the estimation of p least squares regression problems, where p is the number of genes. If the number n of observations (arrays) is much smaller than the number p of variables (genes), these approaches are inappropriate. Suitable alternatives are based either on regularized estimation of the (inverse) covariance matrix, or on regularized high-dimensional regression. The present paper focuses on the latter approach, and presents a comparative study on the use of various approaches to high-dimensional regression for covariance selection. The chosen methods are extensively compared in simulations and real data studies. Since for real data the ground truth (i.e. the true underlying network) is unknown, our performance analysis focuses on the similarities and differences between the investigated methods. In particular, we examine the connectivity and size of the resulting graphs, as the differences between the estimated networks. Moreover, we compare the stability of the methods with respect to subsampling and with respect to violations of i.i.d. assumptions.
In the remainder of this section, we give a brief overview of graphical Gaussian modeling in the classical setting with n > p. Subsequently, we discuss the case of high-dimensional data in the "Methods" section.
Gene Regulatory Networks and Graphical Gaussian Models
In the context of gene regulatory networks, each of the p genes is represented by a random variable X_{ i }(i = 1,..., p). For each pair of genes (i, j), we are interested in their partial correlation ρ_{ ij }with respect to all other genes, i.e. with respect to the set of random variables Ƶ_{\ij}= {X_{1},...,X_{ p }}\ {X_{ i }, X_{ j }}.
Given n observations (arrays) x_{1},..., x_{ n }∈ ℝ^{ p }of the set of p genes, the standard unbiased plug-in estimate for the partial correlation coefficients ρ_{ ij }in the case n > p can be formulated in two equivalent ways [4], as outlined below.
Notations
Formulation 1: Inversion of the Covariance Matrix
Formulation 2: Least Squares Regression
In the n > p setting, the two regression coefficients and always have the same sign. Hence, is well-defined. Moreover, it can be shown that both formulations 1 and 2 are equivalent [4] in the sense that they always yield the same estimate. In the n ≥ p setting, a test of the null hypothesis ρ_{ ij }= 0 is available using results on the distribution of .
In microarray data, the number n of samples is typically very small as compared to the number p of considered genes. Hence, the above framework is inappropriate for two reasons. Firstly, the standard estimate of the partial correlation matrix given by Eqs. (3) and (7) is not appropriate when n <p: in formulation 1, the estimated covariance matrix is typically ill-conditioned or even singular, and its generalized (Moore-Penrose) inverse has large mean squared error [6]. In formulation 2, the least squares criterion (5) is ill-posed and leads to overfitting. Hence, an alternative regularized estimate of the partial correlation matrix has to be used in the context of GGMs with high-dimensional data. The two formulations 1 and 2 lead to two different strategies for the regularized estimation of the partial correlations in the p ≫ n setting, which are reviewed in the Methods section.
Secondly, the testing approach mentioned above breaks down in the p ≫ n setting, since the sampling distribution of estimates under the null hypothesis of zero partial correlation is unknown. Two alternatives have been proposed in order to assess statistical significance: (i) methods based on sparse estimates of the partial correlation matrix that do not require separate testing, and (ii) methods based on empirical null modeling and (local) false discovery rate multiple testing [7, 13, 14].
Methods
This section reviews the available strategies for estimating GGMs in the p ≫ n setting: biased large-scale covariance estimation and regularized regression including our two novel variants (Ridge Regression and Adaptive Lasso).
Regularized Estimation of the (Inverse) Covariance Matrix
where the factor λ ∈ [0, 1] controls the shrinkage intensity. Let us assume a parametrization of covariances in terms of correlations and variances, whereas shrinkage is applied to the correlations and diagonal entries are left intact, i.e. the estimator does not shrink the variances. For correlation shrinkage, we consider the identity matrix as the most commonly employed shrinkage target. Notice that the optimal shrinkage intensity λ can be determined analytically and be estimated from the data. Thus, the resulting correlation shrinkage estimator is positive definite, and favorable properties carry over to derived quantities, such as sample partial correlations. Subsequently, model selection of the gene association network can be achieved using empirical null modeling and (local) false discovery rate multiple testing [7, 13, 14].
Estimates of the inverse covariance matrix can also be obtained using bootstrap aggregating (bagging) as a technique for variance reduction [15]. In some implicit way, the bootstrap procedure presumably helps to regularize the problem. However, bagging schemes are inferior to the shrinkage estimator [6], and computationally much more expensive. A recent extension using the augmented bootstrap [16] is in fact closely related to the shrinkage estimator [17, 18] and is expected to perform similarly.
In this paper, we use the correlation shrinkage based approach as a reference method in comparison with the regression based approaches to covariance selection.
Finally, recent novel approaches are to be noted that are based on, ℓ_{1} regularized maximum likelihood estimation in graphical Gaussian models [9, 19–21]. Corresponding inverse covariance estimates exploit the sparsity in the graphical structure and conduct parameter estimation and model selection simultaneously. However, despite recent advances in semidefinite programming computation remains challenging in practice due to the high-dimensionality and positive definiteness constraint [22].
Regularized Regression
Here, the strategy is to replace the least squares estimator in (6) by some regularized estimator of the regression coefficients that can be used in formula (7) to obtain estimators of the partial correlations. More formally, we define the following class of estimates of the partial correlations.
and 0 otherwise.
This definition ensures that the estimated partial correlation coefficients are always well-defined and that they lie in the interval [-1, 1]. Again, we can roughly distinguish between regression methods that require testing to construct the undirected graphs, and sparse regression methods.
In the rest of this subsection, we discuss two regularized regression methods (PLS and the Lasso) that have been proposed for the estimation of large-scale GGMs in the literature. Furthermore, we propose two additional attractive methods (ridge regression and the adaptive Lasso).
Partial Least Squares
Tenenhaus et. al. [23] suggest Partial Least Squares (PLS) regression [24, 25] as a plug-in for Def. 1. PLS is a method for supervised dimensionality reduction. It has its seed in the chemometrics community, but its success has lead to applications in various other scientific fields, e.g. in chemo- and bioinformatics [26, 27].
While the original formulation of PLS scales with the number p of variables, it is also possible to represent the algorithm in a way that it only scales with the number n of observations [28, 29]. This leads to a dramatic decrease in computation time for p ≫ n. Note that the number of PLS components is a model parameter that has to be optimized for each of the p regression models (4). The standard model selection techniques are cross-validation or information criteria based on degrees of freedom [30]. In the context of gene regulatory networks, Tenenhaus et.al. [23] propose to use the same number of components m for all p regression models. They observe empirically that the partial correlation coefficients (Def. 1) obtained from PLS regression reach a plateau when the number of PLS components m increases, and suggest a heuristic procedure to choose the smallest m for which the plateau is reached. However, in our experiments, we use the theoretically well-funded and popular cross-validation technique with k folds.
As the PLS coefficients are not sparse, the obtained partial correlations are in general non-zero. Thus, a statistical testing procedure has to be used to determine which edges are significant. (Alternatively, one might also use a sparsification of PLS as proposed by Chun & Keles [31].) In the present article, we use large-scale simultaneous hypothesis testing with local false discovery rate (fdr) level 0.2, in order to identify unusual outliers among the estimated partial correlations.
For the sake of completeness, let us mention in this section a variant of the PLS approach described above, which was recently suggested by Pihur et al. [10]. Instead of estimating the partial correlation using Eq. (7), they propose an alternative measure of correlation strength which is very similar to the PLS-based partial correlation coefficient except that, roughly speaking, the square root of the product of and is replaced by their sum. We remark that Pihur et. al. do not optimize the number of PLS components m and recommend to use m ≈ 3.
Ridge Regression
where λ > 0 denotes the penalty parameter. This leads to a reduction of variance and thus avoids overfitting. The solution obtained by ridge regression depends on the penalty parameter λ. In our paper, we use standard k-fold cross-validation to select the optimal amount of penalization λ. As ridge regression does not lead to sparse solutions, we use large-scale false discovery rate multiple testing [14] to test for significant edges, as described above in the subsection on PLS. Again, we adopt a level of 0.2.
The Lasso
where λ > 0 is the regularization parameter. With the ℓ_{1}-penalty, many estimated regression coefficients will be equal to 0. As a result, with variable selection in mind, the Lasso has a major advantage: a sparse estimator of the matrix of partial correlations is yielded and a graph can be obtained by assigning an edge between two genes if and only if ≠ 0. The choice of the penalty λ has to be determined for each of the p high-dimensional regressions successively. Again, this can be done using some cross-validation scheme or information criteria. Meinshausen & Bühlmann [33] motivate a choice of the penalty parameter that aims at controlling the probability of falsely connecting two nodes in the graph, i.e. that is a choice tailored to the graph structure. However, experiments [6] indicate that this approach leads to graphs that are too dense, i.e. too many edges are selected. Therefore, in this paper, we use the oracle penalty for optimal prediction that is determined using k-fold cross-validation.
The two-stage adaptive Lasso
Note that the amount of penalization in both the initial stage Lasso and the second stage Lasso with penalty weights is determined via k-fold cross-validation. The adaptive Lasso will be at least as sparse as the Lasso. For graphical Gaussian modeling, the adaptive Lasso estimates are used in Def. 1, and two genes are connected if and only if the partial correlation coefficient ≠ 0. We remark that for model selection, the optimal weights have to be determined in each of the k cross-validation splits. As the optimal weights themselves are determined via k-fold cross-validation, this implies that a lasso fit has to be computed k^{2} times! This leads to high computational costs.
Results
In this section, we perform extensive experiments to compare regression-based methods for reconstructing gene regulatory networks. We consider the recently proposed techniques PLS regression and Lasso regression, and the two additional methods, ridge regression and adaptive Lasso regression, that have not been applied in practice for this purpose before.
Overview of the methods
Simulations
The performance of the proposed methods is assessed in a simulation study with a set-up similar to [6]. The number of variables is fixed at p = 100, and various sample sizes ranging from 25 to 200 in steps of 25 are investigated. We consider two different scenarios. First, we simulate networks with varying degree of density and no network topology, and second, we investigate sparse networks with different network topologies (see additional file 1 for an illustration and below for a detailed explanation). These scenarios correspond to particular choices of the partial correlation matrix P(see below). For all experiments, a total of 20 replications are performed for each sample size to average out variability due to random sampling. For each replication, the data are drawn randomly from a multivariate normal distribution with correlation structure derived from P.
Varying degree of density
non-zero entries are constructed by first drawing the non-zero entries from a uniform distribution on [-1, 1] and then rescaling the non-diagonal entries to ensure that we obtain a feasible partial correlation matrix (for more details, see the R-package GeneNet[38]). Hence, the range of the non-zero partial correlations depend on the density of the network. If the network is rather dense, the absolute values of the non-zero partial correlation coefficients are very small compared to a sparse network. This is illustrated in the additional file 2. Here, we plot the histogram of the non-zero partial correlations for a random matrix P of density d. It is important to note that due to the small values, the reconstruction of the network becomes more delicate for a higher degree of density: it is more difficult to select the correct non-zero entries if their true vales are close to zero. We remark that this effect cannot be entirely eliminated by a more clever simulation design, and that the simulation of partial correlation matrices is far from trivial [39].
For each generated data set, P is then estimated based on PLS regression, ridge regression, the Lasso, the adaptive Lasso and the shrinkage covariance estimator, successively. For all regression-based methods, k = 5-fold cross-validation is used to optimize the model parameters, i.e. the number of components m for PLS and the penalty λ for ridge regression, the Lasso and the two-stage adaptive Lasso, respectively. For the Lasso and the adaptive Lasso, we follow the parametrization implemented in the lars package [40], based on the ratio of the ℓ_{1}-norm of the Lasso and the ℓ_{1}-norm of the least squares estimates. Specifically, the regularization parameter is chosen from an equidistant sequence between 0 and 1 of length 1000.
Furthermore, we normalize this parameter to avoid the peaking phenomenon at n = p (see [41] for details). For ridge regression, we consider a logarithmically spaced sequence l_{1},..., l_{1000} ranging from 10^{-10} to 10^{-1}. The candidate penalty parameters are then defined as λ_{ s }= l_{ s }n p (with s = 1,..., 1000). Finally, the range of the number of PLS components is from 1 to 15.
For sparse networks, the two sparse estimates based on the Lasso and the adaptive Lasso, respectively, yield a lower MSE compared to the three other methods that are not sparse and are likely to contain many non-zero but non-significant (small) entries, which ultimately lead to a higher MSE. This effect vanishes for higher degrees of density. A notable exception is PLS. For denser networks, the MSE becomes larger. These networks correspond to small absolute values of the entries in P. Therefore, we conjecture that PLS is not able to shrink the regression coefficients enough, as the regularization parameter m (number of components) is discrete. This is in contrast to the four other methods. Note however that for the reconstruction of the underlying networked topology the MSE is only of secondary interest.
For each investigated sample size, the resulting number of selected edges is displayed in the upper right panel of Figures 1, 2, 3, 4 and 5, while the horizontal line is the number of true edges. For sparse networks, the Lasso with its regularization parameter chosen to be prediction optimal tends to select too many edges. PLS, ridge regression and the approach based on shrinkage covariance estimation are in contrast far more conservative and rather select too few edges, even in the n > p case. The adaptive Lasso is less conservative and appears to be a promising alternative. Again, these differences vanish for higher degrees of densities. As remarked above, the reconstruction task becomes more difficult for higher degrees of density. This explains the low number of selected edges for higher degrees of density.
respectively. The panels illustrate that for sparse networks, the Lasso's comparatively high power comes at the prize of rather low true discovery rate. Again, the power decreases with the increase in density of the network. In many practical applications, we argue that it might be more valuable to report more stable results with fewer false positives.
However, it is to be noted that the non-sparse methods using fdr-based procedures for edge selection involve an arbitrary parameter: the fdr threshold (here 0.2). These methods can thus be made more or less sparse by changing the threshold value. To investigate the relative accuracy of the non-sparse methods independently of the particular fdr threshold, the same simulations are subsequently performed with other thresholds. In order to evaluate the ability of the three methods to detect non-zero partial correlations, their sensitivity and specificity are computed for these different fdr thresholds and displayed graphically in form of ROC curves (see additional files 3, 4, 5, 6, 7, 8, 9, 10, 11, 12). PLS and ridge regression yield very similar results. They slightly outperform the approach based on shrinkage covariance estimation. The sensitivity and specificity of the Lasso and the adaptive Lasso, which do not depend on a particular threshold, are depicted as single points. They are above the ROC curves of the three non-sparse methods, indicating good performance - especially for the adaptive Lasso.
Different network topologies
Real Data Study
Size of the data sets
data set | arrays | genes | time series | repeated measurements | size of full graph | Availability |
---|---|---|---|---|---|---|
ecoli1 | 23 | 100 | yes | no | 4 950 | R package plsgenomics[55] |
ecoli2 | 9 | 102 | yes | no | 5 151 | R package GeneNet[38] |
ara | 22 | 800 | yes | no | 319 600 | R package GeneNet |
t.cell10 | 100 | 58 | yes | yes | 1 653 | R package longitudinal[56] |
t.cell34 | 340 | 58 | yes | yes | 1 653 | R package longitudinal |
west | 49 | 3883 | no | no | 7 536 903 |
In real world scenarios, the ground truth, i.e. the true underlying network, is hardly ever known, and the performance of different methods cannot be determined in terms of MSE, power and tdr as in the simulation study. Nevertheless, it is possible to compare the performance of the different methods quantitatively. In particular, we investigate the size and the connectivity of the estimated graphs, their overlap the type of interaction between genes and their stability.
In a nutshell, the Lasso and adaptive Lasso select less edges than the other methods for all data sets except for the two data sets t.cell10 and t.cell34 with repeated measurements. With these two data sets, Lasso and adaptive Lasso yield complex graphs with as much as over 50% non-zero edges (t.cell34 data). This behavior is likely to be due to the longitudinal structure of the data that is not explicitly considered, since the standard Lasso regression method assumes independent observations. In contrast, longitudinal structures may be handled in an implicit way by methods using an fdr-based assessment, where the distribution under the null hypothesis is estimated from the data. To gather further evidence for our hypothesis, we average over the 10 replications in the two respective data sets. This leads to 10 observations for t.cell10 and 34 observations for t.cell34. On the averaged data, both Lasso and adaptive Lasso indeed select far less edges: For the averaged t.cell10, we obtain: 4.2% (Lasso), 2.0% (adaptive Lasso), 12.2% (PLS), 0.2% (Ridge), 0.2% (shrinkage). For the averaged t.cell34, we obtain 12.3% (Lasso), 4.8% (adaptive Lasso), 11.9% (PLS), 2.7% (Ridge), 0.1% (shrinkage).
PLS reconstructs very dense networks for five out of the six data sets (ecoli1, ara, t.cell10, t.cell34 and west). In combination with the high MSE that we observed in the simulations, we conjecture that PLS in combination with cross-validation is not the most reliable method for the reconstruction of networks. We believe that other model selection strategies or the incorporation of sparse PLS [31] are necessary in order to improve the performance of PLS.
Among the three methods with fdr-based assessment of the edges, i.e PLS, ridge regression and shrinkage covariance estimation, the latter procedure seems to be most conservative, whereas PLS identifies the highest number of edges. This result is consistent for all six real data sets and yields a refinement of the results presented in the simulation study, where these three methods performed similarly.
Overlap of the estimated graphs
data set | pls | ridge | lasso | adalasso | shrink | |
---|---|---|---|---|---|---|
ecoli1 | pls | 1.000 | 0.096 | 0.156 | 0.127 | 0.045 |
ridge | 0.686 | 1.000 | 0.600 | 0.457 | 0.390 | |
lasso | 0.496 | 0.267 | 1.000 | 0.581 | 0.165 | |
adalasso | 0.597 | 0.302 | 0.862 | 1.000 | 0.189 | |
shrink | 0.405 | 0.488 | 0.464 | 0.357 | 1.000 | |
% selected | 0.162 | 0.018 | 0.052 | 0.036 | 0.017 | |
ecoli2 | pls | 1.000 | 0.593 | 0.263 | 0.156 | 0.305 |
ridge | 0.651 | 1.000 | 0.309 | 0.197 | 0.388 | |
lasso | 0.297 | 0.318 | 1.000 | 0.520 | 0.311 | |
adalasso | 0.310 | 0.357 | 0.917 | 1.000 | 0.381 | |
shrink | 0.408 | 0.472 | 0.368 | 0.256 | 1.000 | |
% selected | 0.032 | 0.030 | 0.029 | 0.020 | 0.024 | |
ara | pls | 1.000 | 0.064 | 0.025 | 0.017 | 0.035 |
ridge | 0.590 | 1.000 | 0.151 | 0.108 | 0.377 | |
lasso | 0.535 | 0.352 | 1.000 | 0.579 | 0.361 | |
adalasso | 0.556 | 0.386 | 0.887 | 1.000 | 0.409 | |
shrink | 0.335 | 0.392 | 0.161 | 0.119 | 1.000 | |
% selected | 0.126 | 0.018 | 0.006 | 0.004 | 0.014 | |
t.cell10 | pls | 1.000 | 0.314 | 0.993 | 0.985 | 0.131 |
ridge | 0.956 | 1.000 | 1.000 | 1.000 | 0.422 | |
lasso | 0.141 | 0.047 | 1.000 | 0.795 | 0.020 | |
adalasso | 0.170 | 0.057 | 0.965 | 1.000 | 0.024 | |
shrink | 0.947 | 1.000 | 1.000 | 1.000 | 1.000 | |
% selected | 0.109 | 0.027 | 0.575 | 0.417 | 0.011 | |
t.cell34 | pls | 1.000 | 0.053 | 0.762 | 0.670 | 0.031 |
ridge | 1.000 | 1.000 | 1.000 | 1.000 | 0.583 | |
lasso | 0.345 | 0.024 | 1.000 | 0.643 | 0.014 | |
adalasso | 0.433 | 0.034 | 0.917 | 1.000 | 0.020 | |
shrink | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
% selected | 0.134 | 0.005 | 0.284 | 0.221 | 0.004 | |
west | pls | 1.000 | 0.089 | 0.017 | 0.008 | 0.041 |
ridge | 0.667 | 1.000 | 0.118 | 0.062 | 0.236 | |
lasso | 0.643 | 0.611 | 1.000 | 0.407 | 0.404 | |
adalasso | 0.673 | 0.694 | 0.884 | 1.000 | 0.458 | |
shrink | 0.632 | 0.488 | 0.161 | 0.084 | 1.000 | |
% selected | 0.086 | 0.011 | 0.002 | 0.001 | 0.006 |
Percentage of positive correlations
ecoli1 | ecoli2 | ara | t.cell10 | t.cell34 | west | |
---|---|---|---|---|---|---|
lasso | 61.2% | 79.6% | 63.6% | 59.0% | 65.7% | 77.2% |
adalasso | 61.9% | 81.9% | 65.3% | 61.0% | 67.2% | 78.2% |
pls | 57.5% | 77.8% | 54.7% | 60.0% | 66.1% | 56.1% |
ridge | 58.9% | 75.0% | 55.6% | 62.2% | 77.8% | 71.1% |
shrinkage | 50.0% | 70.4% | 55.3% | 73.7% | 71.4% | 72.0% |
The score is always ≤ 1, and the higher the value of κ, the more stable the methods are with respect to subsampling.
Stability of the Methods
data set | measure | pls | ridge | lasso | adalasso | shrink |
---|---|---|---|---|---|---|
ecoli1 | κ | 0.630 | 0.510 | 0.550 | 0.550 | 0.593 |
ranking of κ | 1 | 5 | 3.5 | 3.5 | 2 | |
ecoli2 | κ | 0.242 | 0.280 | 0.469 | 0.450 | 0.486 |
ranking of κ | 5 | 4 | 2 | 3 | 1 | |
t.cell10 | κ | 0.656 | 0.797 | 0.670 | 0.674 | 0.742 |
ranking of κ | 5 | 1 | 4 | 3 | 2 | |
t.cell34 | κ | 0.655 | 0.555 | 0.625 | 0.619 | 0.702 |
ranking of κ | 2 | 5 | 3 | 4 | 1 | |
mean ranking of κ | 3.25 | 3.75 | 3.125 | 3.375 | 1.5 |
Finally, the considered methods differ quite dramatically with respect to their run-time. As an illustration, we compared the run-time on the west data set, which contains 3883 genes. The approach based on shrinkage covariance estimation is by far the most efficient one (≈ 2 min), and all other methods scale within several hours: PLS ≈ 7.5 hours, ridge regression ≈ 10 hours, the Lasso ≈ 17 hours, and the adaptive Lasso ≈ 3.5 days. This can be seen as a major drawback of the methods relying on cross-validation schemes, especially the Lasso-based methods. While Ridge Regression and PLS allow a representation that only scales in the number of observations, Lasso and adaptive Lasso scale in the number of variables. Furthermore, adaptive Lasso requires nested cross-validation. Partial relief can be found in a parallel implementation. Alternatively, for high-dimensional data, one might consider to approximate the Lasso-based networks by first constructing a mildly sparse network without cross-validation (for example using the method described in [33]), and then to refine this network by running the (adaptive) Lasso with cross-validation.
Discussion
In this paper, we proposed and compared different methods to estimate partial correlation coefficients based on regularized regression techniques with applications to genetic networks. It is remarkable that while we focus on the framework of graphical Gaussian models (and do not consider alternative frameworks as e.g. Bayesian networks), the investigated methods nevertheless show clear differences. Hence, the employed regularization technique for graphical Gaussian models has a considerable effect.
In a simulation study, we assessed the performance of the considered methods in terms of estimation accuracy (MSE) and in terms of reverse engineering of the true underlying networked topology. As a result, the investigated non-sparse methods (PLS, ridge regression, and the approach based on shrinkage covariance estimation that served as a reference method) were found to perform similarly. It is to be noted that these methods have fdr-based significance testing in common. They are rather conservative with respect to the inclusion of edges when used with the standard fdr threshold 0.2. The Lasso tends to produce too "dense" structures, while the adaptive Lasso compensates for that by selecting edges in a two-step approach, therefore leading to sparser graphs. The latter two-stage approach is able to select relevant edges, even for small samples, while at the same time preventing to be too dense. For denser networks, the performances of the five methods are very similar. On real world data, the behavior of the non-sparse methods is again similar, except that PLS is less conservative than ridge regression and the approach using a shrinkage covariance estimator. A remarkable difference with respect to the different data sets is the behavior of the Lasso and the adaptive Lasso on the t.cell data sets. In contrast to the four other data sets, the t.cell data include replications, thus violating the assumption of independent samples. Consequently, the (adaptive) Lasso does not handle the underlying data structure correctly, while empirical null modeling seems to account for the decreased "effective" sample size in an implicit way.
Note that all investigated methods require the specification of tuning parameters that need to be optimized based on the available data. The choice of the model selection criterion itself strongly influences the results of the methods [50], especially for small n. As an example, the model selection procedure introduces a substantial amount of variation for the Lasso and the adaptive Lasso. In the real world study, we estimate the two graphs on two different random cross-validation splits, which leads to an overlap of only 88.4% on the west data, although the adaptive Lasso graph is defined as a subgraph of the Lasso graph. Hence, tuning parameters should be given much attention in future research when new methods are developed. Moreover, setting the parameters to fixed values without proper selection procedure (such as cross-validation) and just because they "yield nice results" is an incorrect and biased strategy which may favor the proposed novel method. Furthermore, from a computational point of view, a major strength of the shrinkage approach is that the optimal amount of regularization can be estimated from the data using an analytic formula, thus making time-consuming cross-validation procedures unnecessary.
We want to emphasize that there are interesting alternatives for the detection of significant edges that do not depend on sparsity penalties or testing based on local false discovery rates. For instance, Reverter & Chan [51] propose information theoretic measures for the reconstruction of gene co-expression networks. The comparative performance of these methods and their connections to the approaches investigated above may be explored in future research.
Finally, the methods discussed in this paper can potentially be used for detecting causal interactions [52, 53]. For instance, in the presence of longitudinal data, Arnold. et.al. [53] propose to identify the direction of interactions between variables by investigating partial correlations between time-shifted copies of the variables. Amongst others, they propose to estimate these partial correlations using Lasso regression, but other regression methods might be promising alternatives.
Conclusion
Comparison of the investigated methods
lasso | adaptive lasso | PLS | Ridge Regression | shrinkage | |
---|---|---|---|---|---|
properties | |||||
testing needed | no | no | yes | yes | yes |
run-time | high | very high | medium | medium | low |
simulation results | |||||
density | very dense | between lasso & non-sparse methods | very sparse | very sparse | very sparse |
mean-squared-error | low | low | high | medium | medium |
real-world-study | |||||
density | too dense on t.cell | too dense on t.cell | very dense | - | - |
repeated measurements | too dense | too dense | dense | - | - |
stability under resampling | medium | medium | medium | medium | good |
Performance
In the simulation, the investigated non-sparse regression methods, i.e. Ridge Regression and Partial Least Squares, exhibit rather conservative behavior when combined with (local) false discovery rate multiple testing in order to decide whether or not an edge is present in the network. For networks with higher densities, the difference in performance of the methods decreases. Both sparse and non-sparse methods can deal with cluster topologies in the network.
For PLS, we observe both a high MSE in the simulations and a high percentage of selected edges in some of the real data. In our opinion, this is an indication that PLS itself might not be too well-suited for the reconstruction of networks. The reasons are that PLS is not sparse by design, and that it does not shrink arbitrarily close to zero. Therefore, we suggest to incorporate sparse versions of PLS instead in future research.
On six real data sets, we also clearly distinguish the results obtained using the non-sparse methods and those obtained using the sparse methods where specification of the regularization parameter automatically means model selection. For data that violate the assumption of uncorrelated observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, indicating that they might not be suited under these conditions.
Stability
We compared the stability of the methods under two aspects. All regression-based methods are less stable than the shrinkage approach over different subsamples of the data, and within the regression-based approaches, there is no clear difference between sparse and non-sparse methods. However, the two sparse regression methods seem to be unstable with respect to violations of the i.i.d assumption of the samples.
Runtime
The computational load for the Lasso and in particular for the adaptive Lasso is considerable. For very high-dimensional data, this can constitute a severe limitation. The runtime might be decreased by applying parallel computation techniques or by preselecting a coarse network topology that does not rely on cross-validation. While PLS and Ridge Regression are slower than shrinkage, both of them are fairly fast to compute, as they allow a kernel representation, i.e. most of the computation scales in the number of samples and not in the number of variables.
Available Software
The regularized estimation of partial correlations and the construction of gene association networks with (adaptive) Lasso, ridge regression and PLS are implemented in the R package parcor[37] which is available from the CRAN repository http://cran.r-project.org/. The package relies heavily on the lars package [40]. For assigning statistical significane to the edges in the network, we use the fdrtool package [54]. We also provide an executable sheet for the simulations (additional file 13) and the real-world data (additional file 14).
Declarations
Acknowledgements
NK was supported by the BMBF grant FKZ 01-IS07007A (ReMind), and the FP7-ICT Programme of the European Community, under the PASCAL2 Network of Excellence, ICT-216886. Financial support for JS from DSM Nutritional Products Ltd. (Basel, Switzerland) is gratefully acknowledged. ALB was supported by the LMU-innovativ Project BioMed-S: Analysis and Modelling of Complex Systems in Biology and Medicine. We thank Lukas Meier and Mikio L. Braun for constructive comments on model selection, and Animesh Acharjee for helpful feedback on our R package "parcor".
Authors’ Affiliations
References
- Friedman N: Inferring Cellular Networks using Probabilistic Graphical Models. Science 2004, 303(5659):799–805. 10.1126/science.1094068View ArticlePubMedGoogle Scholar
- Yeung MKS, Tegnér J, Collins JJ: Reverse Engineering Gene Networks using Singular Value Decomposition and Robust Regression. Proceedings of the National Academy of Sciences 2002, 99(9):6163–6168. 10.1073/pnas.092576199View ArticleGoogle Scholar
- Rangel C, Angus J, Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild D, Falciani F: Modeling T-cell Activation using Gene Expression Profiling and State-Space Models. Bioinformatics 2004, 20: 1361–1372. 10.1093/bioinformatics/bth093View ArticlePubMedGoogle Scholar
- Whittaker J: Graphical Models in Applied Multivariate Statistics. Wiley New York; 1990.Google Scholar
- Dobra A, Hans C, Jones B, Nevins J, Yao G, West M: Sparse Graphical Models for Exploring Gene Expression Data. Journal of Multivariate Analysis 2004, 90: 196–212. 10.1016/j.jmva.2004.02.009View ArticleGoogle Scholar
- Schäfer J, Strimmer K: A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology 2005, 4: 32. 10.2202/1544-6115.1175View ArticleGoogle Scholar
- Schäfer J, Strimmer K: An Empirical Bayes Approach to Inferring Large-Scale Gene Association Networks. Bioinformatics 2005, 21: 754–764. 10.1093/bioinformatics/bti062View ArticlePubMedGoogle Scholar
- Li H, Gui J: Gradient Directed Regularization for Sparse Gaussian Concentration Graphs, with Applications to Inference of Genetic Networks. Biostatistics 2008, 7(2):302–317. 10.1093/biostatistics/kxj008View ArticleGoogle Scholar
- Yuan M, Lin Y: Model Selection and Estimation in the Gaussian Graphical Model. Biometrika 2007, 94: 19–35. 10.1093/biomet/asm018View ArticleGoogle Scholar
- Pihur V, Datta S, Datta S: Reconstruction of Genetic Association Networks from Microarray Data. Bioinformatics 2008, 24(4):561–568. 10.1093/bioinformatics/btm640View ArticlePubMedGoogle Scholar
- Ma S, Gong Q, Bohnert HJ: An Arabidopsis Gene Network Based on the Graphical Gaussian Model. Genome Research 2007, 17: 1614–1625. 10.1101/gr.6911207PubMed CentralView ArticlePubMedGoogle Scholar
- Schmitt JE, Lenroot RK, Wallace GL, Ordaz S, Taylor KN, Kabani N, Greenstein D, Lerch JP, Kendler KS, Neale MC, Giedd JN: Identification of Genetically Mediated Cortical Networks: A Multivariate Study of Pediatric Twins and Siblings. Cerebral Cortex 2008, 18(8):1737–1747. 10.1093/cercor/bhm211PubMed CentralView ArticlePubMedGoogle Scholar
- Efron B: Large-Scale Simultaneous Hypothesis Testing: the Choice of a Null Hypothesis. Journal of the American Statistical Association 2004, 99: 96–104. 10.1198/016214504000000089View ArticleGoogle Scholar
- Strimmer K: A Unified Approach to False Discovery Rate Estimation. BMC Bioinformatics 2008, 9: 303. 10.1186/1471-2105-9-303PubMed CentralView ArticlePubMedGoogle Scholar
- Breiman L: Bagging predictors. Machine Learning 1996, 24: 123–140.Google Scholar
- Tyekucheva S, Chiaromonte F: Augmenting the Bootstrap to Analyze High Dimensional Genomic Data. TEST 2008, 17: 1–18. 10.1007/s11749-008-0098-6View ArticleGoogle Scholar
- Strimmer K: Comments on: Augmenting the Bootstrap to Analyze High Dimensional Genomic Data. TEST 2008, 17: 25–27. 10.1007/s11749-008-0101-2View ArticleGoogle Scholar
- Schäfer J: Comments on: Augmenting the Bootstrap to Analyze High Dimensional Genomic Data. TEST 2008, 17: 28–30. 10.1007/s11749-008-0102-1View ArticleGoogle Scholar
- d'Aspremont A, Banerjee O, Ghaoui LE: First-Order Methods for Sparse Covariance Selection. SIAM Journal on Matrix Analysis and its Applications 2008, 30: 56–66. 10.1137/060670985View ArticleGoogle Scholar
- Rothman A, Bickel P, Levina E, Zhu J: Sparse Permutation Invariant Covariance Estimation. Electronic Journal of Statistics 2008, 2: 494–515. 10.1214/08-EJS176View ArticleGoogle Scholar
- Witten D, Tibshirani R: Covariance-regularized regression and and classification for high-dimensional problems. Journal of Royal Statistical Society, Series B 2009, 71(3):615–636. 10.1111/j.1467-9868.2009.00699.xView ArticleGoogle Scholar
- Yuan M: Efficient Computation of ℓ_{1}Regularized Estimates in Gaussian Graphical Models. Journal of Computational and Graphical Statistics 2008, 17(4):809–826. 10.1198/106186008X382692View ArticleGoogle Scholar
- Tenenhaus A, Guillemot V, Gidrol X, Frouin V: Gene Association Networks from Microarray Data using a Regularized Estimation of Partial Correlation based on PLS Regression. IEEE Transactions on Computational Biology and Bioinformatics 2008. [http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.87]Google Scholar
- Wold H: Path Models with Latent Variables: The NIPALS Approach. In Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building. Edited by: HMB et al. Academic Press; 1975:307–357.View ArticleGoogle Scholar
- Wold S, Ruhe H, Wold H, Dunn WJ III: The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM Journal of Scientific and Statistical Computations 1984, 5: 735–743. 10.1137/0905052View ArticleGoogle Scholar
- Saigo H, Krämer N, Tsuda K: Partial Least Squares Regression for Graph Mining. 14th International Conference on Knowledge Discovery and Data Mining (KDD2008) 2008, 578–586.Google Scholar
- Boulesteix AL, Strimmer K: Partial Least Squares: a Versatile Tool for the Analysis of High-Dimensional Genomic Data. Briefings in Bioinformatics 2007, 8: 32–44. 10.1093/bib/bbl016View ArticlePubMedGoogle Scholar
- Rosipal R, Trejo L: Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Spaces. Journal of Machine Learning Research 2001, 2: 97–123.Google Scholar
- Rosipal R, Krämer N: Overview and Recent Advances in Partial Least Squares. In Subspace, Latent Structure and Feature Selection Techniques, Lecture Notes in Computer Science. Springer; 2006:34–51.View ArticleGoogle Scholar
- Krämer N, Braun ML: Kernelizing PLS, Degrees of Freedom, and Efficient Model Selection. In Proceedings of the 24th International Conference on Machine Learning Edited by: Ghahramani Z. 2007, 441–448.Google Scholar
- Chun H, Keles S: Sparse partial least squares for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society 2009, 182(1):79–90.Google Scholar
- Hoerl A, Kennard R: Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 2000, 42: 80–86. 10.2307/1271436View ArticleGoogle Scholar
- Meinshausen N, Bühlmann P: High Dimensional Graphs and Variable Selection with the Lasso. Annals of Statistics 2006, 34(3):1436–1462. 10.1214/009053606000000281View ArticleGoogle Scholar
- Tibshirani R: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B 1996, 58: 267–288.Google Scholar
- Zhou S, Geer S, Bühlmann P: Adaptive Lasso for High Dimensional Regression and Gaussian Graphical Modeling. 2009, in press. arXiv:0903.2515v1Google Scholar
- Zou H: The Adaptive Lasso and its Oracle Properties. Journal of the American Statistical Association 2006, 101(476):1418–1429. 10.1198/016214506000000735View ArticleGoogle Scholar
- Krämer N, Schäfer J: parcor: estimation of partial correlations based on regularized regression. 2009. [R package version 0.1] [R package version 0.1]Google Scholar
- Schäfer J, Opgen-Rhein R, Strimmer K: Reverse Engineering Genetic Networks using the GeneNet Package. R News 2006, 5/6: 50–53.Google Scholar
- Ruschhaupt M: Erzeugung von positiv definiten Matrizen mit Nebenbedingungen zur Validierung von Netzwerkalgorithmen für Microarray-Daten. PhD thesis. University of Munich; 2008.Google Scholar
- Hastie T, Efron B: lars: Least Angle Regression, Lasso and Forward Stagewise. 2007. [R package version 0.9–7] [R package version 0.9-7]Google Scholar
- Krämer N: On the Peaking Phenomenon in Model Selection for the Lasso. 2009, in press.http://arxiv.org/abs/0904.4416Google Scholar
- Kao K, Yang Y, Boscolo R, Sabatti C, Roychowdhury V, Liao J: Transcriptome-based Determination of Multiple Transcription Regulator Activities in Escherichia Coli by Using Network Component Analysis. Proceedings of the National Academy of Sciences 2004, 101(2):641–646. 10.1073/pnas.0305287101View ArticleGoogle Scholar
- Schmidt-Heck W, Guthke R, Toepfer S, Reischer H, Duerrschmid K, Bayer K: Reverse engineering of the stress response during expression of a recombinant protein. EUNITE 2004 European Symposium on Intelligent Technologies, Hybrid Systems and their Implementation on Smart Adaptive Systems 2004, 407–441.Google Scholar
- Smith S, Fulton D, Chia T, Thorneycroft D, Chapple A, Dunstan H, Hylton C, Zeeman S, Smith A: Diurnal Changes in the Transcriptome Encoding Enzymes of Starch Metabolism Provide Evidence for Both Transcriptional and Posttranscriptional Regulation of Starch Metabolism in Arabidopsis Leaves 1. Plant Physiology 2004, 136: 2687–2699. 10.1104/pp.104.044347PubMed CentralView ArticlePubMedGoogle Scholar
- West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson J Jr, Marks J, Nevins J: Predicting the Clinical Status of Human Breast Cancer by using Gene Expression Profiles. Proceedings of the National Academy of Sciences 2001, 98(2):11462–11467. 10.1073/pnas.201162998View ArticleGoogle Scholar
- Boulesteix AL, Slawski M: Stability and aggregation of ranked gene lists. Briefings in Bioinformatics 2009, 10(5):556–68. 10.1093/bib/bbp034View ArticlePubMedGoogle Scholar
- Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23(19):2507. 10.1093/bioinformatics/btm344View ArticlePubMedGoogle Scholar
- Scutari M: Structure variability in Bayesian networks. 2009, in press.http://arxiv.org/abs/0909.1685Google Scholar
- Fleiss J: Measuring nominal scale agreement among many raters. Psychological Bulletin 1971, 76(5):378–382. 10.1037/h0031619View ArticleGoogle Scholar
- Boulesteix AL, Kondylis A, Krämer N: Comment on: Augmenting the bootstrap to analyze high dimensional genomic data. TEST 2008, 17: 31–35. 10.1007/s11749-008-0103-0View ArticleGoogle Scholar
- Reverter A, Chan E: Combining Partial Correlation and an Information Theory Approach to the Reversed-engineering of Gene Co-expression Networks. Bioinformatics 2008, 24(21):2491–2497. 10.1093/bioinformatics/btn482View ArticlePubMedGoogle Scholar
- Pellet JP, Elisseeff A: A Partial Correlation-Based Algorithm for Causal Structure Discovery with Continuous Variables. In Advances in Intelligent Data Analysis VII, 7th International Symposium on Intelligent Data Analysis Edited by: Berthold MR, Shawe-Taylor J, Lavrac N. 2007, 229–239.View ArticleGoogle Scholar
- Arnold A, Liu Y, Abe N: Temporal Causal Modeling with Graphical Granger Methods. In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2007:66–75.Google Scholar
- Strimmer K: fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics 2008, 24: 1461–1462. 10.1093/bioinformatics/btn209View ArticlePubMedGoogle Scholar
- Boulesteix AL, Lambert-Lacroix S, Peyre J, Strimmer K: plsgenomics: PLS analyses for genomics. 2007. [R package version 1.2–2] [R package version 1.2-2]Google Scholar
- Opgen-Rhein R, Strimmer K: longitudinal: Analysis of Multiple Time Course Data. 2008. [R package version 1.1.4] [R package version 1.1.4]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.