Stability of building gene regulatory networks with sparse autoregressive models
- Jagath C Rajapakse^{1, 2, 3}Email author and
- Piyushkumar A Mundra^{1}
https://doi.org/10.1186/1471-2105-12-S13-S17
© Rajapakse and Mundra; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Abstract
Background
Biological networks are constantly subjected to random perturbations, and efficient feedback and compensatory mechanisms exist to maintain their stability. There is an increased interest in building gene regulatory networks (GRNs) from temporal gene expression data because of their numerous applications in life sciences. However, because of the limited number of time points at which gene expressions can be gathered in practice, computational techniques of building GRN often lead to inaccuracies and instabilities. This paper investigates the stability of sparse auto-regressive models of building GRN from gene expression data.
Results
Criteria for evaluating the stability of estimating GRN structure are proposed. Thereby, stability of multivariate vector autoregressive (MVAR) methods - ridge, lasso, and elastic-net - of building GRN were studied by simulating temporal gene expression datasets on scale-free topologies as well as on real data gathered over Hela cell-cycle. Effects of the number of time points on the stability of constructing GRN are investigated. When the number of time points are relatively low compared to the size of network, both accuracy and stability are adversely affected. At least, the number of time points equal to the number of genes in the network are needed to achieve decent accuracy and stability of the networks. Our results on synthetic data indicate that the stability of lasso and elastic-net MVAR methods are comparable, and their accuracies are much higher than the ridge MVAR. As the size of the network grows, the number of time points required to achieve acceptable accuracy and stability are much less relative to the number of genes in the network. The effects of false negatives are easier to improve by increasing the number time points than those due to false positives. Application to HeLa cell-cycle gene expression dataset shows that biologically stable GRN can be obtained by introducing perturbations to the data.
Conclusions
Accuracy and stability of building GRN are crucial for investigation of gene regulations. Sparse MVAR techniques such as lasso and elastic-net provide accurate and stable methods for building even GRN of small size. The effect of false negatives is corrected much easier with the increased number of time points than those due to false positives. With real data, we demonstrate how stable networks can be derived by introducing random perturbation to data.
Keywords
Background
Biological networks are constantly perturbed randomly and there exist efficient and compensatory mechanisms to withstand such instabilities. Constructing gene regulatory networks (GRN) from time-series gene expression data plays a vital role in understanding complex biological mechanisms and in the development of novel drugs. Though microarrays allow measurement of thousands of genes simultaneous, gene expressions in practice can be gathered only over a few time points due to high cost and time involved, and limitations of the experiments. This makes building GRN inherently an ill-posed problem in practice, leading such networks unstable and irreproducible. Moreover, variable and complex nature of biological sources, and measurement noise and artifacts, add to the challenges of constructing accurate and stable GRN.
A wide range of techniques for inferring GRN from microarray datasets have been proposed in the literature [1–4], including Boolean networks [5], linear networks [6, 7] differential equation models [8], and stochastic methods [9–12]. Despite the plethora of such techniques, stability of building GRN from gene expression data has not been addressed by the research community so far. In this paper, we investigate the stability of building GRN by using sparse linear models with simulations and real data.
The linear multivariate vector autoregression (MVAR) provides a simple and efficient technique to estimate regulatory relationships among genes. However, due to less number of time points compared to the number of genes whose expressions involved in gene expression datasets, several penalized MVAR techniques using regularization [3, 7, 13–16] and priors [17] have been considered for building GRN. The sparse MVAR techniques such as lasso and elastic-net uses a penalty function to drive small regulatory coefficient to zero, thereby producing computationally stable and sparse yet biologically plausible GRNs. Recently, their efficacy in building GRN have been demonstrated [6, 7].
A good technique of building GRN should not only be accurate but also be reproducible and stable; biologists will then be able to test complex hypothesis with confidence on in silico networks. Stability means that the network construction is robust to changes of network topology and parameters, and biological and instrumental noise. In this paper, we first introduce novel criteria for evaluating stability of building GRN at the level of connections and networks. MVAR models of ridge, lasso, and elastic-net penalty are evaluated with respect to their accuracies and stabilities by using synthetic gene expression datasets. In particular, we investigate how many time points of gene expressions are needed for a network of given topology and size. Using a real data set gathered in HeLa cell cycle [18], we demonstrate how random perturbations in the data could be induced to determine stable genes in the network.
Methods
The above matrix cannot be determined if the number of genes are larger than number of samples (that is, T <<I).
where λ and α are regularization parameters. When α = 0 the loss function represents ridge MVAR, α = 1 represents lasso, and α ∈ (0,1) represents an elastic-net penalty. In other words, lasso uses L_{1}-norm, ridge regression uses L_{2}-norm, and elastic-net uses both L_{1}-norm and L_{2}-norm in its penalty term. The penalty terms attempt to drive small regression coefficients to zero, making networks sparse and computationally well-posed. The lasso penalty is more capable of driving regression coefficients to zero than the ridge penalty.
Synthetic data
where d_{i} denotes the number of adjacent edges not initiated by gene i itself (which approximates to the in-degree of gene i). The parameters γ denotes the power of preferential attachment and b the attractiveness of the gene with no adjacent edges.
Parameters used for generation of synthetic network
Parameter | Values |
---|---|
Number of Bootstraps (B) | 100 |
Number of Genes | 10, 50, or 100 |
Number of time points | 10, 30, 50, 70 |
Regression coefficient cut-off for edge detection (∈) | 0.0001 |
Real data
To investigate to the stability of real biological networks, we use lasso and elastic-net methods on Human cancer cell Line (HeLa) cell-cycle gene expression dataset [18]. The dataset used in this study is described as experiment 3 (http://genome-www.stanford.edu/Human-CellCycle/Hela/) and contains 48 time points where gene expressions were measured at the interval of one hour and synchronized by double thymidine block. Based on relevance to cell cycle and tumor development, 91 genes were selected. This dataset was previously used to demonstrate the efficacy of sparse lasso MVAR techniques in building GRN [6]. In order to investigate stability random perturbations, we added noise perturbations at each time point by adding random samples from Gaussian noise N(0, (σ + δ)^{2}); we define where σ is the standard deviation of residuals of real data and δ is a perturbation constant. For real dataset, the standard deviation of residuals was, σ = 0.225 and σ = 0.23, respectively, for models built with lasso and elastic-net MVAR models. By generating 100 gene expression datasets for each of δ such that SNR ∈ [0.01, 4], the stability of GRN built were studied against noise perturbations.
Stability of structure
where d denotes the Hamming distance and |s^{ b }| denotes the number of connections in the network s^{ b }. In (7), the stability ρ ∈ [0, 1], where higher values denote high stability of the GRN inference algorithm. Here, stability is measured with respect to the whole network and the Hamming distance takes into account both the presence and absence of a regulatory connection.
Implementation
Scale-free network topologies were generated using igraph R package [21]. The stability of networks are studied with penalized regression methods. Each of these MVAR method has regularization parameters. Although computationally expensive, we used the leave-one-out cross-validation to estimate the λ for ridge MVAR and α and/or λ for sparse models. For ridge estimate, the λ was selected from the set of {0.001, 0.1, 1, 10, 100}. For lasso and elastic net, the glmnet package [22] is used which could generate the whole solution path for λ for a given α value (α = 1 for lasso). Hence, the α values are selected from a set of {0.1, 0.2, …, 0.9} for elastic-net model.
Here, t(d) is t-distribution with d degree of freedom (dof); σ^{2} is an estimate of the error of variance ; and w_{ k } is the j-th diagonal element of ((Z′Z + λW^{–})^{–1}Z′Z(Z′Z + λW^{–})^{–1}) where W^{ – } denotes generalized inverse of W, a diagonal matrix with diagonal elements |β_{ k }|. To achieve closed form solution, this matrix is estimated using ridge coefficients using the tuned λ for lasso penalty [24]. For lasso, the number of nonzero coefficients is an unbiased estimate for the degrees of freedom [25]. The dof for elastic-net were also assumed to be the number of nonzero coefficients. Only the statistically significant edges were evaluated for stability analysis of sparse MVAR on real datasets.
In this paper, we use fast coordinate descent algorithm for optimizing lasso and elastic net loss function. The codes are provided with glmnet MATLAB package. Ridge regression is performed using standard MATLAB function.
Results on synthetic dataset
To study the effects of the number of time points on the stability, GRN consisting of I ∈ {10, 50, 100} number of genes were simulated using scale-free networks with topologies having power-law coefficient γ ∈ [ 2.10, 2.30 ]. Thus, temporal gene expressions for T ∈ {10, 30, 50, 70} number of time points for a given network topology were generated. For given I, γ, and T, B = 100 number of bootstrapped datasets were generated by randomly initiating gene expressions at the first time point.
Performances of lasso, elastic-net (EN), and ridge MVAR models on synthetic datasets with varying numbers of genes and time points.
Number of genes | Number of time points | Method | True positives | False positives | Precision | Recall | F-measure | Stability |
---|---|---|---|---|---|---|---|---|
EN | 5.86 | 7.94 | 0.47 | 0.65 | 0.53 | 0.41 | ||
10 | Lasso | 5.19 | 4.88 | 0.57 | 0.58 | 0.55 | 0.46 | |
Ridge | 8.99 | 90.70 | 0.09 | 1.00 | 0.16 | 0.99 | ||
EN | 8.66 | 8.02 | 0.56 | 0.96 | 0.69 | 0.64 | ||
30 | Lasso | 8.54 | 5.79 | 0.63 | 0.95 | 0.75 | 0.65 | |
Ridge | 8.99 | 90.55 | 0.09 | 1.00 | 0.16 | 0.99 | ||
EN | 8.99 | 5.91 | 0.64 | 1.00 | 0.77 | 0.71 | ||
10 | 50 | Lasso | 8.99 | 4.12 | 0.71 | 1.00 | 0.82 | 0.75 |
Ridge | 8.98 | 90.53 | 0.09 | 1.00 | 0.16 | 0.99 | ||
EN | 9.00 | 5.12 | 0.67 | 1.00 | 0.79 | 0.72 | ||
70 | Lasso | 8.99 | 3.92 | 0.72 | 1.00 | 0.83 | 0.76 | |
Ridge | 8.99 | 90.35 | 0.09 | 1.00 | 0.16 | 0.99 | ||
EN | 17.29 | 131.50 | 0.12 | 0.35 | 0.18 | 0.15 | ||
10 | Lasso | 13.53 | 42.81 | 0.25 | 0.28 | 0.26 | 0.15 | |
Ridge | 48.90 | 2429.90 | 0.02 | 1.00 | 0.04 | 0.99 | ||
EN | 32.73 | 95.75 | 0.26 | 0.67 | 0.37 | 0.27 | ||
30 | Lasso | 32.37 | 72.41 | 0.31 | 0.66 | 0.42 | 0.31 | |
Ridge | 48.90 | 2431.70 | 0.02 | 1.00 | 0.04 | 0.99 | ||
EN | 39.39 | 77.88 | 0.34 | 0.80 | 0.48 | 0.35 | ||
50 | 50 | Lasso | 39.15 | 0.39 | 0.80 | 0.82 | 0.52 | 0.38 |
Ridge | 48.9 | 2436.20 | 0.02 | 1.00 | 0.04 | 0.99 | ||
EN | 48.82 | 98.11 | 0.34 | 1.00 | 0.50 | 0.41 | ||
70 | Lasso | 48.82 | 78.38 | 0.39 | 1.00 | 0.56 | 0.45 | |
Ridge | 49.00 | 2432.90 | 0.02 | 1.00 | 0.04 | 0.99 | ||
EN | 22.33 | 259.46 | 0.08 | 0.23 | 0.12 | 0.08 | ||
10 | Lasso | 14.91 | 69.95 | 0.18 | 0.15 | 0.16 | 0.08 | |
Ridge | 98.31 | 9759.90 | 0.01 | 1.00 | 0.02 | 0.98 | ||
EN | 58.67 | 214.17 | 0.22 | 0.59 | 0.32 | 0.20 | ||
30 | Lasso | 56.37 | 153.76 | 0.27 | 0.57 | 0.37 | 0.22 | |
Ridge | 98.71 | 9777.50 | 0.01 | 1.00 | 0.02 | 0.99 | ||
EN | 79.10 | 190.56 | 0.30 | 0.80 | 0.43 | 0.29 | ||
100 | 50 | Lasso | 78.60 | 164.55 | 0.33 | 0.79 | 0.46 | 0.32 |
Ridge | 98.80 | 9792.20 | 0.01 | 1.00 | 0.02 | 0.99 | ||
EN | 87.62 | 160.01 | 0.36 | 0.89 | 0.51 | 0.36 | ||
70 | Lasso | 87.54 | 144.63 | 0.38 | 0.88 | 0.53 | 0.38 | |
Ridge | 98.81 | 9805 | 0.01 | 1.00 | 0.02 | 0.99 |
Results on real dataset
Discussion
Novel measures of evaluating stability of building GRN were introduced. Thereby, stability and accuracy of sparse MVAR models in building GRN structure were studied using synthetic datasets. The results suggest that to achieve decent accuracy and stability with sparse MVAR methods, at least a number of time points equal to the number of genes are required. But as the network size grows, the number of time points required is less compared to the number of genes in the network. It is easier to ameliorate the effects of false negatives than the false positives by increasing the number of time samples of gene expressions. The results indicates that lasso MVAR and elastic-net perform equally on datasets in general, though lasso handles false positives better. However, elastic-net performed better than lasso on real dataset. This is because elastic-net penalty has an ability to predict regulatory relationship between highly correlated genes while lasso will only predict one of them [31]. As genes are highly correlated, elastic-net or their improved versions proves better making inferences on large scale GRNs [7]. In simulated datasets, correlations among gene expression were not simulated.
To the best of our knowledge, this is the first study introducing stability criteria or studying methods building GRN. Our investigations here were focused on the stability of sparse MVAR models. Our work could be extended for other approaches as well. For small networks, sparse linear models to build networks were stable and accurate. Furthermore, effects of the number of time points on the stability were studied by experimenting on scale-free networks of different topologies. As the network size grows, the number of time points required for building GRN is less than the number of genes in the GRN. Stability is inherent problem in practice as real datasets consist of a large number of genes and short time-series. With an application of real dataset, we demonstrate how stable GRN can be derived by introducing perturbations to the gene expression datasets. Only a few statistically significant edges and associated gene hubs are stable and can withstand small amount of perturbation. Biologically, more stable genes have preserve more significant roles of the biological process of the network. This research emphasizes the need for building GRN that are accurate, stable, and reproducible, so that the structures derived are robust against noise and perturbation of data. In addition by perturbing gene expressions, more accurate and stable core genes and subnetworks can be inferred from temporal gene expression data.
Declarations
Acknowledgements
We thank Prof Andre Fujita for providing the reduced HeLa cell cycle dataset. This work was partly supported by a ARC 9/10 grant to J. C. Rajapakse by Ministry of Education, Singapore.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 13, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S13.
Authors’ Affiliations
References
- Huang Y, Tienda-Luna I, Wang Y: Reverse engineering gene regulatory networks. IEEE Signal Processing Magazine 2009, 26: 76–97.PubMed CentralView ArticlePubMedGoogle Scholar
- Chao S, Hua J, Jung S: Inference of gene regulatory networks using time-series data: a survey. A Survey. Current Genomics 2009, 10: 416–429.View ArticleGoogle Scholar
- Fogelberg C, Palade V: Machine learning and genetic regulatory networks: a review and a roadmap. In Foundations of Computational Intelligence. Edited by: Hassanien AE, Abraham A, Vasilakos A, Pedrycz W. Stoneham: Butterworth-Heinemann, Springer Verlag; 2009.Google Scholar
- Kim S, Imoto S, Miyano S: Inferring gene networks from time series microarray data using dynamic Bayesian networks. Briefings in Bioinformatics 2003, 4(3):228–235. 10.1093/bib/4.3.228View ArticlePubMedGoogle Scholar
- Li P, Zhang C, Perkins E, Gong P, Deng Y: Comparison of probabilistic Boolean network and dynamic Bayesian network approaches for inferring gene regulatory networks. BMC Bioinformatics 2007, 8(Suppl 7):S13. 10.1186/1471-2105-8-S7-S13PubMed CentralView ArticlePubMedGoogle Scholar
- Fujita A, Sato J, Garay-Malpartida H, Yamaguchi R, Miyano S, Sogayar M, Ferreira C: Modeling gene expression regulatory networks with the sparse vector autoregressive model. BMC Systems Biology 2007, 1: 39. 10.1186/1752-0509-1-39PubMed CentralView ArticlePubMedGoogle Scholar
- Shimamura T, Imoto S, Yamaguchi R, Fujita A, Nagasaki M, Miyano S: Recursive regularization for inferring gene networks from time-course gene expression profiles. BMC Systems Biology 2009, 3: 41. 10.1186/1752-0509-3-41PubMed CentralView ArticlePubMedGoogle Scholar
- de Jong H: Modeling and simulation of genetic regulatory systems : a literature review. Journal of Computational Biology 2002, 9: 67–103. 10.1089/10665270252833208View ArticlePubMedGoogle Scholar
- Friedman N, Linia M, Nachman I, Peér D: Using Bayesian networks to analyze expression data. Journal of Computational Biology 2000, 7(3–4):601–620. 10.1089/106652700750050961View ArticlePubMedGoogle Scholar
- Ferrazzi F, Sebastiani P, Ramoni M, Bellazzi R: Bayesian approaches to reverse engineer cellular systems: a simulation study on nonlinear Gaussian networks. BMC Bioinformatics 2007, 8(Suppl 5):S2. 10.1186/1471-2105-8-S5-S2PubMed CentralView ArticlePubMedGoogle Scholar
- Murphy K, Mian S: Modelling gene expression data using dynamic bayesian networks. Tech. rep 1999.Google Scholar
- Chaturvedi I, Rajapakse JC: Building gene networks with time-delayed regulations. Pattern Recognition Letters 2010, 31: 2133–2137. 10.1016/j.patrec.2010.03.002View ArticleGoogle Scholar
- Zhao W, Serpedin E, Dougherty E: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 2006, 22(17):2129–2135. 10.1093/bioinformatics/btl364View ArticlePubMedGoogle Scholar
- Menéndez P, Kourmpetis Y, ter Braak C, van Eeuwijk F: Gene regulatory networks from multifactorial perturbations using Graphical Lasso: application to the DREAM4 Challenge. PLOS One 2010, 5(12):e14147. 10.1371/journal.pone.0014147PubMed CentralView ArticlePubMedGoogle Scholar
- Grzegorczyk M, Husmeier D: Improvements in the reconstruction of time-varying gene regulatory networks: dynamic programming and regularization by information sharing among genes. Bioinformatics 2011, 27(5):693–699. 10.1093/bioinformatics/btq711View ArticlePubMedGoogle Scholar
- Fujita A, Sato J, Garay-Malpartida H, Morettin P, Sogayar M, Ferreira C: Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics 2007, 23(13):1623–1630. 10.1093/bioinformatics/btm151View ArticlePubMedGoogle Scholar
- Geier F, Timmer J, Fleck C: Reconstructing gene-regulatory networks from time series, knock-out data, and prior knowledge. BMC Systems Biology 2007, 1: 11. 10.1186/1752-0509-1-11PubMed CentralView ArticlePubMedGoogle Scholar
- Whitfield M, Sherlock G, Saldanha A, Murray J, Ball C, Alexander K, Matese J, Perou C, Hurt M, Brown P, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Molecular Biology of the Cell 2002, 13(6):1977–2000. 10.1091/mbc.02-02-0030.PubMed CentralView ArticlePubMedGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edition. Springer; 2001.View ArticleGoogle Scholar
- Barabasi A, Albert R: Emergence of scaling in random networks. Science 1999, 286: 509–512. 10.1126/science.286.5439.509View ArticlePubMedGoogle Scholar
- Csardi G, Nepusz T: igraph: Network analysis and visualization.[http://cran.r-project.org/web/packages/igraph/index.html]
- Friedman J, Hastie T, Tibshirani R: glmnet: Lasso and elastic-net regularized generalized linear models.[http://cran.r-project.org/web/packages/glmnet/index.html]
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of Royal Statistical Society: Series B 1995, 57: 289–300.Google Scholar
- Tibshirani R: Regression shrinkage and selection vias the Lasso. Journal of Royal Statistical Society: Series B 1996, 58: 267–288.Google Scholar
- Zou H, Hastie T, Tibshirani R: On the degrees of freedom of the lasso. Annals of Statistics 2007, 35(5):2173–2192. 10.1214/009053607000000127View ArticleGoogle Scholar
- Bretz J, Garcia J, Huang X, Kang L, Zhang Y, Toellner KM, Chen-Kiang S: Noxa mediates p18INK4c cell-cycle control of homeostasis in B cells and plasma cell precursors. Blood 2011, 117(7):2179–2188. 10.1182/blood-2010-06-288027PubMed CentralView ArticlePubMedGoogle Scholar
- Petrucci M, Ricciardi M, Ariola C, Gregorj C, Ribersani M, Savino R, Ciliberto G, Tafuri A: Cell cycle regulation and induction of apoptosis by IL-6 variants on the multiple myeloma cell line XG-1. Annals of Hematology 1999, 78: 13–18. 10.1007/s002770050465View ArticlePubMedGoogle Scholar
- Lawler J: Thrombospondin-1 as an endogenous inhibitor of angiogenesis and tumor growth. J. cellular and molecular medicine 2002, 6: 1–12. 10.1111/j.1582-4934.2002.tb00307.xView ArticleGoogle Scholar
- Uren A, Beilharz T, O’connell M, Bugg S, Driel RV, Vaux D, Lithgow T: Role for yeast inhibitor of apoptosis (IAP)-like proteins in cell division. Proceedings of National Academy of Science 1999, 96: 10170–10175. 10.1073/pnas.96.18.10170View ArticleGoogle Scholar
- Amati B, Alevizopoulos K, Vlach J: Myc and the cell cycle. Frontiers in Bioscience 1998, 3: d250–268.PubMedGoogle Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 2005, 67(2):301–320. 10.1111/j.1467-9868.2005.00503.xView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.