Stability of building gene regulatory networks with sparse autoregressive models

Background Biological networks are constantly subjected to random perturbations, and efficient feedback and compensatory mechanisms exist to maintain their stability. There is an increased interest in building gene regulatory networks (GRNs) from temporal gene expression data because of their numerous applications in life sciences. However, because of the limited number of time points at which gene expressions can be gathered in practice, computational techniques of building GRN often lead to inaccuracies and instabilities. This paper investigates the stability of sparse auto-regressive models of building GRN from gene expression data. Results Criteria for evaluating the stability of estimating GRN structure are proposed. Thereby, stability of multivariate vector autoregressive (MVAR) methods - ridge, lasso, and elastic-net - of building GRN were studied by simulating temporal gene expression datasets on scale-free topologies as well as on real data gathered over Hela cell-cycle. Effects of the number of time points on the stability of constructing GRN are investigated. When the number of time points are relatively low compared to the size of network, both accuracy and stability are adversely affected. At least, the number of time points equal to the number of genes in the network are needed to achieve decent accuracy and stability of the networks. Our results on synthetic data indicate that the stability of lasso and elastic-net MVAR methods are comparable, and their accuracies are much higher than the ridge MVAR. As the size of the network grows, the number of time points required to achieve acceptable accuracy and stability are much less relative to the number of genes in the network. The effects of false negatives are easier to improve by increasing the number time points than those due to false positives. Application to HeLa cell-cycle gene expression dataset shows that biologically stable GRN can be obtained by introducing perturbations to the data. Conclusions Accuracy and stability of building GRN are crucial for investigation of gene regulations. Sparse MVAR techniques such as lasso and elastic-net provide accurate and stable methods for building even GRN of small size. The effect of false negatives is corrected much easier with the increased number of time points than those due to false positives. With real data, we demonstrate how stable networks can be derived by introducing random perturbation to data.


Background
Biological networks are constantly perturbed randomly and there exist efficient and compensatory mechanisms to withstand such instabilities. Constructing gene regulatory networks (GRN) from time-series gene expression data plays a vital role in understanding complex biological mechanisms and in the development of novel drugs. Though microarrays allow measurement of thousands of genes simultaneous, gene expressions in practice can be gathered only over a few time points due to high cost and time involved, and limitations of the experiments. This makes building GRN inherently an ill-posed problem in practice, leading such networks unstable and irreproducible. Moreover, variable and complex nature of biological sources, and measurement noise and artifacts, add to the challenges of constructing accurate and stable GRN.
A wide range of techniques for inferring GRN from microarray datasets have been proposed in the literature [1][2][3][4], including Boolean networks [5], linear networks [6,7] differential equation models [8], and stochastic methods [9][10][11][12]. Despite the plethora of such techniques, stability of building GRN from gene expression data has not been addressed by the research community so far. In this paper, we investigate the stability of building GRN by using sparse linear models with simulations and real data.
The linear multivariate vector autoregression (MVAR) provides a simple and efficient technique to estimate regulatory relationships among genes. However, due to less number of time points compared to the number of genes whose expressions involved in gene expression datasets, several penalized MVAR techniques using regularization [3,7,[13][14][15][16] and priors [17] have been considered for building GRN. The sparse MVAR techniques such as lasso and elastic-net uses a penalty function to drive small regulatory coefficient to zero, thereby producing computationally stable and sparse yet biologically plausible GRNs. Recently, their efficacy in building GRN have been demonstrated [6,7].
A good technique of building GRN should not only be accurate but also be reproducible and stable; biologists will then be able to test complex hypothesis with confidence on in silico networks. Stability means that the network construction is robust to changes of network topology and parameters, and biological and instrumental noise. In this paper, we first introduce novel criteria for evaluating stability of building GRN at the level of connections and networks. MVAR models of ridge, lasso, and elastic-net penalty are evaluated with respect to their accuracies and stabilities by using synthetic gene expression datasets. In particular, we investigate how many time points of gene expressions are needed for a network of given topology and size. Using a real data set gathered in HeLa cell cycle [18], we demonstrate how random perturbations in the data could be induced to determine stable genes in the network.

Methods
Suppose dataset X = {x i (t)} I × T consists of time-series of expressions of I genes, taken over equally spaced T time points, where x i (t) denotes the expression of gene i at time t. Let vector x t x t i i I ( ) ( ( )) = =1 represent expressions of all genes at time t. Consider a network of I genes, represented by an r-order multivariate vector autoregressive (MVAR) model: i j I I denotes the matrix of regression coefficients corresponding to a model of order τ, and e e ( ) ( ( )) t t i i I = =1 denotes residuals that are assumed to be i.i.d. and zero mean Gaussian. The regression coefficient b t i j , represents the interaction between genes i and j. In general, the model has I 2 r coefficients. This scope of this work is restricted to first-order systems (i. e., r = 1). Suppose a vector of gene expressions at time t is denoted by row vector y x t t i i I = = ( ( )) 1 and let z t = y t-1 denote the vector of gene expressions at the previous time-point, b = [b 1 , b 2 , … b I ] T a matrix of size I × I of regression coefficients, and ε t = [ε 1 (t), ε 2 (t),…ε I (t)] the corresponding innovations. The multivariate model can be written in standard multivariate vector autoregressive (MVAR) form: In vector form, (2) becomes: where t-th row of Y, Z, and E, are y t , z t , and ε t , respectively, and there are T -1 samples; Y is a (T -1) × I matrix, Z a (T -1) × I matrix, b a I × I matrix, and E a (T -1) × I matrix. MVAR coefficients are estimated using standard least squares as: The above matrix cannot be determined if the number of genes are larger than number of samples (that is, T <<I).
To handle such instances, penalized regression methods such as ridge regression, lasso, or elastic net regression penalties have been proposed [19]. The general penalized regression loss function is given by: where λ and a are regularization parameters. When a = 0 the loss function represents ridge MVAR, a = 1 represents lasso, and a (0,1) represents an elastic-net penalty. In other words, lasso uses L 1 -norm, ridge regression uses L 2 -norm, and elastic-net uses both L 1norm and L 2 -norm in its penalty term. The penalty terms attempt to drive small regression coefficients to zero, making networks sparse and computationally wellposed. The lasso penalty is more capable of driving regression coefficients to zero than the ridge penalty.

Synthetic data
Most of the real world networks, such as biological networks, are scale-free. In this study, GRN of various sizes and topology are built with scale-free networks using Barabasi-Albert model [20]. First, assuming that large GRNs adhere to the topology of scale-free networks, synthetic structures of GRN were built by initiating a small number of nodes. New nodes and edges were added with preferential attachment as the probability of addition of new nodes to the existing node is not uniform. A node with high number of edges attracts higher number of new nodes compared to a node with no connection. This phenonmena in fact leads to power-law distribution: the probability p i of preferential attachment of a new gene to existing gene i is given by: where d i denotes the number of adjacent edges not initiated by gene i itself (which approximates to the indegree of gene i). The parameters γ denotes the power of preferential attachment and b the attractiveness of the gene with no adjacent edges.
Time-series datasets of gene expressions were generated for a specific network topology with the preferential attachment γ = 1.2. For a given network topology, the coefficients corresponding to no interactions among genes were set to zero. For other connections, true MVAR coefficients were obtained by drawing samples from a uniform distribution on the interval [-1,-0.8] and [0. 8,1] such that the number of positive and negative coefficients are approximately equal. The initial (t = 0) gene expression values were drawn from a uniform distribution on the interval [10,15]. For successive time points, expression were generated using MVAR model with added i.i.d. Gaussian random noise Σ = I [7]. Parameters used for generating synthetic networks are given in Table 1. For a given network and number of time points, we generated 100 time-series datasets by randomly initializing the gene expressions.

Real data
To investigate to the stability of real biological networks, we use lasso and elastic-net methods on Human cancer cell Line (HeLa) cell-cycle gene expression dataset [18].
The dataset used in this study is described as experiment 3 (http://genome-www.stanford.edu/Human-CellCycle/ Hela/) and contains 48 time points where gene expressions were measured at the interval of one hour and synchronized by double thymidine block. Based on relevance to cell cycle and tumor development, 91 genes were selected. This dataset was previously used to demonstrate the efficacy of sparse lasso MVAR techniques in building GRN [6]. In order to investigate stability random perturbations, we added noise perturbations at each time point by adding random samples from Gaussian noise N(0, ) where s is the standard deviation of residuals of real data and δ is a perturbation constant. For real dataset, the standard deviation of residuals was, s = 0.225 and s = 0.23, respectively, for models built with lasso and elastic-net MVAR models. By generating 100 gene expression datasets for each of δ such that SNR [0.01, 4], the stability of GRN built were studied against noise perturbations.

Stability of structure
In this section, we present a Hamming distance based criteria for evaluating the stability of building GRN. Let represents the presence of a regulatory interaction between genes i and j, and otherwise 0. Consider two GRNs s b and s b′ ; the similarity of the two networks is obtained by the average Hamming distance over all the regulatory connections: where d denotes the Hamming distance and |s b | denotes the number of connections in the network s b . In (7), the stability r [0, 1], where higher values denote high stability of the GRN inference algorithm. Here, stability is measured with respect to the whole network and the Hamming distance takes into account both the presence and absence of a regulatory connection.
Using the above measures, the stability of complete GRNs is obtained by averaging over B number of structures obtained using a particular method. The average of pair-wise stability, such as denoted in (7), is then used as the overall stability performance of the given method. This is given by: The stability of an independent edge is equally important. With a small perturbation, an edge is either established between two genes i and j, or is not detected. Hence, based on the number of times a regulatory relationship occurs between the two specific genes, the stability of the edge can be computed. Such edge stability is defined as:

Implementation
Scale-free network topologies were generated using igraph R package [21]. The stability of networks are studied with penalized regression methods. Each of these MVAR method has regularization parameters. Although computationally expensive, we used the leave-one-out cross-validation to estimate the λ for ridge MVAR and a and/or λ for sparse models. For ridge estimate, the λ was selected from the set of {0.001, 0.1, 1, 10, 100}. For lasso and elastic net, the glmnet package [22] is used which could generate the whole solution path for λ for a given a value (a = 1 for lasso). Hence, the a values are selected from a set of {0.1, 0.2, …, 0.9} for elastic-net model.
The presence of statistically significant edges were determined using t-distribution over regression coefficients corrected for multiple comparisons using false discovery rate (FDR) [23]. In this method, let P (1) ≤ P (2) ≤ … ≤ P (n) denotes sorted p-values where n is the number of hypotheses. If k* is the largest n* for which P n n q n ( *) * ≤ then we reject all the hypotheses n* = 1, 2, …, k*. Here, q is level of significance. For a given regression, the p-value for a regression coefficient b could be approximated using: Here, t(d) is t-distribution with d degree of freedom (dof); σ 2 is an estimate of the error of variance ; and w k is the j-th diagonal element of ((Z′Z + lW -) -1 Z′Z(Z′Z + lW -) -1 ) where Wdenotes generalized inverse of W, a diagonal matrix with diagonal elements |b k |. To achieve closed form solution, this matrix is estimated using ridge coefficients using the tuned λ for lasso penalty [24]. For lasso, the number of nonzero coefficients is an unbiased estimate for the degrees of freedom [25]. The dof for elastic-net were also assumed to be the number of nonzero coefficients. Only the statistically significant edges were evaluated for stability analysis of sparse MVAR on real datasets.
In this paper, we use fast coordinate descent algorithm for optimizing lasso and elastic net loss function. The codes are provided with glmnet MATLAB package. Ridge regression is performed using standard MATLAB function.

Results on synthetic dataset
To study the effects of the number of time points on the stability, GRN consisting of I {10, 50, 100} number of genes were simulated using scale-free networks with topologies having power-law coefficient γ [2.10, 2.30 ]. Thus, temporal gene expressions for T {10, 30, 50, 70} number of time points for a given network topology were generated. For given I, γ, and T, B = 100 number of bootstrapped datasets were generated by randomly initiating gene expressions at the first time point. Table 2 shows accuracies and stabilities of sparse linear models at different numbers of genes and time points. True positives (TP), false positives (FP), and Fmeasure (which is equal to 2 × × +

Precision Recall
Precision Recall , where Precision = TP/(TP+FP) and Recall = TP/(TP + FN) were used to indicate the accuracy. The performances measures reported are the averages over 100 bootstrap datasets. As seen, the accuracies of elastic-net and lasso were good and they need only a number of time points approximately equaling the number of genes to achieve good accuracy in small networks (for example, 10 genes). But as the network size grows, the number of time points required were relatively less compared to the size of the network. Ridge shows relatively poor accuracy, so is not considered further for evaluation. If the number of time points are small, false negatives were easier to correct by increasing the number of time points than false positives. Figure 1 shows the stability of lasso and elastic-net MVAR methods of building GRN of different sizes with different number of time points. The accuracy and stability were similar for the two methods. To achieve decent accuracy, the number of time points needed were at least about the size of the network. But to achieve good stability, much more time points are needed. At the individual edge level, figure 2 and 3 represent distribution of stability of edges with increasing number of genes and time points. Increase in number of time samples reduces the number of edges of low stability.

Results on real dataset
Figures 4a and 5a show statistically significant connections of GRN obtained using elastic-net and lasso MVAR, respectively, for Hela cell-cycle dataset. Figures 4b and 5b show the stable networks whose edges that have stability values greater than 0.5 after perturbations at SNR=4.0. As seen, the network derived using elastic-net penalty has higher number of edges compared to GRNs derived using lasso. This is because lasso can only select a very few number of edges in the presence of correlated gene expressions. As seen, after perturbation, the number of in-degree or out-degree of many critical hubs are reduced while few new edges are also detected. Elastic-net produced a more stable network than the lasso. Both the networks were able to detect many important biological hubs, such as Noxa, IL-6, c-myc, IAP, TSP1 Noxa, induction of G1 arrest by p18 bypasses a homeostatic cell-cycle checkpoint in intermediate Plasma cells for their differentiation [26]. IL-6 plays importance role in induction of apoptosis and cell cycle regulation [27]. This is also shown by large number of out-degree edges. Thrombospondin-1 (TSP1) curtails tumor growth and acts as an inhibitor of angiogenesis [28]. Inhibitors of apoptosis (IAPs) have important role in cell division and regulates apoptosis [29]. c-Myc oncoprotein prevents cell cycle arrest in response to growth-inhibitory signals, differentiation stimuli, or mitogen withdrawal [30]. Hubs such as P53, NFkB, FGFR3, etc. were not detected in GRN obtained using Lasso penalty, but were detected with elastic-net penalty. However, many edges and hubs are not recognized when perturbations were induced (for example, at SNR = 4.0) and edge stability is set to a thresold of 0.5. Genes like Noxa, NFkB, P53 etc. are severely affected in GRNs obtained using both sparse MVAR methods. Biologically, this could mean that these genes are less important for the biological processes underlying the network, compared to more stable genes. The results indicate that random perturbation of data indirectly helps the process of building accurate and stable GRNs. Figure 6 shows stability of individual connections produced by lasso and elastic-net methods. As seen, lasso produces less number of unstable edges than elastic-net. Figure 7 shows the effect of SNR on stability and F-measure. With increase in SNR level, both the stability and Fmeasures improves. Generally, the GRNs obtained using elastic-net penalty have higher stability and F-measure compared to GRNs derived using lasso. To compute the F-measure, it is assumed that the GRN derived on real data using lasso (or elastic-net) MVAR is true.

Discussion
Novel measures of evaluating stability of building GRN were introduced. Thereby, stability and accuracy of sparse MVAR models in building GRN structure were studied using synthetic datasets. The results suggest that to achieve decent accuracy and stability with sparse MVAR methods, at least a number of time points equal to the number of genes are required. But as the network size grows, the number of time points required is less compared to the number of genes in the network. It is easier to ameliorate the effects of false negatives than the false positives by increasing the number of time samples of gene expressions. The results indicates that lasso MVAR and elastic-net perform equally on datasets in general, though lasso handles false positives better. However, elastic-net performed better than lasso on real dataset. This is because elastic-net penalty has an ability to predict regulatory relationship between highly correlated genes while lasso will only predict one of them [31]. As genes are highly correlated, elastic-net or their improved versions proves better making inferences on large scale GRNs [7]. In simulated datasets, correlations among gene expression were not simulated.
To the best of our knowledge, this is the first study introducing stability criteria or studying methods building GRN. Our investigations here were focused on the stability of sparse MVAR models. Our work could be extended for other approaches as well. For small networks, sparse linear models to build networks were stable and accurate. Furthermore, effects of the number of time points on the stability were studied by experimenting on scale-free networks of different topologies. As the network size grows, the number of time points required for building GRN is less than the number of genes in the GRN. Stability is inherent problem in practice as real datasets consist of a large number of genes and short time-series. With an application of real dataset, we demonstrate how stable GRN can be derived by introducing perturbations to the gene expression datasets. Only a few statistically significant edges and associated gene hubs are stable and can withstand small amount of perturbation. Biologically, more stable genes have preserve more significant roles of the biological process of the network. This research emphasizes the need for building GRN that are accurate, stable, and reproducible, so that the structures derived are robust against noise and perturbation of data. In addition by perturbing gene expressions, more accurate and stable core genes and subnetworks can be inferred from temporal gene expression data.