Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process
© Opgen-Rhein and Strimmer; licensee BioMed Central Ltd. 2007
Published: 03 May 2007
Causal networks based on the vector autoregressive (VAR) process are a promising statistical tool for modeling regulatory interactions in a cell. However, learning these networks is challenging due to the low sample size and high dimensionality of genomic data.
We present a novel and highly efficient approach to estimate a VAR network. This proceeds in two steps: (i) improved estimation of VAR regression coefficients using an analytic shrinkage approach, and (ii) subsequent model selection by testing the associated partial correlations. In simulations this approach outperformed for small sample size all other considered approaches in terms of true discovery rate (number of correctly identified edges relative to the significant edges). Moreover, the analysis of expression time series data from Arabidopsis thaliana resulted in a biologically sensible network.
Statistical learning of large-scale VAR causal models can be done efficiently by the proposed procedure, even in the difficult data situations prevalent in genomics and proteomics.
The method is implemented in R code that is available from the authors on request.
The vector autoregressive regression (VAR) model is an approach to describe the interaction of variables through time in a complex multivariate system. It is very popular in economics  but with few exceptions  it has not been widely used in systems biology, where it could be employed to model genetic networks or metabolic interactions. One possible reason for this is that while the statistical properties of the VAR model are well explored , its estimation from sparse data and subsequent model selection is very challenging due to the large number of parameters involved .
In this paper we develop a procedure for effectively learning the VAR model from small sample genomic data. In particular, we describe a novel model selection procedure for learning causal VAR networks from time course data with only a few time points, and no or little replication. This procedure is based on regularized estimation of VAR coefficients, followed by subsequent simultaneous significance testing of the corresponding partial correlation coefficients.
Once the VAR model has been learned from the data, it allows to elucidate possible underlying causal mechanisms by inspecting the Granger causality graph implied by the non-zero VAR coefficients.
The remainder of the paper is organized as follows. In the next section we first give the definition of a vector autoregressive process and recall the standard estimation. Subsequently, we describe our approach to regularized inference and to network model selection. This is followed by computer simulations comparing a variety of alternative approaches. Finally, we analyze data from an Arabidopsis thaliana expression time course experiment.
Vector autoregressive model
We consider vector-valued time series data x(t) = (x1(t),...,x p (t)). Each component of this row vector corresponds to a variable of interest, e.g., the expression level of a specific gene or the concentration of some metabolite in dependence of time. The vector autoregressive model specifies that the value of x(t) is a linear combination of those of earlier time points, plus noise,
In this formula m is the order of the VAR process, L the time lag, and c a 1 × p vector of means. The errors ∈ i are assumed to have zero mean and a p × p positive definite covariance matrix Σ. The matrices B i with dimension p × p represent the dynamical structure and thus contain the information relevant for reading off the causal relationships.
The autoregressive model has the form of a standard regression problem. Therefore, estimation of the matrices B i is straightforward. A special case considered in this paper is when both m and L are set to 1. Then the above equation reduces to the VAR(1) process
x(t + 1) = c+ x(t)B+ ε. (2)
We now denote the centered matrices of observations corresponding to x(t + 1) and x(t) by X f ("future") and X p ("past"), respectively, i.e. and . In this notation the ordinary least squares (OLS) estimate can be written as
This is also the maximum likelihood (ML) estimate assuming the normal distribution. The coefficients of higher-order VAR models may be obtained in a corresponding fashion .
Small sample estimation using James-Stein-type shrinkage
Genomic time course data contain only few time points (typically around n = 10) and often little or no replication – hence the above restriction on VAR(1) models with unit lag. Furthermore, it is known that for small sample size the least squares and maximum likelihood methods lead to statistically inefficient estimators. Therefore, application of the VAR model to genomics data requires some form of regularization. For instance, a full Bayesian approach could be used. However, for the VAR model the choice of a suitable prior is difficult .
Here, as a both computationally and statistically efficient alternative, we propose to apply James-Stein-type shrinkage, a method related to empirical Bayes [5, 6]. This procedure has the advantage that it is computationally as simple as OLS, yet still produces efficient estimates for small samples.
Following [6, 7] we now review how an unconstrained covariance matrix may be estimated using shrinkage. In the next section we then show how this result may be used to obtain shrinkage estimates of VAR coefficients.
Assuming centered data X for p variables (columns) the unbiased empirical estimator of the covariance matrix is . For small number of observations S is known to be inefficient and also ill-conditioned (singular!) for n <p. A more efficient estimator may be furnished by shrinking the empirical correlations r ij towards zero and the empirical variances v i against their median. This leads to the following expressions for the components of a shrinkage estimate S*:
The particular choice of the shrinkage intensities and is aimed at minimizing the overall mean squared error.
Shrinkage estimation of VAR coefficients
We combine the centered observations X p and X f into a joint matrix Φ = [X p X f ]. Note that F contains twice as many columns as either X p or X f .
Next, we consider the (n - 1) multiple of the empirical covariance matrix, S= Φ T Φ, noting that S contains the two submatrices S1 = X p T X p and S2 = X p T X f . This allows to write the OLS estimate of VAR coefficients as OLS = (S1)-1S2.
We replace the empirical covariance matrix S by a shrinkage estimate.
From S* we determine the submatrices and which in turn allow to compute the estimates
By decomposing S* using the SVD or Cholesky algorithm it is possible to reconstruct pseudodata matrices and . The above algorithm may be interpreted as OLS or normal-distribution ML based on these pseudodata.
VAR network model selection
The network representing potential directed causal influences is given by the non-zero entries in the matrix of VAR coefficient. For an extensive discussion of the meaning and interpretation of the implied Granger (non)-causality we refer to .
As Shrink is an estimate it is unlikely that any of its components are exactly zero. Therefore, we need to statistically test whether the entries of Shrink are vanishing. However, instead of inspecting regression coefficients directly, it is preferably to test the corresponding partial correlation coefficients: this facilitates small-sample testing and additionally allows to accommodate for dependencies among the estimated coefficients .
Specifically, consider in the VAR(1) model the multiple regression that connects the first variable x1(t + 1) at time point t + 1 with all variables x1(t),...,x p (t) at the previous time point t,
If in this equation the roles of x k (t) and x1(t + 1) are reversed,
the partial correlation between the two variables is the geometric mean of the corresponding regression coefficients, times their sign, i.e. .
Once the partial correlations in the VAR model are computed, we use the "local fdr" approach of  to identify significant partial correlations, similar to the network model selection for graphical Gaussian models (GGMs) of . Note that unlike in a GGM the edges in a VAR network are directed.
We point out that recently two papers have appeared describing related strategies for VAR model selection. As in our algorithm the strategies pursued in both  and  also consist in choosing the VAR network by selecting the appropriate underlying partial correlations. However, the key advantage of our variant of VAR network search is that it is specifically designed to meet small sample requirements, by using shrinkage estimators of regression coefficients and partial correlation, and due to the adaptive nature (i.e. the automatic estimation of the empirical null) of the "local fdr" algorithm .
Results and discussion
In a comparative simulation study we investigated the power of diverse approaches to recovering the true VAR network. We simulated VAR(1) data of different sample size, with n varying between 5 and 200, for 100 randomly generated true networks with 200 edges and p = 100 nodes. The 200 nonzero regression coefficients were drawn uniformly from the intervals [-1; -0.2] and [0.2; 1].
In addition to the shrinkage procedure we estimated regression coefficients by ordinary least squares (OLS) and by ridge regression (RR). All these three regression strategies were applied in conjunction with the above VAR model selection based on partial correlations, with a cutoff value for the "local fdr" statistic set at 0.2 – the recommendation of . As a fourth method we employed L1 regression  (LASSO) to estimate VAR regression coefficients. Note that in the latter instance there is no need for additional model selection, as the LASSO method combines shrinkage and model selection and automatically sets many regression coefficients identically to zero.
In the simulations we ran OLS only for n > 100, as for small sample size the corresponding empirical covariance matrix is singular and consequently the OLS regression is ill-posed. The penalty for the LASSO regression was chosen as in . The regularization parameter in RR was determined by generalized cross validation . Unfortunately, even GCV turned out to be computationally expensive, so that for RR we conducted only 10 repetitions, rather than the 100 considered for the other methods.
The relative performance of the four approaches to VAR estimation can be further explained by considering the relative amount of true and false positive edges (Figure 1, middle and right box). The shrinkage method generally produces very few false positives. In contrast, the RR and LASSO methods lead to a large number of false edges, especially for small sample size. This is particularly pronounced for the LASSO regression, as can be seen in the differently scaled inlay plot contained in the right box of Figure 1, indicating that the penalty applied in the L1 regression may not be sufficient in this situation. In terms of the number of correctly identified edges the RR and shrinkage approach are the two top performing methods. However, even though RR finds a considerable number of true edges even at very small sample size, this has little impact on its true discovery rate because of the high number of false positives.
In summary, the simulation results suggest to apply for small sample size the James-Stein-type shrinkage procedure, and for n > p the traditional OLS approach.
Analysis of a microarray time course data set
For further illustration we applied the VAR shrinkage approach to a real world data example. Specifically, we reanalyzed expression time series resulting from an experiment investigating the impact of the diurnal cycle on the starch metabolism of Arabidopsis thaliana .
We downloaded the calibrated signal intensities for 22,814 probes and 11 time points for each of the two biological replicates from experiment no. 60 of the NASCArrays repository . After log-transforming the data we filtered out all genes containing missing values and whose maximum signal intensity value was lower than 5 on a log-base 2 scale. Subsequently, we applied the periodicity test of  to identify the probes associated with the day-night cycle. As a result, we obtained a subset of 800 genes that we further analyzed with the VAR approach.
We note that a tacit assumption of the VAR model is that time points are equidistant – see Eq. 1. This is not the case for the Arabidopsis thaliana data which were measured at 0, 1, 2, 4, 8, 12, 13, 14, 16, 20, and 24 hours after the start of the experiment. However, as the intensity of the biological reactions is likely to be higher at the change points from light to dark periods (time points 0 and 12), one may argue that assuming equidistant measurements is justifiable at least in terms of equal relative reaction rate.
A further implication of the VAR model (and indeed of many other graphical models) is that dependencies among genes are essentially linear. This can easily be checked by inspecting the pairwise scatter plots of the calibrated expression levels. For the 800 considered Arabidopsis thaliana genes we verified that the linearity assumption of the VAR model is indeed satisfied.
As the VAR network contains directed edges it is possible to distinguish genes that have mostly outgoing arcs, which could be indicative for key regulatory genes, from those with mostly ingoing arcs. In the graph of Figure 2 node 570, an AP2 transcription factor, and node 81, a gene involved in DNA-directed RNA polymerase, belong to the former category, whereas for instance node 558, a structural constituent of ribosome, seems to be part of the latter. Node 627 is another hub in the VAR network, which according to the annotation of  encodes a protein of unknown function. Another interesting aspect of the VAR network is the web of highly connected genes (encircled and colored yellow in the lower right corner of Figure 2) which we hypothesize to constitute some form of a functional module.
We have presented a novel algorithm for learning VAR causal networks. This is based on James-Stein-type shrinkage estimation of covariances between different time points of the conducted experiment, that in turn leads to improved estimates of the VAR regression coefficients. Subsequent VAR model selection is conducted by using "local fdr" multiple testing on the corresponding partial correlations.
We have shown that this approach is well suited for the small sample sizes encountered in genomics. In addition, the approach is computationally very efficient, as no computer intensive sampling or optimization is needed: the inference of the directed network for the Arabidopsis thaliana data with 640, 000 potentially directed edges takes about one minute on a standard desktop computer. While we have illustrated the approach by analyzing a microarray expression data set, it is by no means restricted to this kind of data – we expect that our VAR network approach performs equally well for similar high dimensional time series data from metabolomic or proteomic experiments.
The current algorithm employs a fixed "one step ahead" time lag. One strategy to generalization to arbitrary time lags may be to consider functional data – see, e.g., [20, 21]. This would have the additional benefit to suitable deal with non-equally spaced measurements, a common characteristic of many biological experiments.
We thank Papapit Ingkasuwan for pointing us to the Arabidopsis thaliana data set, and the referees for their valuable comments. This work was supported by the Deutsche Forschungsgemeinschaft (DFG) via an Emmy-Noether research grant to K.S.
This article has been published as part of BMC Bioinformatics Volume 8, Supplement 2, 2007: Probabilistic Modeling and Machine Learning in Structural and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S2.
- Sims CA: Macroeconomics and Reality. Econometrica 1980, 48: 1–48. 10.2307/1912017View ArticleGoogle Scholar
- Bay SD, Chrisman L, Pohorille A, Shrager J: Temporal Aggregation Bias and Inference of Causal Regulatory Networks. J Comput Biol 2004, 11: 971–985. 10.1089/cmb.2004.11.971View ArticlePubMedGoogle Scholar
- Lütkepohl H: Introduction to Multiple Time Series Analysis. Berlin: Springer; 1993.View ArticleGoogle Scholar
- Ni S, Sun D: Bayesian estimates for vector autoregressive models. J Business Economic Statist 2005, 23: 105–117. 10.1198/073500104000000622View ArticleGoogle Scholar
- Efron B, Morris CN: Stein's estimation rule and its competitors-an empirical Bayes approach. J Amer Statist Assoc 1973, 68: 117–130. 10.2307/2284155Google Scholar
- Opgen-Rhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Stat Appl Genet Mol Biol 2007, 6: 9.Google Scholar
- Schäfer J, Strimmer K: A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 2005, 4: 32.Google Scholar
- Granger CWJ: Testing for causality, a personal viewpoint. J Econom Dyn Control 1980, 2: 329–352. 10.1016/0165-1889(80)90069-XView ArticleGoogle Scholar
- Schäfer J, Strimmer K: An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 2005, 21: 754–764. 10.1093/bioinformatics/bti062View ArticlePubMedGoogle Scholar
- Whittaker J: Graphical Models in Applied Multivariate Statistics. New York: Wiley; 1990.Google Scholar
- Efron B: Local false discovery rates. In Tech rep. Department of Statistics, Stanford University; 2005.Google Scholar
- Demiralp S, Hoover KD: Searching for the causal structure of a vector autoregression. Oxford Bull Econom Statist 2003, 65: 745–767. 10.1046/j.0305-9049.2003.00087.xView ArticleGoogle Scholar
- Moneta A: Graphical models for structural vector autoregressions. In Technical Report. Laboratory of Economics and Management, Sant' Anna School of Advanced Studies, Pisa; 2004.Google Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J R Statist Soc B 1996, 58: 267–288.Google Scholar
- Meinshausen N, Bühlmann P: High-dimensional graphs and variable selection with the Lasso. Ann Statist 2006, 34: 1436–1462. 10.1214/009053606000000281View ArticleGoogle Scholar
- Golub G, Heath M, Wahba G: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 1979, 21: 215–223. 10.2307/1268518View ArticleGoogle Scholar
- Smith SM, Fulton DC, Chia T, Thorneycroft D, Chapple A, Dunstan H, Hylton C, Smith SCZAM: Diurnal changes in the transcriptom encoding enzymes of starch metabolism provide evidence for both transcriptionaland posttranscriptional regulation of starch metabolism inArabidopsis leaves. Plant Physiol 2004, 136: 2687–2699. 10.1104/pp.104.044347PubMed CentralView ArticlePubMedGoogle Scholar
- NASCArrays: the Nottingham Arabidopsis Stock Centre's microarray database[http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl]
- Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics 2004, 20: 5–20. 10.1093/bioinformatics/btg364View ArticlePubMedGoogle Scholar
- Opgen-Rhein R, Strimmer K: Inferring gene dependency networks from genomic longitudinal data: a functional data approach. REVSTAT 2006, 4: 53–65.Google Scholar
- Opgen-Rhein R, Strimmer K: Using regularized dynamic correlation to infer gene dependency networks from time-series microarray data. Proceedings of the 4th International Workshop on Computational Systems Biology (WCSB 2006), Tampere 2006, 4: 73–76.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.