Volume 8 Supplement 2

## Probabilistic Modeling and Machine Learning in Structural and Systems Biology

- Research
- Open Access

# Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process

- Rainer Opgen-Rhein
^{1}Email author and - Korbinian Strimmer
^{2}

**8 (Suppl 2)**:S3

https://doi.org/10.1186/1471-2105-8-S2-S3

© Opgen-Rhein and Strimmer; licensee BioMed Central Ltd. 2007

**Published:**03 May 2007

## Abstract

### Background

Causal networks based on the vector autoregressive (VAR) process are a promising statistical tool for modeling regulatory interactions in a cell. However, learning these networks is challenging due to the low sample size and high dimensionality of genomic data.

### Results

We present a novel and highly efficient approach to estimate a VAR network. This proceeds in two steps: (i) improved estimation of VAR regression coefficients using an analytic shrinkage approach, and (ii) subsequent model selection by testing the associated partial correlations. In simulations this approach outperformed for small sample size all other considered approaches in terms of true discovery rate (number of correctly identified edges relative to the significant edges). Moreover, the analysis of expression time series data from *Arabidopsis thaliana* resulted in a biologically sensible network.

### Conclusion

Statistical learning of large-scale VAR causal models can be done efficiently by the proposed procedure, even in the difficult data situations prevalent in genomics and proteomics.

### Availability

The method is implemented in R code that is available from the authors on request.

## Keywords

- Ordinary Little Square
- Lasso
- Ridge Regression
- Significant Edge
- Lasso Regression

## Background

The vector autoregressive regression (VAR) model is an approach to describe the interaction of variables through time in a complex multivariate system. It is very popular in economics [1] but with few exceptions [2] it has not been widely used in systems biology, where it could be employed to model genetic networks or metabolic interactions. One possible reason for this is that while the statistical properties of the VAR model are well explored [3], its estimation from sparse data and subsequent model selection is very challenging due to the large number of parameters involved [4].

In this paper we develop a procedure for effectively learning the VAR model from small sample genomic data. In particular, we describe a novel model selection procedure for learning causal VAR networks from time course data with only a few time points, and no or little replication. This procedure is based on regularized estimation of VAR coefficients, followed by subsequent simultaneous significance testing of the corresponding partial correlation coefficients.

Once the VAR model has been learned from the data, it allows to elucidate possible underlying causal mechanisms by inspecting the Granger causality graph implied by the non-zero VAR coefficients.

The remainder of the paper is organized as follows. In the next section we first give the definition of a vector autoregressive process and recall the standard estimation. Subsequently, we describe our approach to regularized inference and to network model selection. This is followed by computer simulations comparing a variety of alternative approaches. Finally, we analyze data from an *Arabidopsis thaliana* expression time course experiment.

## Methods

### Vector autoregressive model

We consider vector-valued time series data x(*t*) = (*x*_{1}(*t*),...,*x*_{
p
}(*t*)). Each component of this row vector corresponds to a variable of interest, e.g., the expression level of a specific gene or the concentration of some metabolite in dependence of time. The vector autoregressive model specifies that the value of x(*t*) is a linear combination of those of earlier time points, plus noise,

In this formula *m* is the order of the VAR process, *L* the time lag, and c a 1 × *p* vector of means. The errors ∈_{
i
}are assumed to have zero mean and a *p* × *p* positive definite covariance matrix **Σ**. The matrices B_{
i
}with dimension *p* × *p* represent the dynamical structure and thus contain the information relevant for reading off the causal relationships.

The autoregressive model has the form of a standard regression problem. Therefore, estimation of the matrices B_{
i
}is straightforward. A special case considered in this paper is when both *m* and *L* are set to 1. Then the above equation reduces to the VAR(1) process

x(*t* + 1) = c+ x(*t*)B+ ε. (2)

We now denote the centered *matrices of observations* corresponding to x(*t* + 1) and x(*t*) by X_{
f
}("future") and X_{
p
}("past"), respectively, i.e. ${X}_{p}=\left[\begin{array}{c}x(1)\\ \dots \\ x(n-1)\end{array}\right]$ and ${X}_{f}=\left[\begin{array}{c}x(2)\\ \dots \\ x(n)\end{array}\right]$. In this notation the ordinary least squares (OLS) estimate can be written as

^{OLS}= (X

_{ p }

^{ T }X

_{ p })

^{-1}X

_{ p }

^{ T }X

_{ f }. (3)

This is also the maximum likelihood (ML) estimate assuming the normal distribution. The coefficients of higher-order VAR models may be obtained in a corresponding fashion [3].

### Small sample estimation using James-Stein-type shrinkage

Genomic time course data contain only few time points (typically around *n* = 10) and often little or no replication – hence the above restriction on VAR(1) models with unit lag. Furthermore, it is known that for small sample size the least squares and maximum likelihood methods lead to statistically inefficient estimators. Therefore, application of the VAR model to genomics data requires some form of regularization. For instance, a full Bayesian approach could be used. However, for the VAR model the choice of a suitable prior is difficult [4].

Here, as a both computationally and statistically efficient alternative, we propose to apply James-Stein-type shrinkage, a method related to empirical Bayes [5, 6]. This procedure has the advantage that it is computationally as simple as OLS, yet still produces efficient estimates for small samples.

Following [6, 7] we now review how an unconstrained covariance matrix may be estimated using shrinkage. In the next section we then show how this result may be used to obtain shrinkage estimates of VAR coefficients.

Assuming centered data X for *p* variables (columns) the unbiased empirical estimator of the covariance matrix is $S=\frac{1}{n-1}{X}^{T}X$. For small number of observations S is known to be inefficient and also ill-conditioned (singular!) for *n* <*p*. A more efficient estimator may be furnished by shrinking the empirical correlations *r*_{
ij
}towards zero and the empirical variances *v*_{
i
}against their median. This leads to the following expressions for the components of a shrinkage estimate S*:

with

and

The particular choice of the shrinkage intensities ${\widehat{\lambda}}_{1}^{\ast}$ and ${\widehat{\lambda}}_{2}^{\ast}$ is aimed at minimizing the overall mean squared error.

### Shrinkage estimation of VAR coefficients

- 1.
We combine the centered observations X

_{ p }and X_{ f }into a joint matrix Φ = [X_{ p }X_{ f }]. Note that F contains twice as many columns as either X_{ p }or X_{ f }. - 2.
Next, we consider the (

*n*- 1) multiple of the*empirical*covariance matrix, S= Φ^{ T }Φ, noting that S contains the two submatrices S_{1}= X_{ p }^{ T }X_{ p }and S_{2}= X_{ p }^{ T }X_{ f }. This allows to write the OLS estimate of VAR coefficients as $\widehat{B}$^{OLS}= (S_{1})^{-1}S_{2}. - 3.
We replace the empirical covariance matrix S by a shrinkage estimate.

- 4.
From S* we determine the submatrices ${S}_{1}^{\ast}$ and ${S}_{2}^{\ast}$ which in turn allow to compute the estimates

^{Shrink}= (${S}_{1}^{\ast}$)

^{-1}${S}_{2}^{\ast}$.

By decomposing S* using the SVD or Cholesky algorithm it is possible to reconstruct pseudodata matrices ${X}_{f}^{\ast}$ and ${X}_{p}^{\ast}$. The above algorithm may be interpreted as OLS or normal-distribution ML based on these pseudodata.

### VAR network model selection

The network representing potential directed causal influences is given by the non-zero entries in the matrix of VAR coefficient. For an extensive discussion of the meaning and interpretation of the implied Granger (non)-causality we refer to [8].

As $\widehat{B}$^{Shrink} is an estimate it is unlikely that any of its components are exactly zero. Therefore, we need to statistically test whether the entries of $\widehat{B}$^{Shrink} are vanishing. However, instead of inspecting regression coefficients directly, it is preferably to test the corresponding partial correlation coefficients: this facilitates small-sample testing and additionally allows to accommodate for dependencies among the estimated coefficients [9].

Specifically, consider in the VAR(1) model the multiple regression that connects the first variable *x*_{1}(*t* + 1) at time point *t* + 1 with all variables *x*_{1}(*t*),...,*x*_{
p
}(*t*) at the previous time point *t*,

If in this equation the roles of *x*_{
k
}(*t*) and *x*_{1}(*t* + 1) are reversed,

the partial correlation between the two variables is the geometric mean of the corresponding regression coefficients, times their sign, i.e. $\sqrt{{\beta}_{1}^{k}{\beta}_{k}^{1}}\mathrm{sgn}\phantom{\rule{0.1em}{0ex}}({\beta}_{1}^{k})$[10].

Once the partial correlations in the VAR model are computed, we use the *"local fdr"* approach of [11] to identify significant partial correlations, similar to the network model selection for graphical Gaussian models (GGMs) of [9]. Note that unlike in a GGM the edges in a VAR network are directed.

We point out that recently two papers have appeared describing related strategies for VAR model selection. As in our algorithm the strategies pursued in both [12] and [13] also consist in choosing the VAR network by selecting the appropriate underlying partial correlations. However, the key advantage of our variant of VAR network search is that it is specifically designed to meet small sample requirements, by using shrinkage estimators of regression coefficients and partial correlation, and due to the adaptive nature (i.e. the automatic estimation of the empirical null) of the "local fdr" algorithm [11].

## Results and discussion

### Simulation study

In a comparative simulation study we investigated the power of diverse approaches to recovering the true VAR network. We simulated VAR(1) data of different sample size, with *n* varying between 5 and 200, for 100 randomly generated true networks with 200 edges and *p* = 100 nodes. The 200 nonzero regression coefficients were drawn uniformly from the intervals [-1; -0.2] and [0.2; 1].

In addition to the shrinkage procedure we estimated regression coefficients by ordinary least squares (OLS) and by ridge regression (RR). All these three regression strategies were applied in conjunction with the above VAR model selection based on partial correlations, with a cutoff value for the "local fdr" statistic set at 0.2 – the recommendation of [11]. As a fourth method we employed L1 regression [14] (LASSO) to estimate VAR regression coefficients. Note that in the latter instance there is no need for additional model selection, as the LASSO method combines shrinkage and model selection and automatically sets many regression coefficients identically to zero.

In the simulations we ran OLS only for *n* > 100, as for small sample size the corresponding empirical covariance matrix is singular and consequently the OLS regression is ill-posed. The penalty for the LASSO regression was chosen as in [15]. The regularization parameter in RR was determined by generalized cross validation [16]. Unfortunately, even GCV turned out to be computationally expensive, so that for RR we conducted only 10 repetitions, rather than the 100 considered for the other methods.

The relative performance of the four approaches to VAR estimation can be further explained by considering the relative amount of true and false positive edges (Figure 1, middle and right box). The shrinkage method generally produces very few false positives. In contrast, the RR and LASSO methods lead to a large number of false edges, especially for small sample size. This is particularly pronounced for the LASSO regression, as can be seen in the differently scaled inlay plot contained in the right box of Figure 1, indicating that the penalty applied in the L1 regression may not be sufficient in this situation. In terms of the number of correctly identified edges the RR and shrinkage approach are the two top performing methods. However, even though RR finds a considerable number of true edges even at very small sample size, this has little impact on its true discovery rate because of the high number of false positives.

In summary, the simulation results suggest to apply for small sample size the James-Stein-type shrinkage procedure, and for *n* > *p* the traditional OLS approach.

### Analysis of a microarray time course data set

For further illustration we applied the VAR shrinkage approach to a real world data example. Specifically, we reanalyzed expression time series resulting from an experiment investigating the impact of the diurnal cycle on the starch metabolism of *Arabidopsis thaliana* [17].

We downloaded the calibrated signal intensities for 22,814 probes and 11 time points for each of the two biological replicates from experiment no. 60 of the NASCArrays repository [18]. After log-transforming the data we filtered out all genes containing missing values and whose maximum signal intensity value was lower than 5 on a log-base 2 scale. Subsequently, we applied the periodicity test of [19] to identify the probes associated with the day-night cycle. As a result, we obtained a subset of 800 genes that we further analyzed with the VAR approach.

We note that a tacit assumption of the VAR model is that time points are equidistant – see Eq. 1. This is not the case for the *Arabidopsis thaliana* data which were measured at 0, 1, 2, 4, 8, 12, 13, 14, 16, 20, and 24 hours after the start of the experiment. However, as the intensity of the biological reactions is likely to be higher at the change points from light to dark periods (time points 0 and 12), one may argue that assuming equidistant measurements is justifiable at least in terms of equal relative reaction rate.

A further implication of the VAR model (and indeed of many other graphical models) is that dependencies among genes are essentially linear. This can easily be checked by inspecting the pairwise scatter plots of the calibrated expression levels. For the 800 considered *Arabidopsis thaliana* genes we verified that the linearity assumption of the VAR model is indeed satisfied.

As the VAR network contains directed edges it is possible to distinguish genes that have mostly outgoing arcs, which could be indicative for key regulatory genes, from those with mostly ingoing arcs. In the graph of Figure 2 node 570, an AP2 transcription factor, and node 81, a gene involved in DNA-directed RNA polymerase, belong to the former category, whereas for instance node 558, a structural constituent of ribosome, seems to be part of the latter. Node 627 is another hub in the VAR network, which according to the annotation of [17] encodes a protein of unknown function. Another interesting aspect of the VAR network is the web of highly connected genes (encircled and colored yellow in the lower right corner of Figure 2) which we hypothesize to constitute some form of a functional module.

*Arabidopsis thaliana*data in Figure 3. As expected, both graphs share the same structure (main hubs and the module of highly connected genes): if one node influences another in the next timepoint with a constant regression coefficient (VAR-model), they also tend to be significantly partially correlated in the same time point (GGM-model). However, using a GGM it is not possible to infer the causal structure of the network.

## Conclusion

We have presented a novel algorithm for learning VAR causal networks. This is based on James-Stein-type shrinkage estimation of covariances between different time points of the conducted experiment, that in turn leads to improved estimates of the VAR regression coefficients. Subsequent VAR model selection is conducted by using "local fdr" multiple testing on the corresponding partial correlations.

We have shown that this approach is well suited for the small sample sizes encountered in genomics. In addition, the approach is computationally very efficient, as no computer intensive sampling or optimization is needed: the inference of the directed network for the *Arabidopsis thaliana* data with 640, 000 potentially directed edges takes about one minute on a standard desktop computer. While we have illustrated the approach by analyzing a microarray expression data set, it is by no means restricted to this kind of data – we expect that our VAR network approach performs equally well for similar high dimensional time series data from metabolomic or proteomic experiments.

The current algorithm employs a fixed "one step ahead" time lag. One strategy to generalization to arbitrary time lags may be to consider functional data – see, e.g., [20, 21]. This would have the additional benefit to suitable deal with non-equally spaced measurements, a common characteristic of many biological experiments.

## Declarations

### Acknowledgements

We thank Papapit Ingkasuwan for pointing us to the *Arabidopsis thaliana* data set, and the referees for their valuable comments. This work was supported by the Deutsche Forschungsgemeinschaft (DFG) via an Emmy-Noether research grant to K.S.

This article has been published as part of *BMC Bioinformatics* Volume 8, Supplement 2, 2007: Probabilistic Modeling and Machine Learning in Structural and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S2.

## Authors’ Affiliations

## References

- Sims CA:
**Macroeconomics and Reality.***Econometrica*1980,**48:**1–48. 10.2307/1912017View ArticleGoogle Scholar - Bay SD, Chrisman L, Pohorille A, Shrager J:
**Temporal Aggregation Bias and Inference of Causal Regulatory Networks.***J Comput Biol*2004,**11:**971–985. 10.1089/cmb.2004.11.971View ArticlePubMedGoogle Scholar - Lütkepohl H:
*Introduction to Multiple Time Series Analysis*. Berlin: Springer; 1993.View ArticleGoogle Scholar - Ni S, Sun D:
**Bayesian estimates for vector autoregressive models.***J Business Economic Statist*2005,**23:**105–117. 10.1198/073500104000000622View ArticleGoogle Scholar - Efron B, Morris CN:
**Stein's estimation rule and its competitors-an empirical Bayes approach.***J Amer Statist Assoc*1973,**68:**117–130. 10.2307/2284155Google Scholar - Opgen-Rhein R, Strimmer K:
**Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach.***Stat Appl Genet Mol Biol*2007,**6:**9.Google Scholar - Schäfer J, Strimmer K:
**A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.***Stat Appl Genet Mol Biol*2005,**4:**32.Google Scholar - Granger CWJ:
**Testing for causality, a personal viewpoint.***J Econom Dyn Control*1980,**2:**329–352. 10.1016/0165-1889(80)90069-XView ArticleGoogle Scholar - Schäfer J, Strimmer K:
**An empirical Bayes approach to inferring large-scale gene association networks.***Bioinformatics*2005,**21:**754–764. 10.1093/bioinformatics/bti062View ArticlePubMedGoogle Scholar - Whittaker J:
*Graphical Models in Applied Multivariate Statistics*. New York: Wiley; 1990.Google Scholar - Efron B:
**Local false discovery rates.**In*Tech rep*. Department of Statistics, Stanford University; 2005.Google Scholar - Demiralp S, Hoover KD:
**Searching for the causal structure of a vector autoregression.***Oxford Bull Econom Statist*2003,**65:**745–767. 10.1046/j.0305-9049.2003.00087.xView ArticleGoogle Scholar - Moneta A:
**Graphical models for structural vector autoregressions.**In*Technical Report*. Laboratory of Economics and Management, Sant' Anna School of Advanced Studies, Pisa; 2004.Google Scholar - Tibshirani R:
**Regression shrinkage and selection via the lasso.***J R Statist Soc B*1996,**58:**267–288.Google Scholar - Meinshausen N, Bühlmann P:
**High-dimensional graphs and variable selection with the Lasso.***Ann Statist*2006,**34:**1436–1462. 10.1214/009053606000000281View ArticleGoogle Scholar - Golub G, Heath M, Wahba G:
**Generalized cross-validation as a method for choosing a good ridge parameter.***Technometrics*1979,**21:**215–223. 10.2307/1268518View ArticleGoogle Scholar - Smith SM, Fulton DC, Chia T, Thorneycroft D, Chapple A, Dunstan H, Hylton C, Smith SCZAM:
**Diurnal changes in the transcriptom encoding enzymes of starch metabolism provide evidence for both transcriptionaland posttranscriptional regulation of starch metabolism inArabidopsis leaves.***Plant Physiol*2004,**136:**2687–2699. 10.1104/pp.104.044347PubMed CentralView ArticlePubMedGoogle Scholar **NASCArrays: the Nottingham Arabidopsis Stock Centre's microarray database**[http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl]- Wichert S, Fokianos K, Strimmer K:
**Identifying periodically expressed transcripts in microarray time series data.***Bioinformatics*2004,**20:**5–20. 10.1093/bioinformatics/btg364View ArticlePubMedGoogle Scholar - Opgen-Rhein R, Strimmer K:
**Inferring gene dependency networks from genomic longitudinal data: a functional data approach.***REVSTAT*2006,**4:**53–65.Google Scholar - Opgen-Rhein R, Strimmer K:
**Using regularized dynamic correlation to infer gene dependency networks from time-series microarray data.***Proceedings of the 4th International Workshop on Computational Systems Biology (WCSB 2006), Tampere*2006,**4:**73–76.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.