Time-lagged Ordered Lasso for network inference

Background Accurate gene regulatory networks can be used to explain the emergence of different phenotypes, disease mechanisms, and other biological functions. Many methods have been proposed to infer networks from gene expression data but have been hampered by problems such as low sample size, inaccurate constraints, and incomplete characterizations of regulatory dynamics. Since expression regulation is dynamic, time-course data can be used to infer causality, but these datasets tend to be short or sparsely sampled. In addition, temporal methods typically assume that the expression of a gene at a time point depends on the expression of other genes at only the immediately preceding time point, while other methods include additional time points without any constraints to account for their temporal distance. These limitations can contribute to inaccurate networks with many missing and anomalous links. Results We adapted the time-lagged Ordered Lasso, a regularized regression method with temporal monotonicity constraints, for de novo reconstruction. We also developed a semi-supervised method that embeds prior network information into the Ordered Lasso to discover novel regulatory dependencies in existing pathways. R code is available at https://github.com/pn51/laggedOrderedLassoNetwork. Conclusions We evaluated these approaches on simulated data for a repressilator, time-course data from past DREAM challenges, and a HeLa cell cycle dataset to show that they can produce accurate networks subject to the dynamics and assumptions of the time-lagged Ordered Lasso regression. Electronic supplementary material The online version of this article (10.1186/s12859-018-2558-7) contains supplementary material, which is available to authorized users.


S-1 Time-course data and models
Despite the fact that gene expression regulation and other biological functions are dynamic, many existing approaches for GRN reconstruction attempt to predict novel regulatory connections by using static data that is assumed to be measured under steady-state conditions. Given the current advances in high-throughput sequencing techniques, time-course data can be measured at many time points and provide a more comprehensive picture of the dynamic landscape of gene regulation and its resulting biological processes. In particular, time-course data can be used to identify activated genes and changes in differential expression, detect periodicity in gene expression, infer causality between genes, and provide insight into other temporal aspects and mechanisms that cannot be ascertained from static data. However, appropriate modeling assumptions and method modifications are required to effectively leverage the properties that are unique to temporal data. For example, straightforward applications of many existing GRN reconstruction methods to time-course data will treat expression values at different time points as independent samples and disregard the one-way temporal ordering of the data.
One common approach that has been used to analyze the time-course expression data is to assume that it can be modeled with vector autoregression. More specifically, the expression of a gene at a time point is assumed to be linearly dependent on the past expression values of its regulators. For simplicity and computational feasibility, most methods assume that this dependency is short-lived, typically lasting only through the immediately preceding time point, so that the expression of a gene can be described by where x i (t) is the expression of gene i at time t, and w ji are weights indicating how much the expression level of gene j influences that of gene i over a time interval ∆t. However, depending on factors such as the temporal resolution of the data and delay of influence, the expression of a gene may depend on the expression of its regulators at more than one preceding time point, in which case the the expression of a gene can now described by where x j (t − k∆t) is the expression of gene j at the k th previous time point, w ji,k measures how much the expression of gene j at the k th previous time point influences that of gene i at the current time point, and l max is the model order or maximum lag of a regulator's expression that the expression of gene i may depend on. {x j (t − k∆t)} k=1,...,lmax are called the lagged features or variables of gene j. Differential equations have also been commonly used to model gene regulatory dynamics. With linear ordinary differential equations, the expression of a gene i can be described by where a ji is the rate of influence of the expression of gene j on that of gene i. In order to account for dependence on multiple lags, delay differential equations may be used instead.
Using a linear delay differential equation with constant, discrete delays, the expression of a gene i can now described by where a ji,k is the rate of influence of the expression of gene j at the k th previous time point on that of gene i at the current time point. Since microarray data can only collected at discrete time points, discretizing this model results in where w ji,k = a ji,k ∆t and ∆x i (t + ∆t) = x i (t + ∆t) − x i (t).

S-2 Regularization parameter selection
In both method variations, the regularization parameters need to be specified. In the de novo case, the true GRN is not known a priori, but general trends and heuristics may be useful in choosing these values. One important consideration is the tradeoff between the true and false positive rates as a function of λ. Larger values of λ will promote more regularization and force many of the model coefficients to zero, and applying feature selection will produce fewer predicted edges. Therefore, false positive rates will decrease with large λ, but the true positive rates may decrease as well. Similarly, smaller values of λ will repress regularization and produce more non-zero coefficients and, consequently, more predicted edges. Therefore, true positive rates will increase with small λ, but false positive rates may also increase. In addition to statistical measures of performance, we may appeal to model selection techniques that automatically choose λ based on the fit of a model to the expression data. Common approaches include cross-validation to simulate out-of-sample performance and information criteria such as AIC and BIC scores to assess model quality and overfitting. Similar principles may be applied in the semi-supervised case. In this case, we can compute true and false positive rates with respect to the input GRN. However, because the GRN is generally only partially known, these rates should only be interpreted as recovery rates of the input GRN. For smaller values of λ edge , more prior edges will be recovered by the method, and fewer anomalous edges will be proposed. Therefore, λ edge is a reflection of our confidence in the prior information about a network, with smaller values corresponding to a greater certainty that the known edges are true regulatory dependencies. Similarly, for larger values of λ non-edge , fewer prior non-edges will be predicted as edges. λ non-edge therefore reflects apprehension to the possibility of novel edges in the network, with larger values corresponding to larger levels of doubt.

S-4 DREAM network-level analysis
In Figure S-2, the AUCs for each combination of network, method, and model order l max ∈ {1, . . . , 4} are shown. At the scale of individual networks, a variety of AUC patterns can be observed that are largely inconsistent across all of the networks. For example, there is no prevalent relationship between the AUCs and l max for each method. However, the AUCs do appear to be fairly robust to the choice of l max for each method-network combination and also suggest that all methods are capable of predicting edges correctly at rates better than chance. In addition, it is easy to identify networks in which one method attains the highest AUCs for certain values of l max and the worst AUCs at other values, while in other networks, the same method may dominate or be dominated by the other methods at all evaluated l max . However, we do note that Granger causality appears to perform marginally better than the other methods on some of the DREAM networks, regardless of the choice of l max . Conditioning across the different challenges, network sizes, and other factors reveals other elements that may impact the prediction accuracy. In Figure S-3, the AUC differences for Granger causality with respect to the Ordered Lasso are shown. For the DREAM4 challenge datasets, network size appears to influence whether the Ordered Lasso or Granger causality will predict a more accurate network. More specifically, Granger causality attains higher AUCs than the Ordered Lasso does on the 100-node networks, regardless of the model order chosen. In contrast, on the 10-node networks, the Ordered Lasso generally performs better, and for some of these networks, larger differences in accuracy can be observed than in the 100-node networks. When considering the DREAM3 challenge datasets, the type of organism also appears to have an effect on prediction accuracy. For most of the 10-node DREAM3 networks, Granger causality tends to outperform the Ordered Lasso, regardless of organism type. However, based on the AUC differences of the 50-and 100-node networks, the Ordered Lasso tends to perform better than Granger causality on the yeast-based networks, whereas Granger causality tends to do better on the E. coli -based networks. This behavior may be due to the differences in the regulatory dynamics of gene expression between prokaryotic and eukaryotic organisms as well the structural differences in their networks. These observations therefore suggest that when attempting to reconstruct networks, no method may perform best on every network, and a preferred method may be dictated by a combination of factors, including network size and type of organism.

S-5 DREAM Granger causality and cross-correlation
In Sections 3.2 and S-4, we observed that Granger causality tended to perform better than the Ordered Lasso did on many of the DREAM networks. One possible reason for this is that many pairs of genes that correspond to true edges may be exhibiting high crosscorrelations in absolute value. For the networks in which there is a clear distinction between the distribution of edge and non-edge absolute cross-correlations, the lagged variables of a regulator of a target gene are able to strongly account for much of the evolution in expression of that gene. Recall that bivariate or pairwise Granger causality fits two autoregression models of order l max , (S-5) and compares the restricted model (Equation S-4) to the unrestricted model (Equation S-5) using an F -test to assess the explanatory gain of using the lagged values of x in predicting y.
In this context, the inclusion of the higher-order lags of a regulator to a target's unrestricted model will therefore tend to result in strong improvements over the restricted model due to the high absolute cross-correlations of the true edges.
In Figure  cross-correlations at various lags in several of the DREAM4 challenge networks in which Granger causality outperformed the Ordered Lasso. Overall, we see that edges tend to have higher absolute cross-correlation values than the non-edges do in these networks. As a result, the Granger causality AUCs should be fairly high for many of the DREAM networks. Furthermore, the absolute cross-correlations tend to be higher for the edges than for the nonedges across the different lags, which may explain some of the disparity between the Granger causality and Ordered Lasso AUCs; if the delay in influence of a regulator on a target is longer than one time point and therefore incompatible with the monotonicity constraint of the Ordered Lasso, then the Granger causality AUCs may end up being higher than the Ordered Lasso AUCs.

S-6 Model choices: dependent variables and inclusion of loops
In the main text, we assumed that the dynamics of gene expression can be modeled with Equation S-2, i.e., the expression of a gene is a linear function of the expression of its regulators at multiple preceding time points. In addition, we assumed that the expression model of a gene also included the lagged expression of the gene itself as covariates (loops), even if self-regulation was not evident; in certain cases, this may be a reasonable modeling choice, such as when the expression at a time point can be viewed as the result of the collection of perturbations from its regulators to its expression at previous time points. Since such modeling decisions may appear arbitary and other choices may potentially produce a more accurate network, we now analyze the differences in de novo reconstruction accuracy when using the change in expression (Equation S-3) rather than expression (Equation S-2) of a gene at a time point as the output variable to model the gene regulatory dynamics. In addition, we consider the effect of excluding loops from the gene expression models. In Figure S-5, densities are fit to the Ordered Lasso-based DREAM AUCs and shown for each combination of model output, loop existence, and model order l max ∈ {1, . . . , 4}, and in Table S Several trends in the AUC can be observed with respect to the modeling parameters. For example, given a fixed l max and loop inclusion choice, the AUC with the change in expression as the output tends to be less than the AUC with the expression as the output, suggesting that modeling the dynamics based on the change in expression may produce less accurate networks. In addition, the difference in the AUC appears to be larger in the non-loop case than in the loop case. Similarly, when l max and the model output choice are fixed, excluding loops from the models results in less accurate networks, with the effect being greater in the change-in-expression output models than in the expression output models. Lastly, a variation of the monotonic AUC property that we observed when varying the model order in the HeLa subnetwork application can also be seen in the DREAM AUC densities. In this case, when the expression of a gene is used as the model output, the AUCs tend to increase with l max . However, when the change in expression is used, the AUCs now decrease with l max . Altogether, these results impart several practical guidelines for predicting accurate networks with the Ordered Lasso. In general, using the expression to model the output tends to produce more accurate networks than using the change in expression. However, depending on the choice of the other parameters, the networks predicted by the latter approach may not be significantly different. Furthermore, using loops in the expression models to account for a gene's own variance in expression over time can also help the Ordered Lasso to predict accurate edges. Finally, the monotonic AUC property can be used to guide the choice of model order; if the expression (change in expression) is used as the output variable, then large (small) model orders should be used.