Nonparametric identification of regulatory interactions from spatial and temporal gene expression data
© Aswani et al. 2010
Received: 7 April 2010
Accepted: 4 August 2010
Published: 4 August 2010
Skip to main content
© Aswani et al. 2010
Received: 7 April 2010
Accepted: 4 August 2010
Published: 4 August 2010
The correlation between the expression levels of transcription factors and their target genes can be used to infer interactions within animal regulatory networks, but current methods are limited in their ability to make correct predictions.
Here we describe a novel approach which uses nonparametric statistics to generate ordinary differential equation (ODE) models from expression data. Compared to other dynamical methods, our approach requires minimal information about the mathematical structure of the ODE; it does not use qualitative descriptions of interactions within the network; and it employs new statistics to protect against over-fitting. It generates spatio-temporal maps of factor activity, highlighting the times and spatial locations at which different regulators might affect target gene expression levels. We identify an ODE model for eve mRNA pattern formation in the Drosophila melanogaster blastoderm and show that this reproduces the experimental patterns well. Compared to a non-dynamic, spatial-correlation model, our ODE gives 59% better agreement to the experimentally measured pattern. Our model suggests that protein factors frequently have the potential to behave as both an activator and inhibitor for the same cis-regulatory module depending on the factors' concentration, and implies different modes of activation and repression.
Our method provides an objective quantification of the regulatory potential of transcription factors in a network, is suitable for both low- and moderate-dimensional gene expression datasets, and includes improvements over existing dynamic and static models.
Inferring transcriptional regulatory networks in animals is challenging. For example, the large number of genes, the spatial and temporal complexity of expression patterns, and the presence of many redundant and indirect interactions all make it difficult to learn the network. In the long term, it will be necessary to use multiple data sets--including gene expression, genome wide in vivo DNA binding, and network perturbation data--to accurately represent all interactions. Combining multiple data classes in this way, however, is an open and challenging problem.
An alternative, intermediate approach is to use only gene expression data to infer regulatory networks. Here the relationships between the expression levels of one or more transcription factors and those of many putative target genes are used to predict which genes are the most likely targets of each factor. While much work has been done in this area, it is critical to understand the maximum amount of information that can be obtained about the network using this strategy.
Typical approaches for inferring regulatory networks have been to assume a model formulation and then fit the data to this formulation [1, 2]. Many models have been proposed, including coexpression networks [3–5], information-theoretic representations [6–8], regression onto dynamical systems [9–14], Bayesian networks [15–17], and other graphical models [18, 19], each of which has advantages and disadvantages. The primary differences between these models lie in the trade-off between statistical and interpretational issues. Techniques like Bayesian networks, graphical models, and information-theoretic models have protections against over-fitting (i.e., fitting models with many parameters to a small amount of experimental data); however, these techniques do not provide dynamical models which can generate new biological insights. On the other hand, techniques such as nonlinear regression and regression onto dynamical systems provide more biologically interpretable models, but sometimes suffer from inaccurate assumptions or over-fitting of the model to the data.
There is disagreement on the necessity of dynamical [9–13, 8, 15–19, 14] as opposed to static [3, 6, 7, 4, 5, 20–23] models. We feel that dynamical models are more philosophically pleasing because regulatory networks contain temporal characteristics: For example, a protein binds to DNA and initiates transcription, which eventually leads to transport of the mature mRNA to the cytoplasm. Yet the argument is often made that static models provide a quasi-steady-state interpretation of the network that may provide a sufficient approximation. Rigorous comparison of the two approaches, however, is lacking.
Dynamical modeling of animal regulatory networks has a long history [24, 25, 9, 26, 10, 27, 11, 28]. It is a powerful approach in which researchers hypothesize a set of nonlinear, differential equations to describe the network, but it generally requires significant prior knowledge about the network. If there is insufficient biological knowledge about the network, then the structure of the equations can be incorrectly chosen. And if the model is not carefully chosen, it will have a large number of parameters, possibly leading to weak biological effects being erroneously identified as strong effects. Furthermore, it is sometimes shown that a wide range of different parameter values can reproduce the biological behavior of the network, which could be taken as evidence for either network robustness or over-fitting .
The purpose of this paper is to describe a novel approach for inferring regulatory networks from expression data, and it provides a new way to trade off statistical issues and model interpretability. We generate a quasi-genetic, formal model of regulatory networks using nonparametric ordinary differential equations (ODEs) which are fit using the nonparametric exterior derivative estimator (NEDE) [29, 30]. For these reasons, we call our method and the resulting model the NODE (an amalgamation of NEDE and ODE) model. Our NODE model is similar to qualitative piece-wise linear network modeling and identification [13, 12, 14], and we extend these models by using identification techniques that have improved statistical properties and protect against over-fitting. The NEDE estimator adds constraints to the identification problem by learning correlations between factors, and these constraints protect the model from over-fitting and erroneously identifying weak biological effects as strong effects. Though we focus the discussion in this paper to temporal-spatial expression patterns, our NODE method can easily be used with time-series micro-array datasets. It is also scalable to a network sized on the order of hundreds of species.
We focus our modeling effort on the formation of eve mRNA stripes during Stage 5 of Drosophila melanogaster embryogenesis. We apply our technique to this portion of the regulatory network, and compare the performance of our method to that of other more commonly used models. We show that there are significant differences in the regulatory predictions made by the NODE model and other commonly used models, including the fact that our technique predicts that factors frequently have both positive and negative effects on the same targets, depending on the concentration of the factor. We also show that the NODE model performs better than a static, spatial-correlation model.
Our NODE model is a formalization of a quasi-genetic model that seeks to capture the total net effect of direct and indirect influence of each factor on a target gene, and it is generated by looking at the correlation between factor concentrations and the change in target mRNA concentration over time. This is done in small windows of neighboring cells on the embryo and at different time intervals during development. By looking at the change in target mRNA over time, we are able to generate a dynamic equation model that describes each factors' influence on each gene in space and time; tuning parameters for our method are selected in a data-driven manner using cross-validation (see Methods and Models for more details). In general, the model formally predicts repression in all cases where increases in the concentration of a factor leads to a decrease in the rate of change in target mRNA over time. Similarly, it formally predicts activation as all instances where increasing the concentration of a factor leads to an increase in the rate of change in target mRNA over time.
We assess the fit of our NODE model to the experimental data both qualitatively and quantitatively. Because we have an ODE model that describes the formation of the eve mRNA stripes, we can run a simulation of the model using only the experimentally measured eve concentration at the first time point of Stage 5 as the initial condition of the ODE. Only transcription factor protein and eve mRNA data from the first two time points was used to derive the NODE model for predicted regulatory interactions. By using this model along with the transcription factor protein expression data from all time points, we can then simulate the eve mRNA pattern for all six time points and then compare this to the experimentally measured eve pattern.
Such factor activity plots show the intensity and variation of predicted effects of factors at different locations on the embryo and at different time points. Our model is a formal, quasi-genetic ODE model. It is not a mechanistic model, because it cannot capture the various mechanisms involved in the regulation of eve mRNA. This, however, is a strength because of the flexibility gained by not having to make a priori assumptions on the regulatory mechanisms. This comes at the cost of not being able to identify which interactions are direct or indirect.
To aid understanding of our NODE model and help establish its utility, we compared it to a spatial-correlation model. Such models have also been used for identifying regulatory interactions from quantitative expression data [21–23, 20, 33], and are based on the descriptions of the relationship between transcription factor and target gene expression that have been most widely used by developmental biologists. These models are not dynamic and look at the correlation, at fixed time points, between factor concentrations and target mRNA concentrations. To make the result comparable to our NODE model, we consider a new variant of spatial-correlation models which looks separately at the correlation between factor levels and target mRNA levels in different, small regions of the embryo and at different stages of development.
This result fits with the idea that the NODE model is intrinsically more biologically realistic than a spatial-correlation model. As stated earlier, biological networks are marked by temporal effects. For instance, a protein binds to DNA which initiates transcription. This is not an instantaneous process, and there is some delay between when a factor initiates transcription and when the target mRNA is expressed. The spatial-correlation model does not model this notion of temporal effects, whereas the NODE model does.
In many cases it is known that individual gene expression stripes can be controlled via a single cis-regulatory module (CRM) and current computational models generally assume that a given factor acts only as an activator or a repressor on a given CRM (e.g. [26, 27, 34–36]). However, both our NODE model and our variant of the spatial correlation model frequently predict concentration dependent effects whereby, on and around the same expression stripe, a factor has both repressing and activating effects (see the yellow and light blue bars above the plots in Figures 6 and 7 and more generally Figures 4 and 5). For example, consistent with previous molecular genetic evidence, KR is predicted as a repressor of posterior eve stripe 2, but is also implied by the model to be as an activator just anterior of this in cells where KR concentrations are lower (Figure 7). This and the many other similar cases could represent spurious correlations, perhaps due to other factors having dominant effects on targets in cells where the factor under study is expressed at lower levels. However, there are a number of cases where factors, including KR, have been shown to switch from activating to repressing the same target as their concentrations increase [37, 38]. Thus, the predictions of both our NODE model and our variant of the spatial correlation model make it more obvious that gene regulation can involve multiple mechanisms of factor action that should be considered hence forth.
In some cases, the NODE model predicts factor activities that are closer to biological expectations than the spatial-correlation model. Figure 6 indicates that both models predict strong repression by GT in almost the same anterior portion of eve stripe 2 (regions where the blue lines are below 0). On the other hand, Figure 7 indicates that the spatial-correlation model predicts repression by KR mostly in the inter-stripe region between stripes 2 and 3, whereas the NODE model predicts repression by KR in the posterior half of stripe 2. Since it has been experimentally observed that the eve stripes narrow over time [31, 20], and the NODE model more accurately indicates narrowing of the stripes, this provides further support for the idea that the NODE model performs better.
Another significant difference between the two models is that the NODE model can distinguish between multiple regions of the embryo where target mRNA either increases or decreases over time, whereas spatial-correlation models, by definition, cannot. This allows the NODE model to provide more subtle distinctions of factor activity.
Type I Repression - At current factor concentrations, the target mRNA will decrease in concentration over time. An increase in factor concentration will lead to a faster rate of decrease in target mRNA amounts over time.
Type II Repression - At current factor concentrations, the target mRNA will increase in concentration over time. An increase in factor concentration will lead to a slower rate of increase in target mRNA amounts over time.
Type I Activation - At current factor concentrations, the target mRNA will increase in concentration over time. An increase in factor concentration will lead to a faster rate of increase in target mRNA amounts over time.
Type II Activation - At current factor concentrations, the target mRNA will decrease in concentration over time. An increase in factor concentration will lead to a slower rate of decrease in target mRNA amounts over time.
It is also instructive to compare our NODE model to existing dynamical models of spatial pattern regulation in Drosophila embryogenesis. There are dynamic models, some using nonlinear ODEs, that describe the developmental change in the expression of gap genes [26, 27, 35, 36] and the eve stripes . Some of these models only describe the network at the level of protein expression [26, 27] whereas others include more detailed processes such as protein binding [34–36]. Like our model, these models can replicate experimentally measured gene expression patterns.
The models in [26, 27] are similar to our work in some regards in that they concern the network at the expression level. However, they require significant biological knowledge in order to hypothesize the structural forms of their equations, which can be problematic because this limits their ability to provide new biological insights. For instance, an a priori biological assumption made by the models in [26, 27] is that factors do not have concentration-dependent effects. A factor always either represses, activates, or does not affect the target gene. Biological experiments [37, 38] and our models suggest that this is not always true.
The main disadvantage of the models in [34–36] is that they use in vitro data in fitting models for in vivo behavior. These models contain detailed predictions of the regulatory network such as levels of protein-DNA binding in vivo. This is problematic because the parameters of the models were calculated using only gene expression data and in vitro DNA binding data. No comparison was made between the models' inferences and actual measurements of in vivo DNA binding. Work by the BDTNP shows that there is no simple correlation between in vitro affinity and in vivo occupancy, even on highly bound functional targets . This suggests that the models in [34–36] are unlikely to be accurate and that more quantitative data, such as ChIP-chip or ChIP-seq binding data, needs to be used to calculate the model parameters.
We have described a novel approach for inferring interactions within animal regulatory networks. Our approach uses nonparametric statistics to generate ordinary differential equation (ODE) models from expression data, and it has certain statistical and mathematical advantages over existing approaches. It is able to generate spatio-temporal maps of factor activity, highlighting the times and spatial locations at which different regulators might affect target gene expression levels.
We identified an ODE model for eve mRNA pattern formation in the Drosophila blastoderm, and our model was able to reproduce the experimental patterns well. It gives 59% better agreement to the experimentally measured pattern, as compared to a spatial-correlation model. Our model suggests that protein factors frequently have the potential to behave as both an activator and inhibitor for the same cis-regulatory module depending on the factors' concentration, and implies different modes of activation and repression. This suggests further avenues of research.
Here, we describe our NODE technique which uses time-series data to generate a dynamical model. We assume that the rate-limiting species (i.e. transcription factor protein concentrations) which drive the behavior of the network have been measured, and we do not consider actions on faster time-scales (e.g. the dynamics of factors binding to target genes). Also, we assume that concentrations are large enough for the rates of interaction to be deterministic.
where x is a vector whose elements are the concentrations of the rate-limiting species. Nonlinear regression techniques [2, 1] start with a function with unknown coefficients, and then they regress the data onto this function. This is problematic because the relationships are highly nonlinear and one risks over-fitting the data by starting with a function with many unknown coefficients. In contrast, our NODE method does not make any assumptions on the functional form of f(x). We use nonparametric statistics to make local estimates of the ODE in Equation 1, and our tools can scale to networks with hundreds of species.
We focus our presentation on NODE models which describe the effect of five regulatory transcription factors on target eve mRNA, and we briefly comment on how this technique can be used with general, time-series data. The data set we use, code for our methods, and the models generated by our methods are publically available and can be downloaded from http://bdtnp.lbl.gov/Fly-Net/bioimaging.jsp?w=node.
We apply our technique to experimental data that has been collected and processed by the BDTNP [31, 20], where measurements of protein and mRNA concentrations are taken by analyzing images of many Drosophila embryos to create a virtual embryo. The virtual embryo consists of 6078 cells and is a computational, spatial decomposition which is determined by averaging the geometry and number of cells of different embryos [31, 20]. The virtual embryo has measurements of the concentration (averaged over the different embryos at fixed points in time) of various protein factors and target mRNAs at the cellular level for six different time points during Stage 5 of the Drosophila embryo. We denote the vector of factor concentrations as x[t, e] and the vector of target gene concentrations as y[t, e], where t = 1, ..., 6 is the time of the measurement and e = 1, ..., 6078 is an index which uniquely identifies each cell in the virtual embryo. Notation like x bcd [t, e] denotes the [bcd] concentration in cell e at time t.
Inputs: Factor concentrations x[t, e], target gene concentrations y[t, e]
Do a least-squares fit of the polynomial (where c 0,..., c r are coefficients and r is a tuning parameter) with the data points: x[t, e], for each t = 1, ..., 6
Do a least-squares fit of the polynomial (where k 0, ..., k r are coefficients and r is a tuning parameter) with the data points: y[t, e], for each t = 1, ..., 6
Presmoothed factor concentration data is given by , and time derivative of target gene data is given by
Define matrix Y with rows given by , for each t = 1, ..., 6 and e = 1, ..., 6078
Define matrix X [t, e] = [1 Ξ[t, e]], where first column is all one's and Ξ[t, e] is matrix with rows given by , for each u = 1, ..., 6 and v = 1, ..., 6078
for each u = 1, ..., 6 and v = 1, ..., 6078, where n[u, v] = ||x[u, v]-x[t, e]||2 is the Euclidean distance and h is a tunable parameter
Define matrix P [t, e] by making its columns be the (p - d) principal components of Ξ[t, e] T W [t, e]Ξ[t, e] with smallest eigenvalues, where p is number of factors (p = 5 for the NODE model of target eve mRNA) and d is a tuning parameter
Step 1 involves presmoothing the experimental data and computing its time derivatives. We prefer to do this with local polynomial regression (LPR)  because it suffers from fewer transient effects than digital filters [42, 41]. To simplify the presentation, Step 1.a describes polynomial regression (PR). LPR is a variant of PR which protects against over-smoothing the data, and it can be quickly computed by doing a weighted linear regression. More details on LPR can be found in .
This step is important because otherwise the NODE model will be statistically biased . However, caution must be used when deciding to presmooth certain data sets in which the measurements are very noisy and taken at a sparse grid of points in time. In such cases, there is a risk of smoothing out biologically-relevant, temporal trends in the data because of the sparsity of the temporal grid.
Step 3 computes the NODE model, and we make use of the NEDE estimator: a new statistical tool that protects against over-fitting . The computation in Step 3.a.ii determines a window of cells v at time u that have concentrations similar to cell e at time t. The size of this window is selected by the parameter h, and cells with highly (weakly) similar concentrations are weighted highly (weakly) in the estimation of the coefficients of the NODE model. Equation 2 uses the Epanechnikov kernel to do this weighting. Note that weights for cells with very different concentrations can be similar, because the Euclidean distances computed in Step 3.a.ii can be similar. This does not cause problems because the NEDE estimator has been proven to be statistically well-behaved in the presence of such weighting schemes [41, 29, 30].
Step 3.a.iv uses the NEDE estimator in Equation 3 to compute the coefficients of the NODE model. It protects against over-fitting by learning constraints which the data obeys (Step 3.a.iii), and then using these constraints to reduce the degrees of freedom in the regression. In general, the data points x [t, e] form a manifold, and the projection matrix P [t, e] in Equation 3 enforces that the regression coefficients lie close to the manifold. This methodology is motivated by differential geometry which says that the exterior derivative of a function on an embedded submanifold lies in the cotangent space [44, 29, 30]. The NEDE estimator can be calculated quickly on a computer because it is a convex optimization problem. Theoretical properties and a more detailed description of the NEDE estimator can be found in [29, 30].
Instead of using a single ODE model to describe the regulatory network, the NODE model uses a group of ODE models consisting of the first order Taylor expansion (i.e., linearization) of the ODE given in Equation 1. Each equation of the NODE model describes how the behaviour of the regulatory network changes if concentrations of the factors in cell e at time t are changed. It requires fewer assumptions or prior knowledge about the system, because it does not require knowing the mathematical structure of the single ODE model in Equation 1. The disadvantage of this approach is that it is more difficult to interpret a series of models. The full NODE model for formation of target eve mRNA is given by Equation 4, and there is a different equation for each cell e at time t. Though the NEDE estimator protects against over-fitting, some might feel that the NODE model over-fits. The predictive ability of the NODE model, as discussed in Results and Discussion, gives evidence that it does not over-fit. In that test, we used our algorithm on the first two time points of data, and we assumed that the model for cell e at times t = 3, 4, 5 was the same as the model for cell e at time t = 2.
Equation 4 is difficult to interpret because the coefficients vary depending on cell e at time t, due to the fact that each equation is a linearization that is valid for when factor concentrations are close to x[t, e]. The model describes how a change in factor concentrations in the presence of all factors (the right hand side of Equation 4), affects the change in time of eve (the left hand side of Equation 4). If d[eve]/dt is positive (negative), then eve concentration will increase (decrease) by the next instant of time. For example, suppose the concentrations of all species are kept fixed at except for the concentration of GT which is slightly increased from to . In this situation, the change in time of eve concentration will be given by d[eve]/dt = a gt,[t, e]Δgt + b[t, e]. The increase of GT concentration by Δgt leads to a change in the change in time of eve concentration by a gt,[t, e]Δgt amount, and the sign of a gt,[t, e] describes whether this change is positive or negative.
Because this equation describes relationships in the presence of all factors, this can lead to seemingly contradictory results, such as when one species is a putative activator (e.g., BCD protein upregulates eve mRNA), but increasing the concentration of the activator in the presence of the other species can have a slight repressive effect because of interactions between the activator and the other factor species (i.e., the described concentration-dependent effects). Such a situation leads to an odd result: The coefficient of the "activator" will be negative.
and this model looks for the correlation of eve mRNA with protein factor concentrations. Whereas Equation 4 is a dynamical model, the model in Equation 5 is a static model, because it does not describe the temporal changes in eve concentration. A comparison of the fits between these models can be seen in Results and Discussion. Note that the coefficients in Equation 5 are computed with the algorithm for our NODE technique, with the change that Y is a vector of eve concentrations.
The first term is the coefficient from Equation 4, and the second term in parenthesis is a measure of average GT concentration within cells whose factor concentrations are similar to cell e at time t. The second term in parenthesis in Equation 6 is a measure of average concentrations, because it is a measure of the mean difference from the baseline concentration of x[t, e]. To clarify the notation, suppose the i-th value of x[t, e] denotes: x gt [t, e], which is GT concentration. Then the term [Ξ[t, e] T W [t, e]Ξ[t, e]] gt denotes the i-th value along the diagonal of the matrix Ξ[t, e] T W [t, e]Ξ[t, e].
Mathematical definition of factor activity classification in the NODE model
Type I Repression
Type II Repression
Type I Activation
Type II Activation
An example of a window is shown in Figure 11. The NODE method uses Equation 2 to take the similarity of the cells into account when doing the regression procedure. The size of the window is determined by the parameter h which is chosen using cross-validation, and it changes for each cell e at time t. As explained earlier, the statistical tools are well-behaved when weighting of cells within this window is computed with Euclidean distance.
To check that window sizes selected in a data-driven manner were not too small and missing important features, we did a check in which we fixed the windows to surround cell e at time t with a circle of radius of three cells at time t. This size was chosen, because the eve stripes are about six cells wide at Stage 5. A circular window with this size would not miss important regulatory features of the network. The [Additional file 1] shows plots of factor activity as generated by our NODE method for both data-selected and fixed windows. A visual comparison of the factor activity plots generated by these two windows shows that the data-selected windows were able to identify the same features as the fixed, circular window.
where (a) ξ[n] for n = 1, ..., N is a user-selected set of linearization points of Equation 1, (b) A ξ[n] = Df(ξ[n]) and b ξ[n] = f(ξ[n]) are the coefficients of the model, and (c) Df is the gradient of f(x). The NODE technique is unchanged except e refers to different experiments (instead of different cells), and Equation 3 is applied column-wise to Y to give columns of the matrix of coefficients: [b ξ[n] T A ξ[n] T ] T .
This work is part of a broader collaboration by the BDTNP. We are grateful for the frequent advice, support, criticisms, and enthusiasm of its members. The gene expression data were funded by the U.S. National Institutes of Health (NIH) under grants GM704403 (to MDB). Computational analyses were funded by National Science Foundation (NSF) award #CCR-0225610 (ITR), which supports the CHESS at UC Berkeley (to CT) and NSF awards #DMS-0605236 and #DMS-0906808 (to PB). Work at Lawrence Berkeley National Laboratory was conducted under Department of Energy contract DE-AC02-05CH11231.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.