A fast and efficient gene-network reconstruction method from multiple over-expression experiments
© Stokić et al; licensee BioMed Central Ltd. 2009
Received: 7 April 2009
Accepted: 17 August 2009
Published: 17 August 2009
Reverse engineering of gene regulatory networks presents one of the big challenges in systems biology. Gene regulatory networks are usually inferred from a set of single-gene over-expressions and/or knockout experiments. Functional relationships between genes are retrieved either from the steady state gene expressions or from respective time series.
We present a novel algorithm for gene network reconstruction on the basis of steady-state gene-chip data from over-expression experiments. The algorithm is based on a straight forward solution of a linear gene-dynamics equation, where experimental data is fed in as a first predictor for the solution. We compare the algorithm's performance with the NIR algorithm, both on the well known E. coli experimental data and on in-silico experiments.
We show superiority of the proposed algorithm in the number of correctly reconstructed links and discuss computational time and robustness. The proposed algorithm is not limited by combinatorial explosion problems and can be used in principle for large networks.
Prediction of functional relationships between genes, starting from actual gene expression data, is one of the primary goals of systems biology. Despite large efforts in this direction [1, 2], either based on transcription factor – promoter interaction [3, 4], or on inferring gene networks [5–9], methods for reliable predictions of collective behavior of gene-activity are yet to be found. Some general facts about the topology of gene regulatory networks [10–12], statistics of gene expressions  or the dynamics of gene regulation  are becoming to be understood. This knowledge is far from sufficient to successfully reconstruct gene networks, but can be helpful in limiting the tremendous number of parameters involved in reconstruction. Even if the average degree of the gene regulatory network, i.e. the number of genes regulated by some gene on average, was known, noisy and limited data will always lead to severe problems. The degree k i of a node i in a network is defined as the number of links that emerge from -or point to- that node. The average degree is denoted by ⟨k⟩.
There are basically two types of reverse engineering approaches depending on the experimental setup, inferring the gene network from steady-state [8, 9] or from time-series [14, 15] experiments. By using steady-state experiments, one can not draw any conclusion about the dynamics of gene regulation. Conducting time-series experiments gives helpful insights into gene regulatory dynamics, but often with the price of getting redundant information. Further, due to costs full time-series data on gene expression are in general not available. As described in , one can further divide the reverse engineering methods into four categories: differential equation models [5, 6], boolean network models , Bayesian network models  and association networks . The reverse engineering methods based on differential equations further may rely on linear  or nonlinear differential equations . The latter models may possess potential in alleviating the effects of insufficient information fundamental in reverse engineering of genetic networks by obtaining information from combinatorial perturbations measurements on the system. In linear systems such information would be redundant.
How can gene regulatory network reconstruction methods be validated and compared? Neither a standardized biological benchmark, nor a consensus on what class of models to use for in-silico testing exists . The usual way is to validate a method either by applying it on a given experimental dataset or on in-silico datasets. In both cases one has to deal with different problems. Applying a method to an experimental dataset, poses the problem of comparing the reconstruction result with a network which is always just a consensus on how a biological network could look like, but never the exact gene regulatory network. On the other hand, when applying a reconstruction models to in-silico data, one has a perfect reference network, however the generated timeseries data is a result of a dynamical model of gene interaction, which cannot be shown to overlap with the real gene regulation dynamics.
In this paper we introduce a novel reverse engineering algorithm and validate it against biological and in silico data. The performance of the algorithm is compared against Network identification by multiple Regression (NIR)  for several reasons. i) NIR is considered a state of the art algorithm which plays a role comparable to a benchmark. To our best knowledge NIR has not been outperformed by other algorithms with respect to predicting the topology of genetic networks (real and in-silico) correctly -on average- so far. However, NIR has two drawbacks. a) The computation time increases quickly with the network size. b) NIR can not identify hubs in networks. We will briefly discuss this in a little more detail further below. Recently a reconstruction algorithms has been proposed by  which is clearly faster than NIR. Yet, its predictions do not better or completely equal the ones made by NIR. Another algorithm has been proposed recently , which correctly identifies the recA-hub in the E. coli SOS response network and reports a higher statistical significance of the reconstructed SOS network. Yet, the algorithm is restricted to a-cyclic networks. Both cases lack a comprehensive comparison with NIR and do not challenge its role as benchmark. ii) We decided to selected an E. coli dataset  for biological validation of our algorithm. This choice is satisfactory since the underlying SOS response network has been subject to over 30 years of research, which provides us with a well established consensus about the actual gene regulations going on in that particular sub-network. Moreover NIR has been applied to the same SOS response network of E. coli which makes it possible to compare the two algorithms. iii) Both, the proposed algorithm and NIR, build on the assumption that the dynamics of RNA concentrations defined by the underlying network is governed by a linear differential equation. For generating in silico data we therefore are using a linear gene regulation model proposed by the authors . The only nonlinearity in the model is introduced by the condition that RNA concentrations can not become negative. This model simultaneously captures a series of experimental facts, such as the distribution of genome wide gene-expression levels, multi-stability and periodicity. The proposed reconstruction algorithm does not rely on the detailed knowledge of the data generation model. The generated in silico data therefore is realistic and allows for an unbiased comparison of the proposed algorithm with NIR.
Although the proposed algorithm and NIR start from identical assumptions on RNA concentration dynamics and equally have to deal with insufficient information the reconstruction approaches are in fact quite different. The inference idea of the NIR algorithm is to reconstruct the network by using a least-squares multiple regression approach assuming the same fixed number of regulatory links for every gene, i.e. the average degree ⟨k⟩ of the network. Basically this means that promising ensembles of links are put to the test and the ensemble which gives the least squared error with the input data on basis of the underlying linear model is selected as the optimal solution which is calculated for all the ( ) possible combinations, where N is the size of the network. This combinatorial factor is responsible for a polynomial time characteristics of NIR, i.e. roughly T ∝ N⟨k⟩, which limits its applicability to relatively small networks with low average degree. Gene-networks are expected to have average degrees of about 4. For average degree ⟨k⟩ = 4 this implies that computation time of NIR grows quickly with at least a factor N4. Moreover, since NIR only considers networks where all nodes have identical degree there is no way NIR could detect hubs in networks. Clearly NIR will perform best on networks with sharply peaked degree distributions and perform bad on networks with broad degree distributions. The proposed algorithm however relies on the observation that for very sparse networks with no loops the information provided by experiments (perturbations and associated RNA concentrations) is sufficient to infer the network. The proposed approach tries to extrapolate the sparse network predictions to non sparse networks guided by the exact solutions of the linear differential equations. In contrast to NIR the number of regulatory links per gene are not fixed a priory which intuitively is much more realistic. While large networks of even thousands of genes do not actually pose a problem for the proposed algorithm in terms of computing time there are of course limitations to the network size in terms of available independent experimental information. Basically ( ) link weights have to be estimated from maximal N independent over expression experiments. Any reverse engineering algorithm relying on this limited information will necessarily approach the results of pure gambling with growing network size.
will be one-to-one related with the network topology. is the gene expression level of the i th gene when no perturbation has occurred (μ ≡ 0) and the gene expression level of the i th gene, where the j th gene has been over-expressed. It is also not hard to realize that for sparse adjacency matrices the relative response will in general provide a good first estimate, i.e. ∝ Y. Moreover, linearity assures that relative responses for short times will be small, |Y ij | ≪ 1 and therefore Q ~ ~ Y.
A0 is the first approximation of the gene regulatory network we want to reconstruct.
A note on fewer than N experiments
Here the first index in D ij , i runs in the domain M <i ≤ N, the second index, 1 ≤ j ≤ N.
Testing the method
where C is a K × K confusion matrix, or more precisely the element C kl is counting the number of cases where category k is predicted, but category l was present. In our case, K = 3 and the categories are: positive link, negative link and no link between any two genes. It is straight forward to see that p-values for any value of R K can be computed exactly in the same way as for the Pearson correlation coefficient, provided sample size is given.
It is important to stress the difference in measuring reconstruction performance in in-silico and biological experiments. While in biological networks, self-regulation is a part of the complete gene regulatory network, in numerically simulated gene regulation dynamics, self-regulation is often screened by negative self-degradation rates, which have to be imposed, in order to keep the dynamics sufficiently stable, see e.g. . To be as correct and conservative as possible, we therefore compare our reconstructed adjacency matrix only with the off-diagonal elements in the in-silico case. In the E. coli case we of course compare with the complete adjacency matrix.
Testing on the E. coli dataset
We use the wild-type E. coli strain MG1655 available at . The reason for testing our method on this particular dataset is the fact that the SOS response of the E. coli is well understood, and some consensus over the topology of its gene regulatory network is reached. Moreover it is possible to compare reconstruction success with other groups [8, 18, 26]. We test the performance by counting the fraction of the correctly reconstructed links of all three classes (positive, negative and zero), and with the extended Matthews correlation coefficient.
The pure-chance reconstruction threshold
A strong criterion of checking the performance of any reconstruction method we consider, is to compare it with a pure random-reconstruction. Several proposed gene network reconstruction algorithms can be shown to perform only slightly above pure-chance reconstruction. Random reconstruction can be performed in the following way. Suppose that ⟨k⟩ denotes the true average degree of the network, which may or may not be known, and k g denotes a guess on ⟨k⟩. Since we estimate that the directed network has L = Nk g links we take a fully connected network and assign a random order to all N(N - 1) links. Then we take a random number with three outcomes: + (positive weight), - (negative weight), and 0 (no link), and assume that there are as many positive as negative links. The distribution of these outcomes therefore is such that both + and - occur with probability w± = k g /2N, while the 0 appears with probability w0 = 1 - k g /N. The true probabilities, i.e. the probability of +, -, 0 if the true average ⟨k⟩ was known, however are, p± = ⟨k⟩/2N and p0 = 1 - ⟨k⟩/N. Now we pick one link after another in the given random order, and assign a random symbol, +, - or 0 and repeat this until L links have been assigned either + or -. Since 'throwing the dice' is an event independent of the network topology, one can simply compute (k g |⟨k⟩) = w± p± and (k|⟨k⟩) = w0p0.
If reconstruction is based on pure chance the expected K-category correlation will be R K = 0. This can be seen by inserting the confusion matrix C ij = w i p j , i and j indexing +, - or 0, into Equation (18).
Reconstruction on in-silico data
Reconstruction of the E. coli SOS network
The computational time needed to perform the NIR algorithm on this particular 9 node network is of order of magnitude of 1 minute, while our approach takes less than a second, both performed on a standard personal computer. The NIR algorithm is unable to cope with reconstruction of significantly larger gene regulatory networks, both from the time or memory consumption, while our method can easily deal with larger network sizes. Because of typically high levels of noise and uncertainty in biological data collected throughout actual experiments, the robustness of a method is of crucial importance. We tested both the NIR and our algorithm in the following way: We took the E. coli data and added Gaussian noise (the noise amplitude, i.e. the standard deviation of the noise, was chosen to be 1.5 percent of the average amplitude of the input data). We produced 100 perturbed datasets, reconstructed the network with both, the proposed algorithm and NIR, and counted the number of links that have changed with respect to the network reconstructed from the unperturbed data, i.e. links where either - → +, 0 or + → -, 0 or 0 → ±. While for NIR 32.7 percent of the links changed on average for the proposed algorithm only 21.3 percent of the links were classified differently. In absolute numbers: for NIR 26.59 links and for the proposed algorithm 17.25 links were classified differently.
From the right panel of Figure 2, showing the Mathews correlation coefficient for the in-silico experiments, it can clearly be seen that for the fraction of correctly reconstructed links our method performs about equally well than NIR for very sparse networks (⟨k⟩ model = 1) and outperforms it for more densely connected networks. Looking at the fractions of correctly reconstructed links, one notices a slightly better performance of our algorithm, while for the extended Matthews correlation coefficient the difference is much more notable. To understand this difference, one has to take a closer look at the type I and type II errors of both methods. While the NIR algorithm makes almost the same number of reconstruction errors of all types, there is a clear distinction in errors made by our reconstruction algorithm. The vast majority of errors are made by assuming that there is a link (positive or negative) between two genes, while in the real case there is none, and vice versa. Only a few mistakes are made where the real positive link is reconstructed as negative, or vice versa. This is an additional asset of the proposed reconstruction algorithm.
In Figure 3 one can easily notice that both reconstruction methods, the proposed one and NIR, applied to in-silico data have their maxima in performance when the input average degree equals to the true one, ⟨k⟩ = ⟨k⟩ model , which can be seen as an additional consistency check of the algorithm. On the other hand, after applying both reconstruction methods on E. coli data, just the proposed reconstruction algorithm shows its performance maximum at the ⟨k⟩ = ⟨k⟩E.colipoint, while the NIR method shows similarities in behavior to the pure-chance reconstruction.
We introduced a reverse engineering procedure for gene regulatory networks, applicable on an experimental setup where all the genes belonging to a genetic (sub)network are being over-expressed one after the other, after which gene-chip measurements in the steady state are taken. We showed the reconstruction performance on both in-silico and biological data. The method is applicable to large networks, both from the computational memory or computational time point of view, which might be a problem for algorithms limited by combinatorial explosions. However the increasing lack of independent experimental information with growing network size practically limits networks to sizes N ≤ 50. However, due to the superior time characteristics, large networks could in principle be decomposed into overlapping subnetworks. These subnetworks can be inferred by the proposed algorithm and then merged together in an adequately chosen post processing step.
Except from technical benefits, the philosophy of our reconstruction method complies perfectly with the biological goals of conducting over-expression experiments. In contrary to the NIR algorithm or similar reconstruction methods, where the final solution is a network, where every link has the same significance, our method ranks the reconstructed links by their influence, which might be a very important issue in experimental gene interaction-detection Instead of randomly picking the links out of a given reconstructed topology, here one can select interaction-links with the highest weights. This again ameliorates the consequences of not knowing the real network connectivity ⟨ ⟩ a priori. While selecting a good value for ⟨k⟩ is crucial for getting reliable networks, it will not influence the ordering of the links by importance in the proposed algorithm. In other words, no matter which ⟨k⟩ is taken, the set of ranking of reliable links will not change.
Another shortcoming of the NIR algorithm is the fact that the resulting network has a trivial, unrealistic degree distribution, a delta function, δ(k - k*). Thus, detecting genetic hubs, peripheral genes, or any other topologically important genes in the network is practically impossible. The proposed method does not a priori restrict the topology of the reconstructed network except for the average degree ⟨k⟩ which is important for the thresholding only.
For successful reconstruction the NIR algorithm needs as an external input information on external perturbation, which is in most realistic cases at best only approximately known. In our in-silico experiments we have provided the exact information for NIR; even then the NIR algorithm was outperformed.
Supported by WWTF LS129 and Austrian Science Fund FWF Project P19132.
- Gardner TS, Faith JJ: Reverse-engineering transcriptional control networks. Physics of Life Reviews 2005., 2(65–88):View ArticlePubMedGoogle Scholar
- Markowetz F, Spang R: Inferring cellular networks – a review. BMC Bioiformatics 2007, 8: S5. 10.1186/1471-2105-8-S6-S5View ArticleGoogle Scholar
- Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression. Nat Genet 2001, 27: 167. 10.1038/84792View ArticlePubMedGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22: 281. 10.1038/10343View ArticlePubMedGoogle Scholar
- Chen T, He HL, Church GM: Modeling gene expression with differential equations. Pacific Symp Biocomp 1999, 29–40.Google Scholar
- D'Haeseleer P, Wen X, Fuhrman S, Somogi R: Linear modeling of mRNA expression levels during CNS development and injury. Pacific Symp Biocomp 1999, 4–41.Google Scholar
- Akutsu T, Miyano S, Kuhara S: Identification of genetic networks from a small number of gene expression patterns under the Boolean network model. Pacific Symp Biocomp 1999, 17–28.Google Scholar
- Yamanaka T, Toyoshiba H, Sone H, Parham FM, Portier CJ: The TAO-Gen Algorithm for Identifying Gene Interaction Networks with Application to SOS Repair in E.coli. Toxicogenomics 2004, 112: 1614–1621.View ArticleGoogle Scholar
- Gardner TS, di Bernardo D, Lorenz D, Collins JJ: Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling. Science 2003, 301: 102. 10.1126/science.1081900View ArticlePubMedGoogle Scholar
- Maslov S, Sneppen K: Specificity and Stability in Topology of Protein Networks. Science 2002, 296: 910–913. 10.1126/science.1065103View ArticlePubMedGoogle Scholar
- Jeong H, Mason S, Barabási AL, Oltvai Z: Lethality and centrality in protein networks. Nature 2001, 411: 41. 10.1038/35075138View ArticlePubMedGoogle Scholar
- Jeong H, Tombor B, Albert B, Oltvai Z, Barabási AL: The large-scale organization of metabolic networks. Nature 2000, 407: 651. 10.1038/35036627View ArticlePubMedGoogle Scholar
- Živković J, Tadić B, Wick N, Thurner S: Statistical Indicators of Collective Behaviour and Functional Clusters in Gene Networks of Yeast. Euro Phys J B 2006, 50: 255. 10.1140/epjb/e2006-00103-4View ArticleGoogle Scholar
- Yeung M, Tegner J, Collins J: Reverse engineering gene networks using singular value decomposition and robust regression. Proc Natl Acad Sci 2002, 99: 6163–6168. 10.1073/pnas.092576199PubMed CentralView ArticlePubMedGoogle Scholar
- Arkin A, Shen P, Ross J: A test case of correlation metric construction of a reaction pathway from measurements. Science 1997, 29: 1275–1279. 10.1126/science.277.5330.1275View ArticleGoogle Scholar
- de la Fuente A, Bing N, Hoeschele I, Mendes P: Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 2004, 20: 3565. 10.1093/bioinformatics/bth445View ArticlePubMedGoogle Scholar
- Nelander S, Wang W, Nilsson B, She QB, Pratilas C, Rosen N, Gennemark P, Sander C: Models from experiments: combinatorial drug perturbations of cancer cells. Mol Syst Biol 2008, 4: 216. 10.1038/msb.2008.53PubMed CentralView ArticlePubMedGoogle Scholar
- Farina L, Mogno I: A fast reconstruction algorithm for gene networks. e-print archive arXiv:q-bio/0401044v1 2004. [http://arxiv.org/abs/q-bio.QM/0401044]Google Scholar
- Stokić D, Hanel R, Thurner S: Inflation of the edge of chaos in a simple model of gene interaction networks. Phys Rev E 2008, 77: 061917. 10.1103/PhysRevE.77.061917View ArticleGoogle Scholar
- Courcellec J, Khodurskya A, Peterb B, Browna PO, Hanawaltd PC: Comparative Gene Expression Profiles Following UV Exposure in Wild-Type and SOS-Deficient Escherichia coli. Genetics 2001, 158: 41–64.Google Scholar
- Walker GC: The SOS response of Escherichia coli. American Society for Microbiology 1996, 1400–1416.Google Scholar
- Koch WH, Woodgate R: DNA Repair in Prokaryotes and Lower Eukaryotes. DNA Damage and Repair 1998, 1: 107–134. full_textView ArticleGoogle Scholar
- Fernandez de Henestrosa AR, Ogi T, Aoyagi S, Chafin D, Hayes JJ, Ohmori H, Woodgate R: Identification of additional genes belonging to the LexA regulon in Escherichia coli. Molecular Microbiology 2000, 35: 1560–1572. 10.1046/j.1365-2958.2000.01826.xView ArticlePubMedGoogle Scholar
- Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A: The EcoCyc Database. Nucl Acids Res 2002, 30: 56–58. 10.1093/nar/30.1.56PubMed CentralView ArticlePubMedGoogle Scholar
- Gorodkin J: Comparing two K-category assignments by a K-category correlation coefficient. Comp Biol and Chem 2004, 28: 367–374. 10.1016/j.compbiolchem.2004.09.006View ArticleGoogle Scholar
- Supper J, Spieth C, Zell A: Reconstructing Linear Gene Regulatory Networks. EvoBIO LNCS 2007, 4447: 270–279.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.