Reconstructing genome-wide regulatory network of E. coli using transcriptome data and predicted transcription factor activities
- Yao Fu^{1},
- Laura R Jarboe^{2} and
- Julie A Dickerson^{1, 3}Email author
DOI: 10.1186/1471-2105-12-233
© Fu et al; licensee BioMed Central Ltd. 2011
Received: 15 November 2010
Accepted: 13 June 2011
Published: 13 June 2011
Abstract
Background
Gene regulatory networks play essential roles in living organisms to control growth, keep internal metabolism running and respond to external environmental changes. Understanding the connections and the activity levels of regulators is important for the research of gene regulatory networks. While relevance score based algorithms that reconstruct gene regulatory networks from transcriptome data can infer genome-wide gene regulatory networks, they are unfortunately prone to false positive results. Transcription factor activities (TFAs) quantitatively reflect the ability of the transcription factor to regulate target genes. However, classic relevance score based gene regulatory network reconstruction algorithms use models do not include the TFA layer, thus missing a key regulatory element.
Results
This work integrates TFA prediction algorithms with relevance score based network reconstruction algorithms to reconstruct gene regulatory networks with improved accuracy over classic relevance score based algorithms. This method is called G ene expression and T ranscription factor activity based R elevance N etwork (GTRNetwork). Different combinations of TFA prediction algorithms and relevance score functions have been applied to find the most efficient combination. When the integrated GTRNetwork method was applied to E. coli data, the reconstructed genome-wide gene regulatory network predicted 381 new regulatory links. This reconstructed gene regulatory network including the predicted new regulatory links show promising biological significances. Many of the new links are verified by known TF binding site information, and many other links can be verified from the literature and databases such as EcoCyc. The reconstructed gene regulatory network is applied to a recent transcriptome analysis of E. coli during isobutanol stress. In addition to the 16 significantly changed TFAs detected in the original paper, another 7 significantly changed TFAs have been detected by using our reconstructed network.
Conclusions
The GTRNetwork algorithm introduces the hidden layer TFA into classic relevance score-based gene regulatory network reconstruction processes. Integrating the TFA biological information with regulatory network reconstruction algorithms significantly improves both detection of new links and reduces that rate of false positives. The application of GTRNetwork on E. coli gene transcriptome data gives a set of potential regulatory links with promising biological significance for isobutanol stress and other conditions.
Background
Gene regulatory networks play an essential role in controlling gene expression and ensuring that the right genes are expressed or silenced at the right time in the right place to make the organism function appropriately. Better understanding of gene regulatory structure aids biological researchers and biochemical engineers in obtaining more complete views of the complex gene expression and regulatory mechanisms in organisms.
Since TFA is governed by various complex molecular interactions, it is difficult to determine directly from experiments, especially if the activation mechanism is unknown. However, it is possible to computationally predict the change of TFAs relative to a reference state using transcriptome data and a known TF-gene network architecture [1, 2]. Network Component Analysis (NCA) developed by Liao et al. defines the problem of calculating TFAs as optimization of a linear least square matrix decomposition. Liao et al. solve the problem using an expectation maximization (EM) approach [3]. Fast Network Component Analysis (FastNCA) uses singular value decomposition (SVD) and a matrix projection technique to approximate the linear least square matrix decomposition problem defined in NCA[4]. Similarly, Alter and Golub use SVD and pseudo-inverse projection, and integrate ChIP and microarray data to calculate the hidden TFA layer between TFs and genes [5]. ChIP data provides additional information on proteins' DNA binding occupancy. Gao et al. developed an algorithm that combines microarray data for mRNA expression and transcription factor occupancy to define the regulatory network (MA-Networker algorithm) to predict TFAs based on ChIP and transcriptome data using multivariate regression and backward variable selection [6]. With the predicted TFAs, Gao et al. calculate the TF-gene coupling factor using Pearson Correlation [6]. Boulesteix et al. applied statistically inspired modification of the partial least square (SIMPLS) algorithm to find TFAs [7]. Many more complex models are also applied to predict TFAs. For example, Nachman et al. apply the Bayesian Network approach to provide a probabilistic model to predict TFAs [8]. The State-space model by Li et al. assumes the TFAs are affected by the TF gene expressions of previous time points [9]. Probabilistic dynamical models by Sanguinetti et al. consider the possibility of the same TF having different activities on different target genes [10]. A Gaussian process model developed by Gao et al. uses the Bayesian marginalization approach to predict TFAs [11]. Besides predicting TFAs from gene expression data and TF network structures from experiments and literature data, DNA sequence motif information is also widely used (e.g. searching for DNA binding site of TFs) in many methods to infer potential TF-gene links to obtain a more complete TF network structure and improve the prediction of TFAs [2]. However, compared to matrix decomposition and regression approaches, these complex models require more computational power. Thus, these complex models either cannot deal with large scale TFAs or they predict large scale TFAs by converting TFAs into binary.
High-throughput technologies have led to many algorithms for the reconstruction of large scale gene regulatory networks [12]. For example, many sequence analysis approaches which identify potential TF binding sites have been developed [13]. However, many of the predicted potential TF binding sites are not functional (false positive predictions) [12]. From ChIP-chip technology, potential gene regulatory effects can be derived by identifying the portions of a genome that are bound by a particular TF in vivo[14]. Transcriptome data (also known as gene expression data) measured by genome-wide DNA microarrays are widely used for gene regulatory network reconstructions. For instance, Stuart et al. use correlation coefficients between mRNA levels of genes as relevance scores to reconstruct correlation networks [15]. The interacting genes are predicted by detecting the correlation score above some set threshold. Other algorithms such as RELNET (RELevance NETworks) [16] and ARACNE (Algorithm for the Reverse engineering of Accurate Cellular NEtworks) [17] use mutual information as the relevance scores. The CLR (Context Likelihood Relatedness) [10] algorithm uses an adaptive background correction method on the relevance scores to improve precisions [18]. CLR significantly improved the performance of gene regulatory network reconstruction, and is widely adopted in the latest developed gene regulatory network reconstruction algorithms. In the field well known conference on Dialogue for Reverse Engineering Assessments and Methods (DREAM) [19], many winning algorithms are based on CLR. For examples, the best performer algorithm in DREAM2 Challenge 5, synergy augmented CLR (SA-CLR), introduced three way mutual information instead of the pair-wise mutual information in the original CLR [20]. Madar et al. developed a ordinary differential equation (ODE) based dynamic model extension of CLR (mixed-CLR/tl(time-lagged) CLR integrated with Inferelator 1.0) to treat steady-state data and time-series data separately and had an outstanding performance on DREAM3 and DREAM4 100-gene in silico network challenge [21, 22]. Huynh-Thu et al. developed a regression and tree based algorithm to reconstruct gene regulatory networks and awarded the best performer in DREAM4 in silico Multifactorial challenge [23]. Pinna et al. developed a graph analysis based algorithm to predict directed gene regulatory network from gene knockout experiments [24].
Many gene regulatory network reconstruction algorithms focus only on time series transcriptome data to develop dynamic models [25]. These include network identification by multiple regression [26], microarray network identification [27] and multi-scale time-correlation estimation [28]. time-series network identification [29], directed information-based CLR [30]. Dynamic Bayesian network models use a Bayesian Framework to reconstruct gene regulatory networks [31, 32].
Time-series based algorithms and dynamic Bayesian networks models can provide realistic models to reconstruct gene regulatory networks. However, due to a lack of closely spaced time-series data and computational power, these algorithms are difficult to apply on a genome-wide scale. Relevance score based algorithms are more efficient computationally and can integrate many different types of transcriptome data.
The standard simplified two-layer (TF-gene) model assumes a gene regulatory network model in which expressed TFs affect their target genes directly, despite the fact that TFA plays an important role in gene regulation. This simplification may lead to large false positive detection rates. Recently, the problem that TF gene expression does not necessarily correlate with target gene expression was noted in [33]. This discrepancy was addressed using a knowledge base representation of a TF expression by averaging the expressions of its target genes [33]. In our GTRNetwork model, we introduce a hidden layer of TFAs into relevance score approaches which connects TFs and their target genes. The three layer model (Figure 1) is more realistic than the two-layer model, and more biologically reasonable than the knowledge base representation model. The GTRNetwork model results in an approach to reconstruct large scale genome-wide gene regulatory networks that is both biologically more meaningful and computationally feasible.
Results
Selection of TFA prediction algorithms and network reconstruction algorithms
Different TFA prediction algorithms and network reconstruction algorithms affect the performance of the GTRNetwork method. In this research, the task is to reconstruct gene regulatory networks of E. coli in the whole genome scale, which includes over 4000 genes and 160 TFs. In TFA prediction algorithms, only the algorithms using matrix decomposition and regression approaches could fit the computational requirements and scale needs of GTRNetwork algorithm for a whole genome. Three major approaches to predict TFAs are: gNCA-r which uses expectation maximization (EM) [3], FastNCA which uses singular value decomposition (SVD) [4], and SIMPLS which uses partial least square (PLS) regression [7].
Similar scale and computational power requirements as the TFA prediction algorithms exist in regulatory network reconstruction algorithms using TFAs and gene expression levels. The relevance scores are calculated by either Pearson correlation coefficients or adaptive partitioning mutual information (APMI) [35]. While using relevance scores approach on microarray experiments, different genes may have different background noise in different patterns and scales. For example, relevance scores may fail to distinguish direct interaction from indirect influences when the experimental conditions are unevenly sampled, or when the microarray normalization fails to remove false background correlations [18]. Research by Faith et al. [18] showed that using a background correction in the relevance score based network reconstruction process reduces the false positive detection rate of regulatory links and significantly improves the performance of the network reconstruction. The Context Likelihood Relatedness (CLR) [18] algorithm provides background correction on relevance scores in GTRNetwork.
GTRNetwork Algorithm Testing
GTRNetwork Algorithm Combinations.
GTRNetwork Algorithm Variant | TFA prediction | Relevance score | CLR Background correction |
---|---|---|---|
E-A-C | EM | APMI | Yes |
E-A-N | EM | APMI | No |
E-C-C | EM | Cor | Yes |
E-C-N | EM | Cor | No |
P-A-C | PLS | APMI | Yes |
P-A-N | PLS | APMI | No |
P-C-C | PLS | Cor | Yes |
P-C-N | PLS | Cor | No |
S-A-C | SVD | APMI | Yes |
S-A-N | SVD | APMI | No |
S-C-C | SVD | Cor | Yes |
S-C-N | SVD | Cor | No |
N-A-C | None | APMI | Yes |
N-A-N | None | APMI | No |
N-C-C | None | Cor | Yes |
N-C-N | None | Cor | No |
In conclusion, the algorithms using EM-based or SVD-based TFA prediction methods along with the APMI relevance score gave the best performance. In general, using or not using CLR background correction does not give significant differences in performance, but since CLR has low computational requirements (See the discussion session) and has been shown helpful in gene regulatory reconstruction algorithms [18], we suggest the use of CLR background correction in the GTRNetwork algorithm. Thus, the E-A-C (EM-based TFA prediction, APMI relevance score function and using the CLR background correction) combination is used as the default GTRNetwork algorithm in the testing and application below.
Application of GTRNetwork Algorithm
According to the test results above, the E-A-C algorithm combination best fits the current known gene regulatory network topology from RegulonDB 7.0. This algorithm combination was applied using the full set of RegulonDB 7.0 TF-gene links as the initial network topology. The gene expression data of E. coli integrating 466 transcriptome experiment conditions on 4279 gene probes from the M3D database was used as the transcriptome data input. Resulting gene regulatory networks with sizes ranging from 100 links to 600 links were reconstructed. Different relevance score thresholds were set to reconstruct gene regulatory networks with different sizes. Higher thresholds result in smaller regulatory networks with fewer false positives. Lower thresholds give more complete networks, but with more false positives. A check operon step using operon information from RegulonDB 7.0 was applied to improve the sensitivity of the reconstructed regulatory networks. The complete detailed predicted results are shown in Additional file 2.
Valid search of 12 predicted new links using literature.
TF | Gene | Supporting Evidence |
---|---|---|
DicA | insD | TF binding site verified (RegulonDB) [34] |
DicA | intQ | TF binding site verified (RegulonDB) [34] |
DicA | ydfE | TF binding site verified (RegulonDB) [34] |
DcuR | pepE | Involved in anaerobic respiration related process (EcoCyc [37]) |
Fur | ybdB | ybdB (entH) is proposed to be regulated by Fur (EcoCyc [37]) |
Fur | yncE | yncE is de-repressed by Fur [41] |
IscR | fdx | Some evidence that the Fdx functions as an intermediate site for Fe-S cluster assembly [42] |
IscR | hscA | |
IscR | hscB | HscB is a co-chaperone that stimulates HscA (Hsc66) ATPase activity [44] |
IscR | iscX | Possibly involved in Fe-S cluster biogenesis [43] |
SgrR | sroA | TF binding site verified (RegulonDB)[34] |
Predicted Fur target genes
Gene | Gene function |
---|---|
efeU | Ferrous iron permease component of the EfeUOB ferrous iron transporter. |
ybdB (entH) | Thioesterase that is involved in the biosynthesis of enterobactin. |
bfd | Bacterioferritin-associated ferredoxin; predicted redox component complexing with Bfr in iron storage and mobility [2Fe-2S] |
bfr | Iron storage protein. |
efeB | Deferrrochelatase, periplasmic; inactive acid inducible low-pH ferrous ion transporter EfeUOB; periplasmic acid peroxidase; heme cofactor. |
efeO | Inactive acid-inducible low-pH ferrous ion transporter EfeUOB; acid-inducible periplasmic protein. |
ybaN | Inner membrane protein, DUF454 family, function unknown. |
ydiE | Function unknown, hemin uptake protein HemP homolog |
yncE | Secreted protein, possible role in iron acquisition. |
yqjH | NADPH-dependent ferric reductase containing FAD, covalently bound to a cysteine sidechain. |
Despite the fact that E. coli is so well-characterized, there are still many genes that have no known regulators. The GTRNetwork predictions help discover the regulators of those genes still have no known regulators. In the 381 predicted links, there are 171 predicted target genes which previously had no known regulators (Additional File 2).
Significantly changed TFAs under isobutanol condition predicted by GTRNetwork reconstructed gene regulatory network The reconstructed gene regulatory network includes 381 potential new regulatory links, the 16 significantly changed TFAs predicted by original RegulonDB data from Brynildsen's paper [39] are not included.
TF | Function | Target Genes |
---|---|---|
ArgR | Arginine catabolism | argA, gltF, argE, argH, rimP, rbfA, truB, rpsO, pnp, nusA, infB, hisP, gltD, gltB, carB, artP, artI, artQ, artM, artJ, hisJ, hisQ, metY, astE, astB, astD, astA, astC, hisM, argB, argC, argD, argF, argG, argI, argR, carA |
AscG | A rbutin-s alicin-c ellibiose transport and utilization | ascB, ascF, ascG, htpG, prpR, clpB, dnaJ, dnaK, tpke11, groL, groS, grpE, hslU, hslV, ybbN, lipB, ybeD, lnt, ybeX, ybeY, ybeZ |
CysB | Novobiocin resistance, sulfur utilization, and sulfonate-sulfur catabolism | tauA, tauB, tauC, ssuC, ssuD, ssuA, ssuE, hslJ , cbl , tauD, ssuB, cysP, cysU, cysW , cysN , cysM , cysK , cysJ , cysI , cysH , cysD , cysC, cysB , cysA, gsiA, gsiB, gsiC, gsiD, iaaA , yciW , ydjN, yeeD, yeeE |
Lrp | Leucine-responsive regulatory protein | lhgO, alaT, alaU, alaV, gltT, gltU, gltV, gltW, ileT, ileU, ileV, micF , rrfA, rrfB, rrfC, rrfD, rrfE, rrfG, rrfH, rrlA, rrlB, rrlC, rrlD, rrlE, rrlG, rrlH, rrsA, rrsB, rrsC, rrsD, rrsE, rrsG, rrsH, rrfF, thrV, csiD, ilvX, adhE, aroA , fimA, fimC, fimD, fimE, fimF, fimG, fimH, gabT, gcvH , gltB , gltD , ilvA, ilvD , ilvE, ilvH, ilvH, ilvI, ilvM , kbl, livF , livG , livH , livJ, livK , livM , lrp, lysU, malT , ompC, ompF , oppA , oppB , oppC , oppD , oppF , osmC , sdaA , serA, serC , tdh , argO, ilvL, gabD, gabP, osmY , hdeA , hdeB , yhiD, dadA, dadX , gcvT , gltF , stpA , gcvP, aidB , fimI, yeiL, yojI, gdhA, ilvG_1, ilvG_2, thrA, thrB, thrC , thrL |
MarA | Multiple antibiotic resistance | pqiB, pqiA, ybaO , nfsB, micF , slp , dctR, acrB , acrA, marB, marR, marA , inaA , rfaY, rfaZ, yhiD, hdeB , hdeA , rob , zwf , fumC , fpr, nfo, poxB , purA, putA, sodA , tolC, ygiA , ygiB, ygiC , ltaE, ybjT, talA , tktB , phr, ybgA , yhbW |
MetJ | Methionine biosynthesis and transport | metF, metK , metL, metR , yeiB, folE , ahpC, ahpF, metQ, metN, metI, metA , metB, metC , metE |
NadR | NAD biosynthesis | nadA, pnuC, pncB, nadB |
Discussion
In the result section, the tests of combinations of algorithms for GTRNetwork focused on finding the best algorithm combination to give the most precise prediction and maximum recall. The test results showed that the introduction of TFA improved the prediction precision and recall rate of relevance score based gene regulatory network reconstruction significantly. The best combinations of TFA predict algorithm and relevance score functions, in terms of precision and recall depend on the sizes of the known initial TF-gene network topologies.
Algorithm run time tests
Algorithm | PLS | EM | SVD | APMI | Correlation |
---|---|---|---|---|---|
Run time (seconds) | 2750 | 1750 | 6.2107 | 1740 | 1.4086 |
The algorithm combinations that use regression-based SIMPLS to predict TFAs are not as precise as the other combinations. However, SIMPLS does not have as many restrictions as NCA algorithms have, such as the non-redundancy and full column and row rank of the initial network topology. Thus, SIMPLS does not discard as much information while preprocessing data to fit the input criteria. Studies show that it can predict regulatory links that gNCA-r and FastNCA could not [7]. This property of SIMPLS is especially important when there are some regulators or genes of interest, but other TFA prediction algorithms delete these interesting regulators or genes to fit the NCA criteria (detail in Methods session). There is no optimal combination of algorithms for GTRNetwork; instead, the user needs to choose the appropriate algorithm combination based on their input data and other requirements.
The TFA prediction model does not need any biological knowledge on the detailed mechanisms of the activation of TFs. The model assumes that all of the complex effects that contribute to the change of TFA are included in the predicted TFAs and the control strengths. Thus, the GTRNetwork algorithm is not limited to prokaryotes, but can also be applied to eukaryotes. We plan to apply this method to eukaryotes such as yeast and plants in the near future.
While most relevance score based gene regulatory network reconstruction algorithms are not able to identify the self regulation of TFs, because the gene expression data is directly used as the only input to represent both the regulators and the targets, there are always high relevance scores to connect the TF and its gene. In GTRNetwork, since the representation of the regulators (TFAs) and the representation of the targets (expression of genes, including TF genes) are well separated, the relevance score between the TF and its gene is meaningful, and the self regulation of TFs can also be identified. The prediction of self regulation of TFs improves interpretation of the cyclic structures of gene regulatory networks. Further analysis of the effect of feedforward and feedback loops is not carried out in this work but will be applied on the reconstructed networks in our future work.
TFA prediction methods are all based on a linear static model of experimental conditions, and treat dynamic time series data as static data of each time point. Thus, although time series transcriptome data can be used as an input of GTRNetwork, the algorithm does not take advantage of dependencies in time series data.
Conclusion
The algorithm GTRNetwork introduces the hidden layer TFA into classic gene regulatory network reconstruction networks. A comparison of the performances of several algorithmic variants of this algorithm showed that the E-A-C variant of the GTRNetwork use EM-based TFA prediction method, adaptive partitioning mutual information as the relevance score function and CLR background correction method. This is the variant best fits the current known TF-gene regulatory networks from RegulonDB. The application on the E-A-C variant on E. coli data shows a promising amount of biological significance. It would be interesting and meaningful to verify more predicted result biologically and try other alternative TFA prediction such as the SIMPLS based methods and network reconstruction algorithms computationally. The application on other organisms such as yeast is also highly recommended to be applied in the future research.
Methods
TFA prediction
where N × M matrix [Er] is the relative gene expression level matrix and L × M matrix [TFAr] is the relative transcription factor activities, the elements Er_{ ij } (t) = E_{ ij } (t)/E_{ ij } (0) and TFAr_{ kj } (t)/TFAr_{ kj } (0), N × L matrix [CS] is the control strength matrix of transcription factors and genes. The gene expression model in Eq. (4) can be decomposed into matrix [CS] and matrix log([TFAr]) using different algorithms.
The relative gene expression level matrix [Er] can be obtained from transcriptome experiments such as DNA microarrays or RNAseq, and the control strength information must be initialized from the literature e.g. RegulonDB [34], Chip-on-chip experiments, and motif information (mNCA [40]). The initial matrix, CS is converted from the known database of gene regulatory links between TFs and genes, e.g., RegulonDB [34]. Each row represents a gene and each column represents a TF. When there is a known regulatory link between gene i and TF j, CS_{ ij } =1, otherwise CSij = 0.
With the input of [Er] and [CS], transcription factor activities log([TFAr]) can be estimated. There are three major approaches to estimate log ([TFAr]) expectation maximization (EM) approach (e.g. gNCA-r) [3], singular value decomposition (SVD) approach (e.g. FastNCA) [4] and regression approach (e.g., SIMPLS) [7].
- (i)
The connectivity matrix [CS] must have full-column rank.
- (ii)
When a node in the regulatory layer is removed along with all of the output nodes Er_{ i } connected to it, the resulting network must be characterized by a connectivity matrix that still has full-column rank. This condition implies that each column of [CS] must have at least L-1 zeros.
- (iii)
The matrix, log [TFAr], must have full row rank. In other words, each regulatory signal cannot be expressed as a linear combination of the other regulatory signals.
Relevance Scores
Instead of calculating relevance scores between the expression levels of two genes GTRNetwork calculates the relevance score between each TFA and each gene. Pearson correlation coefficient and mutual information are chosen as the relevance score functions:
where X_{ ik } is the k-th observation of variable i. and S_{ ij } is the Pearson Correlation Coefficient score between variable i and j.
Mutual Information
Where p(i,j) is the joint probability of i and j, p_{1}(i) and p_{2}(i) are the marginal probabilities of i and j respectively, S_{ ij } is the Mutual Information score between variable i and j.
The Pearson Correlation (Eq. 4) performs extremely well in detecting linear relationships between two variables (genes in a set of microarray experiments), and Mutual Information (MI) (Eq. 5) has a relatively balanced performance in detecting both linear and non-linear relationships. However, most MI applications only work for discrete variables, and in this problem, both the gene expression ratio and TFA ratio are continuous variables. Adaptive partitioning [35] adjustments are applied to calculate mutual information between TFA ratios and gene expression ratios.
Background correction
In the relevance score based network reconstruction approaches; there are tradeoffs between the link detection sensitivities and false positive detection rates [10]. One reason for the false positive detection is the simplification of the two layer gene regulatory network model. Adding the TFA layer to the classic two layer regulatory network model may solve this problem. Another reason for the false positive detections is due to the noise of gene expression data and different relatedness behaviors of TFs and genes. For example, the expression of some genes may be more stable than other genes and not tend to change much in response of different conditions, the relevance score of these genes are tend to lower, and regulatory relationships between these genes and TFs are hard to be detected, the same to TFAs. Thus, a background correction method such as context likelihood relatedness (CLR) [18] is needed.
By putting different thresholds on the matrix [Z] with elements Z_{ ij } gene regulatory networks with different sensitivities can be reconstructed by searching for gene regulatory links containing TF genes with the Z score larger than the threshold. The information of TF genes (which genes encode TFs) can be found from database such as RegulonDB [34] and EcoCyc [37].
Integration of operon information
In the reconstructed gene regulatory network, when gene A is predicted to be regulated by some TFs, the other genes in the same operon as gene A are not always predicted to be regulated by the same TFs regulating gene A. However, in real gene regulatory networks, all the genes in the same operon tend to have similar behavior. The GTRNetwork algorithm uses an optional check operon step. When the operon information is available, the algorithm searches for genes in the same operon as the target gene and links these genes to the regulators of the target gene. This integration of operon information improves the detection sensitivity of regulatory links.
GTRNetwork algorithm
The GTRNetwork algorithm is implemented using Matlab and the source code is available at:
http://vrac.iastate.edu/~afu/GTRNetwork/GTRNetwork_1.2.1.zip.
- b)
Initial TF-gene network topology in adjacency matrix [C]
- c)
Desired size of reconstructed regulatory network S
- d)
List of operons and the genes contained in them (Optional)
Outputs: A list of predicted regulatory links
- 1.
Match the genes between the matrix [Er] and matrix [C].
Remove unmatched genes in [Er] and store the reduce matrix as [Er0].
- 2.
If the TFA prediction algorithm is gNCA-r or FastNCA, check the three criteria described in TFA prediction section and reduce the matrix [Er0] and [C0] to fit the criteria.
- 3.
Apply TFA prediction algorithm to predict the log_{2} ratio TFA matrix [TFA] from matrix [Er0] and [C0].
- 4.
Calculate the relevance score matrix [M] between TFAs and all expression levels of all genes from matrix [TFA] and [Er].
- 5.
Calculate the joint statistical likelihood matrix [Z] of relevance score matrix [M] using CLR algorithm.
- 6.
Set a threshold T for matrix [Z] so that there are S elements in [Z] greater than T. For all the TF-gene pairs having a Z score greater than T, construct a regulatory link.
- 7.
If the operon list is available, check and add all genes in the same operon of TF target genes to the regulatory target set of the TF.
Declarations
Acknowledgements
This material is based upon work supported by the National Science Foundation under Awards EEC-0813570, DBI-0543441, and IIS-0612240. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Authors’ Affiliations
References
- Liao JC, Boscolo R, Yang Y-L, Tran LM, Sabatti C, Roychowdhury VP: Network component analysis: Reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences of the United States of America 2003, 100: 15522–15527. 10.1073/pnas.2136632100PubMed CentralView ArticlePubMedGoogle Scholar
- Bussemaker HJ, Foat BC, Ward LD: Predictive Modeling of Genome-Wide mRNA Expression: From Modules to Molecules. Annual Review of Biophysics and Biomolecular Structure 2007, 36: 329–347. 10.1146/annurev.biophys.36.040306.132725View ArticlePubMedGoogle Scholar
- Tran LM, Brynildsen MP, Kao KC, Suen JK, Liao JC: gNCA: A framework for determining transcription factor activity based on transcriptome: identifiability and numerical implementation. Metabolic Engineering 2005, 7: 128–141. 10.1016/j.ymben.2004.12.001View ArticlePubMedGoogle Scholar
- Chang C, Ding Z, Hung YS, Fung PCW: Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data. Bioinformatics 2008, 24: 1349–1358. 10.1093/bioinformatics/btn131View ArticlePubMedGoogle Scholar
- Alter O, Golub GH: Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription. Proceedings of the National Academy of Sciences of the United States of America 2004, 101: 16577–16582. 10.1073/pnas.0406767101PubMed CentralView ArticlePubMedGoogle Scholar
- Gao F, Foat B, Bussemaker H: Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics 2004, 5: 31. 10.1186/1471-2105-5-31PubMed CentralView ArticlePubMedGoogle Scholar
- Boulesteix A-L, Strimmer K: Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theoretical Biology and Medical Modelling 2005, 2: 23. 10.1186/1742-4682-2-23PubMed CentralView ArticlePubMedGoogle Scholar
- Nachman I, Regev A, Friedman N: Inferring quantitative models of regulatory networks from expression data. Bioinformatics 2004, 20: i248–256. 10.1093/bioinformatics/bth941View ArticlePubMedGoogle Scholar
- Li Z, Shaw SM, Yedwabnick MJ, Chan C: Using a state-space model with hidden variables to infer transcription factor activities. Bioinformatics 2006, 22: 747–754. 10.1093/bioinformatics/btk034View ArticlePubMedGoogle Scholar
- Sanguinetti G, Rattray M, Lawrence ND: A probabilistic dynamical model for quantitative inference of the regulatory mechanism of transcription. Bioinformatics 2006, 22: 1753–1759. 10.1093/bioinformatics/btl154View ArticlePubMedGoogle Scholar
- Gao P, Honkela A, Rattray M, Lawrence ND: Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities. Bioinformatics 2008, 24: i70–75. 10.1093/bioinformatics/btn278View ArticlePubMedGoogle Scholar
- Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R: Gene regulatory network inference: Data integration in dynamic models--A review. Biosystems 2009, 96: 86–103. 10.1016/j.biosystems.2008.12.004View ArticlePubMedGoogle Scholar
- Vlieghe D, Sandelin A, De Bleser PJ, Vleminckx K, Wasserman WW, van Roy F, Lenhard B: A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Research 2006, 34: D95–97. 10.1093/nar/gkj115PubMed CentralView ArticlePubMedGoogle Scholar
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-Wide Location and Function of DNA Binding Proteins. Science 2000, 290: 2306–2309. 10.1126/science.290.5500.2306View ArticlePubMedGoogle Scholar
- Stuart JM, Segal E, Koller D, Kim SK: A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 2003, 302: 249–255. 10.1126/science.1087447View ArticlePubMedGoogle Scholar
- Butte AJ, Kohane IS: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. 2000, 418–429.Google Scholar
- Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, Califano A: Reverse engineering of regulatory networks in human B cells. Nat Genet 2005, 37: 382–390. 10.1038/ng1532View ArticlePubMedGoogle Scholar
- Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biol 2007, 5: e8. 10.1371/journal.pbio.0050008PubMed CentralView ArticlePubMedGoogle Scholar
- Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One 2010, 5: e9202. 10.1371/journal.pone.0009202PubMed CentralView ArticlePubMedGoogle Scholar
- Watkinson J, Liang KC, Wang X, Zheng T, Anastassiou D: Inference of regulatory gene interactions from expression data using three-way mutual information. Ann N Y Acad Sci 2009, 1158: 302–313. 10.1111/j.1749-6632.2008.03757.xView ArticlePubMedGoogle Scholar
- Madar A, Greenfield A, Vanden-Eijnden E, Bonneau R: DREAM3: network inference using dynamic context likelihood of relatedness and the inferelator. PLoS One 2010, 5: e9803. 10.1371/journal.pone.0009803PubMed CentralView ArticlePubMedGoogle Scholar
- Greenfield A, Madar A, Ostrer H, Bonneau R: DREAM4: Combining genetic and dynamic information to identify biological networks and dynamical models. PLoS One 2010, 5: e13397. 10.1371/journal.pone.0013397PubMed CentralView ArticlePubMedGoogle Scholar
- Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P: Inferring regulatory networks from expression data using tree-based methods. PLoS One 2010., 5(9): e12776. [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0012776] e12776. 10.1371/journal.pone.0012776
- Pinna A, Soranzo N, de la Fuente A: From knockouts to networks: establishing direct cause-effect relationships through graph analysis. PLoS One 2010, 5: e12912. 10.1371/journal.pone.0012912PubMed CentralView ArticlePubMedGoogle Scholar
- Guthke R, Moller U, Hoffmann M, Thies F, Topfer S: Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics 2005, 21: 1626–1634. 10.1093/bioinformatics/bti226View ArticlePubMedGoogle Scholar
- Gardner TS, di Bernardo D, Lorenz D, Collins JJ: Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling. Science 2003, 301: 102–105. 10.1126/science.1081900View ArticlePubMedGoogle Scholar
- di Bernardo D, Thompson MJ, Gardner TS, Chobot SE, Eastwood EL, Wojtovich AP, Elliott SJ, Schaus SE, Collins JJ: Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotech 2005, 23: 377–383. 10.1038/nbt1075View ArticleGoogle Scholar
- Du P, Gong J, Wurtele E, Dickerson J: Modeling Gene Expression Networks using Fuzzy Logic. IEEE Transactions Systems, Man and Cybernetics, Part B 2005, 35: 1351–1359. 10.1109/TSMCB.2005.855590View ArticleGoogle Scholar
- Bansal M, Gatta GD, di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 2006, 22(7):815–822. [http://bioinformatics.oxfordjournals.org/content/22/7/815.abstract] 10.1093/bioinformatics/btl003View ArticlePubMedGoogle Scholar
- Kaleta C, Gohler A, Schuster S, Jahreis K, Guthke R, Nikolajewa S: Integrative inference of gene-regulatory networks in Escherichia coli using information theoretic concepts and sequence analysis. BMC Syst Biol 2010, 4: 116. 10.1186/1752-0509-4-116PubMed CentralView ArticlePubMedGoogle Scholar
- van Berlo RJP, van Someren EP, Reinders MJT: Studying the Conditions for Learning Dynamic Bayesian Networks to Discover Genetic Regulatory Networks. SIMULATION 2003, 79(12):689–702. [http://sim.sagepub.com/content/79/12/689.abstract] 10.1177/0037549703040942Google Scholar
- Perrin B-E, Ralaivola L, Mazurie A, Bottani S, Mallet J, d'Alche-Buc F: Gene networks inference using dynamic Bayesian networks. Bioinformatics 2003, 19: ii138–148. 10.1093/bioinformatics/btg1071View ArticlePubMedGoogle Scholar
- Seok J, Kaushal A, Davis RW, Xiao W: Knowledge-based analysis of microarrays for the discovery of transcriptional regulation relationships. BMC Bioinformatics 2010, 11(Suppl 1):S8. 10.1186/1471-2105-11-S1-S8PubMed CentralView ArticlePubMedGoogle Scholar
- Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muniz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, Garcia-Sotelo JS, Lopez-Fuentes A, Porron-Sotelo L, Alquicira-Hernandez S, Medina-Rivera A, Martinez-Flores I, Alquicira-Hernandez K, Martinez-Adame R, Bonavides-Martinez C, Miranda-Rios J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J: RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res 2011, 39: D98–105. 10.1093/nar/gkq1110PubMed CentralView ArticlePubMedGoogle Scholar
- Liang K-C, Wang X: Gene Regulatory Network Reconstruction Using Conditional Mutual Information. EURASIP Journal on Bioinformatics and Systems Biology 2008, 2008: 14 pages. [http://www.hindawi.com/journals/bsb/2008/253894/ref/Baba-Dikwa] 10.1155/2008/253894View ArticleGoogle Scholar
- Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS: Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res 2008, 36: D866–870.PubMed CentralView ArticlePubMedGoogle Scholar
- Keseler IM, Bonavides-Martinez C, Collado-Vides J, Gama-Castro S, Gunsalus RP, Johnson DA, Krummenacker M, Nolan LM, Paley S, Paulsen IT, Peralta-Gil M, Santos-Zavaleta A, Shearer AG, Karp PD: EcoCyc: A comprehensive view of Escherichia coli biology. Nucleic Acids Research 2009, 37: D464–470. 10.1093/nar/gkn751PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz CJ, Giel JL, Patschkowski T, Luther C, Ruzicka FJ, Beinert H, Kiley PJ: IscR, an Fe-S cluster-containing transcription factor, represses expression of Escherichia coli genes encoding Fe-S cluster assembly proteins. Proceedings of the National Academy of Sciences of the United States of America 2001, 98: 14895–14900. 10.1073/pnas.251550898PubMed CentralView ArticlePubMedGoogle Scholar
- Brynildsen MP, Liao JC: An integrated network approach identifies the isobutanol response network of Escherichia coli . Mol Syst Biol 2009, 5: 277.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang C, Xuan J, Chen L, Zhao P, Wang Y, Clarke R, Hoffman E: Motif-directed network component analysis for regulatory network inference. BMC Bioinformatics 2008, 9: S21.PubMed CentralView ArticlePubMedGoogle Scholar
- Baba-Dikwa A, Thompson D, Spencer NJ, Andrews SC, Watson KA: Overproduction, purification and preliminary X-ray diffraction analysis of YncE, an iron-regulated Sec-dependent periplasmic protein from Escherichia coli . Acta Cryst 2008, 64(Pt 10):966–969.Google Scholar
- Takahashi Y, Nakamura M: Functional Assignment of the ORF2-iscS-iscU-iscA-hscB-hscA-fdx-ORF3 Gene Cluster Involved in the Assembly of Fe-S Clusters in Escherichia coli . Journal of Biochemistry 1999, 126: 917–926.View ArticlePubMedGoogle Scholar
- Tokumoto U, Takahashi Y: Genetic Analysis of the isc Operon in Escherichia coli Involved in the Biogenesis of Cellular Iron-Sulfur Proteins. Journal of Biochemistry 2001, 130: 63–71.View ArticlePubMedGoogle Scholar
- Vickery LE: Hsc66 and Hsc20, a new heat shock cognate molecular chaperone system from Escherichia coli . In Edited by: Jonathan J, Silberg DTT.6(5):1047–1056. Protein Sci.; 1997:1047–1056 Protein Sci.; 1997:1047-1056 10.1002/pro.5560060511
- The EcoGene Database of Escherichia coli Sequence and Function (Ecogene2.0)[http://www.ecogene.org]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.