A novel procedure for statistical inference and verification of gene regulatory subnetwork
© Gong et al.; licensee BioMed Central Ltd. 2015
Published: 23 April 2015
The reconstruction of gene regulatory network from time course microarray data can help us comprehensively understand the biological system and discover the pathogenesis of cancer and other diseases. But how to correctly and efficiently decifer the gene regulatory network from high-throughput gene expression data is a big challenge due to the relatively small amount of observations and curse of dimensionality. Computational biologists have developed many statistical inference and machine learning algorithms to analyze the microarray data. In the previous studies, the correctness of an inferred regulatory network is manually checked through comparing with public database or an existing model.
In this work, we present a novel procedure to automatically infer and verify gene regulatory networks from time series expression data. The dynamic Bayesian network, a statistical inference algorithm, is at first implemented to infer an optimal network from time series microarray data of S. cerevisiae, then, a weighted symbolic model checker is applied to automatically verify or falsify the inferred network through checking some desired temporal logic formulas abstracted from experiments or public database.
Our studies show that the marriage of statistical inference algorithm with model checking technique provides a more efficient way to automatically infer and verify the gene regulatory network from time series expression data than previous studies.
Advancement of DNA microarray technology and next generation sequencing technique have revolutionized the molecular biology, making it possible for biologists to measure and collect thousands of genes' expression levels simultaneously, efficiently and precisely in one experiment. Computational analysis of genome-wide transcriptomics data will help us understand the regulatory components and mechanisms underlying some diseases. These explosively growing amount of highdimensional gene expression data can be divided into two types: static and time series. The static expression data are assumed to be independently and identically distributed (IID), and many statistical inference algorithms [1–8] have been developed to identify key genetic signatures and signaling pathways that are frequently altered in some diseases. Gene regulatory network plays a critical role in the cell's proliferation and differentiation, so, a comprehensive understanding of gene regulatory network (GRN) and regulatory components will help discover some drug targeted genes in cancer and other diseases. Computational biologists have proposed a variety of methods, for example, the Boolean networks  and differential equations , to study the gene regulatory network. Friedman and other researchers [11, 12] developed and applied discrete and continuous Bayesian networks (BN) with linear regression and non-parametric regression to infer gene regulatory networks. The BN approach could identify the causal relationships between different genes to some degree. However, it cannot construct cyclic networks and this method is unable to handle the temporal aspect of time-series data. But the feedback loops (cyclic pathways) are prevalent in the gene regulatory networks and signaling pathways.
The time series gene expression data can provide abundant information regarding the dynamic and temporal behaviors of biological system, which can not be handled by Bayesian network method. Dynamic Bayesian network (DBN) [13–16] is a promising alternative which has been proposed to construct GRN with feedback loops from time-series expression data. DBN has attracted a lot of attention from numerous bioinformatics researchers, and different DBN based approaches and tools were developed to increase accuracy and reduce computational time. An extended expectation-maximisation (EM) algorithm  was proposed to estimate the parameters in the DBN model. However, the DBN method has some limitations, for example, it is very sensitive to the choice of data discretization. Moreover, the deduction of the "activation" or "inhibition" relationship between different genes is not easy and accurate, so, the inferred optimal network might not be a correct one. Recently, Liang et al.'s work  proposed a network and community identification (NCI) method to infer multiple signed subnetworks from gene expression data by incorporating community structure information.
Without verification or validation, the inferred regulatory networks can not help us correctly understand the mechanism in the cell cycle. Another limitation in the previous studies is, the correctness or verification of the inferred networks is manually checked by comparing with public database (KEGG, GO, GenMAPP, etc) or existing/known models. This verification procedure is only good for small and already-known network "inference" and "verification". However, the signaling pathway or regulatory network is complex due to the excessive number of components and interactions, it is not realistic and efficient to use traditional methods to manually verify or analyze large networks. An intelligent verification technique called Model Checking  has been successfully applied for the verification of complex systems, including the hardware (e.g., CPU) and software (aerospace control software) systems. Recently, we applied this technique to study some complex biological networks [20–25]. Model Checking is the process of determining whether or not a given system M satisfies a desired temporal logic formula ψ, denoted by M ╞ ψ. Our previous work proposed and applied different model checking techniques, including statistical model checking [21, 22], synchronous symbolic model checking [23–25], asynchronous model checking technique  and probabilistic model checking , to formally verify some given stochastic, boolean, and discrete-value models of signaling pathways in the cancer cells. The model checkers automatically and exhaustively search the state space to verify some desired temporal logic formula, and it can check up to 10100 possible states.
In this work, we proposed a novel inference and verification procedure, which marries the dynamic Bayesian network inference algorithm with a powerful model checking technique, to analyze time course microarray data. We will first briefly introduce the dynamic Bayesian network inference with Java objects (Banjo)  method developed in Hartemink's group  and apply it to infer optimal gene regulatory networks from time-series expression data. Then, we proposed a novel weighted symbolic model checking technique (weighted SMV) to automatically verify or falsify the inferred weighted networks or models through checking some temporal logic formulas abstracted from experiments. Finally, we apply Banjo and weighted SMV to analyze time-series microarray data and reconstruct gene regulatory subnetwork of yeast.
Dynamic Bayesian network inference
Probabilistic graphical model describes each node in the network by a random variable, and the directed edge represents a conditional dependence between two variables. Therefore, gene regulatory network can be graphically represented by a joint distribution of all random variables over time. The time series gene expression (microarray) data, which consists of p genes measured at n different time points, can be described by an n × p matrix X. If X i = (Xi1,..., X ip ) T is defined as a random variable vector (at time i), then, x i = (xi1,...,x ip ) T corresponds to the values of a vector of p genes' expression measured at time i = 1, 2,..., n; that is, x ij represents an observation value of the random variable X ij (the jth gene's expression measured at time i). We adopt some conventions used in Kim et al.'s work .
The BDe score function is based on the assumption that the microarray data D is a multinomial sample, that is, D|Θ ∼ Multinomial(Θ). BDe also assumes the parameters Θ are globally and locally independent and the priors of Θ, denoted by π(Θ|G, Λ), follow Dirichlet distribution with a hyperparameter vector Λ, that is, Θ|G ∼ Dirichlet(Λ), which is a conjugate prior of multinomial distribution. The optimal network is selected according to the BDe scores which are dependent on P (G, D). Next, search all the possible optimal networks. Banjo allows two different search strategies, including the greedy search and simulated annealing algorithm proposed by Heckerman , which can output top n directed networks with highest scores, and it can also retain and average some highest scoring networks to produce a weighted consensus network .
which is the probability that gene X ti takes a value of k given its parent gene Par(X ti ) takes a value of j; the cumulative distribution function , which describes the probability that gene X ti takes a value less than or equal to k given its parent gene takes a value of j; and a predefined voting system. If there is a high probability for the gene X ti to take a larger value given its parent's value increases, then, the voting system  in the Banjo will increase the positive vote by one; else, the negative vote will increase by one. If the influence score is close to 0, the sign of the edge can not be identified. Banjo can automatically implement the dynamic Bayesian network inference algorithm to search for high-scoring probabilistic graphical models, output the optimal networks and calculate the (signed) influence scores or weights. The interested reader could refer to [28, 32] for details.
The dynamic Bayesian network implemented with Banjo can infer the high-scoring gene regulatory networks based on the BDe metrics, however, this algorithm is sensitive to the data discretization methods. Moreover, in many cases, the inferred optimal network might not be a correct one based on different scoring functions. Which model is closest to the truth in the biological system? Previous studies validate the inferred network through manually comparing with the public database or known models. The manual verification method is not realistic for the large or unknown network verification. The most innovative aspect of the proposed procedure in Figure 2 is the marriage of dynamic Bayesian network inference algorithm with formal verification technique, called weighted symbolic model checking (Part 2 of pseudocode in Figure 2), which can automatically verify the network through checking some temporal logic formulas abstracted from the experiments or public database. Next, we will introduce a powerful model checking technique and apply it to formally verify the inferred regulatory networks.
Weighted symbolic model checking
A network or model can be described as a Kripke structure [19, 20] M = (S, s0, R, L), representing a finite-state concurrent system with the initial state s0 ∈ S, states transition relation R , and a labeling function L. Given a model or concurrent system, we expect it to satisfy some desired property. So, model checking, a formal verification technique , is the process of determining whether or not a given model M satisfies the desired property, which is expressed as a temporal logic formula ψ, denoted by M ╞ ψ. During formal verification, model checkers can search the state space of concurrent system exhaustively to find all states that satisfy the formula ψ. If the property is satisfied, model checker will output "True"; else, it will output "False" with a counterexample sequence that falsifies ψ. Model checking of hardware and software systems has been very successful in the past three decades. Recently, we proposed different (probabilistic, statistical, symbolic, synchronous and asynchronous) model checkers to formally investigate the complex signal transduction networks in the cancer cells [20–24, 26].
The desired properties describing some existing wet lab experimental results or phenomena are expressed in a high-level, expressive language - Computation Tree Logic (CTL) formula ψ. On the computation tree, the root represents an initial state, the branches and leaves represent all possible sequences of state transitions (paths) from the root . CTL formula ψ is composed of path quantifiers which describes the branching structure in the computation tree: A (for all paths), E (there exists some path); temporal operators describing properties on a path through the tree: X (next time), F (in the future), G (globally), U (until), R (release); and Boolean logic connectives (| (or), & (and), → (implies)). In the CTL formula, the temporal operator must be immediately preceded by a path quantifier . Similar to our previous work [19, 20, 26], we will use (AX, EX, AG, EG, AF, EF) to construct CTL formulas for the verification of gene regulatory network. For example, AG φ means φ is globally true on all paths; EF φ means φ holds at some state in the future on some path. More interesting CTL operators and formulas have been discussed in Clarke et al.'s book .
M, s ╞ ! ψ
iff M, s ╞ ψ does not hold;
M, s ╞ ψ1 &ψ2
iff M, s ╞ ψ1 and M, s ╞ ψ2;
M, s ╞ ψ1 | ψ2
iff M, s ╞ ψ1 or M, s ╞ ψ2;
M, π ╞ X ψ
iff M, π1 ╞ ψ;
M, π ╞ F ψ
iff there exists a k ≥ 0, such that M, π k ╞ ψ;
M, π ╞ G ψ
iff for all k ≥ 0, M, π k ╞ ψ;
M, s ╞ A φ
iff for every path π from s, M, π ╞ φ;
M, s ╞ E φ
iff there exists a path π from s, such that, M, π ╞ φ,
Symbolic Model Verifier (SMV)  is a popular formal verification tool encoded by ordered binary decision diagram , and the state transition relation is implicitly represented by a Boolean function. SMV can verify (output "True") or falsify (output "False" with a counterexample) a desired CTL formula ψ through automatically and exhaustively searching the state transition system M. Our recent studies [20, 23–25, 27] proposed both synchronous and asynchronous symbolic model checkers to study the Boolean and discrete value models of signaling pathways. These studies are based on the unweighted model checking, that is, the interaction between two nodes is represented by an unweighted edge.
Results and discussion
In this section, we will apply the dynamic Bayesian network inference and weighted symbolic model checking methods proposed in Figure 2 to infer and verify gene regulatory subnetworks from time series microarray data of yeast.
The time series microarray data of Saccharomyces cerevisiae collected by Spellman et al.  has been studied by many researchers using different inference algorithms. The data were measured and collected from the yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant, which contain 16, 25 and 14 time points. A full description and complete data sets are available at . The Banjo setting code, microarray data and weighted SMV code developed for this work are available at .
List of CTL formulas related to MAPK pathway in Figure 5 and verification results
Fus3 = 1 → AX(Dig1/2 = -1)
AG(Ste11 = 1 → AX(Ste7 = 1))
Msg5 = 1 → AF(Fus3 = -1 & Far1 = -1 & Dig1/2 = 1)
AG((Ste7 = 1 → AF(Fus3 = 1)) & (Fus3 = 1 → AF(Ste7 = 1)))
CTL formulas related to cell cycle subnetwork in Figure 6A and verification results
Fus3 = 1 → AX(Dig1/2 = −1)
AG(Ste11 = 1 → AX(Ste7 = 1))
Msg5 = 1 → AF(Fus3 < 0 & Far1 < 0 & Dig1/2 = 1)
AG((Ste7 = 1 → AF(Fus3 = 1)) & (Fus3 = 1 → AF(Ste7 = 1)))
AG((Swi4/6 = 1 → AF(Cdc6 ≥ 0)))
AG((Swi4/6 = 1 → AF(Cln12 ≥ 0)))
AG((Cdc45 = 1 | Msg5 = 1 → EF(Cdc28 = 1 & Mcm2/3/6 ≥ 0)))
EG((Swi4/6 = 1 → EF(Cdc28 ≤ 0)))
AG((Mcm2/3/6 = 1 → AF(Cln12 = 1)) & (Cln12 = 1 → AF(Mcm2/3/6 ≤ 0)))
A comprehensive understanding of the signaling pathways or gene regulatory networks will advance our knowledge in molecular biology. Network reconstruction from high-dimensional microarray data can help researchers to investigate the crosstalk of different pathways and develop effective multi-gene targeted treatments for some diseases, e.g., cancer and neurodegenerative diseases. Previous studies develop different statistical inference algorithms [13, 14, 39] to reconstruct gene regulatory network from time series expression data. The validation of inferred networks is implemented manually by comparing with public database or existing models, and, normally, a quantitative comparison is used to evaluate the superiority of a new approach [40, 41]. In this work, we proposed a novel procedure, which integrates the dynamic Bayesian network inference algorithm with formal verification technique implemented by Banjo and weighted SMV model checker respectively, to analyze time series gene expression data. The dynamic Bayesian network inference algorithm implemented by Banjo could infer optimal networks of highest scores with directed and weighted edges, however, this method is sensitive to the choice of data discretization methods. The weighted symbolic model checker will exhaustively search the state space to verify or falsify these network candidates through checking some desired temporal logic formulas. Compared with previous studies, the proposed procedure can automatically infer, verify or falsify a biological network based on existing experiments, so it has advantages in the large network inference and verification. The goodness of the verified network will be dependent on not only the learning scores, but also the number of verified temporal logic formulas. One of the key issues in the model checking procedure is the quantity and also quality of the desired temporal logic formulas, which can be abstracted directly from existing experimental results or public database. The more temporal properties we have, the more constrains we can impose on the inferred network candidates. Currently, the inferred regulatory networks are manually encoded into SMV program for model checking. Our future work will build a bioinformatics infrastructure which integrates statistical inference algorithms with different model checkers in a unified framework to automatically infer, encode network candidates into SMV program, and formally verify the inferred gene regulatory networks to select the best models.
HG would like to thank Dr. Hartemink for answering some questions related to Banjo. This work was partially supported by HG's new faculty start-up grant and President Research Fund award (230152) from the Saint Louis University.
Publication of this article was funded by the Saint Louis University to HG.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 7, 2015: Selected articles from The 11th Annual Biotechnology and Bioinformatics Symposium (BIOT-2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S7.
- Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21: 631-643. 10.1093/bioinformatics/bti033.View ArticlePubMedGoogle Scholar
- Luan Y, Li H: Group additive regression models for genomic data analysis. Biostatistics. 2008, 9: 100-113. 10.1093/biostatistics/kxm015.View ArticlePubMedGoogle Scholar
- Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009, 25: 714-721. 10.1093/bioinformatics/btp041.PubMed CentralView ArticlePubMedGoogle Scholar
- Ma S, Song X, Huang J: Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics. 2007, 8: 60-76. 10.1186/1471-2105-8-60.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu TT, Wang S: Doubly regularized cox regression for high-dimensional survival data with group structures. Statisstics and Its Interface. 2013, 6: 175-186. 10.4310/SII.2013.v6.n2.a2.View ArticleGoogle Scholar
- Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. J R Statist Soc B. 2006, 68: 49-67. 10.1111/j.1467-9868.2005.00532.x.View ArticleGoogle Scholar
- Wu TT, Gong H, Clarke EM: A transcriptome analysis by lasso penalized cox regression for pancreatic cancer survival. Journal of Bioinformatics and Computational Biology. 2011, 9: 63-10.1142/S0219720011005744.View ArticlePubMedGoogle Scholar
- Gong H, Wu TT, Clarke EM: Pathway-gene identification for pancreatic cancer survival via doubly regularized cox regression. BMC Systems Biology. 2014, 8:Google Scholar
- Akutsu T, Miyano S, Kuhara S: Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics. 2000, 16: 727-734. 10.1093/bioinformatics/16.8.727.View ArticlePubMedGoogle Scholar
- Chen T, He H, Church G: Moeling gene expression with differential equations. Pacific Symposium on Biocomputing. 1999, 29-40.Google Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using bn to analyze expression data. J Comp Biol. 2000, 7: 601-620. 10.1089/106652700750050961.View ArticleGoogle Scholar
- Imoto S, Goto T, Miyano S: Estimation of genetic networks and functional structures between genes by using bn and nonparametric regression. Pacific symposium on Biocomputing. 2002Google Scholar
- Kim S, Imoto S, Miyano S: Inferring gene networks from time series microarray data using dynamic bayesian networks. Briefings in Bioinformatics. 2003, 4: 228-235. 10.1093/bib/4.3.228.View ArticlePubMedGoogle Scholar
- Friedman N, Murphy K, Russell S: Learning the structure of dynamic probabilistic networks. Prpceedings of the 14th conference on the uncertainty in artificial intelligence. 1998Google Scholar
- Ong I, Glasner J, Page D: Modelling regulatoruypathways in e. coli from time series expression profiles. Bioinformatics. 2002, 18: 241-248.View ArticleGoogle Scholar
- Kim S, Imoto S, Miyano S: Dynamic bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. BioSystems. 2004, 75: 57-65. 10.1016/j.biosystems.2004.03.004.View ArticlePubMedGoogle Scholar
- Perrin B, Ralaivola L, Mazurie A, et al: Gene networks inference using dynamic bayesian networks. Bioinformatics. 2003, 74: 138-148.Google Scholar
- Liang X, Xia Z, Zhang L, Wu F: Inference of gene regulatory subnetworks from time course gene expression data. BMC Bioinformatics. 2012, 13: 3-10.1186/1471-2105-13-3.View ArticleGoogle Scholar
- Clarke EM, Grumberg O, Peled DA: Model Checking. 1999, MIT PressGoogle Scholar
- Gong H: Analysis of intercellular signal transduction in the tumor microenvironment. BMC Systems Biology. 2013, 7 (Suppl 3): S5-10.1186/1752-0509-7-S3-S5.PubMed CentralView ArticlePubMedGoogle Scholar
- Gong H, Zuliani P, Komuravelli A, Faeder JR, Clarke EM: Analysis and verification of the HMGB1 signaling pathway. BMC Bioinformatics. 2010, 11 (Suppl 7): S10-10.1186/1471-2105-11-S7-S10.PubMed CentralView ArticlePubMedGoogle Scholar
- Gong H, Zuliani P, Komuravelli A, Faeder J, Clarke E: Computational modeling and verification of signaling pathways in cancer. Proceedings of Algebraic and Numeric Biology, LNCS. 2012, 6479:Google Scholar
- Gong H, Wang Q, Zuliani P, Lotze MT, Faeder JR, Clarke EM: Symbolic model checking of the signaling pathway in pancreatic cancer. Proceedings of the International Conference on Bioinformatics and Computational Biology (BICoB). 2011Google Scholar
- Gong H, Zuliani P, Clarke E: Model checking of a diabetes-cancer model. 3rd International Symposium on Computational Models for Life Sciences. 2011Google Scholar
- Gong H, Wang Q, Zuliani P, Clarke E: Formal analysis for logical models of pancreatic cancer. 50th IEEE Conference on Decision and Control and European Control Conference. 2011Google Scholar
- Gong H, Feng L: Computational analysis of the roles of er-golgi network in the cell cycle. BMC Systems Biology. 2014, 8 (Suppl 4): S3-10.1186/1752-0509-8-S4-S3.View ArticleGoogle Scholar
- Gong H, Feng L: Probabilistic verification of er stress-induced signaling pathways. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2014Google Scholar
- Sladeczek J, Hartemink A, Robinson J: Banjo: Bayesian network inference with java objects. User Guide. 2005Google Scholar
- Banjo Software. [http://www.cs.duke.edu/~amink/software/banjo/]
- Heckerman D, Geiger D, Chickering D: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning. 1995, 20 (3):Google Scholar
- Heckerman D: A tutorial on learning with bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research. 1996Google Scholar
- Yu J, Smith V, Wang P, Hartemink A, Jarvis E: Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics. 2004, 20: 3594-3603. 10.1093/bioinformatics/bth448.View ArticlePubMedGoogle Scholar
- McMillan KL: PhD Thesis: Symbolic Model Checking an Approach to the State Explosion Problem. 1992, Carnegie Mellon UniversityGoogle Scholar
- Bryant RE: Graph-based algorithms for boolean function manipulation. IEEE Tran. on Computers. 1986, 35 (8): 677-691.View ArticleGoogle Scholar
- Spellman P, Sherlock G, Zhang M, et al: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hydridization. Mol Biol Cell. 1998, 9: 3273-3297. 10.1091/mbc.9.12.3273.PubMed CentralView ArticlePubMedGoogle Scholar
- S. Cerevisiae Expression Data by Spellman. [http://downloads.yeastgenome.org/expression/microarray/]
- Computer Code. [http://cs.slu.edu/~gong/Banjocode.zip]
- Chai L, Mohamad M, et al: A dynamic bayesian network-based model for inferring gene regulatory networks from gene expression data. International Journal of Bio-Science and Bio-Technology. 2014, 6: 41-52.View ArticleGoogle Scholar
- Novikov E, Barillot E: Regulatory network reconstruction using an integral additive model with flexible kernel functions. BMC Systems Biology. 2008, 2: 8-10.1186/1752-0509-2-8.PubMed CentralView ArticlePubMedGoogle Scholar
- R RS, Ventura D, Prince J: Controlling for confounding variables in ms-omics protocol: why modularity matters. Brief Bioinform. 2014, 15 (5): 768-70. 10.1093/bib/bbt049.View ArticleGoogle Scholar
- Smith1 R, Ventura D, Prince JT: Novel algorithms and the benefits of comparative validation. Bioinformatics. 2013, 29: 1583-1585. 10.1093/bioinformatics/btt176.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.