Mining and state-space modeling and verification of sub-networks from large-scale biomolecular networks
- Xiaohua Hu†^{1}Email author and
- Fang-Xiang Wu†^{2, 3}
DOI: 10.1186/1471-2105-8-324
© Hu and Wu; licensee BioMed Central Ltd. 2007
Received: 09 May 2007
Accepted: 31 August 2007
Published: 31 August 2007
Abstract
Background
Biomolecular networks dynamically respond to stimuli and implement cellular function. Understanding these dynamic changes is the key challenge for cell biologists. As biomolecular networks grow in size and complexity, the model of a biomolecular network must become more rigorous to keep track of all the components and their interactions. In general this presents the need for computer simulation to manipulate and understand the biomolecular network model.
Results
In this paper, we present a novel method to model the regulatory system which executes a cellular function and can be represented as a biomolecular network. Our method consists of two steps. First, a novel scale-free network clustering approach is applied to the large-scale biomolecular network to obtain various sub-networks. Second, a state-space model is generated for the sub-networks and simulated to predict their behavior in the cellular context. The modeling results represent hypotheses that are tested against high-throughput data sets (microarrays and/or genetic screens) for both the natural system and perturbations. Notably, the dynamic modeling component of this method depends on the automated network structure generation of the first component and the sub-network clustering, which are both essential to make the solution tractable.
Conclusion
Experimental results on time series gene expression data for the human cell cycle indicate our approach is promising for sub-network mining and simulation from large-scale biomolecular network.
Background
We are in the era of holistic biology. Massive amounts of biological data await interpretation. This calls for formal modeling and computational methods. In this paper, we present a method to model the regulatory system which executes a cellular function and can be represented as a biomolecular network. Understanding the biomolecular network implementing some cellular function goes beyond the old dogma of "one gene: one function": only through comprehensive system understanding can we predict the impact of genetic variation in the population, design effective disease therapeutics, and evaluate the potential side-effects of therapies. As biomolecular networks grow in size and complexity, the model of a biomolecular network must become more rigorous to keep track of all the components and their interactions. In general this presents the need for computer simulation to manipulate and understand the biomolecular network model. However, a major challenge of modeling the dynamics of a biomolecular network is that conventional methods based on physical and chemical principles (such as systems of differential equations) require data that are difficult to accurately and consistently measure using either conventional or high-throughput technologies, which characteristically yield noisy, semi-quantitative, and often relative data.
In this paper, we present a hybrid approach that combines data mining and state-space modeling to build and analyze the biomolecular network of a cellular process. Our method consists of two steps. First, a novel scale-free network clustering approach is applied to the large-scale biomolecular network to obtain various sub-networks. Second, a state-space model is generated for the sub-networks and simulated to predict their behavior in the cellular context. It integrates the process of obtaining network structure directly with state-space dynamic simulation robust to qualitative (molecular biology) and noisy quantitative (biochemical) data to iteratively test and refine hypothetical biomolecular networks. In the following, we review some related work in community structure analysis, and biomolecular networking modeling.
Community structure analysis
The study of community structure in a network is closely related to the graph partitioning in graph theory and computer science. It has also closely ties with the hierarchical clustering in sociology [1]. Recent years have witnessed an intensive activity in this field, partly due to the dramatic increase in the scale of networks being studied. Because communities are believed to play a central role in the functional properties of complex networks [1], the ability to detect communities in networks could have practical applications. Studying the community structure of biological networks is of particular interest and challenging, given the high data volume and the complex nature of interactions. In the context of biological networks, communities might represent structural or functional groupings. They can be synonymous with molecular modules, biochemical pathways, gene clusters, or protein complexes. Being able to identify the community structure in a biological network may help us to understand better the structure and dynamics of biological systems. Hashimoto and colleagues [2] have developed an approach to growing genetic regulatory networks from seed genes. Their work is based on probabilistic Boolean networks and sub-networks are constructed in the context of a directed graph using both the coefficient of determination and the Boolean function influence among genes. The similar approach is also taken by Flake and colleagues [3] to find highly topically related communities in the Web based on the self-organization of the network structure and on a maximum flow method. Related works also include those that predict co-complex proteins. Jansen and colleagues [4] use a procedure integrating different data sources to predict the membership of protein complexes for individual genes based on two assumptions: first, the function of any protein complex depends on the functions of its subunits; and second, all subunits of a protein complex share certain common properties. Bader and Hogue [5] report a molecular complex detection (MCODE) clustering algorithm to identify molecular complexes in a large protein interaction network. MCODE is based on local network density – a modified measure of the clustering coefficient. Bu and colleagues [6] use a spectral analysis method to identify the topological structures such as quasi-cliques and quasi-bipartites in a protein-protein interaction network. These topological structures are found to be biologically relevant functional groups. In our previous work, we developed a spectral-based clustering method using local density and vertex neighborhood to analyze the chromatin network [7, 8]. Two recent works along this line of research are based on the concept of network modularity introduced by Hartwell and colleagues [9]. The works by both Spirin and Mirny [10] and Rives and Galitski [11] use computational analyses to cluster the yeast PPI network and discover that molecular modules are densely connected with each other but sparsely connected with the rest of the network.
Biomolecular networking modeling
A variety of approaches to state models have been implemented for gene and protein networks, including among others, hidden Markov models [12, 13], Bayesian networks [14–16], linear networks [17–19], finite state [20], and probabilistic Boolean networks [21, 22]. These and other methods are based on either treating biological variables at the crudest resolution (on or off in Boolean networks, a few more levels possible for finite state models but with rapidly growing complexity) or as absolute physical quantities. Boolean networks [23] are computationally simple and do not depend on precise experimental data, and thus they are potentially suitable for handling both the complexity of biological networks and qualitative text-based data. However, Boolean models have been proven to lack the resolution needed to accurately model biomolecular interactions [24]. In contrast, various differential equation-based models [17, 18, 25] are computationally expensive and sensitive to imprecisely measured parameters (and virtually useless given purely qualitative data, i.e. from text-mining). Fuzzy logic [26] provides a mathematical framework that is compatible with poorly quantitative yet qualitatively significant data, but it tends to generate so many rules used to describe biological systems [27].
Results
Algorithm: SNBuilder(G, s, f, d)
Gene name | DMTF | BRCA1 | HIFX | HE | PPP2R4 | MYC | NR4A2 | F2 |
---|---|---|---|---|---|---|---|---|
Protein Name | dm | brca1 | h1 | he | ptpa | myc | not | F2 |
Gene name | PTEN | RRM2 | PLAT | TYR | CAD | CDK2 | CDK4 | EP300 |
Protein Name | pt | R2 | tpa | tyr | cad | cdk2 | Cdk4 | P300 |
In this paper, we employ the human cell cycle gene expression data [28] to construct the state-space model of the sub-network shown in Figure 1. There are independent data in [28] for five methods of cell cycle synchronization, and out of which two datasets "Thy-Thy3" and "Thy-Noc" are complete for the genes in the sub-network we studied. These two datasets are swapped and used as the training dataset and the testing dataset in the two experiments.
As the human cell cycle gene expression data are very noisy, some data preprocessing techniques are applied to the log-ratio gene expression data. Firstly a filter is applied to gene expression profiles one by one. At a given time point, the new expression value is the average of three raw values at the immediately previous, current and immediatly behind points. As the mean values and magnitudes for genes and microarrays mainly reflect the experimental procedure [29], then the expression profile of each gene is normalized to have the mean of zero and the standard deviation of one, and then for the expression values on each microarray as so to have the median of zero and the standard deviation of one. Such normalizations also make the PPCA simple [30].
Experiment 1
To inspect stability, robustness, and periodicity of the inferred gene networks, the eigenvalues of the state transition matrices A are calculated. The eigenvalues of matrix follow as: -0.0715, 0.2479, 0.9018, 0.6749 ± 0.3959i, 0.8125 ± 0.2924i, 1.0396 ± 0.1536i. All eigenvalues except for the last pair of matrix A lie inside the unit circle in the complex plane, and the last pair is very closed to the boundary of the unit circle. This means that the inferred network is almost stable and robust. Furthermore, the dominant eigenvalues of the inferred network are pairs of conjugate complex number: 1.0396 ± 0.1536i. Accordingly, this implies that the network behaves periodically [31]. This result is because the networks are inferred from cell-cycle regulated gene expression data.
Experiment 2
Discussion and conclusion
In this paper, we present a new method for adaptive modeling of biomolecular networks. The biomolecular network model of cell function comprises gene and protein expression, interaction, and regulation. The method iteratively mines and organizes quantitative and qualitative data to generate scalable hypothetical biomolecular network structures. The dynamics of these computational hypotheses are tested and refined through cycles of simulation of state-space model and laboratory experiments. We use state-space modeling methods previously developed for gene networks as a robust and general representation for heterogeneous quantitative, qualitative, and linguistic biomolecular data. While in the example here only microarray data are presented, the state-space framework of representing biomolecular expression states can be simply extended to protein and metabolite levels. This is a key point, because gene networks are an abstraction representing only one aspect of biomolecular networks, and they must be integrated with protein-protein interaction networks, and metabolite profiling to develop a comprehensive portrait of cellular function.
We present in this paper an efficient approach to growing a community from a given seed protein. It uses topological property of community structure of a network and takes advantage of local optimization in searching for the community comprising of the seed protein. Due to the complexity and modularity of biological networks, it is more desirable and computationally feasible to model and simulate a network of smaller size. Our approach builds a community of manageable size and scales well to large networks. Its usefulness is demonstrated by the experimental results that all the four communities identified reveal strong structural and functional relationships among member proteins. It provides a fast and accurate way to find a community comprising a protein or proteins with known functions or of interest. For those community members that are not known to be part of a protein complex or a functional category, their relationship to other community members may deserve further investigation which in turn may provide new insights.
Although we do not explicitly use our approach to the prediction of co-complexed proteins, the results of comparing with the PNR method developed by Asthana and colleagues [32] have shown that the communities identified by our approach do include the top ranked candidates of co-complexed proteins. Our approach does not consider the quality of data in our downloaded data set. By using the strong sense definition of community [33], we could to some degree reduce the noises. However, to improve the quality of an identified community, we have to take into account the quality of data and that is another part of our future work. One possible way is to use the probabilities assigned to individual protein pairs as used by Jasen et al [4], Radicchi et al [33], and Bader et al [5, 34].
In general, the state-space modeling method allows for inconsistencies and potentially noisy data to be identified and used to generate alternative computational hypotheses for biomolecular networks. The method is tractable and scalable because novel clustering methods are applied to adaptively extract biologically significant sub-networks for simulation and hypothesis testing. State space simulation of hypothetical biomolecular network models is compared with experimental data to select and refine plausible hypotheses. We combine the simulation result with the computationally derived meta-model to identify key genes whose perturbation would generate the data set that could most optimally differentiate between the alternative biomolecular network hypotheses. Consequently, by uniting the system identification and simulation components of the modeling procedure into an integrated method, we can develop a cyclical flow from modeling through experiments through updates to the global biological knowledge base. Such a flow is designed specifically to respond to the challenges of designing and interpreting high-throughput experiments, which can in the future evolve in concert with modeling and information management.
Compared to previous models such as Boolean network model [35, 23, 36] and difference/differential equation [17, 18], the proposed model (1) has the following characteristics. Firstly, gene expression profiles are the observation variables rather than the internal state variables. Secondly, from a biological angle, the model (1) can capture the fact that genes may be regulated by internal regulatory factor [37]. Thirdly, the model (1) takes the control portion of state-space model into consideration. However, the proposed approach does have some shortcomings, for example, the inherent linearity which can only capture the primary linear components of a biological system which may be nonlinear, and the ignorance to time delays in a biological system resulting, for example, from the time necessary for transcription, translation, and diffusion. These shortcomings will be address in the future work.
Methods
The data flow of our approach
Analyzing biomolecular network to identify sub-networks
The interpretation of large-scale protein network data depends on our ability to identify significant sub-structures (communities) in the data, a computationally intensive task. Many algorithms for detecting community structure in networks have been proposed. They can be roughly classified into two categories, divisive and agglomerative. The divisive approach takes the route of recursive removal of vertices (or edges) until the network is separated into its components or communities, whereas the agglomerative approach starts with isolated individual vertices and joins together small communities. One important algorithm is proposed by Girvan and Newman (the GN algorithm) [38]. The GN algorithm is based on the concept of betweenness, a quantitative measure of the number of shortest paths passing through a given vertex (or edge). The GN algorithm detects communities in a network by recursively removing these high betweenness vertices (or edges). It has produced good results and is well adopted by different authors in studying various networks [1]. However, it has a major disadvantage which is its computational cost. For sparse networks with n vertices, the GN algorithm is of O(n^{3}) time. Various alternative algorithms have been proposed [39–42], attempting to improve either the quality of the community structure or the computational efficiency. As discussed by Holme et al [43], edge-betweenness uses properties calculated from the whole graph, allowing information from non-local features to be used in the clustering. Edge-betweenness algorithm does not scale well to larger graphs, this method is currently most appropriate for studies focused on specific areas of the proteome.
The goal of our work here is to address a slightly different question about the community structure in a biomolecular network, i.e., what is the community to which a given protein (or proteins) belongs? We are motivated by two main factors. Firstly, due to the complexity and modularity of biological networks, it is more feasible computationally to study a community containing a small number of proteins of interest. Secondly, sometimes the whole community structure of the network may not be our primary concern. Rather, we may be more interested in finding the community which contains a protein (or proteins) of interest. Our aim is to discover relatively small sub-networks such that proteins inside the sub-network interact significantly and, meanwhile, they are not strongly influenced by proteins outside the sub-network. Sub-networks are constructed starting with a seed consisting of one or more proteins believed to be participated in a viable sub-network. Functionalities and regulatory relationships among seed proteins may be partially known or they may simply be of interest. Given the seed, we iteratively adjoin new proteins following an adapted definition of a community in a network. The sub-networks built from our models may provide valuable theoretical guidance to experiment.
The algorithm SNBuilder (sub-network builder)
We intuitively model the protein-protein interaction network as an undirected graph, where vertices represent proteins and edges represent interactions between pairs of proteins. An undirected graph, G = (V, E), is comprised of two sets, vertices V and edges E. An edge e is defined as a pair of vertices (u, v) denoting the direct connection between vertices u and v. The graphs we use in this paper are undirected, unweighted, and simple – meaning no self-loops or parallel edges.
For a subgraph G' ⊂ G and a vertex i belonging to G', we define the in-community degree for vertex i, ${k}_{i}^{in}$(G'), to be the number of edges connecting vertex i to other vertices belonging to G' and the out-community degree, ${k}_{i}^{out}$(G'), to be the number of edges connecting vertex i to other vertices that are in G but do not belong to G'.
In our algorithm, we adopt the quantitative definitions of community defined by Radicchi and colleagues [33], i.e. the subgraph G' is a community in a strong sense if ${k}_{i}^{in}$(G') > ${k}_{i}^{out}$(G') for each vertex i in G' and in a weak sense if the sum of all degrees within G' is greater than the sum of all degrees from G' to the rest of the graph.
Names of genes and proteins
1: | G(V, E) is the input graph with vertex set V and edge set E. |
---|---|
2: | s is the seed vertex; f is the affinity threshold; d is the distance threshold. |
3: | N ← {Adjacency list of s} ⋃ {s} |
4: | C ← FindCore(N) |
5: | C' ← ExpandCore(C, f, d) |
6: | return C' |
7: | FindCore(N) |
8: | for each v ∈ N |
9: | calculate ${k}_{v}^{in}$(N) |
10: | end for |
11: | K_{ min }← min {${k}_{v}^{in}$(N), v ∈ N} |
12: | K_{ max }← max {${k}_{v}^{in}$(N), v ∈ N} |
13: | if K_{ min }= K_{ max }or (${k}_{i}^{in}(N)={k}_{j}^{in}(N),\forall i,j\in N,i,j\ne s,i\ne j$) then return N |
14: | else return FindCore(N - {v}, ${k}_{v}^{in}$(N) = K_{ min }) |
15: | ExpandCore(C, f, d) |
16: | $D\to \underset{(v,w)\in E,v\in C,w\notin C}{\cup}\{v,w\}$ |
17: | C' ← C |
18: | for each t ∈ D, t ∉ C, and distance(t, s) <= d |
19: | calculate ${k}_{t}^{in}$(D) |
20: | calculate ${k}_{t}^{out}$(D) |
21: | if ${k}_{t}^{in}$ (D) > ${k}_{t}^{out}$ (D) or ${k}_{t}^{in}$ (D)/|D| > f then C' ← C' ∪ {t} |
22: | end for |
23: | if C' = C then return C |
24: | else return ExpandCore(C', f, d) |
The algorithm performs a breadth first expansion in the core expanding step. It first builds a candidate set containing the core and all vertices adjacent to each vertex in the core (line 16). A candidate vertex will then be added to the core if it meets one of the following conditions (line 21): 1) its in-community degree is greater than its out-community degree, i.e. the quantitative definition of community in a strong sense (${k}_{t}^{in}$ > ${k}_{t}^{out}$); 2) its affinity coefficient is greater than or equals to the affinity threshold f.
We define the affinity coefficient of a vertex to a network as the fraction of its in-community degree over the size of the network excluding the vertex itself (${k}_{t}^{in}$ (D)/(|D|-1)). We introduce the affinity coefficient and the affinity threshold f to provide a degree of relaxation when expanding the core, because it is too strict requiring every expanding vertex to be a strong sense community member. Even though a candidate vertex may not have an in-community degree larger than out-community degree, it may connect to all (or even most of) other members of the network, indicating a strong tie between the candidate vertex and the network. We use an affinity threshold f of 1 in our implementation, meaning that in order to be eligible to add to the core set, the candidate vertex has to connect to all other vertices in the core set. However, f may be relaxed to be less than 1 if necessary or so desired.
In addition, a distance parameter (d) is provided to restrict how far away a candidate vertex to the seed can be considered eligible for expansion. Quite often, a given seed may not always situate in the center of the resulting sub-network. The distance parameter serves as the shortest path threshold to ensure that all members of the obtained sub-network will be within specified proximity to the seed. A large enough value of d, such as one that is larger than the longest path from the seed to all other vertices in the network, will virtually lift this distance restriction.
The FindCore is a heuristic search for a maximum complete subgraph in the neighborhood N of seed s. Let K be the size of N, then the worst-case running time of FindCore is O(K^{2}). The ExpandCore part costs in worst-case approximately |V| + |E| + overhead. |V| accounts for the expanding of the core, at most all vertices in V, minus what are already in the core, would be included. |E| accounts for calculating the in- and out-degrees for the candidate vertices that are not in the core but in the neighborhood of the core. The overhead is caused by recalculating the in- and out-degrees of neighboring vertices every time the FindCore is recursively called. The number of these vertices is dependent on the size of the community we are building and the connectivity of the community to the rest of the network, but not the overall size of the network. For biological networks, the graphs we deal with are mostly sparse and small world, therefore, the running time of our algorithm is close to linear.
Evaluation of SNBuilder
Because there is no alternative approach to our method, we decide to compare the performance of our algorithm to the work on predicting protein complex membership by Asthana and colleagues [32]. Asthana and colleagues reported results of queries with four complexes using probabilistic network reliability (we will refer their work as PNR method in the following discussion). Four communities are identified by SNBuilder using one protein as seed from each of the query complexes used by the PNR method. (Because of the space limitation, we only discuss the comparison study of 1 out of the 4 complexes). The seed protein is selected randomly from the "core" protein set. The figures for visualizing the identified communities are created using Pajek [44]. The community figures are extracted from the network we build using the above mentioned data set with out-of-community connections omitted. The proteins in each community are annotated with a brief description obtained from the MIPS complex catalogue database. As a comparison, we use Complexpander, an implementation of the PNR method [32] to predict co-complex using the core protein set that contains the same seed protein used by SNBuilder. For all our queries when using Complexpander, we select the option to use the MIPS complex catalogue database. We record the ranking of the members in our identified communities that also appear in the co-complex candidate list predicted by Comlexpander.
The ARP2/ARP3 community
Protein | Alias | Description | Rank |
---|---|---|---|
YLR111w | YLR111w | hypothetical protein | |
YIL062c | ARC15*† | subunit of the Arp2/3 complex | 1 |
YLR370c | ARC18* | subunit of the Arp2/3 complex | 4 |
YKL013c | ARC19*† | subunit of the Arp2/3 complex | 3 |
YNR035c | ARC35* | subunit of the Arp2/3 complex | 5 |
YBR234c | ARC40*† | Arp2/3 protein complex subunit, 40 kilodalton | 6 |
YDL029w | ARP2*† | actin-like protein | 2 |
YJR065c | ARP3* | actin related protein | |
YJL095w | BCK1† | ser/thr protein kinase of the MEKK family | |
YPL084w | BRO1 | required for normal response to nutrient limitation | |
YBR023c | CHS3† | chitin synthase III | |
YNL298w | CLA4† | ser/thr protein kinase | |
YNL084c | END3† | required for endocytosis and cytoskeletal organization | |
YBR015c | MNN2 | type II membrane protein | |
YCR009c | RVS161† | protein involved in cell polarity development | |
YDR388w | RVS167† | reduced viability upon starvation protein | |
YFR040w | SAP155† | Sit4p-associated protein | |
YBL061c | SKT5† | protoplast regeneration and killer toxin resistance protein | |
YNL243w | SLA2† | cytoskeleton assembly control protein | |
YHR030c | SLT2† | ser/thr protein kinase of MAP kinase family |
State-pace based simulation of the biomolecular network
Mathematic model
The meaning of the variables follows as: in terms of linear system theory [45], equations (1) are called the state-space model of a dynamic system. The vector x(t) = [x_{1}(t) ⋯ x_{ n }(t)]^{ T }consists of the observation variables of the system and x_{ i }(t)(i = 1,⋯,n) represents the expression level of gene i at time point t, where n is the number of genes in the network. The vector z(t) = [z_{1}(t) ⋯ z_{ p }(t)]^{ T }consists of the internal state variables of the system and z_{ i }(t)(i = 1,⋯,p) represents the expression value of internal element (variable) i at time point t which directly regulates gene expression, where p is the number of the internal state variables. The vector u(t) = [u_{1}(t) ⋯ u_{ r }(t)]^{ T }represents the external input (control variable) of the internal state governing equation. The matrix A= [a_{ ij }]_{p×p}is the time translation matrix of the internal state variables or the state transition matrix. It provides key information on the influences of the internal variables on each other. The matrix B = [b_{ ik }]_{p×r}is the control matrix. The entries of the matrix reflect the strength of a control variable to an internal variable. The matrix C= [c_{ ik }]_{n×p}is the observation matrix which transfers the information from the internal state variables to the observation variables. The entries of the matrix encode information on the influences of the internal regulatory elements on the genes. Finally, the vectors n_{1}(t) and n_{2}(t) stand for system noise and observation noise. In model (1) the upper equation is called the internal state governing equation while the lower one is called the observation equation.
Parameter estimation
Let X be the gene expression data matrix with n rows and m columns, where n and m are the numbers of the genes and the measuring time points, respectively. The constructing of model (1) using microarray gene expression data X may be divided into three phases. Phase one identifies the internal state variables and their expression matrix, and estimates the elements of observation matrix C; Phase two defines the control internal variables; and Phase three estimates the elements of matrices A and B.
Internal variables and estimation of observation matrix
The internal states are latent variables in gene regulatory networks. They could be any unobserved molecules in cell which participate the process of gene regulation. From the biological viewpoint, it is reasonable to assume that the latent variables are some regulatory factors (protein) or missed genes. Many statistical methods [51] have been developed to find the expression of latent variables from the observation data. In this study, the maximum-likelihood algorithm for probabilistic principal component analysis PPCA [30] is employed to extract the internal variables from the observation data (time-course gene expression data). Using the PPCA model, it follows that
X= C·Z+ N (2)
when
C = U_{ p } (5)
where λ_{ j }(j = 1,⋯,p) are the first p largest eigenvalues of the sample variance matrix S, U_{ p }is a n×p matrix, each column of which is a corresponding eigenvector of S.
From Equation (4), the values of the maximum log-likelihood for the PPCA model increase with the increased numbers of internal state variables, p. The redundant internal state variables may result in a complicated model. Since the PPCA has a solid probabilistic foundation, some statistics-based criteria such as BIC and AIC can be used to determine the number of internal state variables [52, 53]. As the number of genes is small in a sub-network, in this paper, the AIC is adopted. For each model, the AIC is define
AIC(p) = 2·L_{ p }- 2·v_{ p } (6)
By this definition, the model with the largest AIC is chosen. After the transformation matrix C is determined, the expression profiles of internal variables accumulated in matrix Z can be calculated by formula
Z = C^{+}X (8)
Control variables, network structure, and control matrix
In state-space model (1), the control variables together with current internal states determine the next internal states. From the viewpoint of biology, the overall expression level of all genes in the network affects the internal (hide) variables [46, 47, 54]. In this study, we take u(t) = x(t) as the input of the internal state equation. Therefore, from the model (1), it follows that
x(t + 1) = CAz(t) + CBx(t) (9)
This equation quantitatively describes the regulatory relationships among genes through the matrix CB. On the other hand, using the algorithm SNBuilder, the subnetwork can be presented by a graph as shown in Figure 1. In this paper, the adjacent matrix of such a graph is called the structure matrix of the network as it qualitatively describes the regulatory relationships among genes, denoted by S. Therefore the structure of matrix CB should be the same as that of matrix S. That is, the (i, j)-th element of CB is nonzero (or zero) if the (i, j)-th element of S is nonzero (or zero).
It is nontrivial to find a control matrix B such that the structure of matrix CB is the same as that of matrix S. Considering that in reality the weak connections among genes may be ignored in the structure of the network, we reformulate the problem as follows: find a matrix B such the squared sum over the elements of CB corresponding to non-zeros in S is much larger than that over other elements.
Estimates of state transition matrix
With the calculated control matrix B and the profiles of internal variables Z, one can estimate the parameters of the state transition matrix in the internal state governing equation:
z(t + 1) = A·z(t) + Bu(t) + n_{1}(t)
where the time-variant vector v(t) has the same dimensions as the internal state vector z(t + 1) and is calculated by the following difference equation
v(t + 1) = A·v(t) + Bu(t) (11)
with the initial state value v(0) = z(t_{0}), and control values u(0),⋯,u(t).
For equally spaced measurements, the minimization of the cost function (10) can be solved by the least square method for the linear regression problem [55]. For unequally spaced measurements, the problem becomes nonlinear, and it is necessary to determine matrices A by using an optimization technique such as those in Chapter 10 of Press's text [56]. The matrix A contains p^{2} unknown elements while the matrix Z contains m·p known expression data points. If p <m, matrix A can be uniquely determined. Fortunately, using AIC the number of chosen internal variables p generally is less than the numbers of genes n and time points m. Therefore matrices A in model (1) could unambiguously be estimated from time-course gene expression data.
Model evaluation
In this study, the inferred gene regulatory networks will be evaluated in the following aspects: the prediction power, stability, robustness, periodicity, controllability, and observability.
The prediction power
where X(i,:) is the expression profile of gene i in the data matrix X·||X(i,:)|| is the Euclidean norm of the vector X(i,:). Intuitively, the smaller the prediction error, the stronger the prediction power is. The prediction error P_{ E }defined in (12) is invariant with respect to the scale of X. Therefore, it is more reasonable to evaluate the models using formulae (12) than the one defined by Wessels et al [57].
where cor(x,y) is the standard correlation of two vectors [51]. The large prediction correlation indicates the strong prediction power.
Stability
Due to the limited energy and storage within a cell, concentrations of gene expression products such as mRNA should remain bounded. All real-life gene networks are therefore stable. Consequently, the inferred gene network models should also be stable in order to be realistic. For our model, this is equivalent to the governing equation (9) being stable. It has been proven [45] that the equation (9) is stable if and only if all eigenvalues of the state transition matrix A lie inside the unit circle in the complex plane.
Periodicity
Certain biological processes are periodic. The cell-cycle and circadian clock, for example, repeat at well-defined and reliable time intervals. Studies have shown that gene regulatory networks associated with these periodic biological processes are themselves rhythmic [58–60]. Therefore, the inferred gene regulatory networks associated with these periodic biological processes should be periodic at its stable states. Accordingly, the periodicity of system (1) at its stable state is determined by its dominant eigenvalues of the state transition matrix A whose moduli are the largest.
Robustness
The robustness of a gene regulatory network is understood as its insensitivity to noise or disturbance. It is obvious that a real-life gene regulatory network has robustness [47, 58]. Therefore, the inferred gene regulatory network should be robust. The stability of a linear system implies robustness to a certain degree [46]. Note that the stability, robustness, and periodicity of system (1) are all related to the eigenvalues of the state transition matrix A.
Controllability
A dynamic control system is said to be controllable if any state could be transferred to any preset state by appropriate control actions [46]. For example, if a gene regulatory network is controllable, it can always be transferred to a normal state if the network malfunctions and deviates from the normal state. The inferred network should be controllable if a real-life gene regulatory network is. It has been proven that the linear system (1) is controllable [46] if and only if
rank([B, AB, ⋯, A^{ p }B]) ≥ p (14)
A control system is said to be directly controllable if rank(B) ≥ p. A directly controllable system can more easily transfer one state to the other one than a controllable system can. Because of the proposed modeling method in this study, all inferred gene regulatory networks are controllable.
Observability
A dynamic control system is said to be observable [46] if the internal state could be estimated by the observation data of the systems. For an observable gene regulatory network, one can always estimated its internal states from the observation data (i.e., time-course gene expression data) even though one can not directly "see" the behavior of the internal variables. Its inferred network should be observable if a real gene regulatory network is. Because of the proposed modeling method in this study, all inferred gene regulatory networks are observable.
Notes
Declarations
Acknowledgements
This research work of XH is supported in part from the NSF Career grant (NSF IIS 0448023). NSF CCF 0514679 and the research grant from PA Dept of Health. This research work of FXW is supported from the Natural Science and Engineering Research Council of Canada (NSERC).
Authors’ Affiliations
References
- Newman MEJ: The structure and function of complex networks. SIAM Review. 2003, 45: 167-256.View ArticleGoogle Scholar
- Hashimoto RF, Kim S, Shmulevich I, Zhang W, Bittner ML, Dougherty ER: Growing genetic regulatory networks from seed genes. Bioinformatics. 2004, 20: 1241-1247.View ArticlePubMedGoogle Scholar
- Flake GW, Lawrence SR, Giles CL, Coetzee FM: Self-organization and identification of web communities. IEEE Computer. 2002, 35: 66-71.View ArticleGoogle Scholar
- Jansen R, Lan N, Qian J, Gerstein M: Integration of genomic datasets to predict protein complexes in yeast. J Struct Functional Genomics. 2002, 2: 71-81.View ArticleGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 2-PubMed CentralView ArticlePubMedGoogle Scholar
- Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G, Chen R: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res. 2003, 31: 2443-2450.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu X: Mining and analyzing scale-free protein-protein interaction network. International Journal of Bioinformatics Research and Application. 2005, 1: 81-101.View ArticleGoogle Scholar
- Hu X, Wu D: Data mining and predictive modeling of biomolecular network from biomedical literature databases. Accepted to be published in IEEE Transactions on Computational Biology and Bioinformatics. 2006Google Scholar
- Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular cell biology. Nature. 1999, 402: C47-C52.View ArticlePubMedGoogle Scholar
- Spirin V, Mirny L: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA. 2003, 100: 1128-1133.View ArticleGoogle Scholar
- Rives AW, Galitski T: Modular organization of cellular networks. Proc Natl Acad Sci USA. 2003, 100: 1128-1133.PubMed CentralView ArticlePubMedGoogle Scholar
- Rosales RA, Fill M, Escobar AL: Calcium regulation of single ryanodine receptor channel gating analyzed using HMM/MCMC statistical methods. J Gen Physiol. 2004, 121: 533-553.View ArticleGoogle Scholar
- Schliep A, Schonhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data. Bioinformatics. 2003, 19: i255-i263.View ArticlePubMedGoogle Scholar
- Beal MJ, Falciani F, Ghahramani Z, Rangel C, Wild DL: A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics. 2005, 21: 349-356.View ArticlePubMedGoogle Scholar
- Nachman I, Regev A, Friedman N: Inferring quantitative models of regulatory networks from expression data. Bioinformatics. 2004, 20: i248-i256.View ArticlePubMedGoogle Scholar
- Rangel C, Angus J, Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild DL, Falciani F: Modeling T-cell activation using gene expression profiling and state-space models. Bioinformatics. 2004, 20: 1361-1372.View ArticlePubMedGoogle Scholar
- Chen T, He HL, Church GM: Modeling gene expression with differential equations. Pacific Symposium on Biocomputing. 1999, 4: 29-40.Google Scholar
- D'Haeseleer P, Wen X, Fuhrman S, Somogyi R: Linear modeling of mRNA expression levels during CNS development and injury. Pacific Symposium of Biocomputing. 1999, 4: 41-52.Google Scholar
- Wu FX, Zhang WJ, Kusalik AJ: Modeling gene expression from microarray expression data with state-space equations. Pacific Symposium on Biocomputing. 2004, 9: 581-592.Google Scholar
- Laubenbacher R, Stigler B: A computational algebra approach to the reverse engineering of gene regulatory networks. J Theor Biol. 2004, 229: 523-537.View ArticlePubMedGoogle Scholar
- Perkins TJ, Hallett M, Glass L: Inferring models of gene expression dynamics. J Theor Biol. 2004, 230: 289-299.View ArticlePubMedGoogle Scholar
- Shmulevich I, Dougherty ER, Kim S, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002, 18: 261-274.View ArticlePubMedGoogle Scholar
- Liang S, Fuhrman S, Somogyi R: REVEAL, A general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing. 1998, 3: 18-29.Google Scholar
- Glass L, Kauffman SA: The logical analysis of continuous, nonlinear biochemical control networks. Journal of Theoretical Biology. 1973, 39: 103-129.View ArticlePubMedGoogle Scholar
- Tegner J, Yeung MK, Hasty J, Collins JJ: Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc Natl Acad Sci USA. 2003, 100: 5944-5949.PubMed CentralView ArticlePubMedGoogle Scholar
- Zadeh LA: Fuzzy set. Information and Control. 1965, 8: 338-353.View ArticleGoogle Scholar
- Hu X, Sokhansanj B, Wu D, Tang Y: A novel approach for mining and dynamic fuzzy simulation of biomolecular network. IEEE Transactions on Fuzzy Systems. 2007Google Scholar
- Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002, 13: 1977-2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 14863-14868.Google Scholar
- Tipping ME, Bishop CM: Probabilistic principal component analysis, Series B. Journal of the Royal Statistical Society. 1999, 61: 611-622.View ArticleGoogle Scholar
- Durbin J, Koopman SJ: Time-series Analysis by State Space Model. 2001, New York, Oxford University PressGoogle Scholar
- Asthana S, King OD, Gibbons FD, Roth FP: Predicting protein complex membership using probabilistic network reliability. Genome Research. 2004, 14: 1170-1175.PubMed CentralView ArticlePubMedGoogle Scholar
- Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D: Defining and identifying communities in networks. Proc Natl Acad Sci USA. 2004, 101: 2658-2663.PubMed CentralView ArticlePubMedGoogle Scholar
- Bader JS, Chaudhuri A, Rothberg JM, Chant J: Gaining confidence in high-throughput protein interaction networks. NATURE BIOTECH. 2004, 22: 78-85.View ArticleGoogle Scholar
- Akutsu T, Miyano S, Kuhara S: Identification of gene networks from a small number of gene expression patterns under the Boolean network model. Pacific Symposium on Biocomputing. 1999, 4: 17-28.Google Scholar
- Somogyi R, Sniegoski CA: Modeling the complexity of genetic networks: Understanding multigenic and pleiotropic regulation. Complexity. 1996, 1: 45-63.View ArticleGoogle Scholar
- Baldi P, Hatfield GW: DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling. 2002, New York, Cambridge University PressView ArticleGoogle Scholar
- Girvan M, Newman MEJ: Community structure in social and biological networks. Proc Natl Acad Sci USA. 2002, 99: 7821-7826.PubMed CentralView ArticlePubMedGoogle Scholar
- Donetti L, Munoz MA: Detecting network communities: a new systematic and efficient algorithm. J Stat Mech. 2004, P10012-Google Scholar
- Newman MEJ: Fast algorithm for detecting community structure in networks. Phys Rev E. 2004, 69: 066133-View ArticleGoogle Scholar
- Newman MEJ, Girvan M: Finding and evaluating community structure in networks. Phys Rev E. 2004, 69: 026113-View ArticleGoogle Scholar
- White S, Smyth P: A spectral clustering approach to finding communities in graphs. SIAM International Conference on Data Mining. 2005, Newport Beach, CA, USAGoogle Scholar
- Holme P, Huss M, Jeong H: Sub-network hierarchies of biochemical pathways. Bioinformatics. 2003, 19: 532-538.View ArticlePubMedGoogle Scholar
- Batagelj V, Mrvar A: Pajek – Program for large network analysis. Connections. 1998, 21: 47-57.Google Scholar
- Machesky LM, Gould KL: The Arp2/3 complex: a multifunctional actin organizer. Curr Opin Cell Biol. 1999, 11: 117-121.View ArticlePubMedGoogle Scholar
- Chen CT: Linear System Theory and Design. 1999, New York, Oxford University Press, 3Google Scholar
- Alberts B, Johnson A, Lewis J, Raff M, Bray D, Hopkin K, Roberts K, Walter P: Essential Cell Biology. 1998, New York, Garland ScienceGoogle Scholar
- Liebler DC: Introduction to Proteomics. 2002, Totowa NJ, Humana PressGoogle Scholar
- Wu FX, Poirier GG, Zhang WJ: Inferring gene regulatory networks with time delays using a genetic algorithm. IEE Proc Systems Biology. 2005, 152: 67-74.View ArticleGoogle Scholar
- Wu FX, Zhang WJ, Kusalik AJ: State-space model with time delays for gene regulatory networks. Journal of Biological Systems. 2004, 12: 483-499.View ArticleGoogle Scholar
- Everitt BS, Dunn G: Applied Multivariate Data Analysis. 1992, New York, Oxford University PressGoogle Scholar
- Burnham KP, Anderson DR: Model selection and inference: a practical information-theoretic approach. 1998, New York: SpringerView ArticleGoogle Scholar
- Schwarz G: Estimating the dimension of a model. Annals of Statistics. 1978, 6: 461-464.View ArticleGoogle Scholar
- Tozeren A, Byers SW: New Biology for Engineers and Computer Scientists. 2004, New Jersey, Pearson Education IncGoogle Scholar
- Harvey AC: Time Series Models. 1993, Cambridge MA, The MIT Press, 2Google Scholar
- Press WH, Flannery BP, Teukolsky SA, Vetterling WT: Numerical Recipes in C: The Art of Scientific Computing. 1992, Cambridge UK, Cambridge University Press, 2Google Scholar
- Wessels LFA, Van Someren EP, Reinders MJT: A comparison of genetic network models. Pacific Symposium on Biocomputing. 2001, 6: 508-519.Google Scholar
- Kauffman SA: The Origins of Order: Self-Organization and Selection in Evolution. 1993, Oxford, Oxford University PressGoogle Scholar
- Langmead CJ, Yan AK, McClung CR, Donald BR: Phase-independent rhythmic analysis of genome-wide expression patterns. Proceedings of the Sixth Annual International Conference on Research in Computational Molecular Biology. 2002, 205-215.Google Scholar
- Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics. 2004, Washington DC, USA, 20: 5-20.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.