Identification of highly synchronized subnetworks from gene expression data
© Gao and Wang; licensee BioMed Central Ltd. 2013
Published: 28 June 2013
Skip to main content
© Gao and Wang; licensee BioMed Central Ltd. 2013
Published: 28 June 2013
There has been a growing interest in identifying context-specific active protein-protein interaction (PPI) subnetworks through integration of PPI and time course gene expression data. However the interaction dynamics during the biological process under study has not been sufficiently considered previously.
Here we propose a topology-phase locking (TopoPL) based scoring metric for identifying active PPI subnetworks from time series expression data. First the temporal coordination in gene expression changes is evaluated through phase locking analysis; The results are subsequently integrated with PPI to define an activity score for each PPI subnetwork, based on individual member expression, as well topological characteristics of the PPI network and of the expression temporal coordination network; Lastly, the subnetworks with the top scores in the whole PPI network are identified through simulated annealing search.
Application of TopoPL to simulated data and to the yeast cell cycle data showed that it can more sensitively identify biologically meaningful subnetworks than the method that only utilizes the static PPI topology, or the additive scoring method. Using TopoPL we identified a core subnetwork with 49 genes important to yeast cell cycle. Interestingly, this core contains a protein complex known to be related to arrangement of ribosome subunits that exhibit extremely high gene expression synchronization.
Inclusion of interaction dynamics is important to the identification of relevant gene networks.
Life is a transient dynamic phenomenon. Biological functions and phenotypic traits, including disease traits, stem from the interactions across multiple scales in the living system. Therefore characterizing the condition-dependent interactions and emergent dynamics are important in the identification of relevant elements to a given biological process.
Recently, a number of computational methods have been developed to identify the condition specific protein-protein interaction (PPI) subnetworks, through integration of generic PPI data (typically obtained from an interactome database) and condition-specific gene expression data . For instance, by integrating yeast PPI networks with gene expression data, Han et al. showed that some modules are active only at specific times and locations . Qi et al. suggested that such approach enables the identification of subnetworks that are active under certain conditions . In a cell cycle study by de Lichtenberg et al, it was found that the cell cycle-regulated and constitutively expressed proteins form protein complexes at particular time points during the cell cycle . In these studies correlation in expression or similar measures are usually used to capture the condition specific gene interaction [3–9]. More recently, a number of studies focused on integration of PPI networks with time course expression data to identify subnetworks that exhibit meaningful dynamic changes in transcription. In a study of yeast metabolic oscillation by Tang et al , the active PPI network is first constructed for each time point (out of a total of 36 time points) through identification of interacting protein pairs whose corresponding genes exhibit a certain significant pattern in expression at that time point. Then Markov clustering algorithm is applied to create candidate functional module of each network. These modules were found to have much more significant biological meaning than those derived using static PPI networks only . In another study, Jin et al  defined a dynamic network module to be a set of proteins satisfying two conditions: (1) they form a connected component in the PPI network; and (2) their expression profiles exhibited time-shifted and local similarity patterns as evaluated using an time-warping dynamic programming algorithm. Using yeast as a model system and time course expression data from multiple experiments, they then showed that the majority of the identified dynamic modules are functionally homogeneous, and many of them shed light on the sequential ordering of the molecular events in the cellular system of yeast .
Understanding cellular physiology from a dynamic and systems perspective is obviously very important and valuable as demonstrated by these studies and many others . Incorporating time course data is a necessity along this direction. They not only capture how a whole system evolves over time, but also contain rich information regarding the coordination, namely, interaction, of the different elements in the system. The measurements from different time points are not independent of each other; this is in contrast to static measurements of different samples, or of the same sample under different conditions. However, most of the existing studies either construct active networks independently at each time point , or rely on pattern similarity measures to infer interaction which ignores the inter-time point dependence . Overlooking the interdependence among the time points not only loses sensitivity toward detecting relevant interactions but could also lead to erroneous predictions [11, 12].
In this study we investigate the application of an idea rooted in statistical physics and non-linear dynamics to characterize the state of gene interaction networks and use it to identify relevant subnetworks. We regard active subnetworks to be those showing high degree of differential expression, and high synchrony in expression changes (i.e., coordination in the timing of expression changes) among the members. The phase locking analysis will be utilized to evaluate expression synchrony, and to capture the dynamic interaction structure. Recently we found that the phase locking metric can identify interacting gene pairs more efficiently than correlation .
Previously, we proposed a Pathway Connectivity Index (PCI) to represent the activity of pre-defined pathways, such as those defined in KEGG and Biocarta. PCI utilizes expression information of all genes in a pathway, as well as the topological properties of its interaction networks. Its advantages have been demonstrated . This metric was later implemented in a software tool entitled Topological Analysis of Pathway-Phenotype Association (TAPPA). Here to capture contributions from topological characteristics of the dynamic interaction network, we integrate the phase locking analysis into PCI to define a novel metric: the Topology-Phase Locking (TopoPL) analysis . With both simulated and real yeast expression data during cell cycle, we will demonstrate the merits of TopoPL.
Simulation utilized the sample expression data gal80R given in Cytoscape (http://cytoscape.org/). There are 331 genes and 361 interactions in this network. Within it, we randomly selected subnetworks at three different sizes n (n = 40, 60, 80), as condition-responsive. In each responsive subnetwork m% (80%, 90%, 100%) of genes are defined to be active. The significance values of active genes were assigned randomly with top significance values in gal80R, and that of the other genes were randomly sampled from the rest of the significance values. The phase locking index λ (see 2.3) of the interactions in the predefined responsive subnetwork were sampled from , i.e. a normal distribution with μ = 0.8, σ = 0.5; while λ for the remaining edges were sampled from . The choice of these values was based on the distribution of the λ values of gene pairs in protein complexes and of randomly selected gene pairs. For protein complexes we used the MIPS annotation (http://mips.helmholtz-muenchen.de/genre/proj/yeast) edited by Gerstein Lab (http://www.gersteinlab.org/proj/bottleneck/mips.txt).
We used the average sensitivity, specificity and F score to measure the performance of TopoPL. The performance is also evaluated with Receiver Operating Characteristic (ROC) curve, a plot of the true positive rate against the false positive rate .
Gene expression data was downloaded from EMBL's Huber group (http://www.ebi.ac.uk/huber-srv/scercycle/). It is a time course study of yeast cell cycle, where cells were arrested using alpha factor or cdc28. The alpha factor dataset contains 41 time points and the cdc28 dataset contains 44 time points, both at 5-minute resolution. These datasets provide strand-specific profiles of temporal expression during the mitotic cell cycle of S. cerevisiae, monitored for more than three complete cell divisions . Yeast PPI data were downloaded from BioGRID (thebiogrid.org, version 3.1.69).
In a perfect locking , and when is randomly distributed. λ offers a new measure to infer potential interaction between gene pairs .
captures the dynamic topological property of the subnetwork, and hub genes (genes with high network degree) contribute more to this metric. can be regarded as the "activity measurement" of the interaction. Gene pairs with significant and synchronized expression changes, and whose gene products interact, contribute more to the activity of the subnetwork.
This metric is an improved version over the PCI that we previously proposed to identify active pathways from gene expression data : , where is normalized log expression measurement of gene i in sample s, and is the adjacency matrix of the PPI network of genes in the pathway. The merit of PCI has been demonstrated in previous works . To reduce the potential impact on the network measure from residual inter-sample and inter-array biases after normalization, here we adopted the non-parametric measure in place of . A similar metric to Eq. (5) was developed recently by us to predict candidate disease genes for type 1 diabetes, where is the z-score of disease relevance of gene i. There again we demonstrated the advantage of incorporating network structural information .
We implemented the searching procedure based on simulated annealing. The pseudocode of the algorithm is described below:
Input: the entire network ; a set of parameters for running simulated annealing: start temperature (= 1 in this study), end temperature (= 1e-8 in this study), number of iterations .
Output: the subnetwork with the highest score.
Steps: initialize each node with its expression significance score and each edge with its phase locking index; select the largest connected component (subnetwork) from top 10% significant nodes of ; calculate score of and obtain its score ; then run the following:
For i = 1 to N, Do
Calculate the current temperature ;
Exit loop if
Randomly pick a node
IF ( ), remove n from ;
ELSE add n to ;
Calculate score for the largest connected component of ;
IF Δ> 0, then ;
ELSE, accept with the probability ;
These steps can be iterated to identify subnetworks with the next highest scores and so on.
Top 10 GO Biological Processes terms significantly enriched in the subnetwork identified during yeast cell cycle.
ribonucleoprotein complex biogenesis
mitotic cell cycle
cell cycle process
cellular component biogenesis
Presently, there is no "gold standard" to evaluate the biological relevance of network modeling algorithms. Here we investigated the functional enrichment of the proteins in the identified subnetworks , and compared to that obtained using Additive and TAPPA. The p values (Bonferroni corrected) of the top 2 terms are 3.33E-13 and 6.5E-12 with TAPPA, and 3.05E-8 and 3.13E-8, with Additive, respectively. TAPPA's are slightly larger than TopoPL, but Additive gave much larger p values. This indicates that including interaction structure, especially its dynamics, improves the sensitivity at identifying biologically relevant gene subnetworks.
Top 30 genes with highest degrees or betweenness in the identified subnetwork.
Interestingly all six genes are annotated with GO:0042254 (ribosomal chaperone activity), it is defined as "A cellular process that results in the biosynthesis of constituent macromolecules, assembly, and arrangement of constituent parts of ribosome subunits; includes transport to the sites of protein synthesis".
Transcription factor binding sites overrepresented in genes of the identified subnetwork and of its core.
The identified subnetwork
Core of the identified subnetwork
All gene hits
All gene hits
FKH1 and MCM1 are well studied cell cycle related transcriptional factors . TOD6 (Pbf1) and DOT6 (Pbf2) as PAC-binding factors, important in the regulation of ribosome biogenesis. Existing ChIP-chip studies suggest that genes have the highest occupancy by TOD6 and DOT6 are highly enriched for the GO Biological Process ''ribosome biogenesis" .
A good algorithm should be efficient at uncovering the true biology underlying different datasets, which should be consistent. In this study, we identified 484 genes with the cdc28 dataset, and 524 genes with the alpha factor dataset. There are 156 (~31%) overlapping genes in them (p < 0.00001, Fisher Test). In contrast, there are only 87 (~17%) overlapping genes with the Additive method (alpha: 501 genes; cdc28: 509 genes), and 145 (~29%) with TAPPA (alpha: 499 genes; cdc28: 503 genes). This indicates that incorporating network structural and dynamic information can generate robust results.
TopoPL scoring method with a simulated annealing search was proposed in this study to identify active subnetworks during a biological process by integrating PPI with dynamic expression data. It incorporates both structural and dynamics information of gene interactions. When applied to the simulated data and the yeast cell cycle data, it yielded more consistent results from different experiments, and predicted more meaningful active network modules, than two alternative scoring methods that either ignores information of the network dynamics, or that of both the dynamics and structure.
This work was supported in part by National Institute of Diabetes and Digestive and Kidney Diseases Grant R01DK080100 (XW).
Publication of this article is supported by NIDDK, NIH (Grant No. R01DK080100)
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 9, 2013: Selected articles from the 8th International Symposium on Bioinformatics Research and Applications (ISBRA'12). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S9.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.