Datasets
The DIP (Database of Interacting Proteins) database lists protein pairs that are known to interact with each other. The interaction indicates that two amino acid chains are experimentally identified to bind to each other. The database lists such pairs to aid those studying a particular PPI, but it also aids those investigating entire regulatory and signaling pathways, as well as those studying the organization and complexity of the PIN at the cellular level. The PPI data of S. cerevisiae used in this work is from DIP (http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM = 7/), updated on Oct. 10, 2010. The static yeast PPI network includes 4,950 distinct proteins and 21,788 interactions totally. As is customary, self interactions representing autoregulation or protein homodimerization are not included in the analysis. Furthermore, duplicated interactions are ignored.
Time course gene expression data and periodic transcripts data of S. cerevisiae are from [3], updated on Apr 14, 2011. Raw microarray data are also available from Gene Expression Omnibus, accession number GSE3431. The dataset, in the form of a 9,335 × 36 matrix, includes expression profiles of 9,335 probes under 36 different time points. We map probe sets to gene symbols according to the annotation file provided by Affymetrix and thus obtain 6,777 budding yeast S. cerevisiae gene products. The periodic transcripts file contains data for 3552 unique expressed genes that are periodic with at least 95% confidence, which corresponds to 3656 probes [3].
Gene ontologies and annotations used in GO enrichment analysis are downloaded from http://geneontology.org (http://www.geneontology.org/gene-associations/submission/), updated on July 24, 2010.
Reconstruction of the TC-PINs
Before that, the first issue, perhaps, is to determine the consistency of both datasets selected. Upon comparing the 4,950 proteins extracted from a static PPI network (http://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=7) with 6,777 gene products from gene-expressing profiles [3], we find that they share 4,858 proteins. Thus, the gene-expressing profiles can cover more than 98% of the proteins in the static PPI network. In other words, the result shows that it is reasonable to combine the two datasets.
Then, a bigger challenge is how to choose an appropriate cutoff threshold in order to filter the gene expression profiles and retain merely the most biologically significant gene products. This threshold application step is a major juncture in which errors can be introduced in the form of both false negatives and false positives. By setting this threshold too high, important gene products can be lost. Similarly, we must be sure to remove gene products that have no apparent biological significance. Some of the methods that have been applied to the threshold selection problem in various types of networks include using an arbitrary threshold [13], retaining only the top x percent of the strongest relationships [14], permutation testing [15] and filtering based upon control spot correlations [16] or the statistical significance of the relationships [17, 18].
Tu at al. [3] used a continuous culture system to reveal a robust, metabolic cycle in budding yeast. Each cycle was characterized by a reductive, nonrespiratory phase followed by an oxidative, respiratory phase wherein the synchronized culture rapidly consumed molecular oxygen. After performing microarray analysis of gene expression, they found that over half of yeast genes (~ 3552) exhibited periodic expression patterns at a confidence level of 95% over three consecutive yeast metabolic cycle (12 time intervals per cycle). 1023 periodic genes encoding ribosomal proteins, translation initiation factors, amino acid biosynthetic enzymes, small nuclear RNAs, RNA processing enzymes and proteins required for the uptake and metabolism of sulfur exhibit a similar expression pattern of peaking in the Ox(oxidative) metabolism. 977 periodic genes during the R/B (reductive/building) metabolism peak when cells begin to cease oxygen consumption. This set consists primarily of nuclear encoded mitochondrial genes as well as genes encoding histones, spindle pole components and proteins required for DNA replication and cell division. 1510 genes expressed maximally in the R/C (reductive/charging) metabolism encode proteins involved in nonrespiratory modes of metabolism and protein degradation. These periodic genes play an important role in yeast metabolic cycle, so they have biological significance. Moreover, Tu at al[3] also indicate that the average expression levels of periodic transcripts is 1.7-fold higher than that of non-periodic transcripts. After looking into the expressing peak value of every periodic gene during one cycle (12 time intervals), we discover that about 82% periodic genes have expression peak value more than 1.6.
Therefore, to select a large number of periodic genes, we take a similar tactic as used by Ala et al. [14] to determine a potential threshold value. That is, for every time point, we set a fix threshold to filter the transcripts. Only the transcripts whose expression levels are higher than the threshold value will be remained.
Our algorithm for reconstructing the TC-PINs is structured as sequence of following steps:
-
(i)
Filtering gene expressing profiles
The approach we use to filter the raw gene expressing data is by comparing the expression levels of genes at every time points to a fixed threshold, for example 0.7. How the cut-off is chosen is discussed in Effect of the threshold selection. After filtering the gene expression profiles, about 56.78% raw transcripts remain. 36 different gene product sets are obtained at 36 time points. Tu et al. [3] classified raw probes that had at least 3 (on average one per cycle) present (P) calls (as generated by Affymetrix GeneChip software) as expressed. According to this criterion, out of 9,335 probes queried by the YG_S98 array, 7,985 (about 86%) are expressed. Out of 6,555 probes querying unique, annotated open reading frames (ORFs), 6,209 are expressed. By filtering the gene expressing profiles, around 43% raw transcripts with low expressing levels are removed and the gene products that have no apparent biological significance are basically discarded.
-
(ii)
Reconstruction of the TC-PINs
If two interacting proteins in the static PPI network also present in the gene product set at a certain time point, the two proteins and their interaction form a part of a TC-PIN at the time point. The process is repeated until the TC-PIN is created. Similarly, 36 TC-PINs can be reconstructed.
Figure 1 shows how to reconstruct the TC-PINs. Our method mainly consists of two stages, screening gene expressing profiles and reconstructing the TC-PINs.
Identifying functional modules from the TC-PINs
The next urgent task is to identify meaningfully functional modules from the TC-PINs. So far, the Markov Cluster (MCL) [11] algorithm seems to be one of the most successful clustering procedures used in partitioning a PPI network into densely connected modules. In 2001, Enright et al. [11] used MCL to assign proteins into families based on precomputed sequence similarity information. Their results show that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. Brohee and Helden [19] applied four algorithms, Molecular Complex Detection (MCODE) [12], Super Paramagnetic Clustering (SPC) [20], Restricted Neighborhood Search Clustering (RNSC) [21] and Markov Clustering (MCL) to six protein interaction networks obtained from high-throughput experiments and compared the resulting clusters with the annotated complexes. They found that the analysis of high-throughput data supported the superiority of MCL for the extraction of complexes from interaction networks. Vlasblom and Wodak [22] found that the advantage of MCL over a number of other procedures which were specifically designed for partitioning protein interactions graphs was dramatic for unweighted protein interaction graphs. Their experimental results show that the MCL procedure is significantly more tolerant to noise and behaves more robustly than the other algorithms. For MCL algorithm, the inflation parameter can be set as different values. Wu et al. [23] concluded that 1.9 was the best inflation parameter for the DIP data. Our experimental results show that the optimal inflation parameter is 2.0 when the MCL algorithm is applied to the yeast PPI network. MCL thus remains the method of choice for identifying protein functional modules from the TC-PINs. The following paragraph will briefly outline the principles of MCL. The MCL process consists of two operators called expansion and inflation. It involves changing the values of a transition matrix toward either 0 or 1 at each step in a random walk, until the stochastic condition is satisfied. The algorithm first adds self-loops to the input graph - by default, the loop weight for each node is assigned as the maximum weight of all edges connected to the node - and then translates this graph into a stochastic 'Markov' matrix. This matrix represents the transition probabilities between all pairs of nodes and the probability of a random walk of length n between any two nodes can be calculated by raising this matrix to the exponent n- a process referred to as expansion. The inflation step introduces a non-linearity into the process, in order to strengthen intra-cluster flow and weaken inter-cluster flow. Since greater path lengths are more common within clusters than between different clusters, the probabilities between nodes in the same module will typically be higher in expanded matrices. MCL further exaggerates this effect by taking entry wise exponents of the expanded matrix and then rescaling each column so that it remains stochastic. Iterating expansion and inflation will subdivide the PPI network into many segments as protein functional modules or complexes.
The MCL procedure is applied to create candidate functional modules from each TC-PIN. Then, a script program implemented using the Perl language is used to remove modules that include only one gene product or belong to another one. Redundant modules are also removed.
Evaluation metrics
Pseudorandom network
A problem of evaluation is that a certain proportion of interacting proteins can be assigned to the same modules by chance. In order to estimate the random expectation of correct grouping, NetworkAnalyzer (http://www.mpi-inf.mpg.de/) is used to preserve the connectivity of each node, while edges are reallocated at random to build a pseudorandom network of the same size (consisting of the same number of nodes and edges) as static yeast PPI networks.
Matching evaluation
Comparing predicted modules with known ones is a common method of evaluation. For many years, the yeast protein modules catalogued by the Munich Information Center of Protein Sequences (MIPS) database have been widely used to generate protein-protein interaction reference sets. Although this catalogue has served the community very well, it no longer reflects the current state of knowledge in the field. We derive 408 typical modules including two or more proteins each from the CYC2008 [24] as the benchmark module set and use the same scoring scheme used by [12] to determine how effectively a predicted module matches a known module. If two complexes overlap each other, they must share one or more proteins. The Overlap Score (OS ) of a predicted module vs. a benchmark module is then a measure of biological significance of the prediction, assuming that the benchmark set of modules is biologically relevant. The overlap score between a predicted and a known module is calculated by using
where, i refers to the number of proteins shared by a predicted module and a benchmark module, g is the number of proteins in the predicted module and h is the number of proteins in the benchmark module. If OS (Overlap Score) is 1, it means that a predicted module has the same proteins as a benchmark module. On the contrary, when OS equals to 0, there is not a shared protein between the predicted module and the benchmark module [12].
Defining the number of true positives (TP) as the number of MCL predicted modules with OS over a threshold value and the number of false positives (FP) as the total number of predicted MCL modules minus TP. The number of false negatives (FN) equals the number of known benchmark modules not matched by predicted modules. Sensitivity is defined as [TP/(TP+FN)] and specificity is defined as [TP/(TP+FP)] [12]. f-measure, or the harmonic mean of sensitivity and specificity, can then be used to evaluate the overall performance [8]:
Statistical evaluation
The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains: Cellular Component (CC), the parts of a cell or its extracellular environment; Molecular Function (MF ), the elemental activities of a gene product at the molecular level, such as binding or catalysis; and Biological Process (BP), operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs and organisms. The GO ontology is structured as a directed acyclic graph and each term has defined relationships to one or more other terms in the same domain and sometimes to other domains. The GO vocabulary is designed to be species-neutral and includes terms applicable to prokaryotes and eukaryotes, single and multicellular organisms.
A module is associated with a known function by determining whether the number of proteins known to be annotated with the function is enriched, as judged by the hypergeometric distribution. The P-value can be used to determine the probability that a given set of proteins is enriched by a given functional group by random chance. In [25], it is used as a criterion to assign each cluster to a known function. The smaller the P-value, the more evidence the clustering is not random. In terms of GO annotations, a group of genes with a smaller P-value is more significant than the one with a higher P-value.
Consider a cluster of size c, with m proteins sharing a particular annotation A. Also assume that there are N proteins in the PPI database, and M of them are known to have annotation A. Given that, the probability of observing m or more proteins that are annotated with A out of N proteins is:
Based on above formulation, a P-value is calculated for each of three ontologies. In the case of multiple annotations from the same ontology, the one with the smaller P-value is assigned to the cluster as functional annotation. That being said, the P-value without any restriction is not enough to label clusters as significant. Hence we use the recommended cutoff value of 0.01 [26] in order to select significant modules within each ontology.
A popular software package for evaluating the statistical significance of GO terms represented in a set of genes extracted from a population is GO::TermFinder, which calculates P-values using formula (3) [27]. GO::TermFinder accepts a list of genes of interest and returns a list of GO terms with which the genes are associated, with corresponding P-values and FDR values (if desired) associated with the enrichment of these terms in the gene list. In this research, the direct use of GO TermFinder is not convenient for analyzing GO enrichment of more than 2000 modules uncovered from the TC-PINs, because this software package can only handle one module at a time. Therefore, combined with the latest version of this toolkit [28], we have used the Perl language to develop a procedure that can automatically process a large number of functional modules in turn.