- Research article
- Open Access
Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP
© Hawkins et al; licensee BioMed Central Ltd. 2010
- Received: 15 October 2009
- Accepted: 19 May 2010
- Published: 19 May 2010
A new paradigm of biological investigation takes advantage of technologies that produce large high throughput datasets, including genome sequences, interactions of proteins, and gene expression. The ability of biologists to analyze and interpret such data relies on functional annotation of the included proteins, but even in highly characterized organisms many proteins can lack the functional evidence necessary to infer their biological relevance.
Here we have applied high confidence function predictions from our automated prediction system, PFP, to three genome sequences, Escherichia coli, Saccharomyces cerevisiae, and Plasmodium falciparum (malaria). The number of annotated genes is increased by PFP to over 90% for all of the genomes. Using the large coverage of the function annotation, we introduced the functional similarity networks which represent the functional space of the proteomes. Four different functional similarity networks are constructed for each proteome, one each by considering similarity in a single Gene Ontology (GO) category, i.e. Biological Process, Cellular Component, and Molecular Function, and another one by considering overall similarity with the funSim score. The functional similarity networks are shown to have higher modularity than the protein-protein interaction network. Moreover, the funSim score network is distinct from the single GO-score networks by showing a higher clustering degree exponent value and thus has a higher tendency to be hierarchical. In addition, examining function assignments to the protein-protein interaction network and local regions of genomes has identified numerous cases where subnetworks or local regions have functionally coherent proteins. These results will help interpreting interactions of proteins and gene orders in a genome. Several examples of both analyses are highlighted.
The analyses demonstrate that applying high confidence predictions from PFP can have a significant impact on a researchers' ability to interpret the immense biological data that are being generated today. The newly introduced functional similarity networks of the three organisms show different network properties as compared with the protein-protein interaction networks.
- Gene Ontology
- Cluster Coefficient
- Similarity Threshold
- Function Assignment
The recent paradigm shift in molecular and systems biology to characterization of large sets of genes and proteins has been enabled by continual technological innovations, including fast sequencing technologies [1–3], arrays for measuring gene expression patterns , and high throughput screens that identify various types of molecular interactions [5–7]. Data sets produced by these new technologies have also spurred development of computational tools to assist in their analysis [8–10]. Of particular importance is function assignment to genes in a genome or any system of interest, as functional information is indispensable for both biological interpretation of the behavior of the system and generation of hypotheses for designing subsequent experiments . To this end, many function prediction methods have been developed recently to meet the urgent needs . They include those which employ information from sequence database search [13–17] more thoroughly than conventional homology searches [18, 19], those which use protein tertiary structure information [20–23], methods that consider conservation of gene locations in genome sequences [24, 25], and methods which utilize protein-protein interaction (PPI) data [26–28]. Please refer to recent reviews for thorough discussion of recent function prediction methods [9, 29].
We previously introduced PFP as a method for predicting Gene Ontology (GO) functional terms  for individual protein sequences with empirically derived confidence scores [14, 31]. PFP has been shown to outperform other sequence-based methods [32–34] and has been enormously successful in international assessments of methods for function prediction (AFP-SIG '05  and Critical Assessment of Techniques for Protein Structure Prediction CASP7, the function prediction [FN] category ). In the previous studies, we have demonstrated that PFP is superior to the other methods not only in terms of the accuracy of function assignment but also in its larger coverage for genome-scale annotation .
Enrichment of function annotation by PFP
Number of protein genes with annotated/predicted function.
Predicted with high confidence (≥ 0.8)
Predicted with medium confidence ≥ 0.6
Predicted with low confidence ≥ 0.4
Previously annotated and predicted with high confidence (≥ 0.8)
E. coli K-12
Functional similarity network by PFP
Size of the functional similarity networks.
Number of Nodes a)
Functional Similarity Category
Number of Nodes with 2+ Edges b)/Edges
E. coli K-12
Network parameters of the functional similarity networks.
Degree exponent (γ)
Cluster coefficient <C(k)>
Clustering degree exponent (β)c)
The functional similarity network using the 0.95 for the similarity threshold value
The middle rows in Table 3 show the clustering coefficients for the networks computed as per the description in methods section. We found that the PPI and the functional similarity networks are clearly distinguished by the clustering coefficient with the latter having larger modularity (i.e. larger values in the clustering coefficient). Single GO-score networks have larger modularity as compared with the funSim networks. The malaria CC-score network has the largest clustering coefficient value (0.86), which is also evident from how it looks (Fig. 3C, the second network from the left). The funSim networks have slightly lower modularity than the single GO-score networks for the same reason that they have fewer hub proteins, i.e. the edges need to satisfy the more severer condition of functional similarity.
Effect of changing the similarity threshold value for connecting edges
In addition to the networks with the similarity threshold value of 0.95 which are discussed above, we examine the networks using a smaller threshold value, 0.80, and a higher threshold value, 0.99 to understand how the network structure changes. The total number of edges significantly increases by using a more permissive threshold value (0.80) for connecting edges and decreases with a larger threshold value (0.99) (Table 2). As the networks become denser with more edges (using the threshold value of 0.80), the number of highly connected nodes increases, which reflects to the decrease in the degree component (γ). This trend is evident especially for the CC-score networks, for which the degree component values are too small for them to be power-law networks. The funSim networks of the three organisms constantly have high degree exponent values. The average clustering coefficient values (middle rows in Table 3) are relatively less affected by the change of the similarity threshold values for drawing edges. Thus all networks with all three similarity threshold values examined are modular.
To summarize, both PPI and most of the newly described functional similarity networks are scale-free. The PPI network and functional similarity networks (namely, funSim, BP-score, CC-score, and MF-score networks) are distinguished by their modularity, with the latter networks showing significant modularity with high clustering coefficient values while the PPI does not. Lastly, the funSim network is different from the single GO-score networks by exhibiting a higher tendency to be hierarchical (i.e. showing a higher β value). However, note that the β value of funSim networks seem to be sensitive to the similarity threshold value and E. coli and yeast funSim networks drop their β value to less than 1.0 when the similarity threshold value is changed to 0.80 and 0.99.
Annotating PPI subnetworks
Enrichment of function annotation to subnetworks.
# of subnetworks
# of edges
# of edges with common annotation a)
# of edges with common annotation or prediction b)
# of edges with common predicted annotation c)
# of subnetworks with functionally enriched edges d)
Since malaria has the largest annotation enrichment among the three organisms (Fig. 2, right panel and Table 4), below we focus on annotations given to the malaria PPI network. Following a previous work , we examine annotation given to subnetworks of the PPI. A subnetwork is identified as all proteins connected to a common centroid protein and the edges among them. The statistical significance of the number of edges in a subnetwork is tested by computing the connectivity coefficient (Eqn. 5) compared with 100 randomized networks. Those subnetworks with a p-value of below 0.05 by the t-test (Eqn. 6) are identified as targets for discussion. We identified 155 subnetworks which hold 716 (97.3%) of the proteins in the entire PPI network.
Annotations of highly interconnected PPI subnetworks in malaria.
Previous annotations (GO)
New annotations with PFP (GO)
chromatin assembly or disassembly (0006333)
Myosin I binding (0017024)
chromosome organization and biogenesis sensu Eukaryota (0007001)
Cytoskeleton-dependent intracellular transport (0030705)
Structural constituent of nuclear pore (0017056)
mitotic cell cycle (0000278)
Nuclear export (0051168)
Nuclear import (0051170)
Nucleic acid transport (0050567)
Nucleobase, nucleoside, nucleotide and nucleic acid transport (0015931)
Regulation of gluconeogenesis (0006111)
Glucosyltransferase activity (0046527)
Signal transducer activity (0004871)
Hydrolase activity (0016787)
Receptor binding (0005102)
Translation regulator activity (0045182)
Autotransporter activity (0015474)
Extracellular region (0005576)
Structural constituent of nuclear pore (0017056)
Negative regulation of lymphocyte activation (0051250)
Peroxisome degradation (0030242)
Microtubule cytoskeleton organization and biogenesis (0000226)
Protein catabolism (0030163)
Intermediate filament cytoskeleton (0045111)
Cellular protein metabolism (0044267)
Cell death (0008219)
Protein folding (0006457)
RNA localization (0006403)
Anterior/posterior axis specification (0009948)
Anterior/posterior pattern formation (0009952)
Cytoskeleton organization and biogenesis (0007010)
Myosin II (0016460)
Actin cytoskeleton (0015629)
Transferase activity (0016740)
ATP binding (0005524)
Cellular protein metabolism (0044267)
Macromolecule catabolism (0009057)
Catalytic activity (0003824)
Kinase activity (0016301)
Intermediate filament cytoskeleton (0045111)
Cytoskeleton-dependent intracellular transport (0030705)
Chromatin binding (0003682)
Adenyl nucleotide binding (0030554)
Chromatin assembly or disassembly (0006333)
Transcription coactivator activity (0003713)
Chromosome organization and biogenesis sensu Eukaryota (0007001)
RNA-mediated posttranscriptional gene silencing (0035194)
Translation regulator activity (0045182)
The next two examples are potentially more interesting, especially with regard to the known pathogenicity of the Malaria plasmodium. The group of 21 proteins centered on Q8I255 was previously annotated with terms directly related to pathogenesis (pathogenesis, extracellular, signal transduction) (Fig. 8C). After providing predicted annotations for 13 of those 21 proteins, several other functions that may be related to particular pathogenic mechanisms were revealed. Particularly interesting are the terms "translation regulator activity", "negative regulation of lymphocyte activation", "microtubule cytoskeleton organization and biogenesis", and "peroxisome degradation". Although the proteins in this subnetwork could already be associated with pathogenesis, new predicted annotations for uncharacterized proteins add direction for designing experiments to test for specific mechanisms that may be responsible for the pathogenic behavior. The interaction subnetwork around Malaria protein Q8I562 (Fig. 8D) also has some potential interest in the molecular mechanisms that contribute to apoptosis. Again, over half of the included proteins (14 of 25) were initially uncharacterized but could be assigned high confidence PFP predictions. Before taking new predictions into account, the cluster was annotated as being related to "cellular protein metabolism" and "protein folding". Several more interesting and specific functional terms were brought to light after including predictions. These terms are related to the cytoskeleton and protein/RNA transport and localization. Specifically, the terms "anterior/posterior pattern formation", "RNA localization", and "cell death" are closely related and signify that the protein interactions in this subnetwork are likely to be involved in the programmed re-organization of the cell leading to death, or apoptosis.
Identifying clusters of functionally related protein-coding genes in genomes
Summary of increase in annotation in genomic windows.
Total # of windows a)
Prior un-annotated windows b)
Prior un-annotated windows which are annotated by PFP
Total # of GO terms added by PFP to prior un-annotated windows c)
# of prior annotated windows d)
# of prior annotated windows to which more GO terms are predicted by PFP
# of GO terms added to the prior annotated windows
Examples of windows with newly annotated highly similar genes.
# of Proteins (Average FunSim)
New annotations (GO)
Regulation of biological process (0050789)
Intracellular membrane-bound organelle (0043231)
Membrane-bound organelle (0043227)
Intracellular organelle (0043229)
Establishment of localization (0051234)
chr07 798000 (bp)
rRNA processing (0006364)
Organelle lumen (0043233)
Membrane-enclosed lumen (0031974)
Chromatin silencing at telomere (0006348)
Telomeric heterochromatin formation (0031509)
Signal transducer activity (0004871)
Transmembrane receptor activity (0004888)
Receptor activity (0004872)
Phosphotransferase activity, alcohol group as acceptor (0016773)
Transferase activity, transferring phosphorus-containing groups (0016772)
NADPH regeneration (0006740)
NADPH metabolism (0006739)
Nicotinamide metabolism (0006769)
Pyridine nucleotide metabolism (0019362)
Oxidoreduction coenzyme metabolism (0006733)
Water-soluble vitamin metabolism (0006767)
Dopamine receptor activity (0004952)
Amine receptor activity ( 0008227)
Neurotransmitter receptor activity (0030594)
Dopamine binding (0035240)
Rhodopsin-like receptor activity (0001584)
Receptor activity (0004872)
Neurotransmitter binding (0042165)
G-protein coupled receptor activity (0004930)
RNA localization (0006403)
Mitotic cell cycle (0000278)
Negative regulation of transcription by carbon catabolites (0045013)
Regulation of transcription by carbon catabolites (0045990)
Response to nutrients (0007584)
Regulation of transcription by glucose (0046105)
Intracellular transport (0046907)
Establishment of localization in cell (0051649)
Autophagic vacuole fusion (0000046)
Organelle fusion (0048284)
Here, we also present several individual cases of new annotation to regions in the genomes of each of the three organisms. A summary of the new annotation is shown in Table 7. The 30 kb region of malaria chromosome 3 starting at position 906,000 contains six proteins with an average GO biological process similarity of 0.722. The 30 kb region of malaria chromosome 3 starting at position 906,000 contains six proteins with an average GO biological process similarity of 0.722. After annotating four of the five previously uncharacterized proteins coded here with high confidence predictions, we found that the proteins may share involvement in phosphorylation or dephosphorylation ("phosphotransferase activity" and "transferase activity, transferring phosphorous-containing groups"). This may indicate that these neighboring proteins are involved in a common signaling or metabolic pathway. Similarly, the region of malaria chromosome 7 starting at position 1,296,000 (five proteins, average biological process similarity of 0.891) was assigned several receptor-like activities. The overrepresented terms related to several types of receptor activity give a strong indication that this region contains proteins that form complex or interact closely as part of a membrane signaling receptor. Membrane receptors and complexes of membrane proteins are well characterized as sharing genome proximity . The four proteins between positions 492,000 and 522,000 of the minus strand of Malaria chromosome 10 (average biological process similarity of 0.860) were assigned several functional terms that all relate to the intrinsic cellular response to nutrients. The terms "intracellular transport", "response to nutrients", "negative regulation of transcription by carbon catabolites", and "mitotic cell cycle" could all indicate a common process involving metabolism and cellular signaling response to the presence of nutrients under particular conditions, perhaps akin to the well known lac operon in E. coli.
E. coli is one of the most well-characterized model organisms in terms of coordinately regulated expression in the form of operons and regulons . As such, we would not expect to find many regions of the genome that could represent new examples of these molecular phenomena relating to specific pathways. However, we did find several examples including the following two where annotation of previously uncharacterized regions might indicate common involvement in processes. First, the 11 proteins within 10 kb of position 1,212,000 (average biological process similarity of 0.792) share broad annotations of "regulation of biological process" and "intracellular membrane-bound organelle". Second, the seven proteins within 10 kb of position 3,016,000 (average biological process similarity of 0.711) share similarly broad annotations of "transport" and "localization". In either case, these annotations might indicate involvement in a common complex or process in a particular membrane-bound organelle or localization pathway, and might be enough to warrant further investigation into the biological reason for the shared function.
Yeast is similarly well characterized, but we again found some examples of genomic windows where application of new high confidence predictions revealed a shared function or related functions. There are two particularly interesting examples. First, the 15 kb region of the plus strand of Yeast chromosome 14 starting at position 141,000 contains three proteins (average cellular component similarity of 0.749) that share the annotations "chromatin silencing at telomere" and "telomeric heterochromatin formation". Second, the six proteins located in 15 kb region of chromosome 15 starting at position 342,000 (average cellular component similarity of 0.760) share the related functions of "signal transducer activity" and "transmembrane receptor activity".
Details of individual protein interaction subnetworks and genomic windows, and previous and new annotations for each subnetwork and window can be found in the supplementary data.
In this analysis, we enriched functional annotation to the three genomes by PFP's high confidence predictions and represented the functional space occupied by the proteomes in the functional similarity network, where edges between proteins (nodes) denote significant functional similarity between them. To the best of our knowledge, this is the first time that structure of functional space is analyzed as a network. Taking advantage of the PFP's large annotation coverage , more than 90% of proteins in each genome are included in the functional similarity network (Table 1). This is a significant enrichment especially for the malaria genome, as previously only 41.9% of proteins were annotated. We defined the functional similarity of proteins using their annotated GO terms rather than other possible functional similarity metrics, e.g. the conventional sequence similarity, because GO terms can compare proteins in different aspects of functions (i.e. in different GO categories and their combinations), which may be more relevant to protein activity in the cell. Moreover, proteins with a high sequence similarity shows significant similarity in the annotated GO terms as well in majority of the cases, so protein sharing GO term similarity can be considered a superset of those sharing sequence similarity [38, 50].
Our study revealed interesting characteristics of the functional similarity networks of the three organisms contrasted with the PPI networks. We analyzed the global topology of the functional similarity network by computing the degree exponent, the clustering coefficient, and the clustering degree exponent of the networks (Table 3). In general, both functional similarity networks and PPI networks follow the power-law, but they are distinct in the former showing the network modularity but the latter does not. Among the four functional similarity networks constructed by considering individual GO-scores and the funSim score, the funSim score network is different from the others by exhibiting a higher tendency to be hierarchical (i.e. higher clustering degree exponent value) similar to the metabolic pathway networks. However, the clustering degree exponent value seems to be sensitive to the similarity threshold value used to construct the networks and the E. coli and yeast funSim score networks drop its value below 1.0 when some similarity threshold values are used.
Unlike the current PPI network data, which provide a static view of protein interactions, the functional similarity networks change their topology as the similarity threshold value is changed. Functional similarity networks of a different similarity threshold value represent different levels of granularity of the gene function space in a genome. Investigation of the global and local structure properties of dynamically changing functional similarity networks is left as an important future work.
It is reminded that the currently available PPI networks have several limitations; they are usually incomplete and potentially include false positive and false negative interactions [51, 52]. However, we expect that such limitations will not affect to this work too much since the focus of this work is the construction of the functional similarity networks and the functional enrichment by PFP. We analyzed the PPI networks as to contrast to the newly introduced functional similarity networks. As a future work, it may be interesting to compare the network properties of the functional similarity networks with other types of biological networks, such as gene regulatory networks [53, 54] or gene functional networks constructed by considering different types of experimental information .
Individual annotation to subnetworks in the PPI networks and genome local windows identified numerous interesting cases where proteins in the subset show high coherence with other members. These results provide examples of how computational prediction can be utilized in interpreting or building hypotheses on the proteins sharing such functional association. Interestingly, there are several cases where proteins in a genome window are functionally coherent with PFP's assignment of broader, less-specific functional terms. These may not be regulons or operons, where functional roles of component genes are usually better defined. Rather, these local windows of genes may imply existence of a new type of gene clusters where genes are inter-related by much broader, higher-level functional category.
Together with the introduction of the functional similarity networks and functional coherence of individual subsets of genes, we have demonstrated the usefulness of computational function prediction by PFP. The same methods can be applied to any biologically related group of proteins. High-throughput technologies such as microarrays and mass spectrometry that identify clusters of proteins linked by common expression patterns or conditions produce datasets that would also be relevant for such an application. In the end, as PFP is a sequence similarity-based prediction method, utilizing its high confidence predictions takes a minimal time and energy commitment (~1 day to run all uncharacterized proteins for P. falciparum) and can have a significant impact on a researcher's ability to interpret the complex datasets that have now become the norm.
We assigned function to previously uncharacterized protein genes in Escherichia coli K-12, Saccaromyces cerevisiae, and Plasmodium falciparum with high-confidence function prediction by the PFP method. Using the enriched function annotation, we introduced the functional similarity network which provides an intuitive representation of the functional space of a proteome. Comparison with the PPI networks revealed distinct features of the functional similarity networks. In addition, PFP's function assignment identified functionally coherent subnetworks in the PPI and local regions in the genomes. All together, this work demonstrated usefulness of the computational functional predictions by PFP.
The genome sequence and annotation data for Escherichia coli K-12, Saccaromyces cerevisiae, and Plasmodium falciparum were obtained from the website of the European Bioinformatics Institute (EBI). Annotations qualified as "previously known" were extracted from EBI's GOA proteome datasets http://www.ebi.ac.uk/GOA/. PPI data for E. coli was obtained from Arifuzzaman et al., for S. cerevisiae was obtained from MIPS , and for P. falciparum was obtained from the paper by LaCount et al.. Genome position data was obtained from the website of the National Center for Biotechnology Information (NCBI) ftp://ftp.ncbi.nih.gov/genomes/.
Computing Clustering Coefficient
where k is the number of neighboring nodes connected to the central node and n is the number of pairs of the neighboring nodes that are directly connected. To quantify the modularity of an entire network, the average clustering coefficient is computed [39, 40].
Function Prediction by PFP
GO functional terms were predicted for each sequence without any previously assigned GO terms from E. coli, S. cerevisiae, and P. falciparum using PFP under its optimal parameter settings, which are described below. Refer to the previous work  for detailed analyses of the effect of using different parameter values. Only terms predicted with high confidence (≥ 0.8) were assigned to each query sequence. The detailed description of the algorithm as well as thorough benchmark results of PFP have been reported in the previous papers [13, 14]. Here we will briefly overview the PFP algorithm for readers' convenience.
where s(f a ) is the final score assigned to the GO term, f a , N is the number of the similar sequences retrieved by PSI-BLAST, Nfunc(i) is the number of GO terms assigned to sequence i, E_value(i) is the E-value given to the sequence i, and f j is a GO term assigned to the sequence i. P(f a | f j ) is to take into account the association of two GO terms, i.e. the co-occurrence of the two GO terms in the same sequences. It is the conditional probability that f a is associated with f j . c(f a , f j ) is number of times f a and f j are assigned simultaneously to each sequence in UniProt , and c(f j ) is the total number of times f j appeared in UniProt, μ is the total number of unique GO terms considered in the associations, and ε is the pseudo-count, which is set to 0.05. Note that the conditional probability is asymmetric, i.e. P(f a | f j ) ≠ P(f j | f a ).
For running PSI-BLAST, the default E-value threshold for inclusion in multiple iterations (-h 0.005) is used and the maximum number of iterations is set to three (-j 3). By shifting the scoring space by a constant (b), individual annotations from weakly similar sequences (E-value > 1) can be considered and scored. Here we use b = log(125) to allow the use of sequence matches to an E-value of 125.
where s(f p ) is the score of the parent term f p . N c is the number of child GO term which belong to the parent term f p and s(f ci ) is the score of a child term f ci . c(f ci ) and c(f p ) is the number of known genes which are annotated with function term f ci and f p , respectively. The final raw score of a GO term is given by summing up the score which is directly computed by Eqn. 2 and those from the ancestral score propagation by Eqn. 4.
Finally, for each predicted GO term, the p-value of the raw score is computed by using the term-specific raw score distribution obtained by running PFP on the benchmark dataset . Then, the expected accuracy is assigned to the prediction by referring to the correlation of the p-value and the actual accuracy computed for each GO term (see 6 in our previous paper ).
PPI network enrichment
To evaluate enrichment of annotations in the interaction network, we compared the number of fully (both interaction partners annotated) and partially (one of the interaction partners annotated) annotated interactions before and after application of PFP to unannotated proteins in the dataset (Fig. 2). We considered only GO predictions with high confidence for the node enrichment.
Partitioning PPI subnetworks
where ν is the average value of the connectivity coefficient for the set of all subnetworks of the same centroid, and s is the variance of the connectivity coefficient values for the same set. This method of determining statistically significant subnetworks was used by LaCount et al. for the malaria interaction network.
Functional similarity network
max(GOscore) is set to 1 (maximum possible GOscore) and the range of the funSim score is [0,1]. To construct the function similarity networks for each organism, we performed an all-by-all pairwise comparison to find the funSim and category GOscore values for each unique protein pair.
In the functional similarity networks, pairs with the GOScore or funSim score of 0.95 or higher are connected by edges. The networks are visualized with Cytoscape .
Identifying significant genomic windows
To identify functionally similar regions of a genome, we used a sliding window approach. For each organism we used a unique window size (10 kb for E. coli, 30 kb for P. falciparum, 15 kb for S. cerevisiae) and a slide value equal to 1/5 the window size. The window sizes were determined such that the number of genes for both strands in any window averaged between eight and ten. Genes included in the window were taken from the plus and minus strands individually and also from both strands together. Windows for which the category GO score was above 0.7 or the funSim was above 0.49 were analyzed for overrepresentation of GO functional terms by the method described below.
Identifying significantly overrepresented terms in groups of proteins
where q is the number of unique GO terms annotated to proteins in the cluster.
The annotation gain for a subset of proteins is calculated as the percentage increase in the number of unique new statistically overrepresented annotations as compared to the number of previously known annotations.
PFP is available as a web tool http://kiharalab.org/pfp and as a downloadable distribution as used in these analyses http://kiharalab.org/pfp/dist. In addition, the supplemental data including the function annotation by PFP to the three genomes and the PPI networks and networks statistics of the functional similarity networks and the PPI networks are available at our lab website http://kiharalab.org/func_network_suppl/.
This work was supported partially by grants from the National Institutes of Health (R01GM075004). MC is supported by a grant from Purdue Research Foundation. DK also acknowledges grants from NIH (U24GM077905) and National Science Foundation (DMS0604776, DMS0800568).
- Mardis ER: Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 2008, 9: 387–402. 10.1146/annurev.genom.9.081307.164359View ArticlePubMedGoogle Scholar
- Pop M, Salzberg SL: Bioinformatics challenges of new sequencing technology. Trends Genet 2008, 24: 142–149.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009, 10: 57–63. 10.1038/nrg2484View ArticlePubMedPubMed CentralGoogle Scholar
- Hoheisel JD: Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev Genet 2006, 7: 200–210. 10.1038/nrg1809View ArticlePubMedGoogle Scholar
- Arifuzzaman M, Maeda M, Itoh A, Nishikata K, Takita C, Saito R, Ara T, Nakahigashi K, Huang HC, Hirai A, Tsuzuki K, Nakamura S, taf-Ul-Amin M, Oshima T, Baba T, Yamamoto N, Kawamura T, Ioka-Nakamichi T, Kitagawa M, Tomita M, Kanaya S, Wada C, Mori H: Large-scale identification of protein-protein interaction of Escherichia coli K-12. Genome Res 2006, 16: 686–691. 10.1101/gr.4527806View ArticlePubMedPubMed CentralGoogle Scholar
- Drewes G, Bouwmeester T: Global approaches to protein-protein interactions. Curr Opin Cell Biol 2003, 15: 199–205. 10.1016/S0955-0674(03)00005-XView ArticlePubMedGoogle Scholar
- Boone C, Bussey H, Andrews BJ: Exploring genetic interactions and networks with yeast. Nat Rev Genet 2007, 8: 437–449. 10.1038/nrg2085View ArticlePubMedGoogle Scholar
- Chitale M, Hawkins T, Kihara D: Automated prediction of protein function from sequence. In Prediction of Protein Strucutre, Functions, and Interactions. Edited by: Bujnicki J. John Wiley & Sons Ltd; 2009:63–86.Google Scholar
- Hawkins T, Kihara D: Function prediction of uncharacterized proteins. J Bioinform Comput Biol 2007, 5: 1–30. 10.1142/S0219720007002503View ArticlePubMedGoogle Scholar
- Hawkins T, Chitale M, Kihara D: New paradigm in protein function prediction for large scale omics analysis. Mol Biosyst 2008, 4: 223–231. 10.1039/b718229eView ArticlePubMedGoogle Scholar
- Valencia A: Automatic annotation of protein function. Curr Opin Struct Biol 2005, 15: 267–274. 10.1016/j.sbi.2005.05.010View ArticlePubMedGoogle Scholar
- Rentzsch R, Orengo CA: Protein function prediction--the power of multiplicity. Trends Biotechnol 2009, 27: 210–219. 10.1016/j.tibtech.2009.01.002View ArticlePubMedGoogle Scholar
- Hawkins T, Luban S, Kihara D: Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci 2006, 15: 1550–1556. 10.1110/ps.062153506View ArticlePubMedPubMed CentralGoogle Scholar
- Hawkins T, Chitale M, Luban S, Kihara D: PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins 2009, 74: 566–582. 10.1002/prot.22172View ArticlePubMedGoogle Scholar
- Chitale M, Hawkins T, Park C, Kihara D: ESG: Extended similarity group method for automated protein function prediction. Bioinformatics 2009, 25: 1739–1745. 10.1093/bioinformatics/btp309View ArticlePubMedPubMed CentralGoogle Scholar
- Wass MN, Sternberg MJ: ConFunc--functional annotation in the twilight zone. Bioinformatics 2008, 24: 798–806. 10.1093/bioinformatics/btn037View ArticlePubMedGoogle Scholar
- Vinayagam A, del VC, Schubert F, Eils R, Glatting KH, Suhai S, Konig R: GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 2006, 7: 161. 10.1186/1471-2105-7-161View ArticlePubMedPubMed CentralGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85: 2444–2448. 10.1073/pnas.85.8.2444View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Kinoshita K, Nakamura H: Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Sci 2003, 12: 1589–1595. 10.1110/ps.0368703View ArticlePubMedPubMed CentralGoogle Scholar
- Thornton JM, Todd AE, Milburn D, Borkakoti N, Orengo CA: From structure to function: approaches and limitations. Nat Struct Biol 2000, 7(Suppl):991–994. 10.1038/80784View ArticlePubMedGoogle Scholar
- Pal D, Eisenberg D: Inference of protein function from protein structure. Structure (Camb) 2005, 13: 121–130. 10.1016/j.str.2004.10.015View ArticleGoogle Scholar
- Brylinski M, Skolnick J: A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci USA 2008, 105: 129–134. 10.1073/pnas.0707684105View ArticlePubMedPubMed CentralGoogle Scholar
- Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23: 324–328. 10.1016/S0968-0004(98)01274-2View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285View ArticlePubMedPubMed CentralGoogle Scholar
- Huynen MA, Snel B, von MC, Bork P: Function prediction and protein networks. Curr Opin Cell Biol 2003, 15: 191–198. 10.1016/S0955-0674(03)00009-7View ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22: 1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Song J, Singh M: How and when should interactome-derived clusters be used to predict functional modules and protein function? Bioinformatics 2009.Google Scholar
- Watson JD, Laskowski RA, Thornton JM: Predicting protein function from sequence and structural data. Curr Opin Struct Biol 2005, 15: 275–284. 10.1016/j.sbi.2005.04.003View ArticlePubMedGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong EL, Nash RS, Sethuraman A, Theesfeld CL, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee SY, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz EM, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la CN, Tonellato P, Jaiswal P, Seigfried T, White R: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32: D258-D261. 10.1093/nar/gkh066View ArticlePubMedGoogle Scholar
- Hawkins T, Kihara D: PFP:Automatic annotation of protein function by relative GO association in multiple functional contexts. The 13th Annual International Conference on Intelligent Systems for Molecular Biology 2005, 117.Google Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed CentralGoogle Scholar
- Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 2004, 5: 178. 10.1186/1471-2105-5-178View ArticlePubMedPubMed CentralGoogle Scholar
- Mulder N, Apweiler R: InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 2007, 396: 59–70. full_textView ArticlePubMedGoogle Scholar
- Friedberg I, Jambon M, Godzik A: New avenues in protein function prediction. Protein Sci 2006, 15: 1527–1529. 10.1110/ps.062158406View ArticlePubMedPubMed CentralGoogle Scholar
- Lopez G, Rojas A, Tress M, Valencia A: Assessment of predictions submitted for the CASP7 function prediction category. Proteins 2007, 69: 165–174. 10.1002/prot.21651View ArticlePubMedGoogle Scholar
- Weston J, Elisseeff A, Zhou D, Leslie CS, Noble WS: Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci USA 2004, 101: 6559–6563. 10.1073/pnas.0308067101View ArticlePubMedPubMed CentralGoogle Scholar
- Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 2006, 7: 302. 10.1186/1471-2105-7-302View ArticlePubMedPubMed CentralGoogle Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science 2002, 297: 1551–1555. 10.1126/science.1073374View ArticlePubMedGoogle Scholar
- Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5: 101–113. 10.1038/nrg1272View ArticlePubMedGoogle Scholar
- Yook SH, Oltvai ZN, Barabasi AL: Functional and topological characterization of protein interaction networks. Proteomics 2004, 4: 928–942. 10.1002/pmic.200300636View ArticlePubMedGoogle Scholar
- Albert R: Scale-free networks in cell biology. J Cell Sci 2005, 118: 4947–4957. 10.1242/jcs.02714View ArticlePubMedGoogle Scholar
- Ng KL, Lee PH, Huang CH, Fang JF, Hsiao HW, Tsai JJP: Hierarchical structures of the protein-protein interaction networks. Chinese J Phys 2006, 44: 67–77.Google Scholar
- LaCount DJ, Vignali M, Chettier R, Phansalkar A, Bell R, Hesselberth JR, Schoenfeld LW, Ota I, Sahasrabudhe S, Kurschner C, Fields S, Hughes RE: A protein interaction network of the malaria parasite Plasmodium falciparum. Nature 2005, 438: 103–107. 10.1038/nature04104View ArticlePubMedGoogle Scholar
- Brun C, Herrmann C, Guenoche A: Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics 2004, 5: 95. 10.1186/1471-2105-5-95View ArticlePubMedPubMed CentralGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nat Biotechnol 2003, 21: 697–700. 10.1038/nbt825View ArticlePubMedGoogle Scholar
- Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, Bonavides-Martinez C, breu-Goodger C, Rodriguez-Penagos C, Miranda-Rios J, Morett E, Merino E, Huerta AM, Trevino-Quintanilla L, Collado-Vides J: RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res 2008, 36: D120-D124. 10.1093/nar/gkm994View ArticlePubMedPubMed CentralGoogle Scholar
- Kihara D, Kanehisa M: Tandem clusters of membrane proteins in complete genome sequences. Genome Res 2000, 10: 731–43. 10.1101/gr.10.6.731View ArticlePubMedGoogle Scholar
- Galperin MY, Koonin EV: Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol 2000, 18: 609–613. 10.1038/76443View ArticlePubMedGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19: 1275–1283. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- von MC, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002, 417: 399–403.Google Scholar
- Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein-interaction networks? Genome Biol 2006, 7: 120. 10.1186/gb-2006-7-11-120View ArticlePubMedPubMed CentralGoogle Scholar
- Babu MM, Lang B, Aravind L: Methods to reconstruct and compare transcriptional regulatory networks. Methods Mol Biol 2009, 541: 163–180.View ArticlePubMedGoogle Scholar
- Karlebach G, Shamir R: Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 2008, 9: 770–780. 10.1038/nrm2503View ArticlePubMedGoogle Scholar
- Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science 2004, 306: 1555–1558. 10.1126/science.1099511View ArticlePubMedGoogle Scholar
- Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW, Stumpflen V: MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res 2006, 34: D436-D441. 10.1093/nar/gkj003View ArticlePubMedPubMed CentralGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang HZ, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucl Acids Res 2006, 34: D187-D191. 10.1093/nar/gkj161View ArticlePubMedPubMed CentralGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13: 2498–2504. 10.1101/gr.1239303View ArticlePubMedPubMed CentralGoogle Scholar
- Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 2005, 21: 3448–3449. 10.1093/bioinformatics/bti551View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.