A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge
© Cho et al; licensee BioMed Central Ltd. 2008
Received: 08 April 2008
Accepted: 18 September 2008
Published: 18 September 2008
The functional characterization of newly discovered proteins has been a challenge in the post-genomic era. Protein-protein interactions provide insights into the functional analysis because the function of unknown proteins can be postulated on the basis of their interaction evidence with known proteins. The protein-protein interaction data sets have been enriched by high-throughput experimental methods. However, the functional analysis using the interaction data has a limitation in accuracy because of the presence of the false positive data experimentally generated and the interactions that are a lack of functional linkage.
Protein-protein interaction data can be integrated with the functional knowledge existing in the Gene Ontology (GO) database. We apply similarity measures to assess the functional similarity between interacting proteins. We present a probabilistic framework for predicting functions of unknown proteins based on the functional similarity. We use the leave-one-out cross validation to compare the performance. The experimental results demonstrate that our algorithm performs better than other competing methods in terms of prediction accuracy. In particular, it handles the high false positive rates of current interaction data well.
The experimentally determined protein-protein interactions are erroneous to uncover the functional associations among proteins. The performance of function prediction for uncharacterized proteins can be enhanced by the integration of multiple data sources available.
Since the completion of sequencing the human genome , discovering the underlying principles of interactions and the functional roles of proteins has been in the spotlight in the post-genomic era. The functional characterization of newly determined proteins has become one of the most crucial challenges. The initial efforts have been carried out by sequential or structural homology searches using computational algorithms such as FASTA  and BLAST , and the tasks of predicting protein function are still progressing with a massive amount of data [4, 5].
The availability of complete genomes in a wide variety of organisms has shifted the single-gene-based function prediction problem to a genome-level. Several approaches have been introduced on the basis of correlated evolution mechanisms of genes. The conserved gene neighborhood across different, distantly related genomes reveals the potential functional linkages . The domain fusion analysis infers a pair of proteins that interact with each other and thus perform a related function . The phylogenetic profiles represent the pattern of presence or absence of a protein in a set of organisms. Two proteins are considered to be functionally linked if they have the same phylogenetic profiles .
In recent years, the data generated by high-throughput techniques have facilitated the functional classification. For example, microarrays monitor the expression levels of thousands of genes, and the correlated expression profiles of the genes can be interpreted as their functional relatedness . Protein-protein interaction data, enriched by high-throughput experiments including yeast two-hybrid systems  and mass spectrometry , have provided the important clues of functional associations between proteins. The integrated protein interaction networks have been built from the heterogeneous interaction data sources. Accordingly, numerous computational methods have been supplemented for uncovering the functional information of uncharacterized proteins in the networks.
In this study, we explore an effective methodology for predicting functions of unknown proteins using the connectivity of protein interaction networks. Previously, a number of approaches have been proposed to predict protein function from protein interaction networks. The neighbor counting method , which is also called the majority-rule based method, searches the most common function among the neighbors, i.e., interacting partners, of an unknown protein. The commonality of the functions of neighbors can be statistically evaluated. Hishigaki et al.  used a chi-square formula to calculate the statistical significance of the functions of neighbors. Because the neighbor counting and related methods focus on the direct neighborhood, they are problematic if an unknown protein has a small number of interacting partners annotated or a large number of false positive interactions.
Several graph theoretic approaches took into account the global topology of protein interaction networks. Vazquez et al.  and Karaoz et al.  attempted to maximize the functional consistency through neighboring in the whole network. Nabieva et al.  applied the concept of functional flow that is propagated from an annotated protein to unannotated proteins. Probabilistic approaches have also been suggested for function prediction. Deng et al.  introduced a statistical framework using the Markov Random Field (MRF) model in a Gibbs distribution. They used a quasi-likelihood approach to estimate the parameters in the MRF model. Lee et al.  developed a kernel logistic regression (KLR) method based on diffusion kernels and incorporated all indirect neighborhoods in the networks. Chua et al.  also considered the indirect neighbors with length-2. They computed the functional similarity score between two proteins, which was derived from the symmetric difference of neighbors and the reliability of the data sources used. Kirac et al.  used the annotation patterns in the neighborhood for the function prediction. They calculated the similarity between the annotation pattern of an unknown protein and the set of annotation patterns for each function.
Most of the previous approaches to predict protein function from protein interaction networks are based on the assumption that two interacting proteins have a similar function or share functions. However, these tasks overestimate the tendency of functional links between the proteins that interact with each other. In addition, it is recognized that current interaction data include a substantial number of false positives and false negatives although they have been curated using various computational techniques. It signifies that the high-throughput experimental methods frequently generate spurious data, and the set of interactions accumulated in a large scale is still incomplete. Because of the inappropriateness of direct use of current interaction data for function prediction, the integration of functional knowledge elicited from other biological resources is necessary.
In this article, we propose a novel probabilistic approach for function prediction from protein interaction networks. We integrate the connectivity of protein interaction networks with the semantic knowledge in the Gene Ontology (GO) database for the purpose of improving the accuracy of function prediction. First, we measure the functional similarity between interacting proteins using GO. Next, we present a probabilistic framework to compute the confidence in function prediction for each unknown protein. Our experimental results show that this approach is robust to current erroneous data and thus predicts function more accurately than previous methods.
Integration of interactions with semantic knowledge
Semantic knowledge, which is also called ontology, is the representation in which concepts are described by their meanings and their relationships to each other . It is typically represented as a hierarchical tree structure or a directed acyclic graph (DAG). The Gene Ontology (GO)  is currently one of the most widely used databases that provide the semantic knowledge related to biological processes and molecular functions. Because the GO terms and their annotations are continually updated as research progresses, they might include errors. However, we employ the GO for the integration with protein-protein interactions because of its comprehensiveness. The relationships between GO terms and the annotations can be exploited to measure the functional similarity between interacting proteins.
Recently, various measurements of functional similarity between proteins using GO have been proposed [23–28]. Such measures can be classified into two distinct categories: structure-based approaches and annotation-based approaches. Suppose each protein is annotated on the most specific GO terms on the paths from the root term. Structure-based approaches utilize the path length or the common parent terms between two GO terms, on which each interacting protein is annotated. Two interacting proteins are functionally more similar if two GO terms whose annotations include each of them are closer each other, or the GO terms have a larger number of common parent terms in the GO structure. As a combination of the two factors, the path length from the root to the most specific common parent term can be taken into consideration. (See Methods for details of structure-based approaches.) However, as a weakness, they depend on the critical assumption that all the edges in the structure represent the same specificity.
Annotation-based approaches focus on the number of proteins annotated on GO terms. They suppose the GO annotations follow the transitivity property, i.e., if a protein is annotated on a term T, then it is also annotated on more general terms on the paths from T to the root in the GO structure. The set of proteins annotated on the root term becomes transitive. Under this property, the number of annotated proteins on a term T is capable of quantifying the specificity of T. The functional similarity between interacting proteins is then measured based on the commonality of the annotations of two terms on which they are annotated. The more annotations two terms share, the more similar they are. The measured similarity can be also normalized by the number of annotated proteins on the most specific individual terms whose annotations include each interacting protein. (See Methods for details of annotation-based approaches.) Annotation-based approaches generally perform better than structure-based approaches when the annotations in GO are fairly complete.
Function prediction algorithm
where P(f = 0) is the probability that p i does not have the function f.
μf+and are calculated by the functional similarity between p i and the proteins annotated on f. Similarly, μf-and are calculated by the functional similarity between p i and the proteins that are not in the annotation of f. P(f = 1) becomes the ratio of the number of proteins having f to the total number of known proteins, and P(f = 0) is 1 - P(f = 1). As an alternative to Formula 7, we compute log(λ f ), which can be the prediction conifdence for p i to the function f.
Functional linkage of protein-protein interactions
We assessed the tendency of functional linkage between interacting proteins in current interaction databases. We extracted the interaction data of Saccharomyces cerevisiae from three databases: 12352 interactions from MIPS , 17186 from DIP  and 56860 from BioGRID . We then computed the functional similarity between interacting proteins using two selected similarity measures: a structure-based method and an annotation-based method. The functional hierarchy and annotations from FunCat  in MIPS were used for the similarity measures. We also calculated the functional consistency using the Jaccard index.
Next, we measured the functional similarity of each interacting protein pair using the annotation-based method in Formula 14. Figure 1(b) shows the distribution of interacting protein pairs with respect to their annotation-based functional similarity. A very few interacting pairs in the three databases are co-annotated on specific functional categories. Furthermore, a large fraction of the interacting pairs, 56% in MIPS and 58% in DIP and BioGRID, have the similarity less than 0.2, i.e., they rarely have common annotations. It indicates that around 60% of the interacting proteins in the databases are not co-annotated on any specific functions.
Finally, we observed the functional consistency of each interacting protein pair using the Jaccard index. According to the transitivity property of functional annotations, if a protein is annotated on a function, then it also has more general functions on the paths towards the root in the hierarchical structure. We calculated the number of common functions among the number of all distinct functions of an interacting protein pair. Figure 1(c) shows the distribution of interacting protein pairs with respect to their functional consistency. The overall distributing pattern is similar to that in Figure 1(b). Only 18% of the interacting pairs in MIPS, 21% in DIP and 16% in BioGRID have the consistency of greater than or equal to 0.4. On the other hand, 63% in MIPS and DIP and 65% in BioGRID have the consistency of less than 0.2, and their common functions are likely to be very general ones, which are located on the upper levels in the functional hierarchy. This result thus implies that more than 60% of the interacting proteins in the databases do not share any specific functions.
Comparison of functional similarity measurements
We evaluated the effectiveness of the functional similarity measurements to choose the best one for our function prediction approach. We used the full version of protein-protein interaction data of Saccharomyces cerevisiae from DIP , which includes 4928 distinct proteins and 17186 interactions. The GO terms and their annotations were downloaded from the Gene Ontology database . Upon selecting the GO terms that have the annotation of two or more proteins in biological process and molecular function categories, we obtained total 2456 GO terms. In both categories, the root terms have the annotations of 5871 proteins. As a structure-based similarity measurement, we chose Formula 13, which is the most advanced metric in the category. As an annotation-based similarity measurement, we used Formula 14 because a previous study  has shown that it has the best performance. We compared the functional similarity of interacting proteins to their interaction reliability that is measured based on connectivity .
Cross-validation of function prediction
where |K i | is the size of the set K i and n is the total number of distinct proteins that are annotated on at least one functional category and have the interaction evidence.
Comparison of prediction performance
We compared the prediction performance with two previous methods, which are currently the most competing. One is the approach weighting the functional similarity of direct and indirect neighbors (level-2 neighborhood) , and the other is the prediction method based on the annotation patterns in the neighborhood . The first method computes the likelihood that an unknown protein p has a function using the functional similarity weights between p and level-1 or level-2 neighbors. The functional similarity weight of two proteins is calculated by the commonality of their neighbors in the protein interaction network. Since we used the same interaction data from DIP as inputs for all the methods, we did not consider the reliability of the data source. We used a threshold of the likelihood to generate the output set of predicted functions for each protein. We then obtained different output sets by various thresholds. The sets of predicted functions become larger as the threshold lowers.
The second method constructs the set of annotation neighborhood patterns for each function, and computes the similarity between the annotation neighborhood pattern of an unknown protein and the set of annotation neighborhood patterns of each function. We used the annotations from the functional categories in MIPS since they are used as a ground truth in our experiment. We set the parameter d = 1 and did not consider the edge weights, i.e., assigned 1 to each edge weight. We used the similarity of annotation neighborhood patterns as a threshold.
Function prediction accuracy comparison
functional similarity-based method
FS weighted averaging method
annotation pattern-based method
neighborhood-based chi-square method
Function prediction for unknown proteins
Function prediction results for unknown proteins
ID in MIPS
C-compound and carbohydrate metabolism
nitrogen, sulfur or selenium metabolism
modification by phosphorylation/dephosphorylation
lysosomal and vacuolar protein degradation
homeostasis of cations
lipid, fatty acid and isoprenoid metabolism
mitotic cell cycle
nuclear or chromosomal cycle
metabolism of porphyrins
Prediction of subcellular localization
Localization prediction results for unknown proteins
predicted subcellular localization
ID in MIPS
Through recent advances of high-throughput techniques, a significant amount of protein-protein interaction data have been accumulated. Protein function has been predicted from the interaction data because the evidence of interaction can be interpreted as functional links. However, we observed only a small fraction of current interaction data from major interaction databases are related to functional linkage. The results in Figure 1 indicate that more than or around 60% of interacting protein pairs are not linked by similar functions. In other words, at most 40% of protein pairs have been motivated by similar functions. This rate goes beyond the potential false positives of experimentally determined interactions, claimed in . This observation has been also demonstrated by the limited accuracy of previous function prediction methods, which are based on the connectivity of protein interaction networks, as shown in Figure 6 and Table 1.
Our function prediction algorithm uses a probabilistic formula derived from functional similarity between proteins that interact with each other. The functional similarity can be measured using the structure or annotations from Gene Ontology (GO). Various measurements for the functional similarity were evaluated in terms of functional co-occurrence and consistency of interacting pairs. The experimental results in Figure 2, 3 and 4 show that the annotation-based method performs the assessment of similarity better than the structure-based method. It indicates that the current GO structure itself is an imperfect resource to identify functional linkage. In our experiments, function prediction has been conducted with yeast protein-protein interaction data. However, our probabilistic framework can be well-applicable to higher-level organisms because of its efficiency.
Functional characterization of proteome is a central goal in the field of Bioinformatics. The experimentally determined protein-protein interactions are crucial data sources to uncover the functional knowledge of uncharacterized proteins. However, a pre-process to assess the functional linkage of interacting proteins from current interaction data is required for predicting protein function successfully.
In this article, we presented a novel concept for integrating the connectivity of protein interaction networks with already published annotation data in Gene Ontology (GO). Our results imply that function prediction from protein interaction networks has been calibrated by integrating with the functional knowledge from GO. Clearly, the prediction accuracy can be more improved by the integration of multiple data sources available, which are relevant to functional linkage. Furthermore, developing effective integration models for the explosive amount of heterogeneous biological data sources is promising in future research for functional knowledge discovery.
Structure-based functional similarity measures
where T i and T j are the GO terms whose annotations include the interacting proteins p1 and p2, respectively, and depth denotes the maximum path length from the root term to a leaf.
where and is the set of GO parent terms of T i and T j , respectively.
Annotation-based functional similarity measures
In Information Theory, self-information is a measure of the information content associated with the outcome of a random variable. The amount of self-information contained in a probabilistic event c depends on the probability P(c) of the event. Specifically, the smaller the probability of the event, the larger the self-information associated with receiving information when the event indeed occurs. The information content of a concept C is defined as the negative log likelihood of C, -log P(C). The similarity between two concepts is measured by their commonality, i.e., the information content of the most specific common concept .
This work was partly supported by NSF grant DBI-0234895 and NIH grant 1 P20 GM067650-01A1. MR gratefully acknowledges support from the National Multiple Sclerosis Society (RG3743).
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921. 10.1038/35057062View ArticleGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Meyers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Friedberg I: Automated protein function prediction – the genomic challenge. Briefings in Bioinformatics 2006, 7(3):225–242. 10.1093/bib/bbl004View ArticlePubMedGoogle Scholar
- Valencia A: Automatic annotation of protein function. Current Opinion in Structural Biology 2005, 15: 267–274. 10.1016/j.sbi.2005.05.010View ArticlePubMedGoogle Scholar
- Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 1999, 96: 2896–2901. 10.1073/pnas.96.6.2896PubMed CentralView ArticlePubMedGoogle Scholar
- Marcotte EM, Pellegrini M, Ng H-L, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285: 751–753. 10.1126/science.285.5428.751View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Clustering analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Parrish JR, Gulyas KD, Finley RL: Yeast two-hybrid contributions to interactome mapping. Current Opinion in Biotechnology 2006, 17: 387–393. 10.1016/j.copbio.2006.06.006View ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422: 198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nature Biotechnology 2000, 18: 1257–1261. 10.1038/82360View ArticlePubMedGoogle Scholar
- Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 2001, 18: 523–531. 10.1002/yea.706View ArticlePubMedGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nature Biotechnology 2003, 21(6):697–700. 10.1038/nbt825View ArticlePubMedGoogle Scholar
- Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, Kasif S: Whole-genome annotation by using evidence integration in functional-linkage networks. Proc Natl Acad Sci USA 2004, 101(9):2888–2893. 10.1073/pnas.0307326101PubMed CentralView ArticlePubMedGoogle Scholar
- Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005, 21: i302-i310. 10.1093/bioinformatics/bti1054View ArticlePubMedGoogle Scholar
- Deng M, Zhang K, Mehta S, Chen T, Sun F: Prediction of protein function using protein-protein interaction data. Journal of Computational Biology 2003, 10(6):947–960. 10.1089/106652703322756168View ArticlePubMedGoogle Scholar
- Lee H, Tu Z, Deng M, Sun F, Chen T: Diffusion kernel-based logistic regression models for protein function prediction. OMICS A Journal of Integrative Biology 2006, 10(1):40–55. 10.1089/omi.2006.10.40View ArticlePubMedGoogle Scholar
- Chua HN, Sung W-K, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22(13):1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Kirac M, Ozsoyoglu G: Protein function prediction based on patterns in biological networks. Proceedings of 12th International Conference on Research in Computational Molecular Biology (RECOMB) 2008, 197–213.View ArticleGoogle Scholar
- Bard JBL, Rhee SY: Ontologies in biology: design, applications and future challenges. Nature Reviews: Genetics 2004, 5: 213–222. 10.1038/nrg1295View ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: The Gene Ontology project in 2008. Nucleic Acids Research 2008, 36: D440-D444. 10.1093/nar/gkm883PubMed CentralView ArticleGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275–1283. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics 2006, 22(8):967–973. 10.1093/bioinformatics/btl042View ArticlePubMedGoogle Scholar
- Cho Y-R, Hwang W, Ramanathan M, Zhang A: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics 2007., 8(265):Google Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007., 23(10):Google Scholar
- Tao Y, Sam L, Li J, Friedman C, Lussier YA: Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics 2007, 23: i529-i538. 10.1093/bioinformatics/btm195PubMed CentralView ArticlePubMedGoogle Scholar
- Wu X, Zhu L, Guo J, Zhang D-Y, Lin K: Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations. Nucleic Acids Research 2006, 34(7):2137–2150. 10.1093/nar/gkl219PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer KFX, Munsterkotter M, Ruepp A, Spannagl M, Stumptflen V, Rattei T: MIPS: analysis and annotation of genome information in 2007. Nucleic Acid Research 2008, 36: D196-D201. 10.1093/nar/gkm980View ArticleGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The database of interacting proteins: 2004 update. Nucleic Acid Research 2004, 32: D449-D451. 10.1093/nar/gkh086View ArticleGoogle Scholar
- Breitkreutz B-J, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bahler J, Wood V, Dolinski K, Tyers M: The BioGRID interaction database: 2008 update. Nucleic Acids Research 2008, 36: D637-D640. 10.1093/nar/gkm1001PubMed CentralView ArticlePubMedGoogle Scholar
- Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, Hewes HW: The FunCat: a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acid Research 2004, 32(18):5539–5545. 10.1093/nar/gkh894View ArticleGoogle Scholar
- Goldberg DS, Roth FP: Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci USA 2003, 100(8):4372–4376. 10.1073/pnas.0735871100PubMed CentralView ArticlePubMedGoogle Scholar
- Sprinzak E, Sattath S, Margalit H: How reliable are experimental protein-protein interaction data? Journal of Molecular Biology 2003, 327: 919–923. 10.1016/S0022-2836(03)00239-0View ArticlePubMedGoogle Scholar
- Resnik P: Using information content to evaluate semantic similarity in a taxonomy. Proceedings of 14th International Joint Conference on Artificial Intelligence 1995, 448–453.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.