Inferring protein function by domain context similarities in protein-protein interaction networks
© Zhang et al; licensee BioMed Central Ltd. 2009
Received: 26 March 2009
Accepted: 2 December 2009
Published: 2 December 2009
Genome sequencing projects generate massive amounts of sequence data but there are still many proteins whose functions remain unknown. The availability of large scale protein-protein interaction data sets makes it possible to develop new function prediction methods based on protein-protein interaction (PPI) networks. Although several existing methods combine multiple information resources, there is no study that integrates protein domain information and PPI networks to predict protein functions.
The domain context similarity can be a useful index to predict protein function similarity. The prediction accuracy of our method in yeast is between 63%-67%, which outperforms the other methods in terms of ROC curves.
This paper presents a novel protein function prediction method that combines protein domain composition information and PPI networks. Performance evaluations show that this method outperforms existing methods.
Genome sequencing projects are generating massive amounts of sequence data, and the functional annotation of these sequences became one of the most challenging tasks, especially for the many proteins whose functions remain unknown. Traditional computational methods have utilized sequence features and machine learning algorithms to predict functions. In recent years, high-throughput technologies, such as yeast-two hybrid, have provided large scale protein-protein interaction data, making it possible to develop new function prediction methods based on protein-protein interaction (PPI) networks [1, 2].
Existing protein function prediction methods based on PPI can be categorized into two classes: direct methods based only on the protein interactions and module-assisted methods . Direct methods directly infer protein functions from interactions in the PPI networks while module-assisted methods first try to find functional modules in the PPI networks and then assign protein functions based on the module functions.
Direct methods are based on the assumption that interacting proteins probably have identical or similar functions [4–7]. This assumption is supported by previous studies which show that 70%-80% of proteins share at least one identical function with their interacting partners. Schwikowski et al  used a neighbor counting method to predict protein functions. They took up to three most frequent functions of interacting partners as indicators of the function of each protein, which turned out to cover over 70% of the known functions. Hisigaki  et al tried to predict protein functions by computing the Chi-square statistics as an indicator of functions that were statistically significantly frequent among neighboring proteins. Chua et al  investigated the relationships between functional similarity and network distance. They utilized functional information from proteins within 1 or 2 neighborhoods of a protein by giving different weights to different network distances.
Vazquez et al  assigned functions to proteins via an iterative algorithm by maximizing the number of edges that connect proteins with the same function. Other graph-based methods include those of Karaoz et al  and Nabieva et al .
Instead of predicting individual protein functions, module-assisted methods first identify functional modules in PPI networks and then assign functions to the proteins according to functions of the module members. These methods are based on previous observations that a group of cellular components and their interactions usually can be attributed to a specific function [3, 12, 13]. The approaches of different module-assisted methods vary mainly on the methods for identifying functional modules, which divide the methods into those based on network topology only and those which integrate multiple data sources. Network topology based methods include MCODE , a module-assisted method based on clustering coefficients, the clustering method of Rives et al  and the hierarchical clustering method of Spirin et al . Ge et al  showed that proteins having similar functions tend to have similar expression patterns, which can be used to predict protein functions. Ideker et al  developed a framework to identify active sub-networks by detecting significant changes in expression over a particular set of conditions. Hanisch et al  applied a co-clustering methodology that combined similarities in gene expression patterns and network topologies. Hierarchical clustering was then used to define functional modules.
Although several existing methods have combined multiple information resources, such as gene expression information, gene regulatory networks and PPI networks, none of them have yet integrated protein domain information and PPI networks to predict protein functions. This paper presents a novel protein function prediction method that uses protein domain composition and PPI networks. This paper first demonstrates that proteins having similar functions are often in similar domain contexts in PPI networks and then develops the protein function prediction method based on this observation. The method gives satisfactory results compared to several existing methods.
Yeast PPI network data was obtained from DIP database . 4,389 proteins and 14,338 protein-protein interactions were included in the network. The yeast PPI network was chosen because it is comparatively more complete with fewer missing interactions. Nearly 70% of the 6,375 ORFs of yeast are covered by the yeast PPI network, which is the highest coverage ratio among PPI networks of all organisms. Besides, the yeast PPI network is the most frequently used in previous protein function prediction studies, which allows accuracy comparison to other methods.
The domain annotation information was retrieved from the PFAM database [20, 21]. The HMMER software package was used to annotate domains in the yeast ORFs. 6,402 domains of 4,618 domain types were obtained from 3,901 proteins. The protein function annotation information was provided by the Gene Ontology database .
Where M is the number of domain types in the PPI network. Given proteins A and B, SA and SB are the sets of domains included in A's neighbors and in B's neighbors. The number of domain types in SA is a, while the number of domain types in SB is b. The intersection of SA and SB is S, containing s types of domains. C(M:s) denotes combinatorial numbers. The larger f indicates a greater probability that A and B share similar functions.
For each GO term, there is a positive data set composed of present proteins, and a negative data set including absent ones. For example, GO:0009277 is used to describe 107 yeast proteins, so these proteins were treated as positive samples. Since some GO terms contains only a few proteins and other GO terms are too general, only GO terms containing 10-200 proteins were considered.
Given a protein P with unknown function, in order to examine its function with regard to each particular GO term, the domain context similarities, f, between P and each protein in both the positive and negative data sets were calculated. The function annotation of the protein with the highest f value was then assigned to P.
The 7-fold cross validation, which has been widely implemented in previous researches [23, 24], was used to evaluate the performance of our prediction. For every GO term, both the positive and negative data sets were divided into seven equal parts randomly. Every time six positive parts and six negative parts were used as the training data set while the remaining parts was used as the test data set. This procedure was repeated 7 times to ensure that every part was used once as the test data set for one GO term. Then the whole procedure was repeated for every GO term. The final accuracy was the average of the evaluations.
Prediction performance measurements
Protein number in each GO term
GO term number
The prediction accuracies are between 63%-67%. The results show that the method has satisfactory robustness for various numbers of proteins within one GO term. As number of proteins increases from 10-30 to 100-200, the accuracy only decreases slightly, by about 4%. The phenomenon that accuracies decrease as number of proteins in the GO term increases can be attributed to the fact that functional annotations in larger GO terms are not as specific as in smaller GO terms. Fuzzy, general annotation information may affect the prediction performance. Further investigation is required to explain this observation. Besides, the recall is higher than the precision, demonstrating that false positive predictions are more common than false negative predictions.
A new prediction method for protein function based on protein-protein interaction and domain context was presented in this research. Domain context similarity in the protein-protein interaction network was defined and used as in index for prediction. The underling principle of this method was that proteins tend to interact with each other via domain-domain interaction. So the high quality domain-domain interaction information may improve the prediction accuracy. Riley at al  developed domain pair exclusion analysis (DPEA) to infer high-confidence domain interaction from protein interactions. Besides, DIMA [27, 28] try to identify known and predicted domain interactions which may be helpful if this information was utilized in our method.
This research also suggests several future directions of research. First, domain context similarity measurements or prediction systems can be improved to reduce false positive predictions and boost accuracy. For example, the cutoff value for domain context similarity can be introduced to improve the accuracy and to deal with multiple function problems. Since the underlying rationale of this method is the domain-domain interaction, high-quality domain interactions can definitely contribute to the accuracy. As mentioned above, the newly developed domain interaction inferring method [26–28] can be used in our future algorithm improvement. Second, as shown by Chua et al , functional similarities exist between neighbor proteins with distances equal to or larger than 2, which may be useful information to be included in function prediction. Third, other data resources, such as gene expression profiles and gene regulatory networks, could be combined with domain context information to prediction functions. Different weight can be assigned to different types of information. Machine learning methods, such as SVM, can also be utilized to take the information listed above as input features. Finally, since protein domains are conserved and can be easily detected in various organisms, this method should be promising in comparing protein functions across species.
The availability of large scale protein-protein interaction data sets makes it possible to predict protein functions based on protein-protein interaction (PPI) networks. Several existing methods combine multiple information resources to predict protein functions. We present a novel protein function prediction method that combines protein domain composition information and PPI networks. Performance evaluations show that this method outperforms existing methods. The results are used to analyze the relationships between domain context similarity and protein function similarity, while this research may have potential future research directions.
This work was supported by the Natural Science Grant in the Chinese 863 project (no. 2006AA020403), the 973 project (no. 2009CB918801) and the National Natural Science Grant (no. 30770498).
- Fields S: High-throughput two-hybrid analysis. The promise and the peril. FEBS J 2005, 272(21):5391–5399. 10.1111/j.1742-4658.2005.04973.xView ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Sharan R, Ideker T, Kelley B, Shamir R, Karp RM: Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. J Comput Biol 2005, 12(6):835–846. 10.1089/cmb.2005.12.835View ArticlePubMedGoogle Scholar
- Deng M, Zhang K, Mehta S, Chen T, Sun F: Prediction of protein function using protein-protein interaction data. J ComputBiol 2003, 10(6):947–960.Google Scholar
- Letovsky S, Kasif S: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 2003, 19(Suppl 1):i197–204. 10.1093/bioinformatics/btg1026View ArticlePubMedGoogle Scholar
- Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, Kasif S: Whole-genome annotation by using evidence integration in functional-linkage networks. Proc Natl Acad Sci USA 2004, 101(9):2888–2893. 10.1073/pnas.0307326101PubMed CentralView ArticlePubMedGoogle Scholar
- Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005, 21(Suppl 1):i302–310. 10.1093/bioinformatics/bti1054View ArticlePubMedGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol 2000, 18(12):1257–1261. 10.1038/82360View ArticlePubMedGoogle Scholar
- Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein--protein interaction data. Yeast 2001, 18(6):523–531. 10.1002/yea.706View ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22(13):1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nat Biotechnol 2003, 21(6):697–700. 10.1038/nbt825View ArticlePubMedGoogle Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 2006, 7: 207. 10.1186/1471-2105-7-207PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003, 4: 2. 10.1186/1471-2105-4-2PubMed CentralView ArticlePubMedGoogle Scholar
- Rives AW, Galitski T: Modular organization of cellular networks. Proc Natl Acad Sci USA 2003, 100(3):1128–1133. 10.1073/pnas.0237338100PubMed CentralView ArticlePubMedGoogle Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA 2003, 100(21):12123–12128. 10.1073/pnas.2032324100PubMed CentralView ArticlePubMedGoogle Scholar
- Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 2001, 29(4):482–486. 10.1038/ng776View ArticlePubMedGoogle Scholar
- Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 2002, 18(Suppl 1):S233–240.View ArticlePubMedGoogle Scholar
- Hanisch D, Zien A, Zimmer R, Lengauer T: Co-clustering of biological networks and gene expression data. Bioinformatics 2002, 18(Suppl 1):S145–154.View ArticlePubMedGoogle Scholar
- Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002, 30(1):303–305. 10.1093/nar/30.1.303PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res 2008, (36 Database):D281–288.Google Scholar
- Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28(3):405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Guo J, Chen H, Sun Z, Lin Y: A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins 2004, 54(4):738–743. 10.1002/prot.10634View ArticlePubMedGoogle Scholar
- Hua S, Sun Z: A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 2001, 308(2):397–407. 10.1006/jmbi.2001.4580View ArticlePubMedGoogle Scholar
- Deng M, Tu Z, Sun F, Chen T: Mapping Gene Ontology to proteins based on protein-protein interaction data. Bioinformatics 2004, 20(6):895–902. 10.1093/bioinformatics/btg500View ArticlePubMedGoogle Scholar
- Riley R, Lee C, Sabatti C, Eisenberg D: Inferring protein domain interactions from databases of interacting proteins. Genome Biol 2005, 6(10):R89. 10.1186/gb-2005-6-10-r89PubMed CentralView ArticlePubMedGoogle Scholar
- Pagel P, Oesterheld M, Stumpflen V, Frishman D: The DIMA web resource--exploring the protein domain network. Bioinformatics 2006, 22(8):997–998. 10.1093/bioinformatics/btl050View ArticlePubMedGoogle Scholar
- Pagel P, Oesterheld M, Tovstukhina O, Strack N, Stumpflen V, Frishman D: DIMA 2.0--predicted and known domain interactions. Nucleic Acids Res 2008, (36 Database):D651–655.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.