Protein complex prediction via verifying and reconstructing the topology of domain-domain interactions
© Ozawa et al; licensee BioMed Central Ltd. 2010
Received: 9 October 2009
Accepted: 28 June 2010
Published: 28 June 2010
High-throughput methods for detecting protein-protein interactions enable us to obtain large interaction networks, and also allow us to computationally identify the associations of proteins as protein complexes. Although there are methods to extract protein complexes as sets of proteins from interaction networks, the extracted complexes may include false positives because they do not account for the structural limitations of the proteins and thus do not check that the proteins in the extracted complex can simultaneously bind to each other. In addition, there have been few searches for deeper insights into the protein complexes, such as of the topology of the protein-protein interactions or into the domain-domain interactions that mediate the protein interactions.
Here, we introduce a combinatorial approach for prediction of protein complexes focusing not only on determining member proteins in complexes but also on the DDI/PPI organization of the complexes. Our method analyzes complex candidates predicted by the existing methods. It searches for optimal combinations of domain-domain interactions in the candidates based on an assumption that the proteins in a candidate can form a true protein complex if each of the domains is used by a single protein interaction. This optimization problem was mathematically formulated and solved using binary integer linear programming. By using publicly available sets of yeast protein-protein interactions and domain-domain interactions, we succeeded in extracting protein complex candidates with an accuracy that is twice the average accuracy of the existing methods, MCL, MCODE, or clustering coefficient. Although the configuring parameters for each algorithm resulted in slightly improved precisions, our method always showed better precision for most values of the parameters.
Our combinatorial approach can provide better accuracy for prediction of protein complexes and also enables to identify both direct PPIs and DDIs that mediate them in complexes.
Recently developed high-throughput methods such as yeast two-hybrid or mass spectrometry to obtain protein-protein interactions (PPIs) have provided a global view of the interaction network [1–5]. As a PPI network grows, it becomes increasingly important to detect functional modules for understanding cellular organization and its dynamics . Protein complexes are clusters of multiple proteins, and they often play a crucial part in basal cellular mechanism. Therefore, computational methods to predict protein complexes are becoming important.
There are four steps in characterizing a protein complex . The first step is to identify its member proteins. The second step is to determine its topology by identifying pairs of proteins which have direct interactions. The third step is to identify DDIs that mediate these direct interactions, and the final step is to specify the complete 3D structure of the complex. Most of the previous research on computational prediction of protein complexes has focused on the first step and various methods such as MCODE, MCL, and RNSC were developed, mainly based on graph theory [8–14]. The candidate complexes predicted by these first-step methods contain a non-negligible number of false positives . One of the reasons for these errors is that all of these methods just extract locally dense regions as protein complexes on the assumption that proteins in complexes are highly interconnected to each other and do not consider structural limitations of interactions in the complex. Therefore methods focusing on higher steps are also important in terms of improving accuracy of the predictions. However there are few methods focusing on the second step [15–17] and there is no effective and comprehensive method for the third or fourth step [7, 15]. In the present report, we use a combinatorial approach focusing not only on the first step but also on second and third steps. Our method analyzes complex candidates predicted by the first-step methods. It searches for optimal combinations of domain-domain interactions within the candidates based on an assumption that proteins in the candidates can form a true protein complex if each of the domains is used by a single protein interaction [18, 19]. This optimization problem was mathematically formulated and solved via binary integer linear programming. As a result, our method can eliminate false positives in the first-step methods, and predict the detailed DDI/PPI organization of the protein complexes (i.e. it can identify both the direct PPIs and the DDIs that mediate them in a given complex).
Overview of protein complex prediction
The predicted results of existing methods to predict protein complexes include significant numbers of false positives , because they merely extract locally dense regions of the network as protein complexes, assuming that all of the proteins in complexes are highly interconnected to each other, without considering any structural limitations against interactions within the complex.
First, the dense regions in the PPI network are extracted by using existing graph algorithms, assuming that nodes and edges correspond to proteins and interactions, respectively. Since proteins participating in the complex are likely to interact with each other frequently, dense regions are likely to correspond to protein complexes [8, 9]. Thus, dense regions that are extracted by using existing methods can be assumed to be initial candidates for protein complexes.
Second, the initial candidates are verified by integrating the DDIs into the candidates by considering the physical binding domains based on two assumptions: (1) a protein participating in the candidate can bind to another protein within the same candidate if there is a potential DDI between these two proteins, and (2) each domain can participate in only one DDI at a time. The second assumption is based on the tendency of the binding interfaces to be exclusive [18, 19], since we roughly equate a single domain with a single binding interface. Although there are cases where a single domain binds multiple domains simultaneously, we can greatly simplify our calculations by discounting those cases.
The initial candidate will be accepted as a final complex prediction if three or more proteins in the candidate are predicted to be connected by DDIs. In this way, we can consider the physical bindings in the protein complex.
Formulation of the second step as a binary integer program
PPI and DDI datasets
We downloaded 35,353 yeast PPIs from the BioGrid database . We used only the PPIs that were derived from mass spectrometry and two-hybrid experiments, since these PPIs represent physical interactions.
Overview of PPI and DDI datasets
PPI (BioGrid Release 2.0.40)
A: DDI (iPfam)
B: DDI (InterDom v2.0 : High Confidence)
C: DDI (InterDom v2.0 : All)
DDI (A + B)
DDI (A + C)
Known complex datasets
To evaluate prediction performances of protein complexes, we used 763 known yeast protein complexes from the EMBL database http://yeast-complexes.embl.de/ because it is the most comprehensive yeast protein complex database.
The Gene Ontology annotations that were downloaded from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ on October 28, 2008, were used to find proteins whose functions are unclear. Proteins that have "GO:0003674" or "GO:0005554" for their ID are regarded as uncharacterized proteins, since they indicate "molecular function unknown" in the Gene Ontology annotations.
Parameters for prediction algorithms
Configurable parameters for each algorithm and their optimal values
Node Score Cut-off
Evaluation of the predictions
where N p and N k are sets of proteins in a predicted complex P and a known complex K, respectively. We regarded a predicted complex as matching with a known complex if the overlapping criterion was greater than the threshold of 0.25, which was taken from the work of Chua et al. . In this evaluation approach, multiple predicted complexes may match the same known complex.
To investigate the enriched level of specific protein function in each predicted complex, we calculated the ratio of the protein pairs that have the same function for each predicted result with our method. We regarded a protein pair as having the same function if their molecular functions are the same in the Gene Ontology annotations.
Protein complex prediction and its performance
Summary of protein complex predictions
Original output of algorithm*
Verified with DDI (A)
Verified with DDI (A + B)
Verified with DDI (A + C)
As the complex size increases, the number of interactions among the member proteins in the complex may also increase. Such large complexes require many domains to bind to each other simultaneously if the bonding capability of the domains is limited. In other words, a cluster that contains many proteins is unlikely to have all of the possible interactions simultaneously active because the ability of each protein to bind is limited by the binding interfaces . Therefore, it is more unlikely that larger complexes will be verified by our approach since it assumes that each domain can participate in only one DDI at a time. In fact, both the maximum size and the average size of complexes that are predicted by our method are smaller than those predicted by existing methods (Table 1, Additional File 1), indicating that our approach is more successful with smaller complexes.
On average, our approach was twice as precise as the existing algorithms. Although configuring parameters for each algorithm resulted in slightly improved precision, most of the precision values remained lower than the precision of our approach. Our approach showed better precision with all parameter values except when the number of predicted candidates was 0. In contrast, the recalls of our approach were lower than those of the existing algorithms (31% for the existing algorithms on average). In fact, our approach drastically reduced the number of candidates (87% of the candidates). However, the reduction for recall was comparatively small (a 69% reduction). Specifically, the recall reduction of our method applied after performing MCL analysis (Inflation = 3.6) was only 80%, whereas the reduction rate of candidates was 98%. In addition, the recall of our approach improved as the number of DDIs used for the verification increased (Additional file 2). In contrast, the precision of our method was almost constant, regardless of the number of DDIs used.
Estimation of false negative rates of our method
Number of false negatives for each method
Number of false negatives of our method
Number of false negatives of existing methods
α:Number of "net" false negatives caused by the assumption *
β:Number of true positives of existing methods
Clustering coefficient verified with DDI (A)
MCL verified with DDI (A)
MCODE verified with DDI (A)
Clustering coefficient verified with DDI (A + B)
MCL verified with DDI (A + B)
MCODE verified with DDI (A + B)
Clustering coefficient verified with DDI (A + C)
MCL verified with DDI (A + C)
MCODE verified with DDI (A + C)
Functional analysis of predicted protein complexes
Complexes that contain an uncharacterized protein
Protein Complex Members
Function of other members
APL1, APS2, APM4, APL3
protein transporter activity
SWM1, CDC27, CDC16, CDC23
SKP1, DAS1, YLR352W
PWP2, NOP14, MPP10
PWP2, UTP18, NOP58
We introduced a combinatorial approach for the prediction of protein complexes focusing not only on determining member proteins in complexes but also on the DDI/PPI organization of the complexes by integrating our newly developed method with existing methods. Our method allows us to identify both direct PPIs and DDIs that mediate them in a given complex. As a result of the identification, our method can eliminate false positives in the first-step methods and can provide more accurate predictions. Also for an efficient prediction, we formalized the protein complex prediction problem by considering the physical binding domain as a binary integer programming problem so that the heuristic approaches for integer programming can be applied if the computational complexity is problematic [28, 29]. Although the assumption that each binding domain is exclusive resulted missing some of the complexes in which a single domain binds to multiple domains at the same time, the restriction allows for an efficient formation of the problem. The rate of false negatives related to the assumption was at most 45.7%, but it was reduced to 20.5% with the largest DDI dataset.
Our approach predicted protein complexes with about twice the accuracy of the original output of the existing methods, and our approach always showed better precision for all of the values of the configurable parameters except for the point where the number of predicted candidates was 0. Also, our approach showed better concordance rate of the functions of the protein pairs compared to existing methods. Particularly with the optimized parameters, our approach showed more than 25% better performance than existing methods for the DDI dataset with the highest confidence.
Although the recall of our approach was lower compared to the existing methods, it improved as the number of DDIs used for verification was increased. Thus, we believe that the recall of our approach will be improved as the number of available DDIs is increased. The number may be increased not only by biochemical experiments but also by computational predictions. For comparison, Katia et al. developed a prediction method for DDIs with a parsimony approach that economizes as much as possible on the use of DDIs . They formulated the problem as a linear program for which the objective function is to minimize the number of DDIs necessary to justify the underlying PPIs. There are also some computational methods to predict DDIs that could enhance the results of our approach [30–33].
To predict protein complexes, several methods employ algorithms to detect densely connected regions in a PPI network. However, the average density of real protein complexes is not particularly high. For example, the density of protein complexes in yeast is around 0.55 . Thus, the extraction of dense regions in the interaction network is not sufficient for accurate predictions of the protein complexes, and pre- or post-processing of the interaction network must be combined with these graph methods.
There are several methods to extract a high confidence network from the PPI network by pre-processing [34, 35]. These methods should also be useful for predictions. For example, Chua et al. filtered a PPI network with a value called the FS weight prior to protein complex prediction and improved the accuracy of their predictions [13, 35]. Moschopoulos et al. developed a tool called GIBA that provides a post-processing capability for individual filters or combinations of 4 different heuristic filters and this also improved the accuracy of the predictions . In contrast, our method can be used for post-processing, and it can also be combined with other methods to predict protein complexes more accurately. A key difference between our method and these other methods is that our method not only improves the accuracy of the predictions, but also reveals the organization of the protein complex including the DDIs that mediate the PPIs. Protein complexes are predicted more accurately by our method and reflect the structural characteristics of the complexes in the cells and may provide deeper insights into how proteins are organized to function in the cells.
We introduced a new approach for the prediction of protein complexes. It provides both accurate predictions of protein complexes and deeper insight into each protein complex by identifying the direct PPIs and DDIs that mediate the complexes.
We are grateful to Prof. Yasuhiro Naito and Prof. Akio Kanai for helpful discussions. This research was supported in part by a grant for the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology, Japan, and by research funds from the Yamagata Prefectural Government and Tsuruoka City, Japan.
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, others: A comprehensive analysis of protein--protein interactions in Saccharomyces cerevisiae. Nature 2000, 403: 623–627. 10.1038/35001009View ArticlePubMedGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98: 4569. 10.1073/pnas.061034498View ArticlePubMedPubMed CentralGoogle Scholar
- Gavin AC, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, others: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415: 141–147. 10.1038/415141aView ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415: 180–183. 10.1038/415180aView ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440: 631–636. 10.1038/nature04532View ArticlePubMedGoogle Scholar
- Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregrin-Alvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MH, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440: 637–643. 10.1038/nature04670View ArticlePubMedGoogle Scholar
- Aloy P, Russell RB: Structural systems biology: modelling protein interactions. Nature Reviews Molecular Cell Biology 2006, 7: 188–197. 10.1038/nrm1859View ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003, 4: 2. 10.1186/1471-2105-4-2View ArticlePubMedPubMed CentralGoogle Scholar
- Przulj N, Wigle DA, Jurisica I: Functional topology in a network of protein interactions. Bioinformatics 2004, 20: 340–348. 10.1093/bioinformatics/btg415View ArticlePubMedGoogle Scholar
- Dongen Sv: Graph Clustering by Flow Simulation. PhD thesis. University of Utrecht, Centre for Mathematics and Computer Science; 2000.Google Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA 2003, 100: 12123–12128. 10.1073/pnas.2032324100View ArticlePubMedPubMed CentralGoogle Scholar
- King AD, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics 2004, 20: 3013–3020. 10.1093/bioinformatics/bth351View ArticlePubMedGoogle Scholar
- Chua HN, Ning K, Sung WK, Leong HW, Wong L: Using indirect protein-protein interactions for protein complex prediction. Journal of Bioinformatics and Computational Biology 2008, 6(3):435–466. 10.1142/S0219720008003497View ArticlePubMedGoogle Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006, 7: 488. 10.1186/1471-2105-7-488View ArticlePubMedPubMed CentralGoogle Scholar
- Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, Superti-Furga G, Serrano L, Russell RB: Structure-based assembly of protein complexes in yeast. Science 2004, 303: 2026–2029. 10.1126/science.1092645View ArticlePubMedGoogle Scholar
- Bernard A, Vaughn DS, Hartemink AJ: Reconstructing the topology of protein complexes. Lecture Notes in Computer Science 2007, 4453: 32. full_textView ArticleGoogle Scholar
- Friedel CC, Zimmer R: Identifying the topology of protein complexes from affinity purification assays. Bioinformatics 2009, 25(16):2140–2146. 10.1093/bioinformatics/btp353View ArticlePubMedPubMed CentralGoogle Scholar
- Kim PM, Lu LJ, Xia Y, Gerstein MB: Relating three-dimensional structures to protein networks provides evolutionary insights. Science 2006, 314: 1938–1941. 10.1126/science.1136174View ArticlePubMedGoogle Scholar
- Sprinzak E, Altuvia Y, Margalit H: Characterization and prediction of protein-protein interactions within and between complexes. Proc Natl Acad Sci USA 2006, 103: 14718–14723. 10.1073/pnas.0603352103View ArticlePubMedPubMed CentralGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2003, 34: D535-D539. 10.1093/nar/gkj109View ArticleGoogle Scholar
- Finn RD, Marshall M, Bateman A: iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics 2005, 21: 410–412. 10.1093/bioinformatics/bti011View ArticlePubMedGoogle Scholar
- Ng SK, Zhang Z, Tan SH, Lin K: InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res 2003, 31: 251. 10.1093/nar/gkg079View ArticlePubMedPubMed CentralGoogle Scholar
- Saito R, Suzuki H, Hayashizaki Y: Interaction generality a measurement to assess the reliability of a protein-protein interaction. Nucleic acids research 2002, 30: 1163. 10.1093/nar/30.5.1163View ArticlePubMedPubMed CentralGoogle Scholar
- Yeung BG, Phan HL, Payne GS: Adaptor complex-independent clathrin function in yeast. Molecular biology of the cell 1999, 10: 3643.View ArticlePubMedPubMed CentralGoogle Scholar
- Pearse BM, Robinson MS: Purification and properties of 100-kd proteins from coated vesicles and their reconstitution with clathrin. EMBO J 1984, 3: 1951–1957.PubMedPubMed CentralGoogle Scholar
- Stevens TH, Forgac M: Structure, function and regulation of the vacuolar (H+)-ATPase. Annual review of cell and developmental biology 1997, 13: 779–808. 10.1146/annurev.cellbio.13.1.779View ArticlePubMedGoogle Scholar
- Forgac M: Structure and properties of the vacuolar (H+)-ATPases. Journal of Biological Chemistry 1999, 274: 12951–12954. 10.1074/jbc.274.19.12951View ArticlePubMedGoogle Scholar
- Glover F: Heuristics for integer programming using surrogate constraints. Decision Sciences 1977, 8: 156–166. 10.1111/j.1540-5915.1977.tb01074.xView ArticleGoogle Scholar
- Rodosek R, Wallace MG, Hajian MT: A new approach to integrating mixed integer programming and constraint logic programming. Annals of Operations Research 1999, 86: 63–87. 10.1023/A:1018904229454View ArticleGoogle Scholar
- Guimarães K, Jothi R, Zotenko E, Przytycka T: Predicting domain-domain interactions using a parsimony approach. Genome Biology 2006, 7: R104. 10.1186/gb-2006-7-11-r104View ArticlePubMedPubMed CentralGoogle Scholar
- Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions. Genome Res 2002, 12: 1540–1548. 10.1101/gr.153002View ArticlePubMedPubMed CentralGoogle Scholar
- Lee H, Deng M, Sun F, Chen T: An integrated approach to the prediction of domain-domain interactions. BMC Bioinformatics 2006, 7: 269. 10.1186/1471-2105-7-269View ArticlePubMedPubMed CentralGoogle Scholar
- Schelhorn SE, Lengauer T, Albrecht M: An integrative approach for predicting interactions of protein regions. Bioinformatics 2008, 24: i35. 10.1093/bioinformatics/btn290View ArticlePubMedGoogle Scholar
- Bader JS, Chaudhuri A, Rothberg JM, Chant J: Gaining confidence in high-throughput protein interaction networks. Nature biotechnology 2003, 22: 78–85. 10.1038/nbt924View ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22: 1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Moschopoulos C, Pavlopoulos G, Schneider R, Likothanassis S, Kossida S: GIBA: a clustering tool for detecting protein complexes. BMC bioinformatics 2009, 10: S11. 10.1186/1471-2105-10-S6-S11View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.