Protein complex detection in PPI networks based on data integration and supervised learning method
BMC Bioinformatics volume 16, Article number: S3 (2015)
Revealing protein complexes are important for understanding principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein interactions, which makes it possible to predict protein complexes from protein-protein interaction (PPI) networks. However, the small amount of known physical interactions may limit protein complex detection.
The new PPI networks are constructed by integrating PPI datasets with the large and readily available PPI data from biomedical literature, and then the less reliable PPI between two proteins are filtered out based on semantic similarity and topological similarity of the two proteins. Finally, the supervised learning protein complex detection (SLPC), which can make full use of the information of available known complexes, is applied to detect protein complex on the new PPI networks.
The experimental results of SLPC on two different categories yeast PPI networks demonstrate effectiveness of the approach: compared with the original PPI networks, the best average improvements of 4.76, 6.81 and 15.75 percentage units in the F-score, accuracy and maximum matching ratio (MMR) are achieved respectively; compared with the denoising PPI networks, the best average improvements of 3.91, 4.61 and 12.10 percentage units in the F-score, accuracy and MMR are achieved respectively; compared with ClusterONE, the start-of the-art complex detection method, on the denoising extended PPI networks, the average improvements of 26.02 and 22.40 percentage units in the F-score and MMR are achieved respectively.
The experimental results show that the performances of SLPC have a large improvement through integration of new receivable PPI data from biomedical literature into original PPI networks and denoising PPI networks. In addition, our protein complexes detection method can achieve better performance than ClusterONE.
Protein-protein interactions (PPI) are fundamental to the biological processes within a cell. Beyond individual interactions, there is a lot more systematic information contained in protein interaction graphs. Complex formation is one of the typical patterns in this graph and many cellular functions are performed by these complexes containing multiple protein interaction partners. Many automatic approaches have been proposed to detect the protein complexes from PPI networks, such as CMC , COACH , MCODE , MCL , Cfinder , and ClusterONE . However, most of these methods are based on unsupervised graph clustering methods and predict protein complexes only with pre-defined rules. Compared with them, supervised learning methods [7, 8] can utilize the known complexes information and may achieve better performances.
At present, large number of PPI databases have been created. Gavin , Krogan  and DIP  are popular PPI databases used by the protein complex detection methods. However, these databases are sparse since the fraction of known true physical interactions is limited . For example, the average numbers of interactions per protein are 6.98, 7.86, and 9.13 in DIP, Krogan, and Gavin, respectively. Nevertheless, large amounts of PPIs could be found in the rapidly growing biomedical literature. Furthermore, since these PPI data are provided by biomedical experts, they are relatively accurate. Their Integration with the existing PPI datasets can be hopeful to eliminate the PPI networks' sparsity, and, therefore, improve the complex detection performance.
In this paper, we present a complex detection approach based on data integration and supervised learning. In this approach, the new PPI networks are constructed by integrating PPI datasets with the PPI data extracted by PPIExtractor  from biomedical literature, and then the less reliable PPI between two proteins are filtered out based on semantic similarity and topological similarity of the two proteins. Finally, the supervised learning protein complex detection (SLPC) method, which can make full use of the information of available known complexes, is applied to detect protein complex on the new PPI networks. The experimental results demonstrate that our approach outperform ClusterONE, the state-of-the-art method.
Extracting PPI data with PPIExtractor
In our work, we use PPIExtractor  to extract PPI interactions from biomedical literature and then integrate them into the PPI networks. PPIExtractor is a useful tool publicly available for extracting new PPI data from a large collection of biomedical literature. Experimental evaluations show that it can achieve state-of-the-art performance on a DIP subset with respect to comparable evaluations.
PPIExtractor contains four modules: (i) Named Entity Recognition (NER) module which aims to identify the protein names in the biomedical literature; (ii) Normalization module which determines the unique identifier of proteins identified in NER module; (iii) PPI extraction module which extracts the PPI information in the biomedical literature; (iv) PPI visualization module which displays the extracted PPI information in the form of a graph. Figure 1 shows the architecture of PPIExtractor.
127,217 PubMed abstracts were downloaded from PubMed website (http://www.ncbi.nlm.nih.gov/pubmed) with the query string "((Saccharomyces cerevisiae) OR yeast) AND protein" and PPIExtractor extracted a total of 126,165 protein interactions from these abstracts.
Since most of the protein names in the PPI databases are systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast') while those in PubMed abstracts are not, we built a yeast protein alias name list with about 6,000 entries from the UniProt website(http://www.uniprot.org/uniprot/? query=yeast&sort= score). The list is used to convert the protein names in PubMed abstracts to systematic names for nuclear-encoded ORFs.
DIP, Krogan, Gavin, three yeast PPI datasets, are used in our work. The details of these PPI datasets are shown in Table 1. For each dataset, original PPI and denoising PPI networks are built, respectively, to verify our method's effectiveness. Original PPI networks are original three yeast PPI datasets mentioned above. Denoising PPI networks are three filtered PPI datasets, in which low reliability interactions are removed with different denoising thresholds. As a matter of fact, protein interaction data produced by high-throughput experiments are often associated with high false positive and false negative rates. Therefore, a method based on both semantic and topological similarity of the two proteins is applied in our work to measure the reliability of the interaction. GO (The Gene Ontology Consortium ) annotation from SGD  is used in this measurement approach. In this method, a PPI's reliability is defined as formula (1):
Where |C(m, n)| denotes the number of terms in C(m, n), the set of the GO terms in which annotation proteins m and n are included. | T i (m, n) | denotes the number of terms in T i (m, n), the set of annotated proteins on GO term g i in whose annotation m and n are included. T max denotes the maximum size of annotated proteins on all GO terms. The GO term's specificity can be quantified by the proportion of the annotation size of a GO term (T i (m,n)) to the total number of annotated proteins (T max ), i.e. a GO term is regarded to be more specific if it has less annotated proteins. NE(m, n) denotes the number of neighbors that m and n share. The formula (1) demonstrates that if the GO term proteins m and n share is more specific, or if they have more common neighbors or GO terms, the interaction between them is more reliable. The details of the denoising PPI networks are shown in Table 2.
Integration of the extracted PPI data into the PPI networks
PPIExtractor assigns the extracted PPIs from the biomedical literature weights representing their reliability . In our study, only PPIs with the weights equal to or higher than an integrating threshold are integrated into the original PPI dataset. In addition, both two proteins in a new PPI should already exist in the PPI dataset. The amounts of the PPI added into the original PPI networks with different integrating thresholds are shown in Table 3.
The weights of the PPIs added into the denoising PPI networks are higher than the integrating threshold -0.6. the reason is that our SLPC method have the best performance on the original PPI networks with the integrating threshold -0.6. What is more, the PPIs, when integrated into the denoising PPI networks, are also filtered with different denoising thresholds. The amounts of the PPIs added with different denoising thresholds are shown in Table 4.
Protein complexes detection with SLPC
In our work, a supervised learning protein complex detection (SLPC) method is employed to predict the protein complexes from PPI networks. Currently, most of protein complex detection methods are unsupervised ones, without utilizing the known complexes information. However, in the research field of protein complexes, numerous complexes have been provided, which can be used as the prior knowledge of the complex detection methods. In previous work, we presented a supervised learning protein complex detection (SLPC) method to predict protein complexes . The SLPC method utilizes the features including Graph density , Degree statistics, Edge weight statistics, Clustering coefficient , and Topologic change . Experimental evaluations show that SLPC can achieve better performances than other present protein complex detection methods. SLPC algorithm is showed in Table 5 and more details are provided in .
Experiments and results
Gold standard protein complexes
We constructed the gold standard protein complexes by combining MIPS , Aloy , SGD  with TAP06 . Proteins absent from the corresponding PPI networks are filtered out from the gold standard. In addition, only the protein complexes including at least two different proteins are retained as the research shows that most of the protein complexes include more than one protein . The details of the gold standard protein complexes of original PPI networks and denoising PPI networks are shown in Tables 6 and 7, respectively.
In our study, F-score, Accuracy (Acc), maximum matching ratio (MMR) are used as the evaluation metrics. The neighborhood affinity score NA(A, B) defined as follows is used to evaluate the similarity of two protein complexes A and B:
If the NA(A, B) is large than or equal to 0.25, complexes A and B are regarded to be matching.
F-score, a popular metric of evaluating complex detection method, is used as the first measure to evaluate the performance.
Where P and B are the predicted and gold standard complex sets, respectively; Ncb is the number of the gold standard complexes matching at least one predicted complex and Ncp is the number of the predicted complexes matching at least one gold standard complex and F-score is calculated as the harmonic mean of precision and recall values.
The second measure we used is the geometric accuracy as introduced by Broh´ee et al. , which is the geometric mean of clustering-wise sensitivity (Sn) and clustering-wise positive predictive value (PPV). A high Sn value indicates that the protein complex prediction has a good coverage of the proteins in the gold standard complexes, and a high PPV value indicates that the predicted protein complexes are likely to be true protein complexes. Assuming the number of the gold standard complexes is n and the number of the predicted complexes is m. T ij denotes the number of proteins that are found both in gold standard complex i and predicted complex j. The Sn, PPV, Acc are defined as follows:
The third metric we used is the maximum matching ratio (MMR) , which is based on a maximal one-to-one mapping between gold standard complex and predicted complex.
Where n denotes the number of the gold standard complexes; m the number of the predicted complexes; j as the member of the predicted complexes. MMR offers a natural, intuitive way to compare predicted complexes with a gold standard and it explicitly penalizes cases when a reference complex is split into two or more parts in the predicted set, as only one of its parts is allowed to match the correct reference .
The Acc measure explicitly penalizes predicted complexes that do not match any of the reference complexes. However, gold standard sets of protein complexes are often incomplete . As a consequence, predicted complexes not matching any known reference complexes may still exhibit high functional similarity or be highly co-localized, and therefore they could still be prospective candidates for further in-depth analysis. In other words, a predicted complex that does not match a reference complex is not necessarily an undesired result, and optimizing for the geometric accuracy measure might prevent us from detecting novel complexes from a PPI dataset. Therefore, in the performance comparison, the F-score and MMR are used as the main metrics; the Acc is only used as an auxiliary one.
The performances of SLPC on original PPI networks
First we tested SLPC on three original PPI networks, i.e. DIP, Krogan and Gavin. The results of F-score, accuracy and MMR are shown in Tables 8, 9, and 10, respectively. It can be seen that the performances measured with these metrics keep improving on these networks with the integrating threshold decreasing from 0 to -0.6. With the threshold -0.6, SLPC achieves the highest average improvements on all three original PPI networks: 4.76, 6.81 and 15.75 percentage units in F-score, accuracy and MMR, respectively. This shows that the introduction of PPIs extracted from literature into the original PPI datasets can boost the performance. The reason is that, the higher integrating threshold means more reliable new PPI interactions are integrated into the original PPI networks, which relieves the sparse problem of PPI networks. As shown in Table 11, in most cases, the average size of complexes predicted from extended PPI networks is much closer to the one of the gold standard protein complexes than that from the original PPI networks, and, therefore, SLPC achieves better performance on extended PPI networks than on original PPI networks.
However, Tables 8 and 10 show that, F-score and MMR values begin to decline after they reach the highest values. The reason is that the lower integrating threshold will introduce more unreliable PPI interactions and therefore, deteriorate the performance of SLPC algorithm.
The performances of SLPC on denoising PPI networks
Denoising PPI networks are the ones form which the low reliable PPIs are removed as discussed in the Section PPI datasets. And the denoising extended PPI networks are the ones into which the PPIs extracted from literature are integrated. More specifically, the new PPIs are also filtered out with different denoising thresholds like those PPIs in original PPI networks, and then integrated into the corresponding denoising PPI networks.
The performances of SLPC on denoising PPI networks are shown in Tables 12, 13 and 14. The performance of SLPC on the denoising extended PPI network is better than that on the corresponding denoising PPI network with any denoising threshold. With denoising threshold 0.9, SLPC achieves highest average improvement of 3.91, 4.61 and 12.10 percentage units in F-score, accuracy and MMR, respectively on denoising extended PPI networks over denoising PPI networks. This shows, once again, that the introduction the PPIs extracted from literature can boot the performance of complex detection methods.
In addition, Tables 12, 13 and 14 also show that, since the higher denoising threshold means more PPIs are filtered from the original PPI networks, which may lead to the missing of some real PPIs, the performances of SLPC algorithm on the denoising PPI networks and denoising extended PPI networks begin to decline after they reach the highest values.
The performance of ClusterONE, the state-of-the-art complex detection method, is also tested (its parameters are set as those described in ). With the denoising threshold 0.9, it achieves average improvements of 0.31, 0.40 and 1.29 percentage units in F-score, accuracy and MMR, respectively on denoising extended PPI networks over denoising PPI networks. This indicates that the introduction the PPIs extracted from literature can also boot the performance of ClusterONE. In addition, experimental results show that SLPC achieves better performance than ClusterONE. With the denoising threshold 0.9, the average performance improvement of SLPC over ClusterONE is 26.02 and 22.40 percentage units in F-score and MMR, respectively.
Protein complexes, consisting of molecular aggregations of proteins assembled by multiple protein interactions, are of the fundamental units of macro-molecular organizations and play crucial roles in integrating individual gene products to perform useful cellular functions. Large amounts of PPI data generated by high-throughput experimental techniques can be used to predict protein complexes from PPI networks. At the same time, numerous accurate PPIs could be found in the rapidly growing biomedical literature since they are provided by biomedical experts. Their Integration with the existing PPI datasets can be hopeful to eliminate the PPI networks' sparsity, and, therefore, improve the complex detection performance.
In this paper, an approach of introducing PPIs from biomedical literature into existing PPI networks and applying supervised learning method in protein complex detection is presented. In the approach, the new PPI networks are constructed by integrating PPI datasets with the large and readily available PPI data from biomedical literature, and then the less reliable PPI between two proteins are filtered out based on semantic similarity and topological similarity of the two proteins. Finally, the supervised learning protein complex detection, SLPC, which can make full use of the information of available known complexes, is applied to detect protein complex on the new PPI networks.
The best average improvements of 4.76, 6.81 and 15.75 percentage units in F-score, accuracy and MMR are achieved respectively, on original extended PPI networks. In addition, the best average improvements of 3.91, 4.61 and 12.10 percentage units in F-score, accuracy and MMR are achieved, respectively, on denoising extended PPI networks. All these results show that, the introduction of PPIs extracted from literature into the original PPI datasets can boost the performance significantly. The reason is that the sparsity problem of PPI networks is remitted by integrating PPI data from biomedical literature. The results also show that our method outperforms ClusterONE, the state-of-the-art method. This is because our method makes full use of the information of available known complexes. To summarize, our complex detection method, based on supervised learning method and integrating PPI data from biomedical literature, can achieve the better performances than other complex detection methods.
Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25: 1891-1897. 10.1093/bioinformatics/btp311.
Wu M, Li X, Kwoh CK, Ng SK: A core-attachment based method to detect protein complexes in PPI networks. BMC bioinformatics. 2009, 10: 169-10.1186/1471-2105-10-169.
Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics. 2003, 4: 2-10.1186/1471-2105-4-2.
Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30: 1575-1584. 10.1093/nar/30.7.1575.
Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22: 1021-1023. 10.1093/bioinformatics/btl039.
Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes in protein-protein interaction networks. Nat methods. 2012, 9: 471-472. 10.1038/nmeth.1938.
Qi YJ, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z: Protein complex identification by supervised graph local clustering. Bioinformatics. 2008, 24: i250-i258. 10.1093/bioinformatics/btn164.
Yu F, Yang Z, Tang N, Lin H, Wang J: Predicting protein complex in protein interaction network-a supervised learning based method. BMC Syst.Biol. 2014, 8 (Suppl 3): S4-10.1186/1752-0509-8-S3-S4.
Gavin AC, Aloy P, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636. 10.1038/nature04532.
Krogan NJ, Cagney G, et al: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006, 440: 637-643. 10.1038/nature04670.
Xenarios I, Salwinski L, et al: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002, 30: 303-305. 10.1093/nar/30.1.303.
Hart GT, Ramani AK, Marcotte EM: How complete are current yeast and human protein interaction networks?. Genome Biol. 2006, 7: 120-10.1186/gb-2006-7-11-120.
Yang Z, Zhao Z, Li Y, Hu Y, Lin H: PPIExtractor: A Protein Interaction Extraction and Visualization System for Biomedical Literature. NanoBioscience, IEEE Transactions. 2013, 12 (3): 173-181.
Ashburner M, Ball CA, et al: Gene Ontology: tool for the unification of biology. Nat genet. 2000, 25: 25-29. 10.1038/75556.
Dwight SS, Harris MA, et al: Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res. 2002, 30: 69-72. 10.1093/nar/30.1.69.
Stelzl U, Worm U, et al: A human protein-protein interaction network: a resource for annotating the proteome. Cell. 2005, 122: 957-968. 10.1016/j.cell.2005.08.029.
Chen L, Shi X, et al: Identifying protein complexes using hybrid properties. J proteome res. 2009, 8: 5212-5218. 10.1021/pr900554a.
Mewes HW, Amid C, et al: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004, 32: D41-D44. 10.1093/nar/gkh092.
Aloy P, Böttcher B, et al: Structure-based assembly of protein complexes in yeast. Science. 2004, 303: 2026-2029. 10.1126/science.1092645.
Dudley AM, Janse DM, et al: A global view of pleiotropy and phenotypically derived gene function in yeast. Mol syst Biol. 2005, 1: E1-E11.
Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC bioinformatics. 2006, 7: 488-10.1186/1471-2105-7-488.
Jansen R, Gerstein M: Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Curr Opin in microbiol. 2004, 7: 535-545. 10.1016/j.mib.2004.08.012.
This work is supported by grants from the Natural Science Foundation of China (grant no. 61070098, 61272373 and 61340020), Trans-Century Training Programme Foundation for the Talents by the Ministry of Education of China (grant no. NCET-13-0084) and the Fundamental Research Funds for the Central Universities (grant no. DUT13JB09 and DUT14YQ213). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Publication of this article was funded by the following grants: the Natural Science Foundation of China (grant no. 61070098, 61272373 and 61340020), Trans-Century Training Programme Foundation for the Talents by the Ministry of Education of China (grant no. NCET-13-0084) and the Fundamental Research Funds for the Central Universities (grant no. DUT13JB09 and DUT14YQ213).
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 12, 2015: Selected articles from the IEE International Conference on Bioinformatics and Biomedicine (BIBM 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S12.
The authors declare that they have no competing interests.
ZHY and FYY conceived of the study, carried out its design and drafted the manuscript. FYY performed the experiments. FYY, XHH, HFL, and JW participated in its design and coordination, and helped to draft the manuscript. All authors read and approved the final manuscript.
About this article
Cite this article
Yu, F.Y., Yang, Z.H., Hu, X.H. et al. Protein complex detection in PPI networks based on data integration and supervised learning method. BMC Bioinformatics 16 (Suppl 12), S3 (2015). https://doi.org/10.1186/1471-2105-16-S12-S3
- Protein-protein interaction network
- Protein complexes
- Data integration
- Supervised learning