Protein complex detection using interaction reliability assessment and weighted clustering coefficient
- Nazar Zaki^{1}Email author,
- Dmitry Efimov^{2} and
- Jose Berengueres^{1}
DOI: 10.1186/1471-2105-14-163
© Zaki et al.; licensee BioMed Central Ltd. 2013
Received: 27 December 2012
Accepted: 9 May 2013
Published: 20 May 2013
Abstract
Background
Predicting protein complexes from protein-protein interaction data is becoming a fundamental problem in computational biology. The identification and characterization of protein complexes implicated are crucial to the understanding of the molecular events under normal and abnormal physiological conditions. On the other hand, large datasets of experimentally detected protein-protein interactions were determined using High-throughput experimental techniques. However, experimental data is usually liable to contain a large number of spurious interactions. Therefore, it is essential to validate these interactions before exploiting them to predict protein complexes.
Results
In this paper, we propose a novel graph mining algorithm (PEWCC) to identify such protein complexes. Firstly, the algorithm assesses the reliability of the interaction data, then predicts protein complexes based on the concept of weighted clustering coefficient. To demonstrate the effectiveness of the proposed method, the performance of PEWCC was compared to several methods. PEWCC was able to detect more matched complexes than any of the state-of-the-art methods with higher quality scores.
Conclusions
The higher accuracy achieved by PEWCC in detecting protein complexes is a valid argument in favor of the proposed method. The datasets and programs are freely available at http://faculty.uaeu.ac.ae/nzaki/Research.htm.
Background
where ${N}_{\mathit{\text{avg}}}=\frac{\sum _{x\in V}\left|{N}_{x}\right|}{N}$ is the average number of neighbors in the network and N is the total number of nodes in the network.
In this paper, we propose a simple yet effective method for protein complex identification. We are aware of the fact that, in addition to improving graph mining techniques, it is necessary to obtain high quality benchmarks by assessing protein interaction reliability. Therefore, we propose a novel method for assessing the reliability of interaction data and detecting protein complexes. Unlike CMC, this method finds near-maximum cliques (maximal cliques without unreliable interactions). We employ the concept of weighted clustering coefficients as a measure to define which subgraph is the closest to the maximal clique. The clustering coefficient of a vertex in this case is the density of its neighborhood [19].
Methods
Computational approaches for detecting protein complexes from PPI data are useful complements to the limitation of the experimental methods such as Tandem Affinity Purification (TAP) [20]. Beside the improvement in graph mining techniques, the success of accurate detection of a protein complex depends on the availability of high-quality benchmarks. The bottleneck of different computational methods remains to be the noise associated with the protein interaction data. Therefore, a rigorous assessment of protein interactions reliability is essential. In this section, we introduce a novel method PEWCC which has two main steps: first, assess the reliability of the protein interaction data using the PE-measure. Second, detect protein complexes using weighted clustering coefficient [19, 21] (WCC). In the subsequent sections, we describe these two steps in details.
Assessing the reliability of protein interactions
In this section we introduce the PE-measure, a new measure for protein pairs interaction reliability. PE-measure enables us to reduce the level of noise associated with PPI networks and it is defined as follows:
where we take the product by all v_{ l } : (v_{ l },v_{ i }) ∈ E,(v_{ l },v_{ j }) ∈ E.
Suppose we would like to determine the weight of the edge e_{1} (between protein 1 and protein 2). According to Equation (4), the probabilities that protein 3 and protein 4 do not “support” the edge e_{1} are (1−p_{1,3}·p_{2,3}) and (1−p_{1,4}·p_{2,4}), respectively. Thus, the probability that protein 3 and 4 do not “support” the edge e_{1} is (1−p_{1,3}·p_{2,3})·(1−p_{1,4}·p_{2,4}). Therefore, the probability that protein 1 and protein 2 interact (and supported by protein 3 and protein 4) is the complementary probability 1−[(1−p_{1,3}·p_{2,3})·(1−p_{1,4}·p_{2,4})].
We start with the initial probability matrix P_{0} (where p_{1,3}, p_{2,3}, p_{2,4}, p_{1,4} and p_{3,5} are all equal to 0.5). In the first iteration (k = 1) the PE-measure of the edge e_{1} is $1-\left[\right(1-\left({p}_{1,3}.{p}_{2,3}\right)\xb7(1-({p}_{1,4}\xb7{p}_{2.4}\left)\right]=\frac{7}{16}$. Similarly, the PE- measures of edges e_{2}, e_{3}, e_{4} and e_{5} are all equal to $\frac{1}{4}$ while the measure of edge e_{6} is equal to 0. All of the PE-measures are updated before the second iteration (k = 2) starts.
where v_{ l } : (v_{ l },v_{ i }) ∈ E, N_{ i } is the number of the neighbors of v_{ i } and i = 1,…,N. If the PE-measure p_{ il } is less than the average (w_{ avg })_{ i } then the edge between proteins i and l is considered unreliable and therefore, it should be removed from the network.
Applying Equation (4) on the hypothetical network shown in Figure 2, we could see that the edge e_{6} yields a lower weight which is equal to 0 and therefore, it could be a noise and should be removed from the network.
Detecting protein complex using weighted clustering coefficient
In Figure 3 (a), if i = 1 then N_{1} in this case is equal to 5 (the central protein 1 is not considered), N_{3cliques} = 7 and therefore, according to Equation 6, ${c}_{1}=\frac{2\times 7}{{5}^{2}\times (5-1)}=0.14$. Based on the sequence of the degree, node 5 has only 2 outgoing connections and therefore, it should be removed from the subgraph. In Figure 3 (b), the subgraph is now reduced to 4 nodes, N_{3cliques} = 5 and therefore, c_{1} = 0.21. Based on the sequence of the degree there exists a tie and therefore either nodes 3 or 4 should be randomly removed. If the node 3 is removed as shown in Figure 3 (c) then we end up with a subgraph with only 3 nodes. The c_{1} in this case is equal to 0.33 and therefore, the subgraph which contains the central protein 1 and three nodes (2, 4 and 6) is a valid core protein complex. Once the core protein complex is identified, we examine the main subgraph once again and re-join any protein which interacts with more than t % of the proteins in the core protein complex. In the case of t = 50, protein 3 will join the subgraph and the final complex predicted is shown in Figure 3 (d).
Assessing the quality of predicted complexes
where K is a cluster and R is a reference complex. V_{ K } and V_{ R } are the set of proteins in K and R, respectively. The complex K is defined to match the complex R if MatchScore(K,R) ≥ α where α = {0.25,0.5,0.6,0.7,0.8 or 0.9} (because different methods were evaluated with different values of α).
Following Nepusz et al. [9], we also evaluated our method using the maximum matching ratio (MMR). The MMR measure is based on a maximal one-to-one mapping between predicted and reference complexes. The motivation for Nepusz et al. [9] to use the MMR is the fact that the PPV tends to be lower if there are substantial overlaps between the predicted complexes, which could limit the prediction accuracy when using overlapping clustering algorithms. The algorithm used to calculate the MMR is available in the supplementary material (Additional file 1).
The experimental works were conducted on a PC with Intel(R) Core(TM)2, CPU 6400 @ 2.13GHz and 3 GB of RAM.
Results and discussion
Properties of the two PPI datasets used in the experimental work
Dataset | Proteins | Interactions | Network density | Clustering coefficient | Av. no. of neighbors | Isolated proteins |
---|---|---|---|---|---|---|
PPI-D1 | 3,869 | 19,165 | 0.002 | 0.157 | 8.957 | 8 |
PPI-D2 | 5,640 | 59,748 | 0.004 | 0.246 | 21.187 | 0 |
Three reference sets of protein complexes are used in these experiments. The first set of complexes (Cmplx-D1) comprises of 162 hand-curated complexes from MIPS [30]. The second dataset (Cmplx-D2) which contains 63 complexes is generated by Aloy et al. [31]. The third reference set (Cmplx-D3) of 203 complexes was developed by Nepusz [9] and it consists of the most recent version of the MIPS catalog of protein complexes. Both datasets Cmplx-D1 and Cmplx-D2 were used by Liu et al. [10] to evaluate the performance of the CMC method. Complexes with sizes greater or equal to 4 proteins were considered.
Performance comparison of PEWCC, CMC, ClusterONE, MCL, CFinder, and MCODE, with A c c ( K , P ) ≥ 0 . 5
Cmplx-D1 | Cmplx-D2 | |||||||
---|---|---|---|---|---|---|---|---|
Method | Matched Cmplx | Prec | Rec | F1 | Matched Cmplx | Prec | Rec | F1 |
PEWCC | 58 | 0.435 | 0.469 | 0.451 | 61 | 0.468 | 0.910 | 0.618 |
CMC | 56 | 0.297 | 0.346 | 0.320 | 57 | 0.385 | 0.889 | 0.537 |
ClusterONE | 52 | 0.204 | 0.387 | 0.267 | 48 | 0.231 | 0.872 | 0.365 |
MCL | 51 | 0.353 | 0.315 | 0.333 | 52 | 0.448 | 0.825 | 0.581 |
MCODE | 39 | 0.330 | 0.241 | 0.279 | 34 | 0.386 | 0.540 | 0.450 |
CFilter | 46 | 0.379 | 0.284 | 0.325 | 43 | 0.463 | 0.683 | 0.552 |
As shown in Table 2, the proposed method was able to detect more matched complexes than any of the state-of-the-art methods with higher F1 value.
The performance of CMC and ClusterONE with and without filtering method such as AdjstCD and PE measures with A c c ( K , P ) ≥ 0 . 5
Method | Clusters predicted | Matched Cmplx | Perc. of successful Cmplx | Rec | Prec | PPV | F1 |
---|---|---|---|---|---|---|---|
CMC | 133 | 45 | 28 | 0.217 | 0.263 | 0.172 | 0.238 |
ClusterONE | 498 | 77 | 47.5 | 0.372 | 0.118 | 0.301 | 0.180 |
AdjstCD+CMC | 127 | 75 | 46.3 | 0.362 | 0.455 | 0.277 | 0.404 |
AdjstCD+ClusterONE | 139 | 78 | 48.2 | 0.377 | 0.393 | 0.294 | 0.385 |
PE+CMC | 112 | 77 | 47.5 | 0.372 | 0.446 | 0.313 | 0.406 |
PE+ClusterONE | 110 | 81 | 50 | 0.391 | 0.464 | 0.318 | 0.424 |
PE+WCC | 128 | 89 | 54.9 | 0.435 | 0.469 | 0.262 | 0.451 |
For generalization purposes PEWCC was further compared to several state-of-the-art methods based on the protein interaction dataset PPI-D2 and the reference dataset Cmplx-D3. PPI-D2 and Cmplx-D3 were recently published and used to evaluate the performance of ClusterONE [9] in detecting protein complexes. In this case more than one quality score were used to assess the performance of each algorithm: following [9] the fraction of matched complexes with a given overlap score threshold Acc(K,P) ≥ 0.25 and the geometric accuracy. The performance of methods such as (RNSC) [4, 5] and (RRW) [3] were included in the comparison. Please note that RNSC algorithm does not take into consideration the weights of the PPI graph edges. The summary of the parameters setup for all the methods used in the comparison is available in the supplementary materials (Additional file 2).
Compare PE-WCC to ClusterONE, RNSC, RRW, CMC, MCL and MCODE, where A c c ( K , P ) ≥ 0 . 25
Method | Clusters predicted | Matched Cmplx | Perc. of successful Cmplx | Sn | PPV | Acc | MMR |
---|---|---|---|---|---|---|---|
PEWCC | 468 | 122 | 60.1 | 0.551 | 0.430 | 0.491 | 0.348 |
ClusterONE | 473 | 88 | 43.3 | 0.454 | 0.427 | 0.440 | 0.195 |
RNSC | 209 | 79 | 38.9 | 0.399 | 0.441 | 0.419 | 0.192 |
RRW | 253 | 75 | 36.9 | 0.276 | 0.429 | 0.344 | 0.178 |
CMC | 73 | 53 | 26.1 | 0.323 | 0.404 | 0.487 | 0.176 |
MCL | 338 | 37 | 18.2 | 0.346 | 0.350 | 0.348 | 0.083 |
MCODE | 85 | 21 | 10.3 | 0.285 | 0.284 | 0.285 | 0.048 |
Conclusion
In this paper, we have provided a novel method (PEWCC) for detecting protein complexes from a PPI network of yeast. We have shown that our approach, which first assesses the quality of the interaction data and then detect the protein complex based on the concept of weighted clustering coefficient, is more accurate than most of the well known methods.
The noise associated with the PPI network and the focus on dense subgraphs have restricted researchers from creating an effective algorithm that is capable of identifying small complexes and PEWCC is no exception. In fact, we cannot recall any method that can effectively detect complexes (≤ 3 proteins) using only the topology of the PPI network. We understand that PEWCC stops when the neighborhood graph contains only 3 proteins which restricts it from identifying small complexes (≤ 3 proteins). It was possible for us to discover the clustering coefficient was c_{ i } = 1 for dense graphs of size 3 (with 3 nodes and 3 edges) and c_{ i } = 0 for other subgraphs of size 3 (with 3 nodes and 2 edges). We are currently conducting a systematic research of nested complexes (the case where one complex is a sub-complex of a bigger one) in order to identify strategies that could be useful in improving the capability of PEWCC in identifying small complexes.
The performance of PEWCC can also be tested when the edges were randomly removed from the original graph. However, we strongly believe that the main issue concerning PPI data is the noise associated with false interactions (edges). There are many interactions that are not reliable and by removing them, the prediction accuracy was improved by using PE measure and AdjstCD. Moreover, if we remove edges uniformly over the PPI network, then the PEWCC algorithm will still work, because it calculates relative density (one subgraph with respect to another). It means that if we have two subgraphs G_{1} and G_{2} and the density of G_{1} is less than the density of G_{2}, then following the random deletion of some edges from G_{1} and G_{2}, the probability that the density of G_{1} will be less than the density of G_{2}, will still be very high.
In the future, we would like to compare the performance of PE to the recently published novel weighting schemes for noise reduction in PPI network by graphs by Kritikos et al. [32]. In this research work, only the topological properties of PPI graphs were taken into consideration while it has been proved that integrating additional biological knowledge helps the weighting schemes to generate more reliable PPI graphs. Therefore, an interesting open challenge is to study the incorporation of additional biological knowledge of protein complexes. To this end, a probabilistic calculation of the affinity score between two proteins [33] could further improve the performance of the proposed method.
Furthermore, the idea of decomposing the PPI network into overlapping clusters will be explored as it shows great potential in recent works [9, 34-36].
Declarations
Acknowledgements
The authors would like to acknowledge the assistance provided by the Emirates Foundation (EF Grant Ref. No. 2010/116), the National Research Foundation (NRF Grant Ref. No. 21T021) and the Research Support and Sponsored Projects Office and the Faculty of Information Technology at the United Arab Emirates University (UAEU).
Authors’ Affiliations
References
- Zaki NM, Berengueres J, Efimov: ProRank: A method for detecting protein complexes. Proceedings of the ACM Genetic and Evolutionary Computation Conference (GECCO). 2012, Philadelphia, 209-216.Google Scholar
- Dongen SM: Graph Clustering by Flow Simulation. 2000, Domplein 29, 3512 JE Utrecht, Netherlands: University of UtrechtGoogle Scholar
- Macropol K, Can T, Singh A: RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics. 2009, 10 (283):Google Scholar
- Andrew DK, Przulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinformatics. 2004, 20 (17): 3013-3020.View ArticleGoogle Scholar
- Przulj N, Jurisica I, Wigle D A: Functional topology in a network of protein interactions. Bioinformatics. 2004, 20 (3): 340-348.View ArticlePubMedGoogle Scholar
- Leung H, Chin F, XIANG Q: Predicting protein complexes from ppi data: A core-attachment approach. J Comput Biol. 2009, 16 (2): 133-139.View ArticlePubMedGoogle Scholar
- Zaki NM, Berengueres J, Efimov D: Detection of protein complexes using a protein ranking algorithm. Proteins: Struct, Funct, Bioinformatics. 2012, 80 (10): 2459-2468.View ArticleGoogle Scholar
- Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. J Bioinformatics. 2006, 22 (8): 1021-1023.View ArticleGoogle Scholar
- Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods. 2012, 9: 471-472.PubMed CentralView ArticlePubMedGoogle Scholar
- Guimei L, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25 (15): 1891-1897.View ArticleGoogle Scholar
- Kuchaiev O, Rasajski M, Higham DJ, Przulj N: Geometric de-noising of protein-protein interaction networks. PLoS Comput Biol. 2009, 5 (8): 454-View ArticleGoogle Scholar
- Sprinzak E, Sattath S, Hargalit H: How relaiable are experimental protein-protein interaction data. J Mol Bio. 2003, 327: 919-923.View ArticleGoogle Scholar
- Xiaoli L, Kwoh CK, See-Kiong N, Min W u: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics. 1186, 10:Google Scholar
- Bader GD, Christopher WH: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 2-PubMed CentralView ArticlePubMedGoogle Scholar
- Brun C etal: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol. 2003, 5 (1): R6-PubMed CentralView ArticlePubMedGoogle Scholar
- Chua H etal: Using indirect protein-protein interactions for protein complex predication. J Bioinform Comput Biol. 2008, 6: 435-466.View ArticlePubMedGoogle Scholar
- Hon NC, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006, 22 (13): 1623-1630.View ArticleGoogle Scholar
- Chua HN, Ning K, Sung WK, Leong HW, Wong L: Using indirect protein-protein interactions for protein complex prediction. J Bioinform Comput Biol. 2008, 6 (3): 435-466.View ArticlePubMedGoogle Scholar
- Watts DJ, Strogatz SH: Collective dynamics of ‘small-world’ networks. Nature. 1998, 393 (6684): 409-410.View ArticleGoogle Scholar
- Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M: A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol. 1999, 17 (10): 1030-1032.View ArticlePubMedGoogle Scholar
- Efimov D, Zaki NM, Berengueres J: Detecting protein complexes from noisy protein interaction data. Proceedings of the 11th International Workshop on Data Mining in Bioinformatics (BIOKDD’12), Beijing, China. 2012, New York: ACM, 1-7.View ArticleGoogle Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006, 7: 488-PubMed CentralView ArticlePubMedGoogle Scholar
- Ho Y: Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415: 180-183.View ArticlePubMedGoogle Scholar
- Gavin AC, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147.View ArticlePubMedGoogle Scholar
- Gavin AC, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440: 631-636.View ArticlePubMedGoogle Scholar
- Krogan NJ: Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature. 2006, 440: 637-643.View ArticlePubMedGoogle Scholar
- Uetz P, et al: A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature. 1999, 403: 623-627.Google Scholar
- Ito T, et al: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci. 2001, 98: 4569-4574.PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, et al: Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 2006, 34 (1): D535-D539.PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, et al: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004, 32: 41-44.View ArticleGoogle Scholar
- Aloy P, et al: Structure-based assembly of protein complexes in yeast. Science. 2004, 303: 2026-2029.View ArticlePubMedGoogle Scholar
- Kritikos GD, Moschopoulos C, Vazirgiannis M, Kossida S: Noise reduction in protein-protein interaction graphs by the implementation of a novel weighting scheme. BMC Bioinformatics. 2011, 12: 239-PubMed CentralView ArticlePubMedGoogle Scholar
- Xie Z, Kwoh CK, Li XL, Wu M: Construction of co-complex score matrix for protein complex prediction from ap-ms data. Bioinformatics. 2011, 27: i159-i166.PubMed CentralView ArticlePubMedGoogle Scholar
- Tak Chien C, Young-Rae C: Accuracy improvement in protein complex prediction from protein interaction networks by refining cluster overlaps. Proteome Sci. 2012, 10: S3-View ArticleGoogle Scholar
- Becker E, Robisson B, Charles E, Gunoche A, Brun C: Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics. 2012, 28 (1): 84-90.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang XF, Dai DQ, Ou-Yang L, Wu MY: Exploring overlapping functional units with various structure in protein interaction networks. PLoSONE. 2011, 7 (8): e43092-View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.