A twolayer integration framework for protein complex detection
 Le OuYang^{1, 2, 5},
 Min Wu^{3},
 XiaoFei Zhang^{4},
 DaoQing Dai^{2}Email author,
 XiaoLi Li^{3}Email author and
 Hong Yan^{5}
https://doi.org/10.1186/s1285901609393
© OuYang et al. 2016
Received: 12 August 2015
Accepted: 27 January 2016
Published: 24 February 2016
Abstract
Background
Protein complexes carry out nearly all signaling and functional processes within cells. The study of protein complexes is an effective strategy to analyze cellular functions and biological processes. With the increasing availability of proteomics data, various computational methods have recently been developed to predict protein complexes. However, different computational methods are based on their own assumptions and designed to work on different data sources, and various biological screening methods have their unique experiment conditions, and are often different in scale and noise level. Therefore, a single computational method on a specific data source is generally not able to generate comprehensive and reliable prediction results.
Results
In this paper, we develop a novel Twolayer INtegrative Complex Detection (TINCD) model to detect protein complexes, leveraging the information from both clustering results and raw data sources. In particular, we first integrate various clustering results to construct consensus matrices for proteins to measure their overall cocomplex propensity. Second, we combine these consensus matrices with the cocomplex score matrix derived from Tandem Affinity Purification/Mass Spectrometry (TAP) data and obtain an integrated cocomplex similarity network via an unsupervised metric fusion method. Finally, a novel graph regularized doubly stochastic matrix decomposition model is proposed to detect overlapping protein complexes from the integrated similarity network.
Conclusions
Extensive experimental results demonstrate that TINCD performs much better than 21 stateoftheart complex detection techniques, including ensemble clustering and data integration techniques.
Keywords
Protein complex Protein interaction data Cocomplex matrix Consensus matrix Matrix fusion Matrix decompositionBackground
Understanding the structural and functional architecture of the cell has been a fundamental task for systems biology [1]. As vital macromolecules, proteins do not act individually, but work by interacting with other partners [2]. Almost all of the functional processes within a cell are carried out by protein complexes which are formed by interacting proteins [3]. Therefore, detecting protein complexes from proteinprotein interaction (PPI) data is crucial for elucidating the modular structure within cells [4, 5]. In recent years, highthroughput screening (HTS) techniques have been designed to detect proteinprotein interactions, e.g., yeast twohybrid (Y2H) [6] and Tandem Affinity Purification/Mass Spectrometry (TAP) [7]. Such HTS techniques have already generated a large amount of PPI data, which facilitate the development of computational methods for protein complex detection [8–21].
Generally, computational methods for protein complex detection utilize two types of data, namely, the binary protein interaction data detected by HTS techniques such as Y2H method, and the data for cocomplex interactions among proteins [22, 23] from TAP experiments. Here, we denote the above two types of data as PPI data and TAP data respectively. PPI data is usually modeled as a graph (i.e., PPI network) where nodes represent proteins and edges represent protein interactions. A number of graph clustering algorithms have been proposed for detecting protein complexes from PPI networks, such as MCODE [9], CFinder [24], MCL [8], RNSC [25], COACH [26] and ClusterONE [15]. On the other hand, raw data from TAP experiments is a list of bait proteins along with their corresponding prey proteins that they pulled out (purification records), which could be modeled as a bipartite graph (in which the two node sets are composed of bait proteins and prey proteins, and the edges between the two node sets represent baitprey connections). Several algorithms have been proposed to identify protein complexes from TAP data as well [27–31]. A common strategy is to first define affinity scores between proteins based on the purification records [5, 32, 33] and then convert the TAP data to a PPI network by using a threshold method to keep those reliable interactions for further analysis. Since convert the original TAP data into a binary PPI network not only introduces errors but also loses useful information in the raw data [23], another alternative strategy is to detect complexes from the TAP data directly, such as CACHET [31].
As diverse sources of protein interaction data are available, data integration becomes a common methodology to reduce the noise in PPI and TAP data (address false positive issue) [34] and to improve the coverage (address false negative issue) for protein complex detection. For example, DECAFF [35] exploited the Gene Ontology (GO) annotations to assess the reliability of PPI data and then detected protein complexes from the refined PPI data; MATISSE [36] and CMBI [37] integrated gene expression data with PPI data to increase the confidence of interactions for protein complex detection. InteHC [17] integrated four data sources (i.e., PPI data, gene expression profiles, GO terms and TAP data) and significantly improved the complex detection. In particular, InteHC calculated a score matrix for each of the four data sources and took their weighted sum, which relies on additional supervision information to learn a weight for each data source, as an integrated matrix. However, due to noise in different data sources, the direct fusion of several original datasets may exacerbate the problems of noise. Moreover, how to correctly estimate the cocomplex relationships between proteins based on their functional annotations and gene expression profiles is still an open problem.
Nevertheless, with various methods proposed for protein complex detection, we are thus able to generate a series of clustering results for each type of data. Clearly, given one type of data, each method has its own advantages and limitations in capturing cocomplex relationships between proteins [38]. Ensemble clustering, which exploits the complementary nature of individual methods by leveraging their clustering outputs, is thus promising to improve the detection for protein complexes [18, 39, 40]. Particularly, ensemble clustering methods usually first reconstruct a consensus matrix (or consensus network) which shows the cocomplex propensity among proteins from a series of clustering results and then apply a specific algorithm [18, 41] to detect protein complexes from the consensus matrix. However, the consensus network, based solely on the resultlevel integration (integrate the clustering results of different methods on a single type of data), may miss the underlying complex information which exist in other types of data. It is thus necessary to combine the consensus matrices derived from multiple types of data to generate a more comprehensive and reliable cocomplex similarity matrix, which may facilitate the detection of protein complexes.
Different from Y2H experiments that are prone to identify direct physical interactions, TAP experiments already provide useful information about protein complexes and TAP data describe the cocomplex propensity among proteins. However, as TAP data cannot be converted into cocomplex interactions in a straightforward manner, several scoring methods have been proposed to estimate the affinity scores between proteins based on the purification records (e.g., baitprey and preyprey relationships) provided by TAP data, such as C2S scores [30]. As such, we are able to integrate heterogeneous matrices, i.e., the consensus matrices derived from different types of data and the cocomplex score matrices derived from TAP data, to better understand the cocomplex relationships among proteins. However, as these heterogeneous matrices may have different scales, noise rates and importance levels, focusing only on common patterns can miss valuable complementary information. It would be challenging to merge them into a final cocomplex matrix automatically in an unsupervised manner. In addition, once we obtain the integrated matrix, it is still difficult for us to design an efficient algorithm to detect overlapping complexes from this integrated matrix.
Methods
In this section, we describe our TINCD model as shown in Fig. 1 in details. We first demonstrate the twolayer integration and then present the graph regularized doubly stochastic matrix decomposition algorithm for protein complex detection.
First layer integration: resultlevel integration via ensemble clustering
As diverse types of data are available and various computational methods have been designed to detect protein complexes from them, we are thus able to generate a series of base clustering results (i.e., employing different methods on a particular type of data will generate multiple clustering results). A straightforward way to measure the cocomplex affinities among proteins is to build the consensus matrices by integrating the above abundant clustering results.
Suppose all the data sources used in this study cover N proteins and we have obtained n _{ p } clustering results which are generated by applying n _{ p } different methods on a specific type of data. Here, a clustering result refers to a set of clusters generated by a certain method. A consensus matrix C ^{(m)} is a N×N matrix. In C ^{(m)}, the entry \(C_{\textit {ij}}^{(m)}\) is the number of clustering results where the proteins i and j are assigned to the same cluster, divided by the number of clustering results n _{ p }. As such, each entry \(C_{\textit {ij}}^{(m)}\) indicates the probabilities that protein i and j are involving in the same complexes. If protein i is not assigned to any clusters or is not included in the mth data source, the ith row and ith column of C ^{(m)} are set to zero. Note that the coverage and quality of different data sources would be different. We thus build a corresponding consensus matrix independently for each type of data. In this study, we focus on two data sources, namely, PPI data and TAP data. Therefore, the consensus matrices corresponding to these two types of data are denoted by C ^{(1)} and C ^{(2)} respectively (please refer to Fig. 1).
Second layer integration: integrating heterogeneous cocomplex matrices via similarity network fusion
Let C ^{(m)} (m=1,…,M) denote all the consensus matrices from the ensemble clustering and the score matrix derived from the TAP data (in this study, M=3). All of these M matrices describe the cocomplex similarities among proteins, but in different scales and with different noise rates. We next introduce the similarity network fusion (SNF) method [42] to integrate these M heterogeneous matrices.
Detecting protein complexes via graph regularized doubly stochastic matrix decomposition model
In the above sections, we obtain the integrated similarity matrix W via a twolayer integration framework. Next, we present the graph regularized doubly stochastic matrix decomposition model to detect protein complexes from W.
Model formulation
where θ _{ ik }=P(ki) and \(\hat {W}_{\textit {ij}}=\sum _{k=1}^{K} \frac {\theta _{\textit {ik}}\theta _{\textit {jk}}}{\sum _{z=1}^{N} \theta _{\textit {zk}}}\).
where T r(·) denotes the trace of a matrix and D is a diagonal matrix defined by \(D_{\textit {ii}} = \sum _{j=1}^{N} W_{\textit {ij}}\). By minimizing R, we wish the cocomplex relationships inherent in W could transfer to the estimator of θ.
Graph regularized doubly stochastic matrix decomposition model
where λ≥0 is the tradeoff parameter that controls the balance between the two factors.
Since the above objective function (9) is nonconvex, we employ a relaxed MajorizationMinimization algorithm to find a good local minima [43]. The update rule for θ is shown in Algorithm ??. Please refer to Additional file 1 for more details. Since the optimal solution \(\hat {\theta }_{\textit {ik}}\) is a continuous value which describes the probability of assigning protein i to the predicted kth complex, we need to discretize \(\hat {\theta }\) into a final proteincomplex assignment matrix θ ^{⋆}. In this study, to get overlapping protein complexes, for each protein i, we first sort \(\hat {\theta }_{\textit {ik}}\), k=1,…,K in descending order, then we retain the top K _{ i } complexes if the gap between the K _{ i }th and (K _{ i }+1)th element is the largest. \({\theta }^{\star }_{\textit {ik}}=1\) if k belongs to the top K _{ i } complexes, and \({\theta }^{\star }_{\textit {ik}}=0\) otherwise.
Here, \({\theta }^{\star }_{\textit {ik}} = 1\) represents protein i is assigned to the predicted kth complex while \({\theta }^{\star }_{\textit {ik}} = 0\) indicates protein i does not belong to the predicted kth complex. In this study, we only consider predicted complexes with at least three proteins [15].
Results
In this section, we first introduce the experiment settings, i.e., experiment data and evaluation metrics. Then, we demonstrate an extensive comparison study between our proposed TINCD method and various existing approaches for protein complex detection.
Experiment data and evaluation metrics
In this study, two types of data (PPI data and TAP data) for yeast have been employed for evaluating the performance of various complex detection methods. The binary PPI data is downloaded from the DIP database [45], which involves with 17,201 interactions among 4,930 proteins. In addition, we consolidate the data from both [5] and [46] as our TAP data, which consist of 6,498 purifications involving 2,996 bait proteins and 5,405 prey proteins. Overall, the PPI data and TAP data cover 5,929 proteins.
We employ 3 benchmark complex sets as goldstandard to evaluate the complexes predicted by various methods, namely CYC2008 [47], MIPS [48] and SGD [49]. In particular, CYC2008 consists of 408 complexes, MIPS with 203 and SGD with 323, respectively. For all the reference sets, to avoid selection bias, we filter out the proteins that are not involved in the input PPI and TAP data. Moreover, we only consider complexes with at least 3 proteins as suggested by Nepusz et al. [15].
Parameter settings
There are two parameters K and λ in our model, where K is the number of possible complexes and λ controls the effects of the Laplacian regularizer. Since we usually do not have any prior knowledge about the number of complexes in realworld situations, it is hard to decide the value of K. Fortunately, we have introduced a graph regularization to force proteins with high cocomplex similarity scores to have similar memberships. By controlling the effect of this regularization term, we may be able to filter out the irrelevant dimensions of θ. If so, we can fit our model with a large value of K as our model is able to determine the number of complexes adaptively. Therefore, to test how these two parameters affect the performance of our model, we have performed the sensitivity studies. Particularly, we consider all combinations of the following values: {1500,2000,2500} for K and {2^{−5},2^{−4},…,2^{7}} for λ, and assess how well the complexes predicted by our model match with reference sets.
Similarity network fusion (SNF) vs. matrix averaging
In the experiments, the consensus matrices are built via integrating various base clustering results from PPI data and TAP data. In particular, 11 stateoftheart approaches are applied to PPI data to generate complexes, including CFinder [24], CMC [50], COACH [26], ClusterONE [15], DPClus [51], IPCA [52], MCL [8], MCODE [9], RNSC [25], RRW [53] and SPICi [54]. In this study, optimal parameters are set for CFinder, CMC, COACH, DPClus, IPCA, MCL, MCODE, RRW and SPICi to generate their best results while ClusterONE and RNSC have used the default parameters set by the authors. For detailed parameter settings of these algorithms, please refer to Additional file 1. The consensus matrix based on these 11 base clustering solutions is denoted as P. We also collect the complexes predicted from TAP data by 5 existing methods, including BT [29], C2S [30], CACHET [31], Hart [27] and Pu [28]. Protein complexes predicted by these 5 methods are downloaded from http://www.ntu.edu.sg/home/zhengjie/data/InteHC/. The consensus matrix based on these 5 solutions for TAP data is denoted as T. In addition, P+T denotes the combination of two consensus matrices P and T. SNF is thus applied to integrate the C2S matrix with the consensus matrices (e.g., P, T and P+T). In addition, a natural way to integrate these matrices is to take an average for them, and we denote this method as Matrix Averaging. Next, we will take Matrix Averaging as baseline and compare it with the SNF method.
Moreover, we have two observations by comparing the performance of different consensus matrices as shown in Fig. 3.
Firstly, integrated with C2S matrix via SNF, the consensus matrix P performs much better than T. For example with reference data CYC2008, C2S+P and C2S+T obtain comparable Accuracy, while C2S+P has a higher FRAC than C2S+T (0.770 for C2S+P vs. 0.706 for C2S+T). The rationale behind this finding would be that T is redundant with C2S to some extent (both from TAP data), while P complements C2S well (PPI and TAP) to achieve better performance.
Secondly, by adding T to C2S+P, C2S+P+T achieves better performance than C2S+P. Comparing C2S+P with respect to CYC2008, the Accuracy of C2S+P+T is increased by 1.7 % from 0.763 to 0.776 while its FRAC is increased by 5.58 % from 0.770 to 0.813. As shown in Additional file 1, both Accuracy and FRAC of C2S+P+T are improved on SGD benchmark complexes, i.e., the Accuracy improves by 4.1 % from 0.711 to 0.740 and the FRAC increases by 9.4 % from 0.678 to 0.742. Overall, we would think that C2S+P+T performs better than C2S+P and C2S+T, and our TINCD refers to the clustering over C2S+P+T thereafter.
Clustering the integrated matrix
Once we obtained the integrated matrix (i.e., C2S+P+T), we are able to apply various clustering methods to generate protein complexes in our framework, e.g., Nonnegative Matrix Factorization (NMF) and Agglomerative Hierarchical Clustering (HC). Since the integrated matrix corresponds to a weighted network, and only few methods can deal with large scale weighted networks. In this section, we will compare our proposed graph regularized doubly stochastic matrix decomposition model with NMF, HC, ClusterONE and SPICi. All of these four algorithms are able to detect complexes from weighted PPI networks directly and output the results in a reasonable time. In particular, NMF is a popular clustering algorithm which can be related to a generalized form of many clustering methods (i.e., Kernel Kmeans clustering and spectral clustering.) [55]. In this study, NMF is solved by DTU:Toolbox [56] via multiplicative update method. For HC, it first considers all singleton proteins as initial clusters, then it iteratively merges two clusters with the highest similarity in each iteration. The iterative algorithm terminates when quality function of the detected clusters has achieved its maximal value. Similar to [17], three quality functions are used to measure the quality of a set of clusters, the corresponding results are thus denoted by HCQ1, HCQ2 and HCQ3 respectively. For more details about these three quality functions, please refer to [17]. For a fair comparison, optimal parameters are set for these four algorithms to generate its best results (For NMF, the number of clusters is chosen from 1000 to 2000 with 100 as increment. For SPICi, we try different values of density threshold, ranges from 0.1 to 1 with 0.1 as increment. ClusterONE has used the default parameters set by the authors.).
Comparisons with base clustering solutions
As introduced above, we collected 16 base solutions (11 for PPI data and 5 for TAP data) to generate protein complexes. Next, we compare TINCD with these 16 base solutions in terms of their Accuracy and FRAC over 3 benchmark complex sets.
Comparison between TINCD and stateoftheart methods with respect to CYC2008
Methods  No. of complexes  No. of covered proteins  Acc  FRAC 

TINCD  1562  5846  0.776  0.813 
ECBNMF  457  2105  0.751  0.677 
CMBI  618  1041  0.459  0.349 
InteHC  684  3400  0.748  0.634 
CFinder  245  2008  0.518  0.319 
CMC  562  1651  0.643  0.655 
COACH  746  1838  0.650  0.664 
ClusterONE  342  1366  0.584  0.438 
DPClus  651  2140  0.639  0.680 
IPCA  816  1621  0.617  0.575 
MCL  600  4101  0.644  0.536 
MCODE  108  666  0.485  0.311 
RNSC  541  2095  0.619  0.506 
RRW  248  1174  0.571  0.511 
SPICi  412  2113  0.607  0.502 
BT  409  1286  0.728  0.591 
C2S  1035  4500  0.761  0.664 
CACHET  449  964  0.674  0.553 
Hart  390  1307  0.720  0.600 
Pu  400  1504  0.732  0.579 
An observation in Table 1 is that 5 base solutions for TAP data are much better than those 11 base solutions for PPI data. The consensus matrix P generated by these weaker base solutions for PPI data, however, performs much better than T as shown in Fig. 3. This observation highlights once again that the consensus matrix P from PPI data is a good complement to C2S score matrix for protein complex detection.
Comparison with ensemble clustering
In Fig. 5, TINCD achieves higher Accuracy than ECBNMF (0.776 for TINCD vs. 0.751 for ECBNMF). In addition, TINCD achieves a FRAC 0.813, which is 20.09 % higher than ECBNMF (0.677). Hence, TINCD outperforms the ensemble clustering method ECBNMF in terms of both Accuracy and FRAC (similar results obtained with respect to MIPS and SGD benchmarks are shown in Additional file 1: Table S1).
Comparison with data integration techniques
In addition to ensemble clustering techniques which integrate clustering results, another type of integrative techniques aims to integrate diverse data sources for protein complex detection. For example, CMBI integrates PPI data, gene expression profiles and essential protein information to detect protein complexes, while InteHC integrates PPI data, TAP data, gene expression profiles and gene ontology annotations for protein complex prediction. Next, we compare our TINCD with data integration techniques CMBI and InteHC. Protein complexes predicted by CMBI and InteHC are downloaded from http://www.ntu.edu.sg/home/zhengjie/data/InteHC/.
InteHC integrates various data sources and utilizes some supervision information to assign them different weights according to their importance. Among various raw data sources, TINCD integrates only the C2S scores with consensus matrices in an unsupervised manner and thus is more preferable. The overall better results achieved by TINCD in the more challenging unsupervised setting demonstrate that TINCD is able to achieve better FRAC (by two layer integration), at the same time to maintain a high Accuracy. In the future, it would be promising if we integrate more data sources (e.g., gene ontology annotations) into our TINCD framework.
A case study: the FBP degradation complex
Discussions and conclusions
In this work, we have proposed a novel twolayer integration framework TINCD to identify protein complexes. First, TINCD constructs consensus matrices for proteins and measures their cocomplex propensity based on the complex knowledge discovered by various graph clustering results. Second, a similarity network fusion (SNF) strategy is employed by TINCD to combine consensus matrices and score matrix obtained from TAP data to obtain a final cocomplex score matrix. Finally, a novel graph regularized doubly stochastic matrix decomposition model is proposed to detect overlapping protein complexes from the final score matrix.

Our TINCD model, leveraging the information from both the clustering results and raw data sources, generates more comprehensive and reliable results.

The similarity network fusion (SNF) model, integrating heterogeneous matrices into a final cocomplex score matrix, is free of scale and robust to the noise in the data.

The graph regularized doubly stochastic matrix decomposition model considers the sparse similarities and thus ensures relatively balanced clusters.

TINCD generates the overlapping protein complexes, which clearly reflect the biological reality on proteins’ multifunctional roles.

Finally, TINCD is unsupervised and is thus generic enough for the integration of different types of data sources.
The computational complexity for updating θ in Algorithm 1 is O(E K+N K), where E is the number of nonzero items in W, N is the number of proteins in the data and K is the predefined number of complexes. Therefore the overall time cost of the graph regularized doubly stochastic matrix decomposition model is O((E+N)K I t e r), where Iter is the number of iterations. In the experiments, we implement the algorithm using Matlab in a laptop with Intel 2 CPU (2.10 GHz × 2) and 12 GB RAM. The time cost of calculating the final cocomplex score matrix is at most 785 seconds (since the efficiency of SNF has been discussed in [42], we do not discuss its computational complexity here). Each update of θ costs at most 21 seconds and the entire estimation takes less than 4,200 seconds when the maximum number of iterations is set to 200. Frankly, our TINCD has a relatively higher computational cost compared with some base solutions. However, we would think that the running time for TINCD is still affordable for the following reasons. First, our primary task is to predict protein complexes with better accuracy and coverage. To achieve this goal, we integrate multiple data sources for clustering and it makes sense that we will higher computational cost as a sacrifice. Second, as discussed in [40], in the context of understanding and exploiting the structure of PPI networks, cluster analysis is usually used as an “offline” process to provide a comprehensive set of clustering results. It is thus acceptable that “offline”, processes have longer running time. Third, PPI data is indeed growing in recent years. The computing power of hardware (e.g., multiple CPU cores) is also under a rapid development. Moreover, we can also consider to parallelize the integration process for speedup.
Regarding the future works, we plan to design an algorithm that could incorporate other data sources (i.e., functional or structural information of proteins) [34] in addition to protein interaction data for protein complex detection. We would expect higher prediction accuracy by considering more features for proteins.
Availability of supporting data
The datasets supporting the results of this article are included within its additional files. All the experimental results and code can be downloaded from https://github.com/OylCityU/TINCD.
Declarations
Acknowledgements
This work is supported by the National Science Foundation of China [11171354, 61375033, 61532008 and 61402190 to LOY, DQD, XFZ], the Ministry of Education of China [20120171110016 to LOY, DQD, XFZ], the Natural Science Foundation of Guangdong Province [S2013020012796 to LOY, DQD, XFZ], the Selfdetermined Research Funds of CCNU from the colleges’ basic research and operation of MOE [CCNU15A05039 and CCNU15ZD011 to XFZ] and City University of Hong Kong (Project 9610034).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Mitra K, Carvunis AR, Ramesh SK, Ideker T. Integrative approaches for finding modular structure in biological networks. Nat Rev Genet. 2013; 14(10):719–32.PubMed CentralView ArticlePubMedGoogle Scholar
 Li X, Wu M, Kwoh CK, Ng SK. Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics. 2010; 11(Suppl 1):3.View ArticleGoogle Scholar
 Clancy T, Hovig E. From proteomes to complexomes in the era of systems biology. Proteomics. 2014; 14(1):24–41.View ArticlePubMedGoogle Scholar
 Brohée S, Van Helden J. Evaluation of clustering algorithms for proteinprotein interaction networks. BMC Bioinformatics. 2006; 7(1):488.PubMed CentralView ArticlePubMedGoogle Scholar
 Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, SupertiFurga G. Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440(7084):631–6.View ArticlePubMedGoogle Scholar
 Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A. 2001; 98(8):4569–574.PubMed CentralView ArticlePubMedGoogle Scholar
 Gavin AC, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, et al.Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002; 415(6868):141–7.View ArticlePubMedGoogle Scholar
 Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for largescale detection of protein families. Nucleic Acids Res. 2002; 30(7):1575–84.PubMed CentralView ArticlePubMedGoogle Scholar
 Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003; 4(1):2.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang J, Li M, Deng Y, Pan Y. Recent advances in clustering methods for protein interaction networks. BMC Genomics. 2010; 11(Suppl 3)(Suppl 3):10.Google Scholar
 Wang J, Li M, Chen J, Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB). 2011; 8(3):607–20.View ArticleGoogle Scholar
 Tang X, Wang J, Liu B, Li M, Chen G, Pan Y. A comparison of the functional modules identified from time course and static ppi network data. BMC Bioinformatics. 2011; 12(1):339.PubMed CentralView ArticlePubMedGoogle Scholar
 Li M, Wu X, Wang J, Pan Y. Towards the identification of protein complexes and functional modules by integrating ppi network and gene expression data. BMC Bioinformatics. 2012; 13(1):109.PubMed CentralView ArticlePubMedGoogle Scholar
 Becker E, Robisson B, Chapple CE, Guénoche A, Brun C. Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics. 2012; 28(1):84–90.PubMed CentralView ArticlePubMedGoogle Scholar
 Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in proteinprotein interaction networks. Nat Methods. 2012; 9(5):471–2.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang XF, Dai DQ, OuYang L, Wu MY. Exploring overlapping functional units with various structure in protein interaction networks. PLoS ONE. 2012; 7(8):43092.View ArticleGoogle Scholar
 Wu M, Xie Z, Li X, Kwoh CK, Zheng J. Identifying protein complexes from heterogeneous biological data. Proteins Struct Funct Bioinformatics. 2013; 81(11):2023–33. doi:10.1002/prot.24365.View ArticleGoogle Scholar
 OuYang L, Dai DQ, Zhang XF. Protein complex detection via weighted ensemble clustering based on bayesian nonnegative matrix factorization. PLoS ONE. 2013; 8(5):62158.View ArticleGoogle Scholar
 OuYang L, Dai DQ, Li XL, Wu M, Zhang XF, Yang P. Detecting temporal protein complexes from dynamic proteinprotein interaction networks. BMC Bioinformatics. 2014; 15(1):335.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhang Y, Lin H, Yang Z, Wang J. Integrating experimental and literature proteinprotein interaction data for protein complex prediction. BMC Genomics. 2015; 16(Suppl 2):4.View ArticleGoogle Scholar
 OuYang L, Dai DQ, Zhang XF. Detecting protein complexes from signed proteinprotein interaction networks. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB). 2015; 12(6):1333–44. doi:10.1109/TCBB.2015.2401014.View ArticleGoogle Scholar
 Rajagopala SV, Sikorski P, Kumar A, Mosca R, Vlasblom J, Arnold R, FrancaKoh J, Pakala SB, Phanse S, Ceol A, et al.The binary proteinprotein interaction landscape of escherichia coli. Nat Biotechnol. 2014; 32(3):285–90.PubMed CentralView ArticlePubMedGoogle Scholar
 Teng B, Zhao C, Liu X, He Z. Network inference from apms data: computational challenges and solutions. Brief Bioinformatics. 2014; 038. doi:http://dx.doi.org/10.1093/bib/bbu038. http://bib.oxfordjournals.org/content/early/2014/11/05/bib.bbu038.full.pdf+html.
 Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T. Cfinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006; 22(8):1021–3.View ArticlePubMedGoogle Scholar
 King A, Pržulj N, Jurisica I. Protein complex prediction via costbased clustering. Bioinformatics. 2004; 20(17):3013–20.View ArticlePubMedGoogle Scholar
 Wu M, Li X, Kwoh CK, Ng SK. A coreattachment based method to detect protein complexes in ppi networks. BMC Bioinformatics. 2009; 10(1):169.PubMed CentralView ArticlePubMedGoogle Scholar
 Hart GT, Lee I, Marcotte EM. A highaccuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinformatics. 2007; 8(1):236.PubMed CentralView ArticlePubMedGoogle Scholar
 Pu S, Vlasblom J, Emili A, Greenblatt J, Wodak SJ. Identifying functional modules in the physical interactome of saccharomyces cerevisiae. Proteomics. 2007; 7(6):944–60.View ArticlePubMedGoogle Scholar
 Friedel CC, Krumsiek J, Zimmer R. Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. J Comput Biol. 2009; 16(8):971–87.View ArticlePubMedGoogle Scholar
 Xie Z, Kwoh CK, Li XL, Wu M. Construction of cocomplex score matrix for protein complex prediction from apms data. Bioinformatics. 2011; 27(13):159–66.View ArticleGoogle Scholar
 Wu M, Li XL, Kwoh CK, Ng SK, Wong L. Discovery of protein complexes with coreattachment structures from tandem affinity purification (tap) data. J Comput Biol. 2012; 19(9):1027–42.PubMed CentralView ArticlePubMedGoogle Scholar
 Collins SR, Kemmeren P, Zhao XC, Greenblatt JF, Spencer F, Holstege FC, Weissman JS, Krogan NJ. Toward a comprehensive atlas of the physical interactome of saccharomyces cerevisiae. Mol Cell Proteomics. 2007; 6(3):439–50.View ArticlePubMedGoogle Scholar
 Zhang B, Park BH, Karpinets T, Samatova NF. From pulldown data to protein interaction networks and complexes with biological relevance. Bioinformatics. 2008; 24(7):979–86.View ArticlePubMedGoogle Scholar
 Wu M, Li X, Chua HN, Kwoh CK, Ng SK. Integrating diverse biological and computational sources for reliable proteinprotein interactions. BMC Bioinformatics. 2010; 11(Suppl 7):8.View ArticleGoogle Scholar
 Li XL, Foo CS, Ng SK. Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. In: International Conference on Computational Systems Bioinformatics (CSB). San Diego: World Scientific: 2007. p. 157–68.Google Scholar
 Ulitsky I, Shamir R. Identification of functional modules using network topology and highthroughput data. BMC Syst Biol. 2007; 1(1):8. doi:10.1186/1752050918.PubMed CentralView ArticlePubMedGoogle Scholar
 Tang X, Wang J, Pan Y. Predicting protein complexes via the integration of multiple biological information. In: IEEE 6th International Conference on Systems Biology (ISB). Xian, China: IEEE: 2012. p. 174–9.Google Scholar
 Song J, Singh M. How and when should interactomederived clusters be used to predict functional modules and protein function?Bioinformatics. 2009; 25(23):3143–50.PubMed CentralView ArticlePubMedGoogle Scholar
 Asur S, Ucar D, Parthasarathy S. An ensemble framework for clustering protein–protein interaction networks. Bioinformatics. 2007; 23(13):29–40.View ArticleGoogle Scholar
 Greene D, Cagney G, Krogan N, Cunningham P. Ensemble nonnegative matrix factorization methods for clustering proteinprotein interactions. Bioinformatics. 2008; 24(15):1722–8. doi:10.1093/bioinformatics/btn286.PubMed CentralView ArticlePubMedGoogle Scholar
 Lancichinetti A, Fortunato S. Consensus clustering in complex networks. Sci Rep. 2012; 2:336.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, HaibeKains B, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014; 11(3):333–7.View ArticlePubMedGoogle Scholar
 Yang Z, Oja E. Clustering by lowrank doubly stochastic matrix decomposition. In: Proceedings of the 29th International Conference on Machine Learning (ICML12). Edinburgh, Scotland: JMLR: 2012. p. 831–8.Google Scholar
 Cai D, He X, Han J, Huang TS. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2011; 33(8):1548–60.View ArticlePubMedGoogle Scholar
 Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004; 32(suppl 1):449–51.View ArticleGoogle Scholar
 Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, PeregrínAlvarez JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MHY, Butland G, AltafUl AM, Kanaya S, Shilatifard A, O’Shea E, Weissman JS, Ingles CJ, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature. 2006; 440(7084):637–43.View ArticlePubMedGoogle Scholar
 Pu S, Wong J, Turner B, Cho E, Wodak SJ. Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2009; 37(3):825–31.PubMed CentralView ArticlePubMedGoogle Scholar
 Mewes HW, Amid C, Arnold R, Frishman D, Güldener U, Mannhaupt G, Münsterkötter M, Pagel P, Strack N, Stümpflen V, Warfsmann J, Ruepp A. Mips: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004; 32(suppl 1):41–4.View ArticleGoogle Scholar
 Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, et al.Sgd: Saccharomyces genome database. Nucleic Acids Res. 1998; 26(1):73–9.PubMed CentralView ArticlePubMedGoogle Scholar
 Liu G, Wong L, Chua HN. Complex discovery from weighted ppi networks. Bioinformatics. 2009; 25(15):1891–7.View ArticlePubMedGoogle Scholar
 AltafUlAmin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006; 7(1):207.PubMed CentralView ArticlePubMedGoogle Scholar
 Li M, Chen JE, Wang JX, Hu B, Chen G. Modifying the dpclus algorithm for identifying protein complexes based on new topological structures. BMC Bioinformatics. 2008; 9(1):398.PubMed CentralView ArticlePubMedGoogle Scholar
 Macropol K, Can T, Singh AK. Rrw: repeated random walks on genomescale protein networks for local cluster discovery. BMC Bioinformatics. 2009; 10(1):283.PubMed CentralView ArticlePubMedGoogle Scholar
 Jiang P, Singh M. Spici: a fast clustering algorithm for large biological networks. Bioinformatics. 2010; 26(8):1105–11.PubMed CentralView ArticlePubMedGoogle Scholar
 Ding C, He X, Simon HD. On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proc. SIAM Data Mining Conf. California: SIAM: 2005. p. 606–10.Google Scholar
 Schmidt MN, Laurberg H. Nonnegative matrix factorization with gaussian process priors. Comput Intell Neurosci. 2008; 2008:3.Google Scholar