Detecting overlapping protein complexes based on a generative model with functional and topological properties
 XiaoFei Zhang^{1, 2, 3},
 DaoQing Dai^{2}Email author,
 Le OuYang^{2} and
 Hong Yan^{3}
https://doi.org/10.1186/1471210515186
© Zhang et al.; licensee BioMed Central Ltd. 2014
Received: 7 March 2014
Accepted: 9 June 2014
Published: 13 June 2014
Abstract
Background
Identification of protein complexes can help us get a better understanding of cellular mechanism. With the increasing availability of largescale proteinprotein interaction (PPI) data, numerous computational approaches have been proposed to detect complexes from the PPI networks. However, most of the current approaches do not consider overlaps among complexes or functional annotation information of individual proteins. Therefore, they might not be able to reflect the biological reality faithfully or make full use of the available domainspecific knowledge.
Results
In this paper, we develop a Generative Model with Functional and Topological Properties (GMFTP) to describe the generative processes of the PPI network and the functional profile. The model provides a working mechanism for capturing the interaction structures and the functional patterns of proteins. By combining the functional and topological properties, we formulate the problem of identifying protein complexes as that of detecting a group of proteins which frequently interact with each other in the PPI network and have similar annotation patterns in the functional profile. Using the idea of link communities, our method naturally deals with overlaps among complexes. The benefits brought by the functional properties are demonstrated by real data analysis. The results evaluated using four criteria with respect to two gold standards show that GMFTP has a competitive performance over the stateoftheart approaches. The effectiveness of detecting overlapping complexes is also demonstrated by analyzing the topological and functional features of multi and monogroup proteins.
Conclusions
Based on the results obtained in this study, GMFTP presents to be a powerful approach for the identification of overlapping protein complexes using both the PPI network and the functional profile. The software can be downloaded from http://mail.sysu.edu.cn/home/stsddq@mail.sysu.edu.cn/dai/others/GMFTP.zip.
Keywords
Background
Detecting protein complexes, which is crucial for elucidating the structural and functional architecture of cells, has attracted a lot of attention in recent years. Wellknown experimental methods such as tandem affinity purification with mass spectrometry [1] and proteinfragment complementation assay [2], even though they are effective, have low efficiency, low coverage, and are biased [3].
Due to the development of highthroughput techniques, a large number of physical proteinprotein interactions (PPI) have been generated and accumulated, which paves the way for establishing or reconstructing the PPI networks [4, 5]. Two proteins interacting with each other in such network probably provide an evidence that they belong to a common protein complex. This intuition inspires us to split the whole network into groups, which have more links within each group and fewer links between different groups, to reveal its intrinsic structure and global organization in terms of protein complexes. Recently, numerous computational approaches relying on different strategies (e.g., graph clustering [6], community detection [7, 8]) have been proposed to detect complexes from the PPI network [3, 9–16]. However, those methods have their own shortcomings inevitably, since they only use the network topology.
In order to reduce the negative effect brought by the spurious interactions, several researchers have tried to incorporate functional information into complex detection process. These approaches can be mainly classified into two categories, preprocessbased [22–25] and postprocessbased [26, 27]. The main idea of the former category is to design a functional semantic similarity measure to weight the strengths of proteinprotein interactions, and then use a graph clustering algorithm to detect complexes from the weighted PPI network. They require the clustering algorithms to be able to handle weighted networks. However, there are only a few network clustering algorithms that can handle weights and overlaps simultaneously [17, 28–32]. Furthermore, their performances depend on how the semantic measure is defined to assign the weights, which itself has many open problems [33]. The postprocessbased approaches use some metrics to quantify the functional homogeneity of complex candidates detected by graph clustering algorithms, and then discard candidates with low reliability. They do not make full use of the available functional annotations since such information are excluded during the complex candidate detection process. Recently, Zhang et al. map the topological and functional features into a unified distance measure by constructing an ontology augmented network, while they do not pay attention to the overlap problem [34].
As an alternative, we couple the functional profile with the network topology to detect overlapping protein complexes. To this end, we resort to probabilistic models which have been applied to analyze PPI networks [20, 21, 35–37]. Unlike previous models that account only for the generative process of the PPI network, we develop a new Generative Model with Functional and Topological Properties (GMFTP), which is dominated by two latent variables. One is introduced to represent the degree of proteins belonging to complex(es). By the idea of link communities [38, 39], we generate a complex typerelated interaction between two proteins if they tend to belong to the same complex(es). It gives rise to overlaps in a natural way that a protein belongs to multiple complexes if it has more than one type of interactions. The other one is used to represent the preferences of functions with which proteins in a complex associate. We generate an association between a protein and a function using these two model parameters. According to the introduced model, a complex is assumed to be a group of proteins which frequently interact with each other and have similar functional patterns. For a given PPI network and functional profile, we then transform the complex detection problem into a parameter estimation problem. We investigate the performance of our model using six yeast PPI networks and four categories of functional profiles. Experiment results show that the functional properties are able to improve the performance. Comparative experiments further demonstrate that our model not only has a better performance than the stateoftheart approaches but also is capable of identifying proteins in multiple complexes.
Methods
A generative model with functional and topological properties
Before introducing our model, we introduce some notations first. We consider the functional and topological properties of N proteins. Each protein i has an annotation profile of fixed length C, F_{ i } = [F_{i1},…,F_{ i C }]^{ T } ∈ {0,1}^{ C }, where F_{ ic } = 1 if protein i is associated with function c, F_{ ic } = 0 otherwise, and C is the total number of functions considered. For convenience, we denote F = [F_{1},…,F_{ N }]^{ T } = [F_{ ic }] ∈ {0,1}^{N×C} as the functional profiles for all proteins. The PPI network is represented as an adjacency matrix A = [A_{ ij }] ∈ {0,1}^{N×N}, where A_{ ij } = 1 if proteins i and j are connected, A_{ ij } = 0 otherwise. We assume that there are K complexes. In the typical modelbased clustering setting, the value of K is initially unknown and needs to be predetermined. Here we assume that the value is given first and address how to set it at the end of this section.
GMFTP generates both the annotation F_{ ic } and the interaction A_{ ij } as follows. In a similar manner to that of [37, 39], a nonnegative parameter θ_{ ik } is introduced to represent the affinity of protein i belonging to complex k. A higher affinity score θ_{ ik } means that protein i is more likely to belong to complex k, and vice versa. Note that a protein may obtain high affinity scores on multiple complexes, thus our model supports overlaps. Since proteins within the same complex(es) are always associated with same functions [40], for a given complex k, we introduce a nonnegative parameter ψ_{ kc } to represent the propensity that proteins in complex k are associated with function c. A higher score ψ_{ kc } means that proteins in complex k are more likely to be associated with function c, and vice versa. In effect, ψ_{ kc } represents the preferences of functions with which proteins in complex k are associated. We denote Θ = [θ_{ ik }] as the proteincomplex affinity matrix and Ψ = [ψ_{ kc }] as the complexfunction preference matrix.
By the definitions of θ_{ ik } and ψ_{ kc }, if protein i obtains higher affinity score θ_{ ik } and complex k obtains higher preference score ψ_{ kc }, protein i is more likely to be associated with function c, and vice versa. Then θ_{ ik }ψ_{ kc } can be assumed as the likelihood that protein i is associated with function c in terms of complex k. Taking into account all the K complexes, we can assume ${\sum}_{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\psi}_{\mathit{\text{kc}}}$ to be the total likelihood that protein i is associated with function c. Then the association F_{ ic } between protein i and function c is independently generated by a Bernoulli distribution with success rate $\sigma \left({\sum}_{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\psi}_{\mathit{\text{kc}}}\right)$, where σ(x) = 1  exp(x) is a function which maps the input argument from [0,+∞) to [0,1), ensuring that the result is a valid probability.
A protein complex in the PPI network is usually assumed to be a cohesively connected subnetwork which has many interactions within itself [41], hence two proteins which belong to the same complex(es) are likely to interact with each other. If two proteins i and j obtain high affinity scores θ_{ ik } and θ_{ jk }, they would be connected in complex k. We therefore assume that θ_{ ik }θ_{ jk } is the likelihood that proteins i and j are connected in terms of complex k, and that ${\sum}_{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\theta}_{\mathit{\text{jk}}}$ is the total likelihood that they interact in terms of all the K complexes. Then the interaction A_{ ij } between them is independently generated by a Bernoulli distribution with success probability $\sigma \left({\sum}_{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\theta}_{\mathit{\text{jk}}}\right)$. Here we use function σ(x) to map the likelihood to the probability.
It is well known that a protein usually belongs to one or several complexes; and a protein complex tends to be responsible for (or be significantly enriched with) a given set of biological functions. This means Θ and Ψ are sparse essentially. To model the sparsity property, we place an independent exponential distribution prior over each element θ_{ ik } and ψ_{ kc } with rate parameter λ, which is similar to the sparsity promoting prior in nonnegative sparse coding [42, 43]. The sparse restriction may lead all elements in some columns of Θ and rows of Ψ to 0 simultaneously, and hence the corresponding irrelevant complexes will disappear automatically.
For a better understanding of our model, we illustrate the connection between the variables we use and the biology terms in Figure 2. Given hyperparameter λ, N proteins and C functional terms, the generative process of the functional profile and the PPI network with Kcomplexes can be summarized as follows:

For each protein i and complex k, draw proteincomplex affinity score θ_{ ik } ∼ Exp(λ) with probability:$P({\theta}_{\mathit{\text{ik}}}\lambda )=\lambda exp\left(\lambda {\theta}_{\mathit{\text{ik}}}\right),\phantom{\rule{1em}{0ex}}{\theta}_{\mathit{\text{ik}}}\ge 0.$(1)

For each complex k and function c, draw complexfunction preference score ψ_{ kc } ∼ Exp(λ) with probability:$P({\psi}_{\mathit{\text{kc}}}\lambda )=\lambda exp\left(\lambda {\psi}_{\mathit{\text{kc}}}\right),\phantom{\rule{1em}{0ex}}{\psi}_{\mathit{\text{kc}}}\ge 0.$(2)

For each protein i and function c, sample their association value ${F}_{\mathit{\text{ic}}}\sim \text{Bernoulli}\left(\sigma \left({\sum}_{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\psi}_{\mathit{\text{kc}}}\right)\right)$ with probability:$\begin{array}{l}P({F}_{\mathit{\text{ic}}}\mathrm{\Theta},\Psi )={\left(\sigma \left(\sum _{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\psi}_{\mathit{\text{kc}}}\right)\right)}^{{F}_{\mathit{\text{ic}}}}{\left(1\sigma \left(\sum _{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\psi}_{\mathit{\text{kc}}}\right)\right)}^{1{F}_{\mathit{\text{ic}}}}.\end{array}$(3)

For each pair of proteins i and j (i < j), sample their interaction value ${A}_{\mathit{\text{ij}}}\sim \text{Bernoulli}\left(\sigma \left({\sum}_{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\theta}_{\mathit{\text{jk}}}\right)\right)$ with probability:$\begin{array}{l}P({A}_{\mathit{\text{ij}}}\mathrm{\Theta})={\left(\sigma \left(\sum _{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\theta}_{\mathit{\text{jk}}}\right)\right)}^{{A}_{\mathit{\text{ij}}}}{\left(1\sigma \left(\sum _{k=1}^{K}{\theta}_{\mathit{\text{ik}}}{\theta}_{\mathit{\text{jk}}}\right)\right)}^{1{A}_{\mathit{\text{ij}}}}.\end{array}$(4)
Model formulation and parameter estimation
Model formulation
In previous section, we have introduced a generative process of the functional profile and the PPI network. Each run of this process generates a sample of the proteincomplex affinity parameter Θ, complexfunction preference parameter Ψ, functional profile F and PPI network A. Given the hyperparameter λ, we can decompose the joint probability distribution over F, A, Θ, Ψ using the dependent relationships stated in the previous definition and encoded in Figure S1 (in Additional file 1) as follows:
and P(θ_{ ik }λ), P(ψ_{ kc }λ), P(F_{ ic }Θ,Ψ), P(A_{ ij }Θ) are defined in Equations (1)(4), respectively. Considering the case that the functional profiles of some proteins are not available, we introduce S_{ i } to represent whether functional profile of protein i is generated, where S_{ i } = 1 means the functional profile is generated, and S_{ i } = 0 otherwise.
where Θ ≥ 0 and Ψ ≥ 0 mean each element θ_{ ik } ≥ 0 and ψ_{ kc } ≥ 0.
Parameter estimation
Due to the limitation in space, we describe the details of the two updating formulae in Additional file 1.
Once Θ and Ψ are initialized, we update them according to Equations (11) and (12) alternately until a stopping criterion has been satisfied. Since the objective function in Equation (10) is not convex, the final estimators of Θ and Ψ depend on their initial values. To mitigate the risk of local minimization to some extend, we repeat the entire updating procedure 100 times with random restarts and choose the result that gives the lowest value of the objective function as the final estimator. In our implementation, the iteration process is conducted until the relative change in objective value is less than 10^{6}. To avoid the case that this process converges too slowly and requires excessive computing time, we also stop it if the number of iterations reaches 400.
Protein complex detection
Here ${\mathrm{\Theta}}^{\star}=[{\theta}_{\mathit{\text{ik}}}^{\star}]$ is the proteincomplex membership indication matrix in which ${\theta}_{\mathit{\text{ik}}}^{\star}=1$ represents protein i is in the detected complex k and ${\theta}_{\mathit{\text{ik}}}^{\star}=0$ represents protein i is not in complex k. We set τ = 0.2 experimentally such that a protein can not belong to more than 5 predicted complexes in our algorithm. Due to local minimization, a detected complex candidate may be composed of several isolated subnetworks. In this case, each connected subnetwork is regarded as a complex. We discard detected complexes which include less than three proteins.
Results
Data sets and evaluation methods
Two experimental yeast PPI data sets [4, 5], a combined computational interaction map [45], the yeast interactions derived from DIP ([46]) and the ones derived from BioGRID [47] are used to test the performance. We refer to them as Gavin, Krogan, Collins, DIP and BioGRID data sets. The Krogan data set is used as two variants: the core data set (referred to as Krogan core) and the extended data set (referred to as Krogan extended). The Collins, Gavin, Krogan core and Krogan extended data sets include edge weights. We derive two variants of these four networks: weighted version which includes the weights and unweighted version which ignores the weights. As DIP (version April 6, 2013) and BioGRID (version 3.1.77) provide weights for only a low proportion of the interactions, we treat them as unweighted, following the method in [17]. The Gene Ontology (April 6, 2013) is used as the data source of functional properties [48]. Four categories of functional profiles (BP, CC, MF and total) are derived from the annotations of the three individual subontologies (biological process, cellular component, and molecular function) and the comprehensive annotation which concatenates that of all the three subontologies. The gold standards of yeast protein complexes are derived from CYC2008 [19] and SGD [49]. For details, see Additional file 1.
We use four independent quality criteria, accuracy (ACC) [3], fraction of matched complexes (FRAC), maximum matching ratio (MMR) [17] and precisionrecall score (PR) [40], to evaluate the detected complexes. The four metrics have complementary strengths since they evaluate the performance from different perspectives. Due to the fact that the gold standard complexes are incomplete, we also test the functional homogeneity of predicted complexes in a similar way to [17] (Additional file 1).
Effect of parameters
GMFTP includes two parameters which need to be tuned: K and λ. As discussed above, we can use a value of K that is higher than the real number by introducing a sparse prior. We therefore set K = 1000 for all the six data sets. Next, we focus on examining the influence of λ which is the hyperparameter of prior distribution. We run GMFTP with various values of λ (λ ∈ {2^{3},2^{2},…,2^{6}}) and evaluate the quality of predicted complexes by matching them with the reference complexes.
For each PPI network and each category of functional profile, the ACC and PR scores are used to test whether λ has an effect on the performance. Overall, GMFTP obtains competitive ACC scores when λ∈ [2^{3},2^{3}] and optimal PR scores when λ∈ [2^{2},2^{4}] for both the two gold standards (Figures S2–S7 in Additional file 1). We also test how the parameter affects the number of predicted complexes and covered proteins. The number of predicted complexes and the number of proteins clustered into corresponding complexes decrease with increasing λ (Figures S2–S7 in Additional file 1), which shows that λ is able to control the sparsity of our model. An example which illustrates how λ influences the number of detected complexes via merging small complexes into larger ones is shown in Figure S8 (in Additional file 1). Overall, we find that GMFTP has a competitive performance when λ = 4 and other optimized values may improve further the performance in some cases. To avoid evaluation bias and overestimation of the performance, we do not tune the parameter to a particular dataset and set λ to 4 as the default value in the following experiments.
Effect of functional property
To investigate the benefit brought by incorporating functional information into complex detection process, we compare the complexes predicted by GMFTP using only the PPI network to those using both the PPI network and the four categories of functional profiles. For the case of using only the PPI network, we set S_{ i } = 0 for all proteins and F as a zero matrix with size N × K. For brevity, we refer to the five cases as PPI only, PPI+BP, PPI+CC, PPI+MF and PPI+total, respectively.
Comparison with previous approaches using only topological property
Since most previous approaches detect complexes based solely on the PPI network, we concentrate on testing the effectiveness of GMFTP using only the topological property first. We compare it to a representative set of approaches: AP [11], CFinder [50], ClusterONE [17], Linkcomm [38], MCL [9], MCODE [10], MINE [51], SPICi [12] and SRMCL [32]. For the four algorithms (AP, ClusterONE, MCL and SPICi) which can handle weights, we implement them on both the weighted and the unweighted versions of the four networks (Collins, Gavin, Krogan core and Krogan extended) which include edge weights. For each algorithm, except ClusterONE, Linkcomm and SRMCL for which we use the default parameters as suggested by the authors, the parameters are deliberately selected in a similar way to [17]. The details are listed in Additional file 1. For all compared approaches, like GMFTP, we exclude complex candidates with size less than three. For GMFTP, we set F as a zero matrix and S_{ i } = 0 for all proteins in this experiment. We do not tune the parameters of GMFTP and set K = 1000, λ = 4 for all datasets.
Benchmark results using solely the unweighted PPI network with respect to the CYC2008 gold standard
Algorithm  Coverage  # Complexes  FRAC  MMR  ACC  PR  

GMFTP  1168  179  0.868  0.591  0.765  0.593  
AP  1363  207  0.697  0.785  0.497  0.444  
CFinder  1161  114  0.653  0.439  0.693  0.440  
ClusterONE  1293  203  0.847  0.571  0.775  0.564  
Linkcomm  1126  407  0.903  0.646  0.744  0.456  
Collins  MCL  1178  187  0.840  0.537  0.779  0.529 
MCODE  853  115  0.743  0.496  0.730  0.593  
MINE  1101  138  0.771  0.499  0.756  0.547  
SPICi  958  124  0.708  0.448  0.728  0.570  
SRMCL  1304  337  0.875  0.625  0.755  0.481  
GMFTP  1464  271  0.841  0.489  0.742  0.457  
AP  1815  274  0.667  0.659  0.346  0.310  
CFinder  1158  137  0.638  0.378  0.701  0.424  
ClusterONE  1624  294  0.783  0.449  0.725  0.391  
Linkcomm  1381  604  0.870  0.548  0.703  0.372  
Gavin  MCL  1301  240  0.696  0.421  0.713  0.422 
MCODE  899  155  0.710  0.438  0.685  0.492  
MINE  1242  212  0.804  0.454  0.710  0.436  
SPICi  1008  184  0.746  0.434  0.697  0.478  
SRMCL  1750  735  0.819  0.539  0.701  0.327  
GMFTP  1244  270  0.756  0.474  0.722  0.491  
AP  2506  391  0.575  0.433  0.242  0.182  
CFinder  1143  115  0.433  0.281  0.555  0.268  
ClusterONE  2044  539  0.720  0.431  0.708  0.326  
Linkcomm  962  425  0.701  0.460  0.675  0.428  
Krogan core  MCL  1933  388  0.671  0.377  0.691  0.299 
MCODE  640  95  0.463  0.268  0.583  0.406  
MINE  937  157  0.616  0.359  0.664  0.450  
SPICi  1249  224  0.628  0.356  0.689  0.409  
SRMCL  2585  1833  0.884  0.575  0.686  0.197  
GMFTP  1197  265  0.652  0.398  0.692  0.469  
AP  3522  461  0.461  0.232  0.117  0.096  
CFinder  914  88  0.287  0.172  0.543  0.277  
ClusterONE  1114  239  0.481  0.296  0.633  0.407  
Linkcomm  1925  998  0.652  0.424  0.687  0.317  
Krogan extended  MCL  2973  531  0.503  0.254  0.636  0.190 
MCODE  619  84  0.343  0.188  0.506  0.345  
MINE  902  162  0.564  0.316  0.650  0.451  
SPICi  1584  295  0.525  0.258  0.645  0.311  
SRMCL  3637  2644  0.702  0.431  0.617  0.154  
GMFTP  1705  376  0.580  0.296  0.639  0.329  
AP  4662  517  0.441  0.219  0.091  0.086  
CFinder  635  75  0.263  0.119  0.453  0.297  
ClusterONE  1402  346  0.429  0.227  0.554  0.280  
Linkcomm  3396  1829  0.630  0.386  0.629  0.203  
DIP  MCL  4007  609  0.451  0.234  0.628  0.173 
MCODE  540  95  0.210  0.108  0.402  0.211  
MINE  1135  260  0.536  0.268  0.585  0.333  
SPICi  2103  403  0.455  0.228  0.583  0.245  
SRMCL  4825  3222  0.674  0.376  0.583  0.141  
GMFTP  2456  434  0.687  0.377  0.723  0.378  
AP  5632  206  0.316  0.064  0.027  0.044  
CFinder  1729  110  0.220  0.127  0.512  0.186  
ClusterONE  2580  473  0.610  0.318  0.683  0.325  
Linkcomm  4119  4446  0.678  0.459  0.701  0.243  
BioGRID  MCL  3652  335  0.314  0.158  0.520  0.126 
MCODE  1087  136  0.297  0.154  0.514  0.294  
MINE  2414  409  0.576  0.308  0.663  0.304  
SPICi  2756  501  0.483  0.261  0.652  0.281  
SRMCL  5593  1097  0.496  0.273  0.594  0.143 
Functional enrichment of the complexes detected using only the unweighted PPI network
Network  Algorithm  <E(15)  E(15) to E(10)  E(10) to E(5)  E(5) to 1 

Collins  GMFTP  33 (18.4%)  24 (13.4%)  60 (33.5%)  62 (34.6%) 
Linkcomm  53 (13.0%)  59 (14.5%)  109 (26.8%)  186 (45.7%)  
SRMCL  38 (11.3%)  36 (10.7%)  92 (27.3%)  171 (50.7%)  
Gavin  GMFTP  29 (10.7%)  20 (7.4%)  54 (19.9%)  168 (62.0%) 
Linkcomm  29 (4.8%)  34 (5.6%)  112 (18.5%)  429 (71.0%)  
SRMCL  49 (6.7%)  29 (3.9%)  135 (18.4%)  522 (71.0%)  
Krogan core  GMFTP  28 (10.4%)  22 (8.1%)  63 (23.3%)  157 (58.1%) 
Linkcomm  24 (5.6%)  30 (7.1%)  114 (26.8%)  257 (60.5%)  
SRMCL  80 (4.4%)  70 (3.8%)  264 (14.4%)  1419 (77.4%)  
Krogan extended  GMFTP  29 (10.9%)  19 (7.2%)  57 (21.5%)  160 (60.4%) 
Linkcomm  30 (3.0%)  41 (4.1%)  158 (15.8%)  769 (77.1%)  
SRMCL  135 (5.1%)  86 (3.3%)  259 (9.8%)  2164 (81.8%)  
DIP  GMFTP  36 (9.6%)  29 (7.7%)  68 (18.1%)  242 (64.5%) 
Linkcomm  44 (2.4%)  63 (3.4%)  323 (17.7%)  1398 (76.5%)  
SRMCL  174 (5.4%)  117 (3.6%)  398 (12.3%)  2533 (78.6%)  
BioGRID  GMFTP  66 (15.2%)  38 (8.8%)  113 (26.0%)  217 (50.0%) 
Linkcomm  217 (4.9%)  254 (5.7%)  1026 (23.1%)  2949 (66.3%)  
SRMCL  166 (15.1%)  77 (7.0%)  210 (19.2%)  643 (58.7%) 
Comparison with previous approaches using both functional and topological properties
To evaluate the advantage of GMFTP in incorporating functional annotation into complex detection process, we compare its results with those of other approaches which also take functional property into consideration. A popular framework on this topic can be divided into two steps: to weight the strengths of interactions using some semantic similarity measures, and then to detect complexes from the weighted PPI networks using some graph clustering algorithms [22–25]. The main difference between them lies in the different similarity measures and clustering algorithms they use. Since there is no public software available for these approaches, we design a heuristic comparison. We employ three widely used measures Jiang ([52], Kappa [53] and Lin [54]) to weight the PPI network and apply four algorithms (AP, ClusterONE, MCL and SPICi) which can handle weights to detect complexes. The package csbl.go [55] is used to calculate the similarities between proteins, and the weights of interactions which involve unannotated proteins are set to 1. The parameter settings of the clustering algorithms are presented in Additional file 1. We also compare GMFTP to COAN [34] which considers GO slim annotations by constructing an ontology augmented network.
Benchmark results using both the PPI network and the total GO annotation with respect to the CYC2008 gold standard
Network  Algorithm  GO sim  Coverage  # Complexes  FRAC  MMR  ACC  PR 

Collins  GMFTP  –  1085  188  0.890  0.659  0.788  0.651 
Jiang  1210  169  0.868  0.575  0.784  0.579  
ClusterONE  Kappa  1130  161  0.840  0.573  0.770  0.597  
Lin  1255  172  0.854  0.561  0.784  0.613  
Gavin  GMFTP  –  1122  208  0.877  0.594  0.768  0.577 
Jiang  1298  224  0.833  0.549  0.768  0.496  
ClusterONE  Kappa  1218  217  0.783  0.511  0.760  0.485  
Lin  1461  253  0.804  0.514  0.757  0.449  
Krogan core  GMFTP  –  1218  252  0.805  0.573  0.768  0.649 
Jiang  1516  341  0.847  0.540  0.756  0.445  
ClusterONE  Kappa  1322  319  0.810  0.522  0.736  0.470  
Lin  1813  464  0.841  0.517  0.766  0.379  
Krogan extended  GMFTP  –  1062  216  0.696  0.487  0.745  0.601 
Jiang  2021  503  0.724  0.435  0.745  0.342  
ClusterONE  Kappa  1722  482  0.680  0.410  0.733  0.344  
Lin  2458  752  0.713  0.425  0.730  0.281  
DIP  GMFTP  –  1492  306  0.705  0.430  0.704  0.474 
Jiang  2910  733  0.741  0.406  0.714  0.299  
ClusterONE  Kappa  2487  748  0.679  0.398  0.682  0.308  
Lin  3285  921  0.701  0.359  0.700  0.248  
BioGRID  GMFTP  –  2283  413  0.750  0.474  0.754  0.448 
Jiang  3789  881  0.665  0.377  0.759  0.260  
ClusterONE  Kappa  3303  889  0.691  0.398  0.717  0.283  
Lin  4208  1073  0.602  0.334  0.755  0.227 
Detecting multifunctional proteins
It is well known that a protein may carry out different functions in different complexes. A desirable approach to complex detection therefore should be able to accommodate proteins that belong to more than one complex. Due to the absence of a reference set of bona fide multifunctional proteins, it is impractical to compare different approaches at this job directly. We resort to test how well the set of multigroup proteins predicted by GMFTP matches with those of the other methods which also handle overlaps (CFinder, ClusterONE, Linkcomm, MINE and SRMCL) and the two gold standards (CYC2008 and SGD). For GMFTP, we concentrate on the results of two cases (PPI only and PPI+total). For ClusterONE, we use the results of the two versions (weighted and unweighted) of networks. A protein is regarded as a multigroup protein if it belongs to more than one predicted (or reference) complex, and it is a monogroup protein if it belongs only to one predicted (or reference) complex. Overall, the multigroup proteins recovered by our model significantly (hypergeometric test, Pvalue ≤ 0.01) overlap with those of the other approaches and the gold standards (Additional file 4).
Discussion
The developments of highthroughput experimental techniques and computational methods for delineating proteinprotein interactions and predicting protein functions have produced rich interaction and functional knowledge of proteins. Recently, a great deal of research works have tried to group proteins into complexes in a given PPI network. However, the performances of the approaches which use the topological property alone are limited not only for the poor quality of the underlying PPI network but also for the negligence of other available information such as functional profile.
In our opinion, both topological and functional properties are meaningful and important for predicting protein complexes. We therefore develop a new algorithm which makes full use of them. Unlike previous approaches, we consider an alternative view and propose a probabilistic modelbased approach to combine these two types of properties in a natural and principled manner. Our method can avoid the choice of semantic measures and naturally deal with overlaps. Owing to the superior performance and sound theoretical principle of GMFTP, we hope that our work can attract more attention to modelbased methods for complex detection. Although generative model have been applied to study PPI networks, our model is different from the previous ones since most of them focus only on the generative process of the network structure. As we know, our model is one of the first to take the generative process of the functional profile into account.
One problem with considering functional property is that the improvement of performance depends on the quality and completeness of functional annotations of the database. It is well known that functional information is not always obtainable in practice [40]. From Equation (11), the complex(es) into which an uncharacterized protein will be clustered is determined only by the topological structure, which means our model can adaptively handle the case where the protein is not functionally characterized. Since GO terms in the subontology of cellular component may provide some clues as to what complex(es) a protein may belong to, the function property derived from this subontology may introduce biases and overestimate the performance. However, the effectiveness of our model has also been investigated in the other two subontologies. In practical application, even if there may be some evaluation biases, we suggest combining the total GO annotations of all the three subontologies to form a comprehensive functional profile to improve the performance, which works similarly to the semisupervised clustering in machine learning [56].
In general, it is timeconsuming and difficult for modelbased approaches to scale up. We now analyze the computational complexity in Equations (11) and (12). Each update of Θ takes O(KN(N + C)) times and update of Ψ takes O(NKC) times. Therefore, the total time cost of GMFTP is O(KNT(N + C)), where T is the number of iterations. Given that the realworld PPI networks and functional profiles are extremely sparse, the overall cost can be reduced to O(KT(N + C + E + R)), where E is the number of interactions and R is the number of functional associations (see Additional file 1). In the experiments, we implement the algorithm using Matlab in a workstation with Intel 4 CPU (3.40 GH × 4) and 16 GB RAM. Each update costs at most 3.25 seconds and the entire estimation takes less than 1300 seconds when we set the maximum number of iterations to 400. This means that even though our approach may be not as fast as some local network clustering algorithms (e.g., SPICi), the time cost is also affordable. In order to avoid local minimization, we repeat the updating process 100 times with random restarts. We acknowledge that this may be not a sufficient number of repetitions to ensure a global optimum solution and GMFTP would work better with more restarts. Instead of searching for the global minimization with millions of repetitions, we have paid attention to evaluate how the random initial conditions influence the stability of the results (see Additional file 1).
One perennial problem for modelbased approaches is to select models, that is how to determine the value of parameter λ here. In statistics, several model selection strategies are available [43]. A simple and widely adopted strategy is the crossvalidation procedure. However, this strategy may be not applicable in the task of network clustering since removing a predefined fraction of proteins (or interactions) from a PPI network would change the topological structure, which means adding noise rather than splitting the data set [17]. Another solution to this problem is to select model according to some model selection criteria such as Akaike information criterion and Bayesian information criterion. The performance of this type of strategies varies according to the choice of criteria. For simplicity and good performance, we first analyze how λ affects the performance and then set it to 4 in the comparative experiments. The model selection problem is left as an open research question in the future study.
Previous researches have shown that the quality of detected complexes could be improved if the weights of interactions are available [17]. Currently, our model is limited to unweighted networks and can be applied to weighted networks only after “binarizing” them due to the Bernoulli generative mechanism. In the future work, we will investigate the generative process of weighted networks to make full use of the valuable information of weights. In addition, the hierarchical relationships among GO terms are not used in our model. Intuitively, two proteins which share a lowlevel (or specific) GO function are more likely to belong to the common complex(es) than those which share a highlevel (or general) GO function. It would be useful to incorporate the specificity of GO terms into our model and further to improve the performance.
Conclusions
In this study, we have developed a new approach for protein complex detection based on a proposed generative model for proteinprotein interaction network and protein functional profile. Experiment results on six yeast networks show the competitive performance of our method in the identification of both protein complexes and multifunctional proteins. The results also show the effect of protein functional property on complex detection, which suggests that the functional annotation information should be used if it is available.
Declarations
Acknowledgements
We thank the associate editor and the anonymous reviewers for their helpful suggestions which have brought improvement of this work. This work is supported by the National Science Foundation of China [11171354 and 61375033 to XFZ, DQD, LOY], the Ministry of Education of China [20120171110016 to XFZ, DQD, LOY], the Natural Science Foundation of Guangdong Province [S2013020012796 to XFZ, DQD, LOY], and City University of Hong Kong [9610308 to HY].
Authors’ Affiliations
References
 Gavin AC, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Höfert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415 (6868): 141147.View ArticlePubMedGoogle Scholar
 Tarassov K, Messier V, Landry CR, Radinovic S, Molina MMS, Shames I, Malitskaya Y, Vogel J, Bussey H, Michnick SW: An in vivo map of the yeast protein interactome. Science. 2008, 320 (5882): 14651470.View ArticlePubMedGoogle Scholar
 Li XL, Wu M, Kwoh CK, Ng SK: Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics. 2010, 11 (Suppl 1): 3View ArticleGoogle Scholar
 Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, et al: Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006, 440 (7084): 631636.View ArticlePubMedGoogle Scholar
 Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, PeregrinAlvare JM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, et al: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006, 440 (7084): 637643.View ArticlePubMedGoogle Scholar
 Schaeffer SE: Graph clustering. Comput Sci Rev. 2007, 1 (1): 2764.View ArticleGoogle Scholar
 Fortunato S: Community detection in graphs. Phys Rep. 2010, 486 (3): 75174.View ArticleGoogle Scholar
 Newman M: Communities, modules and largescale structure in networks. Nat Phys. 2012, 8 (1): 2531.View ArticleGoogle Scholar
 Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for largescale detection of protein families. Nucleic Acids Res. 2002, 30 (7): 15751584.View ArticlePubMed CentralPubMedGoogle Scholar
 Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4 (1): 2View ArticlePubMed CentralPubMedGoogle Scholar
 Frey BJ, Dueck D: Clustering by passing messages between data points. Science. 2007, 315 (5814): 972976.View ArticlePubMedGoogle Scholar
 Jiang P, Singh M: Spici: a fast clustering algorithm for large biological networks. Bioinformatics. 2010, 26 (8): 11051111.View ArticlePubMed CentralPubMedGoogle Scholar
 Ren J, Wang J, Li M, Wang L: Identifying protein complexes based on density and modularity in proteinprotein interaction network. BMC Syst Biol. 2013, 7 (4): 115.Google Scholar
 Wang J, Li M, Deng Y, Pan Y: Recent advances in clustering methods for protein interaction networks. BMC Genomics. 2010, 11 (Suppl 3): 10View ArticleGoogle Scholar
 Srihari S, Leong HW: A survey of computational methods for protein complex prediction from protein interaction networks. J Bioinform Comput Biol. 2013, 11 (02): 1230002View ArticlePubMedGoogle Scholar
 Ji J, Zhang A, Liu C, Quan X, Liu Z: Survey: Functional module detection from proteinprotein interaction networks. IEEE Trans Knowl Data Eng. 2014, 26 (2): 261277.View ArticleGoogle Scholar
 Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes in proteinprotein interaction networks. Nat Methods. 2012, 9 (5): 471472.View ArticlePubMed CentralPubMedGoogle Scholar
 Becker E, Robisson B, Chapple CE, Guénoche A, Brun C: Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics. 2012, 28 (1): 8490.View ArticlePubMed CentralPubMedGoogle Scholar
 Pu S, Wong J, Turner B, Cho E, Wodak SJ: Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2009, 37 (3): 825831.View ArticlePubMed CentralPubMedGoogle Scholar
 Kuchaiev O, Rašajski M, Higham DJ, Pržulj N: Geometric denoising of proteinprotein interaction networks. PLoS Comput Biol. 2009, 5 (8): 1000454View ArticleGoogle Scholar
 Guimerà R, SalesPardo M: Missing and spurious interactions and the reconstruction of complex networks. Proc Natl Acad Sci U S A. 2009, 106 (52): 2207322078.View ArticlePubMed CentralPubMedGoogle Scholar
 Lubovac Z, Gamalielsson J, Olsson B: Combining functional and topological properties to identify core modules in protein interaction networks. Proteins. 2006, 64 (4): 948959.View ArticlePubMedGoogle Scholar
 Cho YR, Hwang W, Ramanathan M, Zhang A: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics. 2007, 8 (1): 265View ArticlePubMed CentralPubMedGoogle Scholar
 Wang J, Xie D, Lin H, Yang Z, Zhang Y: Filtering gene ontology semantic similarity for identifying protein complexes in large protein interaction networks. Proteome Sci. 2012, 10 (Suppl 1): 18View ArticleGoogle Scholar
 Hu A, Chan K: Utilizing both topological and attribute information for protein complex identification in ppi networks. IEEE/ACM Trans Comput Biol Bioinform. 2013, PP (99): 11.Google Scholar
 King AD, Pržulj N, Jurisica I: Protein complex prediction via costbased clustering. Bioinformatics. 2004, 20 (17): 30133020.View ArticlePubMedGoogle Scholar
 Li XL, Foo CS, Ng SK: Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. Comput Syst Bioinformatics Conf. 2007, 6: 157168.View ArticlePubMedGoogle Scholar
 Zhang S, Wang RS, Zhang XS: Identification of overlapping community structure in complex networks using fuzzy cmeans clustering. Phys Stat Mech Appl. 2007, 374 (1): 483490.View ArticleGoogle Scholar
 Farkas I, Ábel D, Palla G, Vicsek T: Weighted network modules. New J Phys. 2007, 9 (6): 180View ArticleGoogle Scholar
 Kalinka AT: Tomancak P: linkcomm: an r package for the generation, visualization, and analysis of link communities in networks of arbitrary size and type. Bioinformatics. 2011, 27 (14): 20112012.View ArticlePubMed CentralPubMedGoogle Scholar
 van Dongen S, AbreuGoodger C: Using mcl to extract clusters from networks. Bacterial Molecular Networks. 2012, New York: Springer, 281295.View ArticleGoogle Scholar
 Shih YK, Parthasarathy S: Identifying functional modules in interaction networks through overlapping markov clustering. Bioinformatics. 2012, 28 (18): 473479.View ArticleGoogle Scholar
 Guzzi PH, Mina M, Guerra C, Cannataro M: Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinformatics. 2012, 13 (5): 569585.View ArticlePubMedGoogle Scholar
 Zhang Y, Lin H, Yang Z, Wang J: Construction of ontology augmented networks for protein complex prediction. PLoS ONE. 2013, 8 (5): 62077View ArticleGoogle Scholar
 Airoldi EM, Blei DM, Fienberg SE, Xing EP: Mixed membership stochastic blockmodels. J Mach Learn Res. 2008, 9: 19812014.PubMed CentralPubMedGoogle Scholar
 Zhang XF, Dai DQ, OuYang L, Wu MY: Exploring overlapping functional units with various structure in protein interaction networks. PLoS ONE. 2012, 7 (8): 43092View ArticleGoogle Scholar
 Zhang XF, Dai DQ, Li XX: Protein complexes discovery based on proteinprotein interaction data via a regularized sparse generative network model. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9 (3): 857870.View ArticlePubMedGoogle Scholar
 Ahn YY, Bagrow JP, Lehmann S: Link communities reveal multiscale complexity in networks. Nature. 2010, 466 (7307): 761764.View ArticlePubMedGoogle Scholar
 Ball B, Karrer B, Newman M: Efficient and principled method for detecting communities in networks. Phys Rev E. 2011, 84 (3): 036103View ArticleGoogle Scholar
 Song J, Singh M: How and when should interactomederived clusters be used to predict functional modules and protein function?. Bioinformatics. 2009, 25 (23): 31433150.View ArticlePubMed CentralPubMedGoogle Scholar
 Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci U S A. 2003, 100 (21): 1212312128.View ArticlePubMed CentralPubMedGoogle Scholar
 Hoyer PO: Nonnegative sparse coding. Proceedings of the 2002 12th IEEE Workshop on Neural Networks for Signal Processing, 2002. 2002, Piscataway: IEEE Press, 557565.Google Scholar
 Murphy KP: Machine Learning: A Probabilistic Perspective. 2012, Cambridge: The MIT PressGoogle Scholar
 Lee DD, Seung HS: Algorithms for nonnegative matrix factorization. Adv Neural Inf Process Syst, vol. 13. 2001, Cambridge: The MIT Press, 556562.Google Scholar
 Collins SR, Kemmeren P, Zhao XC, Greenblatt JF, Spencer F, Holstege FC, Weissman JS, Krogan NJ: Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics. 2007, 6 (3): 439450.View ArticlePubMedGoogle Scholar
 Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004, 32 (suppl 1): 449451.View ArticleGoogle Scholar
 Chatraryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, O‘Donnell L, Reguly T, Breitkreutz A, Sellam A, Chen D, Chang C, Rust J, Livstone M, Oughtred R, Dolinski K, Tyers M: The biogrid interaction database: 2013 update. Nucleic Acids Res. 2013, 41 (D1): 816823.View ArticleGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25 (1): 2529.View ArticlePubMed CentralPubMedGoogle Scholar
 Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces genome database. Nucleic Acids Res. 1998, 26 (1): 7379.View ArticlePubMed CentralPubMedGoogle Scholar
 Palla G, Derényi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005, 435 (7043): 814818.View ArticlePubMedGoogle Scholar
 Rhrissorrakrai K, Gunsalus KC: Mine: module identification in networks. BMC Bioinformatics. 2011, 12 (1): 192View ArticlePubMed CentralPubMedGoogle Scholar
 Jiang JJ, Conrath DW: Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of International Conference Research on Computational Linguistics (ROCLING X). 1997, Taiwan: arxiv, 1933.Google Scholar
 Alvord G, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA: The david gene functional classification tool: a novel biological modulecentric algorithm to functionally analyze large gene lists. Genome Biol. 2007, 8 (9): 183View ArticleGoogle Scholar
 Lin D: An informationtheoretic definition of similarity. Proc Int Conf Mach Learn, vol. 1. 1998, San Francisco: Morgan Kaufmann, 296304.Google Scholar
 Ovaska K, Laakso M, Hautaniemi S: Fast gene ontology based clustering for microarray experiments. BioData Min. 2008, 1 (1): 11View ArticlePubMed CentralPubMedGoogle Scholar
 Chapelle O, Schölkopf B, Zien A: Semisupervised Learning. 2006, Cambridge: The MIT PressView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.