Markov clustering versus affinity propagation for the partitioning of protein interaction graphs
© Vlasblom and Wodak; licensee BioMed Central Ltd. 2009
Received: 12 September 2008
Accepted: 30 March 2009
Published: 30 March 2009
Genome scale data on protein interactions are generally represented as large networks, or graphs, where hundreds or thousands of proteins are linked to one another. Since proteins tend to function in groups, or complexes, an important goal has been to reliably identify protein complexes from these graphs. This task is commonly executed using clustering procedures, which aim at detecting densely connected regions within the interaction graphs. There exists a wealth of clustering algorithms, some of which have been applied to this problem. One of the most successful clustering procedures in this context has been the Markov Cluster algorithm (MCL), which was recently shown to outperform a number of other procedures, some of which were specifically designed for partitioning protein interactions graphs. A novel promising clustering procedure termed Affinity Propagation (AP) was recently shown to be particularly effective, and much faster than other methods for a variety of problems, but has not yet been applied to partition protein interaction graphs.
In this work we compare the performance of the Affinity Propagation (AP) and Markov Clustering (MCL) procedures. To this end we derive an unweighted network of protein-protein interactions from a set of 408 protein complexes from S. cervisiae hand curated in-house, and evaluate the performance of the two clustering algorithms in recalling the annotated complexes. In doing so the parameter space of each algorithm is sampled in order to select optimal values for these parameters, and the robustness of the algorithms is assessed by quantifying the level of complex recall as interactions are randomly added or removed to the network to simulate noise. To evaluate the performance on a weighted protein interaction graph, we also apply the two algorithms to the consolidated protein interaction network of S. cerevisiae, derived from genome scale purification experiments and to versions of this network in which varying proportions of the links have been randomly shuffled.
Our analysis shows that the MCL procedure is significantly more tolerant to noise and behaves more robustly than the AP algorithm. The advantage of MCL over AP is dramatic for unweighted protein interaction graphs, as AP displays severe convergence problems on the majority of the unweighted graph versions that we tested, whereas MCL continues to identify meaningful clusters, albeit fewer of them, as the level of noise in the graph increases. MCL thus remains the method of choice for identifying protein complexes from binary interaction networks.
Protein-protein interactions play a key role in cellular processes and significant efforts are being devoted world wide to characterizing such interactions on the scale of whole genomes (for review see [1, 2]). Genome scale data on protein interactions are typically obtained using experimental methods for detecting binary interactions[3, 4], or by affinity purifications of tagged proteins coupled to analytical methods for identifying the co-purified partners [5–7]. These data are in general represented as large networks, or graphs where hundreds or thousands of proteins are linked to one another [8–10]. For a recent review of network analysis techniques as applied to protein interaction networks, see .
It is well known however that proteins tend to function in groups, or complexes, which in the yeast S. cerevisiae contain on average 4.7 different types of subunits [12, 13]. An important goal has therefore been to reliably identify protein complexes from the protein interaction graphs. This task is commonly carried out using graph clustering procedures, which aim at detecting densely connected regions within the interaction graphs.
Clustering is an unsupervised learning method that tackles the task of producing an intrinsic grouping of data elements on the basis of some metric (a 'distance' or similarity measure between elements). It requires solving an optimization problem, which is usually achieved with the help of heuristic algorithms whose ability to approximate the best solution (global minimum) may vary widely. Their application in the context of protein interaction networks encounters the additional problem of dealing with the significant level of background noise in these networks (e.g. spurious interactions that have no biological meaning). Dealing with a high level of noise is a major challenge for clustering procedures, as this requires mitigating the effect of noise by various means – for example by taking into account the topology properties of the network, either during the clustering process or by modifying the distance metric to incorporate such properties prior to clustering.
There exists a wealth of clustering algorithms of which hierarchical clustering (for review see [16, 17]) and K-means[18, 19] are classical examples. More recently a variety of other algorithms have been proposed, and some of these have been applied to the identification of highly connected nodes in protein interaction graphs[7, 21].
So far, one of the most successful clustering procedures used in deriving complexes from protein interaction networks seems to be the Markov Cluster algorithm (MCL). Unlike most hierarchical clustering procedures, this algorithm considers the connectivity properties of the underlying network. It has been used to derive complexes from protein interaction data in two recent comprehensive analyses of the yeast interactome [7, 21]. Furthermore, in a recent benchmark carried out by Brohée et al, MCL was shown to be especially effective for clustering protein interactions in that it possesses a high degree of noise-tolerance in comparison to other algorithms such as the Molecular Complex Detection (MCODE) and Super Paramagnetic Clustering (SPC).
Over a year ago, a novel promising clustering procedure termed Affinity Propagation (AP) was proposed . Affinity propagation identifies representative examples (exemplars) within the dataset by exchanging real-valued messages between all data points. Points are then grouped with their most representative exemplar to give the final set of clusters. AP was applied to a variety of problems including face recognition, and gene identification from putative exons using microarray data, and was shown to be faster and more accurate than the K-Centers clustering algorithm. A subsequent note suggested however, that AP was similar to the earlier vertex substitution heuristic (VSH), and that it did not perform any better. This prompted the AP authors to provide evidence that AP outperforms VSH on large problems – where it runs much faster, and was more accurate than several clustering algorithms tested.
In view of the interest in applying efficient clustering procedures to biological networks in order to identify and characterize functional modules, this paper expands the analysis of Brohée et al to the comparison of the AP and MCL algorithms. Such comparison has not been previously reported.
Following Brohée et al, we first derive an unweighted network of protein-protein interactions from a set of up-to-date hand curated protein complexes from S. cervisiae and evaluate the performance of the two clustering algorithms in recalling the annotated complexes. In doing so the parameter space of each algorithm is sampled in order to select optimal values for these parameters, and the robustness of the algorithms is assessed by quantifying the level of complex recall as interactions are randomly added or removed to the network to simulate noise.
To test performance on a more realistic weighted protein interaction graph, we also apply the two algorithms to the high confidence consolidated protein interaction network of S. cerevisiae recently derived by Collins et al, and to versions of this network in which varying proportions of the links have been randomly shuffled. The computed clusters are compared to the same set of curated S. cerevisiae complexes in order to assess the robustness of the two algorithms.
The comparative analysis on the unweighted networks proposed here has the advantage of representing a self-consistent approach, in which information on a predefined number of cliques is used to build the network, and hence the expected result from partitioning this network is well defined. The choice of the weighted high confidence consolidated network of S. cerevisiae recently derived from purification data also enables to quantify the performance of the clustering procedures by comparing computed clusters to the annotated complexes. Such quantification is difficult with S. cerevisiae protein interaction networks built using yeast two hybrid data, because these interactions differ significantly from co-complex interactions. Partitioning this network using any method is hence unlikely to yield clusters comparable to complexes. The much larger human protein interaction networks compiled from different sources and stored in databases such as HPRD (~50,000 interactions), would not serve our purpose either, given the still limited number of fully annotated human protein complexes against which the clustering results can be compared.
The clustering algorithms
The Markov clustering algorithm (MCL) simulates random walks on the underlying interaction network, by alternating two operations: expansion, and inflation. First, loops are added to the input graph – by default, the loop weight for each node is assigned as the maximum weight of all edges connected to the node – and this graph is then translated into a stochastic "Markov" matrix. This matrix represents the transition probabilities between all pairs of nodes, and the probability of a random walk of length n between any two nodes can be calculated by raising this matrix to the exponent n – a process referred to as expansion. As higher length paths are more common between nodes in the same cluster than nodes within different clusters, the probabilities between nodes in the same complex will typically be higher in expanded matrices. MCL further exaggerates this effect by taking entry wise exponents of the expanded matrix, and then rescaling each column so that it remains stochastic – a process called inflation. Clusters are identified by alternating expansion and inflation until the graph is partitioned into subsets so that there are no longer paths between these subsets.
Where the matrix s(i, k) denotes the similarity (eg. edge weight) between the two nodes i and k, and the diagonal of this matrix contains the preferences for each node. The above two equations are iterated until a good set of exemplars emerges. Each node i can then be assigned to the exemplar k which maximizes the sum a(i, k) + r(i, k), and if i = k, then i is an exemplar. A damping factor between 0 and 1 is used to control for numerical oscillations.
Results and discussion
Performance on unweighted protein interaction graphs
The MCL and AP clustering procedures were each applied to the different versions of the networks and the correspondence between the computed clusters and the original 408 curated complexes was evaluated for each network version. The correspondence was quantified using the Geometric Accuracy (Acc) and Geometric Separation (Sep) criteria as previously defined. Acc is computed as the geometric mean of the Positive Predictive Value and Sensitivity with which the clusters recall the original complexes. The Sep parameter is defined as the geometric mean of two quantities that measure how cluster components are on average distributed amongst complexes and how complex components are distributed among clusters, respectively (see Methods for further details).
To enable as fair a comparison as possible, values of the adjustable parameters in each clustering algorithm were selected so as to maximize the sum of the Acc and Sep values for the clusters computed from each network (see Methods).
We also tested AP on an unweighted network of 15 982 human protein-protein interactions comprising 5850 unique proteins, annotated as experimentally characterized using affinity capture or reconstituted complexes in version 2.0.50 of the BioGRiD database. Similar to the results obtained for unweighted networks to which artificial noise was added, AP did not converge for this more realistic network derived from inherently noisy experimental data. MCL produced clusterings containing between 663 and 1566 clusters, depending on the inflation value. A detailed analysis of these clusters is outside the scope of this report, but the size distributions of the clusters in the MCL partitions produced at various inflation values (Additional File 3) indicate that they are not all trivial singleton or extremely large clusters.
The Acc and Sep were also evaluated for the 408 curated complexes directly. As expected, Acc, which quantifies the maximum extent of overlap between complexes and clusters – and vice versa – is 1 for these complexes (Figures 2a, 3a). Lower Acc values are obtained for the partitions derived by both clustering algorithms – largely due to shared components in the original complexes, which can obscure their detection, especially for smaller clusters. In contrast, shared components lower the Sep values of the original complexes, and hence as the clustering algorithms partition the graphs they can achieve higher Sep values at low noise levels (Figures 2b, 3b).
These results depart sharply from those expected for random partitions, as also illustrated in Figures 2a, b, 3a, and 3b. Random partitions were generated by randomly permuting the assignments of the proteins to clusters within the MCL and AP predictions.
Performance on a weighted biological protein interaction graph
A second series of tests was performed using interaction graphs built from the consolidated network of Collins et al, where each protein-protein link has an associated confidence score ranging in values from 0 to 1. As in previous studies[21, 29], only the high confidence portion of the network was considered, comprising links whose scores are above a confidence threshold of 0.38. The resulting network comprised 12,035 interactions and 1,921 proteins. Since this network represents predicted associations from data derived in two recent high-throughput experimental studies[6, 7], some noise will naturally be present. We did however generate noisier versions of this network by randomly shuffling increasing fractions of edges, and re-evaluating the results for each of these versions. As for the performance tests on the unweighted graphs, the parameters of each algorithm are adjusted so as to optimize the correspondence with the curated complexes, by maximizing the sum of the Acc and Sep values as done above for the comparative analysis on the unweighted graphs.
These results, together with the superior Acc and Sep values obtained with MCL at high noise levels suggest that this algorithm is a better choice for weighted protein interaction networks.
In summary, our analysis has shown that the MCL procedure is significantly more tolerant to noise and behaves more robustly than the AP algorithm. The advantage of MCL over AP is dramatic for unweighted protein interaction graphs, as AP displays severe convergence problems on the majority of the unweighted graph versions that we tested, whereas MCL continues to identify meaningful clusters, albeit fewer of them, as the level of noise in the graph increases. It is possible that AP as it stands, is not suitable for unweighted networks (as discussed below), although this is not specified in the instructions for using the program or in the original publication.
On weighted graphs constructed using data from high throughput experiments believed to be incomplete and usually quite noisy, the difference in performance is also notable. MCL achieves higher Acc and equivalent or better Sep at all significant noise levels. Furthermore, at low to moderate noise levels, these solutions include more proteins than AP. Parameters for either algorithm can be adjusted to affect the final granularity of the cluster, but either the Acc or the Sep will be lower.
Thus for physical interaction networks, we find that MCL outperforms AP in terms of its ability to generate meaningful partitions. The other cited advantages of the AP algorithm, namely its speed and ability to tackle very large networks, play only a minor role in the present application. Indeed both MCL and AP run very fast (< 10 seconds) on the weighted consolidated network of 12,035 interactions and 1,921 proteins. As noise is added to this network, AP can also fail converge at certain preference values (Figure S1 in Additional File 4 and Additional File 5), and it can be difficult to determine which parameters will lead to convergence. For example, AP didn't converge at any of the Preference values tested for unweighted networks with edges randomly removed. On weighted networks with 30% noise, the algorithm converged at Preference values 0.65 and 0.9 only (Additional File 4). Thus for this application, one difficulty in using AP is to determine an appropriate interval and level of granularity for searching Preference values. The AP authors provide tools to assist in choosing sensible Preference intervals, but not for choosing granularity. In situations where AP does not converge, the authors recommend increasing the Damping factor, the maximum number of iterations, and the number of iterations required for convergence – although increasing these parameters can increase the runtime of the algorithm.
The MCL algorithm effectively considers both edge weight and graph topology (connectivity) information. AP, on the other hand, can fail in situations where high weight edges connect two clusters. Consider the artificial situation where two cliques, A and B, are connected by a single, relatively high weight edge. If one of the nodes comprising this edge is an exemplar in clique A, the adjacent node in clique B may be incorrectly assigned to A by AP, despite being highly connected to members of B. This suggests that MCL achieves its robust performance by always considering network topology, whereas AP relies in part on the 'distance metric' (edge weight) to capture this information. To overcome this limitation one could define a modified distance metric that simultaneously captures both the propensity of two proteins to interact and the graph topology, and re-run AP on the modified graph. To some extent, the PE score is such a metric as higher scores are assigned to proteins that repeatedly co-purify together in affinity capture experiments, and lower scores are assigned to non-specific interactions that occur between promiscuous proteins. Indeed, on the PE weighted network of Collins et al, the performance of AP is much closer to that of MCL when the network is unperturbed, as randomly shuffling edges distorts the topology information contained in the edge weights. In the unweighted network, where no topological information is captured by the distance metric, AP is only able to successfully cluster unperturbed networks with very few inter-complex edges (shared components).
As noted in , the relative accuracy and performance of clustering algorithms can vary greatly for different datasets, and this report makes no attempt to address the breadth of problems for which one algorithm outperforms the other.
Building the protein interaction graphs
The unweighted interaction graph was defined by considering all possible pairs of proteins that were annotated to the same complex within a gold-standard set of yeast protein complexes (Additional File 1). Each edge was assigned a weight of 1. The resulting network comprised 11,238 interactions (edges) and 1624 proteins (nodes). For AP, the input pairwise 'similarities' were defined twice for every pair of proteins i, j as S(i, j) = S(j, i) = 1 if protein i and j were annotated to the same gold standard complex.
The weighted interaction network is that derived by Collins et al. The weight of each edge represents the confidence score of each putative interaction, as defined in ref. These confidence scores range from 0.38 to 1. For AP, the input 'similarities' were again defined twice for every pair of interacting proteins i, j as S(i, j) = S(j, i) = c, where c is the confidence assigned to the interaction.
The Acc indicates the tradeoff between the Sensitivity and the Positive Predictive Value (PPV), and is calculated by taking the geometric mean of these two quantities. Sensitivity is defined as the weighted average complex-wise sensitivities, Si, and cluster-wise positive predictive values Pj. Si measures the best overlap of complex i with the predicted clusters, and Pj measures the best overlap of cluster j with the gold standard complexes, relative to the number of components in cluster j that are contained in the original set of complexes. The Acc alone may not give an accurate evaluation of a clustering – for example, if the clustering consists of very large and very small clusters. In this case both the complex-wise Sensitivity and cluster-wise PPV will be high.
A second measure, the Sep, is therefore calculated to measure the one-to-one correspondence between predicted clusters and complexes. It is defined as the geometric mean of the average complex-wise and average cluster-wise separation, which are each derived from confusion tables modified, respectively, to indicate the fraction of overlap of each complex with every cluster, or each cluster with every complex. Unlike Brohée et al, all calculations done here consider only those components that exist in both datasets.
where w(e) denotes the weight of edge e.
Each clustering was performed with parameters that maximized the Geometric Accuracy and Separation. For MCL this involved sampling Inflation parameter values of 1.5 – 4 in steps of 0.1. For the AP algorithm we sampled the Preference parameters from 0.1–1 in steps of 0.05. The damping factor was set to 0.99, the maximum number of iterations to 15,000, and the number of iterations required for convergence to 1500. For AP, all proteins were assigned the same preference.
We are grateful to Delbert Dueck and Brendan Frey for guidance in using the Affinity Propagation algorithm. Miguel Santos, and the systems support team of the Centre for Computational Biology at the Hospital for Sick Children are thanked for help with the computer systems. S.J.W. is Tier 1 Canada Research Chair in Computational Biology and Bioinformatics and acknowledges support from the Canada Institute for Health Research, the Hospital for Sick Children and the Sickkids Foundation, Toronto, Canada.
- Charbonnier S, Gallego O, Gavin AC: The social network of a cell: Recent advances in interactome mapping. Biotechnology annual review 2008, 14: 1–28.View ArticlePubMedGoogle Scholar
- Cusick ME, Klitgord N, Vidal M, Hill DE: Interactome: gateway into systems biology. Human molecular genetics 2005, 14(Spec No. 2):R171–181.View ArticlePubMedGoogle Scholar
- Fields S, Song O: A novel genetic system to detect protein-protein interactions. Nature 1989, 340(6230):245–246.View ArticlePubMedGoogle Scholar
- Johnsson N, Varshavsky A: Split ubiquitin as a sensor of protein interactions in vivo. Proceedings of the National Academy of Sciences of the United States of America 1994, 91(22):10340–10344.PubMed CentralView ArticlePubMedGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141–147.View ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440(7084):631–636.View ArticlePubMedGoogle Scholar
- Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637–643.View ArticlePubMedGoogle Scholar
- Yook SH, Oltvai ZN, Barabasi AL: Functional and topological characterization of protein interaction networks. Proteomics 2004, 4(4):928–942.View ArticlePubMedGoogle Scholar
- Ideker TE: Network genomics. Ernst Schering Research Foundation workshop 2007, (61):89–115.Google Scholar
- Bader S, Kuhner S, Gavin AC: Interaction networks for systems biology. FEBS letters 2008, 582(8):1220–1224.View ArticlePubMedGoogle Scholar
- Pieroni E, de la Fuente van Bentem S, Mancosu G, Capobianco E, Hirt H, de la Fuente A: Protein networking: insights into global functional organization of proteomes. Proteomics 2008, 8(4):799–816.View ArticlePubMedGoogle Scholar
- Alberts B: The cell as a collection of protein machines: preparing the next generation of molecular biologists. Cell 1998, 92(3):291–294.View ArticlePubMedGoogle Scholar
- Formosa T, Barry J, Alberts BM, Greenblatt J: Using protein affinity chromatography to probe structure of protein machines. Methods Enzymol 1991, 208: 24–45.View ArticlePubMedGoogle Scholar
- Jain AK, Dubes RC: Algorithms for clustering data. Upper Saddle River: Prentice-Hall Advanced Reference Series archive 1988.Google Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC bioinformatics 2006, 7: 488.PubMed CentralView ArticlePubMedGoogle Scholar
- Chipman H, Hastie T, Tibshirani R: Statistical Analysis of Gene Expression Microarray Data. Boca Raton, FL: Chapman and Hall; 2003:159–199.Google Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001.View ArticleGoogle Scholar
- MacQueen J: Some methods for classification and analysis of multivariate observations. Procedings of the Fifth Berkeley Symposium on Math, Statistics, and Probability 1967, 1: 281–297.Google Scholar
- Lloyd S: Least squares quantization in PCM. IEEE Transactions on Information Theory 1982, 28: 128–137.View ArticleGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol 2007, 3: 88.PubMed CentralView ArticlePubMedGoogle Scholar
- Pu S, Vlasblom J, Emili A, Greenblatt J, Wodak SJ: Identifying functional modules in the physical interactome of Saccharomyces cerevisiae. Proteomics 2007, 7(6):944–960.View ArticlePubMedGoogle Scholar
- van Dongen S: Graph Clustering by Flow Simulation. In PhD Thesis. University of Utrecht; 2000.Google Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC bioinformatics 2003, 4: 2.PubMed CentralView ArticlePubMedGoogle Scholar
- Blatt M, Wiseman S, Domany E: Superparamagnetic clustering of data. Physical review letters 1996, 76(18):3251–3254.View ArticlePubMedGoogle Scholar
- Frey BJ, Dueck D: Clustering by passing messages between data points. Science (New York, NY) 2007, 315(5814):972–976.View ArticleGoogle Scholar
- Brusco MJ, Kohn HF: Comment on "Clustering by passing messages between data points". Science (New York, NY) 2008, 319(5864):726. author reply 726. author reply 726.View ArticleGoogle Scholar
- Frey BJ, Dueck D: Response to Comment on "Clustering by Passing Messages Between Data Points". Science. 2008, 319(5864):726d.View ArticleGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak SJ: Up-to-date catalogues of yeast protein complexes. Nucleic acids research 2009, 37(3):825–831.PubMed CentralView ArticlePubMedGoogle Scholar
- Collins SR, Kemmeren P, Zhao XC, Greenblatt JF, Spencer F, Holstege FC, Weissman JS, Krogan NJ: Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 2007, 6(3):439–450.View ArticlePubMedGoogle Scholar
- Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, et al.: High-quality binary protein interaction map of the yeast interactome network. Science. 2008, 322(5898):104–110.PubMed CentralView ArticlePubMedGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498–2504.PubMed CentralView ArticlePubMedGoogle Scholar
- Vlasblom J, Wu S, Pu S, Superina M, Liu G, Orsi C, Wodak SJ: GenePro: a cytoscape plug-in for advanced visualization and analysis of interaction networks. Bioinformatics (Oxford, England) 2006, 22(17):2178–2179.View ArticleGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic acids research 2006, (34 Database):D535–539.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.