MCL-CAw: a refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating core-attachment structure

Srihari, Sriganesh; Ning, Kang; Leong, Hon Wai

doi:10.1186/1471-2105-11-504

Research article
Open access
Published: 12 October 2010

MCL-CAw: a refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating core-attachment structure

Sriganesh Srihari¹,
Kang Ning^2,3 &
Hon Wai Leong¹

BMC Bioinformatics volume 11, Article number: 504 (2010) Cite this article

7202 Accesses
48 Citations
Metrics details

Abstract

Background

The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes.

Results

Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes.

Conclusions

We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw.

Background

Most biological processes are carried out by proteins that physically interact to form stoichiometrically stable complexes. Even in the relatively simple model organism Saccharomyces cerevisiae (budding yeast), these complexes are comprised of many subunits that work in a coherent fashion. These complexes interact with individual proteins or other complexes to form functional modules and pathways that drive the cellular machinery. Therefore, a faithful reconstruction of the entire set of complexes from the physical interactions between proteins is essential to not only understand complex formations, but also the higher level organization of the cell.

These physical interactions between proteins have been most extensively catalogued for yeast using high-throughput methods like yeast two-hybrid [1, 2] and direct purification of complexes using affinity tags followed by mass spectrometry (MS) analyses [3]. In 2002, the direct purification strategy or "pull-down" was first applied to yeast in two independent studies by Gavin et al.[4] and Ho et al.[5]. More recently (2006), two separate groups, Gavin et al.[6] and Krogan et al.[7], employed tandem affinity purification (TAP) followed by MS analyses to produce enormous amount of new data, allowing a more complete mapping of the yeast interactome. Although these individual datasets are of high quality, they show surprising lack of correlation with each other [8, 9], and some bias towards high abundance proteins [10] and against proteins from certain cellular compartments (like cell wall and plasma membrane) [11]. Also, each dataset still contains a substantial number of false positives (noise) that can compromise the utility of these datasets for more focused studies like complex reconstruction [11]. In order to reduce the impact of such discrepancies, a number of data integration and affinity scoring schemes have been devised [6, 7, 11–17]. These affinity scores encode the reliabilities (confidence) of physical interactions between pairs of proteins. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining multiple high-throughput datasets and by making use of these affinity scoring schemes.

The interaction data produced from the high-throughput TAP/MS experiments comprise of tagged "bait" proteins and the associated "prey" proteins that co-purify with the baits. Gavin et al.[6] considered direct bait-prey as well as indirect prey-prey relationships (a combination of spoke and matrix models), followed by a socio-affinity scoring system to encode the affinities between the protein pairs. The socio-affinity score quantizes the log-ratio of the number of times two proteins are observed together relative to what would be expected from their frequency in the dataset. Subsequently, Gavin et al. used an iterative clustering approach to derive complexes. Each complex was then partitioned into groups of proteins called "core", "attachment" or "module" (depicted in Additional files 1, Figure S1). On the other hand, Krogan et al.[7] used machine learning techniques (Bayesian networks and C4.5-based decision trees) to define confidence scores for interactions derived from direct bait-prey observations (the spoke model). Subsequently, Krogan et al. defined a high-confidence 'Core' dataset of interactions, and used the Markov Clustering algorithm (MCL) [18, 19] to derive complexes. Hart et al.[12] generated a Probabilistic Integrated Co-complex (PICO) network by integrating matrix modeled relationships of the Gavin et al., Krogan et al. and Ho et al. datasets using a measure similar to socio-affinity scores, and then used a MCL procedure to derive complexes from this network. Collins et al.[11] developed a Purification Enrichment (PE) scoring system to generate the 'Consolidated network' from the matrix modeled relationships of the Gavin et al., and Krogan et al. datasets. Collins et al. used a Bayes classifier to generate the PE scores in the Consolidated network by incorporating diverse evidence from hand-curated co-complexed protein pairs, Gene Ontology (GO) annotations, mRNA expression patterns, and cellular co-localization and co-expression profiles. This new network was shown to be of high quality - comparable to that of PPIs derived from small-scale experiments stored at the Munich Information Center for Protein Sequences (MIPS). Zhang et al.[13] used Dice coefficient (DC) to assign affinities to protein pairs, and evaluated their affinity measure against socio-affinity and PE measures. They concluded that DC and PE offered the best representation for protein affinity, and subsequently used them for complex prediction. Pu et al.[20] used MCL combined with cluster overlaps on the Consolidated network to reveal interesting insights into complex organization. Wang et al.[21] proposed HACO, a hierarchical clustering with overlap algorithm, to reconstruct complexes and used them to build the 'ComplexNet', an interaction network of proteins and complexes, in order to study the higher-level organization of complexes. Chua et al.[14] and Liu et al.[15] developed network topology-based scoring schemes called Functional Similarity Weight (FS Weight) and Iterative-Czekanowski-Dice (Iterative-CD), respectively, to assign reliability scores to the interactions in networks. Subsequently, Liu et al.[16] used a maximal clique merging strategy (called CMC) to derive complexes from networks scored using these two systems. Friedel et al.[17] developed a bootstrapped scoring system to score TAP/MS interactions from Gavin et al. and Krogan et al., and subsequently derived complexes using a variant of MCL. Friedel et al.[22] also developed a minimum spanning tree-based method to reconstruct the topology of complexes from co-purified proteins in TAP/MS assays. Voevodski et al.[23] used PageRank, a random walk-based method employed in context-sensitive web search, to define the affinities between proteins within PPI networks. Subsequently, Voevodski et al. used it to predict co-complexed proteins within the network. Approaches like CORE [24] and COACH [25] adopted local dense neighborhood search to derive cores and attachments from unscored networks. Mitrofanova et al.[26] measured the connectivity between proteins in unweighted PPI networks by edge-disjoint paths instead of edges to overcome noise, and modeled these paths as a network flow and represented it in Gomory-Hu trees. They subsequently isolated groups of nodes in the trees that shared edge-disjoint paths in order to identify complexes. Very recently, Ozawa et al.[27] used domain-domain interactions to validate and refine the complexes predicted by MCL.

In this study, we develop an algorithm to derive yeast complexes from weighted (affinity-scored) PPI networks. Inspired by the experimental findings by Gavin et al.[6] on the modularity structure in yeast complexes, and the distinctive properties of "core" and "attachment" proteins, we develop a novel core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes. We had proposed the idea of core-attachment based refinement in a preliminary work [28] and called it MCL-CA.

However, MCL-CA worked only on unscored networks. Here, we devise an improved algorithm (called MCL-CAw) and provide a natural extension to work on scored (weighted) PPI networks. Even though most eukaryotic complexes are hypothesized to display such core-attachment modularity, here we design our algorithm specific to yeast complexes because of lack of sufficient evidence, high-throughput datasets and reference complexes from other organisms. We combine TAP/MS physical datasets from Gavin et al.[6] and Krogan et al.[7] to generate an unscored PPI network (Table 1). We then score this network using two topology-based affinity scoring schemes, FS Weight [14] and Iterative-CD [15], to generate scored PPI networks. We gather two additional readily-available scored PPI networks from Collins et al.[11] and Friedel et al.[17]. The evaluation of MCL-CAw on these networks demonstrates that: (a) MCL-CAw is able to derive higher number of yeast complexes and with better accuracies than MCL; (b) Affinity scoring effectively reduces the impact of noise on MCL-CAw and thereby improves the quality (precision and recall) of its predicted complexes; (c) MCL-CAw responds well to most available affinity scoring schemes for PPI networks. We compare MCL-CAw with several recent complex detection algorithms on both unscored and scored PPI networks. Finally, we perform in-depth analysis of the predicted complexes from MCL-CAw.

Table 1 Properties of the PPI networks used for the evaluation of MCL-CAw

MCL-CAw: a refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating core-attachment structure

Abstract

Background

Results

Conclusions

Background

Methods

The MCL-CAw algorithm: Identifying complexes embedded in the interaction network

Clustering the PPI network using MCL hierarchically

Categorizing proteins as cores within clusters

Filtering noisy clusters

Recruiting proteins as attachments into clusters

Extracting out complexes from clusters

Ranking the predicted complexes

Results

Preparation of experimental data

Evaluation metrics for matching predicted and benchmark complexes

Biological coherence of predicted complexes

Setting the parameters I, α and γ for MCL-CAw

Setting I for MCL

Setting α and γ for CA refinement

Reconfirming I for the chosen values of α and γ

Evaluating the performance of MCL-CAw

The effect of core-attachment refinement on the predictions of MCL

Impact of noise on MCL and MCL-CAw and the role of affinity scoring in reducing this impact

Biological coherence of predicted components

Relative ranking of complex prediction algorithms and affinity-scored networks

Impact of augmenting physical PPI networks with computationally inferred interactions

In-depth analysis of individual predicted complexes

Instances of correctly predicted complexes of MCL-CAw

Instances depicting mistakes in the predictions of MCL-CAw

Correlation between essentiality of proteins and their ability to form complexes

Discussion

Conclusion

Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us