Skip to main content
  • Research article
  • Open access
  • Published:

MCL-CAw: a refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating core-attachment structure

Abstract

Background

The reconstruction of protein complexes from the physical interactome of organisms serves as a building block towards understanding the higher level organization of the cell. Over the past few years, several independent high-throughput experiments have helped to catalogue enormous amount of physical protein interaction data from organisms such as yeast. However, these individual datasets show lack of correlation with each other and also contain substantial number of false positives (noise). Over these years, several affinity scoring schemes have also been devised to improve the qualities of these datasets. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining datasets from multiple sources and by making use of these affinity scoring schemes. In the attempt towards tackling this challenge, the Markov Clustering algorithm (MCL) has proved to be a popular and reasonably successful method, mainly due to its scalability, robustness, and ability to work on scored (weighted) networks. However, MCL produces many noisy clusters, which either do not match known complexes or have additional proteins that reduce the accuracies of correctly predicted complexes.

Results

Inspired by recent experimental observations by Gavin and colleagues on the modularity structure in yeast complexes and the distinctive properties of "core" and "attachment" proteins, we develop a core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes from scored (weighted) PPI networks. We combine physical interactions from two recent "pull-down" experiments to generate an unscored PPI network. We then score this network using available affinity scoring schemes to generate multiple scored PPI networks. The evaluation of our method (called MCL-CAw) on these networks shows that: (i) MCL-CAw derives larger number of yeast complexes and with better accuracies than MCL, particularly in the presence of natural noise; (ii) Affinity scoring can effectively reduce the impact of noise on MCL-CAw and thereby improve the quality (precision and recall) of its predicted complexes; (iii) MCL-CAw responds well to most available scoring schemes. We discuss several instances where MCL-CAw was successful in deriving meaningful complexes, and where it missed a few proteins or whole complexes due to affinity scoring of the networks. We compare MCL-CAw with several recent complex detection algorithms on unscored and scored networks, and assess the relative performance of the algorithms on these networks. Further, we study the impact of augmenting physical datasets with computationally inferred interactions for complex detection. Finally, we analyse the essentiality of proteins within predicted complexes to understand a possible correlation between protein essentiality and their ability to form complexes.

Conclusions

We demonstrate that core-attachment based refinement in MCL-CAw improves the predictions of MCL on yeast PPI networks. We show that affinity scoring improves the performance of MCL-CAw.

Background

Most biological processes are carried out by proteins that physically interact to form stoichiometrically stable complexes. Even in the relatively simple model organism Saccharomyces cerevisiae (budding yeast), these complexes are comprised of many subunits that work in a coherent fashion. These complexes interact with individual proteins or other complexes to form functional modules and pathways that drive the cellular machinery. Therefore, a faithful reconstruction of the entire set of complexes from the physical interactions between proteins is essential to not only understand complex formations, but also the higher level organization of the cell.

These physical interactions between proteins have been most extensively catalogued for yeast using high-throughput methods like yeast two-hybrid [1, 2] and direct purification of complexes using affinity tags followed by mass spectrometry (MS) analyses [3]. In 2002, the direct purification strategy or "pull-down" was first applied to yeast in two independent studies by Gavin et al.[4] and Ho et al.[5]. More recently (2006), two separate groups, Gavin et al.[6] and Krogan et al.[7], employed tandem affinity purification (TAP) followed by MS analyses to produce enormous amount of new data, allowing a more complete mapping of the yeast interactome. Although these individual datasets are of high quality, they show surprising lack of correlation with each other [8, 9], and some bias towards high abundance proteins [10] and against proteins from certain cellular compartments (like cell wall and plasma membrane) [11]. Also, each dataset still contains a substantial number of false positives (noise) that can compromise the utility of these datasets for more focused studies like complex reconstruction [11]. In order to reduce the impact of such discrepancies, a number of data integration and affinity scoring schemes have been devised [6, 7, 11–17]. These affinity scores encode the reliabilities (confidence) of physical interactions between pairs of proteins. Therefore, the challenge now is to detect meaningful as well as novel complexes from protein interaction (PPI) networks derived by combining multiple high-throughput datasets and by making use of these affinity scoring schemes.

The interaction data produced from the high-throughput TAP/MS experiments comprise of tagged "bait" proteins and the associated "prey" proteins that co-purify with the baits. Gavin et al.[6] considered direct bait-prey as well as indirect prey-prey relationships (a combination of spoke and matrix models), followed by a socio-affinity scoring system to encode the affinities between the protein pairs. The socio-affinity score quantizes the log-ratio of the number of times two proteins are observed together relative to what would be expected from their frequency in the dataset. Subsequently, Gavin et al. used an iterative clustering approach to derive complexes. Each complex was then partitioned into groups of proteins called "core", "attachment" or "module" (depicted in Additional files 1, Figure S1). On the other hand, Krogan et al.[7] used machine learning techniques (Bayesian networks and C4.5-based decision trees) to define confidence scores for interactions derived from direct bait-prey observations (the spoke model). Subsequently, Krogan et al. defined a high-confidence 'Core' dataset of interactions, and used the Markov Clustering algorithm (MCL) [18, 19] to derive complexes. Hart et al.[12] generated a Probabilistic Integrated Co-complex (PICO) network by integrating matrix modeled relationships of the Gavin et al., Krogan et al. and Ho et al. datasets using a measure similar to socio-affinity scores, and then used a MCL procedure to derive complexes from this network. Collins et al.[11] developed a Purification Enrichment (PE) scoring system to generate the 'Consolidated network' from the matrix modeled relationships of the Gavin et al., and Krogan et al. datasets. Collins et al. used a Bayes classifier to generate the PE scores in the Consolidated network by incorporating diverse evidence from hand-curated co-complexed protein pairs, Gene Ontology (GO) annotations, mRNA expression patterns, and cellular co-localization and co-expression profiles. This new network was shown to be of high quality - comparable to that of PPIs derived from small-scale experiments stored at the Munich Information Center for Protein Sequences (MIPS). Zhang et al.[13] used Dice coefficient (DC) to assign affinities to protein pairs, and evaluated their affinity measure against socio-affinity and PE measures. They concluded that DC and PE offered the best representation for protein affinity, and subsequently used them for complex prediction. Pu et al.[20] used MCL combined with cluster overlaps on the Consolidated network to reveal interesting insights into complex organization. Wang et al.[21] proposed HACO, a hierarchical clustering with overlap algorithm, to reconstruct complexes and used them to build the 'ComplexNet', an interaction network of proteins and complexes, in order to study the higher-level organization of complexes. Chua et al.[14] and Liu et al.[15] developed network topology-based scoring schemes called Functional Similarity Weight (FS Weight) and Iterative-Czekanowski-Dice (Iterative-CD), respectively, to assign reliability scores to the interactions in networks. Subsequently, Liu et al.[16] used a maximal clique merging strategy (called CMC) to derive complexes from networks scored using these two systems. Friedel et al.[17] developed a bootstrapped scoring system to score TAP/MS interactions from Gavin et al. and Krogan et al., and subsequently derived complexes using a variant of MCL. Friedel et al.[22] also developed a minimum spanning tree-based method to reconstruct the topology of complexes from co-purified proteins in TAP/MS assays. Voevodski et al.[23] used PageRank, a random walk-based method employed in context-sensitive web search, to define the affinities between proteins within PPI networks. Subsequently, Voevodski et al. used it to predict co-complexed proteins within the network. Approaches like CORE [24] and COACH [25] adopted local dense neighborhood search to derive cores and attachments from unscored networks. Mitrofanova et al.[26] measured the connectivity between proteins in unweighted PPI networks by edge-disjoint paths instead of edges to overcome noise, and modeled these paths as a network flow and represented it in Gomory-Hu trees. They subsequently isolated groups of nodes in the trees that shared edge-disjoint paths in order to identify complexes. Very recently, Ozawa et al.[27] used domain-domain interactions to validate and refine the complexes predicted by MCL.

In this study, we develop an algorithm to derive yeast complexes from weighted (affinity-scored) PPI networks. Inspired by the experimental findings by Gavin et al.[6] on the modularity structure in yeast complexes, and the distinctive properties of "core" and "attachment" proteins, we develop a novel core-attachment based refinement method coupled to MCL for reconstruction of yeast complexes. We had proposed the idea of core-attachment based refinement in a preliminary work [28] and called it MCL-CA.

However, MCL-CA worked only on unscored networks. Here, we devise an improved algorithm (called MCL-CAw) and provide a natural extension to work on scored (weighted) PPI networks. Even though most eukaryotic complexes are hypothesized to display such core-attachment modularity, here we design our algorithm specific to yeast complexes because of lack of sufficient evidence, high-throughput datasets and reference complexes from other organisms. We combine TAP/MS physical datasets from Gavin et al.[6] and Krogan et al.[7] to generate an unscored PPI network (Table 1). We then score this network using two topology-based affinity scoring schemes, FS Weight [14] and Iterative-CD [15], to generate scored PPI networks. We gather two additional readily-available scored PPI networks from Collins et al.[11] and Friedel et al.[17]. The evaluation of MCL-CAw on these networks demonstrates that: (a) MCL-CAw is able to derive higher number of yeast complexes and with better accuracies than MCL; (b) Affinity scoring effectively reduces the impact of noise on MCL-CAw and thereby improves the quality (precision and recall) of its predicted complexes; (c) MCL-CAw responds well to most available affinity scoring schemes for PPI networks. We compare MCL-CAw with several recent complex detection algorithms on both unscored and scored PPI networks. Finally, we perform in-depth analysis of the predicted complexes from MCL-CAw.

Table 1 Properties of the PPI networks used for the evaluation of MCL-CAw

Methods

The MCL-CAw algorithm: Identifying complexes embedded in the interaction network

Our MCL-CAw algorithm broadly consists of two phases. In the first phase, we partition the PPI network into multiple dense clusters using MCL. Following this (in the second phase), we post-process (refine) these clusters to obtain meaningful complexes. The MCL-CAw algorithm consists of the following steps:

  1. 1.

    Clustering the PPI network using MCL hierarchically

  2. 2.

    Categorizing proteins as cores within clusters

  3. 3.

    Filtering noisy clusters

  4. 4.

    Recruiting proteins as attachments into clusters

  5. 5.

    Extracting out complexes from clusters

  6. 6.

    Ranking the predicted complexes

We use the following notations while describing our algorithm. The PPI network is represented as G = (V, E), where V is the set of proteins, and E is the set of interactions between these proteins. For each e = (p, q) ∈ E, there is a confidence score (weight) w(p, q) encoding the affinity between the proteins p and q. These affinity scores depend on the scoring system used.

Clustering the PPI network using MCL hierarchically

The first step of our algorithm is to partition (cluster) the PPI network using MCL [18], which simulates random walks (called a flow) to identify relatively dense regions in the network. The inflation coefficient parameter I in MCL is used to regulate the granularity of the clusters - higher the value more finer are the generated clusters (how to choose I in practice is discussed in the "Results" section). MCL tends to produce several large clusters (sizes ≥ 30) that amalgamate smaller clusters [7, 20]. On the other hand, the size distributions of hand-curated complexes from Wodak lab [29], MIPS [30] and Aloy et al.[31] (Table 2) reveal that most complexes are of sizes less than 10. Therefore, we perform hierarchical clustering by iteratively selecting all clusters of sizes at least 30 and re-clustering them using MCL.

Table 2 Properties of hand-curated yeast complexes from Wodak lab [29], MIPS [30] and Aloy [31]

After iterative rounds of MCL-based hierarchical clustering on the protein network G = (V, E), we obtain a collection of k disjoint (non-overlapping) clusters {C i : C i = (V i , E i ), 1 ≤ i ≤ k}, where V i ⊆ V and E i ⊆ E.

Categorizing proteins as cores within clusters

Microarray analysis by Gavin et al.[6] of their predicted complex components showed that a large percentage of pairs of proteins within cores were co-expressed at the same time during the cell cycle and sporulation, consistent with the view that cores represent main functional units within complexes. Three-dimensional structural and yeast two-hybrid analysis showed that the core components were most likely to be in direct physical contact with each other. To reflect these findings in our post-processing steps, we expect:

  • Every complex we predict to comprise of a non-empty set of core proteins; and

  • The proteins within these cores to display relatively high degree of physical interactivity among themselves.

We identify the core proteins within a cluster in two stages: we first identify the set of preliminary cores and subsequently extend this to form the final set of cores. We categorize a protein p∈V i to be a 'preliminary core' protein in cluster C i = (V i , E i ), given by p ∈ PCore(C i ), if:

  • The weighted in-connectivity of p with respect to C i is at least the average weighted in-connectivity of C i , given by: d in (p, C i ) ≥ d avg (C i ); and

  • The weighted in-connectivity of p with respect to C i is greater than the weighted out-connectivity of p with respect to C i , given by: d in (p, C i ) > d out (p, C i ).

The weighted in-connectivity d in (p, C i ) of p with respect to C i is the total weight (score) of interactions p has with proteins within C i . Similarly, the weighted out-connectivity d out (p, C i ) of p with respect to C i is the total weight of interactions p has with proteins outside C i . These are given by d in (p, C i ) = ∑ {w (p, q): q ∈ V i } and d out (p, C i ) = ∑ {w (p, q): q ∉ V i } respectively. The average weighted in-connectivity d avg (C i ) of cluster C i is therefore the average of the weighted in-connectivities of all proteins within Ci, given by d a v g ( C i ) = 1 | C i | ⋅ ∑ q ∈ V i d i n ( q , C i ) .

We use these preliminary cores to find the 'extended core' proteins. We categorize a protein p ∉ PCore(C i ) to be an extended core protein in cluster C i , given by p ∈ ECore(C i ), if:

  • The weighted in-connectivity of p with respect to PCore(C i ) is at least the average of the weighted in-connectivities of all non-cores r ∉ PCore (C i ) to the preliminary cores, given by: d in (p, PCore (C i )) ≥ d avg (r, PCore (C i )); and

  • The weighted in-connectivity of p with respect to PCore(C i ) is greater than the weighted out-connectivity of p with respect to PCore(C i ), given by: d in (p, PCore(C i )) > d out (p, PCore(C i )).

Here, d in (p, PCore(C i )) is the total weight of interactions p has with the preliminary cores of C i , given by: d in (p, PCore (C i )) = ∑ {w (p, q): q ∈ PCore (C i )}. Similarly, d out (p, PCore(C i )) is the total weight of interactions p has with all the non-core proteins within C i , given by:

d in (p, PCore (C i )) = ∑ {w (p, r): r ∈ PCore (C i )}. Finally, d avg (r, PCore(C i )) is the average weight of interactions of all non-cores r with the preliminary cores, given by:

d a v g ( r , P C o r e ( C i ) ) = 1 ( | C i | − | P C o r e ( C i ) | ) ⋅ ∑ r ∈ P C o r e ( C i ) d i n ( r , P C o r e ( C i ) ) .

Combining the preliminary and extended core proteins, we form the final set of core proteins of cluster C i , given by:

C o r e ( C i ) = { P C o r e ( C i ) ∪ E C o r e ( C i ) } .
(1)

Filtering noisy clusters

Consistent with the assumption that every complex comprises of a set of core proteins, we consider a cluster as noisy if it does not include any core protein as per our above criteria. We discard all such noisy clusters.

Recruiting proteins as attachments into clusters

Microarray analysis by Gavin et al.[6] of their predicted complex components showed that attachment proteins were closely associated with core proteins within complexes and yet showed a greater degree of heterogeneity in expression levels, supporting the notion that attachments might represent non-stoichiometric components. Also, attachment proteins were seen shared between two or more complexes, consistent with the view that the same protein may participate in multiple complexes [20, 21]. On the other hand, the application of MCL to PPI networks yields clusters that do not share proteins (non-overlapping clusters). Mapping these clusters back to the original PPI network shows that proteins having similar connectivities to multiple clusters are assigned arbitrarily to only one of the clusters. These proteins might as well be assigned to multiple clusters. To reflect these findings in our algorithm, we expect the attachment proteins to be those proteins within complexes that are:

  • Non-core proteins;

  • Closely interacting with the core proteins; and

  • May be shared across multiple complexes.

We consider the following criteria to assign a non-core protein p belonging to a cluster C j (called donor cluster) as an attachment in an acceptor cluster C i (the donor and acceptor clusters may be the same), that is, p ∈ Attach(C i ):

  • Protein p has sufficiently strong interactions with the core proteins Core(C i ) of the cluster C i ;

  • The stronger the interactions among the core proteins, the stronger have to be the interactions of p with the core proteins;

  • For large core sets, strong interactions are required to only some of the core proteins or, alternatively, weaker interactions to most of them.

Combining these criteria, we assign non-core p as an attachment in the acceptor cluster C i , that is p ∈ Attach(C i ), if:

I p ≥ α . I c . ( S c 2 ) − γ ,
(2)

where I p = I(p, Core(C i )) is the total weight of interactions of p with Core(C i ), given by I(p, Core(C i )) = ∑{w(p, q): q ∈ Core(C i )}, while I c = I(Core(C i )) is the total weight of interactions among the core proteins of C i , given by I ( C o r e ( C i ) ) = 1 2 ⋅ ∑ { w ( q , r ) : q , r ∈ C o r e ( C i ) } , and S c = |Core(C i )|, which is is normalized to yield 1 for core sets of size two. The parameters α and γ are used to control the effects of I (Core(C i )) and |Core(C i )|. For a simple illustration, let α = 0.5 and γ = 1, and consider all interactions to be of equal weight 1. Therefore, p is attached to a core set of four proteins, if the total weight of its interactions with the core proteins is at least 3, which is possible if p is connected to at least three core proteins (how to choose values for α and γ in practice is discussed in the "Results" section). This step ensures that non-core proteins having sufficiently strong interactions with the cores in more than one clusters are recruited as attachments into all those clusters.

Extracting out complexes from clusters

For each cluster we group together its constituent core and attachment proteins to define a unique complex. We expect all the remaining proteins within the cluster to have weaker associations with this resultant complex, and therefore categorize them as noisy proteins. In fact, experiments [28] have shown that MCL clusters tend to include several such noisy proteins leading to reduction in accuracies of the clusters. Therefore, our step ensures that such noisy proteins are discarded in order to extract out more accurate complexes. Additionally, since these resulting complexes include attachment proteins that potentially may be recruited by multiple complexes, this step ensures that our predicted complexes adhere to the protein-sharing phenomenon observed in real complexes [6, 20, 21]. We discard all complexes of size less than 4 because many of these are false positives. It is difficult to predict small real complexes solely based on interaction (topological) information (also noted in [16, 24]).

For each cluster C i , we define a unique complex Cmplx(C i ) as:

C m p l x ( C i ) = { C o r e ( C i ) ∪ A t t a c h ( C i ) } .
(3)

Each interaction (p, q) among the constituent proteins p and q within this complex carries the weight w(p, q) observed in the PPI network.

Ranking the predicted complexes

As a final step, we output our predicted complexes in a reasonably meaningful order of biological significance. For this, we rank our predicted complexes in decreasing order of their weighted densities. The weighted density W D ( C ′ i ) of a predicted complex C ′ i is given by [16]:

W D ( C i ′ ) = ∑ p , q ∈ C ′ i w ( p , q ) | C i ′ | ⋅ ( | C ′ i | − 1 ) .
(4)

The unweighted density of a predicted complex is defined in a similar way by setting the weights of all constituent interactions to 1. This blindly favors very small complexes, or complexes with proteins having large number of interactions without considering the reliability of those interactions. On the other hand, the weighted density considers the reliability (by means of affinity scores) of such interactions. If two complexes have the same unweighted density, the complex with higher weighted density is ranked higher.

Results

Preparation of experimental data

We gathered high-confidence Gavin and Krogan-Core interactions deposited in BioGrid http://thebiogrid.org/[32] (version as of July 2009). These were assembled from a combination of bait-prey and prey-prey relationships (the spoke and matrix models) observed by Gavin et al.[6], and the bait-prey relationships (the spoke model) observed by Krogan et al.[7]. We combined these interactions to build the unscored Gavin+Krogan network (all edge-weights were set to 1). We then applied Iterative-CD k[15, 16] and FS Weight k[14] scoring (with k = 2 iterations, recommended in [16]) on the Gavin+Krogan network, and selected all interactions with non-zero scores. This resulted in the ICD(Gavin+Krogan) and FSW(Gavin+Krogan) networks, respectively. In addition to these two scored networks, we downloaded the Consolidated3.19 network (with PE cut-off: 3.19, recommended by Collins et al.[11]) from http://interactome-cmp.ucsf.edu/, and the Bootstrap0.094 network [17] (with BT cut-off 0.094) from http://www.bio.ifi.lmu.de/Complexes/ProCope/. The Consolidated network was derived from the matrix modeled relationships of the original Gavin and Krogan datasets using the PE system [11]. Therefore, this network comprised of additional prey-prey interactions that were missed in the Krogan 'Core' dataset. The Bootstrap network was derived from the matrix modeled relationships using the bootstrapped scores [17]. Table 1 summarizes some properties of these networks.

The benchmark (reference) set of complexes was built from hand-curated complexes derived from three sources: 408 complexes of the Wodak lab CYC2008 catalogue [29], 313 complexes of MIPS [30], and 101 complexes curated by Aloy et al.[31]. The properties of these reference sets are shown in Table 2. We considered each of these reference sets independently for the evaluation of MCL-CAw. We did not merge them into one comprehensive list of complexes because the individual complex compositions are different across the three sources and some complexes may also get double-counted (because of different names used for the same complex). An alternative strategy was adopted by Wang et al.[21] by integrating the complexes from three sources (MIPS [30], SGD [33] and their own in-house curated complexes) using the Jaccard score: two complexes overlapping with a Jaccard score of at least 0.7 were merged together - the proteins to be included into the resultant complex were chosen based on a voting scheme.

To be accurate (as well as fair) while evaluating our method on these benchmark sets, we considered only the set of derivable benchmark complexes from each of the PPI networks: if a protein is not present in a PPI network, we remove it from the set of benchmark complexes. By repeated removals, if the size of a benchmark complex shrinks below 3, we remove the complex from our benchmark set to generate the final set of derivable benchmark complexes for each of the PPI networks.

In order to evaluate the biological coherence of our predicted complexes, we downloaded the list of cellular localizations (GO terms under "Cellular Component") of proteins from Gene Ontology (GO) [34]. We selected only the informative GO terms. A GO term is informative if no less than 30 proteins are annotated with this term and none of its descendant terms are annotated to no less than 30 proteins [35]. The list of essential genes was obtained from the Saccharomyces Genome Deletion Project [36, 37]: http://www-sequence.stanford.edu/group/yeast_deletion_project/deletions3.html

Evaluation metrics for matching predicted and benchmark complexes

Let B = {B1,B2,...,B m } and C = {C1,C2,...,C n } be the sets of benchmark and predicted complexes, respectively. We use the Jaccard coefficient J to quantify the overlap between a benchmark complex B i and a predicted complex C j :

J ( B i , C j ) = | B i ∩ C j | | B i ∫ C j | .
(5)

We consider B i to be covered by C j , if J(B i , C j ) ≥ overlap threshold t. In our experiments, we set the threshold t = 0.5, which requires | B i ∩ C j | ≥ | B i | + | C j | 3 . For example, if |B i | = |C j | = 8, then the overlap between B i and C j should be at least 6.

We use previously reported [16] definitions of recall Rc (coverage) and precision Pr (sensitivity) of the set of predicted complexes:

R c = | { B i | B i ∈ B ∧ ∃ C j ∈ C ; J ( B i , C j ) ≥ t } | | B |
(6)

Here, |{B i |B i ∈ B Λ ∃C j ∈ C; J(B i , C j ) ≥ t}| gives the number of derived benchmarks.

P r = | { C j | C j ∈ C ∧ ∃ B i ∈ B ; J ( B i , C j ) ≥ t } | | C |
(7)

Here, |{C j |C j ∈ C Λ ∃B i ∈ B; J(B i , C j ) ≥ t}| gives the number of matched predictions.

We also evaluate the performance of our method by plotting the precision versus recall curves for the predicted complexes. These curves are plotted by tuning a threshold on the number of predicted complexes considered for the evaluation. The predicted complexes are considered in decreasing order of their weighted densities (that is, in increasing order of their complex ranks).

Biological coherence of predicted complexes

A complex can be formed if its proteins are localized within the same compartment of the cell. So, we use the localization coherence of the predicted complexes as a measure their quality. Let L = {L1, L2,..., L k } be the set of known localization groups, where each L i contains a set of proteins with similar localization annotations. The co-localization score LS(C j ) of a predicted complex C j is defined as the maximal fraction of its constituent proteins that are co-localized within the same localization group among the proteins that have annotations. This is given as follows [16]:

L S ( C j ) = max { | C j ∩ L i | : i = 1 , 2 , … , k } | p : p ∈ C j ∧ ∃ L i ∈ L , p ∈ L i | .
(8)

Therefore, the co-localization score LS(C) for the set of predicted complexes C is just the weighted average over all complexes [16]:

L S ( C ) = ∑ C j ∈ C max { | C j ∩ L i | : i = 1 , 2 , … , k } ∑ C j ∈ C | p : p ∈ C j ∧ ∃ L i ∈ L , p ∈ L i | .
(9)

Setting the parameters I, α and γ for MCL-CAw

Before evaluating the performance of MCL-CAw, we describe the procedure used for setting inflation parameter I for MCL, and α and γ for core-attachment refinement in order to determine a good combination of parameters for MCL-CAw in practice. Only the predicted complexes of size ≥ 4 from MCL and MCL-CAw were considered for setting the parameters as well as for further experiments. We used F1 (harmonic mean of precision and recall) measured against the Wodak lab [29], MIPS [30] and Aloy [31] benchmarks as our basis for choosing the best values for these parameters.

We adopted the following four-step procedure for each PPI network:

  1. 1.

    Run MCL for a range of I values and choose I that offers the best F1 measure;

  2. 2.

    Set I to the chosen value, set a certain α for MCL-CAw, and choose γ from a range of values that offers the best F1 measure;

  3. 3.

    Set I and γ to the chosen values, and choose α for MCL-CAw from a range of values that offers the best F1 measure;

  4. 4.

    Set α and γ for MCL-CAw to the chosen values, and reconfirm the value chosen for I.

Setting I for MCL

Inflation I in MCL determines the granularity of the clustering - the higher the value more finer are the clusters produced. Typical values used for clustering PPI networks are I = 1.8 and 1.9 [16, 19, 38]. For each PPI network, we ran MCL over a range of I , and measured F1 against the three benchmark sets. We then normalized these F1 values against the best F1 obtained on each benchmark, summed up these normalized F1 values across benchmarks, and finally normalized these sums to obtain a final ranking for the I values. The detailed calculations are presented in Additional files 1, Tables S1 and S2. In Figure 1, we show sample F1 versus I plots for the unscored Gavin+Krogan and scored ICD(Gavin+Krogan) networks for the range of I = 1.25 to 3.0. We noticed that inflation I = 2.5 gave the best F1 on both unscored and scored networks. The F1 obtained at I = 1.8 and 1.9 was only marginally less than that at I = 2.5.

Figure 1
figure 1

Setting the inflation parameter I in MCL: F1 versus I plot. (a): Plot for the unscored Gavin+Krogan network; (b): Plot for the scored ICD(Gavin+Krogan) network. I = 2.5 gave the best F1 for both unscored and scored networks.

Setting α and γ for CA refinement

For each PPI network, we set I to the chosen value, fixed a certain α, and ran MCL-CAw over a range of γ. We adopted the same method as above to choose the value of γ offering the best F1 measure. Figure 2 shows sample F1 versus γ plots on the unscored Gavin+Krogan and scored ICD(Gavin+Krogan) networks for I = 2.5, α = 1.00 and γ = 0.15 to 1.50. The detailed calculations are presented in Additional files 1, Table S3. We noticed that γ = 0.75 gave the best F1 on both unscored and scored networks.

Figure 2
figure 2

Setting the parameter γ in core-attachment refinement: F1 versus γ plot. (a): Plot for the unscored Gavin+Krogan network; (b): Plot for the scored ICD(Gavin+Krogan) network. γ = 0.75 gave the best F1 for both unscored and scored networks (I = 2.50 and α = 1.00).

Next, we set I and γ to the chosen values, and ran MCL-CAw over a range of α. Figure 3 shows sample F1 versus α plots on the unscored Gavin+Krogan and scored ICD(Gavin+Krogan) networks for I = 2.5, = γ = 0.75 and α = 0.50 to 1.75. The detailed calculations are presented in Additional files 1, Table S4. We noticed that α = 1.50 gave the best F1 on the unscored network, while α = 1.0 gave the best F1 on the scored networks.

Figure 3
figure 3

Setting the parameter α in core-attachment refinement: F1 versus α plot. (a): Plot for the unscored Gavin+Krogan network; (b): Plot for the scored ICD(Gavin+Krogan) network. α = 1.50 gave the best F1 for the unscored network (I = 2.50 and γ = 0.75). α = 1.00 gave the best F1 for the scored networks (I = 2.50 and γ = 0.75)..

Reconfirming I for the chosen values of α and γ

Finally, for each PPI network, we ran core-attachment refinement with the chosen values of α and γ over a range of I for MCL. Figure 4 compares the F1 versus I plots for plain-MCL and MCL followed by CA refinement on the unscored Gavin+Krogan and scored ICD(Gavin+Krogan) networks for range I = 1.25 to 3.0. The plots reconfirmed that the chosen values for α and γ gave the best performance for CA refinement when I = 2.5 (except for the Aloy benchmark, the smallest benchmark among the three, for which F1 was best at I = 1.75 and was marginally lower for I = 2.5). The detailed calculations are presented in Additional files 1, Tables S5 and S6. We settled on I = 2.5, α = 1.50 and γ = 0.75 for the unscored Gavin+Krogan network, and I = 2.5, α = 1.0 and γ = 0.75 for the scored networks as our final combination of parameters for MCL-CAw.

Figure 4
figure 4

Reconfirming inflation I for MCL-Caw. (a): Plot for the unscored Gavin+Krogan network with α = 1.50 and γ = 0.75. (b): Plot for the scored ICD(Gavin+Krogan) network α = 1.00 and γ = 0.75. I = 2.5 gave the best F1 for these chosen values of α and γ (except on the Aloy benchmark, the smallest benchmark among the three, on which I = 1.75 gave marginally better F1).

Evaluating the performance of MCL-CAw

Figure 5 shows the workflow considered for the evaluation of MCL-CAw. The predicted complexes were tapped at two successive stages:

Figure 5
figure 5

Workflow for evaluation of MCL-Caw. The predicted complexes of MCL-CAw were tapped at two stages: (i) Clustering using MCL; (ii) Hierarchical clustering followed by core-attachment refinement using MCL-CAw. These predicted complexes were evaluated by matching them to the set of benchmark complexes.

  1. 1.

    After clustering using MCL;

  2. 2.

    After hierarchical clustering followed by core-attachment refinement using MCL-CAw.

The effect of core-attachment refinement on the predictions of MCL

Compare the topmost rows in Table 3 for MCL and MCL-CAw evaluated on the unscored Gavin+Krogan network. They show that MCL-CAw achieved significantly higher recall compared to MCL on Gavin+Krogan - on an average 31% higher number of complexes derived than MCL. In fact referring back to Figure 4(a), MCL-CAw achieved higher F1 compared to MCL for the entire range I = 1.25 to 3.00. In order to further analyse this improvement, we considered two sets of complexes derived from Gavin+Krogan. (a) Set A = MCL ∩ MCL-CAw, consisting of all complexes correctly predicted by both MCL and MCL-CAw, but with different Jaccard accuracies; (b) Set B = MCL-CAw\MCL, consisting of all complexes correctly predicted by MCL-CAw, but not by MCL. There was no complex correctly predicted by MCL that was missed by MCL-CAw. We calculated the increase (percentage) in accuracies for complexes from A and B. This increase for A was noticably high, the average being 7.53% on the Wodak set. The increase for B was significantly high, the average being 62.26% on the Wodak set. This shows: (a) CA-refinement was successful in improving the accuracies of MCL clusters; (b) This improvement was particularly high for low quality clusters of MCL (that is, set B). MCL-CAw was successful in elevating the accuracies above the threshold t = 0.5 for those clusters that were difficult to be matched to known complexes using MCL alone. Consequently, MCL-CAw derived significantly higher number of benchmark complexes than MCL.

Table 3 (i) Impact of core-attachment refinement on MCL; (ii) Role of affinity scoring in reducing the impact of natural noise on MCL and MCL-CAw

Impact of noise on MCL and MCL-CAw and the role of affinity scoring in reducing this impact

Table 3 compares different evaluation metrics for MCL and MCL-CAw on the unscored Gavin+Krogan with the four scored PPI networks. Very clearly, both MCL and MCL-CAw showed considerable improvement in precision and recall on the scored networks. For example, MCL achieved about 127% higher precision and 51.3% higher recall (on average), while MCL-CAw achieved about 132% higher precision and 26.6% higher recall (on average on Wodak lab benchmark) on the four scored networks than on the unscored Gavin+Krogan network. The precision versus recall curves (Figure 6) on Gavin+Krogan dropped sharply, while those for the three scored networks - ICD(Gavin+Krogan), FSW (Gavin+Krogan) and Consolidated3.19 - displayed a more "graceful" decline. The curve for Bootstrap0.094 displayed a sudden dip towards the beginning, but stabilized subsequently to achieve a higher (final) precision and recall compared to the unscored Gavin+Krogan network.

Figure 6
figure 6

Impact of affinity scoring on the performance of MCL and MCL-Caw. (a). and (b): Precision versus recall curves on Gavin+Krogan and the four scored networks (ICD(Gavin+Krogan), FSW(Gavin+Krogan), Consolidated3.19 and Bootstrap0.094) for MCL and MCL-CAw, respectively, evaluated on Wodak benchmark with t = 0.5. Both the methods showed significant improvement on the scored networks compared to unscored Gavin+Krogan.

Among the four scored PPI networks, both MCL and MCL-CAw showed best precision and recall on the Consolidated3.19 network, which can be directly attributed to the high quality of this network. However, this high quality of Consolidated3.19 came at the expense of lower protein coverage (see Table 4; also noted in [20]), resulting in reduced number of derivable complexes. In order to counter this, we gathered a larger subset of the Consolidated network with PE cut-off 0.623 (the average PE score), which accounted for a higher protein coverage (Table 4). We noticed that the improvement of MCL-CAw over MCL was significantly higher on Consolidated0.623, compared to the improvement seen on Consolidated3.19. We also noticed that ICD scoring of Consolidated0.623 drastically reduced the size of this network, revealing that this larger subset in fact included significant amount of false positives (noise). These experiments indicate that any reasonably good algorithm like MCL can perform well on high quality networks. However, due to the lack of protein coverage as well as scarcity of such high quality networks, we need to consider larger networks for complex detection (particularly to be able to detect novel complexes). This in turn exposes the algorithms to higher amount of natural noise (even in scored networks). Therefore, the need is to develop algorithms that can detect larger number of complexes in the presence of such noise. In this scenario, our results show that MCL-CAw is able to derive considerably higher number of complexes than MCL. Taking this further, we introduced different levels of random noise to study its impact on MCL and MCL-CAw. We introduced 10% to 75% random noise (2000 to 10000 random interactions) to the Gavin+Krogan network. We noticed that MCL-CAw performed better than MCL even upon introducing 50% random noise (Table 5). However, at 75% random noise, the performance of MCL-CAw marginally dropped below that of MCL. Therefore, MCL-CAw was reasonably robust to random noise - it was stable in the range 10% - 40% noise, which covers the typical levels of noise seen in TAP/MS datasets [9] (we say this keeping in mind that MCL has been shown to be robust even at 80% random noise [38]). We next scored these noisy networks using the ICD scheme. We found that the performance of both MCL and MCL-CAw improved considerably on these scored networks. MCL-CAw performed considerably better than MCL even at 50% to 75% random noise (Table 5). Therefore, affinity scoring helped MCL-CAw to maintain its performance gain over MCL.

Table 4 MCL-CAw performed considerably better than MCL in the presence of natural noise
Table 5 (i) Impact of introducing different levels of artificial noise on MCL and MCL-CAw (ii) Role of affinity scoring in reducing the impact of noise

Biological coherence of predicted components

The co-localization scores for the various predicted components (cores and whole complexes) of MCL-CAw are shown in Table 6. The table shows that: (a) The predicted complexes of MCL-CAw showed high co-localization scores compared to MCL on both the unscored and scored PPI networks. MCL included several noisy proteins into the predicted clusters, thereby reducing their biological coherence; (b) The predicted cores of MCL-CAw displayed higher scores compared to complexes, indicating that proteins within cores were highly localized; (c) The complexes of both MCL and MCL-CAw displayed higher scores on the four scored networks compared to the Gavin+Krogan network.

Table 6 Co-localization scores for predicted components from MCL and MCL-CAw

Relative ranking of complex prediction algorithms and affinity-scored networks

In order to gauge the performance of MCL-CAw relative to existing techniques, we selected the following recent algorithms proposed for complex detection:

  • On the unscored Gavin+Krogan network, we compared against MCL [18, 19], our preliminary work MCL-CA (2009) [28], CORE by Leung et al. (2009) [24], COACH by Wu Min et al. (2009) [25], CMC by Liu et al. (2009) [16], and HACO by Wang et al. (2009) [21];

  • On the affinity-scored networks, we compared against MCL, MCL incorporated with cluster overlaps by Pu et al. (2007) [20] (our implementation of this, called MCLO), CMC and HACO.

Table 7 summarizes some of the properties and the parameters used for these methods. We consider only complexes of size at least 4 from all algorithms in this entire evaluation. We dropped MCL-CA, CORE and COACH for the comparisons on the affinity-scored networks because these methods assume unweighted networks as inputs. Further, we do not show results for older methods namely MCODE by Bader and Hogue (2003) [8] and RNSC by King et al. (2004) [39], instead include MCL into all our comparisons, because MCL significantly outperforms these methods [16, 38].

Table 7 Existing complex detection methods selected for comparisons with MCL-CAw

Tables 8, 9, 10, 11 and 12 show detailed comparisons between complex detection algorithms on the unscored and scored networks. Figures 7 and 8 show the precision versus recall curves on these networks, while Table 13 shows the area-under-the-curve (AUC) values for these curves. Considering ± 5% error in AUC values, the table shows that CORE attained the highest AUC followed by MCL-CAw and CMC on the unscored network, while MCL-CAw and CMC achieved the overall highest AUC on the scored networks. In addition to this, on each network we ranked the algorithms based on their normalized final F1 measures (with respect to the best performing algorithm on that network), as shown in Table 14. We summed up the normalized F1 values for each algorithm across all the networks to obtain an overall ranking of the algorithms as shown in Table 15. The detailed calculations are presented in Additional files 1, Table S7. On the unscored network CMC showed the best F1 value, while on the scored networks MCL-CAw showed the best overall F1 value. In particular, MCL-CAw performed the best on ICD(Gavin+Krogan), FSW(Gavin+Krogan) and Consolidated3.19 networks, while HACO performed the best on Bootstrap0.094. This more or less agreed with the relative performance gathered from the AUC values (Table 13).

Table 8 Comparisons between the different methods on the unscored Gavin+Krogan network
Table 9 Comparisons between the different methods on the ICD(Gavin+Krogan) network
Table 10 Comparisons between the different methods on the FSW(Gavin+Krogan) network
Table 11 Comparisons between the different methods on the Consolidated3.19 network
Table 12 Comparisons between the different methods on the Bootstrap0.094 network
Figure 7
figure 7

Comparison of complex detection algorithms on unscored Gavin+Krogan network. (a): Precision versus recall curves and area-under-the-curve (AUC) values for complex detection algorithms on the unscored Gavin+Krogan network, evaluated on Wodak reference with t = 0.5. AUC for MCL = 0.225, COACH = 0.169, CORE = 0.361, MCL-CAw = 0.323, CMC = 0.271, MCL-CA = 0.238, HACO = 0.136. (b): Number of predicted complexes, proportion of true positives (correctly matched to benchmark(s)) and false positives (not matched to any benchmark) for the algorithms.

Figure 8
figure 8

Comparison of complex detection algorithms on four scored networks. Precision versus recall curves and area-under-the-curve (AUC) values for complex detection algorithms evaluated on Wodak reference with t = 0.5. (a) ICD(Gavin+Krogan): AUC values MCL = 0.436, CMC = 0.494, MCL-CAw = 0.472, MCLO = 0.435, HACO = 0.305. (b) FSW(Gavin+Krogan): AUC values MCL = 0.431, CMC = 0.481, MCL-CAw = 0.487, MCLO = 0.430, HACO = 0.461. (c) Consolidated3.19: AUC values MCL = 0.469, CMC = 0.399, MCL-CAw = 0.488, MCLO = 0.463, HACO = 0.367. (d) Bootstrap0.094: AUC values MCL = 0.349, CMC = 0.513, MCL-CAw = 0.389, MCLO = 0.353, HACO = 0.317.

Table 13 Area under the curve (AUC) values of precision versus recall curves for complex detection methods on the unscored and scored PPI networks
Table 14 Relative ranking of complex detection algorithms on unscored and affinity-scored networks
Table 15 Overall relative ranking of complex detection algorithms on unscored and affinity-scored networks

The precision of MCL-CAw (0.397) was lower on Bootstrap0.094 compared to other scored networks (ICD - 0.620, FSW - 0.615, Consolidated3.19 - 0.672). MCL-CAw produced many redundant complexes from this network compared to other scored networks, leading to the drop in precision. In fact we observed such variance in CMC and HACO algorithms as well. For example, CMC achieved the best recall on the ICD network, but lowest on the Consolidated network. Also, CMC produced significantly fewer complexes (#77) on the Consolidated network compared to other networks (ICD - 171, FSW - 179, Bootstrap - 203). Further, all algorithms displayed "sudden dips" in precision versus recall curves towards the beginning on the Bootstrap0.094 network (see Figure 8). All these findings indicate that the choice of affinity scoring schemes affected the performance of algorithms. In other words, each algorithm made use of certain characteristics of the PPI networks, and favored a scoring scheme that magnified or reinforced those characteristics. There was no single algorithm which performed relatively best on all the scored networks. Having said that, we note MCL-CAw was ranked among the top three algorithms on all scored networks, and therefore MCL-CAw responded reasonably well to the considered affinity scoring schemes.

We also ranked the different affinity-scored networks based on the F1 measures offered to the complex detection algorithms, as shown in Tables 16 and 17. The table shows that the Consolidated3.19 network offered the best F1 measures to the algorithms, followed by the FSW(Gavin+Krogan), ICD(Gavin+Krogan) and Bootstrap0.094 networks (the detailed calculations are presented in Additional files 2, Table S8). This agreed well with the fact that the Consolidated3.19 network was shown to have a TP/FP ratio comparable to small-scale experiments from MIPS, and therefore was of very high quality [11].

Table 16 Relative ranking of affinity scoring schemes for complex detection
Table 17 Overall relative ranking of affinity scoring schemes for complex detection

Impact of augmenting physical PPI networks with computationally inferred interactions

In this set of experiments, we studied whether augmenting the physical PPI networks with inferred interactions improved the performance of complex detection algorithms. We gathered interactions in yeast comprising of inferred interlogs (inferred from interactions between orthologous proteins in other organisms like fly, mouse and human), and also based on genetic (gene fusion, chromosomal proximity, gene co-evolution) and functional (traits of neighbors, neighbors of neighbors, etc.) associations; downloaded from the Predictome database [40]http://cagt.bu.edu/page/Predictome_about. These were used to generate the Inferred network (Table 1). We then augmented the Gavin+Krogan network with these interactions to generate the Gavin+Krogan+Inferred network and its scored versions, the ICD(Gavin+Krogan+Inferred) and FSW(Gavin+Krogan+Inferred) networks (Table 1).

We evaluated MCL, MCL-CAw, CMC and HACO on these augmented networks (Table 18). All the algorithms displayed very low precision and recall values on the Inferred network, indicating that the inferred interactions alone were not sufficient to predict meaningful complexes. Interestingly, most algorithms displayed marginal dip in their performance on Gavin+Krogan+Inferred compared to Gavin+Krogan. This dip in performance was explained by the analysis on the two augmented-scored networks, ICD(Gavin+Krogan+Inferred) and FSW(Gavin+Krogan+Inferred). Most algorithms showed higher precision and recall on these two augmented-scored networks compared to Gavin+Krogan and Gavin+Krogan+Inferred. This indicates that augmenting with raw inferred interactions gave little benefit due to presence of false positives (noise), but scoring the augmented networks helped to improve the precision and recall values of the algorithms.

Table 18 Impact of augmenting inferred interactions on the performance of MCL, MCL-CAw, CMC and HACO

In-depth analysis of individual predicted complexes

To facilitate the analysis of our individual predicted complexes, we mapped the complexes back to the corresponding PPI networks and examined the interactions between components of the same complex, as well as between components of a given complex and other proteins in the network. We performed this analysis using the Cytoscape visualization environment http://www.cytoscape.org/[41].

Instances of correctly predicted complexes of MCL-CAw

The first example is of an attachment protein shared between two predicted complexes of MCL-CAw. The subunits of these predicted complexes (Id# 57 and 22) make up the Compass complex involved in telomeric silencing of gene expression [42], and the mRNA cleavage and polyadenylation specificity factor, a complex involved in RNAP II transcription termination [43]. The shared attachment Swd2 (Ykl018w) formed high confidence connections with the subunits of both predicted complexes. On this basis, the post-processing procedure assigned Swd2 (Ykl018w) to both predicted complexes, in agreement with available evidence [44] that Swd2 (Ykl018w) belongs to both Compass and mRNA cleavage complexes. The next example illustrates the case where a new protein was predicted as a subunit of a known complex. The attachment protein Ski7 (Yor076c) was included into a predicted complex (Id# 28) that matched the Exosome complex involved in RNA processing and degradation [45]. Additionally, Ski7 (Yor076c) was also included into a prediction (Id# 105) matching the Ski complex (Additional files 1, Figure S2). However, the Ski complex in the Wodak lab catalogue [29] did not include this new protein. Further literature survey suggested that Ski7 acts as a mediator between the Ski and Exosome complexes for 3'-to-5' mRNA decay in yeast [46].

The RNA polymerase I, II, and III complexes (also called Pol I, II, and III, respectively) are required for the generation of RNA chains [47]. As per the Wodak lab catalogue [29], all the three complexes share subunits: Yor224c, Ybr154c, Yor210w and Ypr187w, while Pol I and Pol III share Ynl113w and Ypr110c. Due to the extensive sharing of subunits, the corresponding predictions were grouped together into one large cluster by MCL. On the other hand, MCL-CAw segregated the large cluster into three independent complexes, which matched the Pol I, Pol II and Pol III complexes with accuracies of 0.714, 0.734 and 0.824, respectively.

In addition to these cases, a good fraction of already known core-attachment structures (reported in the supplementary materials of Gavin et al.[6]) were confirmed, and putative complexes were identified (preparation of a compendium currently in progress). Some examples are worth quoting here. Our predicted complex id# 44 closely matched the HOPS complex. All five cores {Ylr148w, Ylr396c, Ymr231w, Ypl045w, Yal002w} and two attachments {Ydr080w, Ydl077c} that were covered matched those reported in Gavin et al. Biological experiments show that the cores have the function of vacuole protein sorting, and with the help of attachments, the complex can perform homotypic vacuole fusion [48]. We identified the ubiquitin ligase ERAD-L complex comprising of Yos9(Ydr057w), Hrd3 (Ylr207w), Usa1 (Yml029w) and Hrd1 (Yol013c) that is involved in the degradation of ER proteins [49]. This matched the Hrd1/Hrd3 purified by Gavin et al. Four subunits {Oca4, Oca5, Siw14, Oca1} of a predicted novel complex (Id# 66) showed high similarity in functions (oxidant-induced cell-cycle arrest) and localization (cytoplasmic) when verified in SGD [33]. This complex exactly matched the putative complex 490 in Gavin et al.

Instances depicting mistakes in the predictions of MCL-CAw

Here we discuss an interesting case in which the sharing of subunits was so extensive and the web of interactions was so dense that separating out the smaller subsumed complexes purely on the basis of the interaction information was much harder. It was the amalgamation of the clusters matching the SAGA, SAGA-like (SLIK), ADA and TFIID complexes. Based on the Wodak lab catalogue [29], the 20 subunits making up the SAGA complex involved in transcriptional regulation [50] include four subunits (Ygr252w, Ydr176w, Ydr448w, Ypl254w) that are members of the ADA complex [51] as well. Sixteen components of the SAGA complex including the four shared with the ADA complex, are also the components of the SLIK complex [52]. Additionally, five subunits (Ybr198c, Ygl112c, Ymr236w, Ydr167w, Ydr145w) of the SAGA complex also belong to the TFIID complex [50]. Because of such extensive sharing of subunits involved in a dense web of interactions (436 interactions among 31 constituent proteins, as seen on the ICD(Gavin+Krogan) network), MCL-CAw was able to segregate out only two distinct complexes - SAGA (0.708) and SLIK (0.625). The clusters matching TFIID and ADA remained amalgamated together. In the next set of analysis, we compared the derived complexes from the Gavin+Krogan and the ICD(Gavin+Krogan) networks, and identified cases where MCL-CAw had missed a few proteins or whole complexes due to affinity scoring. From the Wodak, MIPS and Aloy reference sets, there were 13, 18 and 16 complexes, respectively, that were derived with better accuracies from the Gavin+Krogan network than from the ICD(Gavin+Krogan) network. And, there were 6, 2 and 2 complexes, respectively, that were derived from the Gavin+Krogan network, but missed totally from the ICD(Gavin+Krogan) network. Table 19 shows a sample of such complexes from the Wodak reference set. For the complexes that were derived with lower accuracies (upper half of Table 19), MCL-CAw had missed a few proteins due to low scores assigned to the corresponding interactions. For example, in the predicted complex from the ICD(Gavin+Krogan) network matching the SWI/SNF complex, two proteins (Ymr033w and Ypr034w) out of the four missed ones were absent due to their weak connections with the rest of the members; instead, these proteins were present in the prediction matching the RSC complex. In the Gavin+Krogan network, these two proteins were shared between two complexes matching the SWI/SNF and RSC complexes, which also agreed with the Wodak catalogue [29].

Table 19 Complexes derived with lesser accuracy or missed by MCL-CAw due to affinity scoring

In the cases where MCL-CAw had completely missed some complexes from the scored network (lower half of Table 19), it is interesting to note that MCL-CAw had pulled-in many additional (noisy) proteins as attachments into the predicted complexes, which caused the accuracies to drop below 0.5. One such case is of the predicted complex id#36 matching the eIF3 complex with a low Jaccard score of 0.4. The eIF3 complex from Wodak lab consisted of 7 proteins: Yor361c, Ylr192c, Ybr079c, Ymr309c, Ydr429c, Ymr012w and Ymr146c. The predicted complex id#66 from the Gavin+Krogan network consisted of 8 proteins (Figure 9): 5 cores (Yor361c, Ylr192c, Ybr079c, Ymr309c, Ydr429c) and 3 attachments (Yor096w, Yal035w, Ydr091c). Therefore, there were 2 missed and 3 additional proteins in the prediction, leading to an accuracy of 0.5. The predicted complex id#36 from the ICD(Gavin+Krogan) network consisted of 14 proteins: 6 cores (Yor361c, Ylr192c, Ybr079c, Ymr309c, Ydr429c, Yor096w) and 8 attachments (Yal035w, Ydr091c, Yjl190c, Yml063w, Ymr146c, Ynl244c, Yor204w, Ypr041w). Therefore, there were 1 missed and 8 additional proteins in the prediction, leading to an even lower accuracy of 0.4. All the core proteins had same or similar GO annotations (involvement in translation, localized in cytoplasm or ribosomal subunit) [34]. Upon analysing the GO annotations of the 8 attachment proteins, we noticed that only one (Ymr146c) had the same annotation as the core proteins. This was also part of the eIF3 complex from Wodak lab [29]. Out of the remaining 7 attachment proteins, five (Ypr041w, Ynl244c, Yml063w, Yjl190c, Ydr091c) had related GO annotations (translation initiation, GTPase activity, cytoplasmic, ribosomal subunit) as the core proteins. A literature search revealed that these proteins belonged to the multi-eIF initiation factor conglomerate (containing eIF1, eIF2, eIF3 and eIF5) and the 40 S ribosomal subunit involved in translation [29]. The remaining two (Yal035w, Yor204w) were involved in translation activity, but were absent in the Wodak lab catalogue. These might be potentially new proteins belonging to the eIF3 or related complexes, and need to be further investigated. We also analysed the GO annotations of the level-1 neighbors to the predicted complex seen in the network, none of them had annotations similar to the proteins within the network. This example illustrates that carefully incorporating GO information into our algorithm to include or filter out proteins can be useful in cases where making decisions solely based on interaction information is difficult.

Figure 9
figure 9

Example of a complex missed by MCL-CAw from the ICD(Gavin+Krogan) network, but found from the Gavin+Krogan network. The eIF3 complex from Wodak lab consisted of 7 proteins: Yor361c, Ylr192c, Ybr079c, Ymr309c, Ydr429c, Ymr012w and Ymr146c. The predicted complex id#36 from the ICD(Gavin+Krogan) network consisted of 14 proteins: 6 cores (Yor361c, Ylr192c, Ybr079c, Ymr309c, Ydr429c, Yor096w) and 8 attachments (Yal035w, Ydr091c, Yjl190c, Yml063w, Ymr146c, Ynl244c, Yor204w, Ypr041w). Therefore, there were 1 missed and 8 additional proteins in the prediction, leading to a low accuracy of 0.4. Hexagonal (Orange): eIF3 complex from Wodak lab. Circle (Orange, Yellow and Pink): Predicted complex id#36. Rectangle (Turquoise): Level-1 neighbors to the predicted complex id#36.

Correlation between essentiality of proteins and their ability to form complexes

Early works by Jeong et al.[53] and Han et al.[54] studied the essentialities of proteins based on pairwise interactions within the interaction network, and concluded that hub (high-degree) proteins are more likely to be essential. This formed one of the criteria within the "centrality-lethality" rule [53]. However, a deeper insight can be obtained by studying the essentialities at cluster or group level of proteins rather than pairwise interactions. Recently, Zotenko et al.[55] argued that essential proteins often group together into densely connected sets of proteins performing essential functions, and thereby get involved in higher number of interactions resulting in their hubness property. Therefore, hubness may just an indirect indicator of protein essentiality. More recently, Kang et al.[56] studied essentiality of proteins by generating the reverse neighbor (RNN) topology [57] out of protein networks. This topology groups those proteins together that are within the reverse neighborhood of a given protein. Kang et al. concluded that centrality within the RNN topology is a better estimator of essentiality than hubness or degree in the interaction network. Studies by Hart et al.[12] showed that essential proteins are concentrated only in certain complexes, resulting in a dichotomy of essential and non-essential complexes. Wang et al.[21] concluded that the size of the (largest) recruiting complex of a protein may be a better indicator of protein essentiality than hubness.

In our work, we attempt to understand the relationship between the essentiality of proteins and their ability to form complexes. Table 20 shows that a high proportion (77.65%, 78.03%, 81.34% and 76.35% from the ICD(Gavin+Krogan), FSW (Gavin+Krogan), Consolidated3.19 and Bootstrap0.094 networks, respectively) of essential proteins present in the four affinity-scored networks belonged to at least some correctly predicted complex. This indicated that essential proteins are often members of complexes or co-clustered groups of proteins.

Table 20 Essential genes in the predicted complexes of MCL-CAw

To further analyse this ability of essential proteins to form complexes or groups, we binned our correctly predicted complexes based on their sizes and calculated the proportion of essential proteins in all complexes for each bin (like in [21]). Figure 10(a) shows that essential proteins were present in higher proportions within larger complexes. We then calculated the proportion of essential proteins within the top K ranked complexes. Figure 10(b) shows that essential proteins were present in higher proportions within higher ranked complexes. Both these figures hint at the same finding: essential proteins come together in large groups to perform essential functions.

Figure 10
figure 10

Correlation between essentiality of proteins and their abilities to form complexes. (a): Proportion of essential proteins within complexes of different sizes, predicted from ICD(Gavin+Krogan). Proportion of essential proteins in a complex = #essential proteins/total #proteins in the complex. (b): Proportion of essential proteins within top K ranked complexes.

Discussion

In spite of the advances in computational approaches to derive complexes, high-accuracy reconstruction of complexes has still remained a challenging task. In deriving protein complexes from PPI networks, a key assumption made by most computational approaches is that complexes form densely connected regions within the networks. Therefore, these approaches attempt to cluster the networks based on measures related to connectivities between proteins in the network. Some approaches like MCL simulate random walks (called flow) to identify dense regions, while others like CMC merge maximal cliques into larger dense clusters. Therefore, the performance of these methods varies widely depending on network densities. A glance through Tables 8 to 12 reveals that all the methods considered for comparison in this work achieve very low recall on the MIPS set compared to the Wodak and Aloy sets. Table 2 shows that the average density of complexes in MIPS is much lower than that of Wodak and Aloy sets. Only 52 out of 137 (37.95%) derivable MIPS complexes of size ≥ 5 could be detected from the Gavin+Krogan network by all methods put together. We analysed the remaining 85 MIPS complexes and found most of them to have very low densities (average about 0.217) in the Gavin+Krogan network. For example, the MIPS complex 440.30.10 (involved in mRNA splicing) went undetected by all the methods even though 40 of its 42 proteins were present in the Gavin+Krogan network. There were 144 interactions among these 40 proteins, giving a low density of 0.184 to the complex in this network. Continuing with this analysis, we tested MCL and MCL-CAw on a PPI dataset from DIP http://dip.doe-mbi.ucla.edu, comprising of 17491 interactions among 4932 proteins giving a low average node degree of 7.092. MCL-CAw was able to achieve only marginal improvement (22.8% higher precision and 7.4% higher recall) over MCL, due to the low average node degree of the DIP network. These experiments show that all the methods considered here find it difficult to uncover complexes that are very sparse. This should prompt us to rethink whether over importance is being given to model complexes as dense regions in PPI networks.

Apart from these limitations in the existing computational methods, there are some inherent difficulties in the accumulation of interactome data as well that make complex detection difficult. Complexes display different lifetimes, and their compositions vary based on cellular localizations (compartments) and conditions. The same protein may be recruited by different complexes at different times and conditions. Due to such temporal and spatial variability of complexes, repeated purifications using TAP/MS methods yield somewhat different "complex forms" [20]. The PPI networks constructed out of such purifications represent only a probabilistic average picture of the yeast interactome [20]. Therefore, the complexes predicted out of such networks only approximate the actual complex compositions.

Another limitation arises from the bias in TAP/MS purifications against complexes of certain kind (for example, membrane-bound complexes). Since TAP/MS data are acquired in a single condition (rich media), some complexes may not be present in the cell in that condition [21]. Therefore, new experimental assays are needed before such complexes can be reconstructed and studied.

Finally, even though S. cerevisiae is used as a model organism for eukaryotic interactome analysis, some key complexes specialized to other organisms (including human) can be studied only by analysing the interaction datasets specific to these organisms. However, the incompleteness of interactome data from these organisms makes the reconstruction of complexes difficult.

Conclusion

The ultimate goal of interactome analysis is to understand the higher level organization of the cell. Reconstruction of protein complexes serves as a building block towards achieving this goal. In this paper, inspired by the findings of Gavin et al.[6], we developed a novel core-attachment based refinement method coupled to MCL to identify yeast complexes from weighted PPI networks. We demonstrated that our algorithm (MCL-CAw) performed better than MCL in deriving meaningful yeast complexes particularly in the presence of natural noise. We also showed that MCL-CAw responded reasonably well to the considered affinity scoring schemes. In the future work, we intend to improve the prediction ability of our algorithm by incorporating information from gene annotations, gene expressions, literature mining as well as domain-domain interactions. We also intend to extend our work to predict complexes of organisms other than yeast. In this context, we intend to use our MCL-CAw model to study the existence (and extent) of core-attachment modularity in complexes from other organisms.

Availability

The MCL-CAw software is developed using PL/SQL on Oracle 10 g, using the framework in [58]. The source code, yeast PPI datasets, benchmark and predicted yeast complexes used in this work are freely available at the MCL-CAw project homepage hosted on the NUS server: http://www.comp.nus.edu.sg/~leonghw/MCL-CAw/.

References

  1. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci 2001, 98: 4569–4574. 10.1073/pnas.061034498

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Uetz P, Giot L, Cagney G, Traci A, Judson R, Knight J, Lockshon D, Narayan V, Srinivasan M, Pochart P, Emil QA, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg M: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403: 623–627. 10.1038/35001009

    Article  CAS  PubMed  Google Scholar 

  3. Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Seraphin B: A generic protein purification method for protein complex characterization and proteome exploration. Nature Biotechnol 1999, 17: 1030–1032. 10.1038/13732

    Article  CAS  Google Scholar 

  4. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klien K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwin C, Heurtier MA, Copley RR, Edelmann A, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Sepharin B, Kuster B, Neubauer G, Furga GS: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415: 141–147. 10.1038/415141a

    Article  CAS  PubMed  Google Scholar 

  5. Ho Y, Gruhler A, Heilbut A, Bader G, Moore L, Adams SL, Millar A, Taylor P, Bennet K, Boutlier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart M, Gouderault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielson P, Rasmussen K, Anderson J, Johansen L, Hansen L, Jesperson H, Podtelejnikov A, Nielson E, Crawford J, Poulsen V, Sorensen B, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CWV, Figeys D, Tyers M: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415: 180–183. 10.1038/415180a

    Article  CAS  PubMed  Google Scholar 

  6. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russel PB, Superti FG: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440: 631–636. 10.1038/nature04532

    Article  CAS  PubMed  Google Scholar 

  7. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis A, Punna T, Alverez JM, Shales M, Zhang X, Davey M, Robinson M, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards D, Canadien V, Lalev A, Mena F, Wong P, Sharostine A, Canette M, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone J, Gandi K, Thompson NJ, Musso G, Onge PS, Ghanny S, Lam M, Butland G, Altaf-Ul A, Kanaya S, Shilatifard A, Weissman J, Ingles J, Hughes TR, Parkinson J, Gerstein M, Wodak S, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440: 637–643. 10.1038/nature04670

    Article  CAS  PubMed  Google Scholar 

  8. Bader GD, Hogue CWV: Analyzing yeast protein-protein interaction data obtained from different sources. Nature Biotechnology 2002, 20: 991–997. 10.1038/nbt1002-991

    Article  CAS  PubMed  Google Scholar 

  9. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale datasets of protein-protein interactions. Nature 2002, 417: 399–403. 10.1038/nature750

    Article  CAS  PubMed  Google Scholar 

  10. Batada N, Hurst LD, Tyers M: Evolutionary and physiological importance of hub proteins. PLoS Comp Bio 2006, 2: e88. 10.1371/journal.pcbi.0020088

    Article  Google Scholar 

  11. Collins SR, Kemmeren P, Zhao XC, Greenbalt JF, Spencer F, Holstege F, Weissman J, Krogan NJ: Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 2007, 6: 439–450.

    Article  CAS  PubMed  Google Scholar 

  12. Hart G, Lee I, Marcotte ER: A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinformatics 2007, 8: 236–247. 10.1186/1471-2105-8-236

    Article  PubMed  PubMed Central  Google Scholar 

  13. Zhang B, Park BH, Karpinets T, Samatova N: From pull-down data to protein interaction networks and complexes with biological relevance. Systems Biology 2008, 24: 979–986.

    CAS  Google Scholar 

  14. Chua H, Ning K, Sung W, Leong H, Wong L: Using indirect protein-protein interactions for protein complex prediction. J Bioinformatics and Computational Biology 2008, 6: 435–466. 10.1142/S0219720008003497

    Article  CAS  PubMed  Google Scholar 

  15. Liu G, Li J, Wong L: Assessing and predicting protein interactions using both local and global network topological metrics. Genome Informatics 2008, 22: 138–149. full_text

    Google Scholar 

  16. Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics 2009, 25: 1891–1897. 10.1093/bioinformatics/btp311

    Article  CAS  PubMed  Google Scholar 

  17. Friedel C, Krumsiek J, Zimmer R: Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. Research in Computational Molecular Biology (RECOMB) 2008, 3–16. full_text

    Chapter  Google Scholar 

  18. Dongen S: Graph clustering by flow simulation. PhD thesis. University of Utrecht; 2000.

    Google Scholar 

  19. Enright AJ, Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30(7):1575–1584. 10.1093/nar/30.7.1575

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Pu S, Vlasblom J, Emili A, Greenbalt J, Wodak S: Identifying functional modules in the physical interactome of Saccharomyces cerevisiae. Proteomics 2007, 7: 944–960. 10.1002/pmic.200600636

    Article  CAS  PubMed  Google Scholar 

  21. Wang H, kakaradov B, Collins SR, Karotki L, Fiedler D, Shales M, Shokat KM, Walter T, Krogan NJ, Koller D: A complex-based reconstruction of the Saccharomyces cerevisiae interactome. Mol Cell Proteomics 2009, 8: 1361–1377. 10.1074/mcp.M800490-MCP200

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Fridel C, Zimmer R: Identifying the topology of protein complexes from affinity purification assays. Systems Biology 2009, 25: 2140–2146.

    Google Scholar 

  23. Voevodski K, Yu X: Spectral affinity in protein networks. BMC Systems Biology 2009, 3: 112. 10.1186/1752-0509-3-112

    Article  PubMed  PubMed Central  Google Scholar 

  24. Leung H, Xiang Q, Yiu SM, Chin FY: Predicting protein complexes from PPI data: a core-attachment approach. Journal of Comp Biology 2009, 16: 133–44. 10.1089/cmb.2008.01TT

    Article  CAS  Google Scholar 

  25. Wu M, Li X, Ng SK: A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics 2009, 10: 169. 10.1186/1471-2105-10-169

    Article  PubMed  PubMed Central  Google Scholar 

  26. Mitrofanova A, Farach-Colton M, Mishra B: Efficient and robust prediction algorithms for protein complexes using Gomory-Hu trees. Pacific Symposium on Biocomputing (PSB) 2009, 215–226.

    Google Scholar 

  27. Ozawa Y, Saito R, Fujimori S, Kashima H, Ishizaka M, Yanagawa H, Miyamoto-Sato E, Tomita M: Protein complex prediction via verifying and reconstructing the topology of domain-domain interactions. BMC Bioinformatics 2010, 11: 350. 10.1186/1471-2105-11-350

    Article  PubMed  PubMed Central  Google Scholar 

  28. Srihari S, Ning K, Leong HW: Refining Markov clustering for protein complex detection by incorporating core-attachment structure. Genome Informatics 2009, 23: 159–168. full_text

    PubMed  Google Scholar 

  29. Pu S, Wong J, Turner B, Cho E, Wodak S: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res 2009, 37(3):825–831. 10.1093/nar/gkn1005

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, Ruepp A: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 2006, 34: D169-D172. 10.1093/nar/gkj148

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Aloy P, Bottcher B, Ceulemans H, Mellwig C, Fischer S, Gavin AC, Bork P, Superti-Furga G, Serrano L, Russell RB: Structure-based assembly of protein complexes of yeast. Science 2004, 303: 2026–2029. 10.1126/science.1092645

    Article  CAS  PubMed  Google Scholar 

  32. Breitkreutz B, Stark C, Tyers M: The GRID: The General Repository for Interaction Datasets. Genome Biology 2003, 4(3):R23. 10.1186/gb-2003-4-3-r23

    Article  PubMed  PubMed Central  Google Scholar 

  33. Cherry JM, Adler C, Chervitz SA, Dwight SS, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998, 26: 73–79. 10.1093/nar/26.1.73

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry M, Davis AP, Dolinski K, Dwight SS, Epigg J, Harris MA, Hill DP, Issel-Tarver L, Kasarkis A, Lewis S, Matase JC, Richardson J, Ringwald M, Rubin GM, Sherlock G: Gene ontology: a tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Zhou X, Kao MC, Wong WH: Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci 2002, 99: 12783–8. 10.1073/pnas.192159399

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke JD, Bussey H, Chu AM, Connelly C, Davis K, Dietrich F, Dow SW, Bakkoury E, Foury F, Friend SH, Gentalen E, Giaever G, Hegemann JH, Jones T, Laub M, Liao H, Liebundguth N, Lockhart DJ, Lucau-Danila A, Lussier M, Menard P, Mittmann M, Pai C, Rebischung C, Revuelta JL, Riles L, Roberts CJ, Ross-MacDonald P, Scherens B, Snyder M, Sookhai-Mahadeo S, Storms RK, Veronneau S, Voet M, Volckaert G, Ward TR, Wysocki R, Yen GS, Yu K, Zimmermann K, Philippsen P, Johnston M, Davis RW: Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 1999, 285: 901–906. 10.1126/science.285.5429.901

    Article  CAS  PubMed  Google Scholar 

  37. Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A, Anderson K, Andre B, Arkin AP, Astromoff A, El-Bakkoury M, Bangham R, Benito R, Brachat S, Campanaro S, Curtiss M, Davis K, Deutschbauer A, Entian KD, Flaherty P, Foury F, Garfinkel DJ, Gerstein M, Gotte D, Guldener U, Hegemann JH, Hempel S, Herman Z, Jaramillo DF, Kelly DE, Kelly SL, Kotter P, LaBonte D, Lamb DC, Lan N, Liang H, Liao H, Liu L, Luo C, Lussier M, Mao R, Menard P, Ooi SL, Revuelta JL, Roberts CJ, Rose M, Ross-Macdonald P, Scherens B, Schimmack G, Shafer B, Shoemaker DD, Sookhai-Mahadeo S, Storms RK, Strathern JN, Valle G, Voet M, Volckaert G, Wang CY, Ward TR, Wilhelmy J, Winzeler EA, Yang Y, Yen G, Youngman E, Yu K, Bussey H, Boeke JD, Snyder M, Philippsen P, Davis RW, Johnston M: Functional profiling of the Saccharomyces cerevisiae genome. Nature 2002, 418: 387–391. 10.1038/nature00935

    Article  CAS  PubMed  Google Scholar 

  38. Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006, 7: 488. 10.1186/1471-2105-7-488

    Article  PubMed  PubMed Central  Google Scholar 

  39. King AD, Przulj N, Jurisca I: Protein complex prediction via cost-based clustering. Bioinformatics 2004, 20(17):3013–3020. 10.1093/bioinformatics/bth351

    Article  CAS  PubMed  Google Scholar 

  40. Mellor JC, Yanai I, Karl H, Mintseris J, DeLisi C: Predictome: a database of putative functional links between proteins. Nucleic Acids Research 2002, 30: 306–309. 10.1093/nar/30.1.306

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Shannon P, Markiel A, Ozier O, Baliga NS, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13: 2498–2504. 10.1101/gr.1239303

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Miller T, Krogan NJ, Dover J, Bromage EH, Tempst P, Johnston M, Greenblatt JF, Shilatifard A: COMPASS: a complex of proteins associated with a trithorax-related SET domain protein. Proc Natl Acad Sci 2001, 98(23):12902–7. 10.1073/pnas.231473398

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Zhao J, Kessler M, Moore CL: Cleavage factor II of Saccharomyces cerevisiae contains homologues to subunits of the mammalian Cleavage/polyadenylation specificity factor and exhibits sequence-specific, ATP-dependent interaction with precursor RNA. J Biol Chem 1997, 272(16):10831–8. 10.1074/jbc.272.16.10831

    Article  CAS  PubMed  Google Scholar 

  44. Cheng H, He X, Moore C: The Essential WD Repeat Protein Swd2 Has Dual Functions in RNA Polymerase II Transcription Termination and Lysine 4 Methylation of Histone H3. Mol Cell Biology 2004, 24: 2932–2943. 10.1128/MCB.24.7.2932-2943.2004

    Article  CAS  Google Scholar 

  45. Luz JS, Tavares JR, Gonzales FA, Santosa MCT, Oliveira CC: Analysis of the Saccharomyces cerevisiae exosome architecture and of the RNA binding activity of Rrp40p. Biochemistry J 2006, 89(5):686–691. 10.1016/j.biochi.2007.01.011

    Article  Google Scholar 

  46. Araki Y, Takahashi S, Kobaysashi T, Kajiho H, Hoshino S, Katada T: Ski7p G protein interacts with the exosome and the Ski complex for 3'-to-5' mRNA decay in yeast. EMBO J 2001, 20(17):4684–4693. 10.1093/emboj/20.17.4684

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Hurwitz J: The discovery of RNA polymerase. J Biol Chem 2005, 280(52):42477–42485. 10.1074/jbc.X500006200

    Article  CAS  PubMed  Google Scholar 

  48. Seals DF, Eitzen G, Margolis N, Wickner T, Price A: A Ypt/Rab effector complex containing the Sec1 homolog Vps33p is required for homotypic vacuole fusion. Proc Natl Acad Sci 2000, 97(17):9402–9407. 10.1073/pnas.97.17.9402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Carvalho P, Goder V, Rapoport TA: Distinct ubiquitin-ligase complexes define convergent pathways for the degradation of ER proteins. Cell 2006, 126(2):361–373. 10.1016/j.cell.2006.05.043

    Article  CAS  PubMed  Google Scholar 

  50. Grant PA, Schieltz D, Pray-Grant MG, Reese JC, Yates JR, Wolkman JL: A subset of TAF(II)s are integral components of the SAGA complex required for nucleosome acetylation and transcriptional stimulation. Cell 1998, 94(1):45–53. 10.1016/S0092-8674(00)81220-9

    Article  CAS  PubMed  Google Scholar 

  51. Eberharter A, Sterner DE, Schieltz D, Hassan A, Yates JR, Berger SL, Workman JL: The ADA complex is a distinct histone acetyltransferase complex in Saccharomyces cerevisiae. Mol Cell Biol 1999, 19(10):6621–6631.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Grant PA, Schieltz D, McMahon SJ, Wood JM, Kennedy EL, Cook RG, Workman JL, Yates JR, Grant PA: The novel SLIK histone acetyltransferase complex functions in the yeast retrograde response pathway. Mol Cell Biol 2002, 22(24):8774–8786. 10.1128/MCB.22.24.8774-8786.2002

    Article  PubMed  PubMed Central  Google Scholar 

  53. Jeong H, Mason S, Barabasi AL, Oltvai Z: Lethality and centrality in protein networks. Nature 2001, 411: 41–42. 10.1038/35075138

    Article  CAS  PubMed  Google Scholar 

  54. Han JD, Bertin N, Hao T, Debra S, Gabriel F, Zhang V, Dupuy D, Walhout AJ, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast protein interaction network. Nature 2004, 430: 88–93. 10.1038/nature02555

    Article  CAS  PubMed  Google Scholar 

  55. Zotenko E, Mestre J, Przytycka TM: Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality. PLoS Genetics 2008, 2: e88.

    Google Scholar 

  56. Kang N, Ng HK, Srihari S, Leong HW, Nesvizhskii A: Examination of the Relationship between Essential Genes in PPI Network and Hub Proteins in Reverse Nearest Neighbor Topology. Personal communication

  57. Tao Y, Yiu ML, Mamoulis N: Reverse neighbor search in metric spaces. IEEE Trans Knowl Data Eng 2006, 18: 1239–1252. 10.1109/TKDE.2006.148

    Article  Google Scholar 

  58. Srihari S, Chandrashekar S, Parthasarathy S: A Framework for SQL-Based Mining of Large Graphs on Relational Databases. Pac Asia Conf Knowledge Discovery Data Mining (PAKDD) 2010, 2: 160–167. full_text

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the editor as well as reviewers for their valuable comments and suggestions; Sean Collins (UCSF) and Caroline Friedel (LMU) for making available the Consolidated [11] and the Bootstrap [17] networks, respectively; Guimei Liu (NUS) and Limsoon Wong (NUS) for the Iterative-CD, FS Weight and CMC softwares [14–16]; Henry Leung (HKU) and Wu Min (NTU) for the CORE [24] and COACH [25] softwares, respectively. This work was supported in part by the National University of Singapore under ARF grant R252-000-361-112.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Sriganesh Srihari, Kang Ning or Hon Wai Leong.

Additional information

Authors' contributions

SS conceived the initial ideas and discussed them with HWL and KN. SS devised the algorithm, developed the software, performed the experiments and analysis, and wrote and revised the manuscript. HWL supervised the project, advised SS, and reviewed and revised the manuscript. KN took part in the discussions and helped in reviewing the manuscript. All authors have read and approved the manuscript.

Electronic supplementary material

12859_2010_4087_MOESM1_ESM.PDF

Additional files 1:Additional figures and tables: Figures for core-attachment modularity and illustration of a predicted complex by MCL-CAw. Tables for setting of MCL-CAw parameters, and ranking of complex detection algorithms and affinity-scored networks. (PDF 876 KB)

12859_2010_4087_MOESM2_ESM.ZIP

Additional files 2:The MCL-CAw software package: The source code and installation details for the MCL-CAw software. (ZIP 4 MB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Srihari, S., Ning, K. & Leong, H.W. MCL-CAw: a refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating core-attachment structure. BMC Bioinformatics 11, 504 (2010). https://doi.org/10.1186/1471-2105-11-504

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-11-504

Keywords