Discover Protein Complexes in Protein-Protein Interaction Networks Using Parametric Local Modularity
- Jongkwang Kim^{1} and
- Kai Tan^{1, 2}Email author
https://doi.org/10.1186/1471-2105-11-521
© Kim and Tan; licensee BioMed Central Ltd. 2010
Received: 7 April 2010
Accepted: 19 October 2010
Published: 19 October 2010
Abstract
Background
Recent advances in proteomic technologies have enabled us to create detailed protein-protein interaction maps in multiple species and in both normal and diseased cells. As the size of the interaction dataset increases, powerful computational methods are required in order to effectively distil network models from large-scale interactome data.
Results
We present an algorithm, miPALM (M odule I nference by P arametric L ocal M odularity), to infer protein complexes in a protein-protein interaction network. The algorithm uses a novel graph theoretic measure, parametric local modularity, to identify highly connected sub-networks as candidate protein complexes. Using gold standard sets of protein complexes and protein function and localization annotations, we show our algorithm achieved an overall improvement over previous algorithms in terms of precision, recall, and biological relevance of the predicted complexes. We applied our algorithm to predict and characterize a set of 138 novel protein complexes in S. cerevisiae.
Conclusions
miPALM is a novel algorithm for detecting protein complexes from large protein-protein interaction networks with improved accuracy than previous methods. The software is implemented in Matlab and is freely available at http://www.medicine.uiowa.edu/Labs/tan/software.html.
Background
Protein complexes carry out the majority of biological processes within a cell. Correctly identifying protein complexes in an organism is useful for deciphering the molecular mechanisms underlying many cellular functions. Recent advances in proteomics technologies such as two-hybrid system and mass spectrometry has allowed enormous amount of data on protein-protein interactions (PPI) to be released into the public domain [1]. As the amount of global high throughput protein interaction data keeps increasing, methods for accurately identifying protein complexes from such data become a bottleneck for further analysis of the resulting interactome.
There is a large body of research on computational methods for de novo protein complex detection in PPI networks. These methods can be roughly divided into three categories. Methods in the first group define explicit complex criterion such as dense connectivity within a complex. A heuristic search strategy is then employed to identify complexes [2–4]. In contrast, the second group of methods also define a complex criterion but use complete enumeration to find all complexes that satisfy the criterion [5–7]. Instead of using local search strategy, the third group of methods are based on global graph partitioning techniques [8–11]. For instance, maximization of the modularity (Q) measure proposed by Newman and Girvan [12] has been successfully applied to PPI networks [11]. However, the global modularity measure has an inherent resolution limit for detecting small sub-networks [13], such as protein complexes whose median size is fewer than 10 proteins per complex. The reason for this resolution limit is that global modularity uses the entire network to compute the expected connectivity within a set of proteins, which may not be an appropriate measure of the background around protein complexes. Muff et al. [9] introduced a local version of the modularity measure (LQ) by only considering the immediate neighbors of a complex instead of the entire network. Applying it to the PPI network of E. coli, they showed that LQ was better at identifying small but biologically meaningful protein complexes.
Q and LQ represent two extremes of the neighborhood measure used to estimate background connectivity in a random network. Neither may be optimal for a given PPI network. In this study, we introduce a tunable parameter into the original formulation of modularity to help determine the optimal neighborhood size in calculating expected connectivity of a set of proteins. Another drawback of the previous LQ approach is that the computationally expensive optimization technique, simulated annealing, was used to maximize LQ, which is not feasible for large PPI networks such as yeast or human networks although it was proven useful for the smaller E. coli PPI network.
In this paper we introduce a novel algorithm to infer protein complexes by combining a parametric local modularity measure and a greedy search strategy. We evaluate our approach on the yeast PPI networks using two reference sets of protein complexes and additional functional annotations of yeast proteins. Compared to four existing methods, our algorithm achieves a significantly performance improvement in terms of F-measure and biological relevance of predicted complexes. By applying our method to two large-scale PPI networks, we predict a set of 138 novel protein complexes in the baker's yeast S. cerevisiae that warrant future experimental characterization.
Results
Local Modularity with Coarseness Parameter Improves Complex Prediction
Number and average size (arithmetic mean, in parenthesis) of predicted complexes using three different modularity measures and the DIP PPI network as input.
Complex Annotation | Modularity Measure | ||
---|---|---|---|
Q | LQ | LQa | |
CYC08: 236 (6.7) | 27 (1877.6) | 542 (37.5) | 269 (4.7) |
YHTP08: 207 (8.2) | 262 (4.6) |
Putting All Together: the miPALM Algorithm
We introduce a novel algorithm, miPALM (module inference by Parametric Local Modularity), for inferring protein complexes from large-scale protein interactome data. The input to miPALM consists of an un-weighted PPI graph and two parameters, α and δ. The algorithm has three major steps. Algorithmic details of each step and the corresponding pseudo-code are described in the Methods section. We briefly describe the major steps of the algorithm here. First, from the input PPI network, miPALM identifies a set of triangle seeds using topological overlap measure. A pair of nodes in a network has high topological overlap if they are both strongly connected to the same group of nodes (see Methods). Therefore, the use of topological overlap measure serves to exclude spurious or isolated connections in the network. Second, from each seed, the algorithm uses a greedy search to expand it into candidate complex(es). Local modularity is used as a scoring function to assess the quality of a candidate complex. The parameter α is used to control the background neighborhood size around a candidate complex. Finally, a filtering step is performed on the set of candidate complexes based on their density scores which is controlled by the parameter δ. The complete algorithm for complex prediction is shown in Algorithm 4.
Performance Comparison with Existing Methods
Next, we compare the performance of our algorithm with four representative algorithms for protein complex prediction, MCODE [2], MCL [10], COACH [15], and DME [7]. MCODE relies on the concept of K-core (a sub-graph in which all nodes have a degree at least k) and greedy search. MCL is a global graph partitioning algorithm that works by simulating stochastic flows in a graph. COACH is conceptually similar to MCODE. It first identifies the core of a candidate complex (maximal set of connected vertices whose degrees are greater than the network average) and then expand the core by including additional nodes if more than 50% of their edges are shared with the core. DME detects all node subsets that satisfy a user-defined minimum density threshold in a greedy fashion. Of the five algorithms, MCL cannot detect overlapping complexes whereas MCODE, COACH, DME, and miPALM can. Additionally, MCL is a global graph partitioning method whereas the other four are based on seeding and local search.
We tested the performance of all five methods using two sets of known complexes in the baker's yeast, S. cerevisiae. CYC08 is a set of protein complexes manually curated from published small-scale studies [16]. Since most small-scale studies tend to be biased towards complexes involved in a limited number of cellular processes, to complement this set, we also used the YHTP08 set of protein complexes [16]. It was constructed by analyzing two recent and most comprehensive genome-wide protein complex screens based on affinity purification coupled with mass spectrometry experiment [17, 18].
Statistics of predicted complexes by five algorithms with the best parameters optimized on CYC08 and YHTP08 sets and the DIP PPI network as input.
Algorithm | Gold Standard Sets | |
---|---|---|
(optimized parameters) | CYC08 (236/6.7) | YHTP08 (207/8.2) |
COACH | 271/113/7.3 | 271/95/7.3 |
(affinity threshold) | (0.1) | (0.1) |
MCODE | 57/25/12.9 | 57/18/12.9 |
(VWP) | (0.2) | (0.2) |
MCL | 830/123/5.9 | 830/115/5.9 |
(inflation) | (1.75) | (1.75) |
DME | 487/44/25.1 | 503/40/24.7 |
(density threshold) | (0.97) | (0.96) |
miPALM | 238/100/7.0 | 277/88/7.0 |
(α, δ) | (0.364, 2.40) | (0.374, 2.33) |
Figure 2B shows a breakdown of the F-measure into precision and recall for all five methods. On average, MCL achieved the highest recall mainly due to its large number of predictions. On the other hand, MCODE achieved the highest precision because it tends to identify a subset of known complexes with higher overlap than other methods. However, the overall accuracy of both methods (as measured by the F-measure) was lower than those of COACH and miPALM because MCL had a much lower precision and MCODE had a much lower recall. In other words, the higher F-measure achieved by COACH and miPALM is due to a balanced increase in both their recall and precision.
Although F-measure is a popular metric for evaluating the performance of a complex predictor, it is not the only one. Biological relevance is also an important indicator of the quality of predicted complexes. Accordingly, we next conducted GO term enrichment and co-localization analyses to determine the biological relevance of the predicted complexes. Genome-wide protein localization data has been reported for Baker's yeast using fluorescent imaging [19]. For each predicted complex, we calculated a log-odds score that measures the extent to which members of the complex co-localize to the same sub-cellular compartments (see Methods). Compared to the F-measure that relies on an incomplete gold standard set, both GO term and co-localization annotations used here are more comprehensive and thus complementary to the F-measure.
At a p-value of 0.05, our set of predictions had the highest fractions of complexes with enriched functional categories (Figure 3A). Compared to the second best performer (MCODE), the average increase in the fraction of enriched complexes was 8.9% across the two gold standard sets of complexes. For complex member co-localization, our predictions had an 18.8% average increase compared to the second best performer, DME (Figure 3B).
Taken together, our benchmarking analyses demonstrated that miPALM achieved the second highest F-measure (3% lower than COACH) when evaluated using known complexes. On the other hand, miPALM outperforms all other algorithms by a large margin (8.9% and 18.8%) when evaluated using functional annotations of complex members.
Novel Complex Predictions Using Large Yeast PPI Networks
Next, we applied miPALM to discover novel protein complexes in two large-scale yeast PPI networks based on interactions obtained from the BioGRID database [20]. The first network consists of all yeast interactions in the BioGRID database. The majority of interactions are derived from high throughput experiments. The second network consists of high-confidence interactions derived by filtering the BioGRID interactions based on their lines of supporting evidence [21]. For brevity's sake, these two networks are termed BioGRID and HC networks in this paper. The BioGRID network contains 5591 proteins and 51880 physical interactions and the HC network contains 2228 proteins and 6209 physical interactions. By studying two networks with different amount of noise, we can assess the robustness of our method on noisy data.
To predict complexes, we set the coarseness parameter α to be 0.364 that gave the highest F-measure as described in the performance comparison section.
Supporting evidence for novel complexes predicted by miPALM compared to gold standard sets of known complexes.
CYC08 (%) | YHTP08 (%) | miPALM All (%) | miPALM Novel (%) | |
---|---|---|---|---|
GO | 76.7 | 56.5 | 72.5 | 61.6 |
Colocalization | 25.9 | 43.0 | 65.0 | 68.8 |
To further corroborate our predictions, we next used a genome-wide protein localization data set to examine if members of our predicted complexes tend to co-localize in the same sub-cellular compartments. For each of our predicted complex, we calculated a co-localization log-odds score that compares the member co-localization probability of a predicted complex to the probability of the same number of random proteins in the PPI network (See Methods). For the set of 320 predicted complexes, 208 (65.0%) are enriched for at least one sub-cellular compartments (Table 3). Examined separately, 115 (68.5%) BioGRID and 123 (62.0%) HC predictions are enriched for at least one sub-cellular compartment, respectively (Figure 4B).
To identify new complexes in our prediction, we used the union of CYC08 and YHTP08 as the set of known complexes. After filtering those complexes matching any of the known complexes, we were left with 138 novel protein complexes. To evaluate the quality of these novel protein complexes, we computed the fraction of complexes that have enriched GO functional terms or are co-localized to the same sub-cellular compartments. Eight five (61.6%) of the novel complexes were enriched for at least one GO terms and 95 (68.8%) complexes were enriched for at least one sub-cellular compartments (Table 3). The fraction of GO term enriched complexes was comparable to known complexes. Remarkably, the fraction of co-localized complexes in our prediction was much higher than those of the two gold standard sets (Table 3). These results provide further evidence that the set of novel complexes are true protein complexes. Information about the complete set of predicted complexes with supporting evidence is reported in Additional files 1, 2, 3 and 4.
Discussion
The global modularity measure proposed by Newman and Girvan [12] identifies clusters (sub-networks) in a network by comparing the observed fraction of edges inside a cluster to the expected fraction of edges in the cluster. In doing so, it assumes that connections between all pairs of nodes in the network are equally probable, which reflects all connectivity among all clusters. However, in many molecular interaction networks, most sub-networks are only connected locally. For instance, in metabolic networks, major pathways occur as clusters that are sparsely linked among each other [22]. The same observation can also be made on protein complexes [23].
In this study, we introduced parametric local modularity as a new measure for the quality of clusters in a network. It takes into account local cluster connectivity and overcomes global network dependency. As an analogy, the coarseness parameter functions as the resolution dial of a microscope. By changing the value of the coarseness parameter, we can adjust the size of the cluster neighborhoods when calculating the expected fraction of edges within a cluster. Since different biological networks might have distinct neighborhood connectivity, a tunable local modularity measure allow us to best estimate the local neighborhood connectivity by changing the size of the neigbhorhood under consideration.
Protein complexes are dynamic molecular entities. Depending on the cellular states, membership of a protein complex could change and different complexes could have shared members [18]. Our algorithm can detect overlapping complexes if during the seed expansion step seeds of different candidate complexes are close enough.
The F-measure used for performance evaluation is a popular approach. A drawback of F-measure is that it cannot distinguish whether a predicted complex overlap with just one or multiple known complexes and vice versa. It has been argued that predictions that overlap with fewer known complexes should be regarded as having a higher quality [24]. To further evaluate the methods using this criterion, we use the separation metric introduced by Brohee and van Helden [24] which takes into account the observation above. As shown in Figure S8 (Additional file 1), miPALM again outperforms the other methods. Therefore, it is unlikely that the performance improvement by miPALM is due to a bias in the benchmarking metrics used.
In summary, using three alternative performance measures (F-measure, Biological Relevance, Separation), our benchmarking analysis demonstrate that miPALM achieve an overal best performance among the five algorithms compared. The performance measures of the methods using three input interaction networks are summarized in Additional file 1, Tables S4, S5, S6.
The proposed algorithm can be naturally extended to handle weighted networks by using edge weights for local modularity calculation. Edge weights can be calculated based on topological features of the PPI network and domain-specific information from other omic data, such as microarray gene expression, genome-wide association study, and genome-wide sequence mutation data (e.g. cancer mutation screening). Integration of functional genomic data into miPALM will enable us to find context-dependent sub-networks that are active under specific growth conditions.
Conclusions
Using several performance measures (F-measure, Biological Relevance, and Separation), we have demonstrated that miPALM achieved an overall improvement over previous algorithms. miPALM combines the strength of three key features, triangle seed identification using topological overlap measure, parametric local modularity as a cluster quality measure, and recursive greedy search. By including functional genomic data as edge weights, miPALM can be extended to identify context-dependent gene modules that can in turn be used to assist in network comparison and classification tasks.
Methods
Protein interaction and complex data
Protein interaction networks
Yeast protein-protein interaction data were downloaded from the DIP [14] and BioGRID [20] databases. The DIP "full" set of PPIs (including all physical interactions in the DIP database instead of a subset of high confidence interactions) were used for algorithm development and comparison. The BioGRID and high-confidence [21] sets of PPIs were used for novel protein complex prediction. After removing self-loops and multiple edges, the three networks contain 4859, 5591, and 2228 proteins and 17138, 51880, and 6209 interactions, respectively.
Known annotated protein complexes
Two sets of annotated protein complexes were used for performance evaluation. Pu et al. generated a comprehensive catalogue of 408 protein complexes manually curated from published small-scale experiments reported as of 2008 [16]. This set provides an update of the widely used gold-standard MIPS complexes. In the same study, they also generated a catalogue of 400 high-throughput complexes by a systematic analysis of all high throughput protein-protein interaction data reported as of 2008. After removing complexes with fewer than 3 members, we ended up with two reference sets of protein complexes, termed CYC08 (236 complexes) and YHTP08 (207 complexes), respectively.
Construction of the seed set
Seeding strategy is crucial for a network searching algorithm since the search result is dependent on the starting point (e.g. a node, an edge, or a sub-network). Here we describe how to construct seeds and to rank them based on the local property of the network.
where |Γ(v, w)| is the number of common neighbors of node v and w, k_{ v } and k_{ w } are the degrees of node v and w, A_{ vw } = 1 if v and w have a direct link and zero otherwise.
In the original definition of O_{ T } (v, w), the number of shared interacting partners is normalized by dividing |Γ(v, w)| by min(k_{ v } , k_{ w } ) instead of (k_{ v } + k_{ w })/2. We modified the normalization factor because it is improper to treat two proteins topologically equal if one protein has three interactors and the other has 100 interactors (e.g. hub proteins) even though these two proteins share the same three interacting partners.
Second, we enumerated all triangles in the PPI network using the enumeration algorithm described in Algorithm 1. All triangles in the PPI network can be located by Algorithm 1 in O(k_{max}·m) time with an upper bound of O(n·m), where k_{ max } is the largest node degree in the network.
Algorithm 1: TriangleEnumeration (G)
1 input: Unweighted graph G = (V, E)
2 output: all triangles of G
3 begin
4 for e ∈ E do
5 (v, w) ← a pair of nodes connected by e
6 Γ (v, w) ← a set of common nodes shared by v and w
7 for x ∈ Γ (v, w) do
8 output triplet {v, w, x}
9 remove e from G
10 end
We then rank all triangles found by Algorithm 1 based on their triangle-weights obtained by averaging pair-wise edge weights.
Local modularity as the scoring function
where m is the total number of edges in the network, m_{ ss } is the number of intra-module edges in module S, and d_{ s } is the sum of the degrees of nodes in module S. Essentially, Q is the difference in the fraction of within-module edges between the observed network and a random configuration network model. This definition of modularity is global in the sense that the comparison of m_{ ss }/m with (d_{ s }/2m)^{2} assumes equal probability of connection between any pair of nodes in the random network model.
where Q_{ v } and Q_{ S } are the modularity of v and S, respectively and Q_{ vS } is the modularity of the sub-network created by merging v and S.
where m_{ ss } is the number of edges within sub-network S and m_{ s } is the total number of edges in S and its first neighbours. LQ is based on the observation that in real world networks most sub-networks are only connected to a small fraction of the entire network.
where the denominator of the second term in Eq. 4 is not fixed to 2m, but varied with a parameter α that we call the coarseness parameter.
Readers are referred to the Suppl. Methods (Additional file 1) for detailed derivation of ΔLQ_{ α } from LQ_{ α }
When α = 1, ΔLQα is equivalent to ΔQ in Eq. 3. Decreasing α leads to a smaller number of edges to be considered. For example, if α = 0.5, the ratio of considered edges to the total number of edges in the network (i.e. edge-coverage ratio, r= 2m^{ α } /2m )) is m^{-1/2}. Conversely, if we want to cover locally 50% of edges (r = 0.5), then α can be set to 1+log_{ m }(0.5). As α goes down to zero, the size of the detected sub-network becomes smaller and smaller because the expected fraction of within-module edges, the second term in Eq. 5, becomes larger. Suppl. Figure S1 (Additional file 1) shows the edge-coverage ratio and size of resultant detected sub-networks as a function of α.
Greedy search by maximizing local modularity measure
The problem of finding a network partition with maximum global modularity is known to be NP-hard [26]. Thus, various heuristic approaches were proposed [27–32]. In particular, greedy search [31, 32] based on global modularity have been studied extensively due to its single peakness [33] and fast speed for analyzing very large networks.
Our scoring function (Eq. 5) made it possible to adopt a greedy search strategy to expand a given triangle seed to a larger sub-network iteratively until the increase in local modularity becomes negative. Pseudo codes for our greedy search algorithm are shown in Algorithms 2 and 3. Briefly, starting with the top ranked triangle seed {x, y, z}, our greedy algorithm always merge the direct neighbor w of the seed that increases local modularity the most, growing the seed into a larger sub-network S={w, x, y, z}. The algorithm outputs S if it has no additional neighbor merging of which leads to an increase in the local modularity. This searching process (or seed expansion) is then repeated with a new seed. The time-consuming step of the greedy search algorithm is the calculation of ΔLQ_{ α } after each merging. We avoid recalculating ΔLQ_{ α } (v, S') for all neighbours of S', v∈ N_{s'} by taking advantage of the recursive relationship for ΔLQ_{ α } between before and after merging (see Suppl. Methods and Figure S3 for details, Additional file 1). The upper bound for the time complexity of our search algorithm is O(n_{ s } ·d_{ s }) where n_{ s } is the number of proteins in the sub-network S and d_{ s } is the sum of degrees of all nodes in the sub-network S.
Algorithm 2: RecursiveGreedySearch (S, A, α)
1 input: triangle seed S, adjacency matrix A, and coarseness parameter α
2 output: Expanded sub-network S^{ ' } and its neighbor nodes N_{ s' }
3 begin
Ns ← neighbor nodes of S
5 ΔLQ_{ α } (·, S) ← change in our local modularity for all v in N_{ s }
6 if max((ΔLQ_{ α } (·, S)) < 0 then
7 return S and N_{ s }
8 [S', N_{ s' } ] ← GrowSeed(S, A, N_{ s } , α, ΔLQ_{ α } (·, S))
9 return S' and N_{ s' }
10 end
Algorithm 3: GrowSeed (S, A, N_{ s } , α, ΔLQ_{ α } (·,S))
1 input: triangle seed S, adjacency matrix A, a set of neighbor nodes of S N_{ s } , coarseness parameter α, change in local modularity ΔLQ_{ α }(v, S) for all v in N_{ s }
2 output: Expanded sub-network S^{ ' } and its neighbor nodes N_{ s' }
3 begin
4 ${v}^{*}\leftarrow arg\underset{v}{max}\{\Delta L{Q}_{\alpha}(v,S)\}$
5 N_{ v* } ← all neighbor nodes of v*
6 S' ← {S, v*}
7 ${N}_{{s}^{\prime}}\leftarrow ({N}_{s}-\{{v}^{*}\})\cup ({N}_{v*}-({N}_{s}\cup S)),\phantom{\rule{0.1em}{0ex}}v\in {N}_{s}-\left\{{v}^{*}\right\}$
8 $\Delta L{Q}_{\alpha}(v,\phantom{\rule{0.1em}{0ex}}{s}^{\prime})\leftarrow \Delta L{Q}_{a}(v,\phantom{\rule{0.1em}{0ex}}S)+\Delta L{Q}_{a}(v,\phantom{\rule{0.1em}{0ex}}{v}^{*}),\phantom{\rule{0.1em}{0ex}}v\in {N}_{v*}-({N}_{s}\cup S)$
9 $\Delta L{Q}_{a}(v,\phantom{\rule{0.1em}{0ex}}{S}^{\prime})\leftarrow -\frac{{d}_{v}{d}_{s}}{2{m}^{\alpha +1}}+\Delta L{Q}_{\alpha}(v,\phantom{\rule{0.1em}{0ex}}{v}^{*})$
10 if max (ΔLQ_{ α } (·,S')) < 0 then
11 return S' and N_{ s' }
12 $\left[S\text{'},{N}_{s\text{'}}\right]\leftarrow GrowSeed(S\text{'},\phantom{\rule{0.1em}{0ex}}A,\phantom{\rule{0.1em}{0ex}}\alpha ,{N}_{s\text{'}},\phantom{\rule{0.1em}{0ex}}\Delta L{Q}_{\alpha}(\cdot ,S\text{'}))$
13 end
Elimination of unpromising seeds
Unpromising seeds are those that cannot be expanded into larger sub-networks. In other words, they are triangles that have no neighbors that can cause positive change in local modularity if merged. We filtered out those triangles after seed expansion step to speed up the algorithm and reduce the number of false positives (see Figure S2 in Additional file 1).
Complex merging
Proteins in a PPI network could belong to one or more protein complexes simultaneously. This multiple membership of proteins should be uncovered by the clustering algorithm. Complexes found by our method can be overlapped if they are within the same densely connected region in the PPI network. While revealing overlapped complexes is important for understanding their dynamics, allowing algorithm to make overlapped predictions often produce an excessive number of complexes. For example, the algorithm DME [7] predicted 14,780 complexes (minimum density threshold 0.95) on the yeast DIP full set. The majority of them are overlapped, causing low precision and poor overall performance. In this paper we merged any two complexes S and T if they have an overlap score of greater than 0.5, which is defined as |S ⋂ T|/min(|S|, |T|).
Complex filtering by density score
After merging complexes produced by the seed expansion step, we rank the candidate complexes by their density score δ_{ s } that is defined as the product of the connectivity and size of complex $S,\phantom{\rule{0.1em}{0ex}}{\delta}_{s}=\frac{{m}_{ss}}{{n}_{s}({n}_{s}-1)/2}\xb7{n}_{s}$.
The miPALM algorithm
Our algorithm takes as input an unweighted PPI network Gn, m={V, E} with n nodes and m edges and outputs a set of predicted protein complexes, M. The pseudo code of the algorithm is shown in Algorithm 4.
Algorithm 4: miPALM (G, α, δ)
1 Input: Unweighted graph Gn, m ={V, E}, n=/V/, m=/E|, coarseness parameter α, and density score threshold δ
3 Output: a set of sub-networks, M
4 begin
5 T ← TriangleEnumeration (G)
6 t ← choose the top ranked triad-seed in T
7 T ← delete t from list T
8 while T is not empty do
9 S ← RecursiveGreedySearch (t, A, α)
10 t ← choose the top triad-seed uncovered by the previous search
11 T ← delete t from list T
12 if the size of S is three then
13 continue
14 S ← refine S by looking around S
15 M ← {M, S}, output S
16 S ← merge sub-networks in S
17 for S ∈ M do
18 δ_{ s } ← get density score f S
19 if δ_{ s } < δ then
20 delete S from M
21 end
Performance evaluation
We used the F-measure to evaluate the performance of complex prediction algorithms. F-measure is the harmonic mean of the two quantities, precision (Pre) and recall (Rec), 2 Pre Rec/(Pre + Rec). Precision is defined as the ratio of the number of matched sub-networks to the number of predicted sub-networks by each algorithm. Recall is the ratio of the number of matched sub-networks to the number of known complexes.
For comparison purpose, we used the complex matching criterion used in MCODE [2] to identify predicted complexes that overlap with gold standard complexes. A predicted sub-network is considered matched to a known complex if it has a matching score of 0.2 or greater. Matching score is defined as ω = c^{ 2 }/a·b, where a, b are the size of the sub-network and the known complex, respectively, and c is the number of protein members overlapped between the prediction and the known complex. We also examine the precision and recall rates at different overlap scores (see Figure S9 in Additional file 1).
Parameter selection
Our algorithm has two parameters, α for determining the size of the local neighborhood of a candidate complex and δ for filtering candidate complexes based on their density score. For benchmarking purpose, we used the F-measure to determine the parameters yielding the best performance of the algorithm on three sets of known complex. Because the δ parameter is only used for post-search filtering, we first searched for the optimal α value. We varied α from 0 to 1 with an initial step size of 0.01. Once the range of optimal α value was located, we further searched for the optimal parameter value using a finer step size of 0.001 (Figure S4 in Additional file 1). After an optimal α was found, we determined the optimal δ by searching from 0 to 3.5 with a step size of 0.01. To determine the sensitivity of the algorithm to parameter changes, we determined the overlaps between predicted complexes using two α values differed by 0.01. As can be seen in Figure S5 (Additional file 1), our algorithm is not overly sensitive to parameter changes.
For the other four programs we compared, we tested the following parameter ranges that gave optimal F-measure on the three sets of known complexes. For COACH, the affinity threshold was varied from 0 to 1 with a step size of 0.01. For MCL, the inflation parameter was varied from 1.2 to 5.0 with a step size of 0.01. For DME, the density threshold parameter was varied from 0.91 to 1.0 with a step size of 0.01. For MCODE, vertex weight percentage = 0.2, haircut = TRUE, and fluff = FALSE were used. These parameters of MCODE have been optimized to produce the best results by default.
Gene ontology term enrichment test
Yeast Gene Ontology (GO) slim terms were used to evaluate the biological relevance of predicted complexes. P-value for GO term enrichment was calculated using the hypergeometric distribution. A Bonferroni-corrected p-value of 0.05 is considered to be significant.
Co-localization analysis
where n_{ sk } is the number of proteins localized in compartment k in sub-network S and p_{ s } is the connectivity for the sub-network. We consider a complex to be localized to a compartment k if the log-odds score $log({m}_{sk}/\overline{{m}_{sk}})>0$.
Declarations
Acknowledgements
We thank the anonymous reviewers for their helpful comments. This work is supported by the American Cancer Society [77-004-31 to K.T.] and the Pharmaceutical Research and Manufacturers of America Foundation [to K.T.].
Authors’ Affiliations
References
- Beyer A, Bandyopadhyay S, Ideker T: Integrating physical and genetic maps: from genomes to interaction networks. Nat Rev Genet 2007, 8: 699–710. 10.1038/nrg2144View ArticlePubMedPubMed CentralGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003, 4: 2. 10.1186/1471-2105-4-2View ArticlePubMedPubMed CentralGoogle Scholar
- Everett L, Wang LS, Hannenhalli S: Dense subgraph computation via stochastic search: application to detect transcriptional modules. Bioinformatics 2006, 22: e117–123. 10.1093/bioinformatics/btl260View ArticlePubMedGoogle Scholar
- Adamcsek B, Palla G, Farkas IJ, Derenyi I, Vicsek T: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 2006, 22: 1021–1023. 10.1093/bioinformatics/btl039View ArticlePubMedGoogle Scholar
- Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Nature 2005, 435: 814–818. 10.1038/nature03607View ArticlePubMedGoogle Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci USA 2003, 100: 12123–12128. 10.1073/pnas.2032324100View ArticlePubMedPubMed CentralGoogle Scholar
- Georgii E, Dietmann S, Uno T, Pagel P, Tsuda K: Enumeration of condition-dependent dense modules in protein interaction networks. Bioinformatics 2009, 25: 933–940. 10.1093/bioinformatics/btp080View ArticlePubMedPubMed CentralGoogle Scholar
- Pereira-Leal JB, Enright AJ, Ouzounis CA: Detection of functional modules from protein interaction networks. Proteins 2004, 54: 49–57. 10.1002/prot.10505View ArticlePubMedGoogle Scholar
- Muff S, Rao F, Caflisch A: Local modularity measure for network clusterizations. Phys Rev E 2005, 72: 056107. 10.1103/PhysRevE.72.056107View ArticleGoogle Scholar
- van Dongen S: Graph Clustering by Flow Simulation. University of Utrecht, Physics 2000.Google Scholar
- Girvan M, Newman ME: Community structure in social and biological networks. Proc Natl Acad Sci USA 2002, 99: 7821–7826. 10.1073/pnas.122653799View ArticlePubMedPubMed CentralGoogle Scholar
- Newman MEJ, Girvan M: Finding and evaluating community structure in networks. Physical Review E 2004, 69: 026113. 10.1103/PhysRevE.69.026113View ArticleGoogle Scholar
- Fortunato S, Barthelemy M: Resolution limit in community detection. Proc Natl Acad Sci USA 2007, 104: 36–41. 10.1073/pnas.0605965104View ArticlePubMedPubMed CentralGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449–451. 10.1093/nar/gkh086View ArticlePubMedPubMed CentralGoogle Scholar
- Wu M, Li X, Kwoh CK, Ng SK: A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics 2009, 10: 169. 10.1186/1471-2105-10-169View ArticlePubMedPubMed CentralGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak SJ: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res 2009, 37: 825–831. 10.1093/nar/gkn1005View ArticlePubMedPubMed CentralGoogle Scholar
- Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440: 637–643. 10.1038/nature04670View ArticlePubMedGoogle Scholar
- Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B, et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440: 631–636. 10.1038/nature04532View ArticlePubMedGoogle Scholar
- Huh W-K, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK: Global analysis of protein localization in budding yeast. Nature 2003, 425: 686–691. 10.1038/nature02026View ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34: D535–539. 10.1093/nar/gkj109View ArticlePubMedPubMed CentralGoogle Scholar
- Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hurst LD, Tyers M: Still stratus not altocumulus: further evidence against the date/party hub distinction. PLoS Biol 2007, 5: e154. 10.1371/journal.pbio.0050154View ArticlePubMedPubMed CentralGoogle Scholar
- Guimera R, Nunes Amaral LA: Functional cartography of complex metabolic networks. Nature 2005, 433: 895–900. 10.1038/nature03288View ArticlePubMedPubMed CentralGoogle Scholar
- Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat Rev Genet 2004, 5: 101–113. 10.1038/nrg1272View ArticlePubMedGoogle Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006, 7: 488. 10.1186/1471-2105-7-488View ArticlePubMedPubMed CentralGoogle Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL: Hierarchical organization of modularity in metabolic networks. Science 2002, 297: 1551–1555. 10.1126/science.1073374View ArticlePubMedGoogle Scholar
- Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z, Wagner D: On modularity clustering. Ieee Transactions on Knowledge and Data Engineering 2008, 20: 172–188. 10.1109/TKDE.2007.190689View ArticleGoogle Scholar
- Agarwal GaK D: Modularity-maximizing graph communities via mathematical programming. The European Physical Journal B-Condensed Matter and Complex Systems 2008, 66: 409–418. 10.1140/epjb/e2008-00425-1View ArticleGoogle Scholar
- Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z, Wagner D: On Finding Graph Clusterings with Maximum Modularity. Graph-Theoretic Concepts in Computer Science 2007, 121–132. full_textView ArticleGoogle Scholar
- Noack AaR R: Multi-level Algorithms for Modularity Clustering. In Experimental Algorithms. Volume 5526. Heidelberg: Springer; 2009:257–268. full_textView ArticleGoogle Scholar
- Sales-Pardo M, Guimera R, Moreira AA, Amaral LA: Extracting the hierarchical organization of complex systems. Proc Natl Acad Sci USA 2007, 104: 15224–15229. 10.1073/pnas.0703740104View ArticlePubMedPubMed CentralGoogle Scholar
- Newman MEJ: Fast algorithm for detecting community structure in networks. Physical Review E 2004, 69: 066133. 10.1103/PhysRevE.69.066133View ArticleGoogle Scholar
- Schuetz P, Caflisch A: Multistep greedy algorithm identifies community structure in real-world and computer-generated networks. Physical Review E 2008, 78: 026112. 10.1103/PhysRevE.78.026112View ArticleGoogle Scholar
- Clauset A, Newman MEJ, Moore C: Finding community structure in very large networks. Physical Review E 2004, 70: 066111. 10.1103/PhysRevE.70.066111View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.