### Protein interaction and complex data

#### Protein interaction networks

Yeast protein-protein interaction data were downloaded from the DIP [14] and BioGRID [20] databases. The DIP "full" set of PPIs (including all physical interactions in the DIP database instead of a subset of high confidence interactions) were used for algorithm development and comparison. The BioGRID and high-confidence [21] sets of PPIs were used for novel protein complex prediction. After removing self-loops and multiple edges, the three networks contain 4859, 5591, and 2228 proteins and 17138, 51880, and 6209 interactions, respectively.

#### Known annotated protein complexes

Two sets of annotated protein complexes were used for performance evaluation. Pu *et al*. generated a comprehensive catalogue of 408 protein complexes manually curated from published small-scale experiments reported as of 2008 [16]. This set provides an update of the widely used gold-standard MIPS complexes. In the same study, they also generated a catalogue of 400 high-throughput complexes by a systematic analysis of all high throughput protein-protein interaction data reported as of 2008. After removing complexes with fewer than 3 members, we ended up with two reference sets of protein complexes, termed CYC08 (236 complexes) and YHTP08 (207 complexes), respectively.

### Construction of the seed set

Seeding strategy is crucial for a network searching algorithm since the search result is dependent on the starting point (e.g. a node, an edge, or a sub-network). Here we describe how to construct seeds and to rank them based on the local property of the network.

First, we weight every interaction in the PPI network. For discovering good seeds, it is important to rank within-complex edges high and between-complex edges low. We used a modified version of the topological overlap measure by Ravasz

*et al*. [

25] as edge weight. It is defined as following:

where |Γ(*v*, *w*)| is the number of common neighbors of node *v* and *w*, *k*
_{
v
} and *k*
_{
w
} are the degrees of node *v* and *w*, *A*
_{
vw
} = 1 if *v* and *w* have a direct link and zero otherwise.

In the original definition of *O*
_{
T
} (*v, w*), the number of shared interacting partners is normalized by dividing |Γ(*v*, *w*)| by *min*(*k*
_{
v
}, *k*
_{
w
}) instead of *(k*
_{
v
}
*+ k*
_{
w
}
*)*/*2*. We modified the normalization factor because it is improper to treat two proteins topologically equal if one protein has three interactors and the other has 100 interactors (e.g. hub proteins) even though these two proteins share the same three interacting partners.

Second, we enumerated all triangles in the PPI network using the enumeration algorithm described in *Algorithm 1*. All triangles in the PPI network can be located by Algorithm 1 in O(k_{max}·m) time with an upper bound of O(n·m), where *k*
_{
max
} is the largest node degree in the network.

**Algorithm 1:**
*TriangleEnumeration* (G)

1 **input**: Unweighted graph *G* = (*V*, *E*)

2 **output**: all triangles of *G*

3 **begin**

4 **for**
*e*
∈
*E*
**do**

5 *(v, w) ←* a pair of nodes connected by *e*

6 *Γ (v, w) ←* a set of common nodes shared by *v* and *w*

7 **for**
*x*
∈
**Γ**
*(v, w)*
**do**

8 output triplet {*v, w, x*}

9 remove *e* from *G*

10 **end**

We then rank all triangles found by Algorithm 1 based on their triangle-weights obtained by averaging pair-wise edge weights.

### Local modularity as the scoring function

The total modularity

*Q* of a network with

*M* modules is defined as following [

12]:

where *m* is the total number of edges in the network, *m*
_{
ss
} is the number of intra-module edges in module *S*, and *d*
_{
s
} is the sum of the degrees of nodes in module *S*. Essentially, *Q* is the difference in the fraction of within-module edges between the observed network and a random configuration network model. This definition of modularity is global in the sense that the comparison of *m*
_{
ss
}
*/m* with (*d*
_{
s
}
*/2m*)^{2} assumes equal probability of connection between any pair of nodes in the random network model.

During module search, when a node

*v* and a sub-network

*S* are merged, the change in global modularity can be derived, as followings,

where *Q*
_{
v
} and *Q*
_{
S
} are the modularity of *v* and *S*, respectively and *Q*
_{
vS
} is the modularity of the sub-network created by merging *v* and *S*.

In order to overcome the resolution limit of the global modularity measure, Muff

*et al*. proposed the local modularity measure

*LQ* [

9]

where *m*
_{
ss
} is the number of edges within sub-network *S* and *m*
_{
s
} is the total number of edges in *S* and its *first* neighbours. *LQ* is based on the observation that in real world networks most sub-networks are only connected to a small fraction of the entire network.

Inspired by previous work, we introduce a new local modularity measure for a single subnetwork as defined below:

where the denominator of the second term in Eq. 4 is not fixed to *2m*, but varied with a parameter *α* that we call the *coarseness parameter*.

After merging

*v* and

*S* the change in the newly defined local modularity is then:

Readers are referred to the Suppl. Methods (Additional file 1) for detailed derivation of Δ*LQ*
_{
α
} from *LQ*
_{
α
}

When *α = 1*, Δ*LQα* is equivalent to Δ*Q* in Eq. 3. Decreasing *α* leads to a smaller number of edges to be considered. For example, if *α* = 0.5, the ratio of considered edges to the total number of edges in the network (i.e. edge-coverage ratio, *r*= 2*m*
^{
α
} /2*m* )) is *m*
^{-1/2}. Conversely, if we want to cover locally 50% of edges (*r = 0.5*), then *α* can be set to *1+log*
_{
m
}
*(0.5)*. As *α* goes down to zero, the size of the detected sub-network becomes smaller and smaller because the expected fraction of within-module edges, the second term in Eq. 5, becomes larger. Suppl. Figure S1 (Additional file 1) shows the edge-coverage ratio and size of resultant detected sub-networks as a function of *α*.

### Greedy search by maximizing local modularity measure

The problem of finding a network partition with maximum global modularity is known to be NP-hard [26]. Thus, various heuristic approaches were proposed [27–32]. In particular, greedy search [31, 32] based on global modularity have been studied extensively due to its single peakness [33] and fast speed for analyzing very large networks.

Our scoring function (Eq. 5) made it possible to adopt a greedy search strategy to expand a given triangle seed to a larger sub-network iteratively until the increase in local modularity becomes negative. Pseudo codes for our greedy search algorithm are shown in Algorithms 2 and 3. Briefly, starting with the top ranked triangle seed *{x*, *y, z}*, our greedy algorithm always merge the direct neighbor *w* of the seed that increases local modularity the most, growing the seed into a larger sub-network *S={w*, *x, y, z}*. The algorithm outputs *S* if it has no additional neighbor merging of which leads to an increase in the local modularity. This searching process (or seed expansion) is then repeated with a new seed. The time-consuming step of the greedy search algorithm is the calculation of Δ*LQ*
_{
α
} after each merging. We avoid recalculating Δ*LQ*
_{
α
}(*v*, *S'*) for all neighbours of *S'*, v*∈*N_{s'} by taking advantage of the recursive relationship for Δ*LQ*
_{
α
} between before and after merging (see Suppl. Methods and Figure S3 for details, Additional file 1). The upper bound for the time complexity of our search algorithm is *O(n*
_{
s
}·*d*
_{
s
}
*)* where *n*
_{
s
} is the number of proteins in the sub-network *S* and *d*
_{
s
} is the sum of degrees of all nodes in the sub-network *S*.

**Algorithm 2:**
*RecursiveGreedySearch* (*S*, *A*, *α*)

1 **input**: triangle seed *S*, adjacency matrix *A*, and coarseness parameter *α*

2 **output**: Expanded sub-network *S*
^{
'
} and its neighbor nodes *N*
_{
s'
}

3 **begin**

*Ns ←* neighbor nodes of *S*

*5 *Δ*LQ*
_{
α
}
*(*·, *S) ←* change in our local modularity for all *v* in *N*
_{
s
}

6 **if** max((Δ*LQ*
_{
α
} (·, *S*)) < 0 **then**

7 **return**
*S* and *N*
_{
s
}

8 [*S'*, *N*
_{
s'
}] *← GrowSeed*(*S*, *A*, *N*
_{
s
}, *α*, Δ*LQ*
_{
α
}
*(·, S)*)

9 **return**
*S'* and *N*
_{
s'
}

10 **end**

**Algorithm 3:**
*GrowSeed* (*S*, *A*, *N*
_{
s
}, *α*, Δ*LQ*
_{
α
} (·,*S*))

1 **input**: triangle seed *S*, adjacency matrix *A*, a set of neighbor nodes of *S N*
_{
s
}, coarseness parameter *α*, change in local modularity Δ*LQ*
_{
α
}
*(v, S)* for all *v* in *N*
_{
s
}

2 **output**: Expanded sub-network *S*
^{
'
} and its neighbor nodes *N*
_{
s'
}

3 **begin**

4

5 *N*
_{
v*
}
*←* all neighbor nodes of *v**

6 *S' ←* {*S*, *v**}

7

8

9

10 **if** max (Δ*LQ*
_{
α
} (·,*S'*)) *< 0*
**then**

11 **return**
*S'* and *N*
_{
s'
}

12

13 **end**

### Elimination of unpromising seeds

Unpromising seeds are those that cannot be expanded into larger sub-networks. In other words, they are triangles that have no neighbors that can cause positive change in local modularity if merged. We filtered out those triangles after seed expansion step to speed up the algorithm and reduce the number of false positives (see Figure S2 in Additional file 1).

### Complex merging

Proteins in a PPI network could belong to one or more protein complexes simultaneously. This multiple membership of proteins should be uncovered by the clustering algorithm. Complexes found by our method can be overlapped if they are within the same densely connected region in the PPI network. While revealing overlapped complexes is important for understanding their dynamics, allowing algorithm to make overlapped predictions often produce an excessive number of complexes. For example, the algorithm DME [7] predicted 14,780 complexes (minimum density threshold 0.95) on the yeast DIP full set. The majority of them are overlapped, causing low precision and poor overall performance. In this paper we merged any two complexes *S* and *T* if they have an overlap score of greater than 0.5, which is defined as |*S* ⋂ *T*|/*min*(|*S*|, |*T*|).

### Complex filtering by density score

After merging complexes produced by the seed expansion step, we rank the candidate complexes by their density score *δ*
_{
s
} that is defined as the product of the connectivity and size of complex
.

### The miPALM algorithm

Our algorithm takes as input an unweighted PPI network *Gn, m*={*V, E*} with *n* nodes and *m* edges and outputs a set of predicted protein complexes, *M*. The pseudo code of the algorithm is shown in Algorithm 4.

**Algorithm 4:** miPALM (*G*, *α*, *δ*)

1 **Input**: Unweighted graph *Gn, m* ={*V*, *E}*, *n=/V/, m=/E*|, coarseness parameter *α*, and density score threshold *δ*

3 **Output**: a set of sub-networks, *M*

4 **begin**

5 *T ← TriangleEnumeration* (*G*)

6 *t ←* choose the top ranked triad-seed in *T*

7 *T ←* delete *t* from list *T*

8 **while**
*T* is not empty **do**

9 *S ← RecursiveGreedySearch* (*t*, *A*, *α*)

10 *t ←* choose the top triad-seed uncovered by the previous search

11 *T ←* delete *t* from list *T*

12 **if** the size of *S* is three **then**

13 continue

14 *S ←* refine *S* by looking around *S*

15 *M ←* {*M*, *S*}, output *S*

16 *S ←* merge sub-networks in *S*

17 **for**
*S*
**∈**
*M*
**do**

*18 δ*
_{
s
}
*←* get density score f *S*

19 **if**
*δ*
_{
s
} <*δ*
**then**

20 delete *S* from *M*

21 **end**

### Performance evaluation

We used the F-measure to evaluate the performance of complex prediction algorithms. F-measure is the harmonic mean of the two quantities, precision (Pre) and recall (Rec), 2 Pre Rec/(Pre + Rec). Precision is defined as the ratio of the number of matched sub-networks to the number of predicted sub-networks by each algorithm. Recall is the ratio of the number of matched sub-networks to the number of known complexes.

For comparison purpose, we used the complex matching criterion used in MCODE [2] to identify predicted complexes that overlap with gold standard complexes. A predicted sub-network is considered matched to a known complex if it has a matching score of 0.2 or greater. Matching score is defined as *ω = c*
^{
2
}
*/a·b*, where *a*, *b* are the size of the sub-network and the known complex, respectively, and *c* is the number of protein members overlapped between the prediction and the known complex. We also examine the precision and recall rates at different overlap scores (see Figure S9 in Additional file 1).

### Parameter selection

Our algorithm has two parameters, *α* for determining the size of the local neighborhood of a candidate complex and *δ* for filtering candidate complexes based on their density score. For benchmarking purpose, we used the *F-measure* to determine the parameters yielding the best performance of the algorithm on three sets of known complex. Because the *δ* parameter is only used for post-search filtering, we first searched for the optimal *α* value. We varied *α* from 0 to 1 with an initial step size of 0.01. Once the range of optimal *α* value was located, we further searched for the optimal parameter value using a finer step size of 0.001 (Figure S4 in Additional file 1). After an optimal *α* was found, we determined the optimal *δ* by searching from 0 to 3.5 with a step size of 0.01. To determine the sensitivity of the algorithm to parameter changes, we determined the overlaps between predicted complexes using two *α* values differed by 0.01. As can be seen in Figure S5 (Additional file 1), our algorithm is not overly sensitive to parameter changes.

For the other four programs we compared, we tested the following parameter ranges that gave optimal *F-measure* on the three sets of known complexes. For COACH, the affinity threshold was varied from 0 to 1 with a step size of 0.01. For MCL, the inflation parameter was varied from 1.2 to 5.0 with a step size of 0.01. For DME, the density threshold parameter was varied from 0.91 to 1.0 with a step size of 0.01. For MCODE, vertex weight percentage = 0.2, haircut = TRUE, and fluff = FALSE were used. These parameters of MCODE have been optimized to produce the best results by default.

### Gene ontology term enrichment test

Yeast Gene Ontology (GO) slim terms were used to evaluate the biological relevance of predicted complexes. P-value for GO term enrichment was calculated using the hypergeometric distribution. A Bonferroni-corrected p-value of 0.05 is considered to be significant.

### Co-localization analysis

Based on fluorescence imaging, Huh

*et al*. [

19] classified 75% of the yeast proteome into 22 distinct sub-cellular compartments. Protein localization data was downloaded from the yeast GFP fusion localization database

http://yeastgfp.yeastgenome.org. To compute a log-odds score of complex sub-cellular localization, we compared the observed number of protein pairs within a sub-network

*S* that are co-localized to sub-cellular compartment

*k* (

*m*
_{
sk
}) to the expected number of such pairs in a random network

, defined as following,

where *n*
_{
sk
} is the number of proteins localized in compartment *k* in sub-network *S* and *p*
_{
s
} is the connectivity for the sub-network. We consider a complex to be localized to a compartment *k* if the log-odds score
.