### Preliminaries

We model a protein network as an undirected, unweighted graph where the nodes are the proteins, and two nodes are connected by an edge if the corresponding proteins are annotated as interacting with each other.

#### Graph Representation

Formally, a graph is given by a set of vertices *V* and a set of edges *E*. The degree of a node *u* ∈ *V*, denoted by *d*(*u*), is the number of edges adjacent to *u*. A graph is often represented by its adjacency matrix. The adjacency matrix of a graph *G* = (*V*, *E*) is defined by

#### Random Walks

We can learn a lot about the structure of a graph by taking a random walk on it. A random walk is a process where at each step we move from some node to one of its neighbors. The transition probabilities are given by edge weights, so in the case of an unweighted network the probability of transitioning from *u* to any adjacent node is 1/*d*(*u*). Thus the transition probability matrix (often called the random walk matrix) is the normalized adjacency matrix where each row sums to one:

Here the *D* matrix is the degree matrix, which is a diagonal matrix given by

In a random walk it is useful to consider a probability distribution vector *p* over all the nodes in the graph. Here *p* is a row vector, where *p*(*u*) is the probability that the walk is at node *u*, and ∑_{u ∈ V}*p*(*u*) = 1. Because we transition between nodes with probabilities given by *W*, if *p*_{
t
}is the probability distribution vector at time *t*, then *p*_{t+1}= *p*_{
t
}*W*.

If we modify the random walk to reset at each step with nonzero probability *α*, it will have a unique steady-state probability distribution. This steady-state distribution, which is known as a PageRank vector, is useful because it tells us how much time we will spend at each vertex in a very long random walk on the graph. For starting vector *s*, and reset probability *α*, the PageRank vector pr_{
α
}(*s*) is the unique solution of the linear system

The *s* vector specifies the probability distribution for where the walk transitions when it resets. The original PageRank algorithm used a uniform unit vector for *s* [10, 11]. PageRank with non-uniform starting vectors is known as personalized PageRank, and has been used in context-sensitive search on the Web [35, 36].

#### Partitioning

A common problem in graph theory is to partition the vertices of a graph into clusters while minimizing the number of intercluster edges. A matrix often used for this purpose is the graph Laplacian. The Laplacian matrix of an undirected graph *G* = (*V*, *E*) (with no self-loops) is defined as follows:

In other words, *L*_{
G
}= *D*_{
G
}- *A*_{
G
}. The eigenvectors of *L*_{
G
}reveal some structure of the graph [30], and are often used by spectral graph partitioning algorithms.

### Conductance

Conductance measures proportion of outgoing edges of a set of nodes in the graph. Given a graph *G* = (*V*, *E*), and a subset of vertices *S* ∈ *V*, let us call the edge boundary of *S* the collection of edges with one point in *S* and the other outside of *S*:

Let us also define the volume of *S* to be the sum of the degrees of its nodes:

The conductance of *S* is then defined as the ratio of the size of its edge boundary to the volume of the smaller side of the partition:

The lower the conductance, the better the cluster. Notice that a cluster can have low conductance without being dense. Using the minimum of vol(*S*) and vol() in the definition disregards vacuous clusters (for example, when *S* = ∅ and = *V*).

### Nibble

Nibble, the local clustering algorithm of Spielman and Teng [23], works by looking for a cluster of low conductance among the most popular destinations of a short random walk from the starting vertex. The algorithm starts with a probability distribution vector *p* that has all of its probability in the starting vertex, and at every iteration updates *p* by setting *p*_{
t
}= *p*_{t-1}*W*, where *W* is the lazy random walk transition probability matrix. A lazy random walk is a modified random walk where the probability of remaining at each vertex is 1/2; it is used to ensure that the walk reaches a steady state. After each iteration of the random walk, Nibble checks the probability distribution vector for a cluster of low conductance by performing a "sweep" of *p*.

A sweep is a technique for producing a partition (cluster) from a probability distribution vector. The vertices are ordered by degree-normalized probability, and the conductance of the first *j* vertices in the order is computed, where *j* ranges from 1 to the number of non-zero entries in the vector (*N*), returning the set with lowest conductance. More precisely, let *v*_{1},..., *v*_{
N
}be an ordering of (nonzero) vertices of *p* such that *p*(*v*_{
i
})/*d*(*v*_{
i
}) ≥ *p*(*v*_{i+1})/*d*(*v*_{i+1}). Consider a collection of sweep sets = {*v*_{1},..., *v*_{
j
}}. Let Φ (*p*) be the smallest conductance of any of these sets,

The algorithm finds Φ (*p*) by sorting the entries of *p* by degree normalized probability, and then computing the conductance of each sweep set to find the minimum. The degree of each vertex *v*, denoted by *d*(*v*), is proportional to the amount of probability that *v* has in the stationary distribution of the random walk. Therefore the sweep sets contain vertices that have significantly more probability in *p* than they do in the stationary distribution, meaning that they are visited more often in a walk from the starting node than they are at steady-state.

In order to bound the runtime of the algorithm, Nibble only looks at a small part of the graph close to the starting vertex by using a truncation operation on the probability distribution vector. Given a parameter *ϵ*, after each iteration of the random walk, we set *p*(*v*) = 0 for every *v* such that *p*(*v*) ≤ *ϵ d*(*v*). Nibble takes as input the number of iterations that it performs, as well as *ϵ*, and returns the sweep set of smallest conductance over all iterations.

Deviating from the algorithm presented in [23], we also implement a constrained version of Nibble, which always reports a cluster containing the starting vertex. Here when we perform a sweep of the probability distribution vector, we always put the starting vertex *s* first in the order (set *v*_{1} = *s*), no matter how much probability there is at that vertex. Therefore, Constrained-Nibble only considers sweep sets that include the starting vertex. Similarly, Nibble can also be modified to only consider sweep sets of a certain size, which is useful when we wish to find a cluster in a specified size range.

### PageRank-Nibble

PageRank-Nibble [25] is similar to Nibble in that it looks for a cluster containing the closest vertices to the starting node. However, instead of using an evolving probability distribution of a random walk from starting node *s*, PageRank-Nibble uses a personalized PageRank vector that gives the stationary distribution of a random walk that always returns to *s* when it resets. Once the personalized PageRank vector is computed, the same sweep technique described in the previous section is used to return the cluster of lowest conductance among its sweep sets.

In addition, to bound the amount of time necessary to compute the PageRank vector and perform a sweep, the algorithm uses an approximation of it. The approximation algorithm computes an *ϵ*-approximate PageRank vector by conducting a random walk that only considers vertices *v* that have more than *ϵ d*(*v*) probability in them. The resulting PageRank vector has few non-zero entries, and the amount of error in it for any subset of vertices *S* is less than *ϵ*·vol(*S*).

PageRank-Nibble uses the same sweep technique to find a partition, so it can also be constrained to only consider sweep sets of a certain size, if we wish to find a cluster in a specified size range. Unlike Nibble, PageRank-Nibble always reports a cluster containing the starting vertex because the starting vertex has the most degree normalized probability in the computed PageRank vector. For calculations of PageRank, *α* (the reset probability in the PageRank equation) is typically chosen to be 0.15. However, we find that using a lower value of *α* (such as 0.02) gives us clusters of lower conductance.

### Metis

Metis is a global graph partitioning algorithm that outperforms other state-of-the-art methods [26]. It takes the number of clusters (*k*) as an argument, and maps each vertex to one of *k* balanced clusters, minimizing the size of the edge cut, which is the set of edges with endpoints in different clusters. Metis is a multilevel algorithm, which coarsens the graph to perform the partitioning, and then improves it as the graph is rebuilt. There are two variations of Metis algorithms for partitioning graphs: Kmetis and Pmetis. We try both, and decide to use Pmetis because it gives us clusters of lower conductance. Pmetis works by recursively bisecting the graph, it is slower but returns clusters that are more balanced in size. The partitions reported by Pmetis are almost all of exactly the same size, so to get clusters of a certain size we simply set *k* accordingly. Because Pmetis is a global algorithm, we partition the entire graph once, and for starting vertex *s* return the cluster containing *s*.

### Spectral Clustering

We use a common spectral clustering implementation, taking the first *d* eigenvectors of the Laplacian matrix of the graph (other than the one corresponding to the lowest eigenvalue), to put each vertex in a *d*-dimensional space. We then use *k*-means clustering to partition the vertices into *k* clusters, again choosing *k* to get clusters of the desired size. However, the sizes of the found clusters vary greatly, so we also use a variation where we simply return the *k* closest vertices to the starting vertex in the spectral embedding space. Once again, if we partition the entire graph, we return the cluster that contains the starting vertex.

### Measuring Functional Distance

In order to assess the functional coherence of the found clusters, we use functional distances from Yu et al. [24]. These values are derived using the Gene Ontology (GO) Process classification scheme, where **functional-distance**(*a, b*) is the number of gene pairs that share all of the least common ancestors of *a* and *b* in the classification hierarchy. A low functional distance score means that two proteins are more functionally related, because there are few protein pairs that have the same functional relationship.

The functional distance measure of Yu et al., which the authors refer to as the "total ancestry measure for GO," has the obvious advantage that it considers all known functions of a pair of proteins, allowing for a great degree of precision in assessing functional similarity. Moreover, unlike other methods that derive distances from the GO classification scheme, this method is very resilient to rough functional descriptions, because it still assigns low distances to pairs of proteins that only share very broad terms, as long as there are few other protein pairs that share all of those terms.

Functional distances from [24] can be quite large, yet differences in scores at the low end are more significant than differences at the high end, which is why we take the logarithm in our calculations:

### Calculating Functional Coherence

To determine the functional coherence of community *C* in a protein network represented by a graph *G* = (*V*, *E*), we compute an absolute and a relative functional coherence score. The *absolute functional coherence* of a community is the difference between the average functional distance of two proteins in the network and the average pairwise functional distance of proteins in the community:

The *relative functional coherence* of a community also takes into account how functionally related the proteins inside the community are to the other proteins in the network, and is defined as the difference in average functional distance of intercommunity and intracommunity protein pairs:

### Correlating Conductance and Functional Coherence

To determine whether communities with better conductance are more likely to be functionally coherent, we choose groups of proteins from each network, rank these groups by conductance, absolute, and relative functional coherence, and compute the correlation between the ranks using the Pearson Correlation Coefficient [37]. How to choose the protein groups for this experiment is non-trivial. They cannot be selected by taking random subsets of nodes in each network, which will most certainly produce disconnected groups with bad conductance and functional coherence. Furthermore, they cannot be selected by using algorithms that minimize conductance, which will produce groups with strong bias towards low conductance. A better way to choose these groups is to first randomly select a vertex, and then choose *k* - 1 of its nearest neighbors, breaking ties in distance randomly. Such "random" protein groups will be connected in the network, with reasonable and variable conductance and coherence values. The size of each group is randomly chosen in the range 10 ≤ *k* ≤ 40, because we expect biologically relevant communities to be in this size range.

### The Protein Networks

The protein interaction data that we use in our study is from BioGRID [38], Version 2.0.53, updated June 1, 2009. BioGRID lists interacting protein pairs, and for each pair gives the experimental method used to observe the interaction, as well as the source that submitted it. In our study we use several yeast protein-protein interaction (PPI) networks formed from interactions detected by different methods.

Two of the networks, where protein interactions are detected from bait-and-prey type experiments are Affinity Capture-Western (referred to as **ac-western** in the figures), and Affinity Capture-MS (**ac-ms**). These networks tend to be much more cliquish and contain dense components, which is due to the nature of the experiment used to detect the interactions. A single protein (bait) is used to pull in a set of other proteins (prey), and an interaction is predicted either between the bait and each prey (the spoke model), or between every protein in the group (the matrix model) [39]. We also use Two-Hybrid data in our study. Two-Hybrid methods detect binary interactions, therefore PPI networks based on Two-Hybrid data tend to be less dense and cliquish than ones derived from Affinity Capture experiments.

In addition to using a network formed from the union of all Two-Hybrid interactions listed in BioGRID (**two-hybrid**), we also consider a subset of this data submitted by [40] (**two-hybrid-2**). This network is sparser, but is believed to be of higher quality.