Preliminary definitions
Notation
A graph G is defined by a set of vertices V and edges E that connect pairs of vertices. This work considers undirected, unweighted edges with no self edges. Extensions to directed, weighted, and self-edges are possible but are not discussed here.
A “flat” model. A model M defines how vertices are collected into groups. These groups are denoted C1, C2, …, C
K
for a model with K groups. Each vertex is assigned to one of the K groups, and the groups are disjoint. This model can be summarized as M= {C
k
: k ∈ 1, …, K}. Subscripts u, v typically refer to individual vertices, and subscripts i, j, k refer to groups.
Edge counts between groups can be summarized as
for i ≠ j, and
The binary variable e
uv
= 1 for a u ~ v edge and 0 for the lack of an edge, or a hole. Total pair counts are defined as t
ij
= n
i
n
j
for i ≠ j, and t
ii
= n
i
(n
i
— 1)/2, where n
i
is the number of vertices within group i. Summary counts for holes are h
ij
= t
ij
— e
ij
. For a given pair of groups i and j, the e
ij
edges are modeled as the result of t
ij
independent Bernoulli trials with parameter θ
ij
. The probability of the observed edges, conditioned on θ
ij
, is
The maximum likelihood value
is obtained by setting θ
ij
to its maximum likelihood estimate with a uniform prior,
. A fully Bayesian probability
is obtained by integrating out the nuisance parameter θ
ij
, again with a uniform prior:
where Beta is the standard Beta function and xx = 1 for x = 0.
For a flat model, with K(K + 1)/2 parameters, the likelihood and fully Bayesian probability are
Generalization to a hierarchical model
We can extend the notion of a model M to a hierarchical random graph (HRG) based on a model that successively merges pairs of groups [11]. This original model generates a binary dendrogram T. Each node r in this dendrogram represents the joining of graph vertices L(r) underneath the left sub-tree and vertices R(r) underneath the right sub-tree. With the same Bernoulli probability model (Eq.1) as a building block, e
r
and h
r
are defined as the total number of edges and holes crossing between the left and right sub-trees. We generalize this model for the case of multiple top-level nodes, which merge together into a flat structure using a full block model. We also generalize for tree structures that are not completely branching, yielding tree nodes that collect multiple graph vertices into a single group. Similar to Eq.3, letting M ≡ T, the likelihood L(M) of a hierarchical model T and the corresponding probability P(M) of the graph given the model are
Top-level terms
and
depend on the edges e
rr′
and holes h
rr′
crossing between the top-level groups r and r′, with t
rr′
= e
rr′
+ h
rr′
. For all tree nodes,
and
. For branching nodes (including the top-level nodes), the edges e
r
holes h
r
refer to those crossing between the left and right sub-trees; for non-branching terminals, e
r
and h
r
refer to the edges and holes for vertices within the terminal groups.
Sampling trees with MCMC provides excellent results for predicting missing links by accumulating
values for link probabilities between left and right sub-trees [11]. We have found that extending the MCMC approach to genome-scale networks is computationally burdensome. Approximation methods, such as a Variational Bayes approach [17], can reduce computational costs, but still require a good initial estimate of tree structure. Here we consider agglomerative approaches for finding trees T that optimize the objective function L(M) and its fully Bayesian counterpart P(M).
Agglomerative clustering
Maximum likelihood guide tree
Suppose currently there are K top-level clusters numbered 1 … K within the R total tree nodes. This model, M, has K(K — 1)/2 + R total parameters. Merging two of the top-level nodes (and retaining the structure underneath each) gives a model with (K — 1)(K — 2)/2+ (R+ 1) parameters, a reduction of K — 2 parameters. Without loss of generality suppose we merge clusters 1 and 2 into a new cluster 1′, defining a new model M′. The model likelihood ratio is
There is a subtle but crucial difference between this agglomerative algorithm, which assumes a full block model for the top-level nodes, and the more standard approach with a star-like structure at the top with a single parameter governing the interactions between all pairs of top-level nodes. A starlike model with K top-level and R total nodes has R+1 parameters, and merging two groups increases the number of parameters by 1. The increase in parameters at each step, coupled with a maximum likelihood model, is liable to over-fit the group structure. A further problem is the model likelihood ratio for the star model,
where
and similarly h
b
= t
b
- e
b
count the edges and holes between all pairs of top-level groups before merging 1 and 2, and e12 and h12 count the edges and holes just between groups 1 and 2. Under the star model, any two groups with the same values of e12 and t12 will have identical ratios
. At the initial step, every pair of vertices will have one of two merging scores, depending on whether e12 = 1 or 0. Additional criteria are then required to avoid bad merges at the start of clustering. In contrast,
gathers information from shared patterns of connectivity with other grops. In particular, at the initial step when each group is a single vertex,
, where the number of mismatches is 
Greedy agglomerative algorithm
The likelihood ratio
leads to an agglomerative algorithm that successively merges the two clusters have the largest value.
Initialize top-level clusters as {{v} : v ∈ V}
Initialize K ← V
while K > 1 do
Find top-level clusters i,j with largest 
Add top-level cluster r; L(r) = i and R(r) = j
Remove clusters i and j from the top level
K ← K – 1
end while
We call this method HAC-ML. The time complexity of a naϊve implementation scales as O(V4), but using a priority queue, restricting possible merging pairs to clusters that share at least one common neighbor, and lazy evaluation of λ reduce the complexity to O(EJ log V), where E is the total number of edges and J is the average vertex degree.
Bayesian model selection for top-level and terminal clusters
A natural stopping criteria at the top level is obtained by augmenting
, Eq. 5 with its fully Bayesian equivalent
,
A reasonable stopping criterion is
for the best merge [18]. While there are K(K – 1)/ 2 possible merges, we do not include this factor in the stopping criterion.
Our previous work introduced a similar criterion for collapsing bottom-level clusters comparing a model with separate left and right sub-trees with a model all vertices collected in a single group [17]. Clusters with a single vertex are considered collapsed. During the merging process, if clusters 1 and 2 are selected for merging and are both collapsed, the probability ratio
is calculated, where the subscripts indicate edges and holes within and between groups. The merged cluster is collapsed if
. Clusters of two vertices are always merged because λC = 1. While there are
ways for the reverse process of splitting a cluster into two non-empty groups of sizes n1 and n2, we do not include this factor in the model selection.
Extension to multiple edge types
The HAC-ML algorithm is directly applicable to networks with multiple edge types. Rather than merging the edges into a single superimposed network, each edge type α defines its own likelihood L(α)(M) and probability P(α)(M) for a particular model M. The full likelihood and full probability are then obtained as products over the edge types, L = ∏
α
L(α) and P = ∏
α
P(α).
Performance Evaluation
Data preparation
Experimental evidence codes listed in BioGRID database (http://thebiogrid.org) provide a way to distinguish physical versus genetic interaction pairs. We built a physical network collecting all physically binding or interacting pairs and a genetic network restricted to negative interactions comprising to empirical evidence codes Negative Genetic, Synthetic Growth Defect, Synthetic Haploin-sufficiency, and Synthetic Lethality. We ignored redundant pairs within each type of network such that resulting graphs were undirected and unweighted. We then iteratively removed isolated or degree-1 vertices, as these provide scant information for clustering. For other non-BioGRID genetic interaction datasets we filtered out positively weighted pairs and applied the same iterative removal. In joint-network analysis, we restricted attention to the common intersection of genes.
Other methods
We compared HAC-ML with other deterministic methods: Fast Modularity (CNM; Clauset et al.[14]), Variational Bayes Modularity (VBM; Hofman and Wiggins [19], and Graph Diffusion Kernel (GDK; Qi et al.[20]). CNM is an efficient algorithm that directly optimizes Newman modularity [21]. VBM simplifies network data to one intra- and one inter-community probability distribution. For GDK by discriminating between even-length and odd-length paths, Qi et al.[20] improved link prediction performance, particularly for disassorative (bipartite-like) networks. We used the odd parity kernel with the recommended damping parameter set to 1.0.
Different merging scores
In addition, we also considered agglomerative clustering based on heuristic merging scores: (1) edge density, ρ
e
; (2) combined edge density and shared neighbor density, ρ
e
+ ρ
s
; and (3) decomposed Newman modularity Q from CNM [21]. The edge and shared neighbor densities for merging clusters 1 and 2 are
The summations in ρ
s
(1, 2) runs over all vertices u not in groups 1 or 2, and the logical functions evaluate to 1 and 0. The Newman modularity for merging groups 1 and 2 is
where d
u
and d
v
are vertex degrees and E is the total number of edges. This algorithm is essentially CNM, but retains the hierarchical structure defined by the merge order for link prediction (rather than predicting links based on the cut that maximizes modularity). Replacing
with ρ
e
, ρ
e
+ ρ
s
, and Q yields algorithms HAC-E, HAC-ES, and HAC-Q.
Link prediction
We assessed correctness of a model in the framework of link prediction as presented in Henderson et al.[8]. Starting with a real-world network, training networks are generated by deleting a specified fraction of edges. A test set is defined by the held-out edges and a random choice of an equal number of holes. We then ran all methods on the training data set. The trained group structure provides maximum likelihood estimates for edges within and between clusters (Eq. 9). For VBM and CNM, we estimated edge densities between all pairs of clusters and within all clusters. For hierarchical models, we estimated densities between all left and right clusters at all tree levels. For GDK, each pair’s diffusion was directly used to rank pairs. Finally we assessed precision and recall of pairs in the test set ranked by link probability or GDK score. The counts of true positives (TP), false positives (FP), and false negatives (FN) as function of the number of predictions define the Precision, TP/(TP+FP), and the Recall, TP/(TP+FN). The F-score is the maximum value of harmonic mean of Precision and Recall. This test set definition is suitable for assessment, but overstates practical performance by reducing the number of negative test examples for a sparse network. Note that for large real-world networks, group assignments are generally unknown, making it difficult to assess group assignments directly.
Implementation
Algorithms were implemented in C++ and are available under an open source BSD license as supplementary material and from http://www.baderzone.org.