Our clustered random graph generation method begins with a random graph and iteratively rewires edges to introduce triangles. Network rewiring, also known as edge swapping, is a well-known method for generating networks with desired properties [37, 36, 38]. Two edges are called adjacent if they connect to a common node. Each rewiring is performed on two non-adjacent edges of the graph and consists of removing these two edges and replacing them with another pair of edges. Specifically, a pair of edges (i, j) and (k, l) is replaced with either (i, k) and (j, l), or (i, l) and (j, k) (as illustrated in Figure 1c). This change in the graph leaves the degrees of the participating nodes unchanged, thus maintaining the specified degree sequence. Below we describe a rewiring algorithm that increases the level of clustering in a random graph, while preserving the degree sequence.
The algorithm we develop below is implemented in Python as ClustRNet. It is based on Networkx, an open-source Python library available for download at [39], which provides standard graph library functionality (e.g. data structure, input/output, and layouts). The source code for ClustRNet, along with documentation and test network datasets, is available on the web [40]. Our algorithm joins a existing suite of random graph model-based software tools for the analysis of biological networks and the dynamics on them [41, 42].
Measures of Clustering
We begin with a graph G = (V, E) which is undirected and simple. V is the set of vertices of G and E is the set of the edges. We let N = |V| and M = |E| denote the number of nodes and edges in G, respectively. The degree of a node i will be denoted d
i
. The set of degrees for all nodes in the graph makes up the degree sequence, which follows a probability distribution called the degree distribution.
Clustering is the likelihood that two neighbors of a given node are themselves connected. In topological terms, clustering measures the density of triangles in the graph, where a triangle is the existence of the set of edges (i, j), (i, k), (j, k) between any triplet of nodes i, j, k (Figure 1b).
To quantify the local presence of triangles, δ(i) is defined as the number of triangles in which node i participates. Since each triangle consists of three nodes, it is counted thrice when we sum δ(i) for each node in the graph. Thus the total number of triangles in the graph is
A triple is a set of three nodes, i, j, k that are connected by edges (i, j) and (i, k), regardless of the existence of the edge (j, k) (Figure 1a). The number of triples of node i is simply
assuming d
i
≥ 2. To compute the total number of triples in the graph, τ(G), we sum τ(i) for all i ∈ V.
The clustering coefficient was introduced by Watts and Strogatz [1] as a local measure of triadic closure. For a node i with d
i
≥ 2, the clustering coefficient c(i) is the fraction of triples for node i which are closed, and can be measured as δ(i) = τ(i). The clustering coefficient of the graph is then given by:
where N2 is the number of nodes with c(i) ≥ 0. Some authors do define the clustering coefficient for all nodes of G [43].
A more global measure of the presence of triangles is called the transitivity of graph G and is defined as:
Although they are often similar, T(G) and C(G) can vary by orders of magnitude [22]. They differ most when the triangles are heterogeneously distributed in the graph.
These traditional measures of clustering are degree-dependent and thus can be biased by the degree sequence of the network. The maximum number of possible triangles for a given node i is just its number of triples (τ(i)). For a node which is connected to only low degree neighbors, however, the maximum number of possible triangles may be much smaller than τ(i). To account for this, a new measure for clustering was introduced in [22] that calculates triadic closure as a function of degree and neighbor degree. Specifically, the Soffer-Vasquez clustering coefficient (
) and transitivity (
) are given by:
where ω(i) measures the number of possible triangles for node i, and N
ω
is the number of nodes in G for which ω(i) > 0. We note that
and
are undefined if ω(G) = Σ
i
ω(i) = 0. ω(i) is computed by counting the maximum number of edges that can be drawn among the d
i
neighbors of a node i, given the degree sequence of i's neighbors; this value is often smaller than
[22]. For example, consider a star network of five nodes, where four nodes have degree 1 and one node has degree 4. Although the total number of triples is τ(G) = 6, the number of possible triangles is ω(G) = 0 because the degree one nodes preclude their formation. The computation of ω(i) must be done algorithmically and is not possible in closed form. (From here on, we refer to
as the SV-clustering coefficient and to
as the SV-transitivity.)
Generative Model
Here we develop a model to generate a simply connected random graph with a specified degree sequence and a desired level of clustering. Generating random graphs uniformly from the set of simply connected graphs with a prescribed degree sequence is a well-studied problem with algorithmic solutions [37]. One of the simplest and most popular of these generative algorithms was suggested by Molloy and Reed and is known as the configuration model [27]. Given a specific realizable degree sequence [44], {d
i
}, this method assigns d
j
half-edges to each node j, and then randomly connects pairs half-edges to create edges until there are no half-edges left. (A realizable degree sequence is one which satisfies the Handshake Theorem (the requirement that the sum of the degrees be even) and the Erdos-Gallai criterion (which requires that for each subset of the k highest degree nodes, the degrees of these nodes can be "absorbed" within the subset and the remaining degrees.) Although the model sometimes produces graphs that are not simple or connected, this can be remedied by subsequently removing multiple edges and self loops from the constructed graph and keeping only the largest connected component [37]. Our method begins by using this approach to generate a simple, connected random graph G, with a specific realizable degree sequence D. We then introduce triangles into G using a Markov Chain process without disturbing the degree sequence until we achieve the desired level of clustering, as follows.
Let G
D
be the set of all simple, connected graphs with degree sequence D. If
are the graphs of G
D
, then we let
be the states of the Markov chain, P, where X
i
represents the state in which our graph G = G
i
. The states X
i
and Xi+1are connected in the Markov Chain if G
i
can be changed to Gi+1with the rewiring of one pair of edges. The state space of the Markov chain P is connected because there exists a path from X
i
to X
j
(for any pair i, j) by one or more rewiring moves that leave the degree sequence unchanged [45].
Our clustered graph generation algorithm involves starting with the random graph G (generated with the configuration model above) and transitioning from the state corresponding to G (X
G
) to other states of P until a halting condition is reached. A transition from one state of the Markov chain to another only occurs when the algorithm makes an edge rewiring that both increases the clustering of the graph and leaves the graph connected. Since a rewiring does not alter the degree sequence of the graph, the rewired graph is still in G
D
. The transition probabilities of the Markov chain for a pair of connected states, X
i
to X
j
, are:
where clust(G
x
) is a clustering measure for graph G
x
, which can be replaced by any of the measures introduced in Section. The algorithm continues searching for a feasible rewiring (one that increases the clustering and does not disconnect the graph) until one is found. If a feasible move is not found, a transition is not made and the process remains in the current state.
The Markov chain above is finite and aperiodic, but not irreducible as the process can never transition to a state in which the graph has lower clustering. It does, however, have an absorbing state, X*, in which the transitivity of G* is greater than or equal to the desired transitivity or is the maximum possible transitivity given the particular degree sequence and connectivity constraints.
Algorithm
To generate clustered graphs, we apply the above Markov Chain simulation model by iteratively applying rewirings that increase graph clustering. Each rewiring takes a set of five nodes {x, y1, y2, z1, z2}, connected by four edges {(x, y1), (x, y2), (y1, z1), (y2, z2)}, and swaps the outer edges: {(x, y1), (x, y2), (y1, y2), (z1, z2)}(illustrated in Figure 1d). This introduces a triangle among nodes {x, y1, and y2}, without perturbing the degree sequence. The algorithm proceeds as follows:
Input: A realizable degree sequence {d
i
} a desired clustering value, target
Initialization: Generate a random graph G with degree sequence {d
i
} (using the configuration model), and measure the clustering of G, clust(G).
while clust(G) <target do
1. uniformly select a random node, x, from the
set of all nodes of G such that d
x
> 1.
2. uniformly select two random neighbors, y1
and y2, of x such that dy 1> 1 and
dy 2> 1 and y1≠y2.
3. uniformly select a random neighbor, z1
of y1 and a random neighbor, z2 of
y2 such that z1 ≠ x, z2 ≠ x,
z1 ≠ z2.
4. G
cand
: = G where G
cand
is the candidate
graph to which the transition may be made.
5. if (y1, y2) and (z1, z2) do not exist then
Rewire two edges of G
cand
: delete (y1, z1) and (y2, z2), add (y1, y2) and (z1, z2).
end
6. Update the value of clust(G
cand
) by measuring
δ (i) (and ω (i) if relevant) for the nodes involved
in the rewiring and their neighbors.
7. if clust(G
cand
) > clust(G) and G
cand
is connected then
G: = G
cand
end
end
Output: A random graph, G with degree sequence {d
i
} and clust(G) ≥ target.
The algorithm terminates when the graph attains at least the desired level of clustering or reaches a threshold number of unsuccessful rewiring attempts. In the latter case, the algorithm returns the graph with the maximum clustering achieved. For practical purposes, a threshold is placed on the number of unsuccessful attempts made by the algorithm in ClustRNet for the case that the desired clustering cannot be reached. Due to the random restarts made at every step, the algorithm is prevented from getting trapped in local minima.
The algorithm is designed to increase clustering while preserving both the degree sequence and connectedness of the graph. However, there are some cases where the desired clustering can only be reached by disconnecting the graph; and thus ClustRNet provides the option of removing the connectivity constraint (see Additional file 1, Figure S2).
Choice of Clustering Measure
The algorithm is defined independent of the choice of clustering measure. The term clust(G) in the algorithm above can be replaced by any clustering measure described in Section. ClustRNet includes all four of these clustering measures (C,
, T;
).
The algorithm output varies with the choice of clustering measure. The clustering coefficient is a local measure; and thus C and
yield networks that are only locally optimized for the desired level of clustering. The algorithm may have difficulty attaining target clustering values when using the absolute clustering measures (C or T) because of joint degree constraints (the degrees of adjacent nodes) on the possible numbers of triangles, as with the example presented in Section. The Soffer-Vasquez clustering measures, which explicitly consider joint degree constraints, provide a way around this difficulty [22]. Although the rewiring in our algorithm changes the joint degree distribution (and thus the degree correlations) of the graph, ω(G) is not altered significantly during network generation (as shown in Additional file 1, Figure S3). Thus, when using
or
, clustering is increased primarily by the addition of triangles (that is, increasing δ (G)) rather than decreasing ω(G)).
Types of Graph Changes
As shown in Figure 2, there are six types of triangles that can be added or removed for every pair of edges that are rewired. As illustrated in Figure 1d, these additions and removals can occur in combination.
-
Type A: The addition of the edge between vertices y1 and y2 guarantees the addition of one triangle in every rewiring event.
-
Type B: The addition of the edge (y1, y2) could create new triangles with shared neighbors of y1 and y2.
-
Type C: The addition of the edge (z1, z2) could add a triangle if there existed edges between x and z1 and x and z2.
-
Type D: The addition of the edge between vertices z1 and z2 could create new triangles with shared neighbors of z1 and z2.
-
Type E: The removal of edges (y1, z1) and (y2, z2) removes one triangle each if the edges (x, z1) or (x, z2) exist.
-
Type F: The removal of the edges between vertices y1 and z1, and y2 and z2 could lead to the removal of existing triangles with shared neighbors of y1 and z1 or y2 and z2.
We note that although the type A addition is a special case of type B, the type C addition is a special case of type D, and the type E removals are a special case of type F, we distinguish them because they have different probabilities of occurrence. Our look-ahead strategy only allows rewiring moves when the total number of Type E and F losses is fewer than the total number of Type A, B, C, and D gains.
Computational Complexity
Like many heuristic search methods, the algorithm we propose can be computationally expensive. The method outlined in Section 2.2 requires O(M) steps to generate a connected graph, and up to O(M) steps to randomize the graph, where M is the number of edges in the graph. At each step of randomization, we test that the graph remains connected (an O(M) operation), resulting in an overall O(M2) random network generation process. A naive computation of the transitivity/clustering coefficient requires checking every node for the existence of edges between every pair of neighbors of the node. This step requires O(
) operations, where N is the number of nodes and d
max
is the maximum degree of any node in the graph. The most expensive step of our algorithm is the introduction of triangles via rewiring. A single rewiring step requires O(M) operations for switching edges, checking for connectivity and updating the clustering measure. Although we cannot analytically calculate the number of attempted rewiring steps required to reach the desired transitivity, we have found it empirically to be O(M). Thus, the average complexity of the clustered network algorithm presented here is O(M2). This complexity has been computed for the most naive versions of our algorithms; and more efficient implementations may improve the complexity greatly. For example, we might improve efficiency by performing connectivity tests once every x rewirings (for some number x) rather than during every rewiring, as proposed in [46].