Exploring community structure in biological networks with random graphs
 Pratha Sah^{1},
 Lisa O Singh^{2},
 Aaron Clauset^{3, 4, 5} and
 Shweta Bansal^{1, 6}Email author
DOI: 10.1186/1471210515220
© Sah et al.; licensee BioMed Central Ltd. 2014
Received: 18 December 2013
Accepted: 20 May 2014
Published: 25 June 2014
Abstract
Background
Community structure is ubiquitous in biological networks. There has been an increased interest in unraveling the community structure of biological systems as it may provide important insights into a system’s functional components and the impact of local structures on dynamics at a global scale. Choosing an appropriate community detection algorithm to identify the community structure in an empirical network can be difficult, however, as the many algorithms available are based on a variety of cost functions and are difficult to validate. Even when community structure is identified in an empirical system, disentangling the effect of community structure from other network properties such as clustering coefficient and assortativity can be a challenge.
Results
Here, we develop a generative model to produce undirected, simple, connected graphs with a specified degrees and pattern of communities, while maintaining a graph structure that is as random as possible. Additionally, we demonstrate two important applications of our model: (a) to generate networks that can be used to benchmark existing and new algorithms for detecting communities in biological networks; and (b) to generate null models to serve as random controls when investigating the impact of complex network features beyond the byproduct of degree and modularity in empirical biological networks.
Conclusion
Our model allows for the systematic study of the presence of community structure and its impact on network function and dynamics. This process is a crucial step in unraveling the functional consequences of the structural properties of biological systems and uncovering the mechanisms that drive these systems.
Keywords
Biological networks Community structure Random graphs Modularity Benchmark graphsBackground
Network analysis and modeling is a rapidly growing area which is moving forward our understanding of biological processes. Networks are mathematical representations of the interactions among the components of a system. Nodes in a biological network usually represent biological units of interest such as genes, proteins, individuals, or species. Edges indicate interaction between nodes such as regulatory interaction, gene flow, social interactions, or infectious contacts [1]. A basic model for biological networks assumes random mixing between nodes of the network. The network patterns in real biological populations, however, are typically more heterogeneous than assumed by these simple models [2]. For instance, biological networks often exhibit properties such as degree heterogeneity, assortative mixing, nontrivial clustering coefficients, and community structure (see review by Proulx et al. [1]). Of particular interest is community structure, which reflects the presence of large groups of nodes that are typically highly connected internally but only loosely connected to other groups [3, 4]. This pattern of large and relatively dense subgraphs is called assortative community structure. In empirical networks, these groups, also called modules or communities, often correspond well with experimentallyknown functional clusters within the overall system. Thus, community detection, by examining the patterns of interactions among the parts of a biological system, can help identify functional groups automatically, without prior knowledge of the system’s processes.
Although community structure is believed to be a central organizational pattern in biological networks such as metabolic [5], protein [6, 7], genetic [8], foodweb [9, 10] and pollination networks [11], a detailed understanding of its relationship with other network topological properties is still limited. In fact, the task of clearly identifying the true community structure within an empirical network is complicated by a multiplicity of community detection algorithms, multiple and conflicting definitions of communities, inconsistent outcomes from different approaches, and a relatively small number of networks for which ground truth is known. Although node attributes in empirical networks (e.g., habitat type in foodwebs) are sometimes used to evaluate the accuracy of community detection methods [12], these results are generally of ambiguous value as the failure to recover communities that correlates with some node attribute may simply indicate that the true features driving the network’s structure are unobserved, not that the identified communities are incorrect.
A more straightforward method of exploring the structural and functional role of a network property is to generate graphs which are random with respect to other properties except the one of interest. For example, network properties such as degree distribution, assortativity and clustering coefficient have been studied using the configuration model [13], and models for generating random graphs with tunable structural features [2, 14, 15]. These graphs serve to identify the network measures that assume their empirical values in a particular network due to the particular network property of interest. In this work, we propose a model for generating simple, connected random networks that have a specified degree distribution and level of community structure.
Random graphs with tunable strength of community structure can have several purposes such as: (1) serving as benchmarks to test the performance of community detection algorithms; (2) serving as null models for empirical networks to investigate the combined effect of the observed degrees and the latent community structure on the network properties; (3) serving as proxy networks for modeling network dynamics in the absence of empirical network data; and (4) allowing for the systematic study of the impact of community structure on the dynamics that may flow on a network. Among these, the use of random graphs with tunable strength of community structure to serve as benchmarks has received the most attention and several such models have been proposed [16–21]. A few studies have also looked at the role of community structure in the flow of disease through contact networks [22–25]. However, the use of modular random graphs, which can be defined as random graphs that have a higher strength of community structure than what is expected at random, is still relatively unexplored in other applications.
Previous work
In 2002, Girvan and Newman proposed a simple toy model for generating random networks with a specific configuration and strength of community structure [3]. This model assumes a fixed number of modules each of equal size and where each node in each module has the same degree. In this way, each module is an ErdősRényi random graph. To produce modular structure, different but fixed probabilities are used to produce edges within or between modules. Although this toy model has been widely used to evaluate the accuracy of community detection algorithms, it has limited relevance to realworld networks, which are generally both larger and much more heterogeneous. Lancichinetti et al. [16] introduced a generalization of the GirvanNewman model that better incorporates some of these features, e.g., by including heterogeneity in both degree and community size. However, this model assumes that degrees are always distributed in a particular way (like a power law [26]), which is also unrealistic. (A similar model by Bagrow [17] generates modular networks with power law degree distribution and constant community size.)
Yan et al. [23] used a preferential attachment model to grow scalefree networks comprised of communities of nodes whose degrees follow a powerlaw distribution. And, models for special graph types such as hierarchical networks [18], bipartite networks [21], and networks with overlapping modules [20] have also been proposed. These models also make strong assumptions about the degree or community size distributions, which may not be realistic for comparison with real biological networks. A recently proposed model [19] does generate networks with a broad range of degree distributions, modularity and community sizes, but its parameters have an unclear relationship with desired properties (such as degree distribution and modularity), making it difficult to use in practice. Thus, while these models may be sufficient for comparative evaluation of community detection algorithms, they are of limited value for understanding their performance and output when applied to realworld networks.
An alternative approach comes from probabilistic models, of which there are two popular classes. Exponential random graph models (ERGMs) have a long history of use in social network analysis, and can generate an ensemble of networks that contain certain frequencies of local graph features, including heterogeneous degrees, triangles, and 4cycles [27]. However, many classes of ERGMs exhibit pathological behavior when parameterized with triangles or higherorder structures [28], which severely limits their utility. Stochastic block models (SBMs) are more promising, but require a large number of parameters to be chosen before a graph can be generated. In this approach, the probability of each link depends only on the community labels of its endpoints. Thus, to generate a network, we must specify the number of communities K, their sizes (in the form of a labeling of the vertices), and the $\left(\genfrac{}{}{0ex}{}{K}{2}\right)$ (in the undirected case) grouppair probabilities. The result is a random graph with specified community sizes, where each community is an ErdősRényi random graph with a specified internal density, and each pair of communities is a random bipartite graph with specified density. The degree distributions of these networks is a mixture of Poisson distributions, which can be unrealistic. A recent generalization of the SBM due to Karrer and Newman [29] allows the specification of the degree sequence, which circumvents this limitation but introduces another set of parameters to be chosen. Although the stochastic block models can in principle be used to generate synthetic networks, they are more commonly used within an inferrential framework in which community structure is recovered by estimating the various parameters directly from a network. As a result, the practical use of the SBM as a null model, either for general benchmarking of community detection algorithms or for understanding the structure of biological networks, remains largely unexplored, and we lack clear answers as to how best to sample appropriately from its large parameter space in these contexts. The SBM also does not provide a simple measure of the level of modularity in a network’s largescale structure, which makes its structure more difficult to interpret. The SBM is a promising model for many tasks, and adapting it to the questions we study here remains an interesting avenue for future work.
Our approach
Here, we develop and implement a simple simulation model for generating modular random graphs using only a small number of intuitive and interpretable parameters. Our model can generate graphs over a broad range of distributions of network degree and community size. The generated graphs can range from very small (<10^{2}) to large (> 10^{5}) network sizes and can be composed of a variable number of communities. In Methods below, we introduce our algorithm for generating modular random graphs. In Results and discussion, we consider the performance of our algorithm and structural features of our generated graphs to show that properties such as degree assortativity, clustering, and path length remain unchanged for increasing modularity. We next demonstrate the applicability of the generated modular graphs to test the accuracy of extant community detection algorithms. The accuracy of community detection algorithms depends on several network properties such as the network mean degree and strength of community structure, which is evident in our analysis. Finally, using a few empirical biological networks, we demonstrate that our model can be used to generate corresponding null modular graphs under two different models of randomization. We conclude the paper with some thoughts about other applications and present some future directions.
Methods
We present a model that generates undirected, simple, connected graphs with prescribed degree sequences and a specified level of community structure, while maintaining a graph structure that is otherwise as random (uncorrelated) as possible. Below, we introduce some notation and a metric for measuring community structure, followed by a description of our model and the steps of the algorithm used to generate graphs with this specified structure.
Measure of community structure
We begin with a graph G=(V,E) that is comprised of a set of vertices or nodes V(G)={v_{1},…,v_{ n }} and a set of edges E(G)={e_{1},…,e_{ m }}. G is undirected and simple (i.e. a maximum of one edge is allowed between a pair of distinct nodes, and no “self” edges are allowed). The number of nodes and edges in G is V(G)=n and E(G)=m, respectively. The neighborhood of a node v_{ i } is the set of nodes v_{ i } is connected to, N(v_{ i })={v_{ j }  (v_{ i },v_{ j })∈E,v_{ i }≠v_{ j },1≤j≤n}. The degree of a node v_{ i }, or the size of the neighborhood connected to v_{ i }, is denoted as d(v_{ i })=N(v_{ i }). A degree sequence, D, specifies the set of all node degrees as tuples, such that D={(v_{ i },d(v_{ i })} and follows a probability distribution called the degree distribution with mean $\overline{d}$.
where ${e}_{\mathit{\text{kk}}}=\frac{E({C}_{k})}{E(G)}$ denotes the proportion of all edges that are within module C_{ k }, and ${a}_{k}=\left[\phantom{\rule{0.3em}{0ex}}{\sum}_{{v}_{i}\in {C}_{k}}d({v}_{i})\right]/2E(G)$ represents the fraction of all edges that touch nodes in community C_{ k }. When Q=0, the density of withincommunity edges is equivalent to what is expected when edges are distributed at random, conditioned on the given degree sequence. Values approaching Q=1, which is the maximum possible value of Q, indicate networks with strong community structure. Typically, values for empirical network modularity fall in the range from about 0.3 to 0.7 [30]. However, in theory Good et al. [31] show that maximum Q values depend on the network size and number of modules.
In order to generate a graph with a specified strength of community structure, Q, equation (1) represents our first constraint, which we rewrite below in terms of the expected value of Q, (full derivation in Additional file 1):
where $\overline{{d}_{w}}$ and $\overline{d}$ are the average withindegree and average degree, respectively, and s_{ k }=V(C_{ k }) is the module size for module k. Thus, equation (2) allows us to specify $\overline{{d}_{w}}$ in terms of Q, $\overline{d}$, m and s_{ k }, assuming that the modulespecific average degree and average withindegree are equal to $\overline{d}$ and $\overline{{d}_{w}}$, respectively. When ${s}_{k}=\overline{s}$ for all k, E[Q] reduces to $\frac{\overline{{d}_{w}}}{\overline{d}}\frac{1}{K}$.
Algorithm
 1.
Assign the n network nodes to K modules based on the size distribution P(s).
 2.
Assign degrees, d(v _{ i }), to each node v _{ i } based on p _{ d } and $\overline{d}$. We next assign withindegrees, d _{ w }(v _{ i }), to each node v _{ i } by assuming that the withindegrees follow the same distribution as p _{ d } with mean $\overline{{d}_{w}}$, which is estimated based on equation (2) above (Figure 1a).
 3.
Connect betweenedges based on a modified HavelHakimi model and randomize them (Figure 1b).
 4.
Connect withinedges based on the HavelHakimi model and randomize them (Figure 1c and 1d).
The generated graph then has a degree distribution that follows p_{ d } with mean $\overline{d}$, K modules with sizes distributed as P(s), and a modularity Q≈E[ Q]. We set an arbitrary tolerance of ε=0.01, such that the achieved modularity is Q=E[ Q]± ε. The graph is also as random as possible given the constraints of the degree and community structure, contains no self loops (edges connecting a node to itself), multiedges (multiple edges between a pair of nodes), isolate nodes (nodes with no edges), or disconnected components. Below, we elaborate on each of the steps of this algorithm.
Assigning nodes to modules
We sample module sizes, s_{ k }, for each of the K modules from the specified module size distribution, P(s) so that $\sum {s}_{k}=n$. The n nodes are then arbitrarily (without loss of generality) assigned to each module to satisfy the sampled module size sequence.
Assigning degrees
Based on the degree distribution specified, a degree sequence is sampled from the distribution to generate a degree, d(v_{ i }), for each node v_{ i } (unless a degree sequence is already specified in the input). To ensure that the degree sequence attains the expected mean of the distribution (within a specified threshold) and is realizable, we verify the Handshake Theorem (the requirement that the sum of the degrees be even) and the ErdősGallai criterion (which requires that for each subset of the k highest degree nodes, the degrees of these nodes can be “absorbed” within the subset and the remaining degrees) [32], and that no node is assigned a degree of zero.

Condition 1: d(v_{ i })≥d_{ w }(v_{ i }) for all v_{ i }. To ensure this, we sort the degree sequence and withindegree sequence, independently. If d(v_{ i })<d_{ w }(v_{ i }) for any v_{ i } in the ordered lists, the condition is not satisfied. In Figure S2 of Additional file 1, we discuss the rejection rates for the rejection sampling of both the degree and withindegree sequence.

Condition 2: a realizable withindegree sequence for each module, C_{ k }, as defined by the Handshake Theorem and the ErdosGallai criterion.
In addition, to ensure that each module approximately achieves the overall mean withindegree, $\overline{{d}_{w}}$, we specify the following constraint: $\text{max}[\phantom{\rule{0.3em}{0ex}}{\{{d}_{w}({v}_{i})\}}_{{v}_{i}\in G}]\le \text{min}[\phantom{\rule{0.3em}{0ex}}{s}_{k}]$. If the sampled module sizes do not satisfy this criteria, the module sizes are resampled or an error is generated.
The betweendegree sequence is generated by specifying d_{ b }(v_{ i })=d(v_{ i })−d_{ w }(v_{ i }) for each node v_{ i }. To test if the betweendegree sequence is realizable, we impose a criterion developed by Chungphaisan [33] (reviewed by Ivanyi [34]) for realizable degree sequences in multigraphs. To do so, we imagine a coarse graph, H, where the modules of G are the nodes of H (i.e. V(H)={C_{1},C_{2},…C_{ K }}), and the betweenedges that connect modules of G are the edges of H. We note that H is a multigraph, because G allows multiple betweenedges of G to connect each pair of modules. In this case, the degree sequence of H is $D=\left\{({C}_{k},d({C}_{k}))d({C}_{k})={\sum}_{{v}_{j}\in {C}_{k}}{d}_{b}({v}_{j}),k=1\dots K\right\}$.

Condition 1: the Handshake theorem is satisfied for {d(C_{ k })}: ${\sum}_{k=1}^{K}d({C}_{k})={\sum}_{k=1}^{K}{\sum}_{{v}_{j}\in {C}_{k}}{d}_{b}({v}_{j})$ is even

Condition 2:$\sum _{k=1}^{j}d({C}_{k})\mathit{\text{bj}}(j1)\le {\sum}_{k=j+1}^{K}\text{min}[\phantom{\rule{0.3em}{0ex}}\mathit{\text{jb}},d({C}_{k})]$ for (j=1,…,K−1).
Here, b is defined as the maximum number of edges allowed between a pair of nodes in H; in our case, b=max[ {d_{ b }(v_{ i })}], the maximum betweendegree of any node v_{ i }∈G.
We also generate graphs with Q=0 by assuming the network is composed of a single module with no betweenedges. Thus, d_{ w }(v_{ i })=d(v_{ i }) and d_{ b }(v_{ i })=0 for all v_{ i }∈G.
Connecting edges
Based on the withindegree sequence and betweendegree sequence specified above, edges are connected in two steps (Figure 1). Nodes that belong to different modules are connected based on their betweendegree to form betweenedges (Figure 1b) and nodes that belong to the same module are connected according to their withindegree to form withinedges (Figure 1c and 1d).
We connect betweenedges using a modified version of the HavelHakimi algorithm. The HavelHakimi algorithm [35, 36] constructs graphs by sorting nodes according to their degree and successively connecting nodes of highest degree with each other. After each step of connecting the highest degree node, the degree list is resorted and the process continues until all the edges on the graph are connected. Here, we modify this to construct betweenedges by sorting nodes by highest betweendegree, in order of highest total betweendegree for the module to which they belong, and successively connecting the node at the top of the list randomly with other nodes. Connections are only made between nodes if they are not previously connected, belong to different modules, and do not both have withindegree of zero (to avoid disconnected components). After each step the betweendegree list is resorted, and the process continues until all betweenedges are connected. After all betweenedges have been connected, the connections are randomized using a wellknown method of rewiring through doubleedge swaps [37]. Specifically, two randomly chosen betweenedges (u,v) and (x,y) are removed, and replaced by two new edges (u,x) and (v,y), as long as u and x, and v and y belong to different modules, respectively. The swaps are constrained to avoid the formation of self loops and multiedges. This process is repeated a large number of times to randomize edges.
We then connect withinedges using the standard HavelHakimi algorithm, applied to each module independently. Specifically, withinedges of a module are connected by sorting nodes of the module according to their withindegree and successively connecting nodes of highest withindegree with each other. Connections are only made between nodes if they are not previously connected, and do not both have a betweendegree of zero (to avoid disconnected components). After each step the withindegree list is resorted and the process continues until all the withinedges of the module are connected. The connections are then randomized by rewiring through doubleedge swaps [37]. We do not specify that each module be connected (only that the full graph is connected). However, if this is required, Taylor’s algorithm can be used to rewire pairs of edges until the module is connected [38]. Specifically, the algorithm selects two random edges (u,v) and (x,y) that belong to two different disconnected components of the module. As long as (u,x) and (v,y) are not existing edges, the (u,v) and (x,y) edges are removed and (u,x) and (v,y) are added. Taylor’s theorem proves that following such operation any disconnected module can be converted to a connected module with the same degree sequence.
Results and discussion
Performance & properties of generated graphs
Performance
Our model generates graphs that closely match the expected modularity and degree distribution. The deviation of the observed modularity is less than 0.01 from the expected value, given the specified partition. The modular random graphs with Poisson degree distribution generated by our model are similar to the ones described by Girvan and Newman [3] with linking (p_{ i n }) and crosslinking probability (p_{ o u t }) equal to $\frac{\overline{{d}_{w}}}{\overline{s}1}$and $\frac{\overline{d}\overline{{d}_{w}}}{\overline{s}(K1)}$ respectively. However, our model overcomes several limitations of the model proposed by Girvan and Newman [3] and others [16, 17] by considering heterogeneity in total degree, withinmodule degree distribution, and module sizes. Unlike many of the existing models [18–21], our model can generate modular random graphs with arbitrary degree distributions, including those obtained from empirical networks. Though we discuss modular random graphs with positive Q values, our model can also generate disassortative modular random graphs (see Figure S3 in Additional file 1). In this case, nodes tend to connect to nodes in other modules and thus the density of edge connections within a module is less than what is expected at random. Additionally, we also compare our model to graphs generated based on a degreecorrected stochastic block model (SBM). The details of the parameterization of the SBM and the results are shown in Figure S4 in Additional file 1).
Structural properties
There are several other topological properties (besides degree distribution and community structure) that can influence network function and dynamics. The most significant of these properties are degree assortativity (the correlation between a node’s degree and its neighbor’s degrees), clustering coefficient (the propensity of a node’s neighborhood to also have edges among them) and average path length (the typical number of edges between pairs of nodes in the graph). We have developed this model to generate graphs with specified degree distribution and modularity, while minimizing structural byproducts. Thus, it is important to confirm that we have reached this goal with the generative model above.
Biological networks show remarkable variation in network size, connectivity and community size distribution, with some of them having particularly small network size, high degree, and small module sizes (e.g. foodweb networks). We therefore tested the performance of our generated networks under deviations in the network specifications of size, mean degree and module size distribution (results presented in Additional file 1: Figure S5, S6 and S7). We find that the structural properties of our generated modular random graphs remain constant, except for two constraining conditions: a) high average degree ($\stackrel{\u0304}{d}>$10) and b) low average module size ($\stackrel{\u0304}{s}<$50). At these parameter extremes, the modular random graphs become degree disassortative and have increased clustering coefficient. A similar observation of network degree disassortativity has been made in hierarchically modular networks [42]. In these two scenarios, the highest value of withindegree (d_{ w }(v_{ i })) that a node can attain is constrained by the community size, which reduces the number of possible high withindegree nodes. As a consequence high withindegree nodes must connect to low withindegree nodes more than expected, resulting in a degree disassortative network. In these two cases, modules also become more dense and thus create more triangles resulting in a gradual increase in clustering. Path length, on the other hand, is not affected by these conditions and shows a consistent dependence on network size and mean degree, which is well known [43, 44].
Application: benchmark graphs for communitydetection algorithms
Detecting communities in empirical networks has been an area of intensive research in the past decade [45] since Girvan and Newman’s seminal paper on community detection [3]. Extant techniques such as modularity maximization, hierarchical clustering, the cliquebased method, the spin glass method etc. aim at achieving high levels of accuracy in detecting the correct partition (for a detailed review see [45]), but have their own set of strengths and weaknesses. Choosing the best algorithm can be a difficult task especially as algorithms often use distinct definitions of communities and perform well within that description. Thus, it is exceedingly important to test communitydetection algorithms against a suitable benchmark. We propose our modular random graphs as benchmark graphs for the validation of existing and new algorithms of community detection.
In addition to comparing the estimated values of modularity to the known values in the modular random graphs, we can compare the similarity in the partitions detected by the algorithms to the true partitions. For this comparison, we use the Jaccard similarity (J), which measures the similarity between two partitions based on the proportion of the union of the partitions that is made up by the intersection of the partitions [52]; as well as the Variation of Information (VI), which measures the distance between two partitions based on the amount of information lost when going from one partition to another [53]. These results are presented in the (Additional file 1: Figure S11). As reflected in the results above, we find that partitioning is inaccurate when the true community structure is weak but improves as the Q_{ t r u e } value increases. These observations have also been noted before by Lacichinetti and Fortunato [54].
Application: null analysis of empirical networks
It is crucial to have random controls in the study of biological systems. Our algorithm can be used to generate null models and applied to the detection of structure in empirical biological networks. These null networks can be used to test hypotheses regarding the role of modularity and other topological features of the empirical networks. To do so, one would first determine the number of communities and modularity level (Q) of the sampled network using an appropriate community detection algorithm (the previous section describes the use of random modular graphs to validate existing algorithms of community detection). Our algorithm can then be used to generate an ensemble of networks that match the empirical degree structure and community structure, and then compare the structural, functional, or dynamical properties of the empirical network to those of the generated modular random graphs. Because our model generates graphs without any structural byproducts (as illustrated in a previous section), this is an appropriate model for generation of null models. We note that our algorithm does not necessarily require knowledge of the complete empirical network, but rather only estimates of the degree structure and community structure. The literature on algorithms for inference of network structure from a sample is growing, and currently includes work on inference of missing nodes, edges and even community structure [55–57].
We also generate random graphs based on the configuration model that have the same degree distribution and average network degree as the empirical network but are random with respect to other network properties for each of the four empirical networks (Figure 6, dark gray bars). Our modular random graph model identifies which network measures assume their empirical values in a particular network because of (i) the observed degrees and (ii) the latent community structure. The configuration model, on the other hand, only specifies (i) and not (ii) [13]. Comparison to these configuration model networks thus helps us highlight the utility of our model to identify which empirical patterns in a network are deserving of further investigation. Figure 6 shows the value of each of these properties for the empirical networks as well as the ensemble mean of modular random and random graphs with matched degree distribution.
From Figure 6 it is evident that none of the empirical biological networks have network structure identical to their null counterparts. This suggests that the structure of each of these biological systems is governed by more than what is specified by the degree distribution and community structure. However, the observed network properties of empirical networks are closer to the ensemble means of the modular random graphs, which indicates that modularity is an essential structural component of real biological networks and that it plays an important role in influencing other structural properties of the network. For instance, compartmentalization induced by modularity promotes species persistence and system robustness by containing localized perturbation [11, 62, 63], which might favor their selection during the course of evolution. Our results show that the empirical networks tested have a much higher modularity than the simple random graphs (Figure 6a) and therefore provide evidence for this selection. Out of the three network properties that we tested apart from modularity, we found clustering coefficient of the generated random graphs to be significantly different from each of the empirical counterparts. This may point to a functional role for “triangles” in these biological networks, significantly above or below what is prescribed by the degree and community structure.
Little Rock Lake food web interactions (FW)
Among the four empirical networks that we tested, the properties of the ensemble mean of null models such as assortativity and path length closely match most of the observed properties of Little Rock food web. The observed clustering coefficient of food web is strikingly lower than either of the random graphs which confirm the observations of low clustering in food web made by earlier studies (Figure 6d). The observed path length of this food web is short (Figure 6c) and only slightly longer than the path lengths of random graphs, which has also been noted before [64–66]. We note that for this food web, the structural properties of the random graphs with matched degree distribution are quite similar to those of modular random graph counterparts, suggesting that the degree distribution, particularly the high density of edges in the network governs most of the other topological characteristics of this network. Modularity, on the other hand, seems to play a minor role in dictating the structural properties of this network.
Yeast proteinprotein interaction network (YP)
The empirical yeast protein network is more disassortative than the ensemble mean of null modular graphs (Figure 6b). Disassortative interactions in proteinprotein interaction networks are known to reduce interferences between functional modules and thus increase the overall robustness of the network to deleterious perturbations [6], while also allowing for functions to be performed concurrently [67]. The results therefore suggest that disassortative interactions may be selected for in the evolution of biological networks. From Figure 6(d) it is also evident that the yeast protein network has a higher value of clustering coefficient than the expected value predicted by the modular random graphs. A high value of clustering coefficient indicates that there are several alternate interaction paths between two proteins, making the system more robust to perturbation [68].
C.elegansmetabolic interaction network (CM)
The C.elegans metabolic network demonstrates a shorter path length but higher clustering coefficient than both modular and random graphs with matched degree distribution (Figure 6c and 6d). A high clustering coefficient and short path length suggests that the graph has smallworld properties, which has been observed in other metabolic networks as well [69]. A highly disassortative degree structure is also well known in metabolic networks, although the mechanism leading to this property is unclear (see review by [39]). As the predicted value of disassortativity of the modular random graphs is closer to the observed value, our results suggest that the strong community structure of the metabolic networks could be one of the factors contributing to high degree disassortativity. (As discussed earlier, community structure leads to significant degree correlations in small networks with longtailed degree distributions; see Figure S5 in Additional file 1 for an example).
Social interaction network of dolphins network at Doubtful Sound, New Zealand (DS)
The empirical social interaction network of dolphins that we investigated demonstrated a negative assortativity (or disassortativity) similar to other real biological networks (Figure 6b). Interestingly, the assortativity value of both null modular and random graphs with matched degree distribution counterparts of the dolphin network is lower than the observed value, which suggests that the network is more assortative than expected. Degree assortativity has also been observed in other animal [70] and human [4] social interaction networks. This result is quite intuitive for a social network and is also referred to as homophily: more gregarious individuals tend to interact with other gregarious individuals while introverted individuals prefer to associate with other introverts [14]. The empirical dolphin network also demonstrated a lower value of clustering coefficient than the expected values of either null model. Low clustering coupled with high degree assortativity indicates that dolphin populations may be more susceptible to the propagation of infection or information, as transmission may occur rapidly through the entire network with such properties [70, 71].
Conclusions
In summary, the model that we propose in this study generates modular random graphs over a broad range of degree distribution and modularity values, as well as module size distributions. We highlight that our model is specifically designed to generate networks which have modularity evenly divided across its modules, modulo the impact of module size. This means that we are mitigating the resolution limit effect and indeed generating networks with the maximum modularity partition. We also confirm that structural properties of our generated modular graphs such as assortativity, clustering and path length remain unperturbed for a broad range of parameter values. This important feature allows these graphs to act as benchmark and control graphs to explicitly test hypotheses regarding the function and evolution of modularity in biological systems. Of the approaches available, our method provides flexibility and has been explored the most fully for these applications.
Compartmentalization of biological networks has been an area of great interest to biologists. What we refer to as community structure in this work is any segregation of a biological system into smaller subunits interconnected by only a few connections. It has been suggested that modularity in a system promotes system robustness and enhances species persistence by containing localized perturbations [11, 63]. Metabolic networks of organisms living in a variable environment have indeed been found to be more modular [62]. Maintaining and selecting for modularity in biological networks, however, comes at a great cost of reducing system complexity [72], longer developmental time and cost of complete module replacement in case of failure [73]. It is therefore unclear why modularity would be strongly selected for as a structural feature of biological systems. There is also a lack of evidence to prove that the functional localization of subgoals overlaps with the structural segregation of the network into community structure. Our work provides a tool for the systematic study of network structure (through benchmark graphs) and of the impact of connectivity and compartmentalization on system function and dynamics (through control graphs).
The detection of community structure plays a crucial role in our topological understanding of complex networks. Currently the performance of community detection methods is usually evaluated based on groundtruth from real networks. However, determining reference communities in real networks is often a difficult task. Also, ground truth data on empirical network partitions do not necessarily identify system features based on network topology and thus may create a bias when analyzing community structure. A more convenient technique of evaluating community detection method is to use artificial random graphs, but has been limited as most of the models fail to incorporate degree heterogeneity of real networks. By providing a systematic method to generate benchmark graphs, our model can aid in the development of more robust community detection algorithms, and therefore improve our topological understanding of empirical networks.
A step beyond identifying the topological presence of network communities is the understanding of its evolution as well as the functional and dynamical role of community structure. We believe this process can be facilitated by using an appropriate class of control or null graphs. As a model for generating null networks, our method joins a suite of random graph models, each contributing to a hierarchy of null models. The simplest model for generating random graphs (based on only a single parameter) is the ErdősRényi random graph model, which produces graphs that are completely defined by their average degree and are random in all other respects. A slightly more complex and general model is one that generates graphs with a specified degree distribution (or degree sequence) but are random in all other respects [13, 74, 75]. These models can be extended to sequentially include additional independent structural constraints, such as degree distribution and clustering coefficient [2], or degree structure and community structure, as we have demonstrated here. A further extension to this work will be designing models that generate random graphs with multiple structural constraints. For example, our model can be combined with the one proposed by [2] to generate random graphs with specified degree distribution as well as tunable strength of modularity and clustering coefficient.
Availability and requirements
Project name: Modular random graph generatorProject home page:http://github.com/bansallabOperating system(s): Platform independentProgramming language: Python 2.7Other requirements: Networkx Python packageLicense: BSDstyleAny restrictions to use by nonacademics: None
Declarations
Acknowledgements
This work was supported by NSF award DEB1216054.
Authors’ Affiliations
References
 Proulx SR, Promislow DEL, Phillips PC: Network thinking in ecology and evolution. Trends Ecol Evol. 2005, 20 (6): 34553. 10.1016/j.tree.2005.04.004.View ArticlePubMedGoogle Scholar
 Bansal S, Khandelwal S, Meyers LA: Exploring biological network structure with clustered random networks. BMC Bioinformatics. 2009, 10: 40510.1186/1471210510405.View ArticlePubMed CentralPubMedGoogle Scholar
 Girvan M, Newman MEJ: Community structure in social and biological networks. Proc Nat Acad Sci USA. 2002, 99 (12): 78217826. 10.1073/pnas.122653799.View ArticlePubMed CentralPubMedGoogle Scholar
 Newman M: Mixing patterns in networks. Phys Rev E. 2003, 67 (2): 026126View ArticleGoogle Scholar
 Ravasz E, Somera AL, Oltvai ZN, Barabási AL, Mongru Da: Hierarchical organization of modularity in metabolic networks. Science (New York, NY). 2002, 297 (5586): 15511555. 10.1126/science.1073374. [http://www.ncbi.nlm.nih.gov/pubmed/12202830],View ArticleGoogle Scholar
 Maslov S, Sneppen K: Specificity and stability in topology of protein networks. Science (New York, NY). 2002, 296 (5569): 910913. 10.1126/science.1065103.View ArticleGoogle Scholar
 Han JDJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast proteinprotein interaction network. Nature. 2004, 430 (6995): 8893. 10.1038/nature02555.View ArticlePubMedGoogle Scholar
 ShenOrr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genet. 2002, 31: 6468. 10.1038/ng881.View ArticlePubMedGoogle Scholar
 Krause AE, Mason DM, Ulanowicz RE, Taylor WW, Frank Ka: Compartments revealed in foodweb structure. Nature. 2003, 426 (6964): 282285. 10.1038/nature02115.View ArticlePubMedGoogle Scholar
 Stouffer DB, SalesPardo M, Newman MEJ, Guimerà R: Origin of compartmentalization in food webs. Ecology. 2010, 91 (10): 29412951. 10.1890/091175.1.View ArticlePubMedGoogle Scholar
 Olesen JM, Bascompte J, Dupont YL, Jordano P: The modularity of pollination networks. Proc Nat Acad Sci USA. 2007, 104 (50): 1989119896. 10.1073/pnas.0706375104.View ArticlePubMed CentralPubMedGoogle Scholar
 Yang J, Leskovec J: Defining and evaluating network communities based on groundtruth. Proc ACM SIGKDD Workshop Mining Data Semantics  MDS ‘12. 2012, New York: ACM Press, 18.Google Scholar
 Molloy M, Reed B: A critical point for random graphs with a given degree sequence. Random Struct Algorithms. 1995, 6 (2–3): 161180.View ArticleGoogle Scholar
 Newman M: Assortative mixing in networks. Phys Rev Lett. 2002, 89 (20): 208701View ArticlePubMedGoogle Scholar
 XulviBrunet R, Sokolov I: Reshuffling scalefree networks: from random to assortative. Phys Rev E. 2004, 70 (6): 066102[http://link.aps.org/doi/10.1103/PhysRevE.70.066102],View ArticleGoogle Scholar
 Lancichinetti A, Fortunato S, Radicchi F: Benchmark graphs for testing community detection algorithms. Phys Rev E. 2008, 78 (4): 16.View ArticleGoogle Scholar
 Bagrow JP: Evaluating local community methods in networks. J Stat Mech Theory Exper. 2008, 2008 (05): P05001View ArticleGoogle Scholar
 Arenas A, DíazGuilera A, PérezVicente C: Synchronization reveals topological scales in complex networks. Phys Rev Lett. 2006, 96 (11): 114102View ArticlePubMedGoogle Scholar
 Hintze A, Adami C: Modularity and antimodularity in networks with arbitrary degree distribution. Biol Direct. 2010, 5: 3210.1186/17456150532.View ArticlePubMed CentralPubMedGoogle Scholar
 Sawardecker EN, SalesPardo M, Nunes Amaral LA: Detection of node group membership in networks with group overlap. Eur Phys J B. 2008, 67 (3): 277284. [http://www.springerlink.com/index/10.1140/epjb/e2008004180],View ArticleGoogle Scholar
 SalesPardo M, Nunes Amaral LA, Guimerà R: Module identification in bipartite and directed networks. Phys Rev E, Stat Nonlinear Soft Matter Phys. 2007, 76 (3 Pt 2): 036102Google Scholar
 Zhao H, Gao ZY: Modular effects on epidemic dynamics in smallworld networks. Euro Phys Lett (EPL). 2007, 79 (3): 3800210.1209/02955075/79/38002.View ArticleGoogle Scholar
 Yan G, Fu ZQ, Ren J, Wang WX: Collective synchronization induced by epidemic dynamics on complex networks with communities. Phys Rev E. 2007, 75: 016108View ArticleGoogle Scholar
 Chu X, Guan J, Zhang Z, Zhou S: Epidemic spreading in weighted scalefree networks with community structure. J Stat Mech Theory Exper. 2009, 2009 (07): P07043View ArticleGoogle Scholar
 Salathe M, Jones JH: Dynamics and control of diseases in networks with community structure. PLoS Comput Biol. 2010, 6 (4): 111.View ArticleGoogle Scholar
 Clauset A, Shalizi CR, Newman MEJ: Powerlaw distributions in empirical data. SIAM Rev. 2009, 51 (4): 661703. 10.1137/070710111.View ArticleGoogle Scholar
 Wang P, Robins G, Pattison P, Lazega E: Exponential random graph models for multilevel networks. Soc Netw. 2013, 35: 96115. 10.1016/j.socnet.2013.01.004.View ArticleGoogle Scholar
 Chatterjee S, Diaconis P: Estimating and understanding exponential random graph models. Ann Stat. 2013, 41 (5): 24282461. 10.1214/13AOS1155.View ArticleGoogle Scholar
 Karrer B, Newman M: Stochastic blockmodels and community structure in networks. Phys Rev E. 2011, 83: 111.View ArticleGoogle Scholar
 Newman MEJ: Detecting community structure in networks. Eur Phys J B  Condensed Matter. 2004, 38 (2): 321330.Google Scholar
 Good BH, de Montjoye YA, Clauset A: Performance of modularity maximization in practical contexts. Phys Rev E. 2010, 81 (4): 046106View ArticleGoogle Scholar
 Zverovich IE, Zverovich VE: Contributions to the theory of graphic sequences. Discrete Math. 1992, 105: 293303. 10.1016/0012365X(92)901526.View ArticleGoogle Scholar
 Chungphaisan V: Conditions for sequences to be r_graphic. Discrete Math. 1974, 7: 3139. 10.1016/S0012365X(74)800166.View ArticleGoogle Scholar
 Iványi A: Degree sequences of multigraphs. Annales Univ Sci Budapest Sect Comp. 2012, 37: 195214.Google Scholar
 Havel V: A remark on the existence of finite graphs. Casopis Pest Mat. 1955, 80: 477480.Google Scholar
 Hakimi S: On realizability of a set of integers as degrees of the vertices of a linear graph. I. J Soc Industrial Appl. 1962, 10 (3): 496506. 10.1137/0110037.View ArticleGoogle Scholar
 Gkantsidis C, Mihail M, Zegura E: The Markov chain simulation method for generating connected power law random graphs. Proceedings of the Fifth Workshop on Algorithm Engineering and Experiments. Edited by: Ladner RE. SIAM. 2003 2003:16–25,
 Taylor R: Constrained Switchings in Graphs. 1981, Berlin, Heidlberg: SpringerGoogle Scholar
 Barabási AL, Oltvai ZN: Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004, 5 (2): 101113. 10.1038/nrg1272.View ArticlePubMedGoogle Scholar
 Przulj N: Biological network comparison using graphlet degree distribution. Bioinformatics. 2007, 23 (2): e177e183. 10.1093/bioinformatics/btl301.View ArticlePubMedGoogle Scholar
 Tanaka R: Scalerich metabolic networks. Phys Rev Lett. 2005, 94 (16): 168101View ArticlePubMedGoogle Scholar
 Jing Z, Lin T, Hong Y, JianHua L: The effects of degree correlations on network topologies and robustness. Chinese. 2007, 16 (12): 35713580.Google Scholar
 Dorogovtsev S, Mendes J, Oliveira J: Degreedependent intervertex separation in complex networks. Phys Rev E. 2006, 73 (5): 056122View ArticleGoogle Scholar
 Hołyst J, Sienkiewicz J, Fronczak A, Fronczak P, Suchecki K: Universal scaling of distances in complex networks. Phys Rev E. 2005, 72 (2): 026108View ArticleGoogle Scholar
 Newman MEJ: Communities, modules and largescale structure in networks. Nat Phys. 2011, 8: 2531. 10.1038/nphys2162.View ArticleGoogle Scholar
 Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E: Fast unfolding of communities in large networks. J Stat Mech Theory Exper. 2008, 2008 (10): P1000810.1088/17425468/2008/10/P10008.View ArticleGoogle Scholar
 Clauset A, Newman M, Moore C: Finding community structure in very large networks. Phys Rev E. 2004, 70 (6): 066111View ArticleGoogle Scholar
 Reichardt J, Bornholdt S: Statistical mechanics of community detection. Phys Rev E. 2006, 74: 116.Google Scholar
 Rosvall M, Axelsson D, Bergstrom CT: The map equation. Eur Phys J Special Topics. 2010, 178: 1323.View ArticleGoogle Scholar
 Raghavan U, Albert R, Kumara S: Near linear time algorithm to detect community structures in largescale networks. Phys Rev E. 2007, 76 (3): 036106[http://link.aps.org/doi/10.1103/PhysRevE.76.036106],View ArticleGoogle Scholar
 Pons P, Latapy M: Computing communities in large networks using random walks. J Graph Algorithms Appl. 2006, 10 (2): 191218. 10.7155/jgaa.00124.View ArticleGoogle Scholar
 Downton M, Brennan T: Comparing classifications: an evaluation of several coefficients of partition agreement. Classification Society, Boulder, CO, vol. 4. 1980,Google Scholar
 Meilǎ M: Comparing clusterings by the variation of information. Learn Theory Kernel Mach. 2003, 2777: 173187. 10.1007/9783540451679_14.View ArticleGoogle Scholar
 Lancichinetti A, Fortunato S: Community detection algorithms: a comparative analysis. Phys Rev E. 2009, 80 (5): 056117View ArticleGoogle Scholar
 Chen J, Zaïane O, Goebel R: Local community identification in social networks. Soc Netw Anal. 2009, 237242.Google Scholar
 Kim M, Leskovec J: The network completion problem: inferring missing nodes and edges in networks. SDM. 2011, 4758.Google Scholar
 Lin W, Kong X, Yu PS, Wu Q, Jia Y, Li C: Community detection in incomplete information networks. Proc 21st Int Conf World Wide Web  WWW ‘12. 2012, New York: ACM Press, 341341.View ArticleGoogle Scholar
 Martinez N: Artifacts or attributes? Effects of resolution on the Little Rock Lake food web. Ecol Monograph. 1991, 61 (4): 367392. 10.2307/2937047.View ArticleGoogle Scholar
 Colizza V, Flammini A, Maritan A, Vespignani A: Characterization and modeling of proteinprotein interaction networks. Phys A Stat Mech Appl. 2005, 352: 127. 10.1016/j.physa.2004.12.030.View ArticleGoogle Scholar
 Jeong H, Tombor B, Albert R, Oltvai ZN, Barabási aL: The largescale organization of metabolic networks. Nature. 2000, 407 (6804): 651654. 10.1038/35036627.View ArticlePubMedGoogle Scholar
 Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM: The bottlenose dolphin community of Doubtful Sound features a large proportion of longlasting associations. Behav Ecol Sociobiol. 2003, 54 (4): 396405. 10.1007/s002650030651y.View ArticleGoogle Scholar
 Parter M, Kashtan N, Alon U: Environmental variability and modularity of bacterial metabolic networks. BMC Evol Biol. 2007, 7: 16910.1186/147121487169.View ArticlePubMed CentralPubMedGoogle Scholar
 Stouffer DB, Bascompte J: Compartmentalization increases foodweb persistence. Proc Nat Acad Sci USA. 2011, 108 (9): 36483652. 10.1073/pnas.1014353108.View ArticlePubMed CentralPubMedGoogle Scholar
 Williams RJ, Berlow EL, Barabási AL, Martinez ND, Dunne Ja: Two degrees of separation in complex food webs. Proc Nat Acad Sci USA. 2002, 99 (20): 1291312916. 10.1073/pnas.192448799.View ArticlePubMed CentralPubMedGoogle Scholar
 Montoya JM, Sole RV: Small world patterns in food webs. J Theor Biol. 2002, 214 (3): 405412. 10.1006/jtbi.2001.2460.View ArticlePubMedGoogle Scholar
 Williams RJ, Martinez ND, Dunne Ja: Foodweb structure and network theory: the role of connectance and size. Proc Nat Acad Sci USA. 2002, 99 (20): 1291712922. 10.1073/pnas.192407699.View ArticlePubMed CentralPubMedGoogle Scholar
 Khor S: Concurrency and network disassortativity. Artif Life. 2010, 16 (3): 225232. 10.1162/artl_a_00001.View ArticlePubMedGoogle Scholar
 Wuchty S, Barabási AL, Ferdig MT: Stable evolutionary signal in a yeast protein interaction network. BMC Evol Biol. 2006, 6: 810.1186/1471214868.View ArticlePubMed CentralPubMedGoogle Scholar
 Wagner A, Fell DA: The small world inside large metabolic networks. Proc Biol Sci R Soc. 2001, 268 (1478): 18031810. 10.1098/rspb.2001.1711.View ArticleGoogle Scholar
 Croft D, James R, Ward AJW, Botham MS, Mawdsley D, Krause J: Assortaive interactions and social networks in fish. Oecologia. 2005, 143: 211219. 10.1007/s0044200417968.View ArticlePubMedGoogle Scholar
 Newman M: Properties of highly clustered networks. Phys Rev E. 2003, 68 (2): 026121View ArticleGoogle Scholar
 Welch JJ, Waxman D: Modularity and the cost of complexity. Evol Int J Organic Evol. 2003, 57 (8): 17231734. 10.1111/j.00143820.2003.tb00581.x.View ArticleGoogle Scholar
 Krohs U: The cost of modularity. Functions in Biological and Artificial Worlds: Comparative Philosophical Perspectives. 2009, MIT Press, 259276.View ArticleGoogle Scholar
 Aiello W, Chung F, Lu L: A random graph model for massive graphs. Proc ThirtySecond Annual ACM Symposium on Theory of Computing  STOC ‘00. 2000, New York: ACM Press, 171180.View ArticleGoogle Scholar
 Newman MEJ, Strogatz SH, Watts DJ: Random graphs with arbitrary degree distributions and their applications. Phys Rev E. 2001, 64 (2): 026118View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.