Network analysis and modeling is a rapidly growing area which is moving forward our understanding of biological processes. Networks are mathematical representations of the interactions among the components of a system. Nodes in a biological network usually represent biological units of interest such as genes, proteins, individuals, or species. Edges indicate interaction between nodes such as regulatory interaction, gene flow, social interactions, or infectious contacts [1]. A basic model for biological networks assumes random mixing between nodes of the network. The network patterns in real biological populations, however, are typically more heterogeneous than assumed by these simple models [2]. For instance, biological networks often exhibit properties such as degree heterogeneity, assortative mixing, non-trivial clustering coefficients, and community structure (see review by Proulx et al. [1]). Of particular interest is community structure, which reflects the presence of large groups of nodes that are typically highly connected internally but only loosely connected to other groups [3, 4]. This pattern of large and relatively dense subgraphs is called assortative community structure. In empirical networks, these groups, also called modules or communities, often correspond well with experimentally-known functional clusters within the overall system. Thus, community detection, by examining the patterns of interactions among the parts of a biological system, can help identify functional groups automatically, without prior knowledge of the system’s processes.

Although community structure is believed to be a central organizational pattern in biological networks such as metabolic [5], protein [6, 7], genetic [8], food-web [9, 10] and pollination networks [11], a detailed understanding of its relationship with other network topological properties is still limited. In fact, the task of clearly identifying the true community structure within an empirical network is complicated by a multiplicity of community detection algorithms, multiple and conflicting definitions of communities, inconsistent outcomes from different approaches, and a relatively small number of networks for which ground truth is known. Although node attributes in empirical networks (e.g., habitat type in foodwebs) are sometimes used to evaluate the accuracy of community detection methods [12], these results are generally of ambiguous value as the failure to recover communities that correlates with some node attribute may simply indicate that the true features driving the network’s structure are unobserved, not that the identified communities are incorrect.

A more straightforward method of exploring the structural and functional role of a network property is to generate graphs which are random with respect to other properties except the one of interest. For example, network properties such as degree distribution, assortativity and clustering coefficient have been studied using the configuration model [13], and models for generating random graphs with tunable structural features [2, 14, 15]. These graphs serve to identify the network measures that assume their empirical values in a particular network due to the particular network property of interest. In this work, we propose a model for generating simple, connected random networks that have a specified degree distribution and level of community structure.

Random graphs with tunable strength of community structure can have several purposes such as: (1) serving as benchmarks to test the performance of community detection algorithms; (2) serving as null models for empirical networks to investigate the combined effect of the observed degrees and the latent community structure on the network properties; (3) serving as proxy networks for modeling network dynamics in the absence of empirical network data; and (4) allowing for the systematic study of the impact of community structure on the dynamics that may flow on a network. Among these, the use of random graphs with tunable strength of community structure to serve as benchmarks has received the most attention and several such models have been proposed [16–21]. A few studies have also looked at the role of community structure in the flow of disease through contact networks [22–25]. However, the use of modular random graphs, which can be defined as random graphs that have a higher strength of community structure than what is expected at random, is still relatively unexplored in other applications.

### Previous work

In 2002, Girvan and Newman proposed a simple toy model for generating random networks with a specific configuration and strength of community structure [3]. This model assumes a fixed number of modules each of equal size and where each node in each module has the same degree. In this way, each module is an Erdős-Rényi random graph. To produce modular structure, different but fixed probabilities are used to produce edges within or between modules. Although this toy model has been widely used to evaluate the accuracy of community detection algorithms, it has limited relevance to real-world networks, which are generally both larger and much more heterogeneous. Lancichinetti et al. [16] introduced a generalization of the Girvan-Newman model that better incorporates some of these features, e.g., by including heterogeneity in both degree and community size. However, this model assumes that degrees are always distributed in a particular way (like a power law [26]), which is also unrealistic. (A similar model by Bagrow [17] generates modular networks with power law degree distribution and constant community size.)

Yan et al. [23] used a preferential attachment model to grow scale-free networks comprised of communities of nodes whose degrees follow a power-law distribution. And, models for special graph types such as hierarchical networks [18], bipartite networks [21], and networks with overlapping modules [20] have also been proposed. These models also make strong assumptions about the degree or community size distributions, which may not be realistic for comparison with real biological networks. A recently proposed model [19] does generate networks with a broad range of degree distributions, modularity and community sizes, but its parameters have an unclear relationship with desired properties (such as degree distribution and modularity), making it difficult to use in practice. Thus, while these models may be sufficient for comparative evaluation of community detection algorithms, they are of limited value for understanding their performance and output when applied to real-world networks.

An alternative approach comes from probabilistic models, of which there are two popular classes. Exponential random graph models (ERGMs) have a long history of use in social network analysis, and can generate an ensemble of networks that contain certain frequencies of local graph features, including heterogeneous degrees, triangles, and 4-cycles [27]. However, many classes of ERGMs exhibit pathological behavior when parameterized with triangles or higher-order structures [28], which severely limits their utility. Stochastic block models (SBMs) are more promising, but require a large number of parameters to be chosen before a graph can be generated. In this approach, the probability of each link depends only on the community labels of its endpoints. Thus, to generate a network, we must specify the number of communities *K*, their sizes (in the form of a labeling of the vertices), and the $\left(\genfrac{}{}{0ex}{}{K}{2}\right)$ (in the undirected case) group-pair probabilities. The result is a random graph with specified community sizes, where each community is an Erdős-Rényi random graph with a specified internal density, and each pair of communities is a random bipartite graph with specified density. The degree distributions of these networks is a mixture of Poisson distributions, which can be unrealistic. A recent generalization of the SBM due to Karrer and Newman [29] allows the specification of the degree sequence, which circumvents this limitation but introduces another set of parameters to be chosen. Although the stochastic block models can in principle be used to generate synthetic networks, they are more commonly used within an inferrential framework in which community structure is recovered by estimating the various parameters directly from a network. As a result, the practical use of the SBM as a null model, either for general benchmarking of community detection algorithms or for understanding the structure of biological networks, remains largely unexplored, and we lack clear answers as to how best to sample appropriately from its large parameter space in these contexts. The SBM also does not provide a simple measure of the level of modularity in a network’s large-scale structure, which makes its structure more difficult to interpret. The SBM is a promising model for many tasks, and adapting it to the questions we study here remains an interesting avenue for future work.

### Our approach

Here, we develop and implement a simple simulation model for generating modular random graphs using only a small number of intuitive and interpretable parameters. Our model can generate graphs over a broad range of distributions of network degree and community size. The generated graphs can range from very small (<10^{2}) to large (> 10^{5}) network sizes and can be composed of a variable number of communities. In Methods below, we introduce our algorithm for generating modular random graphs. In Results and discussion, we consider the performance of our algorithm and structural features of our generated graphs to show that properties such as degree assortativity, clustering, and path length remain unchanged for increasing modularity. We next demonstrate the applicability of the generated modular graphs to test the accuracy of extant community detection algorithms. The accuracy of community detection algorithms depends on several network properties such as the network mean degree and strength of community structure, which is evident in our analysis. Finally, using a few empirical biological networks, we demonstrate that our model can be used to generate corresponding null modular graphs under two different models of randomization. We conclude the paper with some thoughts about other applications and present some future directions.