Molecular ecological network analyses
© Deng et al.; licensee BioMed Central Ltd. 2012
Received: 29 December 2011
Accepted: 30 May 2012
Published: 30 May 2012
Skip to main content
© Deng et al.; licensee BioMed Central Ltd. 2012
Received: 29 December 2011
Accepted: 30 May 2012
Published: 30 May 2012
Understanding the interaction among different species within a community and their responses to environmental changes is a central goal in ecology. However, defining the network structure in a microbial community is very challenging due to their extremely high diversity and as-yet uncultivated status. Although recent advance of metagenomic technologies, such as high throughout sequencing and functional gene arrays, provide revolutionary tools for analyzing microbial community structure, it is still difficult to examine network interactions in a microbial community based on high-throughput metagenomics data.
Here, we describe a novel mathematical and bioinformatics framework to construct ecological association networks named molecular ecological networks (MENs) through Random Matrix Theory (RMT)-based methods. Compared to other network construction methods, this approach is remarkable in that the network is automatically defined and robust to noise, thus providing excellent solutions to several common issues associated with high-throughput metagenomics data. We applied it to determine the network structure of microbial communities subjected to long-term experimental warming based on pyrosequencing data of 16 S rRNA genes. We showed that the constructed MENs under both warming and unwarming conditions exhibited topological features of scale free, small world and modularity, which were consistent with previously described molecular ecological networks. Eigengene analysis indicated that the eigengenes represented the module profiles relatively well. In consistency with many other studies, several major environmental traits including temperature and soil pH were found to be important in determining network interactions in the microbial communities examined. To facilitate its application by the scientific community, all these methods and statistical tools have been integrated into a comprehensive Molecular Ecological Network Analysis Pipeline (MENAP), which is open-accessible now (http://ieg2.ou.edu/MENA).
The RMT-based molecular ecological network analysis provides powerful tools to elucidate network interactions in microbial communities and their responses to environmental changes, which are fundamentally important for research in microbial ecology and environmental microbiology.
In an ecosystem, different species/populations interact with each other to form complicated networks through various types of interactions such as predation, competition and mutualism. On the basis of ecological interactions, ecological networks can be grouped as antagonistic, competitive and mutualistic networks . Traditionally, food webs have been intensively studied in ecological research because they are critical to study the complexity and stability of ecological communities [2, 3]. Recent studies showed that food webs possessed typical properties of network topology (e.g. degree distribution, small world effect) [1, 4, 5]. Within the last decade, mutualistic networks have also been intensively studied . But, it appears that no studies have been performed to examine competitive networks. This is most likely because the network structure is not available based on competitive interactions. Unlike food webs and plant-animal mutualistic networks where the structure is already known, quantifying competitive interactions among different species/populations within a given habitat is difficult so that the network structure for competitive interactions is unknown. This is also true for network studies in microbial ecology. Because of their vast diversity, as-yet uncultivated status  and of the lack of appropriate theoretical frameworks and experimental data, very few community-scale network studies have been performed in microbial ecology.
Various network approaches have been developed and widely applied in genomic biology . To reveal the interactions among biological molecules including genes and proteins, differential equation-based network methods [9–12], Bayesian network methods [13, 14], and relevance/co-expression network methods [15–20], have been used to infer cellular networks based on gene expression data. Among them, the correlation-based relevance network method is most commonly used largely due to its simple calculation procedure and noise tolerance . However, most studies involving relevance network analysis use arbitrary thresholds, and thus the constructed networks are subjective rather than objective . This problem has been solved by our recent development of a random matrix theory (RMT)-based approach, which is able to automatically identify a threshold for cellular network construction from microarray data [22–24]. Our results showed that the developed novel RMT-based approach can automatically identify cellular networks based on microarray data. Our results also indicated that this approach is a reliable, sensitive and robust tool for identifying transcriptional networks for analyzing high-throughput genomics data for modular network identification and gene function prediction [22, 23].
High-throughput technologies such as microarrays and high throughout sequencing have generated massive amounts of data on microbial community diversity and dynamics across various spatial and temporal scales [25, 26]. These data offer an unprecedented opportunity to examine interactions among different microbial populations . Recently, a novel conceptual framework, termed molecular ecological networks (MENs), has been proposed and applied to characterize microbial communities in response to elevated CO2[27, 28]. Here, we provide detailed mathematical and bioinformatic foundation of this novel approach, and further applications to characterize microbial community network interactions in response to long-term experimental warming. Additionally, we provide an online tool, named the Molecular Ecological Network Analyses Pipeline (MENAP), which is freely accessible to the scientific community.
An ecological network is a representation of various biological interactions (e.g., predation, competition, mutualism) in an ecosystem, in which species (nodes) are connected by pairwise interactions (links) [1, 29–32]. As previously described, we refer such molecule-based ecological networks in microbial communities as molecular ecological networks (MENs) [27, 28], in which different nodes (molecular markers, e.g., OTUs, functional genes, intergenic regions) are linked by edges (i.e., interactions). The MENs derived from functional gene markers are referred as functional molecular ecological networks (fMENs)  and those based on phylogenetic gene markers as phylogenetic molecular ecological networks (pMENs) .
The network topological indexes used in this study
Part I: network indexes for individual nodes
is the connection strength between nodes i and j.
It is also called node degree. It is the most commonly used concept for desibing the topological property of a node in a network.
is the number of shortest paths between nodes j and k that pass through node i.
It is used to desibe the number of geodesic paths that pass through the ith node. High Stress node can serve as a broker.
is the total number of shortest paths between j and k.
It is used to desibe the ratio of paths that pass through the ith node. High Betweenness node can serve as a broker similar to stress centrality.
M(i) is the set of nodes that are connected to the ith node and λ is a constant eigenvalue.
It is used to desibe the degree of a central node that it is connected to other central nodes.
l i is the number of links between neighbors of node i and k i ’ is the number of neighbors of node i.
It desibes how well a node is connected with its neighbors. If it is fully connected to its neighbors, the clustering coefficient is 1. A value close to 0 means that there are hardly any connections with its neighbors. It was used to desibe hierarchical properties of networks.
E is the global efficiency and E i is the global efficiency after the removal of the node i and its entire links.
It measures the deease of node i on the system performance if node i and all associated links are removed.
Part II: The overall network topological indexes
k i is degree of node i and n is the number of nodes.
Higher avgK means a more complex network.
Average geodesic distance
d ij is the shortest path between node i and j.
A smaller GD means all the nodes in the network are closer.
all parameters shown above.
It is the opposite of GD. A higher E means that the nodes are closer.
Harmonic geodesic distance
E is geodesic efficiency.
The reciprocal of E, which is similar to GD but more appropriate for disjoint graph.
Centralization of degree
max(k) is the maximal value of all connectivity values and k i represents the connectivity of i th node. Finally this value is normalized by the theoretical maximum centralization score.
It is close to 1 for a network with star topology and in contrast close to 0 for a network where each node has the same connectivity.
Centralization of betweenness
max(B) is the maximal value of all betweenness values and B i represents the betweenness of i th node. Finally this value is normalized by the theoretical maximum centralization score.
It is close to 0 for a network where each node has the same betweenness, and the bigger the more difference among all betweenness values.
Centralization of stress centrality
max(SC) is the maximal value of all stress centrality values and SC i represents the stress centrality of i th node. Finally this value is normalized by the theoretical maximum centralization score.
It is close to 0 for a network where each node has the same stress centrality, and the bigger the more difference among all stress centrality values.
Centralization of eigenvector centrality
max(EC) is the maximal value of all eigenvector centrality values and EC i represents the eigenvector centrality of i th node. Finally this value is normalized by the theoretical maximum centralization score.
It is close to 0 for a network where each node has the same eigenvector centrality, and the bigger the more difference among all eigenvector centrality values.
l is the sum of total links and l exp is the number of possible links.
It is closely related to the average connectivity.
Average clustering coefficient
is the clustering coefficient of node i.
It is used to measure the extent of module structure present in a network.
l i is the number of links between neighbors of node i and k i ’ is the number of neighbors of node i.
Sometimes it is also called the entire clustering coefficient. It has been shown to be a key structural property in social networks.
W is the number of pairs of nodes that are not reachable.
It is one of the most important measurements for summarizing hierarchical structures. Con is 0 for graph without edges and is 1 for a connected graph.
Common characters of complex networks
It is a most notable characteristic in complex systems. It was used to desibe the finding that most nodes in a network have few neighbors while few nodes have large amount of neighbors. In most cases, the connectivity distribution asymptotically follows a power law . It can be expressed in , where P(k) is the number of nodes with k degrees, k is connectivity/degrees and γ is a constant.
It is a terminology in network analyses to depict the average distance between nodes in a network is short, usually logarithmically with the total number of nodes . It means the network nodes are always closely related with each other.
It was used to demonstrate a network which could be naturally divided into communities or modules . Each module in gene regulation networks is considered as a functional unit which consisted of several elementary genes and performed an identifiable task [23, 46]. A modularity value can be calculated by Newman’s method  which was used to measure how well a network is able to be separated into modules. The value is between 0 to 1.
It was used to depict the networks which could be arranged into a hierarchy of groups representing in a tree structure. Several studies demonstrated that metabolic networks are usually accompanied by a hierarchical modularity [37, 44]. It was potentially consistent with the notion that the accumulation of many local changes affects the small highly integrated modules more than the larger, less integrated modules . One of the most important signatures for hierarchical modular organizations is that the scaling of clustering coefficient follows C(k) ~ k −γ (scaling law), in which k is connectivity and γ is a constant .
Topological properties of the empirical molecular ecological networks (MENs) of additional miobial communities and their associated random MENs a
Habitats of communitiesb
Similarity threshold (s t )
Network size (n)
R2 of power law
R2 of scaling law
Average path (GD)
Average Clustering coefficient (avgCC)
Modularity & (the number of modules)
Average path (GD)
Average clustering coefficient (avgCC)
Grassland soils under elevated CO2, MN (i)
3.00 ± 0.03
0.099 ± 0.009
0.31 ± 0.01
Grassland soils under ambient CO2, MN (i)
3.84 ± 0.06
0.028 ± 0.007
0.52 ± 0.01
Lake sediment, Lake DePue, WI (ii)
3.46 ± 0.05
0.046 ± 0.010
0.45 ± 0.01
Groundwater, Well 101–2, Oak Ridge, TN (iii)
3.13 ± 0.07
0.081 ± 0.017
0.40 ± 0.01
Groundwater Well 102–2, Oak Ridge, TN (iii)
3.89 ± 0.08
0.033 ± 0.012
0.53 ± 0.01
Groundwater Well 102–3, Oak Ridge, TN (iii)
3.54 ± 0.09
0.049 ± 0.013
0.48 ± 0.01
Phylogenetic MENs (454 pyrosequencing)
Grassland soils under warming, Norman, OK (iv)
3.94 ± 0.20
0.020 ± 0.008
0.44 ± 0.01
Grassland soils under unwarming, Norman, OK (iv)
3.39 ± 0.23
0.038 ± 0.010
0.47 ± 0.01
Grassland soils under elevated CO2, MN (i)
3.98 ± 0.22
0.015 ± 0.006
0.61 ± 0.02
Grassland soils under ambient CO2, MN (i)
4.10 ± 0.20
0.017 ± 0.005
0.59 ± 0.01
Agricultural soil, Africa (v)
3.99 ± 0.04
0.020 ± 0.004
0.48 ± 0.01
Human intestine, Stanford, CA (vi)
4.23 ± 0.10
0.025 ± 0.009
0.58 ± 0.01
Scale-free, small-world, modularity and hierarchy are common network properties in many complex systems (Table 2) [8, 53, 54]. The overall topology indices (Table 3) revealed that all curves of network connectivity distribution were fitted well with the power-law model (R2 values from 0.74 to 0.92), indicative of scale-free networks. Also, the average path lengths (GD) were 3.09 to 5.08, which were close to logarithms of the total number of network nodes and comparable to those in other networks displaying small-world behavior, suggesting that the MENs in these microbial communities had the typical property of small world. For modularity, all modularity values (M) were from 0.44 to 0.86, which were significantly higher than the M values from their corresponding randomized networks, Therefore, all constructed MENs appeared to be modular. Finally, the hierarchy property was examined by the scaling of clustering coefficient. R2 values of the linear relationship between logarithms of clustering coefficients and the logarithms of connectivity ranged from 0.10 to 0.73, indicating the hierarchical behavior was quite variable. MENs from certain habitats may have highly hierarchical structures like sediment samples from Lake DePue (0.73), but others may not (Table 3). Overall, our constructed MENs from different habitats clearly exhibit scale free, small world and modularity properties, but hierarchy property is displayed on certain networks.
Modularity is a very important concept in ecology. It could originate from specificity of interactions (e.g. predation, pollination), habitat heterogeneity, resource partition, ecological niche overlap, natural selection, convergent evolution, and phylogenetic relatedness, and it could be important for system stability and resilience . In MENs, a module in the network is a group of OTUs that are highly connected among themselves, but had much fewer connections with OTUs outside the group. Random matrix theory-based approach is able to delineate separate modules, but some modules could still be very big.
After modules and submodules are determined, the eigengene analysis is used to reveal higher order organizations in the network structure [60–62]. In the eigengene analysis, each module is represented by its singular value decomposition (SVD) of abundance profile called module eigengene . In the warming pMEN, the module eigengenes from top 10 large submodules (≥8 nodes) explained 30 - 68 % variations of relative abundance across different replicates, suggesting that these eigengenes represented the module profiles relatively well. The correlations among module eigengenes were used to define the eigengene network. Eigengene analysis is important for revealing higher order organization and identifying key populations based on network topology . In warming pMEN, these correlations of 10 largest submodules were visualized as a heat-map and hierarchical clustering diagram (Figure 4B). The eigengenes within several groups of submodules showed significant correlations and clustered together as super-groups, such as #6 and #8, #2, #5 and #3, and #1, and #7 and #9, which were referred as meta-modules that exhibit a high order organization among submodules. Besides, within each module, eigengene analysis approach was able to show the representative abundance profile and identify key members as shown in our previous paper .
Different nodes play distinct topological roles in the network . The analysis of modular topological roles is important to identify key populations or functional genes based on the nodes’ roles in their own modules. Their topological roles can be defined by two parameters, within-module connectivity (z i ) and among-module connectivity (P i ). The topological roles of nodes in warming and unwarming pMENs were illustrated in ZP-plot (Figure 4C). According to values of z i and P i , the roles of nodes were classified into four categories: peripherals, connectors, module hubs and network hubs. From ecological perspectives, peripherals might represent specialists whereas module hubs and connectors were close to generalists and network hubs as super-generalists . Here, the majority of OTUs (90.9 %) under warming and unwarming conditions were peripherals with most of their links inside their own modules. A total of 26 nodes (7.9 %) were connectors and only four nodes (1.2 %) were module hubs. Those four OTUs as module hubs were derived from Planctomyces (Planctomycetes), Nocardioides (Actinobacteria) under warming condition, and Thermoleophilum (Actinobacteria) and GP4 (Acidobacteria) under unwarming condition, indicating that the hubs of pMENs were substantially different under different conditions.
The partial Mantel tests on connectivity vs. the OTU significances of soil geochemical variables and soil temperature in warming pyrosquencing molecular ecological network
GSof soil geochemistryapartialGSof temperature
GSof temperature partialGSof soil geochemistry
All detected OTUs
The network analysis component is further divided into three major parts:
Network characterization. Various network properties are calculated and evaluated, such as connectivity, betweenness, clustering coefficient, and geodesic distance. The module/submodule detection and modularity analyses is performed using fast greedy modularity optimization . Eigengene network analysis is performed to understand network characteristics at higher organization levels and to identify key microbial populations or key functional genes in terms of network topology.
Network visualization. An automatic pipeline is constructed to visualize the constructed network. Moreover, the file format for software Cytoscape 2.6.0  is prepared to visualize more complex and delicate network graphs. Other data associated with OTUs, such as taxonomy, relative abundance, edge information, and positive and negative correlations is imported and visualized in network figures.
Network comparison. Various randomization methods like the Maslov-Sneppen method  are used obtain random networks for network comparison. Various indices are evaluated for comparing the differences of networks among different communities in terms of sensitivity and robustness. In addition, OTU significances are calculated to reveal associations of the network structure to the ecological functional traits .
Most previous studies on the biodiversity of microbial communities have been focused on the number of species and the abundance of species, but not interactions among species. However, species interactions could be more important to ecosystem functioning than species richness and abundance, especially in complex ecosystems [1, 27–29]. Several recent analyses show that the ecological networks of ecosystems are highly structured [1, 69, 70], thus ignoring the structure of network and the interactions among network components precludes further assessment of biodiversity and its dynamics. Several recent breakthroughs have been made to analyze species interactions of animals and plants [1, 4, 31, 55, 70, 71], but it is difficult to detect network interactions of a microbial community [72–74]. Therefore, in this study, we systematically described a mathematical and bioinformatic framework of MENA based on RMT, a powerful method well established in quantitative physics [23, 75, 76]. Our results demonstrate that the RMT-based approach is powerful in discerning network interactions in microbial communities.
The network approach described is based on the transition of two universal distributions from the random matrix theory. A major advantage of RMT method is that the threshold to construct network is automatically determined. In contrast, most other methods studies use arbitrary thresholds, which are usually based on limited knowledge of biological information [8, 72–74, 77]. RMT-based approach selects an optimal threshold without ambiguity, which ensures its construction of optimal networks. Another advantage of RMT-based approach is its remarkable capacity in tolerating noise, resulting in reliable, robust networks. Our results show even with 100 % Gaussian noise, more than 85 % nodes from original network are still preserved. This characteristic could be very important for dealing with the large-scale data, such as metogenomics and micrrorrays, which are generally inherent with high noise.
Nevertheless, characterizing ecological network of microbial communities poses major challenges. MENs are constructed by the adjacency matrix originated from the pair-wise correlations of relative OTU abundance across different samples. Therefore, a network interaction between two OTUs or genes describes the co-occurrence of these two OTUs or genes across different samples. The co-occurrence might be caused by species or genes performing similar or complementary functions, or shared environmental conditions that microbial species coexist in . However, the former possibility can be complicated by the observations that functionally redundant genes are not necessary co-regulated, but instead co-regulated through other genes, which is coined as transitive co-regulation . The latter possibility can be complicated by the distinctiveness of individuals in microbial niches observed in their behaviors and responses to environmental perturbation . Therefore, caution must be taken for the interpretation of underlying mechanisms that shape microbial communities.
A long-held tenet is that the structure of ecological networks has significant influence on the dynamics [1, 80]. Most complex systems have common characteristics such as small world, scale-free, modularity and hierarchy [8, 53, 54]. Consistently, MENs were found to be scale-free, small world and modular, in addition to hierarchical property in some MENs. These network properties are important for the robustness and stability of complex systems [8, 27, 28, 81]. For example, our results showed that any two microbial species in the community can be linked by just a few other neighbor species, showing small-world property. This may imply that the energy, materials and information can be easily transported through entire systems. In microbial communities, this characteristic drives efficient communications among different members so that relevant responses can be taken rapidly to environmental changes. Meanwhile, it is intriguing to note that modularity is prevailing in MENs, while hierarchy is present only in some MENs. Research on a wide range of architectural patterns in mutualistic (pollination) and trophic (predation) networks showed that hierarchy, also called nestedness, was strong in mutualistic networks, but that modularity was strong in trophic networks . Although ecological networks of microbial communities are very complicated and cannot be classified into simple mutualistic or trophic networks, it would be interesting to compare a number of ecological networks of microbial communities to catalog different architectural patterns and to explore the mechanisms underlying the stability and resilience of communities.
In addition to interactions among microbes within a community, MENs allow for analyses of interactions with their environment through correlations with abiotic environmental measurements, which might provide insights on the conditions that have significant impact on the co-occurring organisms. It is also possible to link groups of organisms with biogeochemical measurements to reveal the functional role of organism in biogeochemical processes. These kinds of data are important for generating hypotheses to help explain natural environments that microbial communities reside, which might lead to forecasting responses of microbial communities when environment changes .
In summary, our study provides a mathematical/bioinformatic framework for network construction based on metagenomics data such as sequencing  and microarray hybridization data . It is useful, as demonstrated with the microbial communities under experimental warming, for dissecting interactions within a microbial community as well as with environment, thus allowing microbial ecologists to address a variety of ecological questions at the community-wide scale [83, 84]. It is also possible to extend MENA to emerging fields of microbial ecology such as high-throughput proteomics, since RMT is not stringent on data types. In addition, broad application of MENA will generate a number of ecological networks that allow for exploration of architectural patterns of microbial communities . This RMT-based molecular ecological network analysis provides powerful tools to elucidate network interactions in microbial communities and their responses to environmental changes, which are fundamentally important for research in microbial ecology, systems microbiology, and global change.
The network construction begins with a data table with n distinct operational taxonomic units (OTUs) based on 16 S rRNA genes or functional genes observed across m replicates or samples. Typically OTUs are used to refer taxonomic classification based on ribosomal RNA genes. For convenience, in the following sections, we use OTUs to refer the classifications derived from both 16 S rRNA genes and/or functional genes. Let y ik represent the abundance or relative abundance of the i-th OTU in the k-th sample ( , ) and Y nxm = [y ik ] is the abundance matrix. Usually, the abundance profile of i-th OTU is standardized as below. If the mean and standard deviation of y i across all samples are and , the standardized abundance of the i-th OTU in the k-th sample is , where x ik has mean value of 0 and variance value of 1. X nxm is the standardized data matrix and used for subsequent correlation analysis.
Molecular ecological networks can be built on the basis of the measurements of relative OTU abundance in microbial communities. In MENs, each OTU corresponds to a node. Each network corresponds to an adjacency matrix (or interaction matrix), A nxn = a ij , which encodes the connection strength between each pair of nodes . In an unweighted network, the adjacency a ij =1 if nodes i and j are connected, and a ij =0 otherwise . For an undirected network, the adjacency matrix is symmetric. In weighted network, the pairwise adjacency has values between 0 and 1, i.e., 0 ≤ a ij ≤ 1. The adjacency matrix is the foundation of all subsequent steps in network analysis.
where s tb is the threshold parameter. The resulting adjacency matrix, A p×p, is generally smaller than the similarity matrix because the rows or columns are removed if all of their elements are less than the threshold value.
The structure of relevance network strongly depends on the threshold value, s t . In some network analysis, the threshold value is chosen arbitrarily based on known biological information or set by the empirical study . Thus, the resulting network is more or less subjective [19, 20, 85, 87]. However, it is difficult to select appropriate thresholds, especially for poorly studied organisms/communities. In MENA, we use the random matrix theory (RMT)-based approach, which is able to identify the threshold automatically based on the data structure itself [22, 46] to select the final threshold parameter, s t .
Initially proposed by Wigner and Dyson in the 1960s for studying the spectrum of complex nuclei , random matrix theory (RMT) is a powerful approach for identifying and modeling phase transitions associated with disorder and noise in statistical physics and materials science. It has been successfully used for studying the behavior of different complex systems, such as spectra of large atoms , metal insulator transitions in disorder systems, spectra of quasiperiodic systems , chaotic systems , the stock market , brain response , gene co-expression networks  and protein interaction networks . However, its suitability for complex biological systems, especially microbial communities, remains largely unexplored.
RMT predicts two universal extreme distributions of the nearest neighbor spacing distribution (NNSD) of eigenvalues: Gaussian orthogonal ensemble (GOE) statistics, which corresponds to random properties of complex system, and Poisson distribution, which corresponds to system-specific, nonrandom properties of complex systems . These two different universal laws depend on the properties of the matrix. On one hand, if consecutive eigenvalues are completely uncorrelated, the NNSD follows Poisson statistics. Considering a series of eigenvalues, the probability of an eigenvalue falling in a scale D D + s is independent of the start point D, where s can be any positive values. It means the probability of an eigenvalue falling in any scales with certain length s will be identical, no matter where the scales begin. The NNSD under such assumption follows a Poisson random process, so-called exponential distribution of Poisson process . On the other hand, for correlated eigenvalues, the NNSD has Gaussian orthogonal ensemble (GOE) statistics. Given a series of correlated eigenvalues, the probability of one eigenvalue falling into a scale D D + s is proportional to s. Wigner illustrated that the NNSD under this assumption was closely to Gaussian orthogonal ensemble so-called Wigner surmise.
The key concept of RMT is to mainly concern with the local property between eigenvalues rather than the global property of a series of eigenvalues. Here, the local property between eigenvalues means the eigenvalue fluctuations and the global property is the average eigenvalue density. In order to reveal the fluctuations of eigenvalues, the average eigenvalue density has to be removed from system so that the average eigenspacing is constant. Also, this procedure to generate a uniform eigenvalues distribution is called unfolding. The unfolded eigenvalues will fall between 0 and 1, and its density does not depend on the overall level distribution. Consider a sequence of eigenvalues from adjacency matrix, and those eigenvalues have been ordered as . In practice, we replace eigenvalues λ i with where N av is the continuous density of eigenvalues obtained by fitting and smoothing the original integrated density of eigenvalues to a cubic spline or by local density average.
We use the χ 2 goodness-of-fit test to assess whether NNSD follows Wigner-Dyson distribution or Poisson distribution. We assume that the NNSD of any biological system obeys these two extreme distributions [22, 23, 27, 28], and that there is a transition point from GOE to Poisson distribution, and this transition point can be used as the threshold for defining adjacency matrix.
The following major steps are used to define the threshold (s t ) based on the standardized relative abundance of OTUs across different samples (Figure 2).
Calculate the Pearson correlation matrix, R nxn, based on the standardized relative abundance of OTUs, X nxm with n distinct OTUs across m samples.
Obtain similarity data, S nxn, by taking the absolute value of correlation matrix R n×n .
Set an initial threshold value, s tb (e.g., 0.3 based on our experiences).
Calculate the adjacency matrix, A pxp = [a ij ] according to s tb , where p is the number of OTUs retained in the adjacency matrix with non-zero rows or columns.
Calculate eigenvalues λ i of the adjacency matrix based on the equation , where λ is the eigenvalue, v is the corresponding eigenvector, and I is the identity matrix. Because S is the symmetric matrix and v is a non-zero vector, we can get p number of eigenvalues to solve the equation . To test NNSD distribution, order the eigenvalues as
To get unfolded eigenvalues, replace λ i with , where N av is the continuous density of eigenvalues and can be obtained by fitting the original integrated density to a cubic spline or by local average.
Calculate the nearest neighbor spacing distribution of eigenvalues, P(d), which defines the probability density of unfolded eigenvalues spacing,
Using the χ 2 goodness-of-fit test to determine whether the probability density function P(d) follows the exponential distribution of Poisson statistic, .H0: P(d) follows the Poisson distribution.H1: P(d) does not follows the Poisson distribution.The χ 2 goodness-of-fit test has the test statistics, , where d i is the observed nearest neighbor spacing and E(d i ) is an expected (theoretical) nearest neighbor spacing from Poisson distribution. The resulting χ 2 value is compared to the χ 2 distribution. Let be the critical value at a significant level of 0.01 based on χ 2 distribution with u degrees of freedom.
If , the null hypothesis H0 is not rejected. Then go to step (j).If , the null hypothesis H0 is rejected. Then, increase the threshold by 0.1, s tb + 0.1, and repeat the steps from (e) to (h).
Find a finer scale threshold value by increasing the threshold with 0.01 within the range of [s tb -0.1, s tb ]. Then repeat the steps from (e) to (h).
If H0 is accepted, i.e., the P(d) follows Poisson distribution, the finer scale threshold identified is used as the optimal threshold for defining the adjacency matrix.
where s t is the final threshold parameter. Two nodes are linked if the similarity between their abundance profiles across all samples is equal to 1.
Once MENs are determined, various network topology indices can be calculated based on the adjacency matrix (Table 1). The overall topological indices describe the overall network topology in different views and thus are useful in characterizing various MENs identified under different situations. The indices for describing individual nodes are useful in assessing their roles in the network.
Scale-free, small world, modularity and hierarchy are most common network characteristics of interest [8, 53, 93]. A scale-free network is a network whose connectivity follows a power law, at least asymptotically , that is, only a few nodes in the network have many connections with other nodes while most of nodes have only a few connections with other nodes. It can be expressed by , where k is connectivity and λ is a constant. A small-world network is the network in which most nodes are not neighbors of one another, but most nodes can be reached by a few paths (typically, less than 6). Small world network has a small average shortest path (GD) typically as the logarithm of the number of nodes . In addition, there is no formal definition for hierarchical topology . One of the most important signatures for hierarchical, modular organizations is that the scaling of clustering coefficient follows C(k) ~ k −γ , in which k is connectivity and γ is a constant. By log-transformation, logC(k) ~ −γlog(k), the logarithms of clustering coefficients have a linear relationship with the logarithms of connectivity.
where N M is the number of modules in the network, l b is the number of links among all nodes within the b th module, L is the number of all links in the network, and is the sum of degrees (connectivity) of nodes which are in the b th module. M measures the extension whose nodes have more links within their own modules than expected if linkage is random. It varies with the range of [−1, 1].
Several different algorithms can be used to separate modules, including short random walks, leading eigenvector of the community matrix, simulated annealing approach, and fast greedy modularity optimization [56, 57]. The algorithm of short random walks is based on the idea that all random walks tend to stay in the densely connected parts of a graph that was corresponding to the modules . After calculating a distance between two nodes or between sets of nodes by random walk algorithm, it uses a hierarchical clustering approach to present the structural similarities between all nodes. Thereafter this approach will choose the best partition automatically. The advantage of this algorithm is efficient and fast computation.
Once the network modularity value (M) was explicitly defined, theoretically the module structure can be determined by maximizing M values over all possible divisions of network. However, exhaustive maximization over all divisions is computational intractable . The algorithm of leading eigenvector is one of several approximate optimization methods have been proven effectively obtained higher M values with high speed. It simplified the maximization process in terms of a modularity matrix Bnxn that can be obtained by the adjacent matrix Anxn subtracting an expected edges matrix Pnxn from a null model. Then the network can be split into two groups by finding the leading eigenvector that was corresponding to the largest positive eigenvalue of modularity matrix. This splitting process can be looped until any further divisions will not increase the M value . This method shows more accurate separations than other algorithms in several well-studied social networks .
The algorithm of simulated annealing approach usually produces the best separation of the modules by direct maximization of M. The simulated annealing is a stochastic optimization technique to find “low cost” configurations . It carries out the exhaustive search on network structures to merge and split priori-modules and move individual nodes from one module to another. Although this is a time-consuming process, it is expected to obtain clear module separations with a higher M.
The algorithm of fast greedy modularity optimization is to isolate modules via directly optimizing the M score [66, 97]. It starts with treating each node as the unique member of one module, and then repeatedly combines two modules if they generate the largest increase in modularity M. This algorithm has advantages with fast speed, accurate separations and ability to handle huge networks [66, 97].
where is the number of links of node i to other nodes in its module b and are the average and standard deviation of within-module connectivity, respectively over all the nodes in module b, is the number of links of node i in the whole network, is the number of links from node i to nodes in module c, and N M is the number of modules in the network.
The within-module connectivity, z i , describes how well node i is connected to other nodes in the same module, and the participation coefficient, P i , reflects what degree that node i connects to different modules. P i is also referred as the among-module connectivity . If all links of node i only belong to its own module, P i = 0. If the links of node i are distributed evenly among modules, P i → 1. The topological roles of individual nodes can be assigned by their position in the z-parameter space. Originally, Guimera et al. [33, 59] divided the topological roles of individual nodes into seven categories. Olesen et al.  simplified this classification into four categories for pollination networks. In this study, we use the simplified classification as follows: (i) Peripheral nodes (z i ≤ 2.5, P i ≤ 0.62), which have only a few links and almost always to the nodes within their modules, (ii) Connectors (z i ≤ 2.5, P i > 0.62), which are highly linked to several modules, (iii) Module hubs (z i > 2.5, P i ≤ 0.62), which are highly connected to many nodes in their own modules, and (iv) Network hubs (z i > 2.5, P i > 0.62), which act as both module hubs and connectors. From ecological perspective, peripheral nodes represent specialists whereas the other three are generalists.
One of the grand challenges in dealing with high throughput metagenomics data is the high dimensionality. Various statistical approaches are used to reduce dimensions and extract major features, including principal component analysis (PCA), detrended correspondence analysis (DCA), and singular value decomposition (SVD). SVD is an orthogonal linear transformation of data (e.g., microbial data) from the complexity to the comprehensibility . Based on SVD analysis, the Eigengene is a linear combination of genes and eigenvalues. In the diagonalized data, each eigengene is just expressed in the corresponding eigen arrays. Langfelder and Horvath  proposed eigengene network analysis to summarize the gene expression data from each module as a centroid. Eigengene network analysis is powerful to reveal higher order organization among gene co-expression modules [33, 61, 62]. Here, we have adopt this method to analyze modules in MENs.
where both and are column-orthogonal matrices, and is a diagonal matrix of the singular values . The matrices V b and D b are denoted as and
Assuming that the singular values are arranged in decreasing order, the first column of V b is referred as the Module Eigen-gene, E b , for the b-th module. That is, .
Generally, the module eigen-gene can explain approximately 50 % or more of the variance of the OTU abundances in the module . Since PCA and SVD are identical if each OTU relative abundance has been standardized to mean 0 and variance 1, E b is the first principal component based on PCA analysis .
If is close to 1 or −1, it is evident that the i-th OTU is close to the centroid of module b.
Since only a single data point is available for each network parameter, we are not able to perform standard statistical analyses to assess statistical significances. Similar to the concept of hypothesis testing, the null model is generated to assess the performance of the alternative model. Thus, the random networks are generated to compare different complex networks using the Maslov-Sneppen procedure . The Maslov-Sneppen method keeps the numbers of nodes and links unchanged but rewires the positions of all links in the MENs so that the sizes of networks are the same and the random rewired networks are comparable with original ones. This method has been typically used for ecological network analyses . For each network identified, a total of 100 randomly rewired networks are usually generated by the Maslov-Sneppen procedure  and all network indices are calculated individually for each randomized network. Then the average and standard deviation for each index of all random networks are obtained. The statistical Z-test is able to test the differences of the indices between the MEN and random networks. Meanwhile, for the comparisons between the network indices under different conditions, the Student t-test can be employed by the standard deviations derived from corresponding random networks.
where x i is the relative abundance of the i-th OTU and T h is the h-th sample trait (e.g. soil pH, N content, total plant biomass) ( ). Since the measurement units for different traits vary, all trait data should be standardized prior to statistical analysis. Finally, an OTU significance matrix, GS nxg, is obtained.
To discern the relationships between molecular ecological networks and soil properties, Mantel tests can be performed . The relationships between the MENs and environmental variables were determined as follows: First, the significances of variables are calculated with the above equation (Eq 13) and the OTU significance matrix is generated. Then the Euclidean distance matrix is generated by calculating the Euclidean distance between every two OTUs. The distance matrix among all OTUs’ connectivity ( ) was calculated as well. In addition, Mantel tests are performed between the distance matrices of the connectivity ( ) and OTU significance ( ) to examine the relationships between network structure (i.e., connectivity) and soil variables. The Mantel tests were performed using the programs available in R vegan package .
This work has been supported, through contract DE-SC0004613 and contract DE-AC02-05CH11231 (as part of ENIGMA, a Scientific Focus Area) and contract DE-SC0004601, by the US Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomics: GTL Foundational Science, the United States Department of Agriculture (Project 2007-35319-18305) through NSF-USDA Microbial Observatories Program, the Oklahoma Bioenergy Center (OBC) of State of Oklahoma, and State Key Joint Laboratory of Environment Simulation and Pollution Control (Grant 11Z03ESPCT) at Tsinghua University.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.