### Defining adjacency matrix

Molecular ecological networks can be built on the basis of the measurements of relative OTU abundance in microbial communities. In MENs, each OTU corresponds to a node. Each network corresponds to an adjacency matrix (or interaction matrix), *A*
^{nxn} = *a*
_{
ij
}, which encodes the connection strength between each pair of nodes [20]. In an unweighted network, the adjacency *a*
_{
ij
} =1 if nodes *i* and *j* are connected, and *a*
_{
ij
} =0 otherwise [20]. For an undirected network, the adjacency matrix is symmetric. In weighted network, the pairwise adjacency has values between 0 and 1, i.e., 0 ≤ *a*
_{
ij
} ≤ 1. The adjacency matrix is the foundation of all subsequent steps in network analysis.

To define the adjacency matrix, the similarity of OTU abundance across all samples should be measured first. Such similarity measures the degree of concordance between the abundance profiles of OTUs across different samples. Similar to widely used gene co-expression analyses [

20,

61,

85,

86], Pairwise Pearson correlation coefficients (

*r*
_{
ij
}) are used to measure the similarity between

*i-*th and

*j-*th OTU across different samples. Let

*R*
^{nxn} =

*r*
_{
ij
} be the Pearson correlation matrix, then

where

*x*
_{
ik
} and

*x*
_{
jk
} are the standardized abundance of the

*i*-th and

*j*-th OTUs in the

*k*-th sample.

are the mean values of the

*i*-th and

*j*-th OTUs over samples. In general, the absolute value of the correlation coefficient (

*r*
_{
ij
}) is used to define the abundance similarity between

*i-*th and

*j-*th OTU (

*s*
_{
ij
}), that is

Let

*S*
^{nxn} = [s

_{ij}, which is a similarity matrix of the OTU abundance. In molecular ecological network analysis, the adjacency matrix is derived from the OTU abundance similarity matrix by applying a threshold. Similar to relevant gene co-expression network analysis [

20,

61,

85,

86], the nodes are connected if they have significant pairwise similarities (i.e., correlations) across different samples. Thus, using a threshold value (

*s*
_{
tb
}), OTU abundance similarity matrix,

*S*
^{n×n} =

*s*
_{
ij
}, is converted into the adjacency matrix,

*A*
^{p×p} =

*a*
_{
ij
}, where p ≤ n. The adjacency

*a*
_{
ij
} between the

*i-*th and

*j-*th OTU is defined by thresholding the OTU abundance similarity [

33]:

where *s*
_{
tb
} is the threshold parameter. The resulting adjacency matrix, *A*
^{p×p}, is generally smaller than the similarity matrix because the rows or columns are removed if all of their elements are less than the threshold value.

### Determining the threshold by random matrix theory-based approach

The structure of relevance network strongly depends on the threshold value, *s*
_{
t
}. In some network analysis, the threshold value is chosen arbitrarily based on known biological information or set by the empirical study [8]. Thus, the resulting network is more or less subjective [19, 20, 85, 87]. However, it is difficult to select appropriate thresholds, especially for poorly studied organisms/communities. In MENA, we use the random matrix theory (RMT)-based approach, which is able to identify the threshold automatically based on the data structure itself [22, 46] to select the final threshold parameter, *s*
_{
t
}.

### Basic concept of RMT

Initially proposed by Wigner and Dyson in the 1960s for studying the spectrum of complex nuclei [88], random matrix theory (RMT) is a powerful approach for identifying and modeling phase transitions associated with disorder and noise in statistical physics and materials science. It has been successfully used for studying the behavior of different complex systems, such as spectra of large atoms [89], metal insulator transitions in disorder systems, spectra of quasiperiodic systems [90], chaotic systems [91], the stock market [76], brain response [92], gene co-expression networks [22] and protein interaction networks [46]. However, its suitability for complex biological systems, especially microbial communities, remains largely unexplored.

RMT predicts two universal extreme distributions of the nearest neighbor spacing distribution (NNSD) of eigenvalues: Gaussian orthogonal ensemble (GOE) statistics, which corresponds to random properties of complex system, and Poisson distribution, which corresponds to system-specific, nonrandom properties of complex systems [89]. These two different universal laws depend on the properties of the matrix. On one hand, if consecutive eigenvalues are completely uncorrelated, the NNSD follows Poisson statistics. Considering a series of eigenvalues, the probability of an eigenvalue falling in a scale *D*
*D* + *s* is independent of the start point *D*, where *s* can be any positive values. It means the probability of an eigenvalue falling in any scales with certain length *s* will be identical, no matter where the scales begin. The NNSD under such assumption follows a Poisson random process, so-called exponential distribution of Poisson process [89]. On the other hand, for correlated eigenvalues, the NNSD has Gaussian orthogonal ensemble (GOE) statistics. Given a series of correlated eigenvalues, the probability of one eigenvalue falling into a scale *D*
*D* + *s* is proportional to *s*. Wigner illustrated that the NNSD under this assumption was closely to Gaussian orthogonal ensemble so-called *Wigner surmise*[89].

The key concept of RMT is to mainly concern with the *local property* between eigenvalues rather than the *global property* of a series of eigenvalues. Here, the local property between eigenvalues means the eigenvalue fluctuations and the global property is the average eigenvalue density. In order to reveal the fluctuations of eigenvalues, the average eigenvalue density has to be removed from system so that the average eigenspacing is constant. Also, this procedure to generate a uniform eigenvalues distribution is called *unfolding*. The unfolded eigenvalues will fall between 0 and 1, and its density does not depend on the overall level distribution. Consider a sequence of eigenvalues
from adjacency matrix, and those eigenvalues have been ordered as
. In practice, we replace eigenvalues *λ*
_{
i
} with
where *N*
_{
av
} is the continuous density of eigenvalues obtained by fitting and smoothing the original integrated density of eigenvalues to a cubic spline or by local density average.

After unfolding the eigenvalues, three statistical quantities can be used to extract information from a sequence of eigenvalues, namely, eignevalue spacing distribution

*P*(

*d*), number variance of eigenvalues ∑, and spectral rigid ▵.

*P*(

*d*) is the probability density function for unfolded eigenvalue spacing,

, which is the NNSD for eigenvalues. For the completely uncorrelated eigenvalues,

*P*(

*d*) follows Poisson statistic and it can be expressed by

On the other hand, for the correlated eigenvalues,

*P*(

*d*) closely follows Wigner-Dyson distribution of the GOE statistics and it can be expressed by

We use the *χ*
^{2} goodness-of-fit test to assess whether NNSD follows Wigner-Dyson distribution or Poisson distribution. We assume that the NNSD of any biological system obeys these two extreme distributions [22, 23, 27, 28], and that there is a transition point from GOE to Poisson distribution, and this transition point can be used as the threshold for defining adjacency matrix.

### Algorithms of detecting the threshold value

The following major steps are used to define the threshold (*s*
_{
t
}) based on the standardized relative abundance of OTUs across different samples (Figure 2).

- (a)
Calculate the Pearson correlation matrix, *R*
^{nxn}, based on the standardized relative abundance of OTUs, *X*
^{nxm} with *n* distinct OTUs across *m* samples.

- (b)
Obtain similarity data, *S*
^{nxn}, by taking the absolute value of correlation matrix *R*
^{
n×n
}.

- (c)
Set an initial threshold value, *s*
_{
tb
} (e.g., 0.3 based on our experiences).

- (d)
Calculate the adjacency matrix, *A*
^{pxp} = [*a*
_{
ij
}] according to *s*
_{
tb
}, where *p* is the number of OTUs retained in the adjacency matrix with non-zero rows or columns.

- (e)

- (f)
To get unfolded eigenvalues, replace *λ*
_{
i
} with
, where *N*
_{
av
} is the continuous density of eigenvalues and can be obtained by fitting the original integrated density to a cubic spline or by local average.

- (g)
Calculate the nearest neighbor spacing distribution of eigenvalues, *P*(*d*), which defines the probability density of unfolded eigenvalues spacing,

- (h)

- (i)
If
, the null hypothesis H_{0} is not rejected. Then go to step (j).If
, the null hypothesis H_{0} is rejected. Then, increase the threshold by 0.1, *s*
_{
tb
} + 0.1, and repeat the steps from (e) to (h).

- (j)
Find a finer scale threshold value by increasing the threshold with 0.01 within the range of [*s*
_{
tb
}-0.1, *s*
_{
tb
}]. Then repeat the steps from (e) to (h).

- (k)
If H_{0} is accepted, i.e., the *P*(*d*) follows Poisson distribution, the finer scale threshold identified is used as the optimal threshold for defining the adjacency matrix.

Once the final threshold value

*s*
_{
t
} is determined at a finer scale, an adjacency matrix is obtained by retaining all the OTUs whose abundance similarity values are greater than the determined threshold. Currently we have only adopted the unweighted network in the following network topological analysis. Hence, the final adjacency

*a*
_{
ij
} is:

where *s*
_{
t
} is the final threshold parameter. Two nodes are linked if the similarity between their abundance profiles across all samples is equal to 1.

### Calculation of MEN topological indices and general features

Once MENs are determined, various network topology indices can be calculated based on the adjacency matrix (Table 1). The overall topological indices describe the overall network topology in different views and thus are useful in characterizing various MENs identified under different situations. The indices for describing individual nodes are useful in assessing their roles in the network.

Scale-free, small world, modularity and hierarchy are most common network characteristics of interest [8, 53, 93]. A scale-free network is a network whose connectivity follows a power law, at least asymptotically [94], that is, only a few nodes in the network have many connections with other nodes while most of nodes have only a few connections with other nodes. It can be expressed by
, where *k* is connectivity and *λ* is a constant. A small-world network is the network in which most nodes are not neighbors of one another, but most nodes can be reached by a few paths (typically, less than 6). Small world network has a small average shortest path (*GD*) typically as the logarithm of the number of nodes [43]. In addition, there is no formal definition for hierarchical topology [95]. One of the most important signatures for hierarchical, modular organizations is that the scaling of clustering coefficient follows *C*(*k*) ~ *k*
^{−γ
}, in which *k* is connectivity and *γ* is a constant. By log-transformation, log*C(k)* ~ −*γ*log(*k*), the logarithms of clustering coefficients have a linear relationship with the logarithms of connectivity.

### Module detection

Modularity is a fundamental characteristics of biological networks as well as many engineering systems [

53]. In MENs, a module in the network is a group of OTUs that are highly connected within the group, but very few connections outside the group. The maximum modularity score is used to separate the graph into multiple dense sub-graphs or modules. The modularity of each network (

*M*) is estimated using the equation [

66]:

where *N*
_{
M
} is the number of modules in the network, *l*
_{
b
} is the number of links among all nodes within the *b*
^{th} module, *L* is the number of all links in the network, and
is the sum of degrees (connectivity) of nodes which are in the *b*
^{th} module. *M* measures the extension whose nodes have more links within their own modules than expected if linkage is random. It varies with the range of [−1, 1].

Several different algorithms can be used to separate modules, including short random walks, leading eigenvector of the community matrix, simulated annealing approach, and fast greedy modularity optimization [56, 57]. The algorithm of short random walks is based on the idea that all random walks tend to stay in the densely connected parts of a graph that was corresponding to the modules [56]. After calculating a distance between two nodes or between sets of nodes by random walk algorithm, it uses a hierarchical clustering approach to present the structural similarities between all nodes. Thereafter this approach will choose the best partition automatically. The advantage of this algorithm is efficient and fast computation.

Once the network modularity value (*M*) was explicitly defined, theoretically the module structure can be determined by maximizing *M* values over all possible divisions of network. However, exhaustive maximization over all divisions is computational intractable [57]. The algorithm of leading eigenvector is one of several approximate optimization methods have been proven effectively obtained higher *M* values with high speed. It simplified the maximization process in terms of a modularity matrix B^{nxn} that can be obtained by the adjacent matrix A^{nxn} subtracting an expected edges matrix P^{nxn} from a null model. Then the network can be split into two groups by finding the leading eigenvector that was corresponding to the largest positive eigenvalue of modularity matrix. This splitting process can be looped until any further divisions will not increase the *M* value [57]. This method shows more accurate separations than other algorithms in several well-studied social networks [57].

The algorithm of simulated annealing approach usually produces the best separation of the modules by direct maximization of *M*[58]. The simulated annealing is a stochastic optimization technique to find “low cost” configurations [96]. It carries out the exhaustive search on network structures to merge and split *priori*-modules and move individual nodes from one module to another. Although this is a time-consuming process, it is expected to obtain clear module separations with a higher *M*.

The algorithm of fast greedy modularity optimization is to isolate modules via directly optimizing the *M* score [66, 97]. It starts with treating each node as the unique member of one module, and then repeatedly combines two modules if they generate the largest increase in modularity *M*. This algorithm has advantages with fast speed, accurate separations and ability to handle huge networks [66, 97].

### Identification of key module members

After all modules are separated, each node can be assigned a role based on its topological properties [

59], and the role of node

*i* is characterized by its within-module connectivity (

*z*
_{
i
}) and among-module connectivity (

*P*
_{
i
}) as follows

where
is the number of links of node *i* to other nodes in its module *b*
and
are the average and standard deviation of within-module connectivity, respectively over all the nodes in module *b,*
is the number of links of node *i* in the whole network,
is the number of links from node *i* to nodes in module *c*, and *N*
_{
M
} is the number of modules in the network.

The within-module connectivity, *z*
_{
i
}, describes how well node *i* is connected to other nodes in the same module, and the participation coefficient, *P*
_{
i
}, reflects what degree that node *i* connects to different modules. *P*
_{
i
} is also referred as the among-module connectivity [98]. If all links of node *i* only belong to its own module, *P*
_{
i
} = 0. If the links of node *i* are distributed evenly among modules, *P*
_{
i
} → 1. The topological roles of individual nodes can be assigned by their position in the *z*-parameter space. Originally, Guimera et al. [33, 59] divided the topological roles of individual nodes into seven categories. Olesen et al. [98] simplified this classification into four categories for pollination networks. In this study, we use the simplified classification as follows: (i) Peripheral nodes (*z*
_{
i
} ≤ 2.5, *P*
_{
i
} ≤ 0.62), which have only a few links and almost always to the nodes within their modules, (ii) Connectors (*z*
_{
i
} ≤ 2.5, *P*
_{
i
} > 0.62), which are highly linked to several modules, (iii) Module hubs (*z*
_{
i
} > 2.5, *P*
_{
i
} ≤ 0.62), which are highly connected to many nodes in their own modules, and (iv) Network hubs (*z*
_{
i
} > 2.5, *P*
_{
i
} > 0.62), which act as both module hubs and connectors. From ecological perspective, peripheral nodes represent specialists whereas the other three are generalists.

### Eigen-gene analysis

One of the grand challenges in dealing with high throughput metagenomics data is the high dimensionality. Various statistical approaches are used to reduce dimensions and extract major features, including principal component analysis (PCA), detrended correspondence analysis (DCA), and singular value decomposition (SVD). SVD is an orthogonal linear transformation of data (e.g., microbial data) from the complexity to the comprehensibility [99]. Based on SVD analysis, the Eigengene is a linear combination of genes and eigenvalues. In the diagonalized data, each eigengene is just expressed in the corresponding eigen arrays. Langfelder and Horvath [61] proposed eigengene network analysis to summarize the gene expression data from each module as a centroid. Eigengene network analysis is powerful to reveal higher order organization among gene co-expression modules [33, 61, 62]. Here, we have adopt this method to analyze modules in MENs.

####
*SVD analysis to define module eigen-gene*

Suppose there are

*n*
^{
b
} OTUs in the

*b*-th module. Let

represent the relative abundance matrix of the

*b*-th module, where

is the relative abundance of the

*i*-th OTU in the

*q*-the sample (

,

). In SVD analysis,

*X*
^{
b
} can be decomposed as follows:

where both
and
are column-orthogonal matrices, and
is a diagonal matrix of the singular values
. The matrices *V*
^{
b
} and *D*
^{
b
} are denoted as
and

Assuming that the singular values are arranged in decreasing order, the first column of *V*
^{
b
} is referred as the Module Eigen-gene, *E*
^{
b
}, for the *b*-th module. That is,
_{.}

The relative abundance profile of the OTUs within a module is represented by the eigen-gene. In addition, the sum of variance of OTU abundances equals to the sum of the diagonal matrix in SVD. Therefore, the percentage of the variance explained by the eigen-gene is given by

as

Generally, the module eigen-gene can explain approximately 50 % or more of the variance of the OTU abundances in the module [61]. Since PCA and SVD are identical if each OTU relative abundance has been standardized to mean 0 and variance 1, *E*
^{
b
} is the first principal component based on PCA analysis [61].

#### Module membership

Module eigen-gene provides the best summary of variation in relative abundance of OTUs within a module, but it is a centroid of a module rather than a real OTU. In practice, it is always important to understand how close it is between a given actual OTU and its eigen-gene. The correlation of the eigen-gene in module

*b* to the

*i*-th actual OTU across all experimental samples is defined as

If
is close to 1 or −1, it is evident that the *i*-th OTU is close to the centroid of module *b*.

### Random network construction and network comparison

Since only a single data point is available for each network parameter, we are not able to perform standard statistical analyses to assess statistical significances. Similar to the concept of hypothesis testing, the null model is generated to assess the performance of the alternative model. Thus, the random networks are generated to compare different complex networks using the Maslov-Sneppen procedure [68]. The Maslov-Sneppen method keeps the numbers of nodes and links unchanged but rewires the positions of all links in the MENs so that the sizes of networks are the same and the random rewired networks are comparable with original ones. This method has been typically used for ecological network analyses [4]. For each network identified, a total of 100 randomly rewired networks are usually generated by the Maslov-Sneppen procedure [68] and all network indices are calculated individually for each randomized network. Then the average and standard deviation for each index of all random networks are obtained. The statistical *Z*-test is able to test the differences of the indices between the MEN and random networks. Meanwhile, for the comparisons between the network indices under different conditions, the Student *t*-test can be employed by the standard deviations derived from corresponding random networks.

### Trait-based gene significance measure

In gene expression network analyses, the gene significance (

*GS*
_{
i,h
}) is the correlation between the expression profile of the

*i*-th gene and the

*h*-th sample trait,

*T*
_{
h
}[

33]. The higher

*GS*
_{
i,h
}, the more biologically significant gene

*i* is related to the sample trait

*h*. Similarly, in this study, the trait-based OTU significance is defined as:

where *x*
_{
i
} is the relative abundance of the *i-*th OTU
and *T*
_{
h
} is the *h-*th sample trait (e.g. soil pH, N content, total plant biomass) (
). Since the measurement units for different traits vary, all trait data should be standardized prior to statistical analysis. Finally, an OTU significance matrix, *GS*
^{nxg}, is obtained.