A statistical toolbox for metagenomics: assessing functional diversity in microbial communities

Background The 99% of bacteria in the environment that are recalcitrant to culturing have spurred the development of metagenomics, a culture-independent approach to sample and characterize microbial genomes. Massive datasets of metagenomic sequences have been accumulated, but analysis of these sequences has focused primarily on the descriptive comparison of the relative abundance of proteins that belong to specific functional categories. More robust statistical methods are needed to make inferences from metagenomic data. In this study, we developed and applied a suite of tools to describe and compare the richness, membership, and structure of microbial communities using peptide fragment sequences extracted from metagenomic sequence data. Results Application of these tools to acid mine drainage, soil, and whale fall metagenomic sequence collections revealed groups of peptide fragments with a relatively high abundance and no known function. When combined with analysis of 16S rRNA gene fragments from the same communities these tools enabled us to demonstrate that although there was no overlap in the types of 16S rRNA gene sequence observed, there was a core collection of operational protein families that was shared among the three environments. Conclusion The results of comparisons between the three habitats were surprising considering the relatively low overlap of membership and the distinctively different characteristics of the three habitats. These tools will facilitate the use of metagenomics to pursue statistically sound genome-based ecological analyses.


Background
Metagenomics, the culture-independent isolation and characterization of DNA from uncultured microorganisms [1], has facilitated the analysis of the functional biodiversity harbored in the large reservoir of uncultured bacteria and archaea [2][3][4]. Although early metagenomic studies identified individual genes or activities of interest, recent advances in genome sequencing technologies have made obtaining a complete metagenomic sequence more tractable. Sequence-based approaches combined with functional expression approaches have the potential to identify novel genes important for industrial and ecological applications. Sequence-based approaches have recently been applied to DNA obtained from viruses [5,6], seawater [7][8][9][10], wastewater [11,12], sediment [13], sponges [14], acid mine drainage [15], marine worms [16], human gut [17], soil [18], and decomposing whale carcasses [18]. The analysis used to describe these communities has primarily focused on the descriptive characterization and comparison of the relative abundance of proteins that belong to specific functional categories.
Attempts to analyze metagenomic sequences have proven that a metagenomic sequence is more than just a large genome sequencing project. First, the goal of most genome sequence projects is a closed genome sequence where every nucleotide is represented by a desired number of independent sequence reads. In metagenomics, the probability of finding overlapping sequence reads is low in most environments [19]. The probability that overlapping sequence reads are from the same population of bacteria or archaea is even lower so that contigs that are formed are out of necessity chimeras of different genomes that may not even be from the same phylum [20]. Second, a closed genome represents a statistical population of the genes harbored by that organism; therefore, comparing genome sequences for the presence or absence of genes is straightforward. Since it is not possible to close a metagenome, every metagenomic sequence collection represents a statistical sample of the genomes in an environment. Therefore, it is necessary to treat the comparison of communities as a statistical problem. Third, although labbased cultures that are sequenced do evolve, the differences between lab stocks is minimal compared to the changes faced by natural communities over short periods of time. This makes it difficult to reanalyze a community once a genome sequence has been obtained to improve annotations and understand gene expression.
Five general approaches have been taken to bring statistical analysis to the analysis of metagenomic sequences. The first adapts genomics-based approaches to metagenomics by constructing and curating databases to aid in the annotation and analysis of genes and the contigs they reside on [21,22]. Unfortunately, although such databases provide a critical infrastructure, given the large number of ORFs that have no known function (e.g. 69% in the Sargasso Sea [7]) and the paucity of contigs formed from many sequencing projects (e.g. <1% in the soil [18]), such database searches will be of limited value for comparative metagenomics. The second approach to analyzing metagenomic sequences has been based on the comparison on the relative abundance of annotation categories within the different sequence collections and within databases of assembled genomes [9,10,17,18,23]; these methods implicitly assume that the metagenomic sequences represent a statistical population and/or that the reference databases represent the normal distribution of genes in communities. A third set of approaches attempts to assign a phylogenetic origin for a sequence fragment in the absence of a phylogenetic anchor (e.g. 16S rRNA gene) using nucleotide frequency analysis or sequence signatures [24][25][26][27]. Such methods are limited for use with most environments because of the difficulty in forming contigs that are long enough to carry out a robust analysis and assume that the contigs that form are not chimeric. A fourth approach has attempted to compare communities without an annotation. These have attempted to quantify the species richness of communities based on the distribution of sequence read depth among contigs [7] and to compare the diversity of communities based on the relative frequency of different length oligonucleotides in the DNA sequence pool [28]. Finally, there have been attempts to using traditional population biology by analyzing the diversity of specific families of genes found in metagenomic collections [29].
Based on previous metagenomic sequencing efforts, we were interested in developing statistical tools to compare the richness, membership, and structure of the complement of ORFs from multiple communities in which assembly of the entire genomes is not possible. To address this problem, we adapted a set of statistical tools designed to analyze collections of 16S rRNA gene sequences to the analysis of protein coding genes [30][31][32][33]. Our goal was to provide additional tools to make statistical and ecological inferences using metagenomic sequence data. Instead of using a traditional pairwise DNA distance matrix obtained from a sequence alignment of homologous genes as is done with 16S rRNA genes, we used BLAST score ratios (BSRs) to develop a distance matrix that represents the similarity of ORFs across homologous groups [34]. To make comparisons among communities, we propose grouping ORFs into operational protein families (OPFs) which are analogous to operational taxonomic units (OTUs) derived from 16S rRNA gene sequences.

A new distance matrix
The goal of this aspect of the work was to develop a method to compare sequence alignments that circumvented the considerable computational effort required to obtain every possible global sequence alignment and pairwise distance. We used local alignments provided in BLAST and the resulting pairwise BLAST scores to generate BSRs. The BSRs approximate the fraction of identical amino acids between two peptide fragments so that a BSR value of 0.30 between two fragments means that they are approximately 30% identical over their full length. By analogy to the analysis of 16S rRNA gene sequences of uncultured bacteria where OTUs are developed based on a distance matrix, we propose using BSR values to define OPFs. Depending on the goals of the analysis an OPF can be defined as necessary. For illustrative purposes and based on previous implementations of BSRs for comparative genomics applications [34,35], unless otherwise indi-cated we will operationally define an OPF as a collection of fragments that have a BSR greater than 0.30.
To assess the feasibility of using peptide fragments from individual sequence reads, we identified peptide fragments from the individual sequence reads used to assemble the Bacillus anthracis, str. Ames genome (GenBank Accession NC_003997, [36]), which contains 4,514 ORFs that were longer than 100 aa. From the individual sequence reads, we identified 92,220 peptide fragments longer than 100 aa. The computational effort required for the pairwise alignment and distance calculation among 92,220 ORFs was prohibitive. Because we expected a majority of the peptide pairs would not have significant similarity, we used BLAST to identify those comparisons that had significant similarity and to calculate BSRs as a surrogate for similarity or distance values (distance = 1-BSR). Instead of generating a 92,220 × 92,220 matrix with 8.5 × 10 9 values, we took advantage of the sparseness of the matrix to simplify the calculations and construct a set of three linked-lists in which each list contained the row, column, and BSR values of the full BSR matrix. Since the BSR for a peptide fragment compared to itself is 1.0 and the BSR for a non-significant comparison is 0.0, the corresponding entries in the linked lists could be removed. Once this was completed, there were 2.1 × 10 6 values, which represented a significant reduction in the memory required to store the data.

MG-DOTUR
To assign peptide fragments to OPFs we rewrote the computer code for DOTUR to be compatible with sparse BSR matrices. DOTUR is used to assign collections of 16S rRNA gene sequences and to use the resulting frequency distribution of sequences among OTUs to estimate richness and diversity (Table 1). By analogy, MG-DOTUR assigns peptide fragments to OPFs and estimates the richness and diversity of OPFs for any desired OPF definition. Two classes of methods are available to estimate richness based on frequency distributions. The first uses parametric distributions such as the lognormal distribution to predict the number of unseen groups in a community [37]. Although it is often assumed that microbial communities follow a lognormal distribution, there are no published examples in the microbial ecology literature for which the observed data support such an assumption. This is primarily due to the difficulty in obtaining a sufficient number of observations to implement these methods. An alternative approach uses non-parametric estimators that do not assume an underlying frequency distribution and are relatively easy to compute. These estimators are implemented in DOTUR and MG-DOTUR.
Based on the observed frequency distribution of peptide fragments in each OPF 0.30 , we applied the Chao1, ACE, and interpolated Jackknife richness estimators to predict the OPF 0.30 richness. The predicted OPF richness was approximately three times greater than the OPF richness that was observed in the assembled B. anthracis genome ( Table 1). When we mapped each OPF from the closed genome to the OPFs from the individual sequence reads we found that each OPF from the closed genome was linked to an average of 3.08 (s.d. = 2.75) OPFs from the sequence reads. Further inspection showed that the multiple OPFs from the sequence reads corresponded to different regions of long ORFs from the closed genome sequence. Similar results have been observed when attempting to estimate the number of expressed genes using expressed sequence tags [38].
To overcome this problem, we developed a method of merging OPFs from the sequence reads to obtain a more meaningful OPF distribution. For two OPFs to merge, we required that the carboxyl-terminus of at least one sequence in the first OPF overlap with the amino-terminus of at least one sequence in the second OPF by at least 5 amino acids. Furthermore, we incorporated a BSR penalty so that for two OPFs to merge the overlapping region had to have a BSR greater than the BSR currently being used to form clusters. We used penalties of 0.00, 0.05, 0.10, 0.15, and 0.20 (Table 1). We then applied this merging scheme to the OPFs from the sequence reads and cal- AMOVA/MG-AMOVA Determines whether two or more communities differ significantly in genetic diversity using an analysis of variance-type formulation Distance Matrix or BLAST Table [33,47,48] culated two types of error [38]. Type I errors corresponded to the fraction of OPFs from the closed genome that mapped to multiple OPFs from the sequence reads. Type II errors corresponded to the fraction of OPFs from the sequence reads that corresponded to different OPFs from the closed genome (Table 2). We found that as we increased the penalty, the Type I error decreased and the Type II error increased. Based on this analysis, we decided to implement a penalty of 0.15 because both types of error were 7.1 and 7.4%, respectively. When the resulting frequency distribution was used to calculate collector's curves using the observed and predicted richness, the curves converged towards the true OPF richness (Fig. 1A). This was used to further validate the choice of penalty. A limitation of this approach is that the resulting number of peptide fragments in a merged OPF is a product of the length of the complete ORF and the relative abundance of the ORF in the metagenome. Therefore, we will report OPF richness from merged analysis and annotations from both merged and non-merged analyses.

Comparing membership and structure using OPFs
Other tools have been developed to compare the membership (e.g. SONS) and structure (e.g. ∫-LIBSHUFF and AMOVA) of microbial communities using 16S rRNA gene sequences. Again, by analogy we were interested in using OPFs and BSRs to compare microbial communities using metagenomic sequences. SONS uses the output of DOTUR and MG-DOTUR to complete its analysis and required no further modification for use with metagenomic sequences. ∫-LIBSHUFF and AMOVA were modified to use the sparse matrix data representation used in MG-DOTUR. The resulting programs were designated MG-LIB-SHUFF and MG-AMOVA. To test these programs, we randomly divided the 92,220 B. anthracis peptide fragments into two artificial communities that were each represented by 46,110 peptide fragments.
We first applied SONS to these two communities to compare the membership and structure of the artificial communities using OPFs. We calculated the shared OPF richness using the Chao non-parametric estimator of shared richness and obtained a value of 3,561 OPFs. Although this estimate of shared richness is lower than 95% confidence interval observed for the total collection of peptide fragments using the Chao1, ACE, or Jackknife estimators, the shard Chao estimator was still increasing with additional sampling (Fig. 1B). This indicates that if sequencing had continued the estimate of shared richness would have probably overlapped eventually. The abundance-based Jaccard (J abund ) estimate of similarity was 1.00, which predicted that all of the peptide fragments belonged to shared OPFs 0.30 . Yue and Clayton's measure of community overlap, θ, was 0.97, which indicated that the distribution of peptide fragments among OPFs was the same in both artificial communities. These results indicate that SONS is amenable to analyzing OPFs to detect similarity between the memberships and structures of different communities.
An alternative approach to comparing community structures is to perform hypothesis tests. AMOVA uses an analysis of variance (ANOVA)-type framework to test the hypothesis that the difference in genetic diversity between two or more communities is not significantly different than the diversity within each community. We implemented this analysis in a program designated MG-AMOVA to perform a single-classification analysis. Our comparison of two randomly generated B. anthracis peptide fragment pseudo-communities revealed that the observed differences between the two pseudo-communities were not statistically significant (p > 0.05). Next we modified the program ∫-LIBSHUFF to create MG-LIB-SHUFF to test the hypothesis that two communities have the same structures. As expected, the differences in structure between the two pseudo-communities were not sta- Analysis of the richness and community membership when peptide fragments identified in individual sequence reads were used to assemble the Bacillus anthracis str  B tistically significant (P > 0.05). Each of these comparisons indicate that we can make statistical comparisons between the membership and structure of microbial communities using peptide fragments identified in single sequence reads from metagenomic data.

Acid Mine Drainage
Tyson et al. [15] used metagenomic sequencing to analyze a biofilm growing in acid mine drainage (AMD) that had a pH below 1.0. They obtained 322 archaeal and bacterial 16S rRNA gene sequences and 103,462 random paired sequence reads, which represented 76.2 Gbp of DNA. We used DOTUR to assign 16S rRNA gene sequences to nine OTUs and predicted there were an additional three OTUs (95% confidence interval [95% CI] = 0 to 8) that were not observed ( Fig. 2A) Collector's curves for the OTU (A) and OPF (B) richness observed and estimated using DNA extracted from an AMD biofilm community Figure 2 Collector's curves for the OTU (A) and OPF (B) richness observed and estimated using DNA extracted from an AMD biofilm community. Collector's curves for the OTU (A) and OPF (B) richness observed and estimated using DNA extracted from an agricultural soil in Minnesota, USA Figure 3 Collector's curves for the OTU (A) and OPF (B) richness observed and estimated using DNA extracted from an agricultural soil in Minnesota, USA. Although there was an insufficient number of peptide fragments to obtain a reliable estimate of the fraction of OPF membership that was shared between any two of the three communities, we estimated that they shared at least between 10 and 20% of their OPF membership (Fig. 4). The "core" whalebone OPF membership that was shared among the three whalebone communities had a richness of at least 3,800 OPFs (approximately 2.5% of the total richness); 1,678 of these were actually observed in the sequence collection. The most commonly shared OPFs among the three communities represented a variety of activities including metal transport, sensors, and housekeeping functions ( Table 3).

16S rRNA Sequences
Comparison of the community structures using the peptide fragments using MG-LIBSHUFF (all p < 0.001) and MG-AMOVA (all p < 0.001) found that the structures of these three communities were significantly different.
Using OPFs, θ varied between 0.39 and 0.55 indicating that there was some similarity in community structure. The ability to quantify and assess the differences in communities without exhaustive sampling of the three whalebone communities indicates the importance of applying statistical methods to metagenomic sequence data. Such analyses make comparative metagenomics amenable to ecologically-based hypothesis testing.

Comparison of the three environments
To assess the relative similarity of OTU 0.03 membership between environments, we used DOTUR to cluster the 2,161 16S rRNA gene fragments from the AMD (n = 322), soil (n = 1,633), and whalebone communities (n = 206). No OTUs were shared between any two of the three communities; however, additional sampling may have identified OTUs that were shared between environments.
We compared the relative similarity of OPF membership between environments by clustering the 351,186-peptide fragments from the AMD (n = 99,419), soil (n = 143,422), and whalebone communities (n = 108,345) using MG-DOTUR and then we estimated the membership and structure overlap among the three communities (Fig. 5).
Measuring the overlap of OPFs measurement among the three communities resulted in the estimate that more than 800 OPFs were shared among the five communities; this represents less than 0.3% of the total OPF richness found in the five communities. Of this pool, 774 merged OPFs and were actually observed with functions including metal transport, housekeeping, and various dehydrogenase activities ( Table 4). Applications of the statistical tools to these types of comparisons will enable researchers to investigate the problem of biogeography using genome-based methods. Venn diagram comparing the OPF membership found in three whalebone microbial communities (AGZO, n = 38,981 peptide fragments; AHAA, n = 36,165; and AHAI, n = 33,199). Below each community name is the Chao1 richness estimate and the 95% confidence interval for that community. We estimated the richness of the overlapping regions based on the pairwise S A,B Chao shared richness estimates between the three communities and by pooling two communities and estimating the shared fraction with the third community. These estimates are provided on the right side of the figure. This analysis suggests that the comparisons between the OPFs 0.30 identified in the metagenomic sequences represent the level of differences expected between phylogenetically disparate groups of bacteria. Furthermore, analyses using completed genome sequences may enable investigators to define the size and boundaries of so-called "pangenomes."

Conclusion
We present a statistical toolbox to estimate the functional richness and overlap among communities based on peptide fragments deduced from DNA sequence data. These statistical approaches are necessary, in part, because the immense genomic diversity contained in most communities precludes the formation of contigs. There is also considerable question regarding the robustness of sequence assembly [40]. Although understanding these complex communities is tantalizing, it may prove useful to identify more communities similar to the AMD and whalebone communities that have a relatively low diversity to develop and test tools that can then be applied to soil. As Venn diagram comparing the pooled OPF membership found in the AMD (n = 99,419 peptide fragments), soil (n = 143,422), and whalebone (n = 108,345) microbial communi-ties  sequencing technologies improve, the feasibility of obtaining nearly complete sequence coverage of the more diverse communities will improve. The rapid advances in sequencing short DNA fragments (approximately 100 bp long) in a highly parallelized manner [41] presents many new opportunities, but the method may not be amenable to metagenomic sequencing because the short sequence reads produce peptide fragments less than 100 aa long, which could make a meaningful ORF identification and analysis of functional diversity difficult.
Innovative methods have been developed to compare collections of 16S rRNA sequences, and analogous new methods are needed for comparing metagenomic sequences. For example, improving our ability to estimate and interpret the biological meaning of OPF richness will be helpful for describing the relative functional capacity of a community. Our analysis does not address the possibility that distant OPFs might serve the same biological function and that members of the same OPF might have different functions. Therefore, further work is needed to unify studies of functionally active clones into a statistical framework. For example, comparing the collection of genes conferring antibiotic resistance found in multiple environments would enable us to understand better the diversity of these genes as well as their biogeography.
Our analysis moves beyond previous attempts to compare microbial communities at the genomic level by not being dependent upon reference databases and introducing statistical rigor to the description and comparison of microbial communities. For example, previous analyses formed clusters based on similarity to reference databases and excluded those peptide fragments with no significant matches, which limited the scope of the analysis. Here, we formed OPFs using the observed data, in essence allowing the data to "speak for themselves", which allowed for a comprehensive comparison of the data. Previous analyses also based the level of similarity between communities on the observed peptide fragments as though they represented a statistical population. Here, we treated the data as a statistical sample and employed statistical tools to estimate the level of similarity between community membership and structure. These tools enable a quantitative, comprehensive, and statistically robust analysis of microbial communities at the genomic level.
Shotgun sequencing of metagenomic communities is becoming increasingly popular and routine. The results of these efforts will provide more insight if they are wrapped in robust ecological and statistical frameworks. Tools are needed to advance data analysis beyond the frequency of different COGs or KEGG categories that are found within a community. This study is a step in building such a framework to compare microbial communities functionally at the genomic level. In addition to estimating community relatedness based on metagenomic data, our approach accounts for present but unsampled peptide fragments, is independent of a subjective annotation process, and includes peptide fragments with no known function.

Genome sequence data
We obtained the 101,379 sequence reads used to assemble the Bacillus anthracis str. Ames whole genome sequence from GenBank (NC_003997). Each sequence read was evaluated by fastgenesb at the Joint Genome Institute using the same parameters used to predict the identity of peptide fragments in two previous metagenomic sequencing studies [15,18]. We also obtained the complete complement of 4,514 ORFs from the finished genome that were longer than 100 aa. All of the predicted peptide fragments from the published metagenomic sequencing projects using an acid mine drainage biofilm [15], whalebone [18], and soil [18] were obtained from the Joint Genome Institute. Only those ORFs and peptide fragments longer than 100 aa were considered in our analyses.

Modified toolbox
DOTUR is a freely available computer program that uses a distance matrix to assign sequences to operational taxonomic units (OTUs) using either the nearest, average, or furthest neighbor clustering algorithms for all possible distances and then constructs rarefaction and collector's curves for a variety of ecological parameters [30]. These curves can be used to compare the relative richness, the number of different OTUs in a community, of two samples and to estimate the overall richness within a sample. Similarly, MG-DOTUR clusters sequences into OPFs using a BLAST table as the input. ORFs are assigned to OPFs using the furthest neighbor clustering algorithm [42], which requires that all sequences in the OPF have a pairwise BSR value greater than a specified value. Because BSRs are not necessarily symmetric (i.e. BSR ij ≠ BSR ji ), they were forced to be symmetric by using the smaller of the two values. Once MG-DOTUR assigns sequences to OPFs, rarefaction curves of the number of OPFs observed on average as a function of ORFs sampled and collector's curves of the Chao1 [43], ACE [44], and the interpolated Jackknife [45] richness estimates as a function of ORFs sample are calculated at multiple BSR values predefined by the user. MG-DOTUR uses a switch to calculate the ACE estimator. If the coefficient of variation (γ) is greater than 0.8, then the ACE-1 estimator is calculated, otherwise the simple ACE estimator is used. This follows recommendations made by Anne Chao for use of the program SPADE [46]. This study reports results obtained by defining an OPF a group of sequences with a BSR value greater than 0.30 [34,35] and OTU as a group of sequences that are all more than 97% identical to each other [30].
∫-LIBSHUFF [31] is a modified version of the program LIB-SHUFF [32] that makes use of the integral form of the Cramér-von Mises statistic to determine whether two communities are either samples of the same statistical population, sub-samples of each other, or were drawn from different statistical populations. As employed in ∫-LIBSHUFF, the Cramér-von Mises statistic is a function of the coverage of one sequence collection onto itself (i.e. homologous coverage, C X ) compared to its coverage onto another collection (i.e. heterologous coverage, C XY ). Coverage is the fraction of sequences that have another sequence within a given distance of them. Application of the LIBSHUFF-style analysis requires converting BSR values into distances by subtracting the BSR value from one and setting the limits of integration from zero to 0.70. MG-LIBSHUFF calculates the ∆C XY statistic and evaluates its significance using a Monte Carlo testing procedure as described elsewhere [31,32].
where, D = the distance (1-BSR) that is used to determine the level of coverage. C X (D) and C XY (D) = measures of homologous and heterologous library coverage.
BSR min = the smallest meaningful BSR value; for this analysis set at 0.30 Population biologists have developed an analysis of variance (ANOVA)-style of analysis, which tests whether a collection of communities have similar genetic diversities using mitochondrial DNA sequences and other genetic markers. This method has been designated as either the analysis of molecular variance (AMOVA) [47] or non-parametric multivariate analysis of variance (MANOVA) [48]. This analysis has been applied for comparing bacterial communities using 16S rRNA sequences [33]. The general method is based on partitioning the sum of the squared elements in a distance matrix similar to what is done in an ANOVA. As applied in MG-AMOVA, we implement a single-classification ANOVA design to determine whether the average genetic average genetic difference between the three whalebone communities was significantly greater than the difference within a community. The total sumsquared error (SS T ) and within community sum-squared error (SS W ) is calculated by where, BSR ij = the BSR value between the i th and the jth peptide fragments.
ε ij = 1 if i and j are in the same community, otherwise it is 0.

N = total number of peptide fragments
The sum-squared error among communities (SSA) can be calculated as SS A = SS T -SS W . Significance was determined by randomizing the assignment of sequences to the sequence collections and recalculating the statistic and determining the proportion of randomizations resulting in an equal or smaller SS W value than that observed from the randomized distribution [48].

OPF-based comparisons of community membership and structure
Using the frequency that each OPF was observed in multiple communities, it has been possible to estimate the number of OPFs that are shared between communities as well as describe the overlap between community structures. Analogous to the Chao1 non-parametric richness estimator [43], Chao et al. [49] derived a non-parametric estimator of the richness shared between two communities: where, S 12 = number of shared OPFs in A and B f 11 = number of shared OPFs with one observed individual in A and B f 1+ , f 2+ = number of shared OPFs with one or two individuals observed in A f +1 , f +2 = number of shared OPFs with one or two individuals observed in B By a similar approach the fraction of individuals or peptide fragments that belong to a shared OPF can be estimated [50,51]: where, U est , V est = fraction of sequences from A and B that belong to a shared OTU X i , Y i = abundance of the i th shared OTU in A and B n total , m total = total number of sequences sampled in A and B I(·) = if the argument, ·, is true then I(·) is 1; otherwise it is 0.
U est and V est can then be used to estimate an abundancebased Jaccard similarity coefficient: To incorporate into the measure of community similarity the proportion of peptide fragments in each OPF, Yue and Clayton [52] developed the parameter θ: where, S 1 and S 2 = observed number of OPFs in each community.

16S rRNA sequence analysis
The three metagenomic sequencing projects were selected because they were accompanied by parallel 16S rRNA sequence collections. We obtained the sequences from the original authors and aligned the sequences using the . Aligned sequences were imported to ARB [54] and overlapping sequences were used to construct distance matrices with a Jukes-Cantor correction for multiple substitutions. Distance matrices were analyzed using DOTUR [30], ∫-LIBSHUFF [31], and MG-AMOVA as described above.

Availability of data and software
MG-DOTUR, MG-LIBSHUFF, MG-AMOVA and all sequence and analysis files are available from the authors' website [55].