M-pick, a modularity-based method for OTU picking of 16S rRNA sequences
© Wang et al; licensee BioMed Central Ltd. 2013
Received: 19 June 2012
Accepted: 30 January 2013
Published: 7 February 2013
Binning 16S rRNA sequences into operational taxonomic units (OTUs) is an initial crucial step in analyzing large sequence datasets generated to determine microbial community compositions in various environments including that of the human gut. Various methods have been developed, but most suffer from either inaccuracies or from being unable to handle millions of sequences generated in current studies. Furthermore, existing binning methods usually require a priori decisions regarding binning parameters such as a distance level for defining an OTU.
We present a novel modularity-based approach (M-pick) to address the aforementioned problems. The new method utilizes ideas from community detection in graphs, where sequences are viewed as vertices on a weighted graph, each pair of sequences is connected by an imaginary edge, and the similarity of a pair of sequences represents the weight of the edge. M-pick first generates a graph based on pairwise sequence distances and then applies a modularity-based community detection technique on the graph to generate OTUs to capture the community structures in sequence data. To compare the performance of M-pick with that of existing methods, specifically CROP and ESPRIT-Tree, sequence data from different hypervariable regions of 16S rRNA were used and binning results were compared.
A new modularity-based clustering method for OTU picking of 16S rRNA sequences is developed in this study. The algorithm does not require a predetermined cut-off level, and our simulation studies suggest that it is superior to existing methods that require specified distance levels to define OTUs. The source code is available at http://plaza.ufl.edu/xywang/Mpick.htm.
Recent advances in high-throughput sequencing technologies have contributed to an explosion in sequence data from studies of microbial composition in various environments that harbor complex microbial communities. As one of the most commonly used approaches for such studies, 16S rRNA sequences are analyzed to estimate species composition and diversity.
An initial requirement for downstream analyses of 16S rRNA sequences is the binning into operational taxonomic units (OTUs) that contain similar sequences. The existing methods can be divided into two classes, taxonomy-dependent methods and taxonomy-independent (TI) methods [1, 2]. For taxonomy-dependent methods, query sequences are compared with known sequences deposited in annotated databases (e.g., RDP  and Greengenes ) . Sequences that match with a reference sequence with a simialrity less than a predetermined cut-off value are grouped together. In contrast, TI methods apply clustering algorithms to pairwise sequence distances to assign query sequences into OTUs [6, 7]. A major advantage of TI methods is their independence from the coverage of existing databases, which allows the analysis of sequences from unknown microorganisms, because novel sequences usually represent a large proportion of a sequence dataset .
In TI methods, pairwise sequence distances are computed either by multiple sequence alignment (MSA) or pairwise sequence alignment (PSA) and several clustering algorithms can then be applied to form OTUs. These clustering algorithms include hierarchical clustering algorithms such as DOTUR , MOTHUR , ESPRIT  and ESPRIT-Tree , as well as heuristic algorithms such as CD-HIT  and UCLUST . In a recent benchmark study, we demonstrated that ESPRIT-tree appeared to have advantages in terms of both accuracy and computational efficiency .
One of the critical problems with existing TI methods is the need to set an appropriate distance threshold to retrieve the optimal OTU binning at a distinct taxonomic level such as species. Applying different thresholds leads to inconsistent binning results. Furthermore, appropriate distance levels appear to vary depending on the chosen hypervariable region , which makes it impossible to create one single distance-based threshold for defining a taxonomic level .
Some efforts have been made recently to address this issue. In , a semi-supervised clustering method was developed to identify a cut within a hierarchical clustering tree that maximizes the fit with a labeled subset of the sequences so that varied distance levels were applied in the clustering process to improve clustering accuracy. However, this approach shares a crucial disadvantage with taxonomy-dependent methods: the need to preselect labels to perform OTU picking. In , a Bayesian clustering method called CROP was developed, which uses a Gaussian mixture model to describe the pairwise sequence distance distribution in an OTU to avoid the need to set a single distance level for all clusters. Although this method does not use hard thresholds, it actually utilizes a lower and upper bounds that can be transformed to a threshold. Another Bayesian based method BEBaC  utilizes a crude 3-mer count based preclustering step, and then the partition space is searched for the partition having maximum posterior possibility for given sequence data. A minimum description length criterion is then applied in a fine clustering step to determine the number of OTUs and generate the final partitioning. Users only need to provide one parameter - the possible maximum number of OTUs as the input. The major disadvantage of this approach is its high computational cost.
In this study, a modularity-based clustering method was developed for OTU picking. By viewing an OTU as a collection of related sequences with similar densities in a sequence space, we applied a community detection method and treated OTU picking as a community structure detection problem.
Modularity itself is also a quality function that indicates whether a partitioning of a graph can reveal the community structure on the graph if such structure exists. The maximum value of modularity is 1; a large value implies good partitioning. The maximum Q value corresponds to the optimal partitioning on the graph, which best reflects its community structure. The community detection problem thus can be formulated as an optimization problem to find the partitioning that maximizes Q.
Several algorithms have been developed to efficiently optimize modularity. Among them, the algorithm in  appears superior in terms of both accuracy and speed [17, 19], and it is chosen in our study to optimize modularity and find a clustering result that reflects community structure in our sequence data. The algorithm takes a bottom-up approach: it initially assigns each vertex to be a distinct cluster; it then moves a vertex into another cluster if the resultant modularity is increased; afterwards it recursively repeats the process by viewing each cluster as a vertex until a maximum modularity is obtained.
In the context of OTU picking, a weighted graph is formed by: i) viewing sequences as vertices, where each pair of sequences is connected by an imaginary edge, and ii) viewing the simlarity of a pair of sequences as the weight on the edge connecting these two sequences. Thus the modularity of a partition of sequences can be computed using Equation (1); the best clustering result is the one that maximizes the modularity. In such a result, each cluster represents an OTU with high homogeneity inside, that is, similarities between sequences within OTUs are greater than those between them. Using this approach, OTUs are defined by homogeneity of edge densities and not by distance between neighborhood clusters, circumventing the need for choosing distance levels.
Our modularity-based approach includes three steps. (1) Pairwise sequence distances are computed using the alignment module of ESPRIT . (2) An ɛ-neighborhood graph is formed by only retaining the pairwise sequence distances less than ɛ, or equivalently pairwise sequence similarity greater than 1-ɛ. This step is somehow similar to single-linkage clustering. (3) Modularity-based clustering is recursively performed on the graph generated in the previous step.
In the first step, we generate a pairwise distance matrix, viewable as a fully connected graph. However, the fully connected graph cannot be directly used to perform clustering because of i) prohibitive computational costs and ii) the resolution limit problem which states that modularity-based methods may fail to acquire clusters smaller than a scale depending on the total size of the graph . This implies that if a complete graph of significant size is used, small clusters in the graph will likely be ignored even if they show connectivity, albeit weak, to the rest of the graph and thus should be recognized as distinct OTUs. Therefore, we use a parameter ɛ in step 2 to mitigate these problems. Ideally, ɛ should be chosen to be greater than the maximum pairwise neighborhood sequence distance within a taxon, but not too large so that it includes all the sequences in multiple taxa into one connected graph. A graph formed in this way can guarantee that the sequences within a taxon are connected and the edge density within a taxon is greater than the density between taxa, making the community structure in the original fully connected graph more prominent.
Clustering results validation
Different clustering results are frequently obtained for the same sequence data set by applying different clustering methods and/or different parameter settings. The lack of a ground truth complicates an objective comparison of clustering methods. Generally, there are two types of clustering validation methods , either using external or internal criteria. Using external criteria the clustering results are compared to correct class labels from the 'ground truth', while only quantities inherent to the data are used for internal validation.
Normalized mutual information (NMI) is a well-known external criterion previously used for validating OTU picking; it measures the difference of a clustering result from a perceived ground truth . NMI views the sequence distributions in the clustering result and ground truth as two discrete random variable distributions, and computes the NMI of the two random variables as the measure for quantifying agreement. The maximum NMI score is 1 which means the clustering result completely match with the partition in ground truth; the higher the NMI score, the more match. NMI is equivalent to variation of information used in White et al. .
Another popular external criterion is the F-score, which jointly considers precision and recall . A common problem with F-score is that it does not satisfy the cluster completeness constrain that each cluster ω k in ground truth is only judged by the best-matched cluster in the clustering result. Thus, other small clusters that match with ω k can not affect the F-score, overestimating correlation when many small clusters are present [21, 23].
Internal validation indices such as Silhoutette width  and Dunn index  have been used to evaluate clustering performance without the need for a ground truth. Quantities such as compactness, connectedness, and separation in the cluster distribution are used to evaluate clustering performance. While independence from questionable ground truths is a clear advantage, internal validation is only possible if the studied dataset has well-defined community structure, a condition that frequently is not met. For the above-mentioned reasons, we herein only use the external criteria based NMI score for clustering validation.
16S rRNA sequences of different hypervariable regions were used to compare M-pick with ESPRIT-Tree and CROP.
Case study 1 - V2 variable region
We used published sequences previously generated to study the association between obesity and the composition of human gut microbiota . The dataset contains ~1.1M sequences covering the V2 region with an average length of 231 nucleotides. We blasted the sequences against the annotated RDP-II database, filtered the sequences using the criteria described in the previous section, and picked the species labels of the retained sequences as ground truth. We then randomly extracted 10 test subsets from these retained sequences, each containing 1000 sequences from the 50 most abundant species (total 50,000 sequences).
While ESPRIT-Tree and CROP can achieve NMI scores greater than 0.9 at their optimum distance level, results are sensitive to the chosen distance level (which is not known a priori). M-pick generated the most accurate results for all of the test datasets.
Number of OTUs and the best distance levels of clustering algorithms (Case study 1)
# OTU (mean, std)
Best distance level
Case study 2- V9 variable region
To confirm the observation described above and to be able to generalize our findings, we performed additional studies using different datasets covering various 16S rRNA hypervariable regions. Results from another case study are presented below; The second study was performed on a dataset retrieved from a soil microbial diversity study  where 139,000 bacterial 16S rRNA sequences (hypervariable V9 region) were obtained from samples collected in Brazil, Florida, Illinois, and Canada.
Case study 3- V3 variable region
For the ease of presentation, we only used the top 50 or 100 species in the previous case studies, which may not give a complete picture of how M-pick works on a whole real data.
Number of OTUs and NMI generated by ESPRIT-Tree at varied distance levels and M-pick (Case study 3)
Case study 4- simulated dataset
In the above case studies, the ground truth was generated by keeping the sequences that highly matched with the RDP-II database through the stringent criteria. However, the way to genererate ground truth could be quenstionable. To adress this concern, we included another simulated dataset from , which contains 22,000 sequences from 11 taxa generated from a Gaussian distribution model with varied deviations. We applied M-pick on the data, and it correctly grouped sequences into 11 taxa with a perfect NMI score of 1, which is better than those from BEBaC, UCLUST, ESPRIT-Tree, and CROP shown in . We also investigated how the problem of resolution limit affected the clustering results by keeping only 20 sequences from Taxon 8. M-pick still retrieved the correct clustering result, which confirms that M-pick worked well for this rare taxon case without the problem of resolution limit.
Additional case studies
Additional case studies were provided in the Additional file 1. The results were consistent with the findings presented in the previous sections.
We herein developed a novel modularity-based clustering method, M-pick, for binning 16S rRNA sequences into OTUs. M-pick is based on graph partitioning, and does not require a predetermined distance level to generate OTUs, which is a challenging requirement for many other OTU picking methods.
M-pick is based on a concept from graph partitioning. It initially creates a similarity based graph composed of all the sequences in a dataset. The algorithm first computes the pairwise sequence distances, and then implicitly creates an ɛ-neighborhood graph from the fully connected graph by only keeping sequence connections with pairwise distances less than ɛ. This strategy is used to save computational cost and to make community structure in the original graph more prominent. Modularity is used not only as the quality function to perform clustering but also as the criterion for terminating the recursive clustering process. We stop partitioning a graph (cluster) when all of its partitions have a modularity value smaller than δ. Both settings of ɛ and δ help to alleviate the resolution limit problem. Although we cannot claim that the proposed method has solved the problem, we found in our empirical studies that the resolution limit does not seem to be a serious issue.
We used multiple sequence datasets from different hypervariable regions of 16S rRNA to compare the performance of M-pick with two other commonly used algorithms, CROP and ESPRIT-Tree. Both are thought to generate accurate clustering results if the optimal distance level is known. However, the optimal distance level, which is not known a priori, varies for different hypervariable regions and even for different datasets from the same region. M-pick outperformed the other two algorithms in most cases even when the optimal distance level was used in the two competing algorithms.
Two parameters are required by M-pick. ɛ is used in the process of creating a graph and δ is used to decide when to stop the recursive clustering. The constraint on an OTU introduced by ɛ and δ is different from that of preset distance level used in ESPRIT-Tree and CROP. It can create arbitrarily shaped OTUs, which alleviates the problem of similar sequences being split into separate OTUs. We found that ɛ should be chosen to be larger than the maximum pairwise neighborhood sequence distance within a species. In all datasets that we analyzed, ɛ and δ were set to be 0.04 and 0.1, respectively, and the results were superior to those achieved by the other two algorithms. Thus, we suggest users to use this parameter setting for picking OTUs at species level. A systematic study to determine the two parameters for other phylogenetic levels needs to be carried out in the future. For the stopping criterion δ, similar considerations should be taken as in  in order to determine this parameter based on the statistical significance of the maximum modularity values of sub-clusters generated in the recursive clustering process.
The computational cost is composed of two parts. (1) O(n2) is consumed in computing pairwise sequence, where n is the number of sequences. (2) The cost of performing modularity-based clustering is approximately linear with respect to m, the number of edges in an ɛ-neighborhood graph. The running time is mainly consumed in the computation of pairwise sequence distances. Therefore, it is highly desirable to develop a more efficient pairwise sequence alignment method. At present, large datasets are handled by adding a preprocessing step. Sequences are pre-clustered at 1% distance level using a high-speed method such as UCLUST, and a representative sequence from each cluster is used to form a reduced dataset, on which the pairwise sequence distances are computed.
We developed M-pick, a new modularity-based clustering method, for OTU picking of 16S rRNA sequences. The algorithm does not require a predetermined cut-off value, and our simulation studies suggest that it is superior to the methods that require specified distance levels to define OTUs. M-pick appears to offer a viable alternative for binning similar sequences into OTUs.
We thank the editor and reviewers for their comments and suggestions that significantly improve the quality of this article. This work is supported in part by National Science Foundation under grant No. DBI-1062362.
- Sun Y, Cai Y, Huse SM, Knight R, Farmerie WG, Wang X: A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief Bioinform 2011, 13: 107-121.PubMed CentralView ArticlePubMed
- Schloss PD, Westcott SL: Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Appl Environ Microbiol 2011, 77: 3219-3226. 10.1128/AEM.02810-10PubMed CentralView ArticlePubMed
- Cole JR, Chai B, Farris BJ, Wang Q, Kulam SA, McGarrell DM: The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res 2005, 33: 294-296.View Article
- Desantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 2006, 72: 5069-72. 10.1128/AEM.03006-05PubMed CentralView ArticlePubMed
- Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML: Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet 2008, 4: e1000255. 10.1371/journal.pgen.1000255PubMed CentralView ArticlePubMed
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658-1659. 10.1093/bioinformatics/btl158View ArticlePubMed
- Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res 2009,37(10):e76. 10.1093/nar/gkp285PubMed CentralView ArticlePubMed
- Schloss PD, Handelsman J: Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol 2005, 71: 1501-1506. 10.1128/AEM.71.3.1501-1506.2005PubMed CentralView ArticlePubMed
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009,75(23):7537-7541. 10.1128/AEM.01541-09PubMed CentralView ArticlePubMed
- Cai Y, Sun Y: ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res 2011, 39: e95. 10.1093/nar/gkr349PubMed CentralView ArticlePubMed
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010,26(19):2460-2461. 10.1093/bioinformatics/btq461View ArticlePubMed
- White JR, Navlakha S, Nagarajan N, Ghodsi M, Kingsfor C, Pop M: Alignment and clustering of phylogenetic markers - implications for microbial diversity studies. BMC Bioinformatics 2010, 11: 152. 10.1186/1471-2105-11-152PubMed CentralView ArticlePubMed
- Hao X, Jiang R, Chen T: Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering. Bioinformatics 2011, 27: 611-618. 10.1093/bioinformatics/btq725PubMed CentralView ArticlePubMed
- Cheng L, Walke AW, Corander J: Bayesian estimation of bacterial community composition from 454 sequencing data. Nucleic Acids Res 2012. 10.1093/nar/gks227
- Newman MEJ: Analysis of weighted networks. Phys Rev 2004, 70: 056131.
- Newman MEJ: Modularity and community structure in networks. PNAS 2006,103(23):8577-8582. 10.1073/pnas.0601602103PubMed CentralView ArticlePubMed
- Fortunato S: Community detection in graphs. Phys Rep 2010,486(3-5):75-174.View Article
- Blondel VD, Cuillaume JL, Lambiotte R, Lefebvre E: Fast unfolding of communities in large networks. J Stat Mech 2008, 1-12. P10008 P10008
- Lancichinetti A, Fortunato S, Lancichinetti A, Fortunato S: Community detection algorithms: a comparative analysis. Phys Rev 2009,80(5):056117.
- Fortunato S, Barthelemy M: Resolution limit in community detection. PNAS 2007,104(1):36-41. 10.1073/pnas.0605965104PubMed CentralView ArticlePubMed
- Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics 2005,21(15):3201-3212. 10.1093/bioinformatics/bti517View ArticlePubMed
- Manning CD, Raghavan P, Schütze H: Introduction to Information Retrieval. Cambridge University Press; Online edition; 2008.View Article
- Amigo E, Gonzalo J, Artiles J, Verdejo F: A comparison of extrinsic clustering evaluation metrics based on formal constrains. Inf Retrieval 2009, 12: 461-486. 10.1007/s10791-008-9066-8View Article
- Rosseeuw PJ: Sihouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987, 20: 53-65.View Article
- Dunn JC: A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J Cybernetics 1973,3(3):32-57. 10.1080/01969727308546046View Article
- Giongo A, Richardson AGD, Crabb DB, Triplett EW: Tax Collector: modifying current 16S rRNA databases for the rapid classification at six taxonomic levels. Diversity 2010, 2: 1015-1025. 10.3390/d2071015View Article
- Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE: A core gut microbiome in obese and lean twins. Nature 2009, 457: 480-484. 10.1038/nature07540PubMed CentralView ArticlePubMed
- Luiz FW: Pyrosequencing enumerates and contrasts soil microbial diversity. ISME J 2007, 1: 283-290.
- Ruan J, Zhang W: Identifying network communities with a high resolution. Phys Rev 2008,77(1):016104.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.