Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison
© Yang et al.; licensee BioMed Central Ltd. 2013
Received: 4 December 2012
Accepted: 10 April 2013
Published: 23 April 2013
Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons.
We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention.
CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets.
KeywordsMicrobiota comparison Microbiome analysis Compression-based distance
Human-associated microbes outnumber human cells by a factor of ten . The majority of these microbes are harbored in the gastrointestinal tract (GIT) and play a strong role in determining an individual’s health . Commensal GIT microbes may modulate nutrient uptake and utilization, promote GIT development and maturation, extract energy from indigestible non-starch polysaccharides, maintain a healthy immune system, and regulate brain development and behavior [3-5]. Many diseases, ranging from neurological disorders, such as Parkinson’s disease , to GIT-related diseases, such as Crohn’s disease (CD) , ulcerative colitis (UC) , irritable bowel syndrome  and obesity [10, 11], are correlated with disturbed microbiotas that differ from those of healthy individuals according to some studies. Surveying the microbial diversity in the GIT of patients diagnosed with CD and UC found differing levels of microbial diversity between healthy and diseased GIT samples [7, 8]. Evidence examining GIT from obese humans and mice exhibited a markedly decreased fraction of Bacteroides and a remarkably increased fraction of Firmicutes[10, 11]. These studies suggest a strong link between GIT microbial composition and the GIT-related diseases. Recent work has correlated the alleviation of disease symptoms with treatments that alter the microbiota such as fecal transplants . For example, recurrent Clostridium difficile-associated infections have been treated using fecal microbiome transplantation (FMT) . The study showed that after two weeks, patient prognosis vastly improved and correspondingly, the fecal bacteria composition of the patient became similar to that of the healthy donor . While many of these results are preliminary [14-16] in nature, they all point to an area of rich research and the growing importance of the GIT microbiota.
The GIT microbiota composition has profound health implications. Modern characterization of GIT microbes is based on culture-independent methods using 16S ribosomal RNA gene (rDNA) hypervariable tag sequencing technologies . 16S rDNA is the most widely used marker for microbial species identification . Currently, next-generation 16S rDNA-based sequencing produces millions of sequences from single run. This advance in sequencing technologies, however, represents a significant methodological challenge. Widely used methodologies include LIBSHUFF [19, 20], analysis of molecular variance (AMOVA) [21-23], parsimony tests [23-25] and UniFrac [26-28]. LIBSHUFF uses the Cramer-von Mises statistic to assess whether or not two microbial communities have the same structure [19, 20]. AMOVA determines whether or not there is a significant difference between the diversity within the two populations and the diversity of all the populations pooled [21-23]. Parsimony tests describe whether or not two community structures significantly differ from each other [23-25]. UniFrac uses phylogenetic information to detect differences between two microbiotas [26-28]. One weakness of the above methods is that they rely on multiple alignments and/or phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. Small changes in algorithms and parameters can have significant influences on the results of microbiota comparisons [29-31]. The issue created by multiple alignments and phylogenetic inference is the rapid growth of the search space for identifying the optimal multiple alignments and phylogenetic trees with the number of sequences . As the ability to sequence continues to outpace advances in computer hardware, more efficient computational algorithms with little or no sacrifice to accuracy will become necessary.
Data-compression techniques based on the notion of Kolmogorov complexity provide an alternative for microbiota comparisons that bypasses multiple alignments and phylogenetic inference. Kolmogorov complexity is defined as the minimum amount of information to reproduce a set of data . As such, Kolmogorov complexity serves as a measure of the repetitiveness within a data set—a powerful proxy for measuring the similarities and differences between datasets [34-36]. However, this theoretically defined concept cannot be computed exactly. Instead, compression algorithms are often used as an approximation for the Kolmogorov complexity [34, 35]. The idea of using compression-based metrics on biological data has a long and established history. Data-compression techniques have been used to construct phylogenetic trees , analyze mitochondrial genomes , classify protein sequences , quantify the time-evolution of macrophage gene expression , and classify 16S rDNA sequences at family level . Here, we extend the application of a data-compression method for microbiota comparisons based on the repetitive nature of 16S rDNA hypervariable tag sequencing.
One advantage of CBD is that it operates more directly on the quality-filtered sequence data to generate distance matrices, thus omitting the need for expert intervention in multiple alignments and phylogenetic inference. In this study, three previously published GIT microbiota datasets were used to demonstrate simplicity, speed and accuracy in the application of CBD on GIT microbiotas comparisons. Although compression algorithms can be parameterized to achieve different levels of compression, our applications of these algorithms were done without any significant parameter tuning, highlighting an important practical advantage of CBD.
Comparisons of CBD with mothur and QIIME
Web or command line
In order to assess the accuracy of CBD, three published datasets were chosen to repeat previous analyses using distances obtained from CBD: (1) human GIT microbiota ; (2) humanized mouse GIT microbiota ; and (3) human mucosa-associated microbiota .
Human GIT microbiota
Turnbaugh et al. used unweighted UniFrac to analyze a total of 1,937,461 V2 and V6 bacterial 16S rDNA sequences from fecal samples of 154 individuals (31 monozygotic, 23 dizygotic twin pairs, and their mothers). The average sequences per V2 and V6 sample were 3,984 ± 232 and 24,786 ± 1,403, respectively. This revealed that family members had greater similarity in their GIT microbiota composition than unrelated individuals; there is a much greater resemblance in the GIT microbiotas of lean or obese related individuals than lean or obese unrelated individuals . The data were then reanalyzed and compared with previously published results.
Humanized mouse GIT microbiota
Human mucosa-associated microbiota
The development of advanced and cost-effective DNA sequencing techniques enables the generation of tremendous datasets. For example, three recent studies reported that Illumina GAIIx or HiSeq platform produced millions of reads [45-47]. To accommodate this high-throughput data generation, simple and fast tools are extremely important for efficiently and accurately extracting information to further characterize microbiota. Increasing the efficiency of microbial community comparisons has profound implications for research. The CBD method described here facilitates efficient similarity comparisons between microbiotas.
CBD generates the distance matrix directly from sample sequences in relatively few steps. In contrast, the tree-based metric required multiple steps including assignment of OTUs, alignment, production of phylogenetic trees and generation of a distance matrix . Furthermore, Caporaso et al. determined that approximately 92% of the computational time was devoted to picking OTUs rather than determining distance assessment. Compared to QIIME and mothur, CBD required much less time completing the distance matrix from large numbers of sequences.
The accuracy of CBD was demonstrated by the reproduction of the statistical relationships between different classes of microbiotas and the ability to reproduce the results from microbial comparison using various methods. In this way, CBD was shown to be a robust and useful tool. However, we note that CBD is not a wholesale replacement for more involved analyses. For example, CBD does not provide information such as taxa or OTU distributions. It provides a simple, rapid, and accurate metric for comparing distances between entire communities of microbes, not a fine-grained assessment of particular species within a community.
The simplicity, speed, and accuracy of CBD suggests that it facilitates microbiota research when used in human-related samples. It does not require enormous sequencing depths obtained from non-invasively collected stool samples, and it is relatively simple for a biological/clinical researcher to compute CBD values. There is increasing evidence advancing the application of GIT microbiota assessments. Smith et al. have implicated the GIT microbial composition as a causal factor of Kwashiorkor. Qin et al. reported that the GIT microbiota of CD patients could be differentiated from that of healthy controls and UC patients based on the abundance of 155 bacterial species. Khoruts et al. observed two weeks after fecal transplantation that fecal microbes of Clostridium difficile-associated disease patients were similar to those of healthy donors. In a recent study, switching mice from a low-fat diet to a high-fat diet was shown to abruptly change the population of GIT microorganisms within one day . Potentially, CBD could aid more informed microbial management by comparing the microbiota before, during, and after manipulation. It could facilitate the exploration of new treatment strategies, and it could be used for diagnosis and prognosis of GIT-related diseases.
The focus of this work was to explore CBD as a tool for microbiota community comparison with a focus on clinical applications. However, the principles behind CBD should be equally applicable to any set of sequenced amplicons. This may be useful in other studies related to the microbiota that focus on fungal or other eukaryotic organisms in the gastrointestinal tract or other environments by examining 18S rRNA hypervariable tag sequencing or internal transcribed spacer regions (ITS).
CBD is web-based and freely accessible at http://tornado.igb.uiuc.edu/CBD/CBD.html. Sequence data in FASTA format can be directly uploaded to the CBD website for analysis. CBD is copyrighted by the board of trustees of the University of Illinois.
CBD provides a simple, rapid but accurate method for microbiota comparisons. It uses the relative compression of combined and individual datasets to quantify overlaps between two microbial communities, therefore is independent of multiple alignments and phylogenetic inference. CBD worked directly on sequence datasets without intermediate steps. The speed advantages of CBD over pipelines in QIIME and mothur became more pronounced as dataset size increased. Tests run on previously analyzed data indicated strong agreement between CBD and more time-consuming analyses.
where C(X) indicated the size of data X after compression, C(Y) indicated the size of data Y after compression, and C(XY) denoted the size of concatenated data XY, where data Y was concatenated to the end of data X, again after compression. Lempel-Ziv-Markov chain-Algorithm (LZMA) compressor (compression level −9) was used. The range of scores from CBD was between 0 and 1 (0, 1) with similar datasets returning smaller values and different datasets returning greater values. The similarity between two microbiota calculated by CBD metric was influenced by two factors, the number of similar sequences between two microbiota and total size of the concatenation of two microbiota datasets. For the same number of similar sequences, the bigger the total size of the concatenation of two microbiota datasets, the greater the CBD value was.
The specific tool we chose for compress (LZMA) was based on tests that indicated LZMA provided better compression ratios in comparison to other commonly available compression tools such as zip, gzip, or bz2. For all datasets, we removed the sequence labels before compressing so that the sequence names do not affect our results. Our datasets were then sorted before compression in order to improve the compression ratio further. Sorting resulted in a large performance boost, especially for larger datasets that were larger than the memory footprint of the compression algorithm, by placing similar sequences near each other in memory.
Test of CBD on artificial datasets
Datasets used in this analysis
In this study, three previously published GIT microbiota datasets were used: 1) V2 and V6 16S rDNA datasets from a recent study that focused on the GIT microbiotas of lean and obese twin pairs and their mothers ; 2) V2 16S rDNA datasets from an analysis of the effect of diet switch from low-fat diet to high-fat diet on humanized murine GIT microbiota composition ; and 3) full-length 16S rDNA datasets from mucosa-associated microbiotas from inflamed and non-inflamed sites of CD and UC patients in the colon as well as that from healthy controls . These datasets were used to test if CBD could successfully recapture the conclusions of previous clinical studies. The links to the three published GIT microbiota datasets can be found at http://tornado.igb.uiuc.edu/CBD/CBDFiles/CBDDownload.html. The first human GIT microbiota data was also used to assess the speed of CBD.
Measurement of computational time
The first five, ten, fifteen, and twenty V2 16S rDNA datasets at the first time point in Additional file 1: Table S1 of Turnbaugh et al. were chosen to form four group files. One thousand sequences were randomly chosen from each file within the group files to be pairwise compared to each other using CBD or QIIME pipeline (http://qiime.sourceforge.net) with default parameters (except using cd-hit for OTUs picking) or mothur (using unique.seqs to remove identical sequences, align.seqs to align unique sequences, clearcut to produce neighbor joining trees, and unifrac unweighted to generate UniFrac distance matrix) in order to produce a CBD distance matrix or an unweighted UniFrac distance matrix [40, 41]. Because QIIME integrates many 16S rDNA analysis software tools into one system, the fastest way to run QIIME (v.1.2.0) is to build QIIME Virtual Box, which requires at least 1024 MB memory, 120 GB storage and a 64-bit system , the time analysis of CBD and QIIME was operated using same computer configuration (8 Intel(R) Xeon(R) CPU E5504 at 2.00 GHz). Because the generation of tree file with clearcut command in mothur v.1.24.1 requires large amounts of memory (RAM), the time analysis of CBD and mothur was operated in large memory cluster located at Institute for Genomic Biology at University of Illinois at Urbana-Champaign (2 Nodes, 16 2.4 GHz Intel CPUs and 256 GB of RAM as well as 24 2.0 GHz Intel CPUs and 1024 GB of RAM) [40, 41]. Sequence data used to measure the computational time can be downloaded at http://tornado.igb.uiuc.edu/CBD/CBDFiles/CBDDownload.html.
Mantel test for dissimilarity between CBD and UniFrac matrix
The Mantel statistic based on Pearson’s product-moment correlation with 1000 permutations was used to evaluate relation between CBD and unweighted UniFrac distance matrix. The first twenty V2 16S rDNA datasets at the first time point in Additional file 1: Table S1 of Turnbaugh et al. was used to perform Mantel test in R language (v.2.11.1). Pearson correlation coefficient between CBD matrix and unweighted UniFrac distance matrix obtained from mothur was 0.868 (P-value = 0.001), which suggests that CBD distance matrix and mothur distance matrix were statistically, significantly, highly and positively related to each other. Pearson correlation coefficient between CBD matrix and unweighted UniFrac distance matrix obtained from QIIME was 0.208 (P-value = 0.035). This suggests that there are lesser, but still statistically significant correlation between the CBD distance matrix and the QIIME distance matrix. Pearson correlation coefficient between the mothur distance matrix and the QIIME distance matrix was 0.226 (P-value = 0.027), which suggests that these matrices are similarly, statistically, significantly and positively associated with each other. While all matrices are significantly correlated, there is a disparity in the amount of correlation, particularly in comparisons of QIIME.
Sequence datasets from three previous studies were used to generate a respective distance matrix. In the study of identical and fraternal twin pairs and their mothers , V2 16S rDNA sequences from the same person at two different time points were merged. Sequences were sorted for each merged V2 and V6 dataset. All pairs of merged V2 or V6 16S rDNA sequences were then compared using the CBD metric. These pairwise distances were used to generate a distance matrix. Twenty-one pairs of samples were analyzed by CBD (Additional file 1: Table S1). In the study of the effect of diet on humanized murine GIT microbiota, all GIT microbiotas under different diets were pairwise compared to each other to generate a distance matrix . In order to study the effect of disease on GIT microbiota composition, all mucosa-associated microbiotas from CD and UC patients’ inflamed and non-inflamed sites and healthy controls were pairwise compared to generate a distance matrix . The distance matrices can be downloaded at http://tornado.igb.uiuc.edu/CBD/CBDFiles/CBDDownload.html.
In the study of identical and fraternal twin pairs and their mothers , rows and columns of the distance matrix were randomly permutated 1000 times. In order to determine significant difference, the distribution of these results was compared to the actual values.
Metric dimensional scaling
In order to visualize the distance relationships between data samples from different individuals, metric dimensional scaling (MDS) in R language (v.2.11.1) was used to convert information into low dimensional and easy-to-visualize space where similarities between data points were conserved as much as possible . A two dimensional MDS representation of distance matrices was visualized in a 2D graphics by matplotlib (Python 2D graphics package used for generating publication-quality images) .
Ribosomal RNA gene
Operational taxonomic units
Metric dimensional scaling
Analysis of molecular variance
Principal coordinates analysis.
This research was supported by grants AG2008-34480-19328 and 454538AG58-5438-7-317l (LBS) from United States Department of Agriculture and Agricultural Research Service and by the Institute for Genomic Biology at the University of Illinois at Urbana-Champaign (NC). We are also indebted to Han Jiang for his technical support and discussion contributions; to Maksim Sipos for his technical support contributions; and to Dr. Saurabh Sinha, Shuyi Ma and Matthew A. Richards for their critical editing contributions.
- Savage DC: Microbial ecology of the gastrointestinal tract. Annu Rev Microbiol. 1977, 31 (6): 107-133.View ArticlePubMedGoogle Scholar
- Jia W, Li H, Zhao L, Nicholson JK: Gut microbiota: a potential new territory for drug targeting. Nat Rev Drug Discov. 2008, 7 (2): 123-129. 10.1038/nrd2505.View ArticlePubMedGoogle Scholar
- Hrncir T, Stepankova R, Kozakova H, Hudcovic T, Tlaskalova-Hogenova H: Gut microbiota and lipopolysaccharide content of the diet influence development of regulatory T cells: studies in germ-free mice. BMC Immunol. 2008, 9 (65): 65-PubMed CentralView ArticlePubMedGoogle Scholar
- Sekirov I, Russell SL, Caetano M, Antunes L, Finlay BB: Gut microbiota in health and disease. Physiol Rev. 2010, 90 (3): 859-904. 10.1152/physrev.00045.2009.View ArticlePubMedGoogle Scholar
- Heijtz RD, Wang S, Anuar F, Qian Y, Björkholm B, Samuelsson A, Hibberd ML, Forssberg H, Pettersson S: Normal gut microbiota modulates brain development and behavior. Proc Natl Acad Sci USA. 2011, 108 (7): 3047-3052. 10.1073/pnas.1010529108.PubMed CentralView ArticleGoogle Scholar
- Ananthaswamy A: Bugs from your gut to mine. New Sci. 2011, 209 (2796): 8-9. 10.1016/S0262-4079(11)60124-3.View ArticleGoogle Scholar
- Manichanh C, Rigottier-Gois L, Bonnaud E, Gloux K, Pelletier E, Frangeul L, Nalin R, Jarrin C, Chardon P, Marteau P, Roca J, Dore J: Reduced diversity of faecal microbiota in Crohn’s disease revealed by a metagenomic approach. Gut. 2006, 55 (2): 205-211. 10.1136/gut.2005.073817.PubMed CentralView ArticlePubMedGoogle Scholar
- Andoh A, Sakata S, Koizumi Y, Mitsuyama K, Fujiyama Y, Benno Y: Terminal restriction fragment length polymorphism analysis of the diversity of fecal microbiota in patients with ulcerative colitis. Inflamm Bowel Dis. 2007, 13 (8): 955-962. 10.1002/ibd.20151.View ArticlePubMedGoogle Scholar
- Si J, Yu Y, Fan Y, Chen S: Intestinal microecology and quality of life in irritable bowel syndrome patients. World J Gastroenterol. 2004, 10 (12): 1802-1805.PubMed CentralView ArticlePubMedGoogle Scholar
- Ley RE, Bäckhed F, Turnbaugh P, Lozupone CA, Knight RD, Gordon JI: Obesity alters gut microbial ecology. Proc Natl Acad Sci USA. 2005, 102 (31): 11070-11075. 10.1073/pnas.0504978102.PubMed CentralView ArticlePubMedGoogle Scholar
- Ley RE, Turnbaugh PJ, Klein S, Gordon JI: Microbial ecology: Human gut microbes associated with obesity. Nature. 2006, 444 (7122): 1022-1023. 10.1038/4441022a.View ArticlePubMedGoogle Scholar
- Khoruts A, Dicksved J, Jansson JK, Sadowsky MJ: Changes in the composition of the human fecal microbiome after bacteriotherapy for recurrent Clostridium difficile-associated diarrhea. J Clin Gastroenterol. 2010, 44 (5): 354-360.PubMedGoogle Scholar
- Yoon SS, Brandt LJ: Treatment of refractory/recurrent C. difficile-associated disease by donated stool transplanted via colonoscopy: a case series of 12 patients. J Clin Gastroenterol. 2010, 44 (8): 562-566. 10.1097/MCG.0b013e3181dac035.View ArticlePubMedGoogle Scholar
- Duncan SH, Lobley GE, Holtrop G, Ince J, Johnstone AM, Louis P, Flint HJ: Human colonic microbiota associated with diet, obesity and weight loss. Int J Obes. 2008, 32 (11): 1720-1724. 10.1038/ijo.2008.155.View ArticleGoogle Scholar
- Schwiertz A, Taras D, Schafer K, Beijer S, Bos NA, Donus C, Hardt PD: Microbiota and SCFA in lean and overweight healthy subjects. Obesity (Silver Spring). 2010, 18 (1): 190-195. 10.1038/oby.2009.167.View ArticleGoogle Scholar
- Salonen A, De Vos WM, Palva A: Gastrointestinal microbiota in irritable bowel syndrome: present state and perspectives. Microbiology. 2010, 156 (Pt 11): 3205-3215.View ArticlePubMedGoogle Scholar
- Pace NR: A molecular view of microbial diversity and the biosphere. Science. 1997, 276 (5313): 734-740. 10.1126/science.276.5313.734.View ArticlePubMedGoogle Scholar
- Woese CR, Kandler O, Wheelis ML: Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci USA. 1990, 87 (12): 4576-4579. 10.1073/pnas.87.12.4576.PubMed CentralView ArticlePubMedGoogle Scholar
- Schloss PD, Larget BR, Handelsman J: Integration of microbial ecology and statistics: a test to compare gene libraries. Appl Environ Microbiol. 2004, 70 (9): 5485-5492. 10.1128/AEM.70.9.5485-5492.2004.PubMed CentralView ArticlePubMedGoogle Scholar
- Singleton DR, Furlong MA, Rathbun SL, Whitman WB: Quantitative comparisons of 16S rRNA gene sequence libraries from environmental samples. Appl Environ Microbiol. 2001, 67 (9): 4374-4376. 10.1128/AEM.67.9.4374-4376.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Anderson M: A new method for non-parametric multivariate analysis of variance. Austral Ecol. 2001, 26 (1): 32-46.Google Scholar
- Excoffier L, Smouse P, Quattro J: Analysis of molecular variance inferred from metric distances among DNA haplotypes - application to human mitochondrial-DNA restriction data. Genetics. 1992, 131 (2): 479-491.PubMed CentralPubMedGoogle Scholar
- Martin AP: Phylogenetic approaches for describing and comparing the diversity of microbial communities. Appl Environ Microbiol. 2002, 68 (8): 3673-3682. 10.1128/AEM.68.8.3673-3682.2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Fitch W: Toward defining course of evolution - minimum change for a specific tree topology. Syst Zool. 1971, 20 (4): 406-416. 10.2307/2412116.View ArticleGoogle Scholar
- Maddison W, Slatkin M: Null models for the number of evolutionary steps in a character on a phylogenetic tree. Evolution. 1991, 45 (5): 1184-1197. 10.2307/2409726.View ArticleGoogle Scholar
- Lozupone C, Hamady M, Knight R: UniFrac - an online tool for comparing microbial community diversity in a phylogenetic context. BMC Bioinformatics. 2006, 7: 371-10.1186/1471-2105-7-371.PubMed CentralView ArticlePubMedGoogle Scholar
- Lozupone C, Knight R: UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005, 71 (12): 8228-8235. 10.1128/AEM.71.12.8228-8235.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R: UniFrac: an effective distance metric for microbial community comparison. ISME J. 2011, 5 (2): 169-172. 10.1038/ismej.2010.133.PubMed CentralView ArticlePubMedGoogle Scholar
- White JR, Navlakha S, Nagarajan N, Ghodsi M, Kingsford C, Pop M: Alignment and clustering of phylogenetic markers - implications for microbial diversity studies. BMC Bioinformatics. 2010, 11: 152-10.1186/1471-2105-11-152.PubMed CentralView ArticlePubMedGoogle Scholar
- Schloss PD: The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Comput Biol. 2010, 6 (7): e1000844-10.1371/journal.pcbi.1000844.PubMed CentralView ArticlePubMedGoogle Scholar
- Sipos M, Jeraldo P, Chia N, Qu A, Dhillon AS, Konkel ME, Nelson KE, White BA, Goldenfeld N: Robust computational analysis of rRNA hypervariable tag datasets. PLoS One. 2010, 5 (12): e15220-10.1371/journal.pone.0015220.PubMed CentralView ArticlePubMedGoogle Scholar
- Rudi K, Zimonja M, Kvenshagen B, Rugtveit J, Midtvedt T, Eggesbo M: Alignment-independent comparisons of human gastrointestinal tract microbial communities in a multidimensional 16S rRNA gene evolutionary space. Appl Environ Microbiol. 2007, 73 (8): 2727-2734. 10.1128/AEM.01205-06.PubMed CentralView ArticlePubMedGoogle Scholar
- Li M, Vitányi P: An introduction to Kolmogorov complexity and its applications. 2008, New York: SpringerView ArticleGoogle Scholar
- Li M, Chen X, Li X, Ma B, Vitanyi PMB: The similarity metric. IEEE Trans Inf Theory. 2004, 50 (12): 3250-3264. 10.1109/TIT.2004.838101.View ArticleGoogle Scholar
- Cilibrasi R, Vitanyi PMB: Clustering by compression. IEEE Trans Inf Theory. 2005, 51 (4): 1523-1545. 10.1109/TIT.2005.844059.View ArticleGoogle Scholar
- Nykter M, Price ND, Aldana M, Ramsey SA, Kauffman SA, Hood LE, Yli-Harja O, Shmulevich I: Gene expression dynamics in the macrophage exhibit criticality. Proc Natl Acad Sci USA. 2008, 105 (6): 1897-1900. 10.1073/pnas.0711525105.PubMed CentralView ArticlePubMedGoogle Scholar
- Otu HH, Sayood K: A new sequence distance measure for phylogenetic tree construction. Bioinformatics. 2003, 19 (16): 2122-2130. 10.1093/bioinformatics/btg295.View ArticlePubMedGoogle Scholar
- Kocsor A, Kertész-Farkas A, Kaján L, Pongor S: Application of compression-based distance measures to protein sequence classification: A methodological study. Bioinformatics. 2006, 22 (4): 407-412. 10.1093/bioinformatics/bti806.View ArticlePubMedGoogle Scholar
- Santoni D, Romano-Spica V: A gzip-based algorithm to identify bacterial families by 16S rRNA. Lett Appl Microbiol. 2006, 42 (4): 312-314. 10.1111/j.1472-765X.2006.01872.x.View ArticlePubMedGoogle Scholar
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75 (23): 7537-7541. 10.1128/AEM.01541-09.PubMed CentralView ArticlePubMedGoogle Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pẽa AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R: QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010, 7 (5): 335-336. 10.1038/nmeth.f.303.PubMed CentralView ArticlePubMedGoogle Scholar
- Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI: A core gut microbiome in obese and lean twins. Nature. 2009, 457 (7228): 480-484. 10.1038/nature07540.PubMed CentralView ArticlePubMedGoogle Scholar
- Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, Gordon JI: The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med. 2009, 1 (6): 6ra14-10.1126/scitranslmed.3000322.PubMed CentralPubMedGoogle Scholar
- Walker AW, Sanderson JD, Churcher C, Parkes GC, Hudspith BN, Rayment N, Brostoff J, Parkhill J, Dougan G, Petrovska L: High-throughput clone library analysis of the mucosa-associated microbiota reveals dysbiosis and differences between inflamed and non-inflamed regions of the intestine in inflammatory bowel disease. BMC Microbiol. 2011, 11: 7-10.1186/1471-2180-11-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R: Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2011, 108 (SUPPL. 1): 4516-4522.PubMed CentralView ArticlePubMedGoogle Scholar
- Bartram AK, Lynch MDJ, Stearns JC, Moreno-Hagelsieb G, Neufeld JD: Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl Environ Microbiol. 2011, 77 (11): 3846-3852. 10.1128/AEM.02772-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Huntley J, Fierer N, Owens SM, Betley J, Fraser L, Bauer M, Gormley N, Gilbert JA, Smith G, Knight R: Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 2012, 6 (8): 1621-1624. 10.1038/ismej.2012.8.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith MI, Yatsunenko T, Manary MJ, Trehan I, Mkakosya R, Cheng J, Kau AL, Rich SS, Concannon P, Mychaleckyj JC, Liu J, Houpt E, Li JV, Holmes E, Nicholson J, Knights D, Ursell LK, Knight R, Gordon JI: Gut microbiomes of Malawian twin pairs discordant for kwashiorkor. Science. 2013, 339 (6119): 548-554. 10.1126/science.1229000.PubMed CentralView ArticlePubMedGoogle Scholar
- Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto J, Hansen T, Le Paslier D, Linneberg A, Nielsen HB, Pelletier E, Renault P: A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010, 464 (7285): 59-65. 10.1038/nature08821.PubMed CentralView ArticlePubMedGoogle Scholar
- Mugavin ME: Multidimensional scaling - a brief overview. Nurs Res. 2008, 57 (1): 64-68. 10.1097/01.NNR.0000280659.88760.7c.View ArticlePubMedGoogle Scholar
- Hunter JD: Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007, 9 (3): 90-95.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.