Complex microbial communities, like those of the human gastrointestinal (GI) tract and other environmental specimens, have gained increased attention in recent years, thanks to technological advances in culture-independent methods based on the amplification of 16S rRNA genes [1, 2]. The NIH Roadmap Human Microbiome Project (HMP) has undertaken a large scale effort to characterize 16S rRNA sequences from healthy human subjects and from human subjects with various diseases. In the course of conducting the project, the various sequencing centers used both ABI 3730 Sanger sequencing and 454 FLX Titanium pyrosequencing platforms to generate and release reference data from multiple body sites sampled in 300 healthy human subjects [3, 4]. Traditional phylogenetic analysis of a sample is performed by amplifying 16S rRNA genes, cloning, and sequencing by the Sanger method . An advantage of this method is the sufficiency of single pass Sanger sequencing of 900–1000 bases for classifying bacteria. Disadvantages include potential cloning bias , as well as time and expense, which can be prohibitive for in-depth sampling of complex microbial communities.
Next-generation sequencing (NGS) technology provides a promising alternative to quantifying the microbiome without the limitations of cloning/Sanger sequencing. For instance, a single run of the 454 Life Sciences pyrosequencing platform can produce 1.2 million sequences in 8 hours , which would require months or years of work with the older methods. The high throughput per run means the unit cost of NGS is only a fraction of that for Sanger sequencing. The new technology also eliminates the cloning bias by directly sequencing the 16S rRNA genes generated by polymerase chain reaction (PCR). Therefore, high throughput sequencing is ideal if adaptable to meet the requirements needed for microbiome work. However, the main limitation of high throughput sequencing is read length. Reads from NGS technologies are considerably shorter than those from Sanger sequencing. Illumina’s Solexa and Applied Biosystem’s SOLiD platforms generate reads of about 25–100 bases, while 454 sequencing technology reads up to 400–500 bases per sequence. The concern is loss of classification accuracy with shorter sequence reads [8, 9]. In addition, the bias associated with PCR amplification is also a concern of PCR based next generation sequencing . Several strategies have been tried to maximize the information obtained from short sequences. One is to target hypervariable regions (HVR) that are most informative for a specific microbiome of interest [11, 12]. As a comparison to the Sanger and the NGS methods, quantitative PCR (qPCR) employs primers specific for particular bacterium to detect and quantify bacteria. Although a reliable and accurate quantification measure for the absolute amount of 16S rRNA genes from one specific organism , the accuracy of qPCR relies on proper designs of the primers .
To date, few attempts have been made to systematically compare and combine different measurement modalities for microbiome analysis. Nossa et al. surveyed broad-range 16S rRNA primers for use in 454 pyrosequencing to classify bacteria from the human foregut microbiome. A length of 900 bases long reads were simulated as Sanger sequences and treated as accurate taxonomies. The group concluded that 347 F/803R primers (covering the 16S rRNA V3V4 region) is the most suitable primer pair for pyrosequencing of classification of foregut 16S rRNA genes. Frank et al. observed similar results provided by Sanger sequencing and pyrosequencing in the human Nasal Microbiota. One recent work has demonstrated that the measured profile (identification and abundance) of microbial communities depends highly on the selection of sequencing platforms – Sanger sequencing and pyrosequencing with different target regions (V1V3, V4V6, V7V9) yielded varying patterns for different genera . It is thus arduous to compare the accuracies of different sequencing platforms for measuring microbiome compositions in an experimental approach.
Here we propose an alternative analytical approach using the latent variable structural equation modeling (SEM) to compare and integrate microbiome measurements from different measurement platforms. The latent variable SEM treats the true bacterial composition of a sample as the latent (unobserved) variable and estimates the relations between, and the reliabilities of, different measurement platforms, and if necessary, subsequently combines them for a joint analysis with each platform weighed by its reliability . The latent variable SEM includes the repeated measures ANOVA, both the univariate and the multivariate versions, as special cases, and is free from the rigid assumptions of the latter approaches such as weighing each platform equally in the analysis regardless of their reliabilities and assuming equal measurement error variances . Furthermore, as with the repeated measures ANOVA, the latent variable SEM can easily incorporate covariates such as disease phenotypes and genotypes, etc. [20, 21] to examine their influences on the underlying microbiome composition/bacteria expression.
In this paper, we demonstrate the latent variable SEM approach through a study of the microbiome in inflammatory bowel diseases (IBD). Our primary goal is to identify the most reliable microbiome measurement platform. A secondary goal is to examine the impact of IBD disease phenotypes (Crohn’s Disease [CD] and ulcerative colitis [UC]) on the enteric microbiota. The measurement platforms compared in this study are: 1) ABI 3730 (Sanger) sequencing of the entire 16S rRNA gene; 2) 454 sequencing of the V1-V3 hypervariable regions; 3) 454 sequencing of the V3-V5 hypervariable region. In the case of a single bacterial taxon, Faecalibacterium spp., we compared the three sequencing platforms with an established qPCR assay.