Removing Noise From Pyrosequenced Amplicons
© Quince et al; licensee BioMed Central Ltd. 2011
Received: 21 June 2010
Accepted: 28 January 2011
Published: 28 January 2011
In many environmental genomics applications a homologous region of DNA from a diverse sample is first amplified by PCR and then sequenced. The next generation sequencing technology, 454 pyrosequencing, has allowed much larger read numbers from PCR amplicons than ever before. This has revolutionised the study of microbial diversity as it is now possible to sequence a substantial fraction of the 16S rRNA genes in a community. However, there is a growing realisation that because of the large read numbers and the lack of consensus sequences it is vital to distinguish noise from true sequence diversity in this data. Otherwise this leads to inflated estimates of the number of types or operational taxonomic units (OTUs) present. Three sources of error are important: sequencing error, PCR single base substitutions and PCR chimeras. We present AmpliconNoise, a development of the PyroNoise algorithm that is capable of separately removing 454 sequencing errors and PCR single base errors. We also introduce a novel chimera removal program, Perseus, that exploits the sequence abundances associated with pyrosequencing data. We use data sets where samples of known diversity have been amplified and sequenced to quantify the effect of each of the sources of error on OTU inflation and to validate these algorithms.
AmpliconNoise outperforms alternative algorithms substantially reducing per base error rates for both the GS FLX and latest Titanium protocol. All three sources of error lead to inflation of diversity estimates. In particular, chimera formation has a hitherto unrealised importance which varies according to amplification protocol. We show that AmpliconNoise allows accurate estimates of OTU number. Just as importantly AmpliconNoise generates the right OTUs even at low sequence differences. We demonstrate that Perseus has very high sensitivity, able to find 99% of chimeras, which is critical when these are present at high frequencies.
AmpliconNoise followed by Perseus is a very effective pipeline for the removal of noise. In addition the principles behind the algorithms, the inference of true sequences using Expectation-Maximization (EM), and the treatment of chimera detection as a classification or 'supervised learning' problem, will be equally applicable to new sequencing technologies as they appear.
Next generation sequencing has revolutionised many areas of biology by providing a cheaper and faster alternative to Sanger sequencing. One technology that is finding many applications, for example in de novo genome sequencing, or diversity studies of regions of DNA that have been amplified by PCR, is 454 Pyrosequencing . It is this latter application of 454 to the sequencing of PCR products or amplicons that we will focus on here. 454 Pyrosequencing generates large numbers of reads, 400,000 in the original GS FLX implementation increasing to 800,000 with Titanium reagents, which are long compared to other pyrosequencing platforms, 250 bp for GS FLX increasing to around 500 bp for Titanium. This makes it ideal for high resolution studies of the sequences and their relative frequencies in relatively long PCR products. The method is to simply sequence the diverse amplicon sample without cloning individual sequences. This has many applications for instance in viral population dynamics , or characterising microbial communities through amplification of 16S rRNA genes .
Per base error rates from 454 pyrosequencing are comparable to those from Sanger sequencing  but without cloning resequencing is impossible. In addition, the large read numbers obtainable mean that the absolute number of noisy reads is substantial. Consequently, it is critical to distinguish true diversity in the sample from noise introduced by the experimental procedure. This is particularly true if we want to calculate the absolute number of different sequences, or clusters of sequences, present. This is effectively the problem in microbial diversity estimation, where sequences are clustered into Operational Taxonomic Units (OTUs) that proxy for traditional taxa and we are interested in estimating the number of such OTUs in a community. It has already been noted that noise in 454 amplicon sequencing leads to inflated estimates of OTU number [5, 6]. This is important because surprisingly large OTU numbers together with a bias towards rare taxa were observed in the first studies of pyrosequenced 16S rRNA genes [3, 7]. This preponderance of rare taxa has been termed the 'rare biosphere'. The spurious OTUs generated by noise generally have low frequencies consequently noise may explain both the high OTU numbers and the bias towards low abundances reported. Development of effective noise removal algorithms is therefore a matter of urgency in the exploration of microbial diversity.
PyroNoise is a relatively sophisticated algorithm that reconstructs the true sequences and frequencies in the sample prior to OTU construction using a mixture model . It is based on clustering flowgrams rather than sequences which allows 454 errors to be modelled naturally. Using this approach it is possible to account for two facts: firstly that sequences with errors are likely to be rare and secondly that they should be similiar to a true abundant sequence. The mixture model approach allows this to be done in a very natural way by fitting noise distributions around each proposed true sequence. The drawback is that an iterative and hence computationally costly algorithm must be used.
Two sources of error need to be considered in pyrosequenced amplicons. Those from the pyrosequencing itself and those introduced by the PCR amplification. The original implementation of PyroNoise simultaneously removed both sources of errors. Consequently it was necessary to align flowgrams. Here we present a new approach, AmpliconNoise, which couples a fast flowgram clustering step without alignment, still called PyroNoise, to a sequence based clustering, SeqNoise, which does perform alignments. The latter explicitly accounts for the differential rates of nucleotide errors in the PCR process, and uses sequence frequencies to inform the clustering process. The result is a more sensitive program than the original PyroNoise achieved at lower computational cost because the fast alignment free flowgram clustering reduces the data set size for the slower sequence clustering. AmpliconNoise has already been used to determine gut microbial diversities  and for viral population genetics .
Recently another flowgram based denoising algorithm, DeNoiser, has been developed . This was motivated by the original PyroNoise and uses the same flowgram alignments but incorporates several modifications to increase speed. It begins by finding unique sequences, orders them by frequency, and then starting with the most abundant maps the other reads onto these 'centroids' if their distance to the centroid is smaller than some threshold. The distance used is the same flowgram based measure as in the original PyroNoise. It is therefore a greedy agglomerative clustering rather than iterative. This reduces the computational costs of the algorithm but misassignment of reads when the true sequences are similar, may result in a loss of ability to accurately reconstruct OTUs. An even faster approach is to perform the same centroid based clustering using sequence rather than flowgram based distances, this is referred to as single-linkage preclustering (SLP) by Huse et al. , and a similar strategy is adopted in the PyroTagger program . In this paper we will describe AmpliconNoise fully for the first time and compare to the original PyroNoise algorithm, DeNoiser, and SLP in terms of ability to remove noise and allow accurate OTU construction. We will use 454 pyrosequencing data from known sequences for these evaluations including both standard GS FLX and newer Titanium data.
SeqNoise accounts for PCR single base errors but the PCR process can also introduce sequences that are composed of two or more true sequences, so called 'chimeras'. These generate sequences that are quite different from either parent and so can not be removed by clustering. Chimeras are generated when incomplete extension occurs during the PCR process and the resulting fragment effectively acts as the primer in the next round of PCR. Existing algorithms for removing chimeras were developed for full length clone sequences and lack the sensitivity for short pyrosequencing reads [12, 13]. The program ChimeraSlayer is to our knowledge the only current chimera checker capable of handling 454 reads effectively. ChimeraSlayer requires a reference data set of sequences that are known to be non-chimeric . We introduce a new algorithm, Perseus, based on two novel principles for chimera removal: firstly because the parents of any chimera must experience at least one more round of PCR than the chimera then we can search amongst all those sequences of equal or greater abundance than the chimera for possible parents; secondly that chimera removal should, with suitable training data sets, be treated as a problem in classification or 'supervised learning'. Adopting these principles Perseus has the sensitivity to remove chimeras from 250 bp GS FLX reads with the advantage of not requiring a set of good reference sequences.
We will now demonstrate that AmpliconNoise followed by Perseus is capable of removing the vast majority of erroneous reads from 454 pyrosequencing data, reducing overall error rates, and thereby allowing accurate OTU construction and microbial diversity estimation.
Test Data Sets
To test the noise removal algorithms we used eight previously published test data sets and one hitherto unpublished. These were all generated by preparing mixtures of known DNA sequences in known concentrations and amplifying and pyrosequencing. For the published data sets the standard GS FLX protocol was used but specifically for this study we generated a further data set with the most recent Titanium reagents. Three different 16S rRNA regions were amplified in all cases with a standard Taq polymerase, the V2 region , the V5 region , and for the Titanium data V4-V5. The mixtures consisted either of 16S rRNA clones in the case of the V5 and V4-V5 data sets or DNA extracted from 67 separate isolated organisms for V2. Two V5 data sets were prepared: one 'Divergent' data set comprising 23 clones that differed at a least 7% of nucleotide positions mixed in equal proportions, facilitating the unambiguous mapping of each read to a known clone, and one 'Artificial Community' where some clones differed by just a single nucleotide and concentrations varied by two orders of magnitude mimicking a natural community. The V2 'Mock Communities' were similarly split between three replicates where the extracted DNA was mixed in equal proportions (Even1, Even2, Even3) and three where it was mixed unevenly (Uneven1, Uneven2, Uneven3). Full details of these test data sets are available from the original publications [5, 8].
The Titanium data was generated by pyrosequencing a mixture of 91 full length 16S rRNA clones obtained from an Arctic soil sample. These clones were independently Sanger sequenced although only 89 sequences could actually be recovered. Consequently the results presented here will be a lower bound on accuracy with a few sequences falsely categorised as errors that should be in the sample. Since this will apply equally to all the tested algorithms our ability to compare between them is not affected. The mixture contained each clone in equal abundance. This DNA mixture was then pyrosequenced following amplification with 16S rRNA primers that also had a tag (AGTGCGTA) and the standard Titanium A and B adaptors attached. The primers used were both degenerate, F515 (GTGNCAGCMGCCGCGGTAA) and R926 (CCGYCAATTYMTTTRAGTTT). Sequencing was forward from F515 so as to capture the V4 and most of the V5 region with a 400-500 bp Titanium read.
The eight GS FLX and one Titanium test data sets used in this study
V4 - V5
Origins of Pyrosequencing Noise
Filtering Noisy Reads
It has previously been shown that some features correlate with noise in reads from 454 data [4, 6]. Consequently filtering for those reads can reduce the overall level of noise in the data set. However, substantial noise remains after this process . The purpose of this study is not to evaluate different filtering methods for reads, but rather to address the problem of how to account for this remaining noise. Consequently we adopted a rather strict filtering procedure: first we checked for an exact match to the primer and tag if present, we then used the observation that signal intensities between 0.5-0.7 are associated with noisy reads . We therefore truncated all reads at the first such signal, or any sequence of the four nucleotide flows that failed to give a signal ≥ 0.5, any read where this occurred before flow 360 of the 400 flows in a GS FLX run we removed. For the Titanium reads which have 800 flows we used the same procedure keeping only reads where the first noisy flow occurred on or after 360. In addition, the level of noise in reads increases towards the ends of reads , to account for this we removed the last 10% of flows truncating all GS FLX reads at flow 360 and Titanium at 720. The numbers of reads after filtering for each data set are given in Table 1.
Removing Pyrosequencing Noise
to generate a total distance which is simply the sum of the distances for each signal. We used the V2 distributions, Figure 1, to calculate flowgram distances for all the GS FLX data sets but calculated new distributions for the Titanium data.
To infer the true sequences and their frequencies we maximise this likelihood using an expectation-maximization (EM) algorithm. EM algorithms apply very naturally to complex clustering problems. Intuitively they exploit the fact that if the properties of the cluster centres are known, then the probability that a given data point was generated by a given cluster centre is easy to calculate, similarly if those probabilities are known then maximum likelihood solutions for the parameters of the cluster centres can be calculated. What is hard to do are these two steps simultaneously. In an EM algorithm we avoid that by iterating the two steps separately until the parameters converge at a local maximum of the likelihood . The algorithm used here differed in two ways from that proposed previously . Firstly, we did not align flowgrams to our denoised sequences before calculating the distances. That was necessary to allow for PCR errors that cause changes resulting in insertions and deletions at the flowgram level. Flowgram gaps do occur very rarely as a result of pyrosequencing noise, occasionally no signal ≥ 0.5 is observed in a frame of four nucleotides, but as described above we truncated our flowgrams when this was observed. Consequently by not performing alignments we ensured that only pyrosequencing noise would be filtered at this step. Secondly, we did not construct the maximum likelihood sequences each flow at a time, instead we only allowed sequences that were observed in the data. This allows the final denoised reads to be mapped to the originals.
- 1.M step: Given the set each sequence to the sequence corresponding to the perfect flowgram Ū j that maximises Equation 3, restricted to the set of P unique perfect flowgrams obtained by rounding the observed flowgrams to integers. This corresponds to finding the perfect flowgram with the smallest total distance to all the reads weighted by the conditional probabilities that each flowgram was generated by that denoised sequence:(4)
Calculate new distances .
- 3.E step: Calculate new as:(5)
Repeat until convergence
Expectation-maximization algorithms because they only find local optimum are sensitive to initial conditions. To initialise the EM algorithm we performed a complete linkage hierarchical clustering based on flowgram distances and formed clusters at a given cut-off, c p . This also defines the number of denoised sequences L, although the number with non-zero weight τ j usually decreases during the iteration. The pyrosequencing noise removal therefore has two parameters σ p and c p , for all the results presented here these were set at the values 1/60 and 0.01 respectively.
Removing PCR Noise
This requires alignment of the read to the sequence. Alignment was performed with a specially modified version of the Needleman-Wunsch algorithm with a reduced gap cost for homopolymer insertion and deletions. This accounted for the possibility of pyrosequencing noise on low frequency reads which may not have been removed in the flowgram clustering. Gap penalties were included in the distance measure.
PCR per base error probabilities for the three 'Even' V2 'Mock Communities'
Per cycle p
We used a mixture model to cluster the sequences, just as for pyrosequencing noise removal, where each component of the mixture corresponds to a true sequence about which observed noisy reads are distributed. The relative weights of each component are the true relative frequencies of the sequences. The reads are assumed to be distributed as exponentially decaying functions of their sequence error corrected distance from these true sequences. The magnitude of the sequence noise is described by the characteristic length of these exponentials, σ s . A maximum likelihood fit of the mixture model can be obtained using an Expectation-Maximization algorithm initialised using the clusters formed from a hierarchical clustering of sequences at a given distance cut-off, c s . In this study we used parameter values of σ s = 0.033 and c s = 0.08, parameters that experience has taught us work well for GS FLX data. For the Titanium data we compared two different values for σ s , 0.1 and 0.04, whilst keeping c s = 0.08. A standard gap was given a penalty of 15.0 and a homopolymer gap, 4.0. Prior to our sequence clustering step we truncated at 220 bp for GS FLX and 400 bp for Titanium because of the increase in error rates at the ends of reads.
Chimeras are generated when incomplete extension occurs in one round of PCR and then the resulting sequence fragment acts as a primer for a different sequence in the next round. Consequently chimeras are composed of two (or occasionally more) true sequences with a discrete break point where the transition from one sequence to another occurs. For our nine test data sets we were therefore able to determine which sequences after denoising were likely chimeras by aligning each sequence against the known reference sequences and finding the putative parents and break point which gave the closest match to the query sequence. If the closest match to a chimera of two sequences was at least three nucleotides or better than that to a single reference sequence then the query was considered as a possible two sequence chimera or 'Bimera'. If it was not then it was considered a 'Good' sequence. Similarly if the match was further improved by three nucleotides when two break points were allowed then it was classified as possibly comprised of three sequences a 'Trimera', and again for the transition to a composite of four sequences or 'Quadramera'. However, the sequence was only classified to these putative definitions if the absolute match was sufficiently good as measured in terms of the sequence error corrected distance (e < 0.15 a distance corresponding to one non-homopolymer gap per 100 nucleotides). Otherwise the sequence was denoted as 'unclassified'. These could include contaminants, real unidentified 16S rRNA operons, gross pyrosequencing or PCR errors, or most likely a chimera that failed to fall under our rather strict definition.
Perseus: Chimera Removal as a Problem in 'Supervised Learning'
For real pyrosequencing data we will not know a priori what sequences should be present and therefore chimera identification algorithms are necessary. Given the mechanics of PCR amplification, any chimeras generated will experience at least one less PCR cycle than either parent, consequently both parents of a chimera will be present in the data set and with a frequency at least equal to the chimera. This ignores the possibility of the chimera experiencing preferential PCR bias over its parents, but it will be true in the vast majority of cases. To exploit this observation we developed an algorithm 'Perseus' that considers each sequence in turn and performs exact pairwise alignments to all sequences with equal or greater abundance, the set of possible parents. The two parents and break point that give the closest match to the query are then identified and a three way alignment of these sequences is generated using the mafft-linsi program . We calculate two quantities from this alignment - the first is the PCR error corrected distance from the query to the optimum chimera. For a sequence to be classified as chimeric this distance has to be absolutely smaller than 0.15 and smaller than or equal to the distance to the closest sequence amongst the best possible parents. This simply ensures that the hypothesis that the sequence is a chimera is possible. However, we still have to account for the possibility that the chimeric pattern could have evolved. We do this by calculating a second quantity using the alignment the 'chimera index' I .
Denoting the query sequence , the closest matching parent , and the more distant parent , we calculate using parsimony the sequence ancestral to all three. We find the number of base pair changes along the three branches to , , and denote these x, y, and z respectively. We resolve changes to the two parts of alignment, either the part of the chimera matching parent or parent , and denote these x A and x B , y A and y B , and z A and z B . For a given chimera to be observed, two independent events must occur, changes to the distant parent must occur on that part of the alignment matching . Assuming all base changes are equally likely then the distribution of changes across the two parts will be binomially distributed with probability proportional to the size of each part. Therefore, we can calculate the probability of the changes being as biased or more so than were observed. The same arguments apply for the changes to the closer parent, they should all lie in the part matching the more distant and we can calculate that probability. We then multiply the two probabilities together and take the negative log to obtain an index that will increase the less likely a chimeric pattern is to have evolved. This index is defined for two parent chimeras, our so called 'bimeras', it could be extended to higher order chimeras but we did not do this, finding that it sufficed for identifying these anyway.
We also added to our training data set the result of taking the reference and calculating their indices without regard to sequence frequency: i.e., comparing all sequences to all others. We then used the test data sets with uneven abundances, the V5 'Artificial Community' and the three V2 'Uneven' data sets, for validation. Running them through our algorithm and then using the logistic regressions to generate probabilities of each sequence being chimeric. Those sequences that do not have a good chimeric match have this probability set to zero. We then defined all sequences with a probability of greater than 50% of being chimeric as chimeras. This will minimise total misclassifications . We also trained the classifier with the Titanium V4-V5 data and associated reference sequences but in this case we lacked a separate data set for validation.
Results and Discussion
In addition to running the AmpliconNoise pipeline we also denoised the data sets with the DeNoiser algorithm  and using single-linkage preclustering (SLP) at the recommended 2% sequence difference as well as at 1% for comparative purposes . We truncated the reads at 220 bp and 400 bp for GS FLX and Titanium respectively before calculating exact pairwise sequence distances for the SLP algorithm. For SLP we used the same filtered reads as for AmpliconNoise but this was not possible for the DeNoiser since there filtering is through the QIIME pipeline . The QIIME filtering is slightly less stringent than the procedure described above for GS FLX data but more so for Titanium where a quality window is recommended. The read numbers following QIIME filtering are also given in Table 1.
Per base error rates following noise removal
Percentage per base error rates in pyrosequencing reads before and after application of denoising algorithms
Titanium - σ s = 0.1
Titanium - σ s = 0.04
Noise removal by all the algorithms can be considered a form of mapping. We map a noisy read onto another that we believe really generated that read. To calculate per base error rates after noise removal we must account for the possibility that the mapping may be to the wrong read. To allow for this we estimated the denoised error rates by, for each read, calculating the number of differences between the denoised read it maps to and the closest matching reference of the original undenoised read. The total of the differences across the data set was then normalised by total alignment length to estimate the per base error rate. The results for the four algorithms are given in Table 3. For the DeNoiser results we only used those reads that were included in the AmpliconNoise and SLP data sets ensuring a fair comparison despite the slight differences in filtering. What is immediately apparent is that SLP at both cut-offs does not actually remove errors instead it inflates them. This is due to the high rate of misassignment where a read is mapped not onto the reference that generated it but to a similar but incorrect sequence. The DeNoiser algorithm does better reducing per base error rates in most cases but it is substantially out-performed by AmpliconNoise which is capable of reducing noise by one-third to a half in all data sets. Given that some residual error will always remain because the sequencing of the references may not be entirely accurate and because of PCR chimeras and contaminants then this is impressive.
Relative Importance of Pyrosequencing and PCR noise
The classification of denoised sequences from the nine test data sets
AmpliconNoise - σ s = 0.1
AmpliconNoise - σ s = 0.04
Accuracy of OTU Construction Following Noise Removal
Run times in seconds for AmpliconNoise and the original PyroNoise for the V5 'Artificial Community' data set
Initial pairwise distances (NDist)
Initial clustering (FCluster)
Flowgram clustering (PyroNoise)
Sequence clustering (SeqNoise)
In Figure 3 we also show the effect of applying single-linkage preclustering at 1% and the recommended 2% level, and the DeNoiser algorithm prior to OTU construction. Single-linkage preclustering at 1% is very poor greatly over-estimating OTU numbers, 2% SLP and the DeNoiser perform similarly above a 3% cut-off predicting slightly more OTUs than the original PyroNoise program, both were worse than AmpliconNoise. Below this 2% SLP predicts a constant OTU number suggesting it is aggregating OTUs that should be separated. However, both SLP and DeNoiser were orders of magnitude faster than AmpliconNoise, single-linkage preclustering does not require a cluster and for this data set it ran in just a few minutes on a standard computer. The DeNoiser took less than an hour on a single core.
Accuracy of 3% OTU construction following application of the noise removal algorithms
σ s = 0.1
σ s = 0.04
Chimera Classification Accuracy
Chimera classification accuracies Perseus applied to the three denoised V2 'Uneven' data sets.
Chimera classification accuracies for ChimeraSlayer applied to the three denoised V2 'Uneven' data sets.
In practice, dedicated training data sets for each study may not be possible, although we would recommend it. If data is GS FLX V5 or V4-V5 Titanium then the two pairs of α and β values given above should work well for their corresponding data types. For V2 data, we would not recommend the values in the caption of Figure 8 because there is an implicit assumption of very high prior chimera probability generated from training on this atypical data set, reflected in the high value of α = -2.83542. The β values on all data sets are in the range of β = 0.5 and we have found that this value, paired with α = -7.5, performs well across a wide range of data sets.
We have demonstrated that AmpliconNoise followed by Perseus has the sensitivity to remove the majority of errors from GS FLX and Titanium pyrosequenced amplicons and allow accurate estimates of OTU number. AmpliconNoise outperforms both agglomerative clusterers, SLP and the DeNoiser, both in terms of per base error rates and OTU construction but at a cost of increased computational complexity and no doubt in some cases, where some noise can be tolerated, these simpler heuristic approaches may be the best option. However, the results here suggest that both agglomerative approaches must be treated with caution, by not simply looking at OTU numbers as in previous evaluations [9, 10], but rather their identity we have established that these are prone to over-clustering, removing a substantial fraction of the true diversity even at the 3% level.
Consequently, we believe that the AmpliconNoise-Perseus pipeline which is freely available as open source software  with all data on a dedicated website , will find a wide range of applications in microbial diversity estimation , and population genetics . They could be critical to the success of large-scale publicly funded efforts to explore microbial diversity, such as the sequencing of human associated microbes being conducted in the Human Microbiome Project . To facilitate their use we include programs for integrating their output into the QIIME 16S rRNA analysis pipeline .
We have shown the importance of considering both the effects of PCR and sequencing errors in studies of diversity based on 16S rRNA amplicons. This suggests the use of high fidelity polymerases to reduce per base PCR error rates. However, one of our most striking observations was just how variable chimera frequencies were in the test data sets. This must be due to PCR conditions, principally, cycle number, extension time, primer and template concentrations and polymerase type . Therefore optimising the whole PCR process to minimise all types of errors is probably a better strategy than just focussing on the enzyme.
In addition, the principles outlined here: the rigorous validation of noise removal algorithms with test data; the use of EM algorithms to generate effective consensus sequences and remove noise from pyrosequenced amplicons; using sequence abundances in the classification of PCR chimeras; and the treatment of the latter as a supervised learning problem; will provide the basis for further algorithm development in this field and contribute to the maturation of next generation sequencing as a quantitative technique for the analysis of PCR amplicon diversity. These principles will be equally applicable to other pyrosequencing platforms, for example the Illumina HiSeq 2000 which is capable of generating orders of magnitude more reads per run [27, 28]. They should even hold for the third generation of sequencing technologies that will target individual molecules . As ever larger amounts of sequence data are generated the question of how to distinguish true diversity from noise will become ever more important.
The Titanium full length clones sequences have been submitted to GenBank with accession numbers HQ462473-560. The Titanium pyrosequencing reads have been submitted to the Short Read Archive with accession SRP003773.
CQ is supported by an Engineering and Physical Sciences Research Council Career Acceleration Fellowship (EP/H003851/1). PT is supported by NIH P50 GM068763. We would like to thank Jens Reeder and Rob Knight for providing the DeNoiser processed data sets and Sue Huse for a copy of the single-linkage preclustering algorithm. We also thank two anonymous reviewers and Dr Keith Harris for constructive comments on the manuscript.
- Margulies M, Egholm M, Altman W, Attiya S, Bader J, Bemben L, Berka J, Braverman M, Chen Y, Chen Z, Dewell S, Du L, Fierro J, Gomes X, Godwin B, He W, Helgesen S, Ho C, Irzyk G, Jando S, Alenquer M, Jarvie T, Jirage K, Kim J, Knight J, Lanza J, Leamon J, Lefkowitz S, Lei M, Li J, Lohman K, Lu H, Makhijani V, McDade K, McKenna M, Myers E, Nickerson E, Nobile J, Plant R, Puc B, Ronan M, Roth G, Sarkis G, Simons J, Simpson J, Srinivasan M, Tartaro K, Tomasz A, Vogt K, Volkmer G, Wang S, Wang Y, Weiner M, Yu P, Begley R, Rothberg J: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437: 376–380.PubMed CentralPubMedGoogle Scholar
- Wang GP, Sherrill-Mix SA, Chang KM, Quince C, Bushman FD: Hepatitis C Virus Transmission Bottlenecks Analyzed by Deep Sequencing. J Virol 2010, 84(12):6218–6228. 10.1128/JVI.02271-09PubMed CentralView ArticlePubMedGoogle Scholar
- Huber JA, Mark Welch D, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML: Microbial population structures in the deep marine biosphere. Science 2007, 318: 97–100. 10.1126/science.1146689View ArticlePubMedGoogle Scholar
- Huse SM, Huber JA, Morrison HG, Sogin ML, Mark Welch D: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 2007., 8(7): 10.1186/gb-2007-8-7-r143Google Scholar
- Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT: Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods 2009, 6: 639–641. 10.1038/nmeth.1361View ArticlePubMedGoogle Scholar
- Kunin V, Engelbrektson A, Ochman H, Hugenholtz P: Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol 2010, 12: 118–123. 10.1111/j.1462-2920.2009.02051.xView ArticlePubMedGoogle Scholar
- Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci USA 2006, 103: 12115–12120. 10.1073/pnas.0605127103PubMed CentralView ArticlePubMedGoogle Scholar
- Turnbaugh PJ, Quince C, Faith JJ, McHardy AC, Yatsunenko T, Niazi F, Aourtit J, Egholm M, Henrissat B, Knight R, Gordon JI: Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins. Proc Natl Acad Sci USA 2010, 107(16):7503–7508. 10.1073/pnas.1002355107PubMed CentralView ArticlePubMedGoogle Scholar
- Reeder J, Knight R: Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat Methods 2010, 7(9):668–669. 10.1038/nmeth0910-668bPubMed CentralView ArticlePubMedGoogle Scholar
- Huse SM, Welch DM, Morrison HG, Sogin ML: Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol 2010, 12(7):1889–1898. 10.1111/j.1462-2920.2010.02193.xPubMed CentralView ArticlePubMedGoogle Scholar
- Kunin V, Hugenholtz B: PyroTagger: A fast, accurate pipeline for analysis of rRNA amplicon pyrosequence data. The Open Journal 2010, 1: 1.Google Scholar
- Huber T, Faulkner G, Hugenholtz P: Bellerophon: a program to detect chimeric sequences in multiple sequence alignments. Bioinformatics 2004, 20: 2317–2319. 10.1093/bioinformatics/bth226View ArticlePubMedGoogle Scholar
- Ashelford K, Chuzhanova N, Fry J, Jones A, Weightman A: At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microb 2005, 71: 7724–7736. 10.1128/AEM.71.12.7724-7736.2005View ArticleGoogle Scholar
- Haas B, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, Ciulla D, Tabbaa D, Highlander SK, Sodergen E, Methe B, DeSantis TZ, The Human Microbiome Consortium, Petrosino JF, Knight R, Birren BW: Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res 2011, in press.Google Scholar
- Balzer S, Malde K, Lanzen A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data-enabling realistic simulation with flowsim. Bioinformatics 2010, 26(18):i420-i425. 10.1093/bioinformatics/btq365PubMed CentralView ArticlePubMedGoogle Scholar
- Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comp J 1998, 41: 578–588. 10.1093/comjnl/41.8.578View ArticleGoogle Scholar
- Eckert KA, Kunkel TA: DNA polymerase fidelity and the polymerase chain reaction. PCR Methods Appl 1991, 1: 17–24.View ArticlePubMedGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 2005, 33: 511–518. 10.1093/nar/gki198PubMed CentralView ArticlePubMedGoogle Scholar
- Bishop CM: Pattern Recognition and Machine Learning. Springer: Yale University Press; 2006.Google Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Tumbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R: QIIME allows analysis of high-throughput community sequencing data. Nat Methods 2010, 7: 335–336. 10.1038/nmeth.f.303PubMed CentralView ArticlePubMedGoogle Scholar
- Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res 2009., 37(10): 10.1093/nar/gkp285Google Scholar
- White JR, Navlakha S, Nagarajan N, Ghodsi MR, Kingsford C, Pop M: Alignment and clustering of phylogenetic markers - implications for microbial diversity studies. BMC Bioinf 2010., 11: 10.1186/1471-2105-11-152Google Scholar
- AmpliconNoise Google Code Project[http://code.google.com/p/ampliconnoise/]
- AmpliconNoise Data[http://people.civil.gla.ac.uk/~quince/Data/AmpliconNoise.html]
- Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, Baker CC, Di Francesco V, Howcroft TK, Karp RW, Lunsford RD, Wellington CR, Belachew T, Wright M, Giblin C, David H, Mills M, Salomon R, Mullins C, Akolkar B, Begg L, Davis C, Grandison L, Humble M, Khalsa J, Little AR, Peavy H, Pontzer C, Portnoy M, Sayre MH, Starke-Reed P, Zakhari S, Read J, Watson B, Guyer M, NIH HMP Working Grp: The NIH Human Microbiome Project. Genome Res 2009, 19: 2317–2323. 10.1101/gr.096651.109PubMed CentralView ArticlePubMedGoogle Scholar
- Lahr DJG, Katz LA: Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase. Biotechniques 2009, 47(4):857–863.PubMedGoogle Scholar
- Lazarevic V, Whiteson K, Huse S, Hernandez D, Farinelli L, Osteras M, Schrenzel J, Francois P: Metagenomic study of the oral microbiota by Illumina high-throughput sequencing. J Microbiol Meth 2009, 79: 266–271. 10.1016/j.mimet.2009.09.012View ArticleGoogle Scholar
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R: Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA 2011, in press.Google Scholar
- Rusk N: Cheap third-generation sequencing. Nat Methods 2009, 6: 244–245. 10.1038/nmeth0409-244aView ArticleGoogle Scholar
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2010. [ISBN 3–900051–07–0] [http://www.R-project.org] [ISBN 3-900051-07-0]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.