Denoising PCR-amplified metagenome data
- Michael J Rosen^{1}Email author,
- Benjamin J Callahan^{1},
- Daniel S Fisher^{1, 2} and
- Susan P Holmes^{3}
DOI: 10.1186/1471-2105-13-283
© Rosen et al.; licensee BioMed Central Ltd. 2012
Received: 3 April 2012
Accepted: 19 October 2012
Published: 31 October 2012
Abstract
Background
PCR amplification and high-throughput sequencing theoretically enable the characterization of the finest-scale diversity in natural microbial and viral populations, but each of these methods introduces random errors that are difficult to distinguish from genuine biological diversity. Several approaches have been proposed to denoise these data but lack either speed or accuracy.
Results
We introduce a new denoising algorithm that we call DADA (Divisive Amplicon Denoising Algorithm). Without training data, DADA infers both the sample genotypes and error parameters that produced a metagenome data set. We demonstrate performance on control data sequenced on Roche’s 454 platform, and compare the results to the most accurate denoising software currently available, AmpliconNoise.
Conclusions
DADA is more accurate and over an order of magnitude faster than AmpliconNoise. It eliminates the need for training data to establish error parameters, fully utilizes sequence-abundance information, and enables inclusion of context-dependent PCR error rates. It should be readily extensible to other sequencing platforms such as Illumina.
Background
The potential of high-throughput sequencing as a tool for exploring biological diversity is great, but so too are the challenges that arise in its analysis. These technologies have made possible the characterization of very rare genotypes in heterogeneous populations of DNA at low cost. But when applied to a metagenomic sample, the resulting raw data consist of an unknown mixture of genotypes that are convolved with errors introduced during amplification and sequencing.
There are two broad approaches to high-throughput sequencing of metagenomes: in amplicon sequencing (also called gene-centric or gene-targeted metagenomics) a pool of DNA for sequencing is produced by using PCR to amplify all the variant sequences in a sample that begin and end with a chosen pair of primers [1–3], frequently targeting hypervariable regions of the 16S ribosomal RNA gene [4]; in de novo genome assembly total DNA is sequenced without amplification and reads are clustered into “species bins”, each providing the material for genome assembly by shotgun methods (see Table 1 in [5] for a list of such studies).
By trading off a broad survey of gene content for greater sequencing depth at the sampled loci, amplicon sequencing has the potential to detect the rarest members of the sampled community, but errors interfere more profoundly. Unlike genome assembly projects, where one needs only to determine the consensus base at each locus or decide whether a SNP is present in a population, the space of possible distributions for the sample genotypes and frequencies is effectively infinite. As a result, ambiguities in genome projects can usually be resolved by increasing the amount of data, whereas increasing depth (as much as 10^{6} in recent studies [6, 7]) increases the number of both real and error-containing sequences and makes the challenge of distinguishing minority variants from errors only greater under amplicon sequencing. Greater depth therefore calls for progressively more sophisticated methods of analysis.
The analysis of amplicon sequence data typically begins with the construction of OTUs (operational taxonomic units), clusters of sequences that are within a cutoff in Hamming distance from one another. OTUs serve to collapse the complete set of sequences into a smaller collection of representative sequences – one for each OTU – and corresponding abundances based on the number of reads falling within each cluster. OTUs were developed as a tool for classifying microbial species, but have also been repurposed to the task of correcting errors; the sequences within an OTU are typically interpreted as a taxonomic grouping without specifying whether the variation within an OTU represents errors or real diversity on a finer scale than that chosen to define the OTU. If the scale of the noise is smaller than that of the clusters, then the construction of OTUs will appropriately group error-containing sequences together with their true genotype. However, as sequencing depth increases, low probability errors outside the OTU radius will start to appear, and will be incorrectly assigned to their own OTU. Early studies using this approach on high-throughput metagenome data sets reported large numbers of low-abundance, previously unobserved genotypes that were collectively dubbed the rare biosphere[8]. Later, analyses of control data sets indicated that the diversity estimates in such studies tends to be highly inflated [9] and that results may lack reproducibility [10]. The dual purpose of OTUs for correcting errors and for taxonomic grouping is appropriate when the diversity is being sampled at a coarse level, e.g. the frequency of different phlya. However, when probing finer-scale diversity, OTU methods have intrinsically high false positive and false negative rates: they both overestimate diversity when there exist errors larger than the OTU-defining cutoff and cannot resolve real diversity at a scale finer than that (arbitrary) cutoff.
In response, a variety of approaches to disentangling errors from actual genetic variation have been proposed recently [11–14]. These include multiple rounds of OTU clustering with different hierarchical methods [11], utilizing sequence abundance information implicitly by starting new clusters with common sequences [11, 12], and replacing OTU clustering with an Expectation-Maximization (EM) approach [13, 14]. Accuracy has steadily improved, but all methods still fall short of maximizing the information acquired from metagenome data sets.
We believe that the way forward is to model the error process and evaluate the validity of individual sequences in the context of the full metagenomic data set, crucially including the abundances (number of reads) corresponding to each sequence. Major progress in this direction has been made recently by Quince et al[13, 14]. In the specific context of pyrosequencing, often used for metagenomics, strings of the same nucleotide (homopolymers) are problematic, and Quince et al incorporated a model of the distribution of homopolymer light intensities into an Expectation-Maximization (EM) algorithm, Pyronoise, which infers the homopolymer lengths of sequencing reads [13] (Q09). Later, Quince et al released AmpliconNoise, an extension of PyroNoise that includes rates of single-nucleotide substitution errors obtained from training data [14] (Q11). These methods were shown to more accurately infer the underlying sample genotypes than other approaches, demonstrating the worth of explicitly modeling errors. However, the methods of Quince et al. have several shortcomings that we would like to rectify: (i) as the size of sequence data sets grows, AmpliconNoise becomes too slow to use in many applications; (ii) estimation of error rates relies on the existence of training data specific to the PCR and sequencing chemistries used; (iii) differentiation of fine-scale diversity is limited because read abundances are not fully utilized when calculating the distance between sequences and clusters; (iv) the parameters that determine how conservative the algorithm is in inferring real diversity are ad hoc and cannot be tuned without experiment-specific training data.
We build on the error-modeling approach pioneered in AmpliconNoise by developing a novel algorithm, DADA, to denoise metagenomic amplicon sequence data that addresses the concerns raised above [15]. We start with a parametric statistical model of substitution errors. We incorporate this error model into a divisive hierarchical clustering algorithm that groups error-containing reads into clusters consistent with being derived from a single sample genotype. Finally we couple this clustering algorithm with the inference of the error parameters from the clustered data, and perform each step in alternation until both converge. This method is presented below, and is shown to outperform previous methods in both speed and accuracy on several control data sets.
Results
Model and algorithm
We introduce a first-order model of the error process by assuming (1) each sequence read originates from a distinct DNA molecule in the sample, and therefore that the presence of errors on different reads are statistically independent, and (2) errors on different sites of the same read are also statistically independent events. The independence of errors across different reads relies on the independence of the PCR replication histories of those reads, a condition that holds when the total number of reads is significantly smaller than the total number of DNA molecules present in the initial environmental sample and there are no strong amplification biases for sequences with errors.
Under these conditions, the numbers of reads (abundances) of the error-containing sequences derived from a sample genotype follow the multinomial distribution, and the abundance r of each particular sequence is binomially distributed (see Methods) with a probability λ determined by the particular combination of errors in that sequence and a number of trials ρ given by the total number of reads of its sample genotype. These facts allow us to establish two statistics to evaluate the hypothesis that a collection of sequencing reads derives from a single sample genotype. The abundance p-value determines when there are too many reads of the same sequence to be consistent with the error model, and the read p-value determines when a sequencing read is too far away to be an error from an inferred sample genotype.
These statistics serve as the basis of a sequence-clustering algorithm in which (1) reads are assigned to clusters, (2) a putative sample genotype is inferred for each cluster, (3) reads are reassigned to the cluster for which they are most likely to have resulted as errors from the inferred sample genotype, (4) the two p-value statistics are computed given the inferred sample genotypes and the clustering of the sequences (5) additional clusters are created if the clustering is statistically inconsistent with the error model (as suggested by small p-values).
The p-values
We introduce two statistics for deciding that particular sequences did not arise from errors. The read p-value is the probability of having observed at least one read of a sequence that is as improbable as the most improbable sequence amongst the observed reads. This statistic treats each read as a separate event (giving rise to its name) and therefore does not utilize sequence abundance. It results in a hard cutoff, λ^{∗}, below which reads are decided not to be errors by DADA. This cutoff is set by the choice of a significance threshold Ω_{ r }, the probability of having observed at least one read more unlikely than λ^{∗}. The abundance p-value, which is computed for each sequence individually, is the probability of having observed at least as many identical reads as we did of each sequence (conditioned on having observed at least one). The conservativeness of this measure is set by a significance threshold Ω_{ a }, the probability that at least one sequence should have been as overabundant as the most overabundant sequence. The abundance p-value gives DADA significantly greater sensitivity than previous methods.
For both the real and simulated data, the abundance p-value does a good job of tracking the form of the abundances of the errors, and the read p-value sits to the right of all observed data. For the real data, a small number of errors sit on or above the abundance discrimination line. Such errors were individually not expected to be observed at all, but ended up with a small number of reads larger than one. This pattern was observed across many clusters, and we believe that it reveals the presence of small violations of our assumption of the independence between reads. In particular, in a regime where the ratio of the number of error-free reads to the number of DNA molecules in the sample that act as the basis for amplification is of order one or larger, then errors during early stages of PCR may be sampled multiple times in the sequence data. As a result, the distribution for the number of reads of these errors may fall off much more slowly than what our model suggests. To deal with this effect in this paper, we lowered the Ω_{ a }threshold using an ad hoc method (discussed below) to prevent excess false positives. Doing so did not affect DADA’s ability to detect the genuine diversity in the data analyzed in this paper, which was typical of the data analyzed in many microbial metagenomics studies, but the sensitivity that is lost by using very small values of Ω_{ a } could be limiting for samples with even finer-scale diversity. Further analytics that model PCR as a branching process improve this current ad hoc threshold (unpublished work).
Treatment of insertions and deletions
DADA does not attempt to explicitly model the indel error processes, and indels do not contribute to the determination of whether sequences are related to each other via errors. Instead, sequences are aligned to each putative sample genotype, and are assigned to clusters on the basis of substitutions. During the computation of p-values, we sum together the reads of sequences within each cluster that have the same set of substitutions (forming structures that we call indel families). The number of reads of each of these indel families, rather than those of the raw sequences, are the basis of our p-values (see Methods).
Treating indels in this way does not affect the accuracy of DADA for the test data sets analyzed here, as the sample genotypes all differed from each other by at least one substitution, and these provided enough information for DADA to distinguish between them. However, DADA cannot distinguish between sequences that differ only by indels. In such cases, if the amplicons being denoised are coding regions, frame information should be used for making decisions about whether particular indels are real or errors, but in order to denoise non-coding regions with pure indel diversity, DADA is not sufficient in its current form.
Preclustering
Prior to our probabilistic sequence clustering we divided the raw data into coarse 3% single-linkage clusters (with indels not contributing to distance), subsets for which each sequence is ≤3% from at least one other sequence in its cluster and >3% from all sequences in other clusters. Due to its speed, we employed the ESPRIT algorithm for this task [16]. Single-linkage’s propensity for chaining was advantageous in this circumstance, as all error-containing sequences are very likely to be in the same cluster as their originating sample genotypes; for a sample genotype and one of its errors to end up in different clusters, the error would have to be ≥3% from the nearest error clustered with that genotype, corresponding to a large gap in the error cloud, which is unlikely under our error model.
Clustering
Each precluster is partitioned into sets of sequences that are conjectured to contain all errors arising from different sample genotypes. This partition is initialized to a single cluster containing all sequences. Two procedures then alternate. First, the indel family most unlikely to have resulted from errors is split off into a new cluster. Sequences then move between clusters based on the probability that they were generated as errors by each one, and the consensus sequence for each cluster is updated until there are no remaining reassignments that can improve the probability of the data. This second step is analogous to the assignment and update steps of standard k-means clustering. This alternation stops when the partition of the sequences fits with the current error model at the significance levels provided by the user.
Accuracy
We evaluated the accuracy of DADA by denoising three of the data sets in Q11 used to demonstrate AmpliconNoise’s accuracy relative to the earlier SLP and DeNoiser algorithms. These data are derived from mixtures of known clones that were amplified together and sequenced on the 454 platform, and consisted of different hypervariable regions of the 16S RNA subunit of bacterial ribosomes (16S rRNA), which are commonly used as a proxy for phylogenetic diversity in metagenomic studies [4]. Two of the data sets, Divergent and Artificial, with 35,190 and 31,867 reads, were sequenced with the GS-FLX chemistry and were truncated at 220 nucleotides. They were constructed by amplifying the V5 region of the 16S rRNA gene from 23 and 90 clones, respectively, isolated from lake water. The Divergent clones were mixed in equal proportions and are separated from each other by a minimum nucleotide divergence of 7%, while the Artificial clones were mixed in abundances that span several orders of magnitude, with some of the clones differing by a single SNP. The other data set, Titanium, with 25,438 reads, was sequenced with the newer Titanium chemistry and was truncated at 400 nucleotides. It contains V4-5 16S rRNA genes from 89 clones isolated from Arctic soil with varying abundance and genetic distances, similar to the Artificial set.
All data sets had undergone filtering of reads deemed to be of low quality prior to application of AmpliconNoise in Q11, so for purposes of comparison, we denoised the same set of filtered reads. The presence of a small number of low-quality reads in 454 data has been previously demonstrated [17], and as we do not expect these to be well described by our error model, we encourage the use of such quality filtering before applying DADA to non-test data. As SLP and DeNoiser were already demonstrated to be less accurate than AmpliconNoise on these data, we include here DADA’s performance only relative to that of AmpliconNoise. There were six other data sets presented in Q11 of V2 regions from a gut microbial community, but these had such an overwhelming number of chimeric sequences (reported to be as high as 85% in Q11), which neither DADA nor AmpliconNoise attempts to address, that we opted not to include these data sets in our analysis.
Tuning algorithmic sensitivity
DADA employs two tunable parameters that determine how conservative or liberal the algorithm is to be in deciding whether particular sequences could have resulted from errors: Ω_{ a }, and Ω_{ r }, the significance levels for its abundance and read p-values. Decisions about singletons, the sequences represented by a single read, depend on Ω_{ r }, whereas decisions about sequences with several reads depend on Ω_{ a }. The two values may be tuned independently to match the priority being placed on capturing the rarest and more common diversity.
We did not observe any significant departures in these data from our model that would affect the read p-values, and it was therefore possible to maintain the interpretation of Ω_{ r }as a significance threshold. As a result, for these data, which contain ≤50 preclusters that were clustered separately by DADA, we set Ω_{ r }=10^{−3} so that the probability of having a false positive would be ≤5% for each data set.
False negatives and false positives
The purpose of DADA (and AmpliconNoise) is the inference of the genotypes present in the underlying sample from a set of noisy (error-containing) sequencing reads. There are two types of errors that such an algorithm can make: false positives in which a sample genotype is inferred that was not present in the sample, and false negatives in which the algorithm fails to infer a sample genotype that was present in the sequencing reads. The tradeoff between false positives and false negatives in the two algorithms can be controlled by the algorithmic parameters, depending on which type of error presents more of a problem to the user.
False positives and false negatives
DADA | AmpliconNoise | |||
---|---|---|---|---|
Sample | False Pos | False Neg | False Pos | False Neg |
Divergent | 0 | 0 | 2 | 0 |
Artificial | 1 | 2 | 8 | 7 |
Titanium (s10) | 6 | 0 | 8 | 9 |
Titanium (s25) | 23 | 4 |
DADA is more accurate in its inference of the sample genotypes than is AmpliconNoise on every data set. The difference is especially strong among false negatives, where DADA successfully identifies virtually all sample genotypes; DADA’s only two false negatives, both in the Artificial set, result from pathological alignment issues between sequences that differ only in the last two bases.
Speed
CPU times for clustering artificial community
Function | CPU time (seconds) |
---|---|
ESPRIT | |
kmerdist | 74.80 |
needledist | 924.68 |
Total | 1002.68 |
DADA | |
N-W alignments | 97.81 |
read p-values | 58.84 |
Total | 296.64 |
ESPRIT+DADA | 1.30×10^{3} |
AmpliconNoise | 7.57×10^{4} |
DADA is currently written in MATLAB, but sequence alignments and the construction of indel families were bottlenecks that we reimplemented as MEX (Matlab executable) C programs. The majority of the time to run our denoising pipeline on the Artificial data set is spent on ESPRIT’s performance of pairwise alignments during the single-linkage pre-clustering step (needledist). A newer version of ESPRIT promises to be released soon that may dramatically lower this time [18]. If additional speedups are needed as data sets grow, it should be possible to replace the global alignments of ESPRIT by banded alignments that would be guaranteed to produce the same clusters if the width of the band is equal to the cluster radius, and would have have roughly linear (in sequence length) rather than quadratic, running time. Nonetheless, for these data DADA already gives a 60-fold speedup over AmpliconNoise, making jobs that required a 64-core cluster to run AmpliconNoise appropriate for a laptop running DADA.
As read lengths continue to grow, we expect the time complexity of DADA to be affected in two primary ways. First, because the complexity of the Needleman-Wunsch alignment algorithm used by both DADA and ESPRIT scales with the product of the lengths of the input sequences [19] there will be a quadratic slowdown with increasing read length unless heuristics are employed. On data sets comparable to those analyzed here, alignments consume the majority of algorithmic time and this scaling will dominate in the near future. Second, our current implementation for computing read p-values has both time and space complexity that grows very rapidly with read length (it is asymptotically $\mathcal{O}\left({L}^{11}\right)$). This was not strongly limiting for these data, but in case it should become so as reads become longer, we have explored the use of a continuous approximation for the error probabilities that may alleviate this problem.
PCR substitution probabilities: symmetries and nearest-neighbor context-dependence
The magnitude of context-dependence for these data was moderate (most context-dependent probabilities differed by <50% from context-independent ones) as seen in the spread of points along the diagonal in Figure 6d,e,f . As a result, maintaining separate probabilities for different contexts did not affect the inferred sample genotypes. Nonetheless, that DADA was robust to significant variation in its parameters is a strong check on the stability of its sample inference.
We have worked with data for which context-dependence is large and has a strong effect on clustering. Therefore, we leave use of context-dependence as an optional feature of DADA, either as a consistency check, or when justified by the amount and nature of the data. But a caution is in order: with modest sized data sets, or if the sequences are too similar, use of the context-dependent rates could result in over-fitting and calling too many errors. However if this did occur, the expected complementarity symmetry of the inferred error probabilities would be unlikely to obtain unless the sequences were read in both directions.
Discussion
DADA explicitly incorporates read abundance when deciding whether sequences are genuine or errors; if there are many identical reads of a sequence, DADA will be more likely to infer an underlying sample genotype, even if individually those reads would be consistent with being an error from a nearby genotype. Furthermore, DADA implicitly assumes, via the error model, that reads near highly abundant sequences are far more likely to be errors. In contrast, previous methods have typically treated each read independently. AmpliconNoise partially incorporates abundance by weighting the prior probability that a read belongs to a cluster by the frequency of that cluster, but this is weaker than DADA, where dependence on cluster size shows up in a binomial coefficient (see Methods), especially for high-abundance errors. By using both sequence identity and abundance in this way, DADA is able to disentangle real diversity from errors at finer scales than previous methods, even when tuned to be very conservative.
However, full incorporation of abundance information makes DADA sensitive to early-stage PCR effects and the mis-estimation of error probabilities. The problem of early-stage effects is particularly pronounced in these data: when clustered with Ω_{ a }=Ω_{ r }=10^{−3}, the Artificial data produces 68 false positives (we would have expected no false positives if the model assumptions were not violated). The majority of these sequences have 2-5 reads and 2-4 errors. Such problems would be typical of moderately oversampled PCR, the regime in which initial sample molecules are typically sampled multiple times, allowing a single error during early stages to show up in more than one read.
In lieu of an abundance statistic that appropriately compensates for this affect, we deal with this problem by lowering the sensitivity of the algorithm by tuning down Ω_{ a }. Further, because the probability given to each sequence scales as the error probabilities to the power of the number of reads (see Methods), if certain error parameters are larger than estimated in certain contexts, then the statistical significance of an error with many reads can be substantially overestimated. This problem gets progressively worse for deeper data sets, as all one-away errors begin to take on many reads. In anticipation of this problem, we have introduced nearest-neighbor context-dependence of error rates (see Methods). These had no impact on the final clustering for the test data presented, but in other data sets with larger context-dependent effects, we found a reduction in diversity estimates when context-dependence was included (data not shown).
DADA is a divisive hierarchical clustering algorithm: all sequences are assigned to a single cluster that is subdivided until the clustering fits an error model. Previous methods, including AmpliconNoise and simple OTU-clustering, have predominantly taken the opposite, agglomerative approach, which starts with too many clusters and merges them until some condition is met. This gives DADA a practical advantage, as the computational and space requirements (especially the number of alignments to perform and store) scale with the square of the initial number of clusters [20]. The typical problem of divisive methods – that the number of possible splittings is too large – is handled in DADA by seeding new clusters with sequences that are identified as not being errors and allowing other sequences, e.g. errors associated with the new clusters, to relocate if they become more probable by doing so.
Finally, DADA uses unsupervised learning to acquire error probabilities from the data that it is given. As PCR protocols vary in their choice of polymerase and number of rounds, these parameters vary by data set, perhaps greatly. This makes the universality of DADA’s approach especially attractive, and will be important as new sequencing methods come into use such as longer read-length and paired-end Illumina that commonly make substitution as well as indel errors [21]. While it now relies on training data to establish error parameters, AmpliconNoise could be embedded in the same procedure of estimating error probabilities after each successive round of clustering, but this would multiply the computation requirements by a factor of the number of rounds of re-estimation, compounding the problem of its slower speed.
Conclusions
OTUs serve as a rough analogue for microbes of the more clearly defined taxonomic groups of higher organisms. However, the repurposing of the OTU concept to the problem of inferring sample genotypes from error-prone metagenomic sequence data has serious and inherent shortcomings. The absence of an error model causes estimates of diversity, especially species richness, to depend strongly on experimental variables such as the size of the data set, the length of the region sequenced, and the details of the PCR/sequencing chemistry. These shortcomings are not amenable to simple fixes; it is not possible to separate real diversity from errors using an OTU approach when the diversity and the errors exist at similar scales (as measured by Hamming distance), as is the case in many metagenomic studies. PyroNoise and AmpliconNoise have demonstrated the usefulness of denoising sequence data with statistical, physically-based error models. These methods are based on the classical statistical technique of expectation-maximization. We have presented an alternate approach, DADA, which is more targeted to the particular task of producing conservative estimates of diversity from noisy sequence data. It is much faster and more capable of resolving fine-scale diversity while maintaining a lower false positive rate.
We did not achieve our goal of complete freedom from ad hoc parameters in this work. Even though Ω_{ a }, our input parameter, has a simple probabilistic meaning that is data set independent, there are corrections to our PCR model, and as a result Ω_{ a } takes on an ad hoc quality in this analysis. Nonetheless, Ω_{ a }can be coarsely tuned from the data itself in the way shown. Alternatively, for conservative diversity estimates, Ω_{ a } may be set to very small values (such as 10^{−100}), and the resolution of the algorithm may be directly quantified. DADA not only guesses what is there, but knows what would have been missed if it were present, making Ω_{ a } ad hoc but not arbitrary.
Much work remains to be done, and it is not yet clear how the algorithms will fare with extremely rich fine-scale diversity as occurs for the antibody repertoire of B-cells and T-cells of the human immune system [22, 23]. DADA must be equipped with statistics that correctly describe the abundance distribution of sequencing errors when a realistic model of PCR is used in which some reads are the result of shared lineages. More sophisticated methods for chimera detection that explicitly parameterize the chimera formation process analogously to the substitution and indel processes are also needed. Finally, these methods must be fully adapted and tested on sequencing platforms other than Roche’s 454.
Methods
General notation
From a sequencing data set $\mathcal{S}=\{{s}_{x},\phantom{\rule{1em}{0ex}}{r}_{x}\}$, where r_{ x } is the number of individual reads of each distinct sequence s_{ x }, we would like to construct an estimate $\mathcal{G}=\left\{{G}_{\alpha}\right\}$ of the set of genotypes in the sample that gave rise to $\mathcal{S}$. With this aim, we construct a partition $\mathcal{B}$ of the sequences {s_{ x }} where each ${B}_{\alpha}\in \mathcal{B}$ is a collection of sequences hypothesized to have originated from a common G_{ α }, and notate the number of reads assigned to G_{ α }by ${\rho}_{\alpha}=\sum _{x|{s}_{x}\in {B}_{\alpha}}{r}_{x}$. Because each s_{ x } can reside in only one B_{ α } and it is assumed that G_{ α } is the source of all s_{ x } in B_{ α }, this framework does not allow for multiple G_{ α } to contribute reads to the same s_{ x }. Allowing the latter is likely to affect $\mathcal{G}$ only in special cases and adds complications.
Treatment of insertions and deletions: the construction of indel families
In addition to substitution errors, reads acquire insertions and deletions (indels) during amplification and sequencing. Both substitutions and indels could be used to parameterize an error model, but here we focus on substitutions and do not attempt to characterize the statistics of indels. Instead, we collapse together all the reads of sequences within each B_{ α }that differ from each other only by the location of indels in their alignments to G_{ α }, forming subsets of each B_{ α } that we call indel families. We call the indel families $\mathcal{F}(\mathcal{S},\mathcal{B},\mathcal{G})=\{{\mathbf{s}}_{y},{\mathbf{r}}_{y}\}$, where each s_{ y } refers either to a subset of some B_{ α } or the sequence identical with G_{ α } except for the substitution errors of its constituents, and r_{ y } is the number of reads in the family. The r_{ y } of each indel family will be used to test whether $\mathcal{B}$ agrees with an error model, i.e. whether the substitution errors observed on the families in each B_{ α } was not too improbable under an error model of substitution errors.
Alignments between sequences and each G_{ α } in this paper took place with a scoring matrix of 5 + logT(to make them comparable with NCBI BLAST’s NUC.4.4 matrix [24]), where T, introduced below, is a matrix of substitution error probabilities. We used a gap penalty of −4 and a hompolymer gap penalty of −1. The gap penalty had to be less than half the smallest mismatch score or alignments would favor a pair of indels to that mismatch. The worst mismatch score tended to be about −6, and so −4 was chosen as a gap penalty to allow as many gaps as possible without making any mismatches prohibited within alignments.
The independence between substitution errors on different reads implies a binomial distribution for the number of reads of each family
If the occurrence of substitution errors on different reads are independent events, then each read of genotype G_{ α } has an i.i.d. one-trial multinomial distribution with parameters Λ={λ_{ yα }}, which we call the genotype error probabilities, to belong to each indel family s_{ y }. Λ also parameterizes the probability distribution for R_{ y }, the number of reads of family y: if s_{ y }⊆B_{ α }, then because R_{ y } is the sum of ρ_{ α } Bernoulli random variables each with success probability λ_{ yα }, it follows the binomial distribution, R_{ y }∼Bin(λ_{ yα },ρ_{ α }). The assumption of independence between reads does not hold if early round PCR errors may be sampled multiple times in the final sequence data. Then, if we condition on having observed a particular error on some other read, the probability to observe it additional times is increased.
Λ may be constructed from simple nucleotide transition matrices
where α_{ n } and y_{ n } denote the n^{ th } nucleotides of G_{ α } and s_{ y } (also let λ_{ xα }=λ_{ yα } for all x|s_{ x }∈s_{ y }, used in Algorithm Algorithm 1).
If the nucleotide error probabilities at each site can depend upon the nearest-neighbor flanking nucleotides, we can keep track of a transition matrix T^{(L,R)} for each possible (L,R) pair of flanking nucleotides such that ${\mathcal{T}}_{\mathrm{ij}}^{(L,R)}=P\left(\mathrm{LjR}\right|\mathrm{LiR})$, with P(ATC|AGC) the error probability for taking AGC to ATC. This generalization, which we call context-dependence, increases the number of free parameters from 12 to 192: there are 16 possible pairs of flanking nucleotides each with a 12 free parameter stochastic matrix. DADA may be run with or without context-dependence, but due to the risk of overfitting and because we saw no substantial effect on the outcome of clustering when it was used, context-dependence was not used to produce the results presented in this paper.
Assessing fit with an error model via tail probabilities
In order to assess whether $\mathcal{G}$ and $\mathcal{B}$ fit the error model Λ, we introduce two statistics, p_{ y }and q_{ α }: p_{ y } is the probability of having seen at least r_{ y }reads of s_{ y } given that we saw at least one and q_{ α }is the probability of having seen at least one read with a genotype error probability at least as small as the smallest genotype error probability of an observed indel family in B_{ α }, ${\lambda}_{\alpha}^{\ast}=\underset{y|{\mathbf{s}}_{y}\subseteq {B}_{\alpha}}{min}{\lambda}_{\mathrm{y\alpha}}$.
p_{ y }: the abundance p-value
We refer to this as the abundance p-value because it evaluates the probability of having observed more extreme abundances under the null hypothesis that each r_{ y }was generated by the error model. Because one abundance p-value is generated for each indel family, we use a Bonferroni correction and compare each p_{ y }with ${\Omega}_{a}/\left|\mathcal{F}\right|$, where Ω_{ a } is a joint significance threshold that is provided to DADA by the user.
If we had not conditioned on having observed at least one read of each family, then the unobserved families would not have born any significance (they would all have p_{ y }=1), but before looking at the data these families could have been significant. This would create a difficulty in choosing an appropriate multiple hypothesis correction; a naive Bonferroni correction of $\prod _{\alpha}{4}^{{L}_{\alpha}}$ with L_{ α } the length of G_{ α }, which treats all possible families as tested hypotheses, would deprive the p-value of any statistical power. Conditioning on R_{ y }>0 and evaluating only the observed sequences avoids this complication. However, any family with r_{ y }=1 obtains p_{ y }=1 regardless of the smallness of λ_{ yα }, which necessitates our second statistic, q_{ α }.
q_{ α }: the read p-value
Vectors of λ_{ γ } and m_{ γ }(α) can be computed starting with γ representing the more common errors and extended to more rare errors as needed to compute p-values for smaller ${\lambda}_{\alpha}^{\ast}$. Finally, rather than maintaining vectors of m_{ γ }(α) for each α, we keep one for each of a small number of possible base compositions and interpolate between the q_{ α }that would result from each of these in order to approximate the q_{ α } that would result from the exact base composition of G_{ α }. Because one q_{ α } is generated for each B_{ α }, the q_{ α } are then compared with ${\Omega}_{r}/\left|\mathcal{B}\right|$, where Ω_{ r } is another joint significance threshold provided to DADA by the user, in order to determine whether any ${\lambda}_{\alpha}^{\ast}$ are too small to be the result of errors.
Maximum likelihood estimate (mle) of error probabilities
where i≠j, L,R are any pair of left and right flanking nucleotides, N_{ LijR } is the total number of LjR codons in $\mathcal{S}$ that result from LiR codons in $\mathcal{G}$ according to $\mathcal{B}$, and N_{ LiR } = ∑_{ j }N_{ LijR }.
Algorithm for inferring the sample genotypes and error probabilities
The p-values p_{ y } and q_{ α } are the basis for an algorithm that alternatively updates $\mathcal{B}$ and the nucleotide error probabilities T, which may be specified by the user as either context-independent or dependent, denoting their values after t iterations by ${\mathcal{B}}^{\mathcal{T}}$ and T^{ t }, in order to improve the likelihood of the data. This is similar to the hard-EM algorithm except that the partition ${\mathcal{B}}^{\mathcal{T}}$ at each step is the result of a model-based divisive hierarchical clustering approach and does not maximize the probability of $\mathcal{S}$ given T^{ t }. The algorithm requires two user inputs, Ω_{ a }and Ω_{ r }, which are the joint significance thresholds for the abundance and read p-values.
Algorithm 1
DADA Sequence clustering algorithm [15]
T^{0}= ${\widehat{\mathcal{T}}}_{\mathrm{mle}}\left({\mathcal{B}}_{0}\right)$, where ${\mathcal{B}}_{0}$ is the trivial partition containing the entire $\mathcal{S}$.
t←1
repeat
repeat
if ${\mathcal{B}}^{\mathcal{T}}\ne {\mathcal{B}}_{0}$
start a new cluster within ${\mathcal{B}}^{\mathcal{T}}$ containing the most statistically inconsistent family.
repeat
update $\left\{{\u011c}_{\alpha}\right\}$
each s_{ x }joins B_{ α }where $\alpha ={\text{arg max}}_{{\alpha}^{\u2033}}\left({\rho}_{{\alpha}^{\u2033}}{\lambda}_{x{\alpha}^{\u2033}}\right)$
until${\mathcal{B}}^{\mathcal{T}}$ is unchanged
update {p_{ y }} and {q_{ α }}
until$min{p}_{y}\ge {\Omega}_{a}/\left|\mathcal{F}\right|$ and $min{q}_{\alpha}\ge {\Omega}_{r}/\left|\mathcal{B}\right|$
t←t + 1
until T has converged
- 1.
Starting with T ^{0}, the maximum likelihood nucleotide error probabilities given the trivial partition ${\mathcal{B}}_{0}$ of all sequences into a single cluster, the outermost loop iteratively updates $\mathcal{B}$ and T until T converges. We have observed cases where T does not completely settle down but fluctuates within a small basin of attraction. To deal with such cases, DADA terminates if T ever returns to a previously held value or ||T ^{ t }−T ^{t−1}|| < ∊, where the tolerance ∊ = 1e − 9 is used as a default and may be altered by the user. If convergence has not been reached in ten rounds, DADA terminates with a warning message.
- 2.
For each T ^{ t }, the next loop begins with the trivial partition, ${\mathcal{B}}^{\mathcal{T}}={\mathcal{B}}_{0}$, and adds blocks to ${\mathcal{B}}^{\mathcal{T}}$ until the {p _{ y }} and {q _{ α }} do not allow rejection of the error model at joint significance levels Ω _{ a }and Ω _{ s }. New ${B}_{\alpha}^{\mathcal{T}}$ are seeded by the sequences in families with the smallest p-values. If statistically significant families exist under both p-values, then those significant under the abundance p-value take priority for starting new clusters. This approach avoids the need to put an explicit penalty on the number of blocks of $\mathcal{B}$, instead aiming for the smallest $\mathcal{B}$ under which the current error model cannot be rejected.
- 3.
After adding a new block ${B}_{\alpha}^{\mathcal{T}}$ to ${\mathcal{B}}^{\mathcal{T}}$, the innermost loop raises the probability of the data by reassigning each sequence to the block that would produce (under the error model) the largest expected number of reads of that sequence. The putative genotype of a cluster, G _{ α }, is also updated if a cluster B _{ α }has a new consensus sequence. This continues until sequences cease changing clusters.
Appendix 1: chimeras, contaminants, and missing or incorrect Sanger sequences
Additional sources of noise and false positives
DADA | AmpliconNoise | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Sample | Denoised | Clone | Chim | Contam | Other | Denoised | Clone | Chim | Contam | Other |
Divergent | 43 | 23 | 18 | 2 | 0 | 51 | 23 | 23 | 3 | 2 |
Artificial | 65 | 50 | 14 | 0 | 1 | 73 | 44 | 21 | 0 | 8 |
Titanium (s10) | 274 | 80 | 185 | 3 | 6 | 163 | 71 | 82 | 2 | 8 |
Titanium (s25) | 304 | 76 | 203 | 2 | 23 |
Titanium genotypes with more error-containing than Sanger sequence matching reads
Reads of nearby pyrosequence | Reads of sanger sequence | Errors |
---|---|---|
5 | 0 | G 8→G 7 (301), |
G 5→G 4 (354) | ||
70 | 0 | G 6→G 5 (133) |
75 | 1 | C 7→C 6 (313) |
21 | 18 | G 6→G 5 (133) |
14 | 0 | G 6→G 5 (133) |
80 | 0 | G 6→G 5 (133) |
77 | 3 | G 6→G 5 (133) |
80 | 1 | G 6→G 5 (133) |
Next we identified chimeras: sequences consisting of two sections with one section a close match to one sample genotype and the other a close match to a second sample genotype. These can be produced in substantial quantities by PCR [25]. Analogously to Q11, for each denoised sequence we computed the Hamming distance to the nearest Sanger sequence and to the nearest exact chimera by considering all possible breakpoints between all pairs of sequences of higher abundance (a chimera will have fewer reads than its parents unless it acquires substantial PCR bias). For a denoised sequence to be classified as a chimera, we required that it be at least 3 nts closer to the nearest exact chimera than the nearest sample genotype and within 5 nts of the optimal chimera (also analogous to the procedure used in Q11). We waived the 3 nt improvement criteria for denoised sequences that were identical to exact chimeras, which occurred for some particularly highly abundant chimeras between closely related sample genotypes. All data sets had a large number of chimeras amongst their denoised reads, with Titanium having more chimeras than sample genotypes (both algorithms), highlighting how essential accurate chimera identification is in tandem with the correction of PCR and sequencing errors.
Contaminants
Accession | Reads/Frequency | D _{ GB } | D _{ sample } | D _{ chim } | Source | Type | DADA | AN |
---|---|---|---|---|---|---|---|---|
Divergent | ||||||||
FR697039 | 14/4×10^{−4} | 0 | 11 | 10 | Lake Water | Bacterium | Y | Y |
EU633742 | 1/3×10^{−5} | 1 | 9 | 9 | Showerhead | Methylobacter | Y | Y |
JF515955 | 1/3×10^{−5} | 1 | 8 | 8 | Soil | Nitrosomonadaceae | N | Y |
Titanium | ||||||||
FJ004768 | 77/3×10^{−3} | 2 | 39 | 27 | Soil | Bacterium | Y | Y |
JF190756 | 1/4×10^{−5} | 1 | 40 | 27 | Human Skin | Bacterium | Y | Y |
JQ462329 | 2/8×10^{−5} | 0 | 7 | 5 | Human Mouth | Bacterium | Y | N |
Detectable genotypes
Sample | Genotypes | Present and distinct |
---|---|---|
Divergent | 23 | 23 |
Artificial | 90 | 50 |
Titanium | 91 | 80 |
Several aspects of this post-processing pipeline – especially contaminant identification– utilize knowledge of the sample genotypes and do not constitute useful methods for non-test data. Our approach to chimera identification does not utilize sample genotype information, but requires more development to be applied to non-test data: it does not search for higher-order chimeras that are combinations of three or more parental sequences, the criteria for being labelled as a chimera do not scale with error rates and read lengths, and no attempt has been made to realistically model the chimera formation process. Our goal has been only the isolation of errors due to PCR and sequencing error.
Appendix 2: Ω_{ a }robustness
False positives and false negatives for each data set with Ω_{ a } = {10^{−15},10^{−40},10^{−100}} and Ω_{ r } = 10^{−3}
Divergent | Artificial | Titanium | ||||
---|---|---|---|---|---|---|
Ω _{ a } | False Pos | False Neg | False Pos | False Neg | False Pos | False Neg |
10^{−15} | 0 | 0 | 10 | 2 | 7 | 0 |
10^{−40} | 0 | 0 | 1 | 2 | 6 | 0 |
10^{−100} | 0 | 0 | 1 | 2 | 6 | 0 |
Declarations
Acknowledgements
Thanks to Chris Quince and Sue Huse for providing and helping with clarifications about test data. Thanks also to Yijun Sun for help with the ESPRIT clustering algorithm. Finally, thanks again to Chris Quince and to Kabir Peay for useful comments on the manuscript. This work was supported in part by the NSF under grant DMS-1120699. SH is supported by grant NIH-R01GM086884.
Authors’ Affiliations
References
- Cheung MK, Au CH, Chu KH, Kwan HS, Wong CK: Composition and genetic diversity of picoeukaryotes in subtropic coastal waters as revealed by 454 pyrosequencing. ISME J 2010, 4: 1053–1059. 10.1038/ismej.2010.26View ArticlePubMed
- Iwai S, Chai B, Sul WJ, Cole JR, Hashsham SA, Tiedje JM: Gene-targeted-metagenomics reveals extensive diversity of aromatic dioxygenase genes in the environment. ISME J 2010, 4: 279–285. 10.1038/ismej.2009.104PubMed CentralView ArticlePubMed
- Teixeria LCRS, Peixoto RS, Cury JC, Sul WJ, Pellizari VH, Tiedje J, Rosado AS: Bacterial diversity in rhizosphere soil from Antarctic vascular plants of Admiralty Bay, maritime Antarctica. ISME J 2010, 4: 989–1001. 10.1038/ismej.2010.35View Article
- Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML: Exploring microbial diversity and taxonomy using SSU rRNA hypervariable tag sequencing. PLoS Genet 2008, 4: e1000255. 10.1371/journal.pgen.1000255PubMed CentralView ArticlePubMed
- Wilmes P, Simmons SL, Denef VJ, Banfield JF: The dynamic genetic repertoire of microbial communities. FEMS Microbiol Rev 2009, 33: 109–132. 10.1111/j.1574-6976.2008.00144.xPubMed CentralView ArticlePubMed
- Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML: Microbial population structures in the deep marine biosphere. Science 2007, 318: 97–100. 10.1126/science.1146689View ArticlePubMed
- Turnbaugh PJ, Quince C, Faith JJ, McHardy AC, Yatsunenko T, Niazi F, Affourtit J, Egholm M, Henrissat B, Knight R, Gordon JI: Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins. PNAS 2010, 107: 7503–7507. 10.1073/pnas.1002355107PubMed CentralView ArticlePubMed
- Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored rare biosphere. PNAS 2006, 103: 12115–12120. 10.1073/pnas.0605127103PubMed CentralView ArticlePubMed
- Kunan V, Engelbrektson A, Ochman H, Hugenholtz P: Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol 2010, 12: 118–123. 10.1111/j.1462-2920.2009.02051.xView Article
- Zhou J, Wu L, Deng Y, Zhi X, Jiang Y, Tu Q, Xie J, Nostrand JDV, He Z, Yang Y: Reproducibility and quantitation of amplicon sequencing-based detection. ISME J 2011, 5: 1303–1313. 10.1038/ismej.2011.11PubMed CentralView ArticlePubMed
- Huse SM, Welch DM, Morrison HG, Sogin ML: Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol 2010, 12: 1889–1898. 10.1111/j.1462-2920.2010.02193.xPubMed CentralView ArticlePubMed
- Reeder J, Knight R: Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat methods 2010, 7: 668–669.PubMed CentralView ArticlePubMed
- Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IA, Read LF, Sloan WT: Accurate determination of microbial diversity from 454 pyrosequencing data. Nat methods 2009, 6: 639–641. 10.1038/nmeth.1361View ArticlePubMed
- Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ: Removing noise from pyrosequenced amplicons. BMC Bioinf 2011, 12: 38. 10.1186/1471-2105-12-38View Article
- Michael Rosen: DADA website. 2012.http://sites.google.com/site/dadadenoiser
- Sun Y, Cai Y, Yu F, Farrell MF, McKendree W, Farmerie W: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res 2009, 37: e76. 10.1093/nar/gkp285PubMed CentralView ArticlePubMed
- Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 2007, 8: R143. 10.1186/gb-2007-8-7-r143PubMed CentralView ArticlePubMed
- Cai Y, Sun Y: ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res 2011, 39: e95. 10.1093/nar/gkr349PubMed CentralView ArticlePubMed
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequences of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMed
- Fraley C, Raftery AE: How many clusters? which clustering method? answers via model-based cluster analysis. Comput J 1998, 41: 578–588. 10.1093/comjnl/41.8.578View Article
- Yang X, Aluru S, Dorman KS: Repeat-aware modeling and correction of short read errors. BMC Bioinf 2011, 12: S52. 10.1186/1471-2105-12-S1-S52View Article
- Boyd SD, Marshall EL, Merker JD, Maniar JM, Zhang LN, Sahaf B, Jones CD, Simen BB, Hanczaruk B, Nguyen KD, Nadeau KC, Egholm M, Miklos DB, Zehnder JL, Fire AZ: Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing. Sci Translational Med 2009, 1: 12ra23. 10.1126/scitranslmed.3000540View Article
- Wang C, Sanders CM, Yang Q, Schroeder HW, Wang E, Babrzadeh F, Gharizadeh B, Myers RM, Hudson JR, Davis RW, Han J: High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. PNAS 2009, 107: 1518–1523.View Article
- Todd Lowe: NUC.4.4 score matrix, NCBI. 1992.ftp://ftp.ncbi.nih.gov/blast/matrices/NUC.4.4
- Lahr DJG, Katz LA: Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase. BioTechniques 2009, 47: 857–866.PubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.