Denoising PCR-amplified metagenome data

Background PCR amplification and high-throughput sequencing theoretically enable the characterization of the finest-scale diversity in natural microbial and viral populations, but each of these methods introduces random errors that are difficult to distinguish from genuine biological diversity. Several approaches have been proposed to denoise these data but lack either speed or accuracy. Results We introduce a new denoising algorithm that we call DADA (Divisive Amplicon Denoising Algorithm). Without training data, DADA infers both the sample genotypes and error parameters that produced a metagenome data set. We demonstrate performance on control data sequenced on Roche’s 454 platform, and compare the results to the most accurate denoising software currently available, AmpliconNoise. Conclusions DADA is more accurate and over an order of magnitude faster than AmpliconNoise. It eliminates the need for training data to establish error parameters, fully utilizes sequence-abundance information, and enables inclusion of context-dependent PCR error rates. It should be readily extensible to other sequencing platforms such as Illumina.


Background
The potential of high-throughput sequencing as a tool for exploring biological diversity is great, but so too are the challenges that arise in its analysis. These technologies have made possible the characterization of very rare genotypes in heterogeneous populations of DNA at low cost. But when applied to a metagenomic sample, the resulting raw data consist of an unknown mixture of genotypes that are convolved with errors introduced during amplification and sequencing.
There are two broad approaches to high-throughput sequencing of metagenomes: in amplicon sequencing (also called gene-centric or gene-targeted metagenomics) a pool of DNA for sequencing is produced by using PCR to amplify all the variant sequences in a sample that begin and end with a chosen pair of primers [1][2][3], frequently targeting hypervariable regions of the 16S ribosomal RNA gene [4]; in de novo genome assembly total DNA is sequenced without amplification and reads are clustered into "species bins", each providing the material for genome assembly by shotgun methods (see Table 1 in [5] for a list of such studies).
By trading off a broad survey of gene content for greater sequencing depth at the sampled loci, amplicon sequencing has the potential to detect the rarest members of the sampled community, but errors interfere more profoundly. Unlike genome assembly projects, where one needs only to determine the consensus base at each locus or decide whether a SNP is present in a population, the space of possible distributions for the sample genotypes and frequencies is effectively infinite. As a result, ambiguities in genome projects can usually be resolved by increasing the amount of data, whereas increasing depth (as much as 10 6 in recent studies [6,7]) increases the number of both real and error-containing sequences and makes the challenge of distinguishing minority variants from errors only greater under amplicon sequencing. Greater depth therefore calls for progressively more sophisticated methods of analysis.
The analysis of amplicon sequence data typically begins with the construction of OTUs (operational taxonomic units), clusters of sequences that are within a cutoff in Hamming distance from one another. OTUs serve to collapse the complete set of sequences into a smaller collection of representative sequences -one for each OTUand corresponding abundances based on the number of reads falling within each cluster. OTUs were developed as a tool for classifying microbial species, but have also http://www.biomedcentral.com/1471-2105/13/283 been repurposed to the task of correcting errors; the sequences within an OTU are typically interpreted as a taxonomic grouping without specifying whether the variation within an OTU represents errors or real diversity on a finer scale than that chosen to define the OTU. If the scale of the noise is smaller than that of the clusters, then the construction of OTUs will appropriately group errorcontaining sequences together with their true genotype. However, as sequencing depth increases, low probability errors outside the OTU radius will start to appear, and will be incorrectly assigned to their own OTU. Early studies using this approach on high-throughput metagenome data sets reported large numbers of low-abundance, previously unobserved genotypes that were collectively dubbed the rare biosphere [8]. Later, analyses of control data sets indicated that the diversity estimates in such studies tends to be highly inflated [9] and that results may lack reproducibility [10]. The dual purpose of OTUs for correcting errors and for taxonomic grouping is appropriate when the diversity is being sampled at a coarse level, e.g. the frequency of different phlya. However, when probing finer-scale diversity, OTU methods have intrinsically high false positive and false negative rates: they both overestimate diversity when there exist errors larger than the OTU-defining cutoff and cannot resolve real diversity at a scale finer than that (arbitrary) cutoff.
In response, a variety of approaches to disentangling errors from actual genetic variation have been proposed recently [11][12][13][14]. These include multiple rounds of OTU clustering with different hierarchical methods [11], utilizing sequence abundance information implicitly by starting new clusters with common sequences [11,12], and replacing OTU clustering with an Expectation-Maximization (EM) approach [13,14]. Accuracy has steadily improved, but all methods still fall short of maximizing the information acquired from metagenome data sets.
We believe that the way forward is to model the error process and evaluate the validity of individual sequences in the context of the full metagenomic data set, crucially including the abundances (number of reads) corresponding to each sequence. Major progress in this direction has been made recently by Quince et al [13,14]. In the specific context of pyrosequencing, often used for metagenomics, strings of the same nucleotide (homopolymers) are problematic, and Quince et al incorporated a model of the distribution of homopolymer light intensities into an Expectation-Maximization (EM) algorithm, Pyronoise, which infers the homopolymer lengths of sequencing reads [13] (Q09). Later, Quince et al released Amplicon-Noise, an extension of PyroNoise that includes rates of single-nucleotide substitution errors obtained from training data [14] (Q11). These methods were shown to more accurately infer the underlying sample genotypes than other approaches, demonstrating the worth of explicitly modeling errors. However, the methods of Quince et al. have several shortcomings that we would like to rectify: (i) as the size of sequence data sets grows, Amplicon-Noise becomes too slow to use in many applications; (ii) estimation of error rates relies on the existence of training data specific to the PCR and sequencing chemistries used; (iii) differentiation of fine-scale diversity is limited because read abundances are not fully utilized when calculating the distance between sequences and clusters; (iv) the parameters that determine how conservative the algorithm is in inferring real diversity are ad hoc and cannot be tuned without experiment-specific training data.
We build on the error-modeling approach pioneered in AmpliconNoise by developing a novel algorithm, DADA, to denoise metagenomic amplicon sequence data that addresses the concerns raised above [15]. We start with a parametric statistical model of substitution errors. We incorporate this error model into a divisive hierarchical clustering algorithm that groups error-containing reads into clusters consistent with being derived from a single sample genotype. Finally we couple this clustering algorithm with the inference of the error parameters from the clustered data, and perform each step in alternation until both converge. This method is presented below, and is shown to outperform previous methods in both speed and accuracy on several control data sets.

Model and algorithm
We introduce a first-order model of the error process by assuming (1) each sequence read originates from a distinct DNA molecule in the sample, and therefore that the presence of errors on different reads are statistically independent, and (2) errors on different sites of the same read are also statistically independent events. The independence of errors across different reads relies on the independence of the PCR replication histories of those reads, a condition that holds when the total number of reads is significantly smaller than the total number of DNA molecules present in the initial environmental sample and there are no strong amplification biases for sequences with errors.
Under these conditions, the numbers of reads (abundances) of the error-containing sequences derived from a sample genotype follow the multinomial distribution, and the abundance r of each particular sequence is binomially distributed (see Methods) with a probability λ determined by the particular combination of errors in that sequence and a number of trials ρ given by the total number of reads of its sample genotype. These facts allow us to establish two statistics to evaluate the hypothesis that a collection of sequencing reads derives from a single sample genotype. The abundance p-value determines when there are too many reads of the same sequence to be consistent with http://www.biomedcentral.com/1471-2105/13/283 the error model, and the read p-value determines when a sequencing read is too far away to be an error from an inferred sample genotype.
These statistics serve as the basis of a sequenceclustering algorithm in which (1) reads are assigned to clusters, (2) a putative sample genotype is inferred for each cluster, (3) reads are reassigned to the cluster for which they are most likely to have resulted as errors from the inferred sample genotype, (4) the two p-value statistics are computed given the inferred sample genotypes and the clustering of the sequences (5) additional clusters are created if the clustering is statistically inconsistent with the error model (as suggested by small p-values).
The full algorithm ( Figure 1) combines this probabilistic sequence clustering with the estimation of substitutionerror probabilities that are used to compute the p-values. The algorithm begins by assuming all reads derive from a single sample genotype and estimates initial error probabilities given this assumption. It then alternates between clustering the reads and re-estimating the error probabilities until it converges to a final set of mutually consistent clusters and error probabilities.

The p-values
We introduce two statistics for deciding that particular sequences did not arise from errors. The read p-value is the probability of having observed at least one read of a sequence that is as improbable as the most improbable sequence amongst the observed reads. This statistic treats each read as a separate event (giving rise to its name) and therefore does not utilize sequence abundance. It results in a hard cutoff, λ * , below which reads are decided not to be errors by DADA. This cutoff is set by the choice of a significance threshold r , the probability of having observed at least one read more unlikely than λ * . The abundance p-value, which is computed for each sequence individually, is the probability of having observed at least as many identical reads as we did of each sequence (conditioned on having observed at least one). The conservativeness of this measure is set by a significance threshold a , the probability that at least one sequence should have been as overabundant as the most overabundant sequence. The abundance p-value gives DADA significantly greater sensitivity than previous methods. Figure 2 shows simulated and real data from a typical cluster of sequences that originated from a common genotype (from the Artificial data set, introduced below). The abundance r and probability λ of each sequence is plotted, as the ability of the read and abundance p-values to discriminate between errors and non-errors is easily visualized in this parameter space. The regions where DADA will declare a sequence to be an error or a real sequence are delineated by a dashed line for each of our p-value statistics. The λ values have been log-transformed and scaled by the most common error probability, making the x-axis interpretable as an effective Hamming distance. Due to this scaling, it is also useful to interpret this plot in terms of real Hamming distances, in which case the a line represents a lower bound on DADA's resolution for any error at that distance.
For both the real and simulated data, the abundance p-value does a good job of tracking the form of the abundances of the errors, and the read p-value sits to the right of all observed data. For the real data, a small number  On the x-axis, we plot log λ scaled by the most common error probability, T A→G , so that values can be interpreted as an effective Hamming distance. The dashed lines delineate the region -the lower left quadrant -where, for significance thresholds a and r provided by the user, DADA accepts that a sequence could have arisen via the error model. The vertical dashed lines shows the λ below which (or the effective distance above which) the read p-value rejects sequences as being errors, and the curved dashed line shows the abundances above which the abundance p-value rejects sequences as being errors for each value of λ. There are several sequences in the real data (red diamonds) that would be rejected by the abundance p-value at the a = .01 significance level; we posit that early round PCR effects are a suitable candidate to explain these departures from the error model. of errors sit on or above the abundance discrimination line. Such errors were individually not expected to be observed at all, but ended up with a small number of reads larger than one. This pattern was observed across many clusters, and we believe that it reveals the presence of small violations of our assumption of the independence between reads. In particular, in a regime where the ratio of the number of error-free reads to the number of DNA molecules in the sample that act as the basis for amplification is of order one or larger, then errors during early stages of PCR may be sampled multiple times in the sequence data. As a result, the distribution for the number of reads of these errors may fall off much more slowly than what our model suggests. To deal with this effect in this paper, we lowered the a threshold using an ad hoc method (discussed below) to prevent excess false positives. Doing so did not affect DADA's ability to detect the genuine diversity in the data analyzed in this paper, which was typical of the data analyzed in many microbial metagenomics studies, but the sensitivity that is lost by using very small values of a could be limiting for samples with even finer-scale diversity. Further analytics that model PCR as a branching process improve this current ad hoc threshold (unpublished work).

Treatment of insertions and deletions
DADA does not attempt to explicitly model the indel error processes, and indels do not contribute to the determination of whether sequences are related to each other via errors. Instead, sequences are aligned to each putative sample genotype, and are assigned to clusters on the basis of substitutions. During the computation of p-values, we sum together the reads of sequences within each cluster that have the same set of substitutions (forming structures that we call indel families). The number of reads of each of these indel families, rather than those of the raw sequences, are the basis of our p-values (see Methods).
Treating indels in this way does not affect the accuracy of DADA for the test data sets analyzed here, as the sample genotypes all differed from each other by at least one substitution, and these provided enough information for DADA to distinguish between them. However, DADA cannot distinguish between sequences that differ only by indels. In such cases, if the amplicons being denoised are coding regions, frame information should be used for making decisions about whether particular indels are real or errors, but in order to denoise non-coding regions with pure indel diversity, DADA is not sufficient in its current form.

Preclustering
Prior to our probabilistic sequence clustering we divided the raw data into coarse 3% single-linkage clusters (with indels not contributing to distance), subsets for which each sequence is ≤ 3% from at least one other sequence in its cluster and > 3% from all sequences in other clusters. Due to its speed, we employed the ESPRIT algorithm for this task [16]. Single-linkage's propensity for chaining was advantageous in this circumstance, as all error-containing sequences are very likely to be in the same cluster as their originating sample genotypes; for a sample genotype and one of its errors to end up in different clusters, the error would have to be ≥ 3% from the nearest error clustered http://www.biomedcentral.com/1471-2105/ 13/283 with that genotype, corresponding to a large gap in the error cloud, which is unlikely under our error model.

Clustering
Each precluster is partitioned into sets of sequences that are conjectured to contain all errors arising from different sample genotypes. This partition is initialized to a single cluster containing all sequences. Two procedures then alternate. First, the indel family most unlikely to have resulted from errors is split off into a new cluster. Sequences then move between clusters based on the probability that they were generated as errors by each one, and the consensus sequence for each cluster is updated until there are no remaining reassignments that can improve the probability of the data. This second step is analogous to the assignment and update steps of standard k-means clustering. This alternation stops when the partition of the sequences fits with the current error model at the significance levels provided by the user.

Accuracy
We evaluated the accuracy of DADA by denoising three of the data sets in Q11 used to demonstrate AmpliconNoise's accuracy relative to the earlier SLP and DeNoiser algorithms. These data are derived from mixtures of known clones that were amplified together and sequenced on the 454 platform, and consisted of different hypervariable regions of the 16S RNA subunit of bacterial ribosomes (16S rRNA), which are commonly used as a proxy for phylogenetic diversity in metagenomic studies [4]. Two of the data sets, Divergent and Artificial, with 35,190 and 31,867 reads, were sequenced with the GS-FLX chemistry and were truncated at 220 nucleotides. They were constructed by amplifying the V5 region of the 16S rRNA gene from 23 and 90 clones, respectively, isolated from lake water. The Divergent clones were mixed in equal proportions and are separated from each other by a minimum nucleotide divergence of 7%, while the Artificial clones were mixed in abundances that span several orders of magnitude, with some of the clones differing by a single SNP. The other data set, Titanium, with 25,438 reads, was sequenced with the newer Titanium chemistry and was truncated at 400 nucleotides. It contains V4-5 16S rRNA genes from 89 clones isolated from Arctic soil with varying abundance and genetic distances, similar to the Artificial set.
All data sets had undergone filtering of reads deemed to be of low quality prior to application of AmpliconNoise in Q11, so for purposes of comparison, we denoised the same set of filtered reads. The presence of a small number of low-quality reads in 454 data has been previously demonstrated [17], and as we do not expect these to be well described by our error model, we encourage the use of such quality filtering before applying DADA to nontest data. As SLP and DeNoiser were already demonstrated to be less accurate than AmpliconNoise on these data, we include here DADA's performance only relative to that of AmpliconNoise. There were six other data sets presented in Q11 of V2 regions from a gut microbial community, but these had such an overwhelming number of chimeric sequences (reported to be as high as 85% in Q11), which neither DADA nor AmpliconNoise attempts to address, that we opted not to include these data sets in our analysis.

Tuning algorithmic sensitivity
DADA employs two tunable parameters that determine how conservative or liberal the algorithm is to be in deciding whether particular sequences could have resulted from errors: a , and r , the significance levels for its abundance and read p-values. Decisions about singletons, the sequences represented by a single read, depend on r , whereas decisions about sequences with several reads depend on a . The two values may be tuned independently to match the priority being placed on capturing the rarest and more common diversity.
Due to early-stage PCR effects discussed above, it was necessary to use a significance levels lower than typical values. In order to select such values, we first performed a loose clustering of each data set with larger values of a and r and then made histograms of the a thresholds that would be required for each cluster to be reabsorbed into some other cluster ( Figure 3). If there are errors with moderate statistical deviations from our model, we expect that these will show up as a tail of increasingly small pvalues that will disappear smoothly as we lower the a threshold. Thus, we looked for the first large gap in these histograms that would suggest all such model departures had been captured. Such a gap occurs at a = 10 −15 for the Divergent data, a = 10 −40 for the Artificial data, and a = 10 −100 for the Titanium data. We used these values in the analysis that follows, but also clustered all three data sets with a = 10 −100 and found that the results were unchanged (see appendix 2). This suggests that a = 10 −100 is a reasonable default value to use when clustering diversity at this scale, even though higher resolution may be achieved by the method outlined above. For non-test metagenome data that is more diverse and less oversampled, we have seen evidence that using much larger values of a (such as .01) may be possible without compromised accuracy, but in such cases it is always advisable to make histograms of the type above to ensure that there is not an excess of clusters that would vanish if a were lowered slightly. We did not observe any significant departures in these data from our model that would affect the read p-values, and it was therefore possible to maintain the interpretation of r as a significance threshold. As a result, for these data, which contain ≤ 50 preclusters that were clustered separately by DADA, we set r = 10 −3 so that the http://www.biomedcentral.com/1471-2105/13/283 probability of having a false positive would be ≤ 5% for each data set.

False negatives and false positives
The purpose of DADA (and AmpliconNoise) is the inference of the genotypes present in the underlying sample from a set of noisy (error-containing) sequencing reads. There are two types of errors that such an algorithm can make: false positives in which a sample genotype is inferred that was not present in the sample, and false negatives in which the algorithm fails to infer a sample genotype that was present in the sequencing reads. The tradeoff between false positives and false negatives in the two algorithms can be controlled by the algorithmic parameters, depending on which type of error presents more of a problem to the user.
We present, in Table 1, a comparison of the false positives and false negatives for DADA and Amplicon-Noise applied to the control data sets described above. Note, however, one important detail: these algorithms are designed to remediate substitution and indel errors, not all possible errors. In particular, we found that contaminants, chimeras, and pathological homopolymer errors contributed to these sequencing data sets. Using ad hoc methods, discussed in appendix 1, we accounted for these additional error sources, and did not penalize either algorithm for them.
DADA is more accurate in its inference of the sample genotypes than is AmpliconNoise on every data set. The difference is especially strong among false negatives, where DADA successfully identifies virtually all sample genotypes; DADA's only two false negatives, both in the Artificial set, result from pathological alignment issues between sequences that differ only in the last two bases.
The differences in the nature of the false positives and negatives made by DADA and AmpliconNoise are shown in Figure 4. DADA produces one false positive in the Artificial data: a sequence with 268 reads that is a single substitution from a sample genotype with only 210 reads. Due to its vast abundance, this is unlikely to be an error, and we speculate it may represent a polymorphism that arose early in the growth of this clone. Amplicon-Noise produces eight false positives in the Artificial data: all sequences with 1-4 reads that are (except for one) 1-4 substitutions away from clusters with a few hundreds of reads. Although these sequences were not atypical errors as judged by DADA due to their low abundances, Ampli-conNoise calls them real as a result of setting a narrow error radius that is needed to prevent additional false negatives. The differences between the errors made by the two algorithms is less clear in the Titanium data set, but DADA outperforms AmpliconNoise in both FPs and FNs. We included AmpliconNoise's results for the Titanium data set for both parameter settings included in Q11: the σ = .04 clustering (s25), which produces only four false negatives, leads to many false positives similar in nature to those of the Artificial clustering -low abundances and a small number of substitutions away from large clusters; the σ = .1 (s10) clustering produces many fewer false positives but misses nine sample genotypes.

Speed
We evaluated the speed of DADA applied to the Artificial data, which was used to profile AmpliconNoise in Q11 (Table 2). ESPRIT was run on a single core of an AMD Phenom II 3.2GHz running Ubuntu and DADA was run on a MacBook Pro with an Intel Core 2 Duo 2.4 GHz. The CPU times for a few significant subroutines within DADA and ESPRIT as well as their total CPU times.
DADA is currently written in MATLAB, but sequence alignments and the construction of indel families were bottlenecks that we reimplemented as MEX (Matlab executable) C programs. The majority of the time to run our denoising pipeline on the Artificial data set is spent on ESPRIT's performance of pairwise alignments during the single-linkage pre-clustering step (needledist). A newer version of ESPRIT promises to be released soon that may dramatically lower this time [18]. If additional speedups are needed as data sets grow, it should be possible to replace the global alignments of ESPRIT by banded alignments that would be guaranteed to produce the same clusters if the width of the band is equal to the cluster radius, and would have have roughly linear (in sequence length) rather than quadratic, running time. Nonetheless, for these data DADA already gives a 60-fold speedup over AmpliconNoise, making jobs that required a 64-core cluster to run AmpliconNoise appropriate for a laptop running DADA.
As read lengths continue to grow, we expect the time complexity of DADA to be affected in two primary ways. First, because the complexity of the Needleman-Wunsch alignment algorithm used by both DADA and ESPRIT scales with the product of the lengths of the input sequences [19] there will be a quadratic slowdown with increasing read length unless heuristics are employed. On data sets comparable to those analyzed here, alignments consume the majority of algorithmic time and this scaling will dominate in the near future. Second, our current implementation for computing read p-values has both time and space complexity that grows very rapidly with read length (it is asymptotically O(L 11 )). This was not strongly limiting for these data, but in case it should become so as reads become longer, we have explored the use of a continuous approximation for the error probabilities that may alleviate this problem.

PCR substitution probabilities: symmetries and nearest-neighbor context-dependence
DADA not only infers sample genotypes, it also infers the substitution error probabilities caused by the amplification and sequencing processes. The substitution error probabilities inferred by DADA for all three data sets exhibit an approximate symmetry under complementation of the two bases involved. For example, the A→G probability is close to the T→C probability. This symmetry is expected if substitution errors predominantly arise during PCR amplification because substitution errors during PCR can be the result of either of two different mis-pairing events (from when the sequence is being copied to the opposite strand, or from when it is being copied back), and complementary substitution errors share causal mis-pairing events (see Figure 5 for a schematic). As it was not imposed, and the identities of the original genotypes were not known to the algorithm, the symmetry is a highly non-trivial check on DADA's ability to learn error probabilities without training data. Additionally, the inferred substitution probabilities were similar across the data sets, and especially so between Divergent and Artificial, which were generated with the same PCR protocols.
We also found that the nearest-neighbor nucleotide context affects the probability of substitution errors. We therefore introduced context-dependent substitution probabilities into DADA, allowing for dependence on the nucleotides immediately preceding and following the substituted nucleotide. Such probabilities are expected to exist in reverse complementary pairs for the same reason given for the context-independent case (again, see Figure 5); the lir → ljr probability is expected to be similar to the lir → ljr probability where lir denotes the reverse complement of lir. The degree of symmetry in the inferred probabilities, both context-independent and contextdependent, are shown in Figure 6 for all three data sets.
The magnitude of context-dependence for these data was moderate (most context-dependent probabilities differed by < 50% from context-independent ones) as seen in the spread of points along the diagonal in Figure 6 d,e,f. As a result, maintaining separate probabilities for different contexts did not affect the inferred sample genotypes. Nonetheless, that DADA was robust to significant variation in its parameters is a strong check on the stability of its sample inference. We have worked with data for which contextdependence is large and has a strong effect on clustering. Therefore, we leave use of context-dependence as an optional feature of DADA, either as a consistency check, or when justified by the amount and nature of the data. But a caution is in order: with modest sized data sets, or if the sequences are too similar, use of the contextdependent rates could result in over-fitting and calling too many errors. However if this did occur, the expected complementarity symmetry of the inferred error probabilities would be unlikely to obtain unless the sequences were read in both directions.

Discussion
DADA explicitly incorporates read abundance when deciding whether sequences are genuine or errors; if there are many identical reads of a sequence, DADA will be more likely to infer an underlying sample genotype, even if individually those reads would be consistent with being an error from a nearby genotype. Furthermore, DADA implicitly assumes, via the error model, that reads near highly abundant sequences are far more likely to be errors. In contrast, previous methods have typically treated each read independently. AmpliconNoise partially incorporates abundance by weighting the prior probability that a read belongs to a cluster by the frequency of that cluster, but this is weaker than DADA, where dependence on cluster size shows up in a binomial coefficient (see Methods), especially for high-abundance errors. By using both sequence identity and abundance in this way, DADA is able to disentangle real diversity from errors at finer scales than previous methods, even when tuned to be very conservative.
However, full incorporation of abundance information makes DADA sensitive to early-stage PCR effects and the mis-estimation of error probabilities. The problem of early-stage effects is particularly pronounced in these data: when clustered with a = r = 10 −3 , the Artificial data produces 68 false positives (we would have expected no false positives if the model assumptions were not violated). The majority of these sequences have 2-5 reads and 2-4 errors. Such problems would be typical of moderately oversampled PCR, the regime in which initial sample molecules are typically sampled multiple times, allowing a single error during early stages to show up in more than one read. http://www.biomedcentral.com/1471-2105/13/283 In lieu of an abundance statistic that appropriately compensates for this affect, we deal with this problem by lowering the sensitivity of the algorithm by tuning down a . Further, because the probability given to each sequence scales as the error probabilities to the power of the number of reads (see Methods), if certain error parameters are larger than estimated in certain contexts, then the statistical significance of an error with many reads can be substantially overestimated. This problem gets progressively worse for deeper data sets, as all one-away errors begin to take on many reads. In anticipation of this problem, we have introduced nearest-neighbor contextdependence of error rates (see Methods). These had no impact on the final clustering for the test data presented, but in other data sets with larger context-dependent effects, we found a reduction in diversity estimates when context-dependence was included (data not shown).
DADA is a divisive hierarchical clustering algorithm: all sequences are assigned to a single cluster that is subdivided until the clustering fits an error model. Previous methods, including AmpliconNoise and simple OTU-clustering, have predominantly taken the opposite, agglomerative approach, which starts with too many clusters and merges them until some condition is met. This gives DADA a practical advantage, as the computational and space requirements (especially the number of alignments to perform and store) scale with the square of the initial number of clusters [20]. The typical problem of divisive methods -that the number of possible splittings is too large -is handled in DADA by seeding new clusters with sequences that are identified as not being errors and allowing other sequences, e.g. errors associated with the new clusters, to relocate if they become more probable by doing so.
Finally, DADA uses unsupervised learning to acquire error probabilities from the data that it is given. As PCR protocols vary in their choice of polymerase and number of rounds, these parameters vary by data set, perhaps greatly. This makes the universality of DADA's approach especially attractive, and will be important as new sequencing methods come into use such as longer read-length and paired-end Illumina that commonly make substitution as well as indel errors [21]. While it now relies on training data to establish error parameters, AmpliconNoise could be embedded in the same procedure of estimating error probabilities after each successive round of clustering, but this would multiply the computation requirements by a factor of the number of rounds of re-estimation, compounding the problem of its slower speed.

Conclusions
OTUs serve as a rough analogue for microbes of the more clearly defined taxonomic groups of higher organisms.
However, the repurposing of the OTU concept to the problem of inferring sample genotypes from error-prone metagenomic sequence data has serious and inherent shortcomings. The absence of an error model causes estimates of diversity, especially species richness, to depend strongly on experimental variables such as the size of the data set, the length of the region sequenced, and the details of the PCR/sequencing chemistry. These shortcomings are not amenable to simple fixes; it is not possible to separate real diversity from errors using an OTU approach when the diversity and the errors exist at similar scales (as measured by Hamming distance), as is the case in many metagenomic studies. PyroNoise and Ampli-conNoise have demonstrated the usefulness of denoising sequence data with statistical, physically-based error models. These methods are based on the classical statistical technique of expectation-maximization. We have presented an alternate approach, DADA, which is more targeted to the particular task of producing conservative estimates of diversity from noisy sequence data. It is much faster and more capable of resolving fine-scale diversity while maintaining a lower false positive rate.
We did not achieve our goal of complete freedom from ad hoc parameters in this work. Even though a , our input parameter, has a simple probabilistic meaning that is data set independent, there are corrections to our PCR model, and as a result a takes on an ad hoc quality in this analysis. Nonetheless, a can be coarsely tuned from the data itself in the way shown. Alternatively, for conservative diversity estimates, a may be set to very small values (such as 10 −100 ), and the resolution of the algorithm may be directly quantified. DADA not only guesses what is there, but knows what would have been missed if it were present, making a ad hoc but not arbitrary.
Much work remains to be done, and it is not yet clear how the algorithms will fare with extremely rich fine-scale diversity as occurs for the antibody repertoire of B-cells and T-cells of the human immune system [22,23]. DADA must be equipped with statistics that correctly describe the abundance distribution of sequencing errors when a realistic model of PCR is used in which some reads are the result of shared lineages. More sophisticated methods for chimera detection that explicitly parameterize the chimera formation process analogously to the substitution and indel processes are also needed. Finally, these methods must be fully adapted and tested on sequencing platforms other than Roche's 454.

General notation
From a sequencing data set S = {s x , r x }, where r x is the number of individual reads of each distinct sequence s x , we would like to construct an estimate G = {G α } of the set of genotypes in the sample that gave rise to S. With http://www.biomedcentral.com/1471-2105/13/283 this aim, we construct a partition B of the sequences {s x } where each B α ∈ B is a collection of sequences hypothesized to have originated from a common G α , and notate the number of reads assigned to G α by ρ α = x|s x ∈B α r x . Because each s x can reside in only one B α and it is assumed that G α is the source of all s x in B α , this framework does not allow for multiple G α to contribute reads to the same s x . Allowing the latter is likely to affect G only in special cases and adds complications.

Treatment of insertions and deletions: the construction of indel families
In addition to substitution errors, reads acquire insertions and deletions (indels) during amplification and sequencing. Both substitutions and indels could be used to parameterize an error model, but here we focus on substitutions and do not attempt to characterize the statistics of indels. Instead, we collapse together all the reads of sequences within each B α that differ from each other only by the location of indels in their alignments to G α , forming subsets of each B α that we call indel families. We call the indel families F (S, B, G) = {s y , r y }, where each s y refers either to a subset of some B α or the sequence identical with G α except for the substitution errors of its constituents, and r y is the number of reads in the family. The r y of each indel family will be used to test whether B agrees with an error model, i.e. whether the substitution errors observed on the families in each B α was not too improbable under an error model of substitution errors.
Alignments between sequences and each G α in this paper took place with a scoring matrix of 5 + log T (to make them comparable with NCBI BLAST's NUC.4.4 matrix [24]), where T, introduced below, is a matrix of substitution error probabilities. We used a gap penalty of −4 and a hompolymer gap penalty of −1. The gap penalty had to be less than half the smallest mismatch score or alignments would favor a pair of indels to that mismatch. The worst mismatch score tended to be about −6, and so −4 was chosen as a gap penalty to allow as many gaps as possible without making any mismatches prohibited within alignments.

The independence between substitution errors on different reads implies a binomial distribution for the number of reads of each family
If the occurrence of substitution errors on different reads are independent events, then each read of genotype G α has an i.i.d. one-trial multinomial distribution with parameters = {λ yα }, which we call the genotype error probabilities, to belong to each indel family s y . also parameterizes the probability distribution for R y , the number of reads of family y: if s y ⊆ B α , then because R y is the sum of ρ α Bernoulli random variables each with success probability λ yα , it follows the binomial distribution, R y ∼ Bin(λ yα , ρ α ). The assumption of independence between reads does not hold if early round PCR errors may be sampled multiple times in the final sequence data. Then, if we condition on having observed a particular error on some other read, the probability to observe it additional times is increased.

may be constructed from simple nucleotide transition matrices
If the occurrence of substitution errors on different sites of the same read are independent events that do not depend upon the absolute position of the sites, then we can write each λ yα in terms of a homogeneous Markov chain T, whose elements we call nucleotide error probabilities. The simplest useful model of this sort is the 4 × 4 transition matrix T ij = P(j|i) with, for example P(C|A) the probability for taking nucleotide A in the sample to C in the data (i,j will always index nucleotides), and this is the model used by AmpliconNoise. These probabilities generate the genotype error probabilities via where α n and y n denote the n th nucleotides of G α and s y (also let λ xα = λ yα for all x|s x ∈ s y , used in Algorithm 1). If the nucleotide error probabilities at each site can depend upon the nearest-neighbor flanking nucleotides, we can keep track of a transition matrix T (L,R) for each possible (L,R) pair of flanking nucleotides such that T (L,R) ij = P(LjR|LiR), with P(ATC|AGC) the error probability for taking AGC to ATC. This generalization, which we call context-dependence, increases the number of free parameters from 12 to 192: there are 16 possible pairs of flanking nucleotides each with a 12 free parameter stochastic matrix. DADA may be run with or without context-dependence, but due to the risk of overfitting and because we saw no substantial effect on the outcome of clustering when it was used, context-dependence was not used to produce the results presented in this paper.

Assessing fit with an error model via tail probabilities
In order to assess whether G and B fit the error model , we introduce two statistics, p y and q α : p y is the probability of having seen at least r y reads of s y given that we saw at least one and q α is the probability of having seen at least one read with a genotype error probability at least as small as the smallest genotype error probability of an observed indel family in B α , λ * α = min y|s y ⊆B α λ yα . http://www.biomedcentral.com/1471-2105/13/283

p y : the abundance p-value
Call R + y the number of reads of s y given that we observed at least one: Given the definition of p y above, We refer to this as the abundance p-value because it evaluates the probability of having observed more extreme abundances under the null hypothesis that each r y was generated by the error model. Because one abundance p-value is generated for each indel family, we use a Bonferroni correction and compare each p y with a /|F |, where a is a joint significance threshold that is provided to DADA by the user.
If we had not conditioned on having observed at least one read of each family, then the unobserved families would not have born any significance (they would all have p y = 1), but before looking at the data these families could have been significant. This would create a difficulty in choosing an appropriate multiple hypothesis correction; a naive Bonferroni correction of α 4 L α with L α the length of G α , which treats all possible families as tested hypotheses, would deprive the p-value of any statistical power. Conditioning on R y > 0 and evaluating only the observed sequences avoids this complication. However, any family with r y = 1 obtains p y = 1 regardless of the smallness of λ yα , which necessitates our second statistic, q α .

q α : the read p-value
For each cluster B α , we compute the probability q α , which we call the read p-value, that there is at least one read with a genotype error probability at least as small as λ * α . Let l α be a random variable representing the smallest genotype error probability when ρ α reads of G α are generated according to . Then where e iterates over all 4 L α sequences, and λ eα are the genotype error probabilities of these sequences. Evaluating the sum in this form would be computationally wasteful; instead we iterate over sets of sequences that share the same types of substitution errors. We index these sets by 4×4 off-diagonal matrices γ whose elements γ i =j specify the number of i's on a genotype that appear as j's on the sequence. The genotype error probability for sequences of type γ away from genotype G α is λ 0α × λ γ with λ 0α = i T n αi ii the probability of having no errors (where n αi is the number of nucleotides of type i on G α ) and λ γ = i,j , which is independent of α. We also need the number of distinct sequences of type γ for each G α , which we call the degeneracy m γ (α) of γ on G α . This is computed by taking a product over multinomial coefficients: m γ (α) = i n αi ! (n αi − j γ ij )! j γ ij ! . Rewritten as a sum over γ , q α becomes Vectors of λ γ and m γ (α) can be computed starting with γ representing the more common errors and extended to more rare errors as needed to compute p-values for smaller λ * α . Finally, rather than maintaining vectors of m γ (α) for each α, we keep one for each of a small number of possible base compositions and interpolate between the q α that would result from each of these in order to approximate the q α that would result from the exact base composition of G α . Because one q α is generated for each B α , the q α are then compared with r /|B|, where r is another joint significance threshold provided to DADA by the user, in order to determine whether any λ * α are too small to be the result of errors.

Maximum likelihood estimate (mle) of error probabilities
After forming a partition B of S that fits the error model generated by T, DADA updates T to its maximum likelihood estimate given this partition. The likelihood of T given S and B is where α n and y n denote the n th aligned nucleotides of G α and s x . For the case without context-dependence, the likelihood may be rewritten as

Algorithm for inferring the sample genotypes and error probabilities
The p-values p y and q α are the basis for an algorithm that alternatively updates B and the nucleotide error probabilities T, which may be specified by the user as either context-independent or dependent, denoting their values after t iterations by B t and T t , in order to improve the likelihood of the data. This is similar to the hard-EM algorithm except that the partition B t at each step is the result of a model-based divisive hierarchical clustering approach and does not maximize the probability of S given T t . The algorithm requires two user inputs, a and r , which are the joint significance thresholds for the abundance and read p-values.

Algorithm 1
DADA Sequence clustering algorithm [15] T 0 =T mle (B 0 ), where B 0 is the trivial partition containing the entire S.
There are three levels of nesting, each beginning with a repeat statement in Algorithm 1. From outer to inner, we give a qualitative description of their purpose: 1. Starting with T 0 , the maximum likelihood nucleotide error probabilities given the trivial partition B 0 of all sequences into a single cluster, the outermost loop iteratively updates B and T until T converges. We have observed cases where T does not completely settle down but fluctuates within a small basin of attraction. To deal with such cases, DADA terminates if T ever returns to a previously held value or ||T t − T t−1 || < , where the tolerance = 1e − 9 is used as a default and may be altered by the user. If convergence has not been reached in ten rounds, DADA terminates with a warning message. After adding a new block B t α to B t , the innermost loop raises the probability of the data by reassigning each sequence to the block that would produce (under the error model) the largest expected number of reads of that sequence. The putative genotype of a cluster, G α , is also updated if a cluster B α has a new consensus sequence. This continues until sequences cease changing clusters.

Appendix 1: chimeras, contaminants, and missing or incorrect Sanger sequences
There are disagreements between the Sanger sequences of the clonal isolates used to construct the data sets and the denoised sequences of DADA and Amplicon-Noise that are due to sources other than PCR substitutions and pyrosequencing errors. These include contamination, chimeric sequences that result from the co-amplification of genomes with regions over which they exactly match, Sanger errors, the absence of any reads of two sample genotypes, and disagreements between the Sanger sequences and majority of the 454 reads about the lengths of several homopolymers (for example, not a single 454 read matched four of the Sanger sequences while many were identical except for the presence of a single deletion on a G homopolymer). In order to evaluate the relative performance of DADA and AmpliconNoise as denoising http://www.biomedcentral.com/1471-2105/13/283 algorithms, it was necessary to identify which disagreements between Sanger and denoised sequences were due to these sources and which, falling outside these categories, were due to algorithmic shortcomings. We chose criteria for classifying errors of these types and applied them to the sequences denoised by DADA and Amplicon-Noise (Table 3).
We began by correcting possible errors in the Sanger sequences. In the Divergent and Artificial data sets there were disagreements between very high abundance denoised sequences and their nearest neighbor Sanger sequences (12/23 Divergent Sanger sequences and 63/90 Artificial Sanger sequences). The denoised sequences were a consensus of thousands of pyrosequencing reads and did not differ from the Sanger sequences near homopolymers; rather, all disagreements were nonhomopolymer related deletions near the starts of the reads. It was confirmed (Chris Quince, personal communication) that all bases of Sanger sequences aligning to sites within 13 nucleotides (nts) of the forward primer of the pyrosequencing reads had been removed in Q11, and so were likewise removed in all our analysis. In the Titanium data, the denoised sequences closest to eight of the Sanger sequences had over 100 reads but differed from them by one or two homopolymer deletions at several long homopolymers. DADA and AmpliconNoise (s10 and s25) agreed on the presence of the deletions in all of these sample genotypes, and there were more copies of the error-containing sequences than the Sanger sequences in the raw data (Table 4), suggesting either an error probability greater than 50% for the combined amplification/pyrosequencing process or problems with the Sanger sequences. Therefore, we did not consider these disagreements to be false positives or false negatives for either algorithm.
Next we identified chimeras: sequences consisting of two sections with one section a close match to one sample genotype and the other a close match to a second sample genotype. These can be produced in substantial quantities by PCR [25]. Analogously to Q11, for each denoised sequence we computed the Hamming distance to the nearest Sanger sequence and to the nearest exact chimera by considering all possible breakpoints between all pairs of sequences of higher abundance (a chimera will have fewer reads than its parents unless it acquires substantial PCR bias). For a denoised sequence to be classified as a chimera, we required that it be at least 3 nts closer to the nearest exact chimera than the nearest sample genotype and within 5 nts of the optimal chimera (also analogous to the procedure used in Q11). We waived the 3 nt improvement criteria for denoised sequences that were identical to exact chimeras, which occurred for some particularly highly abundant chimeras between closely related sample genotypes. All data sets had a large number of chimeras amongst their denoised reads, with Titanium having more chimeras than sample genotypes (both algorithms), highlighting how essential accurate chimera identification is in tandem with the correction of PCR and sequencing errors. Finally, we found several sequences too far from any sample genotypes or exact chimeras to be explained by being errors away from either. Some of these sequences were similar to previously observed sequences found on GenBank (Table 5). We classified as a contaminant any For each data set and both algorithms: the total number of denoised sequences, the number that matched one of the Sanger sequenced clones, the number classified as chimeras, the number classified as contaminants, and all other false positives. http://www.biomedcentral.com/1471-2105/13/283 D GB , D sample and D chim are the Hamming distances to the given GenBank entry, the nearest sample genotype, and the optimal chimera for each putative contaminant denoised sequence. No entries are given for the Artificial data set because no likely contaminants were found in the denoised sequences of either algorithm. The last two columns show which contaminants were present in the denoised sequences of each algorithm ("Y" if found and "N" if not found). The number of genotypes used to construct the sample and the number that were present and distinct and so could detected by a denoising algorithm.
sequence within 2 nts of a GenBank sequence and at least 5 nts closer to a GenBank sequence than any sample genotype or chimera. We found a mixture of contaminants likely to come from the original sample (lake water bacteria), and contaminants that may have entered the sample during processing and sequencing (bacteria previously observed in human skin, a human mouth, and soil samples). These contaminants were not previously mentioned in Q11 but were straightforward to detect when looking at DADA's denoised sequences, in part because having a smaller pool of algorithmic false positives makes identifying contaminants much easier.
In classifying false negatives, we sought to evaluate the ability of the algorithms to detect the presence of genuine diversity in the pyrosequencing reads. However, not all clones used to construct the samples in Q11 had exact matches amongst the pyrosequencing reads: one Sanger sequence in the Artificial data set (#69 in Q11) was 29 nts away from the nearest 454 read and one Sanger sequence in the Titanium data set (#66 in Q11) was 61 nts away from the nearest 454 read. We assumed that these were missing from the 454 data, and they do not contribute to the false negatives of either algorithm. Further, a number of clones were identical to each other up to the point of truncation of the pyrosequencing reads. Finally, a number of the Titanium clones differed from each other only by the presence of Ns, bases that Sanger was unable to resolve. In such cases, we collapsed clones together and assumed the non-N containing Sanger sequence was correct. Table 6 gives the number of distinct (up to Ns) clones that are present in the data, and how many had been used to construct the sample.
Several aspects of this post-processing pipeline -especially contaminant identification-utilize knowledge of the sample genotypes and do not constitute useful methods for non-test data. Our approach to chimera identification does not utilize sample genotype information, but requires more development to be applied to non-test data: it does not search for higher-order chimeras that are combinations of three or more parental sequences, the criteria for being labelled as a chimera do not scale with error rates and read lengths, and no attempt has been made to realistically model the chimera formation process. Our goal has been only the isolation of errors due to PCR and sequencing error.

Appendix 2: a robustness
To assess whether DADA's performance in this paper was the result of the fine-tuning of a , we evaluated each data Table 7  set under all three a values. The results are given in Table 7 and demonstrate that the same results would have been achieved by using a = 10 −100 for all three data sets.