Estimation of sequencing error rates in short reads

Background Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. Results We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html. Conclusions The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.


Background
The rapid development of new DNA sequencing technologies is transforming biology by allowing individual investigators to sequence volumes previously requiring a major genome center. Before the development of nextgeneration sequencing platforms, Sanger biochemistry was the basis of sequencing production. Sanger sequencing, or conventional sequencing has been fine-tuned to achieve read-lengths of up to ∼1,000 bp and per-base accuracies as high as 99.999% [1]. However, given several bottlenecks in conventional sequencing that restrict its parallelism, the optimization in throughput and cost has reached a plateau. Several alternative sequencing strategies have been proposed in the past several years to reduce sequencing time and cost. http://www.biomedcentral.com/1471-2105/13/185 next-generation sequencing pipelines. At the nucleotide level, base-calling algorithms developed by manufacturers (for example, Bustard by Illumina) and independent investigators [2][3][4][5] provide per-base phred-like [6] quality scores as a byproduct. These methods require fluorescence intensity measurements from sequencing runs as input, and the per-base quality scores produced need to be further summarized to assess sequencing quality at higher levels. At the technology level, there have been efforts to characterize error patterns associated with different platforms, which include Dohm et al. and Hansen et al. [7,8] for the Illumina platform, and Huse et al. [9] for the 454 Genome Sequencers. These studies are very important in facilitating our understanding of the quality characteristics, however, they do not provide methods to assess the quality of sequencing data that are produced everyday in individual laboratories.
Sequencing quality also needs to be evaluated and analyzed in light of different applications. For example, Bullard et al. [10] conducted a study of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing data. They evaluated how DE results are affected by varying gene lengths, base-calling calibration methods, and library preparation effects. They obtained the number of uniquely mapped reads with 0 (U0), 1 (U1) or 2 (U2) mismatches and used the ratio (U1+U2)/(U0+U1+U2) to estimate the perread sequencing error rate, which is the proportion of reads that contain at least one error. We refer to this method of counting the number of mismatches to the reference genome as the mismatch counting method. This builds on the assumption that the perfect-matching reads contain no errors and that U1 and U2 contain sequencing errors but not SNPs. It also requires mapping to a reference genome, a step that may be problematic in applications such as threat detection, as the genome sequence of interest may not be known.
Tools also exist to correct errors from next generation sequencers [11][12][13][14][15][16][17] and error rates can be estimated as a by-product. These methods are based on k-mer or substring frequencies, or finding overlaps between reads, which are very computationally intensive, require a large amount of memory, and are difficult to work with large genomes.
In this work we develop a simple and efficient method to estimate the error rates in any sequencing pipelines based on the observation that there is a linear relationship between the number of reads sequenced and the number of reads containing errors. We refer to this proposed method as shadow regression, and show that it works well in applications with moderate to high sequencing depth, such as mRNA-seq, re-sequencing and SAGE.
As with any high throughput experiments, it is important to monitor the quality of next-generation sequencing data at the level of individual experiments. Currently the percentage of reads mapped is used as a quality indicator but it does not directly address the fundamental question of how much error is present in the reads obtained from a sample. A reference genome may not be available at all. Even if the reference genome is available, we show, using simulated data and real data on PhiX, that mapping reads to the reference genome can introduce biases even at relatively modest polymorphism rates. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data. For example, error rates are useful in understanding the fidelity of SNP or mutation calls as a function of coverage.

Next-generation sequencing and SAGE data
Public next-generation sequencing data were obtained from the NCBI Sequence Read Archive (SRA) [18]. The SRA accession numbers are: SRA010153 (mRNAseq, MAQC), SRA001150 (mRNA-seq, Encode) and SRA010105 (re-sequencing, mutation screening). In addition, PhiX data were generated on Illumina Genome Analyzer II by the Center for Cancer Computational Biology at Dana-Farber Cancer Institute (DFCI). SAGE data were obtained from NCBI SAGEmap [19]. Next-generation sequencing data were downloaded in FASTQ format and converted to read counts. SAGE data were downloaded as read counts. Reads that contain N's (no calls) or consist of all A's, C's, G's or T's are filtered out before shadow counts are computed. More details about the data can be found in the Results and Discussion section.

Estimating sequencing error rates
Sequencing pipelines output many short sequence reads representative of the sequences in the sample. If substitutions, insertions or deletions occur, the resulting sequence read differs from the true sequence. Given a read t, we refer to reads that have sufficient sequence similarity with t as the shadows of t. The results shown here are based on differences by up to two bases, though different definitions could be adopted depending on the application. Among the shadows, there may be sequences that are legitimate and error-free. Our method is based on the observation that the number of shadows due to sequencing errors increases linearly with the read count of t, while the number of legitimate shadows is independent of the read count of t. Figure 1 shows read-shadow relationships for samples from some of the data sets to which shadow regression was applied. The same plots for two DNA sequencing runs of the bacteriophage PhiX are shown in Figure 2. We see that as read count increases, shadow count also increases. Infrequently occurring reads have wildly varying numbers of shadows; however, among frequently occurring reads, http://www.biomedcentral.com/1471-2105/13/185 the relationship between read count and shadow count is clear and positive. We propose a linear model based on this observation to estimate error rates in sequencing pipelines, which we call shadow regression. Even though there may be more than two errors in a read, as is the case for reads containing sequence-specific errors shown in Nakamura et al. [20], we found that using reads that differ from t by up to two bases in shadow regression gives us accurate estimates of error rates without excessive computational cost. Note that we present results with substitution errors only in this paper; application of this method to insertions and deletions differ only in the way shadows are defined and are implemented in the R package. Insertions at the beginning of reads and deletions at the end of reads may result in genuine reads that are shifted by one base. Therefore the resulting estimates of position-specific indel rates may be biased, so these estimates are not computed at the extremes. Read-level indel rates are unlikely to be affected.

Per-read error rates
Let r t denote the true number of reads with sequence t in a given sample, and where n t is the number of reads with sequence t that have been sequenced correctly and e t is the number of reads that come from sequence t, but contain sequencing errors and are therefore slightly different from t. We define the per-read error rate to be the proportion of reads containing sequencing errors among all the reads in a sample, which is the same as the proportion of reads containing errors per additional unit of reads sequenced: Error Rate read = t e t t r t = e t r t .
Based on our observation of the linear relationship between the number of shadow reads due to sequencing errors and the number of true reads for a given sequence t, we propose the following linear model to estimate samplespecific error rates in sequencing runs. In the case of Illumina Genome Analyzer (Solexa), a sample corresponds to a lane in a flow-cell. Our model is where s t denotes the number of observed shadows of sequence t (defined to be reads that differ from sequence t by up to two bases) in the sample, is independent of n t and approximately Gaussian with mean 0 and variance σ 2 , α is the intercept, and β is the parameter that represents the slope. Because shadows can also come from legitimate reads that are error free, β needs to be estimated robustly so that they are not influenced by error-free shadows.
Shadow counts are computed by first enumerating all possible shadows of each observed read, and then identifying the shadows that are actually observed. It is generally observed that in a given sample, a small number of reads have high frequencies while the majority of reads have very low frequencies. We find that the relationship between the number of reads and the number of shadows is captured in high frequency reads, and that it is enough to use the top 1000 reads with highest frequencies to estimate β. We also regard the top 1000 reads to be error free, and thus exclude them from shadow counts of any read.
The estimated slopeβ is the number of additional error shadows we get for each additional unit of correctly sequenced read t, i.e.β = e t / n t . The per-read error rate we want to estimate can be obtained the following way: Thus, by examining the slope, we may obtain a samplespecific estimate of the sequencing error rate. The error due to substitutions, insertions or deletions may be estimated separately with this method. Alternatively, the aggregate error from any source may be estimated.

Position-dependent error rates
It has been observed that error rates in sequencing pipelines depend on the base position in the read [5,7], which can be estimated by stratifying shadow reads by position. We define the per-base error rate at position i to be the proportion of reads with sequencing errors at position i among all the reads in the sample, i.e., where e i t is the number of reads that should be t if error free, but are not because of at least a sequencing error at position i. The corresponding linear model is t is the number of error shadows that differ from read t at least at position i, and i is Gaussian with mean 0 and variance σ i2 . As with per-read error rates,β i /(1 + β i ) is the per-base error rate at position i that we want to estimate.

Application to different data types
Our method, shadow regression can be applied to mRNAseq, SAGE, re-sequencing and whole genome sequencing data regardless of platform, as long as there is sufficient coverage in the input data. The current implementation requires the read lengths from a given sample to be the same. Precautions need to be taken when working with http://www.  Figure 3 Simulations: per-read error rate estimates based on simulated reads under five different error rates. The horizontal lines are drawn at 0, where the error rate is estimated perfectly. Each boxplot contains the differences between error rate estimates and the true error rate from 100 simulated data sets. Shadow regression and mismatch counting estimates are plotted next to each other for a given error rate.
whole genome sequencing data as some genomes are highly repetitive and may result in over estimation of the error rates. One way to assess whether a particular genome is too repetitive to apply shadow regression directly is to sample reads from the reference genome and see if shadow regression gives an error rate estimate that is very close to 0. If this error rate is significantly greater than 0, indicating a genome with repetitive regions, one can mask out reads from the repetitive regions or retain only reads from the coding regions before applying shadow regression. The test for repetitive genomes described here is implemented in the R package.

Required coverage
Shadow regression requires sufficient coverage to estimate the error rate accurately. For mRNA-seq, samples shown in the results section for MAQC and Encode human data contain about 12 million reads, which is sufficient for shadow regression to work well. When using about 1.2 million reads sampled from the 12 million reads as input, shadow regression still gives good estimates (data not shown). Therefore we are confident that our method works well with any mRNA-seq data. In general, we would like the top 1000 read counts, which is what we use for error estimation, to have a range of about 500 or more. Therefore the maximum read count needs to reach about 500 or more.

Simulation studies
In order to determine the accuracy of our proposed method, we performed two simulation studies. The first one assumes that errors in a read occur independently of each other, while the second study assumes that once an error occurs in a read, it is more likely to make another error in that read. For both studies, we start with U0 reads (reads that are uniquely mapped to the reference genome with no mismatches) of an MAQC experiment 2 sample (SRR037440). These reads, which map to the reference genome perfectly, are assumed to be the error-free ones for the purpose of generating synthetic data. Substitution errors are then added to these reads according to pre-specified position-specific error rates. For the second study where the errors do not occur independently of each other, the error rates for the rest of the read double once an error occurs in a read, i.e. if an error occurs at position i, the error rates for positions j > i become twice their pre-specified error rates. The pre-specified error rates are based on the estimated error rates of sample SRR037440 by counting the number of mismatches to the reference genome at each position. This set of position-specific error rates are then scaled to create a range of different error rates. The estimated error rates were calculated by transforming the slope from a robust linear regression (as implemented in the rlm() function in the R library MASS). For each set of pre-specified error rates, we repeated the error simulation and estimation one hundred times. Figure 3 shows the performance of shadow regression for estimating per-read error rates. Under the independent error model, shadow regression gives estimates that are usually within 2% of the true error rates. When the errors are dependent, shadow regression is usually within 5% of the truth. Note that as the error rate increases, there is a tendency for shadow regression to underestimate the error rate. This is because at higher error rates, more reads have more than two errors, which are not captured because we only use shadows that differ from the read of interest by two bases. This trend is more pronounced in the dependent case as it is easier to have multiple errors in a read if the first error induces later ones. Figure 4 shows the position-specific error rate estimates given by shadow regression, which are very similar to estimates from mismatch counting, and track the true error rates very closely. The simulations presented this far assumed that the true genome sequence of the sample is exactly the same as that of the reference genome. In practice this is seldom the case. To assess the performance of shadow regression and mismatch counting when the sample genome sequence differs from the reference genome, we performed a simulation where we assumed that there is a single base The intervals in parentheses for shadow regression estimates are the 95% confidence intervals of the estimates. The column "mm" gives the mismatch counting estimates using the standard PhiX reference genome. The column "mm / new genome" gives the mismatch counting estimates using the PhiX genome corrected for mutations shown in Figure 7.
pair difference between the sample genome and the reference genome every 1000 base pairs on average, a realistic value for human polymorphisms. This only affected the error rate estimates produced by mismatch counting since shadow regression does not use the reference genome.
The results are shown in Figures 5 and 6. As expected,  mismatch counting overestimates the error rate in this case.

PhiX DNA data
Next, we evaluated our method using sequencing data of the bacteriophage PhiX 174RF1 (PhiX) DNA, whose complete sequence is known. DNA from PhiX is sometimes sequenced in one lane of Illumina flowcells to calibrate quality scores of the base caller [21] (Supplementary information page 7). We obtained Illumina data for two PhiX samples from the Center for Cancer Computational Biology at DFCI. Figure 2 shows that shadow regression fit reasonable lines to the PhiX samples and we would expect accurate error rate estimates from it. Mismatch counting first gave much higher error rate estimates than shadow regression. On further inspection, we realized that this was caused by several mutations in the PhiX genome. Because the coverages of PhiX samples were in the thousands, differences between the reference genome and the actual PhiX sequence in a few positions substantially inflated the error rates estimated by mismatch counting. This is shown clearly in Figure 7. At five places in each sample, almost all the reads differed from the reference genome. Closer examination of the data confirmed that at each of these places, almost all the mismatched reads had the same difference between the actual reads and the reference genome, leading us to conclude that the difference was due to real mutations rather than sequencing errors. Once we incorporated these mutations into http://www.biomedcentral.com/1471-2105/13/185 the reference genome, mismatch counting and shadow regression gave error rate estimates that are much closer to each other ( Table 1). The percentage of mismatched reads was much higher than the background at some other positions, although not nearly as high as 100%.
These may be due to mixed PhiX species in the PhiX sample and other irregularities introduced in the sample processing steps, and explains the higher estimates given by mismatch counting compared to shadow regression. Our findings here demonstrate the advantage of shadow regression over methods that depend on the reference genome. Bullard and colleagues [10] found that PhiX calibration did not improve the detection of differentially expressed genes in mRNA-seq experiments and yielded fewer high quality reads per lane. Our analysis showed that this may be due to mutations in PhiX and that the reference sequence used in the calibration step can be different from the actual sequence of PhiX used. This is corroborated by the estimate given by [22], indicating that PhiX undergoes 1.0 × 10 −6 substitutions per base per round of copying.

mRNA-seq: MAQC brain experiment 2 using auto calibration
We applied shadow regression to Illumina mRNA-seq data from the MAQC project [23]. A scatter plot of read and shadow counts for one of the samples is shown in Figure 1. Specifically, we used shadow regression to estimate the per-read and position-specific error rates for fourteen samples on two flow-cells (SRX016366 and SRX016368) run on the Illumina 1G Genome Analyzer, and results are shown in Table 2 and Figure 8 respectively. Most per-read error rates fell between 15% and 20%. For the two samples with higher than 20% per-read error rates, their corresponding position-specific error rates were not as smooth as other samples and they had substantial differences in neighboring positions. In all fourteen samples, mismatch counting gave higher error rate estimates than shadow regression, which is expected because mismatch counting classifies genuine differences between the sample of interest and the reference genome as errors.

mRNA-seq: Encode transcriptome
We applied shadow regression to a second Illumina mRNA-seq data set, this time five samples of human cell line K562 from the Encode project (SRX000570) [24], also run on the Illumina 1G Genome Analyzer. A scatter plot of read and shadow counts for one of the samples is shown in Figure 1. Results are shown in Table 3 and Figure 9. We found much higher error rates in these samples with perread estimates around 40% to 50% compared to around 20% for the MAQC samples. The position-specific error rates were similar to MAQC samples in early cycles but increased dramatically after cycle 25, thus inflating the per-read error rates.

Mutation screening: re-sequencing
In addition to mRNA-seq, we also applied shadow regression to re-sequencing data from 24 samples as described in Hu et al. [25]. A novel multiplex PCR method and next generation sequencing were combined to screen patients with X-linked mental retardation for mutations in 86 genes. Sequencing was done using Illumina Genome Analyzer II. A scatter plot of read and shadow counts for one of the samples is shown in Figure 1. We used shadow regression to estimate the error rates in these samples, and results are shown in Figure 10 and Table 4. The error rates in this data set are in general lower than the two mRNA-seq data sets, possibly due to improvements in the Genome Analyzer II. Shadow regression gave lower perread estimates than mismatch counting for most samples as seen in the two mRNA-seq data sets. In about half the samples, shadow regression estimated much higher error rates in the last few cycles than mismatch counting.

Application to Serial Analysis of Gene Expression (SAGE)
SAGE is a powerful technique for the examination of genome-wide expression levels that involves considerable sequencing of concatenated ten base pair long tags corresponding to transcripts [26]. Velculescu and colleagues [27,28] estimated the magnitude of sequencing errors in a SAGE library using error estimates obtained from studies in yeast. Previous work by the same group compared the 60,633 transcripts in the SAGE yeast library to the  known yeast genome, and determined that 56,291 tags match the tags predicted by the sequence of the yeast genome, 88 of the tags match the mitochondrial genome, and an additional 91 tags match a plasmid present in yeast [27]. The remaining 4,163 tags did not match the yeast genome and were therefore considered to be the result of sequencing errors. As these 4,163 tags represent 6.8% of the total library, Velculescu reports that 6.8% of all tags are thought to have a sequencing error. This estimate may be conservative because it only considers novel tags introduced by sequencing errors. Sequencing errors which create additional copies of tags already present in the library rather than novel tags have not been considered in this calculation of the error rate. Additionally, this error rate may include legitimate tags not matching the yeast genome databases because of single nucleotide differences between the yeast strain analyzed by SAGE and those used for genome sequencing, or because of incomplete genome sequencing. We applied shadow regression to sixteen human samples of pancreatic and colorectal tissue available from the NCBI SAGEMap website. A scatter plot of read and shadow counts for one of the samples is shown in Figure 1. Estimated error rates are shown in Table 5. These error rates are slightly higher than the reported 6.8%, as expected, confirming the accuracy of the shadow regression method. These error rates are lower than in previous data sets as conventional sequencing was used in these data sets instead of next generation sequencing.

Conclusions
We established a simple, general, flexible and robust methodology to estimate error rates for any of the existing second-generation sequencing technologies producing short read sequences. The fundamental advance behind the proposed methodology is our observation that there is a linear relationship between the frequency of short reads and the frequencies of their 'neighboring' reads, where neighbors are defined by sequence similarity. This linear increase reflects sequencing errors, as frequently occurring tags cast larger "shadow" of errors on neighboring tags. This observation holds true of SAGE libraries and seems to hold universally across second-generation sequencing platforms with moderate or high sequencing depths as well. Because shadow regression estimates the slope robustly, the error rate estimates are not influenced by shadows that are error-free. The shadow regression method does not require mapping reads to a reference genome. This has significant computational advantages, but more importantly it can address critical biological needs. For example, the ability to measure error rates independent of a reference genome can be critical in experiments designed to detect unknown species, as in threat detection, and in experiments investigating many species simultaneously, as when studying the microbiome. When the reference genome is prone to errors, for example because of a relatively high mutation rate of the system studied, we showed that the method of using the differences between the sample of interest and the reference genome as proxy for sequencing errors can produce estimates that are substantially inflated, while shadow regression is not affected by such differences.
We also showed the accuracy of shadow regression through simulation studies and analyses of the PhiX and SAGE data. We applied shadow regression to mRNAseq, DNA sequencing, mutation screening and SAGE, and demonstrated that this approach can be immediately used to evaluate sequencing error rates in different applications as they are generated. Even though our next-generation sequencing examples are all from the Illumina platform, shadow regression can be applied to other sequencing platforms as well.
We hope that the availability of a simple and computationally effective way of computing error rates at the level of single sequencing experiments will contribute significantly to quality control, proper analysis and experimental design of second-generation sequencing efforts. http://www.biomedcentral.com/1471-2105/13/185 co-developed the statistical methods. RS designed the PhiX experiment and edited the manuscript. GP co-developed the statistical methods, edited the manuscript, and coordinated the research. All authors read and approved the final manuscript.