- Methodology article
- Open Access
Estimation of sequencing error rates in short reads
© Wang et al.; licensee BioMed Central Ltd. 2012
- Received: 20 December 2011
- Accepted: 13 July 2012
- Published: 30 July 2012
Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments.
We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html.
The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.
- Error Rate
- Reference Genome
- Sequencing Error
- Read Count
- Error Rate Estimate
The rapid development of new DNA sequencing technologies is transforming biology by allowing individual investigators to sequence volumes previously requiring a major genome center. Before the development of next-generation sequencing platforms, Sanger biochemistry was the basis of sequencing production. Sanger sequencing, or conventional sequencing has been fine-tuned to achieve read-lengths of up to ∼1,000 bp and per-base accuracies as high as 99.999% . However, given several bottlenecks in conventional sequencing that restrict its parallelism, the optimization in throughput and cost has reached a plateau. Several alternative sequencing strategies have been proposed in the past several years to reduce sequencing time and cost.
One category of such alternatives is cyclic-array sequencing, referred to as second-generation or next-generation sequencing, which has been made available commercially, and includes products such as the 454 Genome Sequencers (Roche Applied Science), the Illumina (Solexa) platform (Illumina), the SOLiD platform (Applied Biosystems) and the HeliScope Single Molecule Sequencer (Helicos). Because of the much higher degree of parallelism and much smaller reaction volumes, next-generation sequencing achieves much higher throughput with dramatically lower cost. The disadvantages are shorter reads and higher error rates compared to Sanger sequencing. Next-generation sequencing has been applied in many areas of biology, including quantification of gene expression and alternative splicing, polymorphism and mutation discovery, microRNA profiling, and genome-wide mapping of protein-DNA interactions. A detailed review of sequencing technologies can be found in .
Aware of the large impact of sequencing quality on downstream analysis, several groups have attempted to detect, quantify and understand errors that arise from next-generation sequencing pipelines. At the nucleotide level, base-calling algorithms developed by manufacturers (for example, Bustard by Illumina) and independent investigators [2–5] provide per-base phred-like  quality scores as a byproduct. These methods require fluorescence intensity measurements from sequencing runs as input, and the per-base quality scores produced need to be further summarized to assess sequencing quality at higher levels. At the technology level, there have been efforts to characterize error patterns associated with different platforms, which include Dohm et al. and Hansen et al. [7, 8] for the Illumina platform, and Huse et al.  for the 454 Genome Sequencers. These studies are very important in facilitating our understanding of the quality characteristics, however, they do not provide methods to assess the quality of sequencing data that are produced everyday in individual laboratories.
Sequencing quality also needs to be evaluated and analyzed in light of different applications. For example, Bullard et al.  conducted a study of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing data. They evaluated how DE results are affected by varying gene lengths, base-calling calibration methods, and library preparation effects. They obtained the number of uniquely mapped reads with 0 (U0), 1 (U1) or 2 (U2) mismatches and used the ratio (U1+U2)/(U0+U1+U2) to estimate the per-read sequencing error rate, which is the proportion of reads that contain at least one error. We refer to this method of counting the number of mismatches to the reference genome as the mismatch counting method. This builds on the assumption that the perfect-matching reads contain no errors and that U1 and U2 contain sequencing errors but not SNPs. It also requires mapping to a reference genome, a step that may be problematic in applications such as threat detection, as the genome sequence of interest may not be known.
Tools also exist to correct errors from next generation sequencers [11–17] and error rates can be estimated as a by-product. These methods are based on k-mer or substring frequencies, or finding overlaps between reads, which are very computationally intensive, require a large amount of memory, and are difficult to work with large genomes.
In this work we develop a simple and efficient method to estimate the error rates in any sequencing pipelines based on the observation that there is a linear relationship between the number of reads sequenced and the number of reads containing errors. We refer to this proposed method as shadow regression, and show that it works well in applications with moderate to high sequencing depth, such as mRNA-seq, re-sequencing and SAGE.
As with any high throughput experiments, it is important to monitor the quality of next-generation sequencing data at the level of individual experiments. Currently the percentage of reads mapped is used as a quality indicator but it does not directly address the fundamental question of how much error is present in the reads obtained from a sample. A reference genome may not be available at all. Even if the reference genome is available, we show, using simulated data and real data on PhiX, that mapping reads to the reference genome can introduce biases even at relatively modest polymorphism rates. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data. For example, error rates are useful in understanding the fidelity of SNP or mutation calls as a function of coverage.
Next-generation sequencing and SAGE data
Public next-generation sequencing data were obtained from the NCBI Sequence Read Archive (SRA) . The SRA accession numbers are: SRA010153 (mRNA-seq, MAQC), SRA001150 (mRNA-seq, Encode) and SRA010105 (re-sequencing, mutation screening). In addition, PhiX data were generated on Illumina Genome Analyzer II by the Center for Cancer Computational Biology at Dana-Farber Cancer Institute (DFCI). SAGE data were obtained from NCBI SAGEmap . Next-generation sequencing data were downloaded in FASTQ format and converted to read counts. SAGE data were downloaded as read counts. Reads that contain N’s (no calls) or consist of all A’s, C’s, G’s or T’s are filtered out before shadow counts are computed. More details about the data can be found in the Results and Discussion section.
Estimating sequencing error rates
Per-read error rates
where s t denotes the number of observed shadows of sequence t (defined to be reads that differ from sequence t by up to two bases) in the sample, ε is independent of n t and approximately Gaussian with mean 0 and variance σ2, α is the intercept, and β is the parameter that represents the slope. Because shadows can also come from legitimate reads that are error free, β needs to be estimated robustly so that they are not influenced by error-free shadows. Shadow counts are computed by first enumerating all possible shadows of each observed read, and then identifying the shadows that are actually observed.
It is generally observed that in a given sample, a small number of reads have high frequencies while the majority of reads have very low frequencies. We find that the relationship between the number of reads and the number of shadows is captured in high frequency reads, and that it is enough to use the top 1000 reads with highest frequencies to estimate β. We also regard the top 1000 reads to be error free, and thus exclude them from shadow counts of any read.
Thus, by examining the slope, we may obtain a sample-specific estimate of the sequencing error rate. The error due to substitutions, insertions or deletions may be estimated separately with this method. Alternatively, the aggregate error from any source may be estimated.
Position-dependent error rates
where is the number of error shadows that differ from read t at least at position i, and ε i is Gaussian with mean 0 and variance σi2. As with per-read error rates, is the per-base error rate at position i that we want to estimate.
Application to different data types
Our method, shadow regression can be applied to mRNA-seq, SAGE, re-sequencing and whole genome sequencing data regardless of platform, as long as there is sufficient coverage in the input data. The current implementation requires the read lengths from a given sample to be the same. Precautions need to be taken when working with whole genome sequencing data as some genomes are highly repetitive and may result in over estimation of the error rates. One way to assess whether a particular genome is too repetitive to apply shadow regression directly is to sample reads from the reference genome and see if shadow regression gives an error rate estimate that is very close to 0. If this error rate is significantly greater than 0, indicating a genome with repetitive regions, one can mask out reads from the repetitive regions or retain only reads from the coding regions before applying shadow regression. The test for repetitive genomes described here is implemented in the R package.
Shadow regression requires sufficient coverage to estimate the error rate accurately. For mRNA-seq, samples shown in the results section for MAQC and Encode human data contain about 12 million reads, which is sufficient for shadow regression to work well. When using about 1.2 million reads sampled from the 12 million reads as input, shadow regression still gives good estimates (data not shown). Therefore we are confident that our method works well with any mRNA-seq data. In general, we would like the top 1000 read counts, which is what we use for error estimation, to have a range of about 500 or more. Therefore the maximum read count needs to reach about 500 or more.
In order to determine the accuracy of our proposed method, we performed two simulation studies. The first one assumes that errors in a read occur independently of each other, while the second study assumes that once an error occurs in a read, it is more likely to make another error in that read. For both studies, we start with U0 reads (reads that are uniquely mapped to the reference genome with no mismatches) of an MAQC experiment 2 sample (SRR037440). These reads, which map to the reference genome perfectly, are assumed to be the error-free ones for the purpose of generating synthetic data. Substitution errors are then added to these reads according to pre-specified position-specific error rates. For the second study where the errors do not occur independently of each other, the error rates for the rest of the read double once an error occurs in a read, i.e. if an error occurs at position i, the error rates for positions j > i become twice their pre-specified error rates. The pre-specified error rates are based on the estimated error rates of sample SRR037440 by counting the number of mismatches to the reference genome at each position. This set of position-specific error rates are then scaled to create a range of different error rates. The estimated error rates were calculated by transforming the slope from a robust linear regression (as implemented in the rlm() function in the R library MASS). For each set of pre-specified error rates, we repeated the error simulation and estimation one hundred times.
PhiX DNA data
Estimated error rates (as a percent) in PhiX samples
mm / new genome
2.613 (2.264, 2.962)
1.416 (1.331, 1.500)
mRNA-seq: MAQC brain experiment 2 using auto calibration
MAQC brain experiment 2 using auto calibration: per-read error rate estimates
sr standard error
mRNA-seq: Encode transcriptome
Encode project mRNA-seq data: per-read error rate estimates
sr standard error
Mutation screening: re-sequencing
Hu re-sequencing data: per-read error rate estimates
sr standard error
Application to Serial Analysis of Gene Expression (SAGE)
SAGE data: human colorectal and pancreatic samples
sr standard error
We established a simple, general, flexible and robust methodology to estimate error rates for any of the existing second-generation sequencing technologies producing short read sequences. The fundamental advance behind the proposed methodology is our observation that there is a linear relationship between the frequency of short reads and the frequencies of their ‘neighboring’ reads, where neighbors are defined by sequence similarity. This linear increase reflects sequencing errors, as frequently occurring tags cast larger “shadow” of errors on neighboring tags. This observation holds true of SAGE libraries and seems to hold universally across second-generation sequencing platforms with moderate or high sequencing depths as well. Because shadow regression estimates the slope robustly, the error rate estimates are not influenced by shadows that are error-free.
The shadow regression method does not require mapping reads to a reference genome. This has significant computational advantages, but more importantly it can address critical biological needs. For example, the ability to measure error rates independent of a reference genome can be critical in experiments designed to detect unknown species, as in threat detection, and in experiments investigating many species simultaneously, as when studying the microbiome. When the reference genome is prone to errors, for example because of a relatively high mutation rate of the system studied, we showed that the method of using the differences between the sample of interest and the reference genome as proxy for sequencing errors can produce estimates that are substantially inflated, while shadow regression is not affected by such differences.
We also showed the accuracy of shadow regression through simulation studies and analyses of the PhiX and SAGE data. We applied shadow regression to mRNA-seq, DNA sequencing, mutation screening and SAGE, and demonstrated that this approach can be immediately used to evaluate sequencing error rates in different applications as they are generated. Even though our next-generation sequencing examples are all from the Illumina platform, shadow regression can be applied to other sequencing platforms as well.
We hope that the availability of a simple and computationally effective way of computing error rates at the level of single sequencing experiments will contribute significantly to quality control, proper analysis and experimental design of second-generation sequencing efforts.
We thank John Quackenbush for sharing data from the Center for Cancer Computational Biology, and Victor Velculescu for insight into an earlier unpublished version of this algorithm for SAGE data. Funding: This work was supported by National Science Foundation award 1041698 and 1042785.
- Shendure J, Ji H: Next-generation DNA sequencing. Nature Biotechnology. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.View ArticlePubMedGoogle Scholar
- Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ: Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nature Methods. 2008, 5 (8): 679-682. 10.1038/nmeth.1230.PubMed CentralView ArticlePubMedGoogle Scholar
- Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F: Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics. 2008, 9: 431-10.1186/1471-2105-9-431.PubMed CentralView ArticlePubMedGoogle Scholar
- Kao W, Stevens K, Song Y: BayesCall: a model-based base-calling algorithm for high-throughput short-read sequencing. Genome Research. 2009, 19 (10): 1884-10.1101/gr.095299.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Bravo H, Irizarry R: Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics. 2010, 66 (3): 665-674. 10.1111/j.1541-0420.2009.01353.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Ewing B, Green P: Base-calling of automated sequencer traces using Phred. II. error probabilities. Genome Research. 1998, 8 (3): 186-View ArticlePubMedGoogle Scholar
- Dohm J, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research. 2008, 36 (16): e105-10.1093/nar/gkn425.PubMed CentralView ArticlePubMedGoogle Scholar
- Hansen K, Brenner S, Dudoit S: Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research. 2010, 38 (12): e131-10.1093/nar/gkq224.PubMed CentralView ArticlePubMedGoogle Scholar
- Huse S, Huber J, Morrison H, Sogin M, Welch D: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology. 2007, 8 (7): R143-10.1186/gb-2007-8-7-r143.PubMed CentralView ArticlePubMedGoogle Scholar
- Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010, 11: 94-10.1186/1471-2105-11-94.PubMed CentralView ArticlePubMedGoogle Scholar
- Butler J, MacCallum I, Kleber M, Shlyakhter I, Belmonte M, Lander E, Nusbaum C, Jaffe D: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Research. 2008, 18 (5): 810-10.1101/gr.7337908.PubMed CentralView ArticlePubMedGoogle Scholar
- Schröder J, Schröder H, Puglisi S, Sinha R, Schmidt B: SHREC: a short-read error correction method. Bioinformatics. 2157, 25 (17): 2009-Google Scholar
- Kelley D, Schatz M, Salzberg S: Quake: quality-aware detection and correction of sequencing errors. Genome Biology. 2010, 11 (11): R116-10.1186/gb-2010-11-11-r116.PubMed CentralView ArticlePubMedGoogle Scholar
- Salmela L: Correction of sequencing errors in a mixed set of reads. Bioinformatics. 2010, 26 (10): 1284-10.1093/bioinformatics/btq151.View ArticlePubMedGoogle Scholar
- Schröder J, Bailey J, Conway T, Zobel J: Reference-free validation of short read data. PloS ONE. 2010, 5 (9): e12681-10.1371/journal.pone.0012681.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research. 2010, 20 (2): 265-10.1101/gr.097261.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Kao W, Chan A, Song Y: ECHO: A reference-free short-read error correction algorithm. Genome Research. 2011, 21 (7): 1181-1192. 10.1101/gr.111351.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Leinonen R, Sugawara H, Shumway M: The sequence read archive. Nucleic Acids Research. 2011, 39 (suppl 1): D19-PubMed CentralView ArticlePubMedGoogle Scholar
- Lash A, Tolstoshev C, Wagner L, Schuler G, Strausberg R, Riggins G, Altschul S: SAGEmap: a public gene expression resource. Genome Research. 2000, 10 (7): 1051-10.1101/gr.10.7.1051.PubMed CentralView ArticlePubMedGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak M, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research. 2011, 39 (13): e90-e90. 10.1093/nar/gkr344.PubMed CentralView ArticlePubMedGoogle Scholar
- Bentley D, Balasubramanian S, Swerdlow H, Smith G, Milton J, Brown C, Hall K, Evers D, Barnes C, Bignell H, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (7218): 53-59. 10.1038/nature07517.PubMed CentralView ArticlePubMedGoogle Scholar
- Cuevas J, Duffy S, Sanjuan R: Point mutation rate of bacteriophage ΦX174. Genetics. 2009, 183 (2): 747-749. 10.1534/genetics.109.106005.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi L, Reid L, Jones W, Shippy R, Warrington J, Baker S, Collins P, De Longueville F, Kawasaki E, Lee K: The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nature Biotechnology. 2006, 24 (9): 1151-1161. 10.1038/nbt1239.View ArticlePubMedGoogle Scholar
- Birney E, Stramatoyannopoulos JA, Dutta A, Guigó R, Thomas R, Elliott H, Zhiping Weng M, Emmanouil T, John A, Robert E, Michael S, Christopher M, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.View ArticlePubMedGoogle Scholar
- Hu H, Wrogemann K, Kalscheuer V, Tzschach A, Richard H, Haas S, Menzel C, Bienek M, Froyen G, Raynaud M, Van Bokhoven H, Chelly J, Ropers H, Chen W: Mutation screening in 86 known X-linked mental retardation genes by droplet-based multiplex PCR and massive parallel sequencing. HUGO J. 2009, 3: 41-49. 10.1007/s11568-010-9137-y.PubMed CentralView ArticlePubMedGoogle Scholar
- Velculescu V, Zhang L, Vogelstein B, Kinzler K: Serial analysis of gene expression. Science. 1995, 270 (5235): 484-10.1126/science.270.5235.484.View ArticlePubMedGoogle Scholar
- Velculescu V, Vogelstein B, Kinzler K: Characterization of the yeast transcriptome. Cell. 1997, 88 (2): 243-251. 10.1016/S0092-8674(00)81845-0.View ArticlePubMedGoogle Scholar
- Zhang L, Zhou W, Velculescu V, Kern S, Hruban R, Hamilton S, Vogelstein B, Kinzler K: Gene expression profiles in normal and cancer cells. Science. 1997, 276 (5316): 1268-10.1126/science.276.5316.1268.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.