Sequencing error correction without a reference genome
© Sleep et al.; licensee BioMed Central Ltd. 2013
Received: 3 May 2013
Accepted: 10 December 2013
Published: 18 December 2013
Next (second) generation sequencing is an increasingly important tool for many areas of molecular biology, however, care must be taken when interpreting its output. Even a low error rate can cause a large number of errors due to the high number of nucleotides being sequenced. Identifying sequencing errors from true biological variants is a challenging task. For organisms without a reference genome this difficulty is even more challenging.
We have developed a method for the correction of sequencing errors in data from the Illumina Solexa sequencing platforms. It does not require a reference genome and is of relevance for microRNA studies, unsequenced genomes, variant detection in ultra-deep sequencing and even for RNA-Seq studies of organisms with sequenced genomes where RNA editing is being considered.
The derived error model is novel in that it allows different error probabilities for each position along the read, in conjunction with different error rates depending on the particular nucleotides involved in the substitution, and does not force these effects to behave in a multiplicative manner. The model provides error rates which capture the complex effects and interactions of the three main known causes of sequencing error associated with the Illumina platforms.
The combination of a high read depth and the highly expressed nature of some sequences can result in some reads occurring millions of times in a next generation sequencing data set. For these situations, even very low error rates may still result in the presence of a multitude of sequence variants. Distinguishing these variants from true biological variants is a technological and computational challenge. In many species, this difficulty is compounded by the lack of an available reference genome.
The importance of identifying and correcting sequence errors has been highlighted by the recent discussion prompted by the report of the presence of widespread differences between the human genome (DNA) and reads derived from the corresponding RNA . While it is tempting to interpret such differences as being due to the presence of RNA editing, a reanalysis of this same data set showed that the majority of the reported differences were actually consistent with technical artefacts arising from sequencing errors (see, e.g. ).
It goes without saying that when the genome of an organism has not been sequenced and assembled, the difficulty of identifying possible sequencing errors is greatly increased, necessitating the development of alternate analysis methods.
Sequencing errors arising from the use of Illumina sequencers, on which we concentrate, can occur for a variety of reasons. One source of error originates from a phenomena referred to as crosstalk. Crosstalk occurs when there is an overlap in signals of the dye emission frequencies used in sequencing machines.
This overlap can lead to confusion of the nucleotide G with nucleotide T, and of A with C [3, 4]. A second cause of error is referred to as either dephasing or phasing. Since sequencing is done in cycles, an error in an earlier cycle may propagate to and affect later cycles. This usually results in the errors appearing more frequently toward the ends of the reads. T fluorophore accumulation is another source of error, and results in more T’s being incorrectly attributed towards the ends of reads. For an extensive review, see , which also discusses other possible sources of sequencing errors such as signal decay, mixed clusters and boundary effects. Additionally, sequence-specific error patterns, including inverted repeats and the effects of the nucleotide sequence GGC have been proposed as an important cause of sequencing errors through dephasing .
The issue of sequencing errors is so ubiquitous that being able to detect and correct them is essential in many areas of molecular biology, particularly in the identification of miRNAs. In , the occurrence of errors and their corresponding rates were investigated by looking at Illumina data sets (2.8 million 27-base reads) taken from Beta vulgaris and Helicobacter acinonychis. By aligning reads to the known genomes of these bacteria, error rates were derived for each of the 12 possible nucleotide substitutions.
This work is typical of procedures that rely on the availability of a reference genome and many methods and software packages have been developed for the detection and/or correction of sequencing errors in this setting . One such method  is based on an algorithm for correcting sequencing errors that uses a ‘generalized suffix trie’. However, this method requires a reference genome and assumes a uniformly distributed error rate. A similar method using suffix arrays is that of Ilie et al. . An alternative method for correcting short reads that takes into account genomic repeats, is described in . Based on a position-dependent error model, error probabilities are estimated for each nucleotide substitution type. The method of  also requires a reference genome, has a position dependent error model, but it is one that is not base-specific.
where P error is the overall probability of an error, P error (pos) is the adjusted probability of an error at each position, and Rate (Q,R) gives the probability of pattern R occurring when the quality score takes the value Q. For each child, the expected number of errors are compared to the actual frequency, using a Z-test with the null hypothesis being that the sequence read contains a sequencing error.
where L is the length of the sequence. Children of a given sequence are then classified as true biological variants if their abundance is > > p X, where X is the abundance of the ‘parent’ sequence. If their abundance is roughly pX or less, they are considered to be sequencing errors. Additionally, reads with an abundance of less than 12 are also removed. This approach has the advantages of not requiring a reference genome or platform-provided quality scores. However, a limitation of this model lies in the assumption of a constant error rate along the read. While the number of sequence variants does increase with increasing abundance of a sequence, there is considerable variation (as seen in Figure 1) that cannot be explained by a probabilistic model based on a constant error rate.
In the following sections we present a method for modelling sequencing errors, extending the graph-based approach described in  but incorporating position- and substitution-specific error rates. Additionally, we do not enforce these effects to be working multiplicatively, and the method does not require a reference genome or quality scores. Results are included that highlight the advantages of the method in its ability to account for the interactions between the different causes of sequencing errors.
We extend the approach of  by allowing for error rates that are both dependent on position along the read and vary by each nucleotide substitution pattern (e.g., T → G). As stated earlier, we do not restrict ourselves to the assumption that these two effects work in a multiplicative fashion. The method was tested on data sets obtained from high-throughput sequencing of short RNA reads extracted from the leaves and roots of three different cultivars of wheat. For this purpose, the Illumina Solexa sequencing technology was employed.
The number of reads for each sample and lane ranged from 5 to 41 million reads, and several sequencing runs were performed approximately two years apart. The first samples were 36 base reads run on the Illumina Genome Analyzer (GeneWorks Pty. Ltd., c. 2009) and the second were 50 base reads from Illumina HiSeq with Illumina TruSeq v3 reagents (Australian Genome Research Facility Ltd., December 2011). What we term an individual data set is the sequenced data from a particular lane (numbered between 1 and 8), corresponding to their physical locations on the flow cell. We refer to lanes 4 and 5 as being the innermost lanes and 1 and 8 as the outermost.
Data processing and graph construction
Processing began with the 3′ adaptors being trimmed from the sequences. A number of mismatches to this adaptor were allowed depending on the length of the matching sections, as described in . Reads containing homopolymer tracts were not removed at this early stage. Removal occurred as a final step after the error model is built and the data corrected for sequencing errors. This was done so as to prevent the removal of any parent sequences of erroneous reads. Reads containing undetermined nucleotides (denoted by the letter N) were, however, excluded from our analysis. Sequences of length outside the region of interest (20-24 nucleotides) were not studied any further. Unique sequences were identified along with the frequency (abundance) with which each was seen in the data.
Properties of created subgraphs
Illumina GA lane 2
Frequency of subgraph sizes
Excluding adaptor trimming, the graphs are created in approximately 40 minutes (on a single processor of a PC running 32 bit Windows XP with 3.45GB of RAM) for a file of 6 million 35-base reads. Our algorithm was not parallelised, but can be, which would greatly reduce the processing time. A similar amount of time is required for the building of error models and correction of the graphs. The full source code is publicly available .
where a parent is the number of times the parent sequence appears in the data set and a total is the sum of the frequencies of sequences in the subgraph. Starting from graphs satisfying the highest parental abundance threshold, we analyse the children of the most abundant sequence, recording the abundances of the sequence and each child sequence, the position along the read where the child differs from the parent, and the nucleotide substitution that has occurred.
where is the parental abundance of the parent sequence used to estimate .
Using this estimate, and assuming that our data may be modelled as coming from a binomial distribution B (n,p), we calculate 95% confidence intervals. The parameters used in the binomial distribution are n, being the number of sequences, given by and p, being the error rate . Those estimates lying above this confidence interval, and thus most likely to be derived from biological variants, are precluded from being used in the error model, and the weighted average and confidence interval is then recalculated. To ensure that positions with no sequencing errors and only biological variants (which would not be removed by the confidence interval method described above) did not contribute exceptionally high data points, an additional smoothing technique was employed. This involved adjusting probability estimates that were greater than twice the average of their two nearest neighbours. These unusually high values were replaced with this average.
and hence are able to model non-multiplicative effects.
The model described above is used to find and correct sequencing errors by comparing the observed sequence abundances with those predicted by the model. Statistical hypothesis testing is used for this purpose with the null hypothesis being that a given sequence is a sequencing error. Sequences for which the null hypothesis is rejected are classified as true biological variants, the remaining sequences are classified as sequencing errors.
Results and discussion
Summary of modelled error probabilities and model parameters
Illumina GA lane 2
Illumina GA lane 4
Illumina HiSeq lane 2
A → C
A → G
A → T
C → A
C → G
C → T
G → A
G → C
G → T
T → A
T → C
T → G
By comparing Figure 3(a) with Figure 3(b), it can be seen that in GA datasets there is a strong dependence of overall error rate on the sequencing lane, with error rates lowest in the inside lanes (Figure 3(b)). Whether this is a more general phenomenon requires further investigation. However, it highlights the necessity of processing lanes separately.
The error profiles of the sequenced reads from lane 2 of the Illumina HiSeq data (Figure 3(c)) show a qualitatively different profile in that some error rates are initially decreasing along the reads. Error rates along the read are substantially lower overall. The substitution G → T is higher at the beginning of the reads but becomes lower moving along the read, which is in contrast to the reverse error, T →G, which increases towards the end. The reason for this phenomena is unclear but it is hypothesised as due either to altered chemistry (the washing away of T fluorophores becoming more (too) effective) or to the changes in the different base calling algorithm (overcompensation for the T fluorophore accumulation phenomenon). We consider the latter scenario more likely as we see similar patterns for many of the other corresponding pairs of nucleotide substitutions. Moving along the first 24 bases of the read, C → T, G → A and G → C also decrease, while their reversed substitutions T → C, A → G and C → G increase. A lane trend is also observed in the Illumina HiSeq data. However, it does not involve all nucleotide transitions. The outer lanes, 2 and 8, have significantly higher error rates for the G → T and T → G transitions. These two error rates become progressively lower toward the inside lanes. The other nucleotide transition error rates remain essentially constant across the lanes.
To address the difficult matter of evaluation we undertook two benchmarking analyses. Firstly, we applied our model to a simulated data set and secondly we checked the performance of our model by correcting reads from an organism with a known reference genome.
Summary of model parameters resulting from simulated data
A → C
A → G
A → T
C → A
C → G
C → T
G → A
G → C
G → T
T → A
T → C
T → G
Evaluation of error correction algorithm on PhiX genomic sequences
We have proposed a model of sequencing errors that is flexible enough to incorporate known sources of error intrinsic to the Illumina sequencing technologies and that does not rely on the availability of a reference genome for error detection. We have demonstrated the advantages of using of a non-factorisable model, particularly necessitated by the presence of accumulated T fluorophores in the Illumina GA data, and other unknown non-multiplicative effects in the Illumina HiSeq data. The method described herein is potentially applicable not only to short RNA reads but also to other sequencing activities where a reliable sequenced genome is not available, such as in the field of metagenomics, where a mixed sample containing reads from many organisms is sequenced, or when trying to distinguish sequencing errors from single nucleotide polymorphisms. While, as discussed in the results section, our model performs well in identifying sequencing errors (our method identifies at least 96.64% of errors in the example PhiX data set), we note that our model may not account for some errors that arise before the sequences enter a flowcell, e.g. during reverse transcription or library amplification. These errors may lack a highly abundant parent sequence and thus are difficult to identify without a reference genome.
A possible direction to improve this model is to include the investigation of the role of single and multiple preceding or following bases in determining error rates. The inclusion in the model of error prone positions, such as those reported in [6, 17] is an area of future interest. Correcting for local variants in error rates within lanes, possibly produced by bubbles in flowcells, also warrants further investigation. Additionally, we note that the characteristic phasing-related rise is not visible for all error types in the first 24 bases of GA data. If one were to model the error rate beyond this point one would have to incorporate a second, increasing, exponential in the fitting function or use a more flexible method, such as fitting splines.
The authors would like to thank Professor Stan Miklavcic for feedback on the manuscript, Dr Chris Brien for useful statistical discussions, Mr John Toubia for technical assistance and Dr Bu-Jun Shi for providing the data used for analysis. This work was supported through funding from the Australian Research Council, Grains Research and Development Corporation, and the Government of South Australia.
- Li M, Wang IX, Li Y, Bruzel A, Richards AL, Toung JM, Cheung VG: Widespread RNA and DNA sequence differences in the human transcriptome. Science. 2011, 333 (6038): 53-58. 10.1126/science.1207018.PubMed CentralView ArticlePubMed
- Pickrell JK, Gilad Y, Pritchard JK: Comment on ‘Widespread RNA and DNA sequence differences in the human transcriptome’?. Science. 1302, 335 (6074):
- Whiteford N, Skelly T, Curtis C, Ritchie ME, Löhr A, Zaranek AW, Abnizova I, Brown C: Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009, 25 (17): 2194-2199. 10.1093/bioinformatics/btp383.PubMed CentralView ArticlePubMed
- Li L, Speed T: An estimate of the crosstalk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis. 1999, 20: 1522-2683.
- Ledergerber C, Dessimoz C: Base-calling for next-generation sequencing platforms. Brief Bioinform. 2011, 12 (5): 489-497. 10.1093/bib/bbq077.PubMed CentralView ArticlePubMed
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011, 39: e90-10.1093/nar/gkr344.PubMed CentralView ArticlePubMed
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36 (16): e105-10.1093/nar/gkn425.PubMed CentralView ArticlePubMed
- Yang X, Chockalingam SP, Aluru S: A survey of error-correction methods for next-generation sequencing. Brief Bioinformatics. 2013, 14: 56-66. 10.1093/bib/bbs015.View ArticlePubMed
- Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B: SHREC a short-read error correction method. Bioinformatics. 2009, 25 (17): 2157-2163. 10.1093/bioinformatics/btp379.View ArticlePubMed
- Ilie L, Fazayeli F, Ilie S: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics. 2011, 27 (3): 295-302. 10.1093/bioinformatics/btq653.View ArticlePubMed
- Yang X, Aluru S, Dorman K: Repeat-aware modeling and correction of short read errors. BMC Bioinformatics. 2011, 12 (Suppl 1): S52-10.1186/1471-2105-12-S1-S52.PubMed CentralView ArticlePubMed
- Wijaya E, Frith MC, Suzuki Y, Horton P: Recount: expectation maximization based error correction tool for next generation sequencing data. In Genome Inform. 2009, 23: 189-201.
- Qu W, Morishita S, Hashimoto S i: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Res. 2009, 19 (7): 1309-1315. 10.1101/gr.089151.108.PubMed CentralView ArticlePubMed
- Schreiber A, Shi BJ, Huang CY, Langridge P, Baumann U: Discovery of barley miRNAs through deep sequencing of short reads. BMC Genomics. 2011, 12: 129-10.1186/1471-2164-12-129.PubMed CentralView ArticlePubMed
- Source code for sequencing error correction without a reference genome. [http://unisa.edu.au/Research/Phenomics-and-Bioinformatics-Research-Centre/Software/],
- Kircher M, Stenzel U, Kelso J: Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 2009, 10 (8): R83+-10.1186/gb-2009-10-8-r83.PubMed CentralView ArticlePubMed
- Minoche A, Dohm J, Himmelbauer H: Evaluation of genomic high-throughput sequencing data generated on illumina HiSeq and genome analyzer systems. Genome Biol. 2011, 12 (11): R112-10.1186/gb-2011-12-11-r112.PubMed CentralView ArticlePubMed
- Illumina IGenomes collection. [http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn],
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.