Repeat-aware modeling and correction of short read errors

Background High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. Results We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”. Conclusions We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.


Background
High throughput next generation sequencing has revolutionized genomics, making it possible to sequence new genomes or resequence individual genomes at a manifold cheaper cost and in an order of magnitude less time than earlier Sanger sequencing. With this technology, ambitious genome sequencing projects target many organisms rather than a few, and large scale studies of sequence variation become feasible [1]. Many nextgeneneration sequencing technologies have been developed, including systems currently in wide use, such as the Illumina Genome Analyzer (earlier known as Solexa) and Applied Biosystems SOLiD, as well as more recent and new offerings from companies such as Complete Genomics and Pacific Biosciences [2]. Many nextgeneration sequencing systems produce short reads, e.g., the widely used Illumina Genome Analyzer systems typically produce 35-150bp reads. Short read technologies have been widely adopted for both genome sequencing and resequencing applications; hence, development of high quality short read assemblers (e.g., [3][4][5][6][7]) and short read mapping tools that map reads to a reference genome [8,9] are important.
Short reads of novel genomes are typically assembled using de Bruijn graphs that represent observed kmers as nodes and length (k -1) overlaps as edges. In the absence of errors, the size of such a graph is bounded by the length of the genome, but can be as high as 4 k in the presence of errors. In the mapping process, a read with sequencing errors may map to multiple locations, or sometimes nowhere at all. Thus, error removal or correction is necessary to keep the size of the graph manageable [7,10] and simplify non-repetitive read mapping [11].
Many approaches have been proposed to identify and sometimes correct sequencing errors in next-generation sequencing data. More recent ones include SAP (Spectral Alignment Problem)-based methods [4,10], SHREC [12] and Reptile [13]. SAP-based methods identify any kmer occurring less than a constant, user-specified frequency threshold to be erroneous. Chin et al. [14] have shown that an optimum threshold can be derived analytically, assuming the genome to be a random sequence and that errors are independently and uniformly distributed in the reads. SHREC, a suffix trie based method, classifies any substring that occurs less often than an analytically calculated threshold to contain errors based on the same assumptions as in [14]. An erroneous base, identified as an infrequent branch of the suffix trie, is corrected to one of its siblings when applicable. Reptile explores read decompositions and makes corrections to any substring whenever an unambiguous choice can be made. In contrast to the previous two approaches, erroneous substrings are inferred based on assessing their frequencies relative to the frequencies of the alternative (error free) substrings. This accommodates undersampled genomic regions. All of these methods are mainly suitable for genomes with a low degree of repetitive sequences.
Repeats in genomes can lead to mishandling of errors in many ways. Nearly identical repeats can easily be mistaken to be sequencing errors. Even when errors are rare, an erroneous kmer may appear at a moderate frequency if it has few nucleotide differences from one or more valid kmers that have a high frequency of occurrence in the genome. The problem of detecting and correcting sequencing errors among reads in the presence of repeats has so far not been adequately addressed. Nevertheless, repeats are widely prevalent, even in some viral genomes such as N. meningitidis. Other genomes, like those of plants, are known for their high repeat content; for instance, an estimated 65-80% of the maize genome is spanned by repeats, which makes the assembly, mapping and error detection and correction tasks difficult. Although packages like FreClu [11] and Recount [15] could be potentially adapted to consider repeats, they are specifically designed for transcriptome data and correct read counts rather than identify and correct erroneous bases within reads. Moreover, insufficient replication of full length reads in genomic data prevents these methods from accurately estimating model parameters.
In this paper, we address the problem of identifying and correcting sequencing errors in short reads from genomes with different levels of repetition, particularly for reads produced by the widely used Illumina Genome Analyzer platform. Similar to existing approaches, we decompose the input reads into kmer substrings and count the number of times Y l each kmer x l occurs in the reads. However, instead of inferring erroneous kmers based on these observed occurrence frequencies [10,12], we developed a maximum likelihood estimate of the expected number T l of attempts to read x l , including both attempts that resulted in error-free reads and erroneous reads. In addition, we propose a new method to choose the threshold, which can be used to identify erroneous kmers as those x l 's for which T l 's are lower than the threshold. We demonstrate that using estimates of read attempts enables more accurate detection of sequencing errors than using observed frequencies for a wide choice of thresholds. We further develop an error correction method to transform erroneous bases in each read to the correct ones and compare the results with SHREC [12] and Reptile [13], two of the most recent error correction methods. The results demonstrate significant improvement in error correction capabilities for genomes with high repeat content. The proposed method is made available through the software package REDEEM (Read Error DEtection and Correction via Expectation Maximization) at "http://aluru-sun.ece.iastate.edu/doku.php?id=redeem".

Methods
Let G denote the reference genome to be sequenced, and let R = {r 1 , r 2 ,…, r N } be the collection of resulting short reads. For simplicity, we assume each read has a fixed length L. The sequence coverage is where |G| is the genome length. Define the k-spectrum of a read r to be the set is the substring from position i to j in r. The k-spectrum produced by all the reads is Each kmer x l R k has a l occurrences in G and Y l instances observed in read set R. Define s G k the probability that a random k-length fragment in the genome is kmer x l . Occurrence frequency a l , or equivalently s l , is unobserved, but of paramount interest. Indeed, if we knew s l = 0, but observed Y l > 0, then we would know each observed instance of x l contains at least one misread base. Under the assumption that errors are rare, it makes sense to label kmers x l with Y l <M as errors, where M is chosen such that P(Y l <M | s l > 0) is reasonably small. Since s l is unknown, the threshold M is set ad hoc, based on training or simulated data [10], or analytical calculations [12,14] assuming the genome to be a random sequence and errors to be distributed uniformly in the reads. In practice, these assumptions do not hold true. Moreover, the problem of misread kmers contributing to the observed frequencies Y l (see Fig. 1) is exacerbated in repetitious genomes where kmers with high genomic occurrence may result in generation of the same misread multiple times [10].
We will develop a model that estimates the expected number of attempts T l to read each kmer x l . The threshold is then applied to each estimated T l instead of the corresponding observed Y l . The model we propose is similar to that of RECOUNT [15] used to correct next generation short read counts. Both models derive from a method originally meant to detect sequencing errors in SAGE libraries [16]. Our model differs from the previous models in that it works with kmers rather than full reads, since there is insufficient replication of full length reads in genomic data (as compared to transcriptome data). In addition, instead of assuming the misread bases to be drawn from {A, C, G, T} with equal probability, we propose a parametric error model that can be trained from the reads produced by the control lane (e.g.using the Illumina Genome Analyzer) in the same experiment. This strategy has already proven to be useful in several pioneering works [11,17]. In addition, we will show that the model is somewhat robust to incorrect assumptions in the underlying error model. A further step is to modify erroneous bases to their true forms in each read. This task has rarely been attempted previously for repetitive regions. We propose a method that utilizes transition probabilities and the contexual information of individual reads to achieve this goal. Like others, we ignore insertion and deletion errors assuming they are rarely produced by next-generation sequencing technology, which is true for reads from the Illumina Genome Analyzer [18].

Error model
The simplest error model posits that sequencing errors occur independently at all sites with constant probability p e . Let p e (x m ,x l ) be the probability that x m is misread as x l . This model produces symmetric misread probabilities: where d(·, ·) denotes the Hamming distance between two kmers. It is known, however, that short read technology produces errors with distinct patterns [18]. As a first approximation, we assume that errors strike sites in the kmer independently, but with varying probabilities. For example, we observe in dataset preparation section that errors cluster in the 3′ portion of reads and, consequently, kmers. Let q i (a,b) be the probability that nucleotide a at position i of a kmer is (mis)read as nucleotide b, with ∑ b q i (a, b) = 1. Then, the misread probability is  We now discuss some ways to reduce and simplify the calculations. We observe substitution errors are relatively rare, so misread kmers generally contain far fewer than k errors. Thus, when considering possible origins of a misread kmer, we can safely restrict our attention to kmers within some Hamming distance d max from the current kmer. Capping the maximum distance between kmers at d max induces a sparse P e , whose entries are normalized by dividing each row by the corresponding row sum. Finally, we ignore kmers that are not observed in the data (i.e. Y m = 0 or, equivalently, x m ∉ R k ), so the (incomplete) neighborhood of kmer x l , denoted by N l d max , is given as Failure to include unobserved kmers could bias estimation of a l by ignoring kmers actually present in G and capable of contributing to Y l . However, the bias cannot be large since a m must be small, otherwise Y m would not be zero.
After considering errors and applying the simplifications, the counts Y l follow a Multinomial distribution Y = (Y 1 ,…,Y| R k|)~Multinomial(N(Lk + 1),p), but unobserved kmers are ignored and the probability vector p = (p 1 ,p 2 …,p l …,p| R k|) depends on the kmer neighborhood. Specifically, where s = (s 1 ,…,s| R k |) is restricted to the set of observed kmers R k . It becomes clear that when x l is surrounded by highly repetitious x m with large s m , then Y l may exceed threshold M because of high misread occurrence with probability s m p e (x m ,x l ). Thus, when errors combine with repeats, it is more appropriate to apply a threshold to estimates of the parameters s l than observed Y l .
The observed log likelihood, l(s |Y), involves a mixture over the neighborhood (Fig. 1) of kmers that could be (mis) read as kmer x l , This setup lends itself to maximum likelihood estimation via the EM algorithm [19]. The update equations are adapted from [16] using a different error model and are given as follows: The expectations of hidden data Y lm obtained in the E step are Notice the estimated expected number of attempts to read kmer x l is T l = ŝ l N(Lk + 1), directly proportional to ŝ l and sitting on the same scale as Y l . In fact, by observing the E step is unchanged and the log likelihoood l(s |Y) is computed up to an additive constant when s l is replaced with T l , we use the EM algorithm to compute T l directly. For inference, we apply the threshold to estimates T = (T 1 ,…,T| R k|) rather than ŝ, to more easily compare our method with thresholding on Y. The algorithm is initialized by setting T l =Y l and iterating until the log likelihood converges.

Error detection and correction
Error detection, in practice, requires a method to choose a threshold M that minimizes the number of wrong decisions when classifying kmers as erroneous or not. We discuss a model-free method for estimating the threshold M in the Appendix, but no results presented in this paper use estimated thresholds.
To correct errors, consider each of the nucleotides in a read r. Each nucleotide appears in at least one and up to k kmers. Suppose the nucleotide at position 1 ≤ i ≤ L of the read appears at position 1 ≤ t ≤ k of kmer x l . The probability that the true nucleotide at position t was b prior to possible misread is

Dataset preparation
In order to test our model, we compiled various simulated and real datasets ( Table 1). The datasets are classified into the following types (Table 1, column 2). Type 1 are simulated Illumina reads from (a) synthetically constructed genomes embedded with various types of non-overlapping repeats, and (b) previously sequenced genomes known to be rich in repeats. Type 2 are actual Illumina reads from a previously sequenced genome with a low degree of repetition.

Reference genome preparation
The reference genomes of type 1(a) were initially generated using the nucleotide distribution of a piece of B73 maize genome (A: 28% C:23% G: 22% T: 27%). Then, repeat regions of different lengths and multiplicities (Table 1, column 6) derived from the same nucleotide distribution were embedded at random locations in these reference genomes. The reference genome N. meningitidis (NC_013016) of D4 is known to be a small, repeat rich, viral genome. The maize genome is known to contain up to 80% repeats and only the relatively unique regions have been fully assembled. Hence, we concatenated the first 20 contigs from Chromosome 1 of the B73 assembly, and removed all non-ACGT characters to form the reference genome of D5.

Short read preparation
The simulated Illumina reads (type 1) were produced by first estimating an error distribution from a real Illumina short read dataset, then simulating uniformly distributed reads of the reference genomes with these error rates. We used the RMAP software [9] to map Illumina data (Sequence Read Archive ID: SRX000429) to the reference genome E. coli str. K-12 allowing up to three mismatches. We were able to map 98.5% of reads; this percent is increased to 99.1% by allowing up to ten mismatches. However, allowing more mismatches increases the chance of a mismapped read since reads are only 36bp, and typically, mapping software can work at full sensitivity for up to two mismatches. Unmapped reads were discarded, and all remaining reads were assumed correctly mapped. By comparing the mapped reads to the reference genome, we estimated L 4 × 4 misread probability matrices M = (M 1 ,M 2 ,…,M L ), where L is the read length and each entry (a, b) (a, b {A, C, G, T}) in misread probability matrix M i (1 ≤ i ≤ L) is the probability a nucleotide a on the reference genome is (mis) read as b at position i in the read. This is calculated as the total number of times a is read to be b at position i among all mapped reads, divided by the number of times the corresponding position of the reference genome is a. Finally, we simulated Illumina sequencing to generate N reads by applying M to N uniformly distributed L-substrings in the reference genome.

Rationale
Simulated data are essential because highly repetitive genomic regions, for which our error model is designed, are often masked prior to assembly. Even when assembly can be done, accurate mapping of sequenced reads back to the assembly is difficult when genomes are repetitive [20]. Under these conditions, only simulation can provide unambiguous error information. Type 1(a) datasets were prepared such that they emulate repeat content ranging from a microbial genome with low repeats to a highly repetitive plant genome. However, to inject reality wherever possible, the reference genomes of Type 1(b) were selected from the previous assemblies. Lastly, the type 2 dataset demonstrated the applicability of our model to real, although non-repetitive, real read data.

Error detection and correction results
Our model accommodates sequencing errors via the misread probabilities p e (x m , x l ) between any two kmers x m and x l . To calculate p e (x m ,x l ), we need to specify the position specific misread probabilities, q i (α, b), 1 ≤ i ≤ k, a, b {A, C, G, T}, for each position of a kmer. Ideally, we would set q i (·, ·) to match the errors in the current dataset inferred from reads in the control lane [11,17]. When such information is not available, we can rely on information derived from other read data generated on the same platform. In the worst case, we can use the simple error model of Eq. (1), which only requires specification of the average error rate p e .
Based on these choices, we tested our datasets using four types of sequencing error (misread) distributions: tIED, wIED, tUED, and wUED (defined below). Our simulation procedure introduced errors according to the misread probability matrices M estimated from dataset SRX000429, so the true error distribution, tIED, was obtained by estimating q i (·, ·) from the same dataset SRX000429. The estimation procedure is similar to the '-' denotes the information that is not quantified; K: thousand; M: million.
one used for estimating M (defined in the previous section), except each read is decomposed into Lk + 1 kmers and the count of each type of misread nucleotide at each kmer position is determined. (Note, the same nucleotide contributes counts in up to k distinct kmers.) Since, the estimated q i (·, ·) represent fewer parameters than M, q i (·, ·) only approximates the true misread probability matrices M, which themselves only approximate true read errors. The wrong Illumina error distribution, wIED, is the situation encountered when Illumina data are only available from a different experiment (and often different lab). To emulate this case, we derived a second set of error probabilities q i (·, ·) from Illumina reads of Acinetobacter sp. ADP1 (Short Read Archive acc. SRX001814, 17.7M reads of 36bp length). The error rates differ at kmer position i = 11 (Table 2) and others (not shown) in the E. coli and A. sp. ADP1 short read datasets, demonstrating that wIED is indeed the wrong error distribution. Finally, in the absence of detailed error information, we can use the uniform error distribution with constant error probability p e . When the average error rate p e = 0.006 is estimated from dataset SRX000429, the error distribution is the true uniform error distribution (tUED). When the error rate is overestimated at p e = 0.02, above the published rate of 0.01-0.015 [21], it is the wrong uniform error distribution, wUED. The same measures as in [14] are used for evaluation, where a false positive (FP) denotes an error free kmer has been considered as erroneous and a false negative (FN) denotes an unidentified erroneous kmer. Table 3 reports the minimum number of wrong predictions (WPs), FP+FN, achieved by applying optimum thresholds on observed Y , used by existing methods, or by applying thresholds on the estimated number of attempts to read T, used in our method. The results of our method are shown for the four types of error distributions in columns tIED, wIED, tUED, and wUED. Bolded entries indicate where lower minima were achieved with our method compared to the standard method. Given the true error distribution, our method committed over 95% fewer WPs for all datasets except D6, where our method still managed 7% fewer WPs. Interestingly, using the wrong Illumina error distribution (column wIED) achieved at least 33% fewer WPs in all repetitive genomes except D4, where our wIED method performed about on par with applying the threshold on Y . The minimum WPs achieved by the true uniform error model tUED are two-to three-fold smaller than the corresponding values in column Y . However, using elevated error rate p e = 0.02 led to higher minimum WPs, except in dataset D3, the most highly repetitive simulated genome.
Even though we presented a method to choose the threshold value (see Appendix), it is not possible for any method to guarantee the optimal threshold. Ideally, the error detection methods should be relatively insensitive to choice of threshold. To compare methods across many thresholds, we plot log(FP + FN), with respect to a wide range of thresholds (Fig. 2). The plot in every case resembles a U-shape since many error kmers are missed (FN) when the threshold is too low and many correct kmers are declared errors (FP) when the threshold is too high. Our method achieved fewer WPs for datasets D1,D2,D3 and Maize with error distributions tIED, wIED, and tUED at all thresholds. The improvement in error detection increased with the degree of repetition, seen in simulations D1 to D3 and also in comparing N. meningitidinis and Maize. Our method tended to flatten the bottom of the U-shape such that a wider range of thresholds often beat even the minimum error obtained under Y thresholding. In all datasets, our method often shifted the U-shape leftward, such that for very small thresholds, our WPs were far less than the Y -based methods, regardless of the error distribution used. As the threshold increases, WPs for all methods eventually converge to the same constant, where all kmers are considered erroneous. For moderately large thresholds, our method sometimes resulted in higher WPs, especially for the wrong error distribution wUED, and sometimes wIED, and genomes with a low degree of repeats.
REDEEM misclassified the fewest kmers when using the "true" error model, but even in our simulations, there was a mismatch between the simulated errors and Table 2 Estimated error probabilities q i (·, ·), position i = 11 E. coli str. K-12 substr.
Acinetobacter sp. ADP1  the estimated "true" error model. The position-specific error probabilities used to compute kmer misread probabilities are not the true error probabilities that vary by position in the full length read. The difference is exacerbated as reads get longer relative to the kmer length.
Since it would be possible to compute misread probabilities p e (x m , x l ) using read-derived position probabilities, this mismatch between kmer and read errors can be eliminated with more sophisticated error models that account for the position of each kmer in the read. Since data to estimate the read error properties can be collected in parallel on a known, control genome, we contend that estimating the true error model is not an undue burden in practical applications [11,17]. Quality scores may also inform on errors [15] and could be incorporated in the REDEEM error model. As discussed previously, only simulated data with different degrees of repeats can be utilized to measure error correction results for repeat-rich genomic regions due to the fact that mapping short reads from such regions uniquely to the reference genome, and the assembly of genomes with high repeat content, remain open problems. We compare our correction results with SHREC [12] and Reptile [13] using datasets D1, D2 and D3 with increasing degrees of repeat content. The results are shown in Table 4. To be self-contained, we reproduce the evaluation measures from [13]: A True Positive (TP) is any erroneous base that is changed to the true base, a False Positive (FP) is any true base changed wrongly, a True Negative (TN) is any true base left unchanged, and a False Negative (FN) is any erroneous base left unchanged. Sensitivity = TP / (TP + FN) and Specificity = TN / (TN + FP). Gain = (TP -FP) / (TP + FN) denotes the total percentage of erroneous bases removed from the dataset post-correction.
REDEEM is designed to specifically target error correction for repeat-rich genomes. While both SHREC and Reptile do not explicitly model the effect of repeats, the variable length suffixes captured by SHREC and different read decompositions explored by Reptile can provide richer and more precise information about errors. Currently, REDEEM does not utilize all such information. Therefore, in genomes with low repeat content, both SHREC and Reptile outperform. However, as the repetition within the genome increases, REDEEM significantly outperforms both methods by accounting for the repetition in the kmer neighborhood. Further experiments show that error correction results are affected mainly by the percentage of the length of the genome spanned by repeats, rather than repeat types and lengths. This can serve as a yardstick in deciding when to use REDEEM over conventional error correction. It is also possible to combine the features of a conventional error correction method such as Reptile with the explicit modeling of repeats as done in REDEEM to produce an error-correction method that is superior both when sampling low repeat and highly-repetitive genomes.
All experiments were carried out on 3.16GHz Intel Xeon Processors; run time and memory usage of all three programs are shown in the last two columns in Table 4. As expected, the run time of REDEEM is longer due to the complexity of modeling repeats explicitly. The largest simulation, D6, took 120 minutes and 9 GB. No error detection/correction method except naÃ¯ve thresholding on observed counts yet scales to practical next-generation applications, but REDEEM is at least comparable to existing, non-repeat-aware methods.

Conclusions
There have been some attempts to formally characterize repeats in genomes [22], but generally, the term "repeat" is used loosely in the literature, with meaning varying by context. In this paper, we consider kmer x l a repeat when its genomic occurrence a l is higher than what is expected in a random genomic sequence. Because genomes are not random, all genomes display some degree of repetition. Perhaps such cryptic repetition explains why we can achieve lower false prediction rates at optimal thresholds even on genomes like E. coli, which according to the I r measure of [22] is only somewhat repetitive.
In summary, we have presented a new method that improved error detection and correction when sampling repeat-rich genomic regions using next-generation sequencers. Important future work includes better models and algorithms to simultaneously estimate error parameters from the data, to consider variation in coverage along the genome, to speed up computations, and to handle larger datasets through better memory management.