Repeat-aware modeling and correction of short read errors
- Xiao Yang^{1},
- Srinivas Aluru^{1, 2} and
- Karin S Dorman^{3}Email author
https://doi.org/10.1186/1471-2105-12-S1-S52
© Yang et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Abstract
Background
High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of k mers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous k mer may be frequently observed if it has few nucleotide differences with valid k mers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.
Results
We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of k mers from their observed frequencies by analyzing the misread relationships among observed k mers. We also propose a method to estimate the threshold useful for validating k mers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”.
Conclusions
We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
Keywords
Background
High throughput next generation sequencing has revolutionized genomics, making it possible to sequence new genomes or resequence individual genomes at a manifold cheaper cost and in an order of magnitude less time than earlier Sanger sequencing. With this technology, ambitious genome sequencing projects target many organisms rather than a few, and large scale studies of sequence variation become feasible [1]. Many next-geneneration sequencing technologies have been developed, including systems currently in wide use, such as the Illumina Genome Analyzer (earlier known as Solexa) and Applied Biosystems SOLiD, as well as more recent and new offerings from companies such as Complete Genomics and Pacific Biosciences [2]. Many next-generation sequencing systems produce short reads, e.g., the widely used Illumina Genome Analyzer systems typically produce 35 - 150bp reads. Short read technologies have been widely adopted for both genome sequencing and resequencing applications; hence, development of high quality short read assemblers (e.g., [3–7]) and short read mapping tools that map reads to a reference genome [8, 9] are important.
Short reads of novel genomes are typically assembled using de Bruijn graphs that represent observed k mers as nodes and length (k – 1) overlaps as edges. In the absence of errors, the size of such a graph is bounded by the length of the genome, but can be as high as 4^{ k } in the presence of errors. In the mapping process, a read with sequencing errors may map to multiple locations, or sometimes nowhere at all. Thus, error removal or correction is necessary to keep the size of the graph manageable [7, 10] and simplify non-repetitive read mapping [11].
Many approaches have been proposed to identify and sometimes correct sequencing errors in next-generation sequencing data. More recent ones include SAP (Spectral Alignment Problem)-based methods [4, 10], SHREC [12] and Reptile [13]. SAP-based methods identify any k mer occurring less than a constant, user-specified frequency threshold to be erroneous. Chin et al. [14] have shown that an optimum threshold can be derived analytically, assuming the genome to be a random sequence and that errors are independently and uniformly distributed in the reads. SHREC, a suffix trie based method, classifies any substring that occurs less often than an analytically calculated threshold to contain errors based on the same assumptions as in [14]. An erroneous base, identified as an infrequent branch of the suffix trie, is corrected to one of its siblings when applicable. Reptile explores read decompositions and makes corrections to any substring whenever an unambiguous choice can be made. In contrast to the previous two approaches, erroneous substrings are inferred based on assessing their frequencies relative to the frequencies of the alternative (error free) substrings. This accommodates under-sampled genomic regions. All of these methods are mainly suitable for genomes with a low degree of repetitive sequences.
Repeats in genomes can lead to mishandling of errors in many ways. Nearly identical repeats can easily be mistaken to be sequencing errors. Even when errors are rare, an erroneous k mer may appear at a moderate frequency if it has few nucleotide differences from one or more valid k mers that have a high frequency of occurrence in the genome. The problem of detecting and correcting sequencing errors among reads in the presence of repeats has so far not been adequately addressed. Nevertheless, repeats are widely prevalent, even in some viral genomes such as N. meningitidis. Other genomes, like those of plants, are known for their high repeat content; for instance, an estimated 65-80% of the maize genome is spanned by repeats, which makes the assembly, mapping and error detection and correction tasks difficult. Although packages like FreClu [11] and Recount [15] could be potentially adapted to consider repeats, they are specifically designed for transcriptome data and correct read counts rather than identify and correct erroneous bases within reads. Moreover, insufficient replication of full length reads in genomic data prevents these methods from accurately estimating model parameters.
In this paper, we address the problem of identifying and correcting sequencing errors in short reads from genomes with different levels of repetition, particularly for reads produced by the widely used Illumina Genome Analyzer platform. Similar to existing approaches, we decompose the input reads into k mer substrings and count the number of times Y_{ l } each k mer x_{ l } occurs in the reads. However, instead of inferring erroneous k mers based on these observed occurrence frequencies [10, 12], we developed a maximum likelihood estimate of the expected number T_{ l } of attempts to read x_{ l }, including both attempts that resulted in error-free reads and erroneous reads. In addition, we propose a new method to choose the threshold, which can be used to identify erroneous k mers as those x_{ l }’s for which T_{ l }’s are lower than the threshold. We demonstrate that using estimates of read attempts enables more accurate detection of sequencing errors than using observed frequencies for a wide choice of thresholds. We further develop an error correction method to transform erroneous bases in each read to the correct ones and compare the results with SHREC [12] and Reptile [13], two of the most recent error correction methods. The results demonstrate significant improvement in error correction capabilities for genomes with high repeat content. The proposed method is made available through the software package REDEEM (R ead E rror DE tection and Correction via E xpectation M aximization) at “http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”.
Methods
Let G denote the reference genome to be sequenced, and let R = {r_{1}, r_{2},…, r_{ N }} be the collection of resulting short reads. For simplicity, we assume each read has a fixed length L. The sequence coverage is , where |G| is the genome length. Define the k-spectrum of a read r to be the set r^{ k } = {r[i : i + k – 1] | 0 ≤ i <L – k + 1}, where r[i : j] is the substring from position i to j in r. The k-spectrum produced by all the reads is .
We will develop a model that estimates the expected number of attempts T_{ l } to read each k mer x_{ l }. The threshold is then applied to each estimated T_{ l } instead of the corresponding observed Y_{ l }. The model we propose is similar to that of RECOUNT [15] used to correct next generation short read counts. Both models derive from a method originally meant to detect sequencing errors in SAGE libraries [16]. Our model differs from the previous models in that it works with k mers rather than full reads, since there is insufficient replication of full length reads in genomic data (as compared to transcriptome data). In addition, instead of assuming the misread bases to be drawn from {A, C, G, T} with equal probability, we propose a parametric error model that can be trained from the reads produced by the control lane (e.g.using the Illumina Genome Analyzer) in the same experiment. This strategy has already proven to be useful in several pioneering works [11, 17]. In addition, we will show that the model is somewhat robust to incorrect assumptions in the underlying error model. A further step is to modify erroneous bases to their true forms in each read. This task has rarely been attempted previously for repetitive regions. We propose a method that utilizes transition probabilities and the contexual information of individual reads to achieve this goal. Like others, we ignore insertion and deletion errors assuming they are rarely produced by next-generation sequencing technology, which is true for reads from the Illumina Genome Analyzer [18].
Error model
These misread probabilities are no longer symmetric, and can be arranged into a 4^{ k } × 4^{ k } matrix P_{ e }, where non-zero entries in the l th row identify all possible ways to (mis) read k mer x_{ l }.
We now discuss some ways to reduce and simplify the calculations. We observe substitution errors are relatively rare, so misread k mers generally contain far fewer than k errors. Thus, when considering possible origins of a misread k mer, we can safely restrict our attention to k mers within some Hamming distance d_{max} from the current k mer. Capping the maximum distance between k mers at d_{max} induces a sparse P_{ e }, whose entries are normalized by dividing each row by the corresponding row sum. Finally, we ignore k mers that are not observed in the data (i.e. Y_{ m } = 0 or, equivalently, x_{ m } ∉ R^{ k }), so the (incomplete) neighborhood of k mer x_{ l }, denoted by , is given as {x_{ m } ∈ R^{ k } : d(x_{ l },x_{ m }) ≤ d_{max}}. Failure to include unobserved k mers could bias estimation of α_{ l } by ignoring k mers actually present in G and capable of contributing to Y_{ l }. However, the bias cannot be large since α_{ m } must be small, otherwise Y_{ m } would not be zero.
After considering errors and applying the simplifications, the counts Y_{ l } follow a Multinomial distribution
Y = (Y_{1},…,Y_{ |R }^{ k }|) ~ Multinomial(N(L – k + 1),p),
where s = (s_{1},…,s_{ |R }^{k}|) is restricted to the set of observed k mers R^{ k }. It becomes clear that when x_{ l } is surrounded by highly repetitious x_{ m } with large s_{ m }, then Y_{ l } may exceed threshold M because of high misread occurrence with probability s_{ m }p_{ e }(x_{ m },x_{ l }). Thus, when errors combine with repeats, it is more appropriate to apply a threshold to estimates of the parameters s_{ l } than observed Y_{ l }.
This setup lends itself to maximum likelihood estimation via the EM algorithm [19]. The update equations are adapted from [16] using a different error model and are given as follows:
Notice the estimated expected number of attempts to read k mer x_{ l } is T_{ l } = ŝ_{ l }N(L – k + 1), directly proportional to ŝ_{ l } and sitting on the same scale as Y_{ l }. In fact, by observing the E step is unchanged and the log likelihoood l(s | Y) is computed up to an additive constant when s_{ l } is replaced with T_{ l }, we use the EM algorithm to compute T_{ l } directly. For inference, we apply the threshold to estimates T = (T_{1},…,T_{ |R }^{k}|) rather than ŝ, to more easily compare our method with thresholding on Y . The algorithm is initialized by setting T_{ l } =Y_{ l } and iterating until the log likelihood converges.
Error detection and correction
Error detection, in practice, requires a method to choose a threshold M that minimizes the number of wrong decisions when classifying k mers as erroneous or not. We discuss a model-free method for estimating the threshold M in the Appendix, but no results presented in this paper use estimated thresholds.
where estimates T_{ m } are substituted for the unknown α_{ m }. Since multiple overlapping k mers provide non-independent information about the base at position i, we average across available t to obtain distribution p_{ i }(b). If argmax_{ b }p_{ i }(b) ≠ r[i : i], then we declare nucleotide r[i : i] misread and correct it to argmax_{ b }p_{ i }(b). To limit computations, we apply this method to reads likely to contain at least one erroneous k mer, as identified with a liberal threshold M.
Results and discussion
Dataset preparation
Experimental datasets
Dataset | Type | Reference genome | Genome length | Repeats | Repeat Types (length, multiplicity) | C | Number of reads |
---|---|---|---|---|---|---|---|
D 1 | 1(a) | - | 1M | 20% | (1000, 200) | 80x | 2.2M |
D 2 | 1(a) | - | 1M | 50% | (500, 400), (1500, 200) | 80x | 2.2M |
D 3 | 1(a) | - | 1M | 80% | (500, 400), (1500, 200) | 80x | 2.2M |
(3000, 100) | |||||||
D 4 | 1(b) | N. meningitidis | 2.1M | - | - | 80x | 4.8M |
D 5 | 1(b) | Maize | 418K | - | - | 80x | 0.92M |
D 6 | 2 | E. coli | 4.6M | - | - | 160x | 20.7M |
Reference genome preparation
The reference genomes of type 1(a) were initially generated using the nucleotide distribution of a piece of B73 maize genome (A: 28% C:23% G: 22% T: 27%). Then, repeat regions of different lengths and multiplicities (Table 1, column 6) derived from the same nucleotide distribution were embedded at random locations in these reference genomes. The reference genome N. meningitidis (NC_013016) of D 4 is known to be a small, repeat rich, viral genome. The maize genome is known to contain up to 80% repeats and only the relatively unique regions have been fully assembled. Hence, we concatenated the first 20 contigs from Chromosome 1 of the B73 assembly, and removed all non-ACGT characters to form the reference genome of D 5.
Short read preparation
The simulated Illumina reads (type 1) were produced by first estimating an error distribution from a real Illumina short read dataset, then simulating uniformly distributed reads of the reference genomes with these error rates. We used the RMAP software [9] to map Illumina data (Sequence Read Archive ID: SRX000429) to the reference genome E. coli str. K-12 allowing up to three mismatches. We were able to map 98.5% of reads; this percent is increased to 99.1% by allowing up to ten mismatches. However, allowing more mismatches increases the chance of a mismapped read since reads are only 36bp, and typically, mapping software can work at full sensitivity for up to two mismatches. Unmapped reads were discarded, and all remaining reads were assumed correctly mapped. By comparing the mapped reads to the reference genome, we estimated L 4 × 4 misread probability matrices M = (M_{1},M_{2},…,M_{ L }), where L is the read length and each entry (α, β) (α, β ∈ {A, C, G, T}) in misread probability matrix M_{ i } (1 ≤ i ≤ L) is the probability a nucleotide α on the reference genome is (mis)read as β at position i in the read. This is calculated as the total number of times α is read to be β at position i among all mapped reads, divided by the number of times the corresponding position of the reference genome is α. Finally, we simulated Illumina sequencing to generate N reads by applying M to N uniformly distributed L-substrings in the reference genome.
Rationale
Simulated data are essential because highly repetitive genomic regions, for which our error model is designed, are often masked prior to assembly. Even when assembly can be done, accurate mapping of sequenced reads back to the assembly is difficult when genomes are repetitive [20]. Under these conditions, only simulation can provide unambiguous error information. Type 1(a) datasets were prepared such that they emulate repeat content ranging from a microbial genome with low repeats to a highly repetitive plant genome. However, to inject reality wherever possible, the reference genomes of Type 1(b) were selected from the previous assemblies. Lastly, the type 2 dataset demonstrated the applicability of our model to real, although non-repetitive, real read data.
Error detection and correction results
Our model accommodates sequencing errors via the misread probabilities p_{ e }(x_{ m }, x_{ l }) between any two k mers x_{ m } and x_{ l }. To calculate p_{ e }(x_{ m },x_{ l }), we need to specify the position specific misread probabilities, q_{ i }(α, β), 1 ≤ i ≤ k, α, β ∈ {A, C, G, T}, for each position of a k mer. Ideally, we would set q_{ i }(·, ·) to match the errors in the current dataset inferred from reads in the control lane [11, 17]. When such information is not available, we can rely on information derived from other read data generated on the same platform. In the worst case, we can use the simple error model of Eq. (1), which only requires specification of the average error rate p_{ e }.
Based on these choices, we tested our datasets using four types of sequencing error (misread) distributions: tIED, wIED, tUED, and wUED (defined below). Our simulation procedure introduced errors according to the misread probability matrices M estimated from dataset SRX000429, so the true error distribution, tIED, was obtained by estimating q_{ i }(·, ·) from the same dataset SRX000429. The estimation procedure is similar to the one used for estimating M (defined in the previous section), except each read is decomposed into L – k + 1 k mers and the count of each type of misread nucleotide at each k mer position is determined. (Note, the same nucleotide contributes counts in up to k distinct k mers.) Since, the estimated q_{ i }(·, ·) represent fewer parameters than M, q_{ i }(·, ·) only approximates the true misread probability matrices M, which themselves only approximate true read errors. The wrong Illumina error distribution, wIED, is the situation encountered when Illumina data are only available from a different experiment (and often different lab). To emulate this case, we derived a second set of error probabilities q_{ i }(·, ·) from Illumina reads of Acinetobacter sp. ADP1 (Short Read Archive acc. SRX001814, 17.7M reads of 36bp length). The error rates differ at k mer position i = 11 (Table 2) and others (not shown) in the E. coli and A. sp. ADP1 short read datasets, demonstrating that wIED is indeed the wrong error distribution. Finally, in the absence of detailed error information, we can use the uniform error distribution with constant error probability p_{ e }. When the average error rate p_{ e } = 0.006 is estimated from dataset SRX000429, the error distribution is the true uniform error distribution (tUED). When the error rate is over estimated at p_{ e } = 0.02, above the published rate of 0.01–0.015 [21], it is the wrong uniform error distribution, wUED.
The same measures as in [14] are used for evaluation, where a false positive (FP) denotes an error free k mer has been considered as erroneous and a false negative (FN) denotes an unidentified erroneous k mer. Table 3 reports the minimum number of wrong predictions (WPs), FP+FN, achieved by applying optimum thresholds on observed Y , used by existing methods, or by applying thresholds on the estimated number of attempts to read T, used in our method. The results of our method are shown for the four types of error distributions in columns tIED, wIED, tUED, and wUED. Bolded entries indicate where lower minima were achieved with our method compared to the standard method. Given the true error distribution, our method committed over 95% fewer WPs for all datasets except D 6, where our method still managed 7% fewer WPs. Interestingly, using the wrong Illumina error distribution (column wIED) achieved at least 33% fewer WPs in all repetitive genomes except D 4, where our wIED method performed about on par with applying the threshold on Y . The minimum WPs achieved by the true uniform error model tUED are two- to three-fold smaller than the corresponding values in column Y . However, using elevated error rate p_{ e } = 0.02 led to higher minimum WPs, except in dataset D 3, the most highly repetitive simulated genome.
Estimated error probabilities q_{ i }(·, ·), position i = 11
E. coli str. K-12 substr. | Acinetobacter sp. ADP1 | ||||||||
---|---|---|---|---|---|---|---|---|---|
×10^{–2} | A | C | G | T | ×10^{–2} | A | C | G | T |
A | 98.96 | 0.63 | 0.18 | 0.23 | A | 96.18 | 2.53 | 0.19 | 1.10 |
C | 0.15 | 99.60 | 0.10 | 0.15 | C | 0.20 | 99.32 | 0.08 | 0.40 |
G | 0.05 | 0.17 | 99.25 | 0.53 | G | 0.12 | 0.30 | 97.60 | 1.98 |
T | 0.05 | 0.19 | 0.18 | 99.58 | T | 0.09 | 0.18 | 0.13 | 99.60 |
A comparison of minimum error rates.
Data | Minimum (FP + FN) Value | ||||
---|---|---|---|---|---|
Y | tIED | wIED | tUED | wUED | |
D1 | 2212 | 18 | 984 | 1020 | 4648 |
D2 | 6392 | 23 | 1300 | 3150 | 6729 |
D3 | 6809 | 19 | 1300 | 2696 | 3124 |
D4 | 216 | 10 | 236 | 80 | 719 |
D5 | 552 | 14 | 373 | 297 | 1346 |
D6 | 14236 | 13275 | 13441 | 13671 | 18793 |
REDEEM misclassified the fewest k mers when using the “true” error model, but even in our simulations, there was a mismatch between the simulated errors and the estimated “true” error model. The position-specific error probabilities used to compute k mer misread probabilities are not the true error probabilities that vary by position in the full length read. The difference is exacerbated as reads get longer relative to the k mer length. Since it would be possible to compute misread probabilities p_{ e }(x_{ m }, x_{ l }) using read-derived position probabilities, this mismatch between k mer and read errors can be eliminated with more sophisticated error models that account for the position of each k mer in the read. Since data to estimate the read error properties can be collected in parallel on a known, control genome, we contend that estimating the true error model is not an undue burden in practical applications [11, 17]. Quality scores may also inform on errors [15] and could be incorporated in the REDEEM error model.
As discussed previously, only simulated data with different degrees of repeats can be utilized to measure error correction results for repeat-rich genomic regions due to the fact that mapping short reads from such regions uniquely to the reference genome, and the assembly of genomes with high repeat content, remain open problems. We compare our correction results with SHREC [12] and Reptile [13] using datasets D 1, D 2 and D 3 with increasing degrees of repeat content. The results are shown in Table 4. To be self-contained, we reproduce the evaluation measures from [13]: A True Positive (TP) is any erroneous base that is changed to the true base, a False Positive (FP) is any true base changed wrongly, a True Negative (TN) is any true base left unchanged, and a False Negative (FN) is any erroneous base left unchanged. Sensitivity = TP / (TP + FN) and Specificity = TN / (TN + FP). Gain = (TP - FP) / (TP + FN) denotes the total percentage of erroneous bases removed from the dataset post-correction.
Error correction results
Data | Method (d) | Sensitivity | Specificity | Gain | CPU Time(min) | Memory(GB) |
---|---|---|---|---|---|---|
D 1 | SHREC | 81.2% | 99.9% | 80.3% | 23.9 | 5.9 |
Reptile | 78.9% | 99.9% | 78.9% | 0.6 | 0.19 | |
REDEEM | 71.3% | 99.9% | 51.5% | 114.1 | 2.5 | |
D 2 | SHREC | 54.0% | 99.9% | 52.7% | 22.7 | 5.8 |
Reptile | 57.8% | 99.9% | 57.8% | 0.5 | 0.16 | |
REDEEM | 78.6% | 99.9% | 64.6% | 72.7 | 1.6 | |
D 3 | SHREC | 29.3% | 99.9% | 26.7% | 21.7 | 5.8 |
Reptile | 46.8% | 99.9% | 46.8% | 0.5 | 0.13 | |
REDEEM | 86.4% | 99.9% | 79.4% | 31.2 | 0.63 |
All experiments were carried out on 3.16GHz Intel Xeon Processors; run time and memory usage of all three programs are shown in the last two columns in Table 4. As expected, the run time of REDEEM is longer due to the complexity of modeling repeats explicitly. The largest simulation, D6, took 120 minutes and 9 GB. No error detection/correction method except naÃ¯ve thresholding on observed counts yet scales to practical next-generation applications, but REDEEM is at least comparable to existing, non-repeat-aware methods.
Conclusions
There have been some attempts to formally characterize repeats in genomes [22], but generally, the term “repeat” is used loosely in the literature, with meaning varying by context. In this paper, we consider k mer x_{ l } a repeat when its genomic occurrence α_{ l } is higher than what is expected in a random genomic sequence. Because genomes are not random, all genomes display some degree of repetition. Perhaps such cryptic repetition explains why we can achieve lower false prediction rates at optimal thresholds even on genomes like E. coli, which according to the I_{ r } measure of [22] is only somewhat repetitive.
In summary, we have presented a new method that improved error detection and correction when sampling repeat-rich genomic regions using next-generation sequencers. Important future work includes better models and algorithms to simultaneously estimate error parameters from the data, to consider variation in coverage along the genome, to speed up computations, and to handle larger datasets through better memory management.
Competing interets
The authors declare that they have no competing interests.
Authors contributions
XY and KD developed and implemented the model and algorithmic solutions. SA helped conceive of and coordinated the study. All authors contributed to the development and revision of the manuscript.
Declarations
Acknowledgements
This research is supported in part by NSF CCF-0811804.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
Authors’ Affiliations
References
- Stratton M: Genome resequencing and genetic variation. Nature Biotechnology 2008, 26(1):65–66.View ArticlePubMedGoogle Scholar
- Perkel JM: Sanger Who? Sequencing the Next Generation. Science 2009, 10: 275–279.View ArticleGoogle Scholar
- Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 2008, 18(5):810–820.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaisson M, Pevzner P: Short read fragment assembly of bacterial genomes. Genome Research 2008, 18(2):324–330.PubMed CentralView ArticlePubMedGoogle Scholar
- Jackson B, Regennitter M, Yang X, Schnable P, Aluru S: Parallel de novo Assembly of Large Genomes from High-Throughput Short Reads. 24th IEEE International Parallel & Distributed Processing Symposium 2010, 1–10.Google Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Research 2009, 19(6):1117–1123.PubMed CentralView ArticlePubMedGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 2008, 18(5):821–829.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3):R25.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith AD, Xuan Z, Zhang MQ: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 2008, 9: 128–135.PubMed CentralView ArticlePubMedGoogle Scholar
- Chaisson M, Pevzner P, Tang H: Fragment assembly with short reads. Bioinformatics 2004, 20(13):2067–2074.View ArticlePubMedGoogle Scholar
- Qu W, Hashimoto S, Morishita S: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Research 2009, 19(7):1309–15.PubMed CentralView ArticlePubMedGoogle Scholar
- Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B: SHREC: a short-read error correction method. Bioinformatics 2009, 25(17):2157–2163.View ArticlePubMedGoogle Scholar
- Yang X, Dorman KS, Aluru S: Reptile: Representative tiling for short read error correction. Bioinformatics 2010, 26(20):2526–2533.View ArticlePubMedGoogle Scholar
- Chin FYL, Leung HCM, Li WL, Yiu SM: Finding optimal threshold for correction error reads in DNA assembling. BMC Bioinformatics 2009, 10(Suppl 1):S15.PubMed CentralView ArticlePubMedGoogle Scholar
- Wijaya E, Frith MC, Suzuki Y, Horton P: Recount: expectation maximization based error correction tool for next generation sequencing data. Genome Informatics 2009, 23(1):189–201.PubMedGoogle Scholar
- Beissbarth T, Hyde L, Smyth GK, Job C, Boon WM, Tan SS, Scott HS, Speed TP: Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics 2004, 20(Suppl 1):i31-i39.View ArticlePubMedGoogle Scholar
- Weese D, Emde AK, Rausch T, Doring A, Reinert K, et al.: RazerS–fast read mapping with sensitivity control. Genome Research 2009, 19(9):1646–1654.PubMed CentralView ArticlePubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research 2008, 36(16):e105.PubMed CentralView ArticlePubMedGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 1977, 39: 1–38.Google Scholar
- Zhi D, Keich U, Pevzner P, Heber S, Tang H: Correcting base-assignment errors in repeat regions of shotgun assembly. IEEE/ACM Trans Comput Biol Bioinform 2007, 4(1):54–64.View ArticlePubMedGoogle Scholar
- Shendure J, Ji H: Next-generation DNA sequencing. Nature Biotechnology 2008, 26(10):1135–1145.View ArticlePubMedGoogle Scholar
- Haubold B, Wiehe T: How repetitive are genomes? BMC Bioinformatics 2006, 7: 541.PubMed CentralView ArticlePubMedGoogle Scholar
- Schwarz G: Estimating the Dimension of a Model. The Annals of Statistics 1978, 6(2):461–464.View ArticleGoogle Scholar
- McCullagh P, Nelder JA: Generalized Linear Models. 2nd edition. New York: Chapman & Hall; 1989.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.