Discovering motifs that induce sequencing errors
© Allhoff et al.; licensee BioMed Central Ltd. 2013
Published: 10 April 2013
Skip to main content
© Allhoff et al.; licensee BioMed Central Ltd. 2013
Published: 10 April 2013
Elevated sequencing error rates are the most predominant obstacle in single-nucleotide polymorphism (SNP) detection, which is a major goal in the bulk of current studies using next-generation sequencing (NGS). Beyond routinely handled generic sources of errors, certain base calling errors relate to specific sequence patterns. Statistically principled ways to associate sequence patterns with base calling errors have not been previously described. Extant approaches either incur decisive losses in power, due to relating errors with individual genomic positions rather than motifs, or do not properly distinguish between motif-induced and sequence-unspecific sources of errors.
Here, for the first time, we describe a statistically rigorous framework for the discovery of motifs that induce sequencing errors. We apply our method to several datasets from Illumina GA IIx, HiSeq 2000, and MiSeq sequencers. We confirm previously known error-causing sequence contexts and report new more specific ones.
Checking for error-inducing motifs should be included into SNP calling pipelines to avoid false positives. To facilitate filtering of sets of putative SNPs, we provide tracks of error-prone genomic positions (in BED format).
Next-generation sequencing (NGS) technologies have tremendously influenced biomedical research. Thanks to its high speed and low cost, NGS has facilitated projects [1, 2] that are based on tens of terabytes of sequencing data. Exome sequencing , which has been used in hundreds of studies , is even more cost-effective, as it limits itself to the medically most relevant coding regions. A major focus in most NGS-based studies are single-nucleotide polymorphisms (SNPs), many of which can be associated with phenotypic traits or diseases.
For all NGS platforms, cost efficiency and higher throughput come at the price of higher sequencing error rates. Beyond random miscalls, there are also systematic sources of errors. Since any base-calling error can be mistaken for a SNP, correcting for a maximum amount of (whatever kind of) sequencing errors is vital. In this study, we focus on a class of such errors that are characteristic for Illumina sequencing platforms . On Illumina platforms, sequencing proceeds in cycles, where, in a rough sketch, during the i-th cycle, the i-th base of the fragment is read. Cycling can be confounded by various factors, which leads to dephasing: sequencing of partial amounts of fragments lags behind the overall sequencing procedure. Indeed, it is routine for Illumina sequencers to correct for mistakenly calling the (i + 1)-th or (i - 1)-th base in the i-th cycle. Beyond this, higher error rates close to the end of reads  and an increase in miscalls in GC-rich regions  have been observed. Both of these error sources are likely to be related to dephasing. In the meantime, they have become common knowledge and are routinely handled.
In this work, we focus on a third kind of error, which, to date, has not yet undergone much principled investigation. In an initial study , it was pointed out that dephasing is also likely to be associated with specific sequence patterns. For example, large amounts of miscalls followed the nucleotide motif GGC and inverted repeats. It was hypothesized that dephasing may be due to backfolding of DNA (inverted repeats) and/or sequence-specific, preferential inhibition of RNA polymerase binding. Both phenomena are plausible and relate specific sequence patterns with delays in the cycling procedure. However, no rigorous framework for detecting those sequence patterns was provided in . To overcome this limitation, Meacham et al.  developed a statistically principled framework, which combines hypothesis testing with logistic regression based classification. As an example, they report that the most common error-associated sequence context is GGT. The integration of (Phred-score based) read error profiles into the framework yields a gain in statistical power (recall) for detecting miscalls. However, it also leads to detection of sequence-unspecific error positions.
Meacham et al.  also report substantial loss of power when considering statistics on position-wise read directionality, which motivates their above-mentioned integration of position-wise error profiles. This, however, while attaining good performance rates, also leads to potential confusion with sequence-unspecific errors, such as certain batch errors. In summary, none of the previous approaches provides a rigorous, statistically consistent framework to reveal sequence motifs as the very reason for base call errors. While  exclusively focus on CSEs, their approach for CSE detection is ad-hoc and, in fact, they report on suboptimal performance rates when distinguishing between true SNPs and CSEs. The GATK pipeline acts in a statistically principled manner, but has too little power where read coverage is low.
In this article, we propose a statistically principled procedure for CSE detection at maximum power. Our main idea is to pool genomic positions according to contexts. When screening pools for significant strand bias, we associate base calling errors with contexts, instead of single positions. This direct association is not only helpful in detecting relevant CSE-related sequences, but can also compensate for too low overall coverage, because pooling keeps statistical power at a high level. The detected motifs can serve for spotting error-prone positions already before the read mapping step. Last but not least, the motifs identified may also yield further insights into sequencing technology itself. Note also that our approach is generic and can be applied to any CSE-prone sequencing technology.
The article is organized as follows. In the Preliminaries section, we summarize a common SNP calling approach together with the statistical foundations to quantify strand bias. In the Algorithm section, we introduce our motif-based context discovery algorithm. We report the discovered sequence contexts in datasets from different Illumina sequencing platforms in the Results section, where we also investigate whether known SNPs (from dbSNP) are related to our discovered contexts. After that, we present a concluding discussion.
SNPs are the most common kind of DNA mutation identified by the 1000 Genomes Project [1, 12]. In diploid organisms, we distinguish between two types of SNPs. A homozygous SNP differs from a reference genome in both alleles, while a heterozygous SNP only differs in one allele. CSEs may be mistaken for heterozygous SNPs (if strand bias is not taken into account), as on average half of the reads differ from the reference.
We now briefly review a typical SNP calling pipeline. Read mappers usually align reads one by one, so they cannot make use of the information available through the set of all reads mapping to a locus. Therefore, tools such as the GATK [10, 11] provide a local realignment algorithm for reads close to insertions and deletions, avoiding many wrong indel calls. Duplicates, i.e. reads which are derived from a single DNA fragment, are removed, since they do not give any additional evidence for or against a mutation and should not be considered independent in downstream analyses. The nucleotide (phred) quality scores given by the NGS device are recalibrated based on empirical probabilities. None of these steps is intended to detect or correct context-specific errors. We also remark that more accurate base-calling algorithms do not prevent CSEs .
2 × 2 contingency table
Fisher's exact test computes a p-value from a 2 × 2 contingency table (Table 1) to decide whether the two data characteristics "read direction" and "mismatch fraction" are independent, which is equivalent to testing whether the rows have the same distribution. If this is the case, then there is no evidence for strand bias.
where the "extreme" values of a' are from the set . If the p-value is sufficiently low, we reject the null hypothesis, meaning we assume that the two rows were not sampled from the same distribution. The quantity - log10 can be considered as a quantitative measure of strand bias and is called the strand bias score.
Fisher's exact test is computationally expensive for tables with large entries, but can be replaced by a χ2 test in this case .
When many statistical tests are performed, the expected number of false positives can also be large. There are many strategies to deal with such situations of multiple hypotheses testing. One popular approach, for instance, is to control the false discovery rate (FDR) as advocated by Benjamini and Hochberg . Another option is to control the family-wise error rate (FWER) by means of a Bonferroni correction, that is, controlling the probability of making at least one type I error among all tested hypotheses, which is more conservative than controlling the FDR (at the same level). In our case, p-values of significant motifs are very low due to the large amounts of available NGS data and, in particular, due to our pooling strategy. Therefore, we can opt for the more conservative Bonferroni correction without losing many significant motifs.
We analyse the power of Fisher's exact test by a sampling procedure, resulting in Figure 2.
To this end, we pick an error probability e (color-coded in Figure 2, between 0.1 and 0.9) and a coverage n (x-axis in Figure 2, between 2 and 200 with a step size of 2) for a simulation. We assume that there is a constant low background error probability unrelated to CSEs, here set to ε = 0.01.
We draw n/2 samples to obtain the "forward" row of Table 1 using a mismatch probability of e; that is, on average, we obtain a = (1 - e)n/2 and b = en/2. We draw further n/2 samples (for a total coverage of n) to obtain the "backward" row of the table using a mismatch probability of ε; i.e., on average, c = (1 - ε)n/2 and d = εn/2. This restricts the sampled tables to those with equal row sums, which is sufficient for our illustration.
We perform Fisher's exact test and record whether we reject the null hypothesis (as we should, since e ≫ ε) at α = 0.05/(5 · 107), a Bonferroni-corrected test level for the human exome. We repeat the sampling experiment T = 3000 times. The empirical power is the fraction of times that we reject the null hypothesis out of these T repetitions.
Figure 2 clearly shows that at low coverages, it is almost impossible to detect significant strand bias. The power curves are improved by choosing less conservative thresholds, but in the end, only high coverage guarantees statistical power.
the i-th term of the sum giving the number of motifs with i Ns. Conceptually, we want to perform one strand bias test as described in the Preliminaries section for each motif in the motif space , omitting motifs that do not occur in the given reference genome.
For a given motif m, we locate all occurrences of m in the reference genome and its reverse complement. From pileups, we costruct an aggregated contingency table, whose entries we name a, b, c, d as in Table 1. This is in principle straightforward, but some care is required to keep track of the two genome strands and both read directions. The pileup position (i.e., the position of interest) is defined to be the last position of the motif. This is not a severe restriction when a sufficient number of wildcards is allowed.
A match occurs at position i if either the read is an F-read and the nucleotide of the forward reference at position i equals the aligned read nucleotide, or if the read is an R-read and the forward reference at position i equals the complement of the aligned read nucleotide. In all other cases, a mismatch occurs. As a convention, in a pileup, the nucleotides of R-reads have been already complemented, so a pileup column can always be compared to the forward reference. At the single R-position in Figure 3, while the pileup indicates an A → G mismatch in many of the R-reads, this is in fact technically a T → C mismatch because of this convention. All three positions shown in this hypothetical example would therefore consistently indicate strand bias and hence a CSE (more precisely, a consistent T → C miscall after CCAGAC).
We compute the motif's contingency table entries as follows:
Initialize a = b = c = d = 0.
For every F-position of motif m, we obtain a pileup, and we increment a by the number of matching F-reads, b by the number of mismatching F-reads, c by the number of matching R-reads, and d by the number of mismatching R-reads in each pileup.
For every R-position of motif m, we obtain a pileup, and we increment a by the number of matching R-reads, b by the number of mismatching R-reads, c by the number of matching F-reads, and d by the number of mismatching F-reads in each pileup.
Note that in some cases, a genome position may be both an F-position and an R-position, such as for the motif GGN and the reference ...GGACC..., where this occurs at the A.
We now describe our algorithm to discover contexts inducing CSEs. The input to the algorithm consists of a reference genome, a collection of aligned reads, a motif length q, a maximal wildcard number n, an FWER threshold α >0, a background error rate threshold ε >0 and an error rate difference (ERD) cutoff δ >0. Typical values are q ≤ 10, n ≤ q/2, α = 0.05, ε = 0.03, δ = 0.05.
We first compute the Bonferroni threshold . By a single linear pass through the reference genome (computing its reverse complement locally on the fly), we incrementally compute the contingency tables of each q-gram (a DNA sequence of length q without wildcards). All contingency tables are stored in a hash map; the q-gram sequence is mapped to its contingency table. For each motif , we now obtain the contingency table of m by adding the tables of all q-grams that match m. We apply Fisher's test and compute the strand bias p-value and score and deem all motifs with a p-value lower than or equal to the Bonferroni threshold T significant.
For each significant motif, we compute the forward error rate FER = b/(a + b) and the reverse error rate RER = d/(c + d) in terms of the contingency table (Table 1). We remove motifs with RER ≥ ε (i.e., above the typical background error rate threshold). For the remaining motifs, we compute the error rate difference ERD = FER - RER. We remove those motifs with too small ERD to be of interest, i.e., those with ERD < δ.
We sort the remaining motifs by decreasing ERD and report them with their properties. We opted for sorting the motifs by ERD (instead of by strand bias score), because a higher score does not imply that the motif is more likely to cause sequencing errors. A higher score (i.e., lower p-value) may simply reflect that the motif occurs more often in the genome or is covered by more reads, which grants higher statistical power and lower p-values.
Note that some of the best motifs (with highest ERD) may be similar to each other and do not provide fundamentally new information. We choose to report all motifs initially, as the resulting list may be postprocessed by comparing motifs depending on the situation.
We expect the discovered motifs to be platform-dependent rather than genome-dependent and, thus, to also cause CSEs when sequencing another genome. The power to detect CSE-causing motifs depends on the number of their occurrences in the used genomes and we might thus miss motifs for MiSeq that are not frequent in E. coli.
We use BWA version 0.5.9-r18-dev with standard parameters for read mapping . As described in the Preliminaries section, SNP discovery pipelines spend considerable computational effort on cleaning up the initial mapping by re-aligning reads around gaps, removing duplicates, and recalibrating quality values. We examine how strongly these steps influence the discovered contexts. As we report in section "Effects of postprocessing", we find that they do not strongly influence the results from a qualitative point of view. Therefore, the results reported in the following section are based on read mapping only, without additional postprocessing steps.
Our context discovery algorithm is run with different parameter settings on the motif length q and number of wildcards n; to assess a broad range of possible motifs we analyse all datasets with the two combinations (q, n) = (8, 4) and (q, n) = (4, 1). The setting (4, 1) was chosen to compare our findings with previously reported motifs which are frequently of type (3, 0), while the setting (8, 4) aims at examining whether there exist more specific motifs that cause CSEs. These two resulting motif spaces are quite different in size, and , and thus resemble different trade-offs between flexibility and excessive multiple testing. We also applied our algorithm using (q, n) = (10, 2). While the corresponding results did not undergo a thorough analysis, we display some of the highlights below and provide the full lists as additional files 6, 7, 8, 9.
Overview of filter settings
Number of motifs per dataset
α = 0.05
ε = 0.03
δ = 0.01
α = 0.05
ε = 0.03
δ = 0.1
A selection of CSE-causing motifs
No motifs passed filter
We included motif space (4,1) to compare our results to previous findings. As discussed in the introduction, the 3-grams GGC and GGT have been reported to be linked to CSEs. Table 4 shows that we indeed find strong evidence for a significantly biased error distribution at GGT sites for datasets GAIIx-bs, HiSeq-hg, and MiSeq-ec. However, the observed difference in error rates at such sites is quite low, ranging from 1.2 to 2.0 percent, and therefore these motifs alone will most likely not disrupt SNP calling. The motif GGC does not appear in the (4,1)-results. By looking at the (8,4)-results, we see that GGC is associated to CSEs, but usually appears 4 base pairs before the first error site. Our analysis also reveals the motif CGGG that appears in the result list for HiSeq-hg and MiSeq-ec.
The observed ERD values for (8,4)-motifs are consistently larger than for (4,1)-motifs by approximately one order of magnitude. This shows that our algorithm is able to discover longer, more specific and informative motifs. For many motifs, especially for the HiSeq and MiSeq platforms, ERD is around 20 percent. Motif NGGCGGGT, for instance, leads to forward errors rates of 20.7 percent on HiSeq instruments, while the reverse error rate is 0.7, which is a normal value for this platform. This particular motif was discovered in all four datasets. In general, the found motifs were quite consistent across platforms and datasets, showing that, on the one hand, Illumina sequencers share common characteristics and, on the other hand, that our algorithm robustly detects CSE-causing motifs. Such specific high-ERD CSE-causing motifs have never been reported in the literature before.
While we do not analyse motifs from (10,2) in detail here, it is notable that some of such motifs discovered in the HiSeq-hg dataset have a FER larger than 0.5, implying that base calls at corresponding positions are more likely to be erroneous than to be correct. These motifs may serve as points of entrance for examining machine protocols. See additional files 6, 7, 8, 9 for full lists of all (10,2)-motifs for all four datasets.
Top 10 discovered motifs after alignment postprocessing
Today, detecting SNPs is routinely done and millions of SNPs have been collected in databases like db-SNP . In the following, we discuss to what extend SNP calling is influenced by error-causing motifs. To identify known SNPs that might be difficult to call using Illumina technology, we pool all (8,4)-motifs discovered in the four datasets. The resulting 91 motifs yield 6 622 827 putatively error-prone positions on the human genome (0.21% of all positions). Of these, 82 684 are co-located with SNPs in dbSNP build 137; that means, 0.29% of all 28 440 783 SNPs in the database lie at error-prone positions. Although the fraction is small, an absolute number of 82 684 SNPs that are difficult to genotype using Illumina devices is still remarkable. The difference between 0.21% of all genomic positions and 0.29% of all SNPs is small, but it is statistically significant (p <10 - 15 according to a χ 2-test). However, whether this difference is due to false positives caused by CSEs or due to other effects remains open.
As a next step, we called SNPs in the two human datasets GAIIx-hg and HiSeq-hg to test whether they are enriched for CSE-prone positions. For SNP calling, we used the GATK's UnifiedGenotyper with default parameters. Datasets GAIIx-hg and HiSeq-hg yielded 2 525 553 and 2 609 149 SNPs, respectively; out of these, 9 126 (0.36%) and 14 844 (0.57%) were found at CSE-prone positions. Thus, the fractions of SNPs at such positions are clearly higher than the corresponding fraction in dbSNP, which might indicate that the set of called SNPs does indeed contain false positives that are due to CSEs.
We have presented an algorithm to identify sequence contexts that cause context-specific sequencing errors (CSEs). In contrast to previous approaches, which detect positions with strand bias and then report on common sequence motifs at the identified positions, we start from the motifs and aggregate information at matching positions, which grants much higher statistical power. Our approach is thus able to integrate many weak but consistent positional signals. Allowing wildcards within the sequence motifs grants additional flexibility, e.g. the motifs with (n, q) = (4, 1) will also discover contexts of length 3.
Our approach is the first motif-based CSE discovery method. We confirm previously reported error-prone sequence contexts [8, 9] but also find much more informative motifs with an ERD higher by one order of magnitude. This allows to exactly pinpoint problematic positions, while the previously known short contexts GGT and GGC do not reliably predict strongly CSE-prone positions. The approach is also robust in the sense that extracted motifs were closely related across datasets.
The practical significance of error context discovery lies in the fact that an increasing number of exome sequencing studies to identify genetic causes of Medelian diseases and genome-wide association studies depend on reliable SNP calling. Our work can be integrated as an additional step into SNP calling pipelines, down-weighting proposed SNPs at known error contexts for the platform, independently of the coverage and strand bias in the particular dataset under investigation. To facilitate filtering of SNPs, we provide platform-specific annotation tracks (in BED format) with positions in the human genome matching discovered contexts. Our implementation is available under the terms of the GNU General Public License. Annotation tracks and source code can be obtained from the URL given in the Abstract.
We plan to systematically compare discovered contexts on more datasets from different organisms, sequenced on the same platform and with more parameter combinations (q, n) and with additional IUPAC wildcards (beyond N) to quantify the robustness of motif-based approaches comprehensively. We will also extend our algorithm to contexts on both sides of error-prone positions, with a special emphasis on inverted repeats. Furthermore, we plan to provide error contexts and annotation tracks for other (non-Illumina) sequencing platforms.
SR was partially funded by DFG Collaborative Research Center SFB 876 "Providing Information by Resource-Constrained Data Analysis" subproject TB1. MA was partially funded by EU COST action BM1006 "Next Generation Sequencing Data Analysis Network". MA and IGC were partially funded by the Excellence Initiative of the German federal and state governments and the German Research Foundation through Grant GSC 111 and IZKF Aachen (Interdisciplinary Centre for Clinical Research within the faculty of Medicine at RWTH Aachen University).
Publication of this article was funded by RWTH Aachen and CWI Amsterdam.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.