- Proceedings
- Open Access
- Published:

# CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing

*BMC Bioinformatics*
**volume 15**, Article number: S10 (2014)

## Abstract

Merging the forward and reverse reads from paired-end sequencing is a critical task that can significantly improve the performance of downstream tasks, such as genome assembly and mapping, by providing them with virtually elongated reads. However, due to the inherent limitations of most paired-end sequencers, the chance of observing erroneous bases grows rapidly as the end of a read is approached, which becomes a critical hurdle for accurately merging paired-end reads. Although there exist several sophisticated approaches to this problem, their performance in terms of quality of merging often remains unsatisfactory. To address this issue, here we present a **c** ontext-**a** ware scheme for **p** aired-**e** nd **r** eads (CASPER): a computational method to rapidly and robustly merge overlapping paired-end reads. Being particularly well suited to amplicon sequencing applications, CASPER is thoroughly tested with both simulated and real high-throughput amplicon sequencing data. According to our experimental results, CASPER significantly outperforms existing state-of-the art paired-end merging tools in terms of accuracy and robustness. CASPER also exploits the parallelism in the task of paired-end merging and effectively speeds up by multithreading. CASPER is freely available for academic use at http://best.snu.ac.kr/casper.

## Introduction

The advent and widespread use of next-generation sequencing (NGS) [1–3] has posed new challenges and opportunities for informatics [4, 5] due to the high-throughput nature and the relatively short and noisy reads compared to the traditional Sanger sequencing. NGS thus sparked the development of new pipelines (for base-calling, genome assembly/mapping, and other essential tasks) that consider the characteristics of the NGS platforms used.

In paired-end sequencing, a DNA fragment is read from either end of the fragment, and some NGS platforms (such as Illumina HiSeq, MiSeq, and GAIIx) inherently generate paired-end reads [2, 3]. When a fragment is larger than the sum of the forward and reverse reads, there exists a gap between the two reads [6]. Otherwise, the forward and reverse reads overlap, which can ideally give the effect of elongating reads. Having longer reads provides many benefits to downstream tasks in the informatics pipeline [7, 6, 8]. The read length of today's sequencers tends to continuously increase: *e.g*., at the time of writing, Illumina MiSeq can produce 2 × 300 bp reads using its reagent kit v3 (http://www.illumina.com).

However, NGS techniques including the Illumina platform tend to result in rapid degradation of the sequencing quality as the end of a read is approached (Figure 1). As a result, the overlapping region (formed by the ends of forward and reverse reads) in a paired-end read frequently contains errors originating from sequencing and/or base-calling. Sequencing results are often annotated with per-base *quality scores* representing the error probability [9].

Such limitations of sequencing technology often prevent the accurate merging of paired-end reads, so the pursuit of new tools for the reliable merging of overlapping paired-end reads has become an active area of research, *e.g*., SHERA [7], FLASH [6], PANDAseq [10], and COPE [11]. Most of these tools start by finding the best overlap between a pair of forward and reverse reads and then try to merge them by resolving mismatching bases in the overlap. The best overlap is sought by considering the overlap alignment [7], the fraction of (mis)matching bases [6, 11], or quality scores [10, 11]. The mismatch resolution is mostly achieved by considering quality scores [6, 10] and replacing the base with the lower quality score by the base with the higher quality score. Quality-score-based resolution often produces incorrect results, especially when the quality scores of mismatching bases do not differ significantly [12]. It was proposed to use quality scores and *k*-mer frequency together as the merge criterion [11], but the resulting methodology tends to be time-consuming and of unsatisfactory accuracy according to our experience. Due to the importance of paired-end merging, some sequence assemblers (*e.g*., ALLPATHS-LG [13]) contain a module for merging paired-end reads as a preprocessing stage, although the performance, flexibility, and applicability of such internal modules tends to be limited compared to the aforementioned methods dedicated to paired-end merging.

To overcome the limitations of the current approaches to merging paired-end reads from amplicon sequencing, here we propose a computational method called *context-aware scheme for paired-end reads* (CASPER). In this scheme, when the difference between the quality scores of mismatching bases is significant, CASPER relies on the quality scores for correction. If not, CASPER instead examines *k*-mer-based contexts around the mismatch and makes a statistical decision (up to *k* partial decisions for each mismatch). CASPER then makes a final decision based on the ensemble of the earlier partial decisions. According to our experiments, CASPER significantly outperforms the existing approaches in terms of accuracy of merging and resilience to noise. Furthermore, the time demand of CASPER remains reasonable in most cases, taking only a few tens of seconds to process one million reads. CASPER is freely available for academic use at http://best.snu.ac.kr/casper.

## Proposed method

Figure 2 shows the overall flow of the proposed CASPER approach that consists of five main steps: (1) preprocessing, (2) constructing a table of *k*-mer counts, (3) finding the best overlap position, (4) resolving mismatches in the overlap, and (5) merging forward and reverse reads. Before explaining the details of each step, we present the assumptions and definitions to be used in the explanation.

In paired-end sequencing, the sequencer produces two reads for each DNA fragment. Suppose that the fragment length is *m* (bases), and the lengths of forward and reverse reads are both *n*. An overlap between the two reads occurs when *m <* 2*n*. The sequence of a fragment is denoted by *S* = 〈*s*_{1}*, s*_{2}*, . . . , s*_{
m
}〉, *s*_{
i
} ϵ {A, C, G, T}. We denote the forward and reverse reads from *S* by *X* = 〈*x*_{1}*, x*_{2}*, . . . , x*_{
n
}〉 and *W* = 〈*w*_{1}*, w*_{2}*, . . . , w*_{
n
}〉, respectively. The *i*-th base in *X* (*i.e*., *x*_{
i
}) is also denoted by *X*(*i*), and the subsequence of *X* ranging from the *i*-th to *j*-th bases (*i.e*., 〈*x*_{
i
}*, . . . , x*_{
j
}〉) is denoted by *X*(*i* : *j*).

With no sequencing error (*e.g*., substitution, insertion, and deletion), *x*_{
i
} = *s*_{
i
} and ${\overline{w}}_{i}={s}_{m-i+1}$ for 1 *≤ i ≤ n*, where ${\overline{w}}_{i}$ represents the Watson-Crick complementary base of ${\overline{w}}_{i}$ (*e.g*., Ā is T, *vice versa*). In practice, bases may be called incorrectly, and each called base is accompanied by a (Phred) *quality score* denoted by *Q* and defined as *Q* = *−* 10 log_{10} *p*, where *p* is the probability that the corresponding base call is incorrect [9]. Sequences *X* and *W* are assumed to be accompanied by ${Q}_{X}=\u27e8{q}_{{x}_{1}},{q}_{{x}_{2}},\dots ,{q}_{{x}_{n}}\u27e9$ and ${Q}_{W}=\u27e8{q}_{{w}_{1}},{q}_{{w}_{2}},\dots ,{q}_{{w}_{n}}\u27e9$, respectively. Symbol 'N' represents any base (indecisive base-call) and is normally accompanied by the lowest quality score available. We assume that there are few indel-type sequencing errors, as is commonly the case with the Illumina platform.

### Preprocessing, *k*-mer counting, and overlap detection (steps 1-3)

In the first step of CASPER, every reverse read *W* in the input is preprocessed to facilitate the downstream steps. Specifically, CASPER reverses the order of bases in *W* and then complements each base. *W* and *Q*_{
W
} are converted to *Y* = 〈*y*_{1}*, y*_{2}*, . . . , y*_{
n
}〉 and ${Q}_{Y}=\u27e8{q}_{{y}_{1}},{q}_{{y}_{2}},\dots ,{q}_{{y}_{n}}\u27e9$, respectively. ${y}_{i}={\overline{w}}_{n-i+1}$ and ${q}_{{y}_{1}}={q}_{{w}_{n-i+1}}$ for 1 *≤ i ≤ n*. In the remainder of this paper, the term 'reverse read' refers to *Y* instead of *W*.

In the second step of CASPER, a table of *k*-mer counts is constructed from the input reads, where *k* is a user-specified parameter (Table 1). There has been active research on efficient *k*-mer counting. For instance, Jellyfish [14] provides a time-efficient, parallel solution to *k*-mer counting. Methods focused on memory efficiency also exist, *e.g*., bloom-filter-based BFcounter [15] and DSK [16]. In CASPER, we adopt and customize Jellyfish for building a table of *k*-mer counts. We denote this table by *T*_{
k
} . For *k*-mer *X*(*i* : *i* + *k −* 1) = (*x*_{
i
}, *x*_{i+1}, . . . , *x*_{i+k− 1}), *T*_{
k
} [*X*(*i* : *i* + *k −* 1)] indicates the number of occurrences of this *k*-mer in all forward and reverse reads of the input data.

Algorithm 1 shows pseudo-code of the remaining steps of CASPER. In the third step of CASPER, it is decided as to how much it needs to shift (to the right) the reverse read *Y* with respect to the forward read *X* in such a way that the fraction of mismatching bases in the overlap region is minimized (lines 1-8). The resulting overlap region is considered the best. The (mis)match ratio is widely used in the literature to locate the optimal overlap between paired-end reads [6, 11]. Parameter *ω* specifies the minimum length of an overlap. If too many mismatches exist in the overlap (*i.e*., the mismatch ratio exceeds a user-specified 'give-up' threshold *γ*), CASPER does not merge the reads (line 9).

### Resolving mismatching bases in forward and reverse reads (step 4)

In the fourth step of CASPER, the mismatching bases in the overlap region are corrected. Ideally, for each position of the overlap region, the base in the forward read should match the base in the reverse read. Due to experimental error and other non-idealities, however, these two bases often mismatch. To merge paired-end reads successfully, we need to resolve mismatching bases. The basic principle is simple: overwrite the incorrect base with the correct base, assuming that the base either in forward or reverse read is correct. The remaining question is which of the two bases is correct. Lines 10-28 in Algorithm 1 reveal how CASPER answers this question. CASPER scans the overlap region twice, once for quality-based resolution and once more for context-based resolution.

Note that in lines 11-28 of the code, position *i* in forward read *X* corresponds to position *i′* in reverse read *Y* . We assume that the two bases *X*(*i*) and *Y* (*i′*) are not 'N' (if both *X*(*i*) and *Y* (*i′*) are 'N', then they are skipped; if either one is 'N', then the other base is informative).

#### First pass: quality-score-based correction (lines 10-14)

For each pair of mismatching bases in the overlap between the forward and reverse reads, CASPER first considers the difference in their quality scores. For *X*(*i*) and *Y* (*i′*), let *b*_{
h
} (*b*_{
l
}) denote the base with the higher (lower) quality score. For a user-specified parameter *δ*, if *|Q*_{
X
} (*i*) *− Q*_{
Y
} (*i′*)*| > δ*, then CASPER replaces *b*_{
l
} by *b*_{
h
}. The rationale behind this is that a significantly higher quality score of *b*_{
h
} is a strong indicator of its correctness. This is in fact the basis on which most of the current approaches to paired-end merging are grounded. CASPER is differentiated from them by the second step.

#### Second pass: context-based correction (lines 15-28)

A more interesting scenario arises when the difference in quality scores is moderate. In such cases, CASPER no longer makes decisions based on the quality scores, since we cannot assume that the error probability of *b*_{
h
} is negligible relative to that of *b*_{
l
} [12]. For the cases where *|Q*_{
X
} (*i*) *− Q*_{
Y
} (*i′*)*| ≤ δ*, CASPER therefore makes context-based decisions by examining the bases before and after the mismatching position, instead of relying on quality scores.

For bases *X*(*i*) and *Y* (*i′*), we consider *k* different (but progressively overlapping) windows along the reads and define the *j*-th context (for 1 *≤ j ≤ k*) in terms of two subsequences of *X* and *Y* :

as depicted in Figure 3. For each *j*, CASPER estimates $P\left\{X\left(i\right)|{\mathcal{C}}_{j}\left(i,{i}^{\prime}\right)\right\}$ and $P\left\{Y\left({i}^{\prime}\right)|{\mathcal{C}}_{j}\left(i,{i}^{\prime}\right)\right\}$ and then constructs a *k*-dimensional decision vector *D* = 〈*d*_{1}*, d*_{2}*, . . . , d*_{
j
} *, . . . , d*_{
k
}〉, where *d*_{
j
} is defined as

Element *d*_{
j
} = 1(0) represents that the base in the forward (reverse) read is correct. Note that this decision rule is based on the Bayesian decision theory [17], and thus the error involved in decision *dj* is given by ${\epsilon}_{j}=\text{min}\left(P\left\{X\left(i\right)|{\mathcal{C}}_{j}\left(i,{i}^{\prime}\right)\right\},P\left\{Y\left({i}^{\prime}\right)|{\mathcal{C}}_{j}\left(i,{i}^{\prime}\right)\right\}\right)$.

In other words, CASPER makes a series of preliminary decisions and stores the results in *D*. The elements in *D* can be considered as the outputs from *k* classifiers. We can make a final decision based on *D* using ensemble learning techniques that can combine results from multiple learners [18]. CASPER employs the idea of *linear opinion pools* to make a final decision:

which can be implemented as voting. If *d*_{
j
} were i.i.d., then the error from the final decision would be *ε*_{
j
}*/k* [18], but here the *d*_{
j
}'s have dependence on each other originating from the definition of the context, and we expect a higher level of error than the theoretical estimate (see 'Experiments on the context definition and probability computation' in Results and Discussion).

CASPER estimates the probabilities in Eq. (1) using the *k*-mer-count table *T*_{
k
} constructed in step 2. By Bayes' theorem, we can express these probabilities as

and

Note that the context-based mismatch resolution step assumes that the bases in each of the forward and reverse reads have a similar probability of being correct. This assumption further gives

(see Results and Discussion for our experimental result that supports the above formulation).

Thus, we can rewrite the decision rule in Eq. (1) using *k*-mer counts as follows:

In certain cases, the number of partial decisions made per mismatch can be less than *k*. CASPER skips those context windows whose span exceeds the read boundary (lines 20-21). Additionally, when the rightmost position of a window spans a pair of mismatching bases for *j* = *j*^{∗} *>* 1 (*i.e*., *X*(*i* + *j −* 1) *≠* *Y* (*i′* + *j −* 1)), CASPER skips all of the following contexts for *j ≥ j*^{∗} (line 22). Some bases appearing in such contexts may be erroneous, since they are yet to be corrected. In contrast, it is guaranteed that *X*(*i − k* + *j* : *i −* 1) = *Y* (*i′ − k* + *j* : *i′ −* 1) for 1 *≤ j < j*^{∗}, because the mismatch resolution starts with the leftmost position in the overlap and proceeds to the right.

Lines 18-28 of Algorithm 1 implement the decision rules represented by Eq. (2) and Eq. (5), also considering the cases in which the number of partial decisions is less than *k*.

### Additional details

In the final step of CASPER, each pair of forward and reverse reads is merged to produce virtually elongated reads (lines 29-31). The quality score of a base not processed by the mismatch resolution remains unchanged. For a newly replaced base, the quality score of the replacing base is used.

Assuming sequential execution, the worst-case time complexity of CASPER is *O*(*m* log *m*+*N n*^{2}), where *N* is the total number of paired-end reads, *n* is the length of each read, and *m* is the number of *k*-mers stored in the *k*-mer table. The worst-case space complexity is *O*(*m* + *N n*). Note that the *O*(*m* log *m*) time and *O*(*m*) space complexity terms are due to the Jellyfish algorithm [14] we utilize for *k*-mer counting. By parallelization, the *O*(*N n*^{2}) term in the time complexity becomes *O*(*N n*^{2}*/t*), where *t* is the number of threads used.

There are two main parts of CASPER that are suitable for parallelization: *k*-mer counting and the two-step merging process. Counting *k*-mers in different reads or merging different pairs of reads provides ample opportunities for data-level parallelization. The former is handled by the parallel implementation of Jellyfish. For the latter, we implement the proposed merging algorithm using OpenMP (http://www.openmp.org), an API that supports shared-memory multi-threaded programming.

One might consider making a context-based decision using two separate *k*-mer tables for forward and reverse reads. The context-based decision rule in Eq. (1) will then use *P* {*X*(*i*)*|X*(*i − k* + *j* : *i* + *j −* 1)} and *P* {*Y* (*i′*)*|Y* (*i′ −k* +*j* : *i′* +*j −* 1)}, reflecting respectively the forward and reverse contexts. This would make the probability estimation more complicated, since the normalization becomes different. Ideally, the forward and (preprocessed) reverse reads should bear the same information and can be considered two replicates of a fragment. Thus, considering the two reads independently may lose the advantage of having such replicates. According to our experiments, this separate-*k*-mer-table approach, while taking more time than the current implementation of CASPER, has a negligible impact on its performance.

## Results and discussion

We test the proposed CASPER methodology with the seven datasets listed in Table 2. To generate A4, A5, S4, and S5, we customize the GemSIM sequencing simulator [19] and apply it to a public dataset [20] using the Illumina error models v4 and v5. Note that datasets A4 and S4 are generated with higher error rates than A5 and S5. In S4 and S5, there is a single reference fragment, whereas A4 and A5 have all of the twenty three reference sequences used in the original publication. Datasets C1, C2, and PA are from real sequencing experiments on bacterial 16S ribosomal RNAs [21, 10].

For performance comparison with the proposed CASPER method, we employ three widely used approaches to merging paired-end reads: COPE [11], FLASH [6], and PANDAseq [10]. All of the four tools compared take quality scores [9] as input, and the quality score information of each sequence used was prepared in the FASTQ format. The specification of the machine used is as follows: Ubuntu 12.04, 4*×* 2.2 GHz Intel Xeon E5-4620 CPUs (8 cores/16 threads each), and 512 GB main memory.

### Performance evaluation and comparison

Table 3 lists the performance of CASPER and three other tools in terms of runtime (measured with 32 threads used) and other widely used metrics including the accuracy and *F*_{1} score [22]. We use the evaluation methodology proposed in [10], which examines the completeness of resolving mismatches in the overlap to call the success of a merge (refer to Additional file 1 for more details on the definition of evaluation metrics and additional experimental results).

Most notably, CASPER exhibits the highest level of accuracy and *F*_{1} score over all the datasets used. In particular, for A4/S4 (simulated with higher error rates) and C1/C2/PA (real data), the average accuracy of CASPER is higher than that of COPE, FLASH, and PANDAseq by 86%, 21%, and 15%, respectively. In terms of *F*_{1} scores, CASPER outperforms the other tools by 8-55% for A4/S4 and C1/C2/PA. For the simulated datasets with lower error rates (A5/S5), the performance of CASPER remains superior to the alternatives tested, although the performance gap is narrower due to the low error rates used to simulate A5 and S5. The evident robustness of CASPER would make it an invaluable tool for handling noisy sequencing data which are encountered frequently in practice. The performance of the tasks that utilize the paired-end merging results (such as genome assembly and mapping) will also benefit significantly from using CASPER, which can provide longer reads by merging forward and reverse reads robustly.

As for runtime, CASPER is not the fastest, as expected from the fact that CASPER relies on the time-consuming *k*-mer counting. PANDAseq and FLASH, which do not employ *k*-mer counts, normally take the least amount of runtime. Nonetheless, the runtime of CASPER is obviously reasonable in most cases (taking about thirty seconds), unlike COPE, another *k*-mer-based merger that takes noticeably more time than the *k*-mer-free tools. Clearly, there is a trade-off between the runtime and accuracy of the merging tools, but considering the quality of merges CASPER can provide, we believe that CASPER is the tool that best balances this trade-off.

### Experiments on the context definition and probability computation

In the context-based mismatch resolution, CASPER makes a call based on the ensemble of up to *k* individual decisions. For comparison, we additionally implement and measure the performance of a version that makes a decision based on only one of the *k* contexts, as shown in Figure 4. The distribution of the accuracy values this modified version reports for different *j* values (1 *≤ j ≤ k*) is represented by a box plot. We observe that the ensemble approach is effective for delivering more robust performance, although the observed performance gain is lower than the theoretical one (see 'Second pass: context-based correction' in Proposed Method), presumably due to the dependence between contexts.

As stated in 'Second pass: context-based correction' in Proposed Method, the context-based mismatch resolution of CASPER assumes that the bases in each of the forward and reverse reads have similar probabilities of being correct and thus does not consider quality scores when estimating probabilities using *k*-mer contexts. Alternatively, we can include quality scores in the probability estimation by multiplying the ratio of probabilities (*i.e*., $1{0}^{-{Q}_{X}\left(i\right)/10}/1{0}^{-{Q}_{Y}\left({i}^{\prime}\right)/10}$) in Eq. (3) and Eq. (4). We implement both approaches and compare the results in Table 4 which indicates that there is negligible difference between these two approaches, and thus implying that the quality scores add little to the information provided by the *k*-mer contexts.

### Effects of algorithm parameters and multithreading on performance

#### Parameter k: the length of k-mer

In general, using larger *k* has the advantage of examining wider contexts, but on the other hand, demands more memory and runtime. In the literature, there exist many *k*-mer-based bioinformatics tools, and different values of *k* are recommended depending on tasks and applications. For instance, the default *k* for seed-and-extend aligners are 11 for BLAST [23] and 28 for MegaBLAST [24].

Figure 5(a) shows how the accuracy and runtime of CASPER varies as the *k*-mer size changes from *k* = 2 to *k* = 31 for dataset C1. To clearly show the overhead incurred by *k*-mer counting, Figure 5(a) also shows the time taken to count *k*-mers. In this figure, the accuracy of CASPER is seen to increase only up to a certain point as we increase *k*, plateauing out around *k* = 8. As *k* increases, the time to count *k*-mers starts decreasing at first but tends to increase in the long run, due to the IO bottleneck incurred by multithreading (see [14] for more details of this behavior). The only variation here is the value of *k*, and the total runtime of CASPER is directly proportional to the time taken to perform *k*-mer counting. The default value CASPER uses is *k* = 17, which is a widely used value in genome assembly [25]. Given the length of fragments in this C1 dataset, smaller values of *k* should work, as is observed in Figure 5(a). Users can change *k* depending on their analysis target.

#### Parameter δ: the threshold for starting context-based mismatch resolution

Recall from 'Second pass: context-based correction' in Proposed Method that CASPER makes context-based decisions on which of the forward and reverse reads is correct if the quality score difference falls below *δ*. Otherwise, CASPER utilizes the quality score difference for the decisions. Figure 5(b) presents the effects of *δ* on the accuracy of CASPER. For the datasets used (A4, S4, C1), we observe a similar trend: as *δ* increases, the accuracy improves and eventually becomes saturated. Note that the number of mismatches handled by the quality-score-based decision increases as we lower *δ*. Thus, the observation that the accuracy degrades as *δ* is lowered strongly implies that the quality-score-based scheme becomes ineffective as *δ* decreases. This observation justifies the switch to the context-based scheme when quality scores do not differ significantly between two mismatching bases. This is also compatible with our motivation for developing CASPER: simply relying on quality scores may incur mistakes in resolving mismatches, when the difference in quality scores is not significant.

#### Parameters γ and ω

CASPER abandons read merging if the mismatch ratio in the overlap region is greater than *γ*. Figure 6(a) shows how the accuracy of CASPER is affected by changing the *γ* threshold (data: C1, S4, A4). The values of the other parameters are set to their default. We can observe that the accuracy steeply increases as we change *γ* from 0 to approximately 0.2, but after that, the change in accuracy seems negligible.

Figure 6(b) shows how the accuracy changes as we vary the value of *ω* from 3 to 20 for the datasets C1, S4 and A4 (default values were used for the other parameters). In this plot, we do not see any significant change up to *ω* = 10. For larger values of *ω*, the accuracy gradually decreases for A4, whereas the accuracy does not change for the other datasets. This observation can be explained by noticing the size of overlaps in the datasets used. The overlap size is 40 for S4 and 50-81 for C1, as listed in Table 2. Consequently, there should be no difference in accuracy for these datasets by varying *ω* from 3 to 20. In contrast, the overlap size for A4 is between 10 and 40, and the accuracy of CASPER for A4 becomes affected if we increase *ω* over 10.

#### The number of computing threads used for parallelization

Considering that read pairs can be merged independently of each other, most of the current approaches to merging paired-end reads provide parallel implementations. The task of *k*-mer counting adds another opportunity for parallelization, and CASPER utilizes the parallelized version of Jellyfish [14] for constructing the *k*-mer-count table.

Figure 7(a) shows how the runtime of CASPER decreases as more threads are used for parallel execution. Due to the part of the code that cannot be parallelized, the effect of multithreading diminishes as the number of threads increases past a certain point. The speedup of *k*-mer counting shows a similar trend, as can be observed in the plot.

Figure 7(b) shows a comparison of the four paired-end merging tools under comparison in terms of speedups by multithreading. For the cases where only a few threads are used, the *k*-mer-based methods (CASPER and COPE) are slower than FLASH and PANDAseq. As more threads are utilized, the runtime of CASPER steeply decreases and shows a similar run-time to the two non-*k*-mer approaches. In contrast, although the time demand of COPE is alleviated by multithreading to a certain extent, its runtime remains significantly higher than the others.

## Concluding remarks

Among the four merging tools tested, CASPER consistently shows the best performance in terms of accuracy and *F*_{1} score for the datasets used. We attribute the main reason for this improvement to the mismatching resolution policy employed by CASPER: when the quality scores differ significantly, it trusts the scores and uses them to decide the correct bases; otherwise, it switches to the context-based decision scheme based on *k*-mer counts around the mismatching bases. CASPER is most clearly differentiated from other approaches by this policy.

One might suggest incorporating quality scores to some extent even when CASPER makes a context-based decision, similarly to an existing approach [11]. For instance, we could adjust the *k*-mer counts by considering the quality scores of the bases in a *k*-mer. According to our tests (*e.g*., see Table 4), however, it is difficult to see any significant effect of this hybrid approach on the accuracy of merging once the quality score difference becomes lower than a threshold value.

The most notable findings from our experiments can thus be summarized as follows: when the difference between the quality scores of a pair of mismatching bases falls below a threshold, the quality scores often fail to deliver decisive information on the identity of the correct base. Instead, using the *k*-mer contexts around the mismatching position is believed to be more informative.

We envision various opportunities for improving CASPER: (1) Currently, the overlap finding step and the mismatch resolution step run separately. Given that a suboptimal arrangement of the overlap may become optimal after the mismatches therein are corrected, a future revision of CASPER may include simultaneous or iterative optimization of these two steps. (2) The context-based decision of CASPER presently relies on the idea of voting and may further be improved by adopting other types of ensemble learning techniques. In the revision process, care should be taken not to incur excessive computational overhead for learning. (3) CASPER presently considers only substitution type sequencing errors. Although substitutions are the most common error type in Illumina sequencing, indels (especially in homopolymers) are still problematic intervals particularly at read ends. By revising the context-based mismatch-resolution scheme, we may augment CASPER so that it can recognize and address indel-type sequencing errors as well as substitutions. (4) For the current form of dependence on *k*-mer counts, CASPER is particularly well suited for high-coverage amplicon sequencing. According to our experiments (summarized in Additional file 1), however, the degree of sequencing depth required to obtain a high accuracy with CASPER tends to be moderate, suggesting its wide applicability of CASPER. For whole-genome shotgun sequencing, the context around mismatches in the overlap between forward and reverse reads may be defined differently to optimize the effectiveness of CASPER. It would be intriguing to analyze the effect of running CASPER on a variety of downstream tasks such as sequence assembly/mapping and metagenomic diversity estimation.

**input** : *X, Y, Q*_{
X
} *, Q*_{
Y
} // *X*: forward read, *Y* : reverse read, *Q*_{
X
} *,Q*_{
Y
} : q-scores

**param** : *ω, γ, δ, k* // *ω*: min overlap, *γ*: give-up threshold, *δ*: difference threshold, *k*: *k*-mer size

**output**: *Z* // virtually elongated read by merging forward & reverse reads

// step 3: find the best overlapping region between forward & reverse reads

**1** lowestMismatchRatio ← ∞, bestOverlapStartIndex ← 0

**2 for** *i ←* 1 **to** *n − ω* + 1 **do**

**3** numMismatches ← 0

**4** **for** *j ← i* **to** *n* **do**

**5** **if** X(j) ≠ Y (j − i + 1) **then** numMismatches ← numMismatches + 1

**6** $\frac{numMismatches}{n-i+1}$ *< lowestMismatchRatio* **then**

**7** *lowestMismatchRatio←*$\frac{numMismatches}{n-i+1}$

**8** bestOverlapStartIndex ← i

**9 if** *lowestMismatchRatio > γ* **then** return // give up (too many mismatches)

// step 4 (first pass): qualty-score-based mismatch resolution

**10 for** i ← bestOverlapStartIndex **to** n **do**

**11** *i′* = *i − bestOverlapStartIndex* + 1 // for convenience in indexing *Y*

**12** **if** *X*(*i*) = *Y* (*i′*) **then** continue // skip matching bases

**13** **else if** *Q*_{
X
}(*i*) *− Q*_{
Y
}(*i′*) *> δ* **then** *Y* (*i′*) = *X*(*i*) // q in *X* is significantly better

**14** **else if** *Q*_{
Y
}(*i′*) *− Q*_{
X
}(*i*) *> δ* **then** *X*(*i*) = *Y*(*i′*) // q in *Y* is significantly better

// step 4 (second pass): context-based mismatch resolution

**15 for** i ← bestOverlapStartIndex **to** n **do**

**16** *i′* = *i − bestOverlapStartIndex* + 1 // for convenience in indexing *Y*

**17 if** *X*(*i*) = *Y* (*i′*) **then** continue // skip matching bases

**18** *ForwardVotes ←* 0, *ReverseVotes ←* 0 // otherwise, examine the *k*-mer context

**19** **for** *j ←* 1 **to** *k* **do**

**20** **if** *i − k* + *j <* 1 **then** continue // index exceeds read boundary (left); continue to next *j*

**21** **if** *i* + *j −* 1 *> n* **then** break // index exceeds read boundary (right); go to line 27

**22** **if** *j >* 1 **and** *X*(*i* + *j −* 1) *≠* *Y* (*i′* + *j −* 1) **then** break // context scan stops; go to line 27

**23** *Context*_{
X
} *← T*_{
k
}[*X*(*i − k* + *j* : *i* + *j −* 1)] // *k*-mer counts in forward reads

**24** *Context*_{
Y
} *← T*_{
k
}[*Y* (*i′ − k* + *j* : *i′* + *j −* 1)] // *k*-mer counts in reverse reads

**25** **if** Context_{X} > Context_{Y} **then** ForwardVotes ← ForwardVotes + 1

**26** **else if** Context_{X} < Context_{Y} **then** ReverseVotes ← ReverseVotes + 1

**27 if** *ForwardVotes > ReverseVotes* **then** *Y*(*i′*) *← X*(*i*) // use the base in *X*

**28 else** *X*(*i*) *← Y* (*i′*) // use the base in *Y*

// step 5: merge forward & reverse reads

**29 for** *i ←* 1 **to** *bestOverlapStartIndex −* 1 **do** *Z*(*i*) *← X*(*i*) // copy first part from *X*

**30 for** *i ←* 1 **to** *n* **do** *Z*(*i* + *bestOverlapStartIndex −* 1) *← Y* (*i*) // copy second part from *Y*

**31** return *Z*

**Algorithm 1:** Steps 3-5 of the proposed CASPER algorithm

## References

- 1.
Schuster SC: Next-generation sequencing transforms today's biology. Nature. 2007, 200 (8):

- 2.
Shendure J, Ji H: Next-generation dna sequencing. Nature biotechnology. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.

- 3.
Metzker ML: Sequencing technologies?the next generation. Nature Reviews Genetics. 2009, 11 (1): 31-46.

- 4.
Pop M, Salzberg SL: Bioinformatics challenges of new sequencing technology. Trends in Genetics. 2008, 24 (3): 142-149. 10.1016/j.tig.2007.12.006.

- 5.
Pop M: Genome assembly reborn: recent computational challenges. Briefings in bioinformatics. 2009, 10 (4): 354-366. 10.1093/bib/bbp026.

- 6.
Magoč T, Salzberg SL: FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011, 27 (21): 2957-2963. 10.1093/bioinformatics/btr507.

- 7.
Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, Alm EJ, Chisholm SW: Unlocking short read sequencing for metagenomics. PLOS One. 2010, 5 (7): 11840-10.1371/journal.pone.0011840.

- 8.
Martin JA, Wang Z: Next-generation transcriptome assembly. Nature Reviews Genetics. 2011, 12 (10): 671-682. 10.1038/nrg3068.

- 9.
Ewing B, Green P: Base-calling of automated sequencer traces using Phred. II error probabilities. Genome research. 1998, 8 (3): 186-194.

- 10.
Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD: PANDAseq: paired-end assembler for Illumina sequences. BMC Bioinformatics. 2012, 13 (1): 31-10.1186/1471-2105-13-31.

- 11.
Liu B, Yuan J, Yiu SM, Li Z, Xie Y, Chen Y, Shi Y, Zhang H, Li Y, Lam TW: COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics. 2012, 28 (22): 2870-2874. 10.1093/bioinformatics/bts563.

- 12.
Kwon S, Park S, Lee B, Yoon S: In-depth analysis of interrelation between quality scores and real errors in illumina reads. Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE. 2013, IEEE, 635-638.

- 13.
Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences. 2011, 108 (4): 1513-1518. 10.1073/pnas.1017351108.

- 14.
Marçais G, Kingsford C: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011, 27 (6): 764-770. 10.1093/bioinformatics/btr011.

- 15.
Melsted P, Pritchard JK: Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinformatics. 2011, 12 (1): 333-10.1186/1471-2105-12-333.

- 16.
Rizk G, Lavenier D, Chikhi R: DSK: k-mer counting with very low memory usage. Bioinformatics. 2013, 29 (5): 652-653. 10.1093/bioinformatics/btt020.

- 17.
Berger JO: Statistical Decision Theory and Bayesian Analysis. 2004, Springer, New York, 2:

- 18.
Alpaydin E: Introduction to Machine Learning. 2004, MIT press, Cambridge, Massachusetts

- 19.
McElroy KE, Luciani F, Thomas T: Gemsim: general, error-model based simulator of next-generation sequencing data. BMC genomics. 2012, 13 (1): 74-10.1186/1471-2164-13-74.

- 20.
Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ: Removing noise from pyrosequenced amplicons. BMC Bioinformatics. 2011, 12 (1): 38-10.1186/1471-2105-12-38.

- 21.
Bartram AK, Lynch MD, Stearns JC, Moreno-Hagelsieb G, Neufeld JD: Generation of multimillion-sequence 16s rrna gene libraries from complex microbial communities by assembling paired-end illumina reads. Applied and environmental microbiology. 2011, 77 (11): 3846-3852. 10.1128/AEM.02772-10.

- 22.
Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, Massachusetts. 2005

- 23.
Altschul SF, Madden TL, Sch¨affer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.

- 24.
Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning dna sequences. Journal of Computational biology. 2000, 7 (1-2): 203-214. 10.1089/10665270050081478.

- 25.
Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y: The sequence and de novo assembly of the giant panda genome. Nature. 2009, 463 (7279): 311-317.

## Acknowledgements

This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT and Future Planning) [No. 2011-0009963, and No. 2012M3A9D1054622] and in part by Samsung Electronics Co., Ltd. The authors would like to thank Sungwoon Choi and Hanjoo Kim at Yoon lab for helpful discussions, and Dr. Daniel Mason for proofreading the manuscript.

**Declarations**

Funding for open access charge was supported by Seoul National University.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.

## Author information

## Additional information

### Authors' contributions

SK implemented the method, carried out the experiments, and analyzed the results. BL participated in the experiments and analyzed the results. SY conceived the research, analyzed the results, and wrote the manuscript. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

## Electronic supplementary material

## Rights and permissions

## About this article

### Cite this article

Kwon, S., Lee, B. & Yoon, S. CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing.
*BMC Bioinformatics* **15, **S10 (2014) doi:10.1186/1471-2105-15-S9-S10

#### Published

#### DOI

### Keywords

- paired-end merging
- amplicon sequencing
- high-throughput short-read sequencing
- sequence analysis