CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing

Merging the forward and reverse reads from paired-end sequencing is a critical task that can significantly improve the performance of downstream tasks, such as genome assembly and mapping, by providing them with virtually elongated reads. However, due to the inherent limitations of most paired-end sequencers, the chance of observing erroneous bases grows rapidly as the end of a read is approached, which becomes a critical hurdle for accurately merging paired-end reads. Although there exist several sophisticated approaches to this problem, their performance in terms of quality of merging often remains unsatisfactory. To address this issue, here we present a context-aware scheme for paired-end reads (CASPER): a computational method to rapidly and robustly merge overlapping paired-end reads. Being particularly well suited to amplicon sequencing applications, CASPER is thoroughly tested with both simulated and real high-throughput amplicon sequencing data. According to our experimental results, CASPER significantly outperforms existing state-of-the art paired-end merging tools in terms of accuracy and robustness. CASPER also exploits the parallelism in the task of paired-end merging and effectively speeds up by multithreading. CASPER is freely available for academic use at http://best.snu.ac.kr/casper.


[Supplementary information] S1 Performance evaluation methods
To assess the performance of the four tools compared in the paper, we use the evaluation methodology proposed in [1], which calls the success of a merge according to the completeness of mismatch resolution in the overlap region (see Fig. S1). Specifically, we formulate the determination whether a merge is successful or not as a binary classification problem. We define a true positive (TP) as a merge that correctly resolves all the mismatching bases in the overlap region with respect to the reference sequence. A false positive (FP) is defined as a merge with incorrect mismatching resolution in the overlap region. A false negative (FN) is a merge that escapes detection by CASPER. A true negative (TN) is undefined in this context.
In terms of the definitions of TP, FP and FN, accuracy and F 1 scores (the two widely used performance metrics) are defined as follows [2]:

S1.1 An alternative evaluation method
Note that there is another way of defining true/false positives/negatives in the literature, as shown in Figure S1. Given that the novelty of CASPER lies in resolving mismatching bases in overlapping regions (rather than finding overlaps per se), the main text uses the 'Label definition I' scheme shown in Figure S1 for performance comparison. It is also possible to use the 'Label definition II' depicted in Figure S1 for evaluating CASPER and the other three methods. In this labeling scheme, true negatives are defined as correct predictions of the reads that do not truly overlap, and the definition of accuracy becomes accuracy = #TP + #TN #TP + #TN + #FP + #FN (5)  Figure S1: Definitions of labels included in performance metrics. 'Label definition I' was proposed in [1] and used in the main text, whereas 'Label definition II' was proposed in [3] and used in Appendix. For output types 1-5, the forward and reverse reads overlap; for output types 6-7, there is no overlap between the reads. For types 1 and 2, the length of the fragment is correctly predicted, but the predicted overlap is correct only for type 1. For type 3, the bases in the overlap region are correctly predicted, but the location of overlap is incorrectly predicted. For types 4 and 5, the overlap is either not detected or incorrectly predicted. Types 6 and 7 are not defined in Label definition I.
while the definition of the F 1 score remains unchanged. Table S1 lists the performance statistics evaluated using the 'Label definition II' scheme for the same datasets used in the main text. This result shows that CASPER consistently produces the best accuracy and F 1 scores for both of the labeling schemes depicted in Figure S1.

S2.1 Performance comparison for datasets with non-overlapping reads
We carried out additional experiments to show the performance of CASPER for datasets with true negatives (i.e., non-overlapping reads). To this end, we first created a new dataset called N4 using nearly the same method used to create the four simulated datasets presented in the main text (A4/A5/S4/S5). That is, we used the GemSIM (v4) model to simulate 1,000,000 reads (100 bp each) from twenty three reference seqeunces originating from the V5 region of bacterial 16S rRNAs. However, the fragment length was set to 200-250 in N4 so that forward and reverse reads do not overlap. We then mixed N4 with A4 and C2 in turn, generating two mixed datasets in which a number of forward and reverse reads do not overlap. The results from applying CASPER and the other three tools to these mixture datasets are presented in Table S2.
According to this result, CASPER maintains its superiority to the compared alternatives in terms of accuracy and F 1 score even for the datasets with many non-overlapping reads. The ability of CASPER to discover overlaps is similar to that of the alternatives (i.e., similar amounts of TNs except PANDAseq in Table S2), but CASPER outperforms the other methods in terms of correcting Parameters: k = 17, ω = 10, γ = 0.5, δ = 19; machine: Ubuntu 12.04, Intel Xeon E5-4620×4, 512-GB memory; these statistics were computed in terms of the 'Label definition II' in Figure S1. Parameters: k = 17, ω = 10, γ = 0.27, δ = 19; machine: Ubuntu 12.04, Intel Xeon E5-4620×4, 512-GB memory; these performance statistics were calculated using the 'Label definition II' in Figure S1. mismatches in the overlap (i.e., better performance in terms of TPs, FPs, and FNs), yielding the best accuracy and F 1 scores overall.

S2.2 Effects of sequencing depth on the accuracy of CASPER
For the current form of dependence on k-mer counts, CASPER is suited primarily to high-coverage amplicon sequencing due to the need for counting k-mer information. To determine the level of sequencing coverage required to achieve a high performance of CASPER, we measured the accuracy of CASPER as the sequencing depth is varied from 1 to 500 for the simulated data created from the A4 dataset. Figure S2 shows the result. As expected, the performance improves as we increase the sequencing depth. However, after a certain point (in this case, a sequencing depth of approximately five), the performance improvement is no longer noticeable. This experimental result suggests that CASPER is applicable not only for high-coverage amplicon sequencing data but also for moderatecoverage data.  Figure S2: Effect of sequencing depth on accuracy. The accuracy of CASPER improves as the sequencing depth increases and becomes saturated when the depth is 5 or higher.