SeqPurge: highly-sensitive adapter trimming for paired-end NGS data
© Sturm et al. 2016
Received: 22 January 2016
Accepted: 3 May 2016
Published: 10 May 2016
Trimming of adapter sequences from short read data is a common preprocessing step during NGS data analysis. When performing paired-end sequencing, the overlap between forward and reverse read can be used to identify excess adapter sequences. This is exploited by several previously published adapter trimming tools. However, our evaluation on amplicon-based data shows that most of the current tools are not able to remove all adapter sequences and that adapter contamination may even lead to spurious variant calls.
Here we present SeqPurge (https://github.com/imgag/ngs-bits), a highly-sensitive adapter trimmer that uses a probabilistic approach to detect the overlap between forward and reverse reads of Illumina sequencing data. SeqPurge can detect very short adapter sequences, even if only one base long. Compared to other adapter trimmers specifically designed for paired-end data, we found that SeqPurge achieves a higher sensitivity. The number of remaining adapter bases after trimming is reduced by up to 90 %, depending on the compared tool. In simulations with different error rates, we found that SeqPurge is also the most error-tolerant adapter trimmer in the comparison.
SeqPurge achieves a very high sensitivity and a high error-tolerance, combined with a specificity and runtime that are comparable to other state-of-the-art adapter trimmers. The very good adapter trimming performance, complemented with additional features such as quality-based trimming and basic quality control, makes SeqPurge an excellent choice for the pre-processing of paired-end NGS data.
Adapter contamination is a common problem of short-read sequencing. It arises from fragments of the sequencing library that are shorter than the read length itself. In that case, a ‘read through’ into the sequencing adapters occurs after sequencing the actual insert of interest. These adapter sequences often disturb downstream analysis of the data, i.e. read mapping and variant calling, or de-novo assembly of reads.
Thus, trimming adapter sequences is a common preprocessing step in most NGS data analysis pipelines. For amplicon-based sequencing approaches, which are widely used in clinical diagnostics, sensitive adapter trimming is of special importance. Recurrent untrimmed adapters at the same genomic position can lead to spurious variant calls. Shotgun sequencing approaches with random distribution of reads over the target region are more robust towards adapter contamination. The random read distribution reduces the probability of spurious variant calls for most variant calling applications. However, when calling variants with low allele frequencies, e.g. somatic variants or mosaic variants, adapter contamination can still lead to spurious variant calls. Therefore, adapter trimming is an essential step for shotgun data as well.
While analyzing NGS data generated with the amplicon-based HaloPlex enrichment (Agilent, Santa Clara CA), we found that adapter contamination was not fully removed by any of the adapter trimming tools we applied, even with optimized parameters [1, 8, 10]. Thus, we developed SeqPurge, a very sensitive paired-end adapter trimmer that is based on a probabilistic approach.
Before going into the algorithm details, we will briefly define the problem more formally: Given forward/reverse adapter sequences and the forward/reverse read produced from one DNA fragment, remove all bases from the reads that stem from a read-through into the adapters, if present. Those bases that must not be trimmed because they stem from the actual sequence of interest are called insert in the following text.
Calculation of non-random match probability
We call two sequences a (non-random) match, if P is less than a given threshold. This threshold is the main parameter that balances sensitivity/specificity of our algorithm.
Using this simple approach is possible because modelling indels is not necessary - an insertion/deletion in the insert is present in both read directions. We use a fixed match probability p of 0.25 for all bases, i.e. we assume that all four bases occur at the same rate. Based on our performance evaluations, this simplification causes no major problems.
The primary design goal of SeqPurge was to achieve a very high sensitivity while maintaining a state-of-the-art specificity. In general, the insert match approach is very sensitive and thus is the best approach for paired-end reads. However, certain sequence motives and unbalanced base content can make sequencing difficult in one read direction or even in both read directions. In these difficult sequences, the adapter match approach can perform better than the insert match approach. Thus, we combine the two approaches to increase the sensitivity of SeqPurge.
First, we try to find an insert match between forward read and the reverse-complement of the reverse read. To detect an insert match, each possible offset between both reads is tested for a match (see Fig. 1b). If we find a match, the adapters are trimmed and the insert remains. If we find several matches, we select the best offset, i.e. the one with the lowest probability of being random. Several matches occur primarily in reads of regions with simple repeats. To prevent overtrimming because of false-positive matches in simple repeat regions, we also require a match between the previously defined adapter sequence and the sequence flanking the putative insert. This additional adapter match is required in only one out of the two reads, which makes the algorithm more robust towards bad read quality in one read direction.
If no insert match was found, we check for adapter matches in the forward and reverse read separately. Again, each possible offset is tested for an adapter sequence match. If an adapter match is detected, the read is trimmed starting at the offset position. If only one of the two reads has an adapter match, the other read is trimmed at the offset as well, because a ‘read through’ is always symmetrical. The rationale is again that one read could have bad quality due to sequencing problems.
Theoretical runtime and runtime optimizations
The theoretical runtime of the algorithm is nm 2 where n is the number of reads and m is the read length. We implemented several optimizations to reduce the runtime in practice.
Because we need to calculate the match probability m 2 times for each read pair and three faculty values are required for the calculation, the faculty values are pre-calculated and stored in a hash for fast lookup.
To avoid a large part of the match probability calculations, we added a minimum matching bases parameter (default is 80 %). Once the maximum allowed number of mismatches of a comparison is exceeded, the rest of the comparison can be skipped. This reduces the overall runtime by up to 75 %.
Reading the data from file and processing the data in the same thread can slow down the analysis considerably when file I/O is slow. Thus, we pre-fetch reads in an I/O thread and use n additional threads for data analysis (n = 1 by default). This strategy of course increases the memory usage slightly.
We are currently evaluating the possible speedup when using four bit-arrays instead of characters to store DNA sequences. This approach is also used by Skewer , the fastest tool in our comparison.
For our benchmark we selected the best tools for paired-end reads from : Skewer , AdapterRemoval , Trimmomatic , SeqPrep (https://github.com/jstjohn/SeqPrep) and FlexBar . Additionally, we benchmarked the recently published tool PEAT , which was not part of the comparison by Jiang and colleagues.
For the main performance comparison, we used real-world data to make the benchmark as realistic as possible. We created a benchmark dataset by sequencing the NIST reference sample NA12878 : (1) A library was created from the NA12878 sample using a HaloPlex custom panel containing the exons of 71 genes (247,305 bases in total) related to hereditary breast and ovarian cancer; (2) the library was sequenced on an Illumina MiSeq (Illumina Inc, San Diego, CA) resulting in about half a million paired-end reads of 158 bp length. An amplicon dataset was used for the benchmark, because the effect of incomplete adapter trimming can even be seen in variant lists. Generally, an adapter trimmer that performs well on amplicon data will perform equally well on shotgun data.
For a second benchmark, we simulated paired-end data. Five million read pairs for the coding region of the genome (CCDS) were created with varying error rates of 0 to 4 %. The 100 bp read pairs were created based on a theoretical library with mean insert size of 100 bp and a standard deviation of 50 bp. This results in a dataset where 50 % of the reads contain adapter contamination and need to be trimmed. The reads were simulated using the PERsim (see Availability of data and materials).
We compare the adapter trimming performance based on metrics of raw reads, mapped reads and variants called. For each of the three steps, different metrics are used.
For raw reads, the number of bases remaining after trimming is counted. Additionally, we report the number of remaining adapter 20-mers from the beginning of the adapters. This crude metric already gives a good impression of the sensitivity. After perfect adapter trimming no adapter 20-mer should be found in trimmed data.
For mapped reads, the first metric is the number of properly paired reads with a mapping quality of 40 or higher. The high mapping quality cutoff of 40 was chosen to remove low-accuracy alignments. Removing low-accuracy alignments is important because alignments are the basis to calculate the metrics of overtrimmed and undertrimmed bases: The start positions of the forward and reverse read indicate the start and end position of the insert, respectively (see Fig. 1a). These metrics depend on the mapping software and its parameters. For the real data evaluation, we assume that all alignments are correct and that the Phred mapping quality is well-calibrated. For the simulated data evaluation, no mapping is required and the correct trimming is known a priori.
For variants, the most basic metric is the overall number of variants called. This simple metric can be used to identify adapter trimmers that cannot trim short adapter fragments. Remaining adapter sequences result in higher variant counts. The more significant metrics are the number of uncalled true-positives and called false-positives. To determine these, all variants with genotype calls differing between tools were manually inspected.
Benchmark results on real data
All adapter trimmers were configured similarly to make the results comparable: one thread was used for trimming; reads smaller than 15 bases after trimming were discarded; no trimming by base quality or no-call stretches (i.e. no base could be called) was performed. For algorithm-specific parameters of the adapter trimmers default values were used. Mapping was performed using the BWA “mem” algorithm with default parameters. Variant calling was performed on bases with a Phred base quality of 20 or higher only.
Adapter trimming benchmark results on real data (raw reads)
In terms of bases left after trimming, most tools produce roughly 142.5 million bases. SeqPrep removes significantly more bases. This will be discussed in the next section together with the number of properly-paired reads. PEAT removes significantly less bases, because PEAT keeps reads with an insert size of zero, i.e. adapter-dimers with no insert.
Adapter trimming benchmark results on real data (mapping)
When looking at overtrimmed and undertrimmed bases, only SeqPurge and SeqPrep show balanced sensitivity (i.e. a low number of undertrimmed bases) and specificity (i.e. a low number of overtrimmed bases). FlexBar and PEAT have a low specificity. AdapterRemoval, PEAT, Skewer and Trimmomatic have a low sensitivity.
Adapter trimming benchmark results on real data (variant calling)
Without adapter trimming, 24 spurious variants were called. They were all caused by short adapter remains in mapped reads. The high number of spurious variant calls underlines the importance of highly sensitive adapter trimming for amplicon-based sequencing data. Using FlexBar and Trimmomatic 19 and 23 false-positive variants were called, respectively, because they failed to trim short adapter remains. FlexBar trims adapter sequences longer than two bases only. Trimmomatic trims adapter sequences longer than seven bases only.
Interestingly, one true-positive variant was missed when applying no adapter trimming. The allele frequency of the heterozygous variant dropped from 54 to 3 % due to incorrect mapping of untrimmed adapter remains at the same genomic position (see Additional file 1: Figure S2 for details).
Resources benchmark results on real data
Variant calling time
The adapter trimming benchmarks presented so far, did not take trimming of low-quality bases into account. We excluded this feature, because not all tools in the comparison support it. Those tools that support it use basically the same approach with only minimal differences: low-quality bases are removed from the end of the read until a given base quality cutoff is exceeded. Thus, no significant performance differences between tools can be expected.
It is however interesting if quality trimming improves mapping and variant calling in general. Our benchmarks using SeqPurge and Skewer (see Additional file 1: Table S1 for details) indicate that trimming low-quality bases improves the properly-paired read count and the undertrimmed bases count. Interestingly, most of the performance improvement was already achieved at moderate quality score cutoffs (Q5 to Q15) and increasing the cutoff did not improve the performance any further. A more extensive evaluation of read trimming effects on NGS data can be found in .
Benchmark results on simulated data
In our benchmark on real data we measured the adapter trimming performance using a comprehensive set of metrics. However, on real data we could not measure the tolerance towards sequencing errors. In practice, this is an important feature because it is not possible to adjust algorithm parameters for each dataset in a high-throughput setting. Thus, we benchmarked the error tolerance on simulated data with different error rates.
Trimming benchmark results on simulated data without errors
Undertrimmed base counts on simulated data
We have presented SeqPurge, a novel adapter trimmer for paired-end sequencing data that is based on a probabilistic approach. The performance of SeqPurge was compared to six other adapter trimming tools on an amplicon-based benchmark dataset and on simulated data.
Our comparison shows that SeqPurge is the only tool that offers a good performance in terms of sensitivity, specificity, speed and error tolerance: AdapterRemoval is very slow and not very sensitive. FlexBar is quite slow, not very specific and fails to trim short adapter remains. PEAT has a low sensitivity and specificity, fails to remove reads without insert and has a high memory usage. SeqPrep is very slow and removes good reads from the dataset. Skewer is not as sensitive as SeqPurge, but it is by far the fastest tool in the comparison. Using Skewer might be considered for large amounts of shotgun data where the runtime is more important than sensitivity, i.e. for whole genome data. Trimmomatic is not very sensitive, keeps short adapter remains and has a high memory usage.
For the real data benchmark we focused on amplicon-based sequencing data because the effects of incomplete adapter trimming can be demonstrated more easily on amplicon data than on shotgun data. A similar benchmark was done on shotgun data (data not shown). The results are comparable to the amplicon-based benchmark with two exceptions: (1) Spurious variant calls are rare in shotgun data because untrimmed adapters are generally not placed at the same genomic position; (2) adapter dimers without insert occur less frequently in shotgun data than in amplicon data.
When considering the overall runtime and file I/O load of an analysis pipeline, processing the FASTQ raw data is one of the main contributors. Thus, it is advantageous if only one tool is needed for raw read processing. SeqPurge not only offers adapter trimming functionality, but also trimming by base quality and no-call stretches. Additionally, it can merge input FASTQ files from several runs or lanes during the trimming process, and it can calculate basic quality control metrics on raw read data. The combination of these features reduces the number of passes needed to perform the basic data preparation steps significantly. To our knowledge, SeqPurge is the only adapter trimmer that combines all these features into a single tool. SeqPurge currently does not support merging of overlapping read pairs into longer single-end reads, which is supported by SeqPrep and AdapterRemoval. A detailed feature overview of other adapter trimmers from our comparison can be found in .
Availability and requirements
SeqPurge is implemented in C++ and runs both under Linux and Windows. It is available under the ‘GNU General Public License version 2’ as part of the ngs-bits project: https://github.com/imgag/ngs-bits.
Ethics approval and consent to participate
Consent for publication
Availability of data and materials
The real NGS dataset supporting the conclusions of this article is available in the European Nucleotide Archive, ftp://ftp.sra.ebi.ac.uk/vol1/ERA494/ERA494451/. The simulated data supporting the conclusions of this article was created using PERsim which is available at https://github.com/imgag/ngs-bits.
We thank all members of the Institute of Medical Genetics and Applied Genomics, who found and reported false-positive variants, which helped us improve our algorithm.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Bolger AM, et al. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Dodt M, et al. FLEXBAR - Flexible Barcode and Adapter Processing for Next-Generation Sequencing Platforms. Biology. 2012;1:895–905.View ArticlePubMedPubMed CentralGoogle Scholar
- Garrison E, et al. Haplotype-based variant detection from short-read sequencing. 2012. http://arxiv.org/abs/1207.3907. Accessed 18 Jan 2016.
- Giorgi FM, et al. An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE. 2013;8(12):e85024.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang H, et al. Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics. 2014;15:182.View ArticlePubMedPubMed CentralGoogle Scholar
- Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. http://arxiv.org/abs/1303.3997. Accessed 18 Jan 2016.
- Li YL, et al. PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm. BMC Bioinformatics. 2015;16:S2.View ArticlePubMedPubMed CentralGoogle Scholar
- Lindgreen S. AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC Res Notes. 2012;5:337.View ArticlePubMedPubMed CentralGoogle Scholar
- Lunter G, et al. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011;21:936–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal. 2011;17:10–2.View ArticleGoogle Scholar
- Mose LE, et al. ABRA: improved coding indel detection via assembly-based realignment. Bioinformatics. 2014;30(19):2813–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson JT, et al. Integrative Genomics Viewer. Nat Biotechnol. 2011;29:24–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Zook JM, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. 2015. http://dx.doi.org/10.1101/026468. Accessed 18 Jan 2016.