AfterQC: automatic filtering, trimming, error removing and quality control for fastq data
© The Author(s) 2017
Published: 14 March 2017
Some applications, especially those clinical applications requiring high accuracy of sequencing data, usually have to face the troubles caused by unavoidable sequencing errors. Several tools have been proposed to profile the sequencing quality, but few of them can quantify or correct the sequencing errors. This unmet requirement motivated us to develop AfterQC, a tool with functions to profile sequencing errors and correct most of them, plus highly automated quality control and data filtering features. Different from most tools, AfterQC analyses the overlapping of paired sequences for pair-end sequencing data. Based on overlapping analysis, AfterQC can detect and cut adapters, and furthermore it gives a novel function to correct wrong bases in the overlapping regions. Another new feature is to detect and visualise sequencing bubbles, which can be commonly found on the flowcell lanes and may raise sequencing errors. Besides normal per cycle quality and base content plotting, AfterQC also provides features like polyX (a long sub-sequence of a same base X) filtering, automatic trimming and K-MER based strand bias profiling.
For each single or pair of FastQ files, AfterQC filters out bad reads, detects and eliminates sequencer’s bubble effects, trims reads at front and tail, detects the sequencing errors and corrects part of them, and finally outputs clean data and generates HTML reports with interactive figures. AfterQC can run in batch mode with multiprocess support, it can run with a single FastQ file, a single pair of FastQ files (for pair-end sequencing), or a folder for all included FastQ files to be processed automatically. Based on overlapping analysis, AfterQC can estimate the sequencing error rate and profile the error transform distribution. The results of our error profiling tests show that the error distribution is highly platform dependent.
Much more than just another new quality control (QC) tool, AfterQC is able to perform quality control, data filtering, error profiling and base correction automatically. Experimental results show that AfterQC can help to eliminate the sequencing errors for pair-end sequencing data to provide much cleaner outputs, and consequently help to reduce the false-positive variants, especially for the low-frequency somatic mutations. While providing rich configurable options, AfterQC can detect and set all the options automatically and require no argument in most cases.
As next generation sequencing (NGS) technology being used more broadly in clinical applications, sequencing data quality control is becoming more important. In some NGS applications like ctDNA(circulating tumour DNA) sequencing , we need to detect ultra low frequency somatic mutations to help diagnosing cancers. However, the experiments (such like DNA amplification) and sequencing process always introduce errors and biases . Typically the somatic mutation rate in ctDNA is near 1% for advanced tumour patients, and can be as low as 1% for early stage tumour patients , which is very close to the error rate of mainstream NGS platforms. The presence of these errors degrades the performance of variant calling tools in detection of true low frequency mutations while keeping false-positive mutations away. This problem drives us to not only apply better preprocessing with better quality control strategies and stricter filtering criterions, but also develop sequencing error profiling and correction algorithms to recognise and reduce errors as much as possible.
For sequencing data, some good tools can already perform quality control, such like FastQC  with per-base and per-sequence quality profiling functions and PRINSEQ  with FASTA/FASTQ statistics capability, while some other tools being able to read trimming, such like Trimmomatic  and SolexaQA . Since the way to do data filtering depends on the QC result and the filtered data also need a post filtering QC, a tool with both rich QC and filtering functions is still wanted. Another improvement that can be made to these tools is overlapping analysis for pair-end sequencing, for which each DNA template is sequenced twice in forward and reverse directions. When the DNA template length is less than twice of the sequencing length, the pair of reads will be overlapped. Note that each base in the overlapping region is actually sequenced twice, so the inconsistency of these pairs may reflect the sequencing errors.
Another function needed for data preprocessing is cutting adapters. When the sequenced DNA template is shorter than sequencing length, part of sequencing adapters may be contained in the output reads. In this case, the adapters should be error-tolerantly detected and removed. Some tools like Trimmomatic  and Cutadapt  can handle such tasks, but they usually require users to input the sequence of the adapters, which are usually not well known for the people doing data analysis. By searching the best overlapping of each pair, AfterQC automatically detects and cuts adapters for pair-end data, with no need of adapter sequence input.
We will present AfterQC in this paper, a tool developed to address major practical sequencing data quality control and filtering problems. In addition to regular quality control functions like per-cycle base content and quality statistics, AfterQC also provides new functions like automatic trimming and overlapping analysis. For example, we found that some sequencers (like Illumina NextSeq series) may output a lot of polyX reads with high quality score. AfterQC can remove them using its polyX filter, while normal quality filters cannot. Another example is that we found if the amplification or sequencing process has serious strand bias, the sequence reads will show K-MER count bias (i.e. the counts of ATCGATCG and its reverse complement CGATCGAT are significantly different). Based on this finding, AfterQC provides K-MER counting based strand bias profiling. Another major contribution of this tool is overlapping analysis for pair-end sequencing data, which can be used to profile the sequencing error rate and apply error base correction or removing. For each input of a single or pair of FastQ files, AfterQC outputs a HTML report, which contains a quality control and data filtering summary, and a list of interactive figures.
Bubble detection and visualisation
In the whole sequencing process, the first several cycles can have more biases or errors since the signal coordination hasn’t been established yet, and the last several cycles can also have errors due to error accumulation and lack of following correction. In some cases, the beginning or ending of the reads may have significant statistical biases. For example, library preparation bias or sequencing bias can cause GC percentage higher than 70% at some beginning or ending cycles, and these cycles should be considered as abnormal cycles, and surely should be removed by some methods.
There are two strategies for trimming, namely local strategy and global strategy. Some tools, like Trimmomatic , apply local strategy, which perform trimming read by read. However local trimming has two drawbacks. The first drawback is that local trimming only uses the quality information for trimming, but cannot utilise the global statistical information to discover the abnormal cycles. The second drawback is local trimming results in unaligned trimming, which means duplicated reads may be trimmed differently, and consequently lead to failure of de-duplication tools like Picard . Most of these de-duplication tools detect duplications only by clustering reads with same mapping positions.
According to our experiments, AfterQC only trims very few cycles for data with good sequencing quality (i.e. 1 base in front, and 1 base in tail), so normally it will not significantly affect the data utilisation rate. However, for some extreme cases, the sequencing quality is quite low, and the mean base content percentage or quality curves can be totally chaotic. To not trim too many data for such cases, AfterQC limits the trimming cycles both in front and tail. The default setting is no more than 10% in front and no more than 5% in tail.
After trimming is done, AfterQC will apply a series of filters on the reads. AfterQC implements quality filters and polyX filters. Quality filters are trivial, which just count the number of low quality bases or N, calculate the mean quality of each read, and then determine whether to keep or discard this read. AfterQC implements an error-tolerantly method to detect polyX (X is one of A/T/C/G). Two arguments (P and L) are used to configure the polyX detection algorithm, P (default is 35) means how long the polyX sequence should have, while L (default is 2) refers to how many non-X bases can be tolerated in each polyX sub-sequence. According to our experiments, NextSeq sequencers are more likely to produce polyX reads, and most of them are polyG.
The order of applying different filters is not important. If one read is filtered out, a new sequence name containing the filter name will be assigned, and then this read will be streamed into the bad output.
Overlapping analysis and error correction
Let T denote the length of a sequenced DNA template, and S denote the length of pair-end sequencing length, then the pair of reads will totally overlap if T≤S, will overlap with a length of 2S−T, if S<T<2S, and will not overlap if 2S≤T. Based on edit distance  optimisation, we developed a method to check how each pair of reads overlap, for data from pair-end sequencing. For a pair of reads R1 and R2, let O be the offset we place R2 under R1, then we’ll have vertically aligned subsequences R1 o and R2 o , and we can calculate their edit distance ed(R1 o ,R2 o ). Our method optimises offset O to obtain the minimal edit distance, ed(R1 o−1,R2 o−1)<ed(R1 o ,R2 o )<ed(R1 o+1,R2 o+1). We consider R1 and R2 overlapped at this offset O if this edit distance ed(R1 o ,R2 o ) and overlapped length L o meet the thresholds.
Sequencing error profiling
As described above, AfterQC can detect the mismatches in the overlapping regions. For those reads with very long overlap (i.e. overlap_len>50), the edit distance of overlapped subsequences is mainly caused by sequencing errors, because an error-free overlapping is usually completely identical (edit distance should be 0). Based on this assumption, we can count the total bases and the mismatched bases in all overlapping regions. And we can consider the ratio of (mismatch/total) reflecting the sequencing error rate, which can be called estimated sequencing error rate. Furthermore, a mismatched pair usually consists of one high quality base (i.e. Q30+) and one low quality base (i.e. < Q15). In this case, we can consider that the low quality base in this pair is a sequencing error, and furtherly profile the sequencing error transform distribution (i.e. how many T bases are sequenced as C).
Automatic adapter cutting
In the overlapping analysis process, we get the optimal offset O for the best local alignment of each pair. The overlap length of this pair can be directly calculated using the offset O. If O is found negative, the bases outside overlapping region will be considered as part of adapter sequences, and then be trimmed automatically.
A feature comparison of AfterQC with existing tools. From the table we can find that AfterQC is versatile on common quality control and data filtering tasks, and offers novel features not implemented by other tools before
Read by read
Read by read
Cutting adapter only
Supported with error correction
Sequence error profiling
Fast only for single-end
Since AfterQC provides some functions that other high throughput sequencing QC or filtering tools do not possess, it usually runs slower than those other tools. In our evaluation, for pair-end sequencing data, AfterQC can process 2*240K pair-end reads per minute, while FastQC can process 2*1.5M reads per minute, which is 6X faster as AfterQC. However, the most time consuming parts of AfterQC are overlapping analysis and error correction processes, which are very useful for pair-end data. Actually, for single-end data, AfterQC can run as fast as FastQC, since no overlapping analysis is involved.
This tool is written in Python, with an edit distance module written in C. PyPy is supported for performance consideration. Currently, the fastest way to run AfterQC is using PyPy, but we are also re-implementing AfterQC using C/C++ only. The performance will be improved after the slow python code is replaced.
Results and discussion
AfterQC has been used to process all of our 100+ runs’ sequencing data, most of which are cell-free DNA sequencing. According to previous studies, the mean length of cell-free DNA is around 167 bp . This relatively short length of cell-free DNA makes AfterQC’s overlapping analysis very useful since most pairs of reads will be overlapped. The AfterQC results of our 100 runs’ data also confirm the reported length distribution. According to our results, sequencing quality can vary greatly with different runs, machines and samples. This suggests us to pay more attention to QC and data filtering, especially for clinical applications.
For pair-end sequencing, AfterQC provides an option to store only the overlapped sub-sequences, which means all pairs with no overlap, and outside overlapping areas will be discarded. Because the overlapped parts of each pair will be completely reverse complemented after overlapping analysis and error correction, this feature actually converts the pair-end sequencing data into high quality clean single-end sequencing data. Since most bases are double confirmed by pair-end sequencing, this overlapped data will have very high quality, and due to overlapping analysis based error correction, the sequence errors will be significantly eliminated.
In summary, we developed a tool called AfterQC with rich quality control, data filtering, error profiling and correction functions for next generation sequencing data. AfterQC is fully tested with a large amount of data and has been accepted by some community users. The overlapping analysis and other techniques used in this tool make it possible to generate high quality clean reads, and make it very useful for low frequency somatic mutation detection in deep sequencing applications.
The authors would like to thank Liwei Deng and Cheng Jin for beautifying the diagram charts.
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 3, 2017. Selected articles from the 15th Asia Pacific Bioinformatics Conference (APBC 2017): bioinformatics. The full contents of the supplement are available online https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-3
This study was funded by National Science Foundation of China (No.61472411) for high performance servers and publication cost, by Technology Development and Creative Design Program of Nanshan Shenzhen (No. KC2015JSJS0028A) for data collection, and by Special Funds for Future Industries of Shenzhen (No. JSGG20160229123927512) for sequencer and reagents.
Availability of data and materials
Project name: AfterQC
Project home page: https://github.com/OpenGene/AfterQC
Operating system(s): Linux, Mac OS X and Windows
Programming language(s): Python and C
Other requirements: None
License: MIT License
Any restrictions to use by non-academics: None
SC wrote most codes and wrote the manuscript, YZ/TH/YH did the testing and revised the manuscript, MX provided the idea of sequencer’s bubble effect and JG gave some key ideas. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Schwarzenbach H, Hoon DS, Pantel K. Cell-free nucleic acids as biomarkers in cancer patients. Nat Rev Cancer. 2011; 11(6):426–37.View ArticlePubMedGoogle Scholar
- Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of ion torrent, pacific biosciences and illumina miseq sequencers. BMC genomics. 2012; 13(1):1.View ArticlePubMedPubMed CentralGoogle Scholar
- Newman AM, Bratman SV, To J, Wynne JF, Eclov NC, Modlin LA, Liu CL, Neal JW, Wakelee HA, Merritt RE, Shrager JB. An ultrasensitive method for quantitating circulating tumor dna with broad patient coverage. Nature medicine. 2014; 20(5):548.View ArticlePubMedPubMed CentralGoogle Scholar
- Andrews S. A Quality Control Tool for High Throughput Sequence Data. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/. Accessed 7 Dec 2016.
- Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2014; 27(6):266–7.Google Scholar
- Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics. 2014; 13:266–7.Google Scholar
- Cox MP, Peterson DA, J BP. Solexaqa: At-a-glance quality assessment of illumina second-generation sequencing data. BMC Bioinforma. 2010; 11(1):266–7.View ArticleGoogle Scholar
- M M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011; 17(1):10–12.View ArticleGoogle Scholar
- Illumina: Two-Channel SBS Sequencing Technology. San Francisco; 2015. https://www.illumina.com/content/dam/illumina-marketing/documents/products/techspotlights/techspotlight_two-channel_sbs.pdf.
- Institute B. A Set of Java Command Line Tools for Manipulating High-throughput Sequencing (HTS) Data and Formats. https://github.com/broadinstitute/picard. Accessed 7 Dec 2016.
- Gao X, Xiao B, Tao D, Li X. A survey of graph edit distance. Pattern Anal Applic. 2010; 13(1):266–7.View ArticleGoogle Scholar
- Hunter JD. Matplotlib: A 2d graphics environment. Comput Sci Eng. 2007; 9(3):90–5.View ArticleGoogle Scholar
- Sievert C, Parmer C, Hocking T, Chamberlain S, Ram K, Corvellec M, Despouy P. MCreate Interactive Web Graphics Via Plotly.js. https://github.com/plotly/plotly.js. Accessed 7 Dec 2016.
- Snyder MW, Kircher M, Hill AJ, Daza RM, Shendure J. Cell-free dna comprises an in vivo nucleosome footprint that informs its tissues-of-origin. Cell. 2016; 164(1):57–68.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009; 25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and samtools. Bioinformatics. 2009; 25(16):2078–079.View ArticlePubMedPubMed CentralGoogle Scholar
- Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK. Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research. 2012; 22(3):568–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Hyman D, Solit D, Arcila M, Cheng D, Sabbatini P, Baselga J, Berger M, Ladanyi M. Precision medicine at memorial sloan kettering cancer center: clinical next-generation sequencing enabling next-generation targeted therapy trials. Drug discovery todal. 2015; 20(12):1422–8.View ArticleGoogle Scholar