HTQC: a fast quality control toolkit for Illumina sequencing data
© Yang et al.; licensee BioMed Central Ltd. 2013
Received: 7 September 2012
Accepted: 27 January 2013
Published: 31 January 2013
Illumina sequencing platform is widely used in genome research. Sequence reads quality assessment and control are needed for downstream analysis. However, software that provides efficient quality assessment and versatile filtration methods is still lacking.
We have developed a toolkit named HTQC - abbreviation of High-Throughput Quality Control - for sequence reads quality control, which consists of six programs for reads quality assessment, reads filtration and generation of graphic reports.
The HTQC toolkit can generate reads quality assessment faster than existing tools, providing guidance for reads filtration utilities that allow users to choose different strategies to remove low quality reads.
Next generation sequencing technologies are generating massive sequence data, and different platforms can introduce varied level of sequence reads error. Among them, the Illumina platform is the most widely used for genome sequencing with the least error rate per base. However, due to the nature of the method, it still presents a considerable amount of errors that has its specific errors pattern. The device performs sequencing by DNA synthesis on clusters of identical DNA molecules simultaneously. When elongation of some DNA molecules is stopped accidentally, it creates disturbance of the cluster’s fluorescent signal, resulting sequencing errors. Such errors accumulate during the process of sequencing, and cause reads quality decreasing while the length grows. Besides, deficiency on sequencing chips and the existence of air bubbles on chip surface can cause failure on reads from a whole tile. To get reliable result in downstream analysis, it is necessary to remove low quality reads avoiding mismatches in read mapping, and false paths during genome assembly.
Comparison of sequencing quality control software tools
Reads filtration methods
Tail trimming, filter by quality/length/tile
Not limited, any FASTQ file
The HTQC toolkit consists of six programs that can perform reads quality assessment and filtration. To improve run-time performance, the time-consuming programs are implemented using C++. The FASTQ format is used for sequence data input and output, and the QC report is generated as tab-separated plain-text file. To create graphical charts of QC report, a Perl script is included. The GNU Glib is used for base utilities such as command-line parser, portable support for threads and asynchronous queues. All programs of HTQC toolkit are capable of single-end or paired-end sequencing experiments, 33-based or 64-based quality score encodings and FASTQ sequence identifier format from different version of CASAVA tools (the traditional format like “@HWUSI-EAS100R:6:73:941:1973#0/1”, and the new format used by CASAVA version 1.8 like “@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG”).
After the assessment of sequence reads quality is obtained, low quality reads should be removed. The HTQC tool kit provides four different programs that include ht_tile_filter, ht_trim, ht_qual_filter and ht_length_filter, to perform reads filtration. The ht_tile_filter is designed to remove reads from problematic tiles that may not be reliable due to sequencing chip quality; the ht_trim cuts low quality bases at the beginning or the end of the reads until the quality score reaches a given threshold; the ht_qual_filter remove reads with low quality and the ht_length_filter remove short reads. When only one end of the paired-end reads is of acceptable quality, it is stored in a separate file. The cutoff value of these programs, such as the thresholds on the minimum reads quality or minimum read length are user defined.
Results and discussion
To demonstrate the function of HTQC, a paired-end sequence data of human gut metagenome was used as an example. To reduce the time cost, one tenth from a total of 35,625,015 paired-end reads were randomly picked. The reads length was 120bp. The quality assessment was performed using ht_stat, which shows the reads quality in a series of charts that were described above in Implementation. When quality assessment was done by base position, there was a gradual decrease of reads quality towards the 3’-end (tail) that can be observed in Figure1A and1B. The tail trimming would be routinely applied to cut the low quality reads using the program ht_trim. In Figure1C, there was at least 10% of reads that have an invalid nucleotide sequence represented by contiguous Ns. The bad reads that contained these Ns can be filtered with the program ht_length_filter or ht_qual_filter. When the quality assessment was done by tiles, we observed tiles 5, 31, 110, 113, 117, 118 produced reads with very low quality (Figure1F) that can be removed using ht_tile_filter. For the quality assessment of any paired-end reads, if the quality of read 2 was worse than read 1, such quality imbalance can be picked up by ht_stat (Figure1G).
Program run-time efficiency
The HTQC tool kit provides convenient utilities for Illumina sequencing QC. It can process sequencing data faster than the existing tools, and generates quality assessment report both in plain-text and graphical representation, which can help in making decisions about further reads filtration. The HTQC package also provides four programs that can perform reads filtration using different methods. Unlike previous tools in which only single filtration method is allowed, user can choose the method they prefer to remove the low quality reads, and combine several filtration methods in any order.
Availability and requirements
Project name: HTQC
Project home page:https://sourceforge.net/projects/htqc
Operation system: Linux, potentially any POSIX compliant system.
Other requirements: GNU Glibhttp://ftp.gnome.org/pub/GNOME/sources/glib, pkg-confighttp://www.freedesktop.org/wiki/Software/pkg-config, CMakehttp://www.cmake.org, Perlhttp://www.perl.org, Gnuplothttp://gnuplot.info
Programming languages: C++, Perl
License: GNU GPL version 3 or later
- Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet 2010,11(1):31-46. 10.1038/nrg2626View ArticlePubMed
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008,36(16):e105. 10.1093/nar/gkn425PubMed CentralView ArticlePubMed
- Martinez-Alcantara A, Ballesteros E, Feng C, Rojas M, Koshinsky H, Fofanov VY, Havlak P, Fofanov Y: PIQA: pipeline for Illumina G1 genome analyzer data quality assessment. Bioinformatics 2009,25(18):2438-2439. 10.1093/bioinformatics/btp429PubMed CentralView ArticlePubMed
- FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- Zhang T, Luo Y, Liu K, Pan L, Zhang B, Yu J, Hu S: BIGpre: a quality assessment package for next-generation sequencing data. Genomics Proteomics Bioinformatics 2011,9(6):238-244. 10.1016/S1672-0229(11)60027-2View ArticlePubMed
- Cox MP, Peterson DA, Biggs PJ: SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010, 11: 485. 10.1186/1471-2105-11-485PubMed CentralView ArticlePubMed
- perlguts - Introduction to the Perl API. http://perldoc.perl.org/perlguts.html
- Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010,38(6):1767-1771. 10.1093/nar/gkp1137PubMed CentralView ArticlePubMed
- CASAVA v1.8 Changes. Illumina, Inc; 2011. http://support.illumina.com/downloads/casava_18_changes.ilmn
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.