Compression of next-generation sequencing quality scores using memetic algorithm
- Jiarui Zhou^{1, 2},
- Zhen Ji^{2}Email author,
- Zexuan Zhu^{2} and
- Shan He^{3}
https://doi.org/10.1186/1471-2105-15-S15-S10
© Zhou et al.; licensee BioMed Central Ltd. 2014
Published: 3 December 2014
Abstract
Background
The exponential growth of next-generation sequencing (NGS) derived DNA data poses great challenges to data storage and transmission. Although many compression algorithms have been proposed for DNA reads in NGS data, few methods are designed specifically to handle the quality scores.
Results
In this paper we present a memetic algorithm (MA) based NGS quality score data compressor, namely MMQSC. The algorithm extracts raw quality score sequences from FASTQ formatted files, and designs compression codebook using MA based multimodal optimization. The input data is then compressed in a substitutional manner. Experimental results on five representative NGS data sets show that MMQSC obtains higher compression ratio than the other state-of-the-art methods. Particularly, MMQSC is a lossless reference-free compression algorithm, yet obtains an average compression ratio of 22.82% on the experimental data sets.
Conclusions
The proposed MMQSC compresses NGS quality score data effectively. It can be utilized to improve the overall compression ratio on FASTQ formatted files.
Background
DNA sequencing provides fundamental data for many research areas e.g. genomics, bioinformatics, and biology [1]. Rapid progress has been made for DNA sequencing technologies, especially the high-throughput next-generation sequencing (NGS), in the last few years. Newly proposed high efficiency methods significantly stimulate the production and usage of NGS data [2]. However the exponential growth of NGS data poses huge challenge to data storage and transmission [3]. Thereby efficient compression algorithms are required.
General-purpose compression algorithms e.g. gzip and bzip2 usually fail to obtain satisfactory results on NGS data, because they are designed for ordinary plain text or binary files. To achieve higher compression ratio, many specialized methods are proposed. For instance, Cox et al. [4] proposed a Burrws-Wheeler transform based compression algorithm for large scale DNA sequence databases. Jones et al. [5] presented Quip, a high efficient reference-based NGS data compression tool relying on external reference genomes or light-weight de novo assembly to generate reference sequences from the target data. Popitsch et al. [6] proposed the NGC tool for SAM format files compression. Hach et al. [7] proposed SCALCE by introducing locally consistent parsing in data encoding. More comprehensive review on NGS data compression can be found in [8, 9].
Typically, raw NGS data includes a series of sequencing records (called reads). Each record consists of three major components: a metadata containing the read name, platform, and other useful information; a DNA sequence read obtained from one fold of the oversampling; and a quality score sequence estimating accuracies of the corresponding DNA bases. Existing algorithms usually focus on the compression of DNA sequence reads, and utilize conventional methods e.g. Huffman coding and run-length encoding (RLE) to compress quality scores [10]. Quality score data is considered more important than the metadata, and usually occupies similar or even more space than the DNA sequences. It may pose bigger challenges for compression than DNA sequence reads, due to the larger alphabet size. By introducing compressor specific for NGS quality scores, the overall compression ratio of NGS data can be further improved.
In this paper, we propose a memetic algorithm (MA) based NGS quality scores compression algorithm, namely MMQSC. The algorithm consists of three major parts: a Huffman coding based preprocessing is conducted in the first place, followed by MA based encoding codebook design. Finally, quality score data is compressed by using the codebook. MA is widely known as a synergy of population-based evolutionary algorithm and local search or individual learning procedures. MAs are capable of solving various complex optimization problems more effectively than their conventional counterparts [11]. In this work, the self-adaptive differential evolution combining with neighborhood search (SaNSDE) [12], and Davies, Swann, and Campey with Gram-Schmidt orthogonalization (DSCG) [13], or SaNSDE-DSCG for short, are introduced to MMQSC, to optimize the NGS quality scores compression codebook, with which most repetitive short score segments are identified and represented with much shorter encoding.
In conventional MAs, each individual represents a candidate solution of the entire problem, i.e., compression codebook. Its optimization is highly complex, because the codebook consists of hundreds of quality score symbols. Multimodal optimization tries to find all or most of the multiple solutions within a population in a single simulation run [14]. Based on multimodal optimization, the MMQSC uses an individual to represent only a single specific code vector, and composes codebook with the entire evolution population. Thereby computational complexity distributed to each individual's fitness evaluation is significantly reduced.
The proposed MMQSC obtains promising performance on NGS quality scores stored in the widely used FASTQ format [15]. Experimental results on five representative NGS data sets show that MMQSC obtains better compression ratio than other state-of-the-art methods. Particularly, MMQSC is a reference-free algorithm for lossless compression, yet obtains an average compression ratio of 22.82%, i.e., 1.81 bits per quality value (BPQ) on the experimental data sets.
The remainder of this paper is organized as follows: Section II describes details of the proposed MMQSC compression algorithm. Section III presents the experimental results of MMQSC on the five real-world NGS data sets. Finally, a conclusion is provided in Section IV.
Methods
SaNSDE and DSCG based memetic algorithm for multimodal optimization
MA is introduced whereby the concept of "meme", which was coined by Dawkins [16], is employed within an evolutionary computation framework to improve search efficiency. Typically, MA utilizes a population-based global optimization as fundamental framework, and introduces separate local searches or 'memes' embedded in each generation of the global evolution to refine the population [17]. The procedure of a canonical MA framework is illustrated in Algorithm 1 [18].
In MAs both global search and local search strategies can be selected flexibly according to the target problem. Typically, NGS data consists of thousands or even millions of read entries, wherein each of them contains hundreds of quality score symbols. Finding a codebook for compressing such data is naturally a high-dimensional optimization problem. Differential evolution (DE) [19] is capable of solving large scale problems effectively. In this paper, a high performance DE variant namely the SaNSDE is utilized as the global optimizer. Particularly, SaNSDE uses three self-adaptive mechanisms to select mutation strategies and control parameter values. By introducing neighborhood search in the optimization process, SaNSDE obtains higher performance than conventional algorithms. Moreover, the widely-used local search strategy DSCG is introduced to increase convergence speed. DSCG is a gradient-based optimizer that searches solution space in a greedy manner. Combining the exploration of SaNSDE and exploitation of DSCG, the proposed MA obtains promising performance on quality scores compression codebook design.
where dist( x_{ i }, x_{ j }) is the Manhattan distance between x_{ i } and x_{ j }. If evolution population is gathering in the same optimal region, its shared fitness values will deteriorate significantly to disperse the individuals. By utilizing f_{ S }( x_{ i }) to guide the search process, SaNSDE-DSCG is capable of finding all optimums effectively.
Compression codebook design using SaNSDE-DSCG
in which U denotes symbol matching (unchanged), I stands for insertion (marked with "∧"), D for deletion (marked with "−"), and S for substitution. For insertions and substitutions, the original symbol should also be recorded (for instance "C" on the third quality score). This matching process is conducted by using dynamic programming (DP).
where M is the number of code vectors in a codebook, P^{∗} denotes length of the symbol differences sequence (i.e. Q^{∗}), and R is the number of all original symbols recorded for insertions and substitutions. Given the example above, original quality score sequence takes L_{ O } = 48 bits of storage, while the encoded data uses only about 24 bits. Usually the encoding process makes L_{ C } smaller than L_{ O }, i.e., conducts compression. The more the code vector is similar to the original quality score sequence, the higher compression ratio is achieved. That is, quality of the codebook decides the overall compression performance.
in which K is the total number of quality score sequences, L_{ C }( C_{ i }, Q_{ k }) denotes encoded data size on the k th sequence Q_{ k } using code vector C_{ i }. Small f_{ R }( x_{ i }) indicates that C_{ i } is more similar to the original score data, i.e., obtains higher compression ratio. Shared fitness f_{ S }( x_{ i }) is then calculated accordingly.
in which P_{ k } is the length of Q_{ k }, and lev(·) denotes Levenshtein distance between the two input sequences. It's value is multiplied by 4, because there is a half chance (I and S) in {U, I, D, S} needs to record the original quality score. Accurate matching information is obtained only after the codebook design process for actual sequences compression.
The MMQSC algorithm
The proposed MMQSC algorithm consists of three major parts: Huffman coding based preprocessing, SaNSDE-DSCG based codebook design, and quality score data compression. Details of SaNSDE-DSCG optimization algorithm and its application on compression codebook design have been discussed in the previous sections.
in which h_{ t } is the t th symbol in converted string H_{ k } = "h_{1} h_{2} ... h_{ T } ", function huff(·) denotes Huffman coding, int(·) converts binary data to corresponding integer value, and chr(·) maps integer number into ASCII character. Offset value 33 is added to the input number in chr(·) to ensure exported character is readable. In the majority of cases H_{ k } consists of fewer symbols than original sequence Q_{ k }. Thereby dimensionality of code vectors is reduced, and the codebook design problem is simplified.
The SaNSDE-DSCG is conducted afterward on the encoded sequences. After optimization, MMQSC maps individuals in the evolution population into code vectors using Equ. (6) to construct a compression codebook. This codebook is then utilized to compress the input data, wherein accurate symbol differences information is obtained by using DP based matching algorithm.
Procedure of the MMQSC algorithm is demonstrated in Algorithm 2. It is worth noting that the codebook design process can also be conducted in an offline manner. That is, a universal compression codebook obtained from representative NGS quality score data sets is utilized to encode all input sequences. The time-consuming optimization process is performed for only one time. Successive compressions, which are usually conducted repeatedly, require much less computational time.
Results and discussion
NGS data sets for MMQSC performance evaluation.
Data | Species | Number of Reads | Number of Bases | File Size (MB) |
---|---|---|---|---|
SRR027474 | Marine metagenome | 28,109 | 3,580,544 | 9.2 |
SRR396942 | Homo sapiens | 1,199,786 | 250,755,274 | 602 |
SRR824063 | Caenorhabditis elegans | 711,156 | 142,231,200 | 348 |
SRR824065 | Caenorhabditis elegans | 64,492 | 12,898,400 | 32 |
SRR932018 | Clostridium symbiosum | 169,457 | 8,472,850 | 27 |
Parameters setting for SaNSDE-DSCG optimization.
Parameter | Population Size | Dimension | Range | ε | α | FEs |
---|---|---|---|---|---|---|
Value | M | N | (0, |S|) | 0.1 × N | 50 | 1E+4 |
Compression performance on experimental NGS data sets.
SRR027474 | SRR396942 | SRR824063 | SRR824065 | SRR932018 | ||
---|---|---|---|---|---|---|
RLE | CR (%) | 38.95 | 60.52 | 54.97 | 47.27 | 64.29 |
BPQ | 3.11 | 4.84 | 4.40 | 3.78 | 5.14 | |
Huffman | CR (%) | 53.22 | 60.83 | 42.35 | 58.12 | 49.30 |
BPQ | 4.26 | 4.87 | 3.39 | 4.65 | 3.94 | |
gzip | CR (%) | 22.70 | 35.94 | 30.33 | 26.79 | 30.25 |
BPQ | 1.82 | 2.88 | 2.43 | 2.14 | 2.42 | |
bzip2 | CR (%) | 16.23 | 31.12 | 25.84 | 21.14 | 25.07 |
BPQ | 1.30 | 2.49 | 2.07 | 1.69 | 2.01 | |
LZMA | CR (%) | 17.63 | 31.32 | 25.63 | 23.08 | 25.00 |
BPQ | 1.41 | 2.51 | 2.05 | 1.85 | 2.00 | |
MMQSC | CR (%) | 14.38 | 30.96 | 18.63 | 27.38 | 22.75 |
BPQ | 1.15 | 2.40 | 1.49 | 2.19 | 1.82 |
Results in Table 3 show that, the proposed MMQSC obtains better performance than the counterpart representative algorithms. Particularly, MMQSC obtains average compression ratio of 22.82%, resulting in an over 77.18% size reduction in the quality score data. The average 1.81 BPQ result is much smaller than the original 8 BPQ in ASCII format. Moreover, performance of MMQSC remains stable in the experimental data sets, indicating that the algorithm has good robustness on different types of NGS data.
Conclusions
This paper presents MMQSC, a MA based NGS quality scores compression algorithm. The MMQSC utilizes Huffman coding to preprocess raw quality score sequences stored in FASTQ format. To obtain higher performance, a SaNSDE and DSCG based MA is proposed to optimize the compression codebook design. The Levenshtein distance is used in fitness evaluations to estimate encoded data size, and improves computation speed. After the codebook optimization, a DP based matching algorithm is conducted to obtain accurate symbol differences information. This information, as well as the optimized codebook, is utilized to compress input quality score data. Experimental results on five NGS data show that the proposed MMQSC obtains higher compression ratio than counterpart state-of-the-art algorithms. Particularly, MMQSC reduces about 77% of the storage space on the experimental data sets.
Declarations
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China Joint Fund with Guangdong, under Key Project U1201256, the National Natural Science Foundation of China, under grant 61171125, in part by the NSFC-RS joint project under Grants IE111069 and 61211130120, in part by the Guangdong Foundation of Outstanding Young Teachers in Higher Education Institutions under grant Yq2013141, in part by the Guangdong Natural Science Foundation, under grants S2012010009545, in part by Science Foundation of Shenzhen City, under Grants JCYJ20130329115450637, ZYC201105170243A and KQC201108300045A, and in part by Program for New Century Excellent Talents in University, under grant 004549.
Declarations
Publication charges for this article have been funded by the National Natural Science Foundation of China (61171125).
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 15, 2014: Proceedings of the 2013 International Conference on Intelligent Computing (ICIC 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S15.
Authors’ Affiliations
References
- You ZH, Yin Z, Han K, Huang DS, Zhou XB: A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinformatics. 2010, 11: 343-10.1186/1471-2105-11-343.PubMed CentralView ArticlePubMedGoogle Scholar
- Bonfield JK, Mahoney MV: Compression of FASTQ and SAM format sequencing data. PloS One. 2013, 8: 59190-10.1371/journal.pone.0059190.View ArticleGoogle Scholar
- Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics. 2010, 11: 473-483. 10.1093/bib/bbq015.PubMed CentralView ArticlePubMedGoogle Scholar
- Cox AJ, Bauer MJ, Jakobi T, Rosone G: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 2012, 28: 1415-1419. 10.1093/bioinformatics/bts173.View ArticlePubMedGoogle Scholar
- Jones DC, Ruzzo WL, Peng X, Katze MG: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research. 2012, 40: 171-10.1093/nar/gks754.View ArticleGoogle Scholar
- Popitsch N, von Haeseler A: NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research. 2013, 41: 27-10.1093/nar/gks939.View ArticleGoogle Scholar
- Hach F, Numanagic I, Alkan C, Sahinalp SC: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012, 28: 3051-3057. 10.1093/bioinformatics/bts593.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu Z, Zhang Y, Ji Z, He S, Yang X: High-throughput DNA sequence data compression. Briefings in Bioinformatics. 2013, bbt087-Google Scholar
- Giancarlo R, Rombo SE, Utro F: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Briefings in Bioinformatics. 2013, bbt088-Google Scholar
- Janin L, Rosone G, Cox AJ: Adaptive reference-free compression of sequence quality scores. arXiv Preprint. 2013, arXiv:1305.0159Google Scholar
- Moscato P, Cotta C, Mendes A: Memetic algorithms. New Optimization Techniques in Engineering. 2004, New York: Springer, 53-85.View ArticleGoogle Scholar
- Yang Z, Tang K, Yao X: Self-adaptive differential evolution with neighborhood search. Proceedings of IEEE Congress on Evolutionary Computation: 1-6 June 2008. 2008, Hong Kong, 1110-1116.View ArticleGoogle Scholar
- Nguyen QH, Ong YS, Lim MH: A probabilistic memetic framework. IEEE Transactions on Evolutionary Computation. 2009, 13: 604-623.View ArticleGoogle Scholar
- Singh G, Deb K: Comparison of multi-modal optimization algorithms based on evolutionary algorithms. Proceedings of Genetic and Evolutionary Computation Conference: 8-12 July 2006. 2006, Seattle, 1305-1312.Google Scholar
- Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research. 2010, 38: 1767-1771. 10.1093/nar/gkp1137.PubMed CentralView ArticlePubMedGoogle Scholar
- Dawkins R: The Selfish Gene. 2006, UK: Oxford University PressGoogle Scholar
- Huang DS, Du JX: A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Transactions on Neural Networks. 2008, 19: 2099-2115.View ArticlePubMedGoogle Scholar
- Chen X, Ong YS, Lim MH, Tan KC: A multi-facet survey on memetic computation. IEEE Transactions on Evolutionary Computation. 2011, 15: 591-607.View ArticleGoogle Scholar
- Storn R, Price K: Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization. 1997, 11: 341-359. 10.1023/A:1008202821328.View ArticleGoogle Scholar
- Sareni B, Krahenbuhl L: Fitness sharing and niching methods revisited. IEEE Transactions on Evolutionary Computation. 1998, 2: 97-106. 10.1109/4235.735432.View ArticleGoogle Scholar
- Gooskens C, Heeringa W: Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data. Language Variation and Change. 2004, 16: 189-207.View ArticleGoogle Scholar
- Huang DS: Radial basis probabilistic neural networks: model and application. International Journal of Pattern Recognition and Artificial Intelligence. 1999, 13: 1083-1101. 10.1142/S0218001499000604.View ArticleGoogle Scholar
- Leinonen R, Sugawara H, Shumway M: The sequence read archive. Nucleic Acids Research. 2011, 39 (suppl 1): 19-21.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.