GTZ: a fast compression and cloud transmission tool optimized for FASTQ files
- Yuting Xing†1,
- Gen Li†2,
- Zhenguo Wang2,
- Bolun Feng2,
- Zhuo Song2Email author and
- Chengkun Wu1Email author
© The Author(s). 2017
Published: 28 December 2017
The dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth. To speed up data sharing and bring data to computing resource faster and cheaper, it is necessary to develop a compression tool than can support efficient compression and transmission of sequencing data onto the cloud storage.
This paper presents GTZ, a compression and transmission tool, optimized for FASTQ files. As a reference-free lossless FASTQ compressor, GTZ treats different lines of FASTQ separately, utilizes adaptive context modelling to estimate their characteristic probabilities, and compresses data blocks with arithmetic coding. GTZ can also be used to compress multiple files or directories at once. Furthermore, as a tool to be used in the cloud computing era, it is capable of saving compressed data locally or transmitting data directly into cloud by choice. We evaluated the performance of GTZ on some diverse FASTQ benchmarks. Results show that in most cases, it outperforms many other tools in terms of the compression ratio, speed and stability.
GTZ is a tool that enables efficient lossless FASTQ data compression and simultaneous data transmission onto to cloud. It emerges as a useful tool for NGS data storage and transmission in the cloud environment. GTZ is freely available online at: https://github.com/Genetalks/gtz.
Next generation sequencing (NGS) has greatly facilitated the development of genome analyses, which is vital for reaching the goal of precision medicine. Yet the exponential growth of accumulated sequencing data poses serious challenges to the transmission and storage of NGS data. Efficient compression methods provide the possibility to address this increasingly prominent problem.
Previously, general-propose compression tools, such as gzip (http://www.gzip.org/), bzip2 (http://www.bzip.org/) and 7z (www.7-zip.org), have been utilized to compress NGS data. These tools do not take advantage of the characteristics of genome data, such as a small size alphabet and repeated sequences segments, which leaves space for performance optimization. Recently, some specialized compression tools have been developed for NGS data. These tools are either reference-based or reference-free. The main difference lies in whether extra genome sequences are used as references. Reference-based algorithms encode the differences between the target and reference sequences, and consume more memory to improve compression performance. GenCompress  and SimGene  use various entropy encoders, such as arithmetic, Golomb and Huffman to compress integer values. The values show properties of reads, like starting position, length of reads, etc. A statistical compression method, GReEn , uses an adaptive model to estimate probabilities based on the frequencies of characters. The probabilities are then compressed with an arithmetic encoder. QUIP  exploits arithmetic coding associated with models of order-3 and high-order Markov chains in all three parts of FASTQ data. LW-FQZip  utilized incremental and run-length-limited encoding schemes to compress the metadata and quality scores, respectively. Reads are pre-processed by a light-weight mapping model and then three components are combined to be compressed by a general-purpose tool, like LZMA. Fqzcomp  estimates character probabilities by order-k context modelling and compresses NGS data in FASTQ format with the help of arithmetic coders.
Nevertheless, reference-based algorithms can be inefficient if the similarity between target and reference sequences is low. Therefore, reference-free methods were also proposed to address this problem. Biocompress proposed in  is a compression method dedicated to genomic sequences. Its main idea is based on the classical dictionary-based compression method --the Ziv and Lempel  compression algorithm. Repeats and palindromes are encoded using the length and the position of their earliest occurrences. As an extension of biocompress , biocompress-2  exploits the same scheme, and uses arithmetic coding of order-2 when no significant repetition exists. The DSRC  algorithm splits sequences into blocks and compresses them independently with LZ77  and Huffman  encoding. It is faster than QUIP both in compression and decompression speed, but inferior to the later in terms of compression ratio. DSRC2 , the multithreaded version of DSRC , splits the input into three streams for pre-processing. After pre-processing, metadata, reads, and quality scores are compressed separately in DRSC. A boosting algorithm, SCALCE , which re-organizes the reads, can outperform other algorithms on most datasets both in the compression ratio and the compression speed.
Nowadays, it is evident that cloud computing has become increasingly important for genomic analyses. However, above-mentioned tools were developed for local usage. Compression has to be completed locally before a data transmission onto the cloud can begin.
AdOC proposed in  is a general-propose tool that allows the overlap of compression and communication in the context of a distributed computing environment. It presents a model for transport level compression with dynamic compression level adaptation, which can be used in an environment where resource availability and bandwidth vary unpredictably.
Generally, the compression performances of the universal compression algorithms, such as AdOC, are unsatisfactory for NGS datasets.
GTZ exploits context model technology combined with multiple prediction modelling schemes. It employs paralleling processing to improve the compression speed.
GTZ can compress directories or folders into a single archive, which is called a multi stream file system. The all-in-one scheme can satisfy purposes of transmission, validation and storage.
GTZ supports random access to files or archives. GTZ utilizes block storage, such that users can extract some parts of genome sequences out of a FASTQ file or some files in a folder, without a complete decompression of the compressed archive.
GTZ can transfer compressed blocks to the cloud storage while the compress is still in process, which is a novel feature compared with other compression tools. This feature enables the data transmission time to be can greatly reduce the total time needed for compression and data transmission onto the cloud. For instance, it could compress and transit a 200GB FASTQ file to cloud storages like AWS and Alibaba cloud storage within 14 min.
GTZ provides a Python API, through which users can integrate GTZ in their own applications flexibly.
In the remaining of this paper, we will introduce how GTZ works and evaluate its performance on several benchmark datasets using the AWS service.
GTZ involves procedures on clients and the cloud end.
Read in streams of large data files.
Pre-process the input by dividing data streams into three sub-streams: metadata, base sequence, and quality score.
Buffer sub-streams in local memories and assemble them into different types of data blocks with a fixed size.
Compress assembled data blocks and their descriptions, and then transmit output blocks into the cloud storage.
Create three types of object-oriented containers (shown in Fig. 2), which define a tree structure.
Loop and wait to receive output blocks sent by the client.
Save received output blocks into block containers according to their types.
Stop if no more output blocks are received.
We will explain all the steps in further details about processing FASTQ files below:
The client reading streams of large data files
The format of an FASTQ file
GTZ is a general-purpose compression tool that uses statistical modelling (http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/) and arithmetic coding.
It is time-consuming to read all the sequences into main memory before compression.
If an input stream does not match well with the previously accumulated sequence, the compression ratio will be degraded, even the output stream will become larger than the input stream.
In GTZ, we employ an adaptive statistical data compression technique based on context modelling. An adaptive modeling needs not to scan the whole sequence and generate probabilities before coding. Instead, the adaptive prediction technology provides on-the-fly reading and compression, that is probabilities are calculated based on the characters already read into the memory. Probabilities may alter with more characters scanned. Initially, the performance of adaptive statistical modelling may be poor due to the lack of reads. However, with more sequences processed, the prediction tends to be more accurate.
Every time the compressor encodes a character, it will update the counter in the prediction table. When a new character X (suppose the sequence before X is ABCD) comes, GTZ will traverse the prediction table, find every character that has followed ABCD before, and compare their appearance frequencies. For instance, if both ABCDX appears 10 times, and ABCDY only once. Then GTZ will assign a higher probability for X.
Adaptive prediction modelling can effectively reduce compression time. There is no need to read all sequences in a time and it introduces overlap of scanning and compression.
GTZ utilizes specific compression units for different kinds of data blocks: a low-order encoder for genetic sequences, a multi-order encoder for quality scores and mixed encoders for metadata. Finally, the outputs in this procedure are blocks at a fixed size.
The main idea about arithmetic coding is to convert reads into a floating point ranging from zero to one (precisely greater than or equal to zero and less than one) based on the predictive probabilities of characters. If the statistical modelling estimates every single character accurately for the compressor, we will have high compression performance. On the contrary, a poor prediction may result in expansion of the original sequence, instead of compression. Thus, the performance of a compressor largely relies on the whether the statistical modelling can output near-optimal predictive probabilities.
A low-order encoder for reads
The simplest implementation of adaptive modeling is order-0. Exactly, it does not consider any context information, thus this short-sighted modeling can only see the current character and make prediction that is independent of the previous sequences. Similarly, an order-1 encoder makes prediction based on one preceding character. Consequently, the low-order modeling makes little contribution to the performance of compressors. Its main advantage is that it is very memory efficient. Hence, for quality score streams that do not have spatial locality, a low-order modeling is adequate for moderate compression rate.
A multi-order encoder for quality scores
The statistical modeling needs non-uniform probability distribution for arithmetic algorithms. The high-order modeling enables high probabilities for those characters which appear frequently, and low probabilities for those which appear infrequently. As a result, compared with low-order encoders, higher-order encoders can enhance adaptive modeling.
A high-order modeling considers several characters preceding the current position. It can obtain better compression performance at the expense of more memory usage. Higher-order modeling was less used due to the limited memory capacity, which is no longer a problem anymore.
Firstly, to generate probabilities of characters, input stream flows through an expanding character probability prediction model, which is composed of first-order, second-order, fourth-order, sixth-order prediction models and a matching model. Like a low-order encoder, probabilities of characters undergo weighted averaging, quantization and interpolation to obtain final results. Secondly, we use bit arithmetic coding algorithm for compression.
A hybrid scheme for metadata
For metadata sub-streams, GTZ first uses delimiters (punctuations) to split them into different segments, then uses different ways to process metadata according to their fields:
For numbers in an ascending or descending order, we employ incremental encoding to represent the variations of one metadata to its preceding neighbors. For instance, ‘3458644’ will be compressed into 3,1,1,3,-2,-2,0. For continuous identical characters, we exploit run-length limited encoding to show their values and numbers of repetition. For random numbers with various precisions, we convert their formats by UTF-8 coding without adding a single separator, and then use a low-order encoder for compression. Otherwise, use the low-order encoder to compress metadata.
In conclusion, during this process, sub-streams are fed into a dynamic probability prediction model and an arithmetic encoder, and they are transformed into compressed blocks at a fixed size.
The key objective is to transmit output blocks to a certain cloud storage platform, with annotations about types, sizes, numbers of data blocks.
To note, different types of encoders may lead to inconsistency in compression speed, which can lead to a data pipe blockage. Thus, in our system, the pipe-filter pattern is designed to synchronize input and output speed, e.g., the input flow will be blocked when the speed of input stream is faster than that of the output stream; The pipe will also be blocked when there is no input flow.
Storage at the cloud end — Creating an object-oriented nested container system
GTZ creates containers as storage compartments that provide a way to manage instances and store file directories. They are organized in a tree structure. Containers can be nested to represent locations of instances: a root container represents a complete compressed file; a block container includes different types of sub-stream containers where specific instances are stored. The nesting structure is showed in Fig. 2.
A root container represents a FASTQ file and it holds N block containers, each of which includes metadata sub-containers, base sequence sub-containers and quality score sub-containers. A metadata sub-container nests repetitive data blocks, random data blocks, incremental data blocks, etc. Base sequence sub-containers and quality score sub-containers nest 0 instance block to N instance block. Taking base sequences for examples, the 0 to (N-1) output blocks are stored in the 0th block container, and the N to (2 N-1) output blocks are stored in the 1st block container, and so on.
This kind of hierarchy allows users to maintain a directory structure to manage compressed files, thereby facilitating random access to specific sequence. Here, we show how to decompress and extract the target files from the compressed archive: in decompression mode, the system will index the start line number n (which is given by users through the command line), then fetch the certain sequence from their according block containers and compress certain (which are also specified by users) lines of the sequence.
Receive data — Receive and store output blocks
Cloud storage platform receives output blocks and descriptive information such as numbers of data blocks, sizes of data blocks, most importantly, the line number of every base sequence within data blocks. The description enables us to directly index certain sequences with line numbers and decode their affiliated blocks rather than extract the whole file. Output blocks are stored in corresponding types of containers.
What is worth noting is that non-FASTQ files can also be compressed and transmitted through GTZ. Additionally, GTZ uses object-oriented programming, it is not restricted to interact with a specific type of cloud storage platform, but applicable to most existing cloud storage platforms, such as the Amazon Web Service and the Alibaba cloud.
Results and discussion
Considering that our method is lossless, we exclude methods that allow losses as counterparts.
NGS data can be stored in either FASTQ or SAM/BAM formats, we only take into account tools targeted at FASTQ format files.
Comparison will be conducted among the algorithms that do not reorder input sequences.
Descriptions of 8 FASTQ datasets used for performance evaluation
Reference genome size
No. of quality scores in data file
RNA seq (H. sapiens)
NA12878 (read 2)
We evaluated the performance of different tools by the following related metrics: the compression ratio, the coefficient of variation (CV) of compression ratios, the compression speed, the total time of compression and transmission to cloud storages. Specifically, the compression ratio is defined as follows:
According to this definition, a smaller compression ratio represents a more effective compression in terms of size reduction; The coefficient of variation (CV) stands for the extent of variability in relation to the mean and it is defined as the ratio of the standard deviation (SD) divided by the average (avg):
A smaller CV reveals better robustness and stability; additionally, GTZ not only performs well in compression on local computers, but also gains satisfactory results in transmission to cloud storages. On local computers, the compression speed is chosen for evaluation, and it can be simply measured by the time used for the compression (for different tools applied on the same data). Under the latter circumstance, the run time of algorithms should be the sum of compression and transmission time, namely, from the start of compression to the completion of transmission onto the cloud.
Compression ratios of different tools on 8 FASTQ datasets
Compression ratio (%)
NA12878 (read 2)
To note, in Table 3, some fields on datasets NA12878 (read 2, a very large dataset) are filled with “TLE” (Time Limit Exceeded, the threshold is empirically set as 6 h), and some fields of the LFQC tools on the SRR5419422, ERR137269 datasets are filled with “Error” (Cannot decompress after compression, those two datasets represent RNA sequences and metagenomics data respectively). Those “outliers” represent a low robustness (for convenience of CV calculation, we just filter out “TLE” and “Error”). For instance, LFQC  yields the best result on 5 out of 8 datasets. However, it got “TLE” on three datasets, which means a poor stability in compression efficiency. In addition, despite the CV of pigz is the lowest, its average compression ratio ranks at the bottom. Moreover, GTZ ranks second with an average compression ratio of 17.86%, and the CV of GTZ is far below that of LFQC  (which has the best compression ratio). In summary, GTZ not only maintains a relatively good average compression ratio than most of its counterparts, but also exhibits better stability and robustness when dealing with different datasets.
Compression time of different tools on 8 FASTQ datasets
Compression Time (s)
NA12878 (read 2)
Average speed (MB/s)
Total time of different tools on 8 FASTQ datasets with maximum bandwidth
Compression Time (s) + Data best upload time
NA12878 (read 2)
Average speed (MB/s)
Total time of different tools on the SRR125858_2 dataset in a real test
Compression ratio (%)
Total time (s)
Qualitative performance summary
Compression rate on different data sections
The compression ratio of GTZ on the three components of FASTQ files
Compression ratio (%)
NA12878 (read 2)
The dramatic development of NGS technology has brought about challenge to store and transmit genome sequences. Efficient compression tools are feasible solutions to address this problem. Therefore, an efficient lossless compression tool for cloud computing of FASTQ files, GTZ, was proposed in this paper. GTZ is the champion winning solution of the GCTA competition (Reports can be found at http://vcbeat.net/35028.html. GTZ integrates the context modeling technology with multiple prediction modelling schemes. It also introduces the ability of paralleling processing technique for improved and steady efficiency of compression. Moreover, it enables random access to some certain specific reads. By virtue of block storage, users are allowed to only compress and read some parts of genome sequences, without the need for a complete decompression of the original FASTQ file. Another important feature is that it can overlap the data transmission with the compression process, which can greatly reduce the total time needed.
We evaluated the performance of GTZ on eight real-world FASTQ datasets and compared it with other state-of-the-art tools. Experimental results validate that GTZ performs well in terms of both compression rate and compression speed and its performance is steady across different datasets. GTZ managed to compress and transfer a 200GB FASTQ file to cloud storages like AWS and Alibaba cloud within 14 min.
For future work, we will investigate how DSRC2, which exhibits a good performance of compression alone, can be optimized for the cloud environment by utilizing data segmentation and the optimization techniques proposed in GTZ.
Publication of this article was funded by the National Natural Science Foundation of China grant (No.31501073, No.81522048, No.81573511), the National Key Research and Development Program (No.2016YFC0905000), and the Genetalks Biotech. Co.,Ltd.
Availability of data and materials
GTZ is freely available at https://github.com/Genetalks/gtz.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 16, 2017: 16th International Conference on Bioinformatics (InCoB 2017): Bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-18-supplement-16.
Yuting Xing, Dr. Gen Li and Dr. Chengkun Wu developed the algorithms and drafted the manuscript; they developed the codes of GTZ together with Zhenguo Wang and Bolun Feng; Dr. Zhuo Song and Dr. Chengkun Wu proposed the idea of the project, prepared the 8 FASTQ datasets for testing, drafted the discussion and revised the whole manuscript. All the authors have read and approve the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Daily K, Rigor P, Christley S, Xie X, Baldi P. Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinformatics. BioMed Central Ltd; 2010;11:514.Google Scholar
- Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G. Compressing genomic sequence fragments using SLIMGENE. J Comput Biol. 2010;18:401–13.View ArticleGoogle Scholar
- Pinho AJ, Pratas D, Garcia SP. GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 2012;40:e27–7.Google Scholar
- Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 2012;40:e171–1.Google Scholar
- Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z. Light-weight reference-based compression of FASTQ data. BMC Bioinformatics. 2015;16:188.View ArticlePubMedPubMed CentralGoogle Scholar
- Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. Gormley M, editor. PLoS One. 2013;8:e59190–10.View ArticlePubMedPubMed CentralGoogle Scholar
- Grumbach S, Tahi F. Compression of DNA sequences. In: I.N.R.I.A; 1994.Google Scholar
- Ziv J, Lempel A. A universal algorithm for sequential data compression. IEEE Trans Inf Theory. 1977;IT-23:337–43.View ArticleGoogle Scholar
- Grumbach S, Tahi F. A new challenge for compression algorithms: genetic sequences. Inf Process Manag. 1994;30:875–86.View ArticleGoogle Scholar
- Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. Bioinformatics. 2011;27:860–2.View ArticlePubMedGoogle Scholar
- Huffman DA. A method for the construction of minimum-Redundacy codes. Proc IRE. 1952;40:1908–11.View ArticleGoogle Scholar
- Roguski L, Deorowicz S. DSRC 2-industry-oriented compression of FASTQ files. Bioinformatics. 2014;30:2213–5.View ArticlePubMedGoogle Scholar
- Hach F, Numanagic I, Alkan C, Sahinalp SC. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012;28:3051–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Jeannot E, Knutsson B. Adaptive online data compression. In: Proceedings th IEEE international symposium on high performance distributed computing; 2017. p. 1–10.Google Scholar
- Nicolae M, Pathak S, Rajasekaran S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics. 2015;31:3276–81.Google Scholar