GTZ: a fast compression and cloud transmission tool optimized for FASTQ files

Background The dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth. To speed up data sharing and bring data to computing resource faster and cheaper, it is necessary to develop a compression tool than can support efficient compression and transmission of sequencing data onto the cloud storage. Results This paper presents GTZ, a compression and transmission tool, optimized for FASTQ files. As a reference-free lossless FASTQ compressor, GTZ treats different lines of FASTQ separately, utilizes adaptive context modelling to estimate their characteristic probabilities, and compresses data blocks with arithmetic coding. GTZ can also be used to compress multiple files or directories at once. Furthermore, as a tool to be used in the cloud computing era, it is capable of saving compressed data locally or transmitting data directly into cloud by choice. We evaluated the performance of GTZ on some diverse FASTQ benchmarks. Results show that in most cases, it outperforms many other tools in terms of the compression ratio, speed and stability. Conclusions GTZ is a tool that enables efficient lossless FASTQ data compression and simultaneous data transmission onto to cloud. It emerges as a useful tool for NGS data storage and transmission in the cloud environment. GTZ is freely available online at: https://github.com/Genetalks/gtz. Electronic supplementary material The online version of this article (10.1186/s12859-017-1973-5) contains supplementary material, which is available to authorized users.


Background
Next generation sequencing (NGS) has greatly facilitated the development of genome analyses, which is vital for reaching the goal of precision medicine. Yet the exponential growth of accumulated sequencing data poses serious challenges to the transmission and storage of NGS data. Efficient compression methods provide the possibility to address this increasingly prominent problem.
Previously, general-propose compression tools, such as gzip (http://www.gzip.org/), bzip2 (http://www.bzip.org/) and 7z (www.7-zip.org), have been utilized to compress NGS data. These tools do not take advantage of the characteristics of genome data, such as a small size alphabet and repeated sequences segments, which leaves space for performance optimization. Recently, some specialized compression tools have been developed for NGS data. These tools are either reference-based or reference-free. The main difference lies in whether extra genome sequences are used as references. Reference-based algorithms encode the differences between the target and reference sequences, and consume more memory to improve compression performance. GenCompress [1] and SimGene [2] use various entropy encoders, such as arithmetic, Golomb and Huffman to compress integer values. The values show properties of reads, like starting position, length of reads, etc. A statistical compression method, GReEn [3], uses an adaptive model to estimate probabilities based on the frequencies of characters. The probabilities are then compressed with an arithmetic encoder.
QUIP [4] exploits arithmetic coding associated with models of order-3 and high-order Markov chains in all three parts of FASTQ data. LW-FQZip [5] utilized incremental and run-length-limited encoding schemes to compress the metadata and quality scores, respectively. Reads are pre-processed by a light-weight mapping model and then three components are combined to be compressed by a general-purpose tool, like LZMA. Fqzcomp [6] estimates character probabilities by order-k context modelling and compresses NGS data in FASTQ format with the help of arithmetic coders.
Nevertheless, reference-based algorithms can be inefficient if the similarity between target and reference sequences is low. Therefore, reference-free methods were also proposed to address this problem. Biocompress proposed in [7] is a compression method dedicated to genomic sequences. Its main idea is based on the classical dictionary-based compression method -the Ziv and Lempel [8] compression algorithm. Repeats and palindromes are encoded using the length and the position of their earliest occurrences. As an extension of biocompress [7], biocompress-2 [9] exploits the same scheme, and uses arithmetic coding of order-2 when no significant repetition exists. The DSRC [10] algorithm splits sequences into blocks and compresses them independently with LZ77 [8] and Huffman [11] encoding. It is faster than QUIP both in compression and decompression speed, but inferior to the later in terms of compression ratio. DSRC2 [12], the multithreaded version of DSRC [10], splits the input into three streams for pre-processing. After pre-processing, metadata, reads, and quality scores are compressed separately in DRSC. A boosting algorithm, SCALCE [13], which re-organizes the reads, can outperform other algorithms on most datasets both in the compression ratio and the compression speed.
Nowadays, it is evident that cloud computing has become increasingly important for genomic analyses. However, above-mentioned tools were developed for local usage. Compression has to be completed locally before a data transmission onto the cloud can begin.
AdOC proposed in [14] is a general-propose tool that allows the overlap of compression and communication in the context of a distributed computing environment. It presents a model for transport level compression with dynamic compression level adaptation, which can be used in an environment where resource availability and bandwidth vary unpredictably.
Generally, the compression performances of the universal compression algorithms, such as AdOC, are unsatisfactory for NGS datasets.
In this paper, we present a tool GTZ, it is characterized as a lossless and efficient compression tool to be used jointly with cloud computing for large-scale genomic data analyses: In the remaining of this paper, we will introduce how GTZ works and evaluate its performance on several benchmark datasets using the AWS service.

Methods
GTZ supports efficient compression in parallel, parallel transmission and random fetching. Figure 1 demonstrates the workflow of GTZ processing.
GTZ involves procedures on clients and the cloud end. A client takes the following steps: (1)Read in streams of large data files.
(2)Pre-process the input by dividing data streams into three sub-streams: metadata, base sequence, and quality score. (3)Buffer sub-streams in local memories and assemble them into different types of data blocks with a fixed size. (4)Compress assembled data blocks and their descriptions, and then transmit output blocks into the cloud storage.
On the cloud, the followings steps are executed: (1)Create three types of object-oriented containers (shown in Fig. 2), which define a tree structure. (2)Loop and wait to receive output blocks sent by the client.
(3)Save received output blocks into block containers according to their types. (4)Stop if no more output blocks are received.
We will explain all the steps in further details about processing FASTQ files below: The client reading streams of large data files Raw NGS data files are typically stored in FASTQ format for the convenience of compression. A typical FASTQ file contains four lines per sequence: Line 1 begins with a character '@' followed by a sequence identifier; Line 2 holds the raw sequence composed of A, C, T, and G; line 3 begins with a character '+' and is optionally followed by the same sequence identifier (and any description) again; line 4 holds the corresponding quality scores in ASCII characters for the sequence characters in line 2. An example of a read is given in Table 1.

Data pre-processing
During the second step, a data stream is split into metadata sub-streams, base sequence sub-streams and quality  score sub-streams. (Since uninformative comment lines normally do not provide any useful information for compression, comment streams are omitted during preprocessing.) Three types of date pre-processing controllers buffer sub-streams and save them in data blocks at a fixed size respectively. Afterwards, data blocks with annotations (about numbers of blocks, sizes of blocks and types of streams) are sent to corresponding compression units. Figure 3 demonstrates how to pre-process data files with the help of pre-processing controllers and compression units.
Statistical modelling can be categorized into two types: static and adaptive statistical modelling. Conventional methods are normally static, which means probabilities are calculated after sequences are scanned from the beginning to end. A static modelling keeps a static table that records character-frequency counts. Although they produce relatively accurate results, the drawbacks are obvious: 1. It is time-consuming to read all the sequences into main memory before compression. 2. If an input stream does not match well with the previously accumulated sequence, the compression ratio will be degraded, even the output stream will become larger than the input stream.
In GTZ, we employ an adaptive statistical data compression technique based on context modelling. An adaptive modeling needs not to scan the whole sequence and generate probabilities before coding. Instead, the adaptive prediction technology provides on-the-fly reading and compression, that is probabilities are calculated based on the characters already read into the memory. Probabilities may alter with more characters scanned. Initially, the performance of adaptive statistical modelling may be poor due to the lack of reads. However, with more sequences processed, the prediction tends to be more accurate.
Every time the compressor encodes a character, it will update the counter in the prediction table. When a new character X (suppose the sequence before X is ABCD) comes, GTZ will traverse the prediction table, find every character that has followed ABCD before, and compare their appearance frequencies. For instance, if both ABCDX appears 10 times, and ABCDY only once. Then GTZ will assign a higher probability for X.
The work flow of an adaptive model is depicted in Fig. 4. The box 'Update model' means converting low-order modellings to high-order modellings (the meaning of low-order and high-order will be discussed in the next subsection.).
Adaptive prediction modelling can effectively reduce compression time. There is no need to read all sequences in a time and it introduces overlap of scanning and compression.
GTZ utilizes specific compression units for different kinds of data blocks: a low-order encoder for genetic sequences, a multi-order encoder for quality scores and mixed encoders for metadata. Finally, the outputs in this procedure are blocks at a fixed size.
The main idea about arithmetic coding is to convert reads into a floating point ranging from zero to one (precisely greater than or equal to zero and less than one) based on the predictive probabilities of characters. If the statistical modelling estimates every single character accurately for the compressor, we will have high compression performance. On the contrary, a poor prediction may result in expansion of the original sequence, instead of compression. Thus, the performance of a compressor largely relies on the whether the statistical modelling can output nearoptimal predictive probabilities.

A low-order encoder for reads
The simplest implementation of adaptive modeling is order-0. Exactly, it does not consider any context  information, thus this short-sighted modeling can only see the current character and make prediction that is independent of the previous sequences. Similarly, an order-1 encoder makes prediction based on one preceding character. Consequently, the low-order modeling makes little contribution to the performance of compressors. Its main advantage is that it is very memory efficient. Hence, for quality score streams that do not have spatial locality, a low-order modeling is adequate for moderate compression rate. Our tailored low-order encoder for reads is demonstrated in Fig. 5. The first step is to transform sequences with the BWT algorithm. BWT (Burrows-Wheeler transform) rearranges reads into runs of similar characters. In the second step, the zero-order and the first-order prediction model are used to calculate appearance probability of each character. Since a poor probability accuracy contributes to undesirable encoding results, we add interpolation after quantizing the weighted average probability, to reduce prediction errors and improve compression ratios. In the last procedure, the bit arithmetic coding algorithm produces decimals ranging from zero to one as outputs to represent sequences.

A multi-order encoder for quality scores
The statistical modeling needs non-uniform probability distribution for arithmetic algorithms. The high-order modeling enables high probabilities for those characters which appear frequently, and low probabilities for those which appear infrequently. As a result, compared with low-order encoders, higher-order encoders can enhance adaptive modeling.
A high-order modeling considers several characters preceding the current position. It can obtain better compression performance at the expense of more memory usage. Higher-order modeling was less used due to the limited memory capacity, which is no longer a problem anymore.
Without transformation, a multi-order encoder (See Fig. 6) for quality scores includes two procedures: Firstly, to generate probabilities of characters, input stream flows through an expanding character probability prediction model, which is composed of firstorder, second-order, fourth-order, sixth-order prediction models and a matching model. Like a low-order encoder, probabilities of characters undergo weighted averaging, quantization and interpolation to obtain final results. Secondly, we use bit arithmetic coding algorithm for compression.

A hybrid scheme for metadata
For metadata sub-streams, GTZ first uses delimiters (punctuations) to split them into different segments, then uses different ways to process metadata according to their fields: For numbers in an ascending or descending order, we employ incremental encoding to represent the variations of one metadata to its preceding neighbors. For instance, '3458644' will be compressed into 3,1,1,3,-2,-2,0. For continuous identical characters, we exploit run-length limited encoding to show their values and numbers of  repetition. For random numbers with various precisions, we convert their formats by UTF-8 coding without adding a single separator, and then use a low-order encoder for compression. Otherwise, use the low-order encoder to compress metadata.
In conclusion, during this process, sub-streams are fed into a dynamic probability prediction model and an arithmetic encoder, and they are transformed into compressed blocks at a fixed size.

Data transmission
The key objective is to transmit output blocks to a certain cloud storage platform, with annotations about types, sizes, numbers of data blocks.
To note, different types of encoders may lead to inconsistency in compression speed, which can lead to a data pipe blockage. Thus, in our system, the pipe-filter pattern is designed to synchronize input and output speed, e.g., the input flow will be blocked when the speed of input stream is faster than that of the output stream; The pipe will also be blocked when there is no input flow.
Storage at the cloud end -Creating an object-oriented nested container system GTZ creates containers as storage compartments that provide a way to manage instances and store file directories. They are organized in a tree structure. Containers can be nested to represent locations of instances: a root container represents a complete compressed file; a block container includes different types of sub-stream containers where specific instances are stored. The nesting structure is showed in Fig. 2.
A root container represents a FASTQ file and it holds N block containers, each of which includes metadata sub-containers, base sequence sub-containers and quality score sub-containers. A metadata sub-container nests repetitive data blocks, random data blocks, incremental data blocks, etc. Base sequence sub-containers and quality score sub-containers nest 0 instance block to N instance block. Taking base sequences for examples, the 0 to (N-1) output blocks are stored in the 0th block container, and the N to (2 N-1) output blocks are stored in the 1st block container, and so on. This kind of hierarchy allows users to maintain a directory structure to manage compressed files, thereby facilitating random access to specific sequence. Here, we show how to decompress and extract the target files from the compressed archive: in decompression mode, the system will index the start line number n (which is given by users through the command line), then fetch the certain sequence from their according block containers and compress certain (which are also specified by users) lines of the sequence.

Receive data -Receive and store output blocks
Cloud storage platform receives output blocks and descriptive information such as numbers of data blocks, sizes of data blocks, most importantly, the line number of every base sequence within data blocks. The description enables us to directly index certain sequences with line numbers and decode their affiliated blocks rather than extract the whole file. Output blocks are stored in corresponding types of containers.
What is worth noting is that non-FASTQ files can also be compressed and transmitted through GTZ. Additionally, GTZ uses object-oriented programming, it is not restricted to interact with a specific type of cloud storage platform, but applicable to most existing cloud storage platforms, such as the Amazon Web Service and the Alibaba cloud.

Results and discussion
In this section, we conducted experiments on a 32-core AWS R4.8xlarge instance with 244GB of memory to evaluate the performance of GTZ in terms of compression ratio and compression speed. During the experiments, the following points should be noted: (1)Considering that our method is lossless, we exclude methods that allow losses as counterparts. (2)NGS data can be stored in either FASTQ or SAM/ BAM formats, we only take into account tools targeted at FASTQ format files. (3)Comparison will be conducted among the algorithms that do not reorder input sequences.
We carried out tests on 8 publicly accessible FASTQ datasets, which are downloaded from the Sequence Read Archive(SRA) initiated by NCBI and the GCTA competition website (https://tianchi.aliyun.com/mini/challenge.htm#training-profile). To ensure the comprehensiveness of our evaluation, we chose datasets that are heterogeneous: the size of datasets ranges from 556MBs to 202, 631MBs; different species and different types of data were chosen, including DNA reads, one RNA-seq dataset of Homo sapiens, one metagenome dataset and read 2 of NA12878 (the GCTA competition datasets). Different quality score encoding methods, such as Sanger and Illumina 1.8+, are selected to cover different numbers of quality scores in datasets. Quality scores are logarithmically linked to error probabilities, leading to a larger alphabet than meta data and reads, thus encodings with small numbers of quality scores normally contribute to a higher compression performance. Descriptions of the  The best results of all the tools are boldfaced datasets are listed in Table 2. Besides, for comparison, based on a comprehensive literature survey, we selected four state-of-the-art and widely-used lossless compression algorithms, including DSRC2 [12] (the improved version of DSRC [10]), quip [4], LW-FQZip [5], Fqzcomp [6], LFQC [15] and pigz. Among them, LW-FQZip [5], Fqzcomp [15] are representatives of reference-based tools; DSRC2 [12] and quip [4] are reference-free methods; pigz is a generalpurpose tool for compression. All the experimental results are included in Additional file 1.

Evaluation results
We evaluated the performance of different tools by the following related metrics: the compression ratio, the coefficient of variation (CV) of compression ratios, the compression speed, the total time of compression and transmission to cloud storages. Specifically, the compression ratio is defined as follows: According to this definition, a smaller compression ratio represents a more effective compression in terms of size reduction; The coefficient of variation (CV) stands for the extent of variability in relation to the mean and it is defined as the ratio of the standard deviation (SD) divided by the average (avg): A smaller CV reveals better robustness and stability; additionally, GTZ not only performs well in compression on local computers, but also gains satisfactory results in transmission to cloud storages. On local computers, the compression speed is chosen for evaluation, and it can be simply measured by the time used for the compression (for different tools applied on the same data). Under the latter circumstance, the run time of algorithms should be the sum of compression and transmission time, namely, from the start of compression to the completion of transmission onto the cloud.

Compression ratio
Performance evaluation results are demonstrated in Table 3 and the best compression ratio, the best CV, which are the smallest, are boldfaced. Comparative results of CV are shown in Fig. 7.
To note, in Table 3, some fields on datasets NA12878 (read 2, a very large dataset) are filled with "TLE" (Time Limit Exceeded, the threshold is empirically set as 6 h), and  The best results of all the tools are boldfaced some fields of the LFQC tools on the SRR5419422, ERR137269 datasets are filled with "Error" (Cannot decompress after compression, those two datasets represent RNA sequences and metagenomics data respectively). Those "outliers" represent a low robustness (for convenience of CV calculation, we just filter out "TLE" and "Error"). For instance, LFQC [15] yields the best result on 5 out of 8 datasets. However, it got "TLE" on three datasets, which means a poor stability in compression efficiency. In addition, despite the CV of pigz is the lowest, its average compression ratio ranks at the bottom. Moreover, GTZ ranks second with an average compression ratio of 17.86%, and the CV of GTZ is far below that of LFQC [15] (which has the best compression ratio). In summary, GTZ not only maintains a relatively good average compression ratio than most of its counterparts, but also exhibits better stability and robustness when dealing with different datasets.

Compression speed
Results for the compression speed tests are shown in Table 4 and the best results are boldfaced. LFQC [15] and LW-FQZip [5] fail to compress the GCTA dataset NA12878 (read 2) within 6 h(21,600 s, which is empirically set). On datasets SRR5419422 and ERR137269, compressed files generated by LFQC cannot be decompressed, which are considered as errors (possibly because SRR5419422 is a RNA dataset and ERR137269 is a metagenomics dataset). Table 4 reveals that the referencebased methods LW-FQZip [5] and LFQC [15] are very slow on large datasets like NA12878 (read 2). DSRC2 [12], which is the representative of reference-free methods, performs best in terms of the average compression speed. GTZ ranks second in terms of compression time. However, we are mostly interested in the total time of compression and transmission. Under the condition where the data transmission throughput is 10Gb/s (1.25 GB/s at best of AWS settings), we tested and estimated the total time of all tools and the results are listed in Table 5. To note, this is a very optimistic optimization. Here, only GTZ supports data upload while compressing, other tools have to finish compression before submission. We can see the average compression and upload speed of GTZ (269.3 MB/ s) is the highest, DSRC2 comes second with an average speed of 269.1 MB/s. In general, if the input data size is very large, GTZ will be even faster than DSRC2: 7% faster in the case of the SRR125858 dataset (a 50GB dataset).
To note, the upload time are estimated with the maximum bandwidth, while in practice, the upload speed could be much slower than that. To verify this, we carried out a real upload test using the relatively big dataset, SRR125858_2.fastq (about half of the SRR125858 dataset), which is 25GBs in size. The compression ratios of GTZ and DSRC2 happen to be the same on this dataset. It took GTZ 99 s to finish compression and transmission, while it took 122 s for DSRC2. Our optimistic estimation of a fast upload takes only 20.3 s, whereas in practice, it took about 45 s. The details are listed in Table 6.
In Table 7, we present a qualitative performance summary of all tools. The parameters, high, moderate, and low show the comparison between different tools. Compression ratio of a tool is said to be high if it is the best compressor or close to the known best algorithm. GTZ achieves satisfactory results both in compression ratio and compression speed (as well as the total time considering data upload) on tested datasets.

Compression rate on different data sections
The compression rates of GTZ on the three sections of a FASTQ file are reported in Table 8.