FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy

Background Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. Results We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. Conclusions Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future. Availability The software and the datasets are available at https://github.com/fpalini/fastdoopc Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04063-1.


Apache Hadoop
Apache Hadoop is the most popular framework supporting the MapReduce paradigm. It allows for the execution of distributed computations thanks to the interplay of two architectural components: YARN (Yet Another Resource Negotiator ) [8] and HDFS (Hadoop Distributed File System) [7]. YARN manages the lifecycle of a distributed application by keeping track of the resources available on a computing cluster and allocating them for the execution of application tasks modeled after one of the supported computing paradigms. HDFS is a distributed and block-structured file-system designed to run on commodity hardware and able to provide fault tolerance through replication of data.
A basic Hadoop cluster is composed of a single master node and multiple worker nodes. The master node arbitrates the assignment of computational resources to applications to be run on the cluster and maintains an index of all the directories and the files stored in the HDFS distributed file system. Moreover, it tracks the worker nodes physically storing the HDFS data blocks making up these files. The worker nodes host a set of worker s (also called Containers), in charge of running the map and reduce tasks of a MapReduce application, as well as using the local storage to maintain a subset of the HDFS data blocks.
One of the main characteristics of Hadoop is its ability to exploit data-local computing. By this term, we mean the possibility to move applications closer to the data (rather than the vice-versa). This allows to greatly reduce network congestion and increase the overall throughput of the system when processing large amounts of data. Moreover, in order to reliably maintain files and to properly balance the load between different nodes of a cluster, large files are automatically split into smaller HDFS data blocks, replicated and spread across different nodes.

Specialized Compressors Supported by means of our Splittable
Compressor Meta-Codec Among the many compression algorithms specialized for genomic data [4], DSRC is the only featuring a splittable Codec among the data compression tools achieving the best performance, based on benchmarking, when dealing with FASTA/Q files. It represents a robust testbed for our solution because its original implementation has been developed in C++ and its integration within a Java Codec is not trivial to realize.
A DSRC standard compressed file is organized in three parts.
• Body. It contains a set of compressed data blocks. Each of these is compressed and can be decompressed independently from the others. The default size of each compressed data block is 10MB.
• Header. It reports the number of compressed data blocks existing in that file, the size of the footer and its relative position inside the file.
• Footer. It reports the size of each compressed data block and the flags used for its compression.

Implementation details
The special-purpose Codec supporting DSRC, HS DSRC, has been obtained following our Splittable Compressor Meta-Codec, as described in Section 2.3 of the Main Manuscript. It required the development of two Java classes: DSRCInputFormat and DSRCCodec. In particular, DSRCCodec uses the JNI framework [5] to load in memory and instantiate the dynamic library containing the DSRC native implementation. Then, it uses the DSRCInputFormat class to extract the information regarding the DSRC parameters and the list of compressed data blocks, according to the DSRC format. In addition, this class initializes the CodecInputStream object, pointing to the file to be decompressed during the execution of a job. Finally, it runs the NativeCodecDecompressor decompress method on each compressed data block to obtain its decompressed version.

Specialized Compressors Supported by means of our Universal Compressor Meta-Codec
In this Section we provide details about the work done for incorporating in Hadoop the specialized compressors reported in Section 2.5.1 of the Main Manuscript, using our Universal Compressor Meta-Codec.
For each compressor, the only step required to support it is the definition of a set of properties stating the supported input file types and the command-line required for compressing and decompressing a generic input file. Let X be the unique name denoting the compressor to be supported and F the file being processed, the following command line properties are available for its integration: • uc.X.compress.cmd: the command line to be used for compressing F using X.
• uc.X.decompress.cmd: the command line to be used for decompressing F using X.
• uc.X.io.input.flag: the command line flag used to specify the input filename. • uc.X.io.output.flag: the command line flag used to specify the output filename.
• uc.X.compress.ext: the extension used by X for saving a compressed copy of F.
• uc.X.decompress.ext: the extension used by X for saving a decompressed copy of X ("fastq" by default).
• uc.X.io.reverse: if X requires the output file name to be specified before the input file name, it is set to true. false, otherwise.
In Table 1, the command lines used for integrating the target specialized compressors using our Universal Compressor Meta-Codec are reported.

Datasets
For our experiments we considered two different datasets.
The first type of dataset, referred to as type 1 datasets, is a collection of FASTQ and FASTA files of different sizes. The FASTA files of these datasets contain a set of reads extracted uniformly at random from a collection of genomic sequences coming from the Human genome [1]. The FASTQ files of these datasets contain a set of reads extracted uniformly at random from a collection of genomic sequences coming from the Pinus Taeda genome [9]. Details about these datasets are reported in Table 2 and Table 3   The second type of dataset, referred to as type 2 datasets, is a collection of FASTQ files corresponding to different coverages of the H.sapiens genome. It has been assembled using the same methodology and the same input FASTQ files considered in [2] for their experiments: that is, more input FASTQ files with a known coverage are concatenated to get an higher coverage.
Namely, the hsapiens1 dataset (coverage 1.6x) has been obtained by concatenating the SRR062634 1.fastq and SRR062634 2.fastq files. The hsapiens2 dataset (coverage 14.4x) has been obtained by concatenating the ERP174324 1.fastq and ERP174324 2.fastq files. The hsapiens3 dataset (coverage 26.6x) has been obtained by concatenating the NA12878-Rep-1 S1 L001 R1 001.fastq and NA12878-Rep-1 S1 L001 R1 002.fastq files. The only difference with respect to the methodology used in [2] is that we had not to first trim input sequences because our HS and HU Codecs support variable length reads.
Details about these datasets are reported in Table 4.

Assessing the invariance of the compression properties of codecs executed via our HU and HS Codecs
In order to prove that our HU and HS Codecs do not change in any way the compression properties of the compressors we import, we perform the following experiment. We create two 128M B input files (equivalent to one single HDFS block in our Hadoop installation) by extracting the corresponding number of bytes from the initial part of the 16GB FASTA file and of the 16GB FASTQ file of our type 1 datasets. Then, we compress the outcoming files, using each of the considered FASTA/Q specialized compression codecs, in their original form, as well as the same codecs, as imported in our HU Codec and in our HS Codec (when available). At this point, we check if the two files asre identical. A simple but effective way to perform this check is by compare the MD5 hashes of the two files. MD5 is an hash function used in cryptography to produce a 128-bit message digest so that the probability for two different files to generate the same hashes is extremely low (see [6]). So, we assessed that the compressed file returned by each codec was identical to the one returned by the same codec encapsulated in our HU Codec and our HS Codec by comparing the corresponding MD5 hashes, as reported in Tables 5-6.   Table 6: MD5 hashes of the compressed files obtained by executing compression codecs either as stand-alone methods or as an encapsulation of our HU and HS codecs using, as input, the first 128MB of the 16GB FASTQ file of our type 1 datasets.