Skip to main content

Table 1 Summary of the related works of this paper

From: SparkGC: Spark based genome compression for large collections of genomes

Year

Name

Methodology

Characteristics

Parallelization

2009

DNAZip [13]

A serial of compression techniques (Variable integer (VINT), Delta positions (DELTA), SNP mapping (DBSNP), K-mer partitioning (KMER)) are taken together to reduce the size of a single genome

The SNP database dbSNP [24] and the mapping results between reference and target sequence have to be input as prerequisites, that limits its practicability

Serial

2012

BlockCompression [25]

The reference and target sequence are divided into fixed-length blocks. Matching are performed between the blocks

Compressed suffix tree is employed to save memory.

Straightforward approximate matching is used to improve matching rate

Block-processing can be distributed on several CPUs

2013

FRESCO [17]

Suffix tree is used to index the reference sequence.

The base after the exact match is saved as mutation

Three schemes (selecting a good reference, reference rewriting, and second-order compression) were proposed to improve the compression ratio

Serial

2015

COGI [18]

COGI transforms the genomic sequences to a bitmap, then applies a rectangular partition coding algorithm to compress the binary image

The reference sequence is selected using techniques based on co-occurrence entropy and multi-scale entropy.

Compressing multiple sequences is supported by COGI, but the compression ratio decreases dramatically

Serial

2015

GDC2 [26]

GDC2 is developed to compress large collections of genomes. Second-order compression scheme and variable integer encoding scheme are employed to reduce the size of compressed files

GDC 2 is implemented in a multithreaded fashion. By default, GDC 2 uses 4 threads: 3 for the first level Ziv–Lempel factoring and 1 for the second-level factoring and arithmetic coding

Multithreaded parallel

2015

iDoComp [27]

Suffix array is used to index the reference sequence.

Greedy matching scheme is used to match the reference and the target sequence

Suffix array has to be pre-computed and stored in the hard drive before compression

Serial

2016

NRGC [28]

NRGC uses the score based placement technique to quantify the differences between genome sequences, so as to obtain the best position of each target block on the reference blocks

NRGC has strict requirements on the similarity between the reference sequence and target sequence, which is prone to compression failure

Serial

2017

HiRGC [29]

In the pre-processing stage, HiRGC separates the target sequence file into the identifier, the length of each line, position intervals of lowercase letters and the letter ‘N’, special letters and base letters, and then different compression schemes are used to compress them according to their characteristics

The greedy matching scheme generates some suboptimal matching result

Serial

2018

SCCG [30]

SCCG optimized the greedy matching scheme of HiRGC. It combines the greedy matching with the segmentation matching used in NRGC, matches the target sequence to the corresponding reference segmentation first, improves the compression ratio

The compression time and memory consumption increase significantly

Serial

2019

HRCM [21]

HRCM supports both pair-wise sequence compression and multiple sequences compression. When multiple sequences are compressed, optimized second-order compression scheme is used to improve compression ratio

HRCM balances well the compression speed, compression ratio, and robustness, especially for large collections of genomes compression

Serial

2020

memRGC [7]

bfMEM algorithm [31] is used to save the compression time and memory usage.

memRGC extends the MEMs if there are less than two SNPs between MEMs, that improves the compression ratio

INDEL (INsertion and DELetion) and more than two SNPs are omitted in the approximate matching of memRGC

multithreaded parallel

2021

HadoopHRCM [22]

HDFS and Map/Reduce architecture is employed to improve the compression speed of HRCM

Distributed parallel computing technology is introduced to the FASTA compression

Hadoop