SparkGC: Spark based genome compression for large collections of genomes

Yao, Haichang; Hu, Guangyong; Liu, Shangdong; Fang, Houzhi; Ji, Yimu

doi:10.1186/s12859-022-04825-5

BMC Bioinformatics

Table 1 Summary of the related works of this paper

From: SparkGC: Spark based genome compression for large collections of genomes

Year	Name	Methodology	Characteristics	Parallelization
2009	DNAZip [13]	A serial of compression techniques (Variable integer (VINT), Delta positions (DELTA), SNP mapping (DBSNP), K-mer partitioning (KMER)) are taken together to reduce the size of a single genome	The SNP database dbSNP [24] and the mapping results between reference and target sequence have to be input as prerequisites, that limits its practicability	Serial
2012	BlockCompression [25]	The reference and target sequence are divided into fixed-length blocks. Matching are performed between the blocks	Compressed suffix tree is employed to save memory. Straightforward approximate matching is used to improve matching rate	Block-processing can be distributed on several CPUs
2013	FRESCO [17]	Suffix tree is used to index the reference sequence. The base after the exact match is saved as mutation	Three schemes (selecting a good reference, reference rewriting, and second-order compression) were proposed to improve the compression ratio	Serial
2015	COGI [18]	COGI transforms the genomic sequences to a bitmap, then applies a rectangular partition coding algorithm to compress the binary image	The reference sequence is selected using techniques based on co-occurrence entropy and multi-scale entropy. Compressing multiple sequences is supported by COGI, but the compression ratio decreases dramatically	Serial
2015	GDC2 [26]	GDC2 is developed to compress large collections of genomes. Second-order compression scheme and variable integer encoding scheme are employed to reduce the size of compressed files	GDC 2 is implemented in a multithreaded fashion. By default, GDC 2 uses 4 threads: 3 for the first level Ziv–Lempel factoring and 1 for the second-level factoring and arithmetic coding	Multithreaded parallel
2015	iDoComp [27]	Suffix array is used to index the reference sequence. Greedy matching scheme is used to match the reference and the target sequence	Suffix array has to be pre-computed and stored in the hard drive before compression	Serial
2016	NRGC [28]	NRGC uses the score based placement technique to quantify the differences between genome sequences, so as to obtain the best position of each target block on the reference blocks	NRGC has strict requirements on the similarity between the reference sequence and target sequence, which is prone to compression failure	Serial
2017	HiRGC [29]	In the pre-processing stage, HiRGC separates the target sequence file into the identifier, the length of each line, position intervals of lowercase letters and the letter ‘N’, special letters and base letters, and then different compression schemes are used to compress them according to their characteristics	The greedy matching scheme generates some suboptimal matching result	Serial
2018	SCCG [30]	SCCG optimized the greedy matching scheme of HiRGC. It combines the greedy matching with the segmentation matching used in NRGC, matches the target sequence to the corresponding reference segmentation first, improves the compression ratio	The compression time and memory consumption increase significantly	Serial
2019	HRCM [21]	HRCM supports both pair-wise sequence compression and multiple sequences compression. When multiple sequences are compressed, optimized second-order compression scheme is used to improve compression ratio	HRCM balances well the compression speed, compression ratio, and robustness, especially for large collections of genomes compression	Serial
2020	memRGC [7]	bfMEM algorithm [31] is used to save the compression time and memory usage. memRGC extends the MEMs if there are less than two SNPs between MEMs, that improves the compression ratio	INDEL (INsertion and DELetion) and more than two SNPs are omitted in the approximate matching of memRGC	multithreaded parallel
2021	HadoopHRCM [22]	HDFS and Map/Reduce architecture is employed to improve the compression speed of HRCM	Distributed parallel computing technology is introduced to the FASTA compression	Hadoop

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com