From: SparkGC: Spark based genome compression for large collections of genomes
Year | Name | Methodology | Characteristics | Parallelization |
---|---|---|---|---|
2009 | DNAZip [13] | A serial of compression techniques (Variable integer (VINT), Delta positions (DELTA), SNP mapping (DBSNP), K-mer partitioning (KMER)) are taken together to reduce the size of a single genome | The SNP database dbSNP [24] and the mapping results between reference and target sequence have to be input as prerequisites, that limits its practicability | Serial |
2012 | BlockCompression [25] | The reference and target sequence are divided into fixed-length blocks. Matching are performed between the blocks | Compressed suffix tree is employed to save memory. Straightforward approximate matching is used to improve matching rate | Block-processing can be distributed on several CPUs |
2013 | FRESCO [17] | Suffix tree is used to index the reference sequence. The base after the exact match is saved as mutation | Three schemes (selecting a good reference, reference rewriting, and second-order compression) were proposed to improve the compression ratio | Serial |
2015 | COGI [18] | COGI transforms the genomic sequences to a bitmap, then applies a rectangular partition coding algorithm to compress the binary image | The reference sequence is selected using techniques based on co-occurrence entropy and multi-scale entropy. Compressing multiple sequences is supported by COGI, but the compression ratio decreases dramatically | Serial |
2015 | GDC2 [26] | GDC2 is developed to compress large collections of genomes. Second-order compression scheme and variable integer encoding scheme are employed to reduce the size of compressed files | GDC 2 is implemented in a multithreaded fashion. By default, GDC 2 uses 4 threads: 3 for the first level Ziv–Lempel factoring and 1 for the second-level factoring and arithmetic coding | Multithreaded parallel |
2015 | iDoComp [27] | Suffix array is used to index the reference sequence. Greedy matching scheme is used to match the reference and the target sequence | Suffix array has to be pre-computed and stored in the hard drive before compression | Serial |
2016 | NRGC [28] | NRGC uses the score based placement technique to quantify the differences between genome sequences, so as to obtain the best position of each target block on the reference blocks | NRGC has strict requirements on the similarity between the reference sequence and target sequence, which is prone to compression failure | Serial |
2017 | HiRGC [29] | In the pre-processing stage, HiRGC separates the target sequence file into the identifier, the length of each line, position intervals of lowercase letters and the letter ‘N’, special letters and base letters, and then different compression schemes are used to compress them according to their characteristics | The greedy matching scheme generates some suboptimal matching result | Serial |
2018 | SCCG [30] | SCCG optimized the greedy matching scheme of HiRGC. It combines the greedy matching with the segmentation matching used in NRGC, matches the target sequence to the corresponding reference segmentation first, improves the compression ratio | The compression time and memory consumption increase significantly | Serial |
2019 | HRCM [21] | HRCM supports both pair-wise sequence compression and multiple sequences compression. When multiple sequences are compressed, optimized second-order compression scheme is used to improve compression ratio | HRCM balances well the compression speed, compression ratio, and robustness, especially for large collections of genomes compression | Serial |
2020 | memRGC [7] | bfMEM algorithm [31] is used to save the compression time and memory usage. memRGC extends the MEMs if there are less than two SNPs between MEMs, that improves the compression ratio | INDEL (INsertion and DELetion) and more than two SNPs are omitted in the approximate matching of memRGC | multithreaded parallel |
2021 | HadoopHRCM [22] | HDFS and Map/Reduce architecture is employed to improve the compression speed of HRCM | Distributed parallel computing technology is introduced to the FASTA compression | Hadoop |