Skip to main content

Table 1 Compression features obtained for the three high coverage WGS datasets with several compression tools. Total compression ratio is the compression ratio (original size / compressed size) of the whole FASTQ file, header, sequence and quality combined

From: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Method Compression ratio Compression Decompression
  Total Header Base Quality Time (s) Mem. (MB) Time (s) Mem. (MB)
SRR959239 - WGS E. coli - 1.4 GB - 116x
gzip 3.9 179 1 13 1
dsrc-lossy 7.6 9 1942 13 1998
fqzcomp-lossy 17.9 35.2 12.0 19.6 73 4171 74 4160
fastqz-lossy 13.4 40.8 14.1 8.7 255 1375 298 1375
leon-lossy 30.9 45.1 17.5 59.3 39 353 33 205
scalce-lossy 9.8 21.4 8.3 9.2 62 2012 35 2012
quip 8.4 29.8 8.5 5.3 244 1008 232 823
mince 16.7 77# 1812 19# 242
orcom* 34.3* 10# 2243 15# 197
SRR065390 - WGS C. elegans - 17 GB - 70x
gzip 3.8 2145 1 165  
dsrc-lossy 7.9 67 5039 85 5749
fqzcomp-lossy 12.8 54.2 7.6 15.0 952 4169 1048 4159
fastqz-lossy 10.3 61.9 7.3 8.7 2749 1527 3326 1527
leon-lossy 21.3 48.6 12.0 32.9 627 1832 471 419
scalce-lossy 8.2 34.1 6.5 7.2 751.4 5309 182.3 1104
quip 6.5 54.3 4.8 5.2 928 775 968 771
mince 10.3 1907# 21825 387# 242
orcom* 24.2* 113# 9408 184# 1818
SRR345593/SRR345594 - WGS human - 733 GB - 102x
gzip 3.3 104,457 1 9124 1
dsrc-lossy 7.4 2797 5207 3598 5914
fqzcomp-lossy 9.3 23.2 5.3 15.0 39,613 4169 48,889 4158
fastqz(a)
leon-lossy 15.6 27.5 9.2 26.8 40,766 9556 21262 5869
scalce(b)
quip 6.5 54.3 4.8 5.2 52,854 776 46594 775
mince(a)
orcom* 19.2* 29,364# 27505 10,889# 60,555
  1. The following columns indicate the ratio for each individual component, when available. Running time (in s) and peak memory (in MB) are given for compression and decompression. All tools were used without a reference genome. Best overall results are in bold
  2. aProgram does not support variable length sequences
  3. bSCALCE was not able to finish on the large WGS human dataset
  4. -lossy suffix means the method was run in lossy mode for quality scores compression
  5. *Stars indicate that the given program changes read order and loses read-pairing information, and thus cannot be directly compared to other tools
  6. Running time with # is on DNA sequence only
  7. Best overall results are in bold