Skip to main content

Table 1 Compression features obtained for the three high coverage WGS datasets with several compression tools. Total compression ratio is the compression ratio (original size / compressed size) of the whole FASTQ file, header, sequence and quality combined

From: Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Method

Compression ratio

Compression

Decompression

 

Total

Header

Base

Quality

Time (s)

Mem. (MB)

Time (s)

Mem. (MB)

SRR959239 - WGS E. coli - 1.4 GB - 116x

gzip

3.9

179

1

13

1

dsrc-lossy

7.6

9

1942

13

1998

fqzcomp-lossy

17.9

35.2

12.0

19.6

73

4171

74

4160

fastqz-lossy

13.4

40.8

14.1

8.7

255

1375

298

1375

leon-lossy

30.9

45.1

17.5

59.3

39

353

33

205

scalce-lossy

9.8

21.4

8.3

9.2

62

2012

35

2012

quip

8.4

29.8

8.5

5.3

244

1008

232

823

mince

16.7

77#

1812

19#

242

orcom*

34.3*

10#

2243

15#

197

SRR065390 - WGS C. elegans - 17 GB - 70x

gzip

3.8

2145

1

165

 

dsrc-lossy

7.9

67

5039

85

5749

fqzcomp-lossy

12.8

54.2

7.6

15.0

952

4169

1048

4159

fastqz-lossy

10.3

61.9

7.3

8.7

2749

1527

3326

1527

leon-lossy

21.3

48.6

12.0

32.9

627

1832

471

419

scalce-lossy

8.2

34.1

6.5

7.2

751.4

5309

182.3

1104

quip

6.5

54.3

4.8

5.2

928

775

968

771

mince

10.3

1907#

21825

387#

242

orcom*

24.2*

113#

9408

184#

1818

SRR345593/SRR345594 - WGS human - 733 GB - 102x

gzip

3.3

104,457

1

9124

1

dsrc-lossy

7.4

2797

5207

3598

5914

fqzcomp-lossy

9.3

23.2

5.3

15.0

39,613

4169

48,889

4158

fastqz(a)

leon-lossy

15.6

27.5

9.2

26.8

40,766

9556

21262

5869

scalce(b)

quip

6.5

54.3

4.8

5.2

52,854

776

46594

775

mince(a)

orcom*

19.2*

29,364#

27505

10,889#

60,555

  1. The following columns indicate the ratio for each individual component, when available. Running time (in s) and peak memory (in MB) are given for compression and decompression. All tools were used without a reference genome. Best overall results are in bold
  2. aProgram does not support variable length sequences
  3. bSCALCE was not able to finish on the large WGS human dataset
  4. -lossy suffix means the method was run in lossy mode for quality scores compression
  5. *Stars indicate that the given program changes read order and loses read-pairing information, and thus cannot be directly compared to other tools
  6. Running time with # is on DNA sequence only
  7. Best overall results are in bold