Skip to main content

Table 4 Compression Algorithm Results on Three High-Throughput Data Sets

From: Data structures and compression algorithms for high-throughput sequencing technologies

 

Dataset 1

Dataset 2

Dataset 3

Standalone Methods

   

Read Length

6,439,584

1,697,990

59,267,219

Chromosome

31,576,860

9,997,062

31,118,531

Strand

6,439,584

1,697,990

31,118,531

# Mismatches

12,382,598

2,499,664

55,624,291

Total

50,399,042

14,194,716

117,861,353

   Start Location

   

MOV†

121,565,953

44,200,254

787,554,494

EG†

236,691,716

86,701,276

1,543,990,407

REG†

10,745,562

26,180,752

76,430,489

Huffman

91,019,189

82,444,521

1,324,964,740

RHuffman

10,311,095

19,066,500

65,905,674

Best Standalone

60,710,137

33,261,216

183,767,027

Combined Methods

   

(C,S,M) Lookup

64,424,309

33,809,380

158,272,463

REG Indexed†

12,133,110

32,342,080

144,975,985

Mismatches

   

Nucleotide

13,917,023

1,307,870

53,441,350

From Start

30,028,807

4,177,576

159,433,004

From End

32,671,455

2,333,372

153,865,294

Total Start

43,945,830

5,485,446

212,874,354

Total End

46,588,478

3,641,242

207,306,644

Combined†

44,033,309

3,757,400

186,298,126

Best Compression

56,078,940

35,983,322

390,541,330

GenCompress

56,166,419

36,099,480

390,541,330