BMC Bioinformatics

Table 4 Compression Algorithm Results on Three High-Throughput Data Sets

From: Data structures and compression algorithms for high-throughput sequencing technologies

	Dataset 1	Dataset 2	Dataset 3
Standalone Methods
Read Length	6,439,584	1,697,990	59,267,219
Chromosome	31,576,860	9,997,062	31,118,531
Strand	6,439,584	1,697,990	31,118,531
# Mismatches	12,382,598	2,499,664	55,624,291
Total	50,399,042	14,194,716	117,861,353
Start Location
MOV^†	121,565,953	44,200,254	787,554,494
EG^†	236,691,716	86,701,276	1,543,990,407
REG^†	10,745,562	26,180,752	76,430,489
Huffman	91,019,189	82,444,521	1,324,964,740
RHuffman	10,311,095	19,066,500	65,905,674
Best Standalone	60,710,137	33,261,216	183,767,027
Combined Methods
(C,S,M) Lookup	64,424,309	33,809,380	158,272,463
REG Indexed^†	12,133,110	32,342,080	144,975,985
Mismatches
Nucleotide	13,917,023	1,307,870	53,441,350
From Start	30,028,807	4,177,576	159,433,004
From End	32,671,455	2,333,372	153,865,294
Total Start	43,945,830	5,485,446	212,874,354
Total End	46,588,478	3,641,242	207,306,644
Combined^†	44,033,309	3,757,400	186,298,126
Best Compression	56,078,940	35,983,322	390,541,330
GenCompress	56,166,419	36,099,480	390,541,330

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com