Skip to main content

Table 2 Compressed file sizes

From: Compressing DNA sequence databases with coil

Dataset

FASTA

bz2

nrdb+bz2

7z

PPMdi

coil

ems1

23292780

5876747

5871445

4870989

5331953

4990193

 

23199910

5853780

5852865

4854519

5311350

4981279

 

23201245

5852837

5852772

4857631

5312411

4988747

ems2

46519702

11576074

11574420

9057588

10475531

9432789

 

46428669

11557030

11556573

9023980

10454826

9410376

 

46390115

11547516

11549117

9036246

10445740

9426594

ems3

69631679

17211495

17205793

12922145

15537092

13607729

 

69647486

17212318

17208461

12907737

15543489

13592739

 

69715954

17231912

17225610

12920294

15558845

13623246

ems4

92905691

22841127

22810035

16600091

20601712

17625302

 

93012024

22868732

22849091

16611294

20629724

17655369

 

92850447

22813494

22799324

16587471

20585812

17584008

ems5

116125238

28428297

28415051

20245345

25636473

21509065

 

116249077

28451622

28426520

20260429

25663621

21547174

 

116117128

28413464

28397742

20239745

25630456

21496207

ems10

232365230

56136032

56054164

37932764

50662993

39774087

 

232226017

56101887

56085818

37910566

50643774

39711435

 

232230440

56099503

56030860

37871106

50622855

39685294

ems15

348404276

83539894

83461996

55591757

75411889

56758484

 

348435883

83529794

83463158

55594352

75435650

56771053

 

348292392

83453434

83396104

55580937

75374710

56768193

ems20

464825178

110838776

110755872

72989089

100113255

72984372

 

464778933

110777795

110650470

72991749

100083039

73004561

 

464532828

110766213

110653180

72918789

100046482

72978434

ems25

581105516

137940393

137814551

89636246

124600275

88816000

 

580758935

137898843

137748733

89647136

124521398

88829572

 

580693026

137884675

137756070

89594767

124526386

88745435

ems50

1161787240

272394718

271857439

169833915

244302747

164139098

 

1161908810

272481687

271896055

169824808

244355206

164069331

 

1161582289

272310746

271844248

169812165

244255108

164093038

ems75

1742471477

405262890

404293340

247835911

362403056

236517596

 

1742664959

405243466

404268271

247921410

362419128

236506851

 

1742458336

405281768

404397179

247684455

362394809

236572552

ems100

2323234744

533757352

 

324292321

478735224

308211386

 

2323234744

533757352

 

324292321

478735224

308211685

 

2323234744

533757352

 

324292321

478735224

308211677

ems100*

2323234744

    

308212275

rfam_full

140518668

4413613

 

4113889

9504648

3995880

 

140518668

4413613

 

4113889

9504648

3996447

 

140518668

4413613

 

4113889

9504648

3995925

  1. All sizes are in bytes. The FASTA column shows the size of the original uncompressed FASTA file. The smallest file in each row is shown in bold. * This row shows the result of using version of find_edges optimised for the Pentium 4. nrdb+bz2 failed to compress the ems100 dataset because the size of the FASTA file exceeded 2 Gb. All coil runs performed on the rfam_full dataset used the -x option to enable in-order recovery of sequences. nrdb+bz2 was not used with the rfam_full dataset because it is incapable of restoring this order.