Skip to main content

Table 2 Compressed file sizes

From: Compressing DNA sequence databases with coil

Dataset FASTA bz2 nrdb+bz2 7z PPMdi coil
ems1 23292780 5876747 5871445 4870989 5331953 4990193
  23199910 5853780 5852865 4854519 5311350 4981279
  23201245 5852837 5852772 4857631 5312411 4988747
ems2 46519702 11576074 11574420 9057588 10475531 9432789
  46428669 11557030 11556573 9023980 10454826 9410376
  46390115 11547516 11549117 9036246 10445740 9426594
ems3 69631679 17211495 17205793 12922145 15537092 13607729
  69647486 17212318 17208461 12907737 15543489 13592739
  69715954 17231912 17225610 12920294 15558845 13623246
ems4 92905691 22841127 22810035 16600091 20601712 17625302
  93012024 22868732 22849091 16611294 20629724 17655369
  92850447 22813494 22799324 16587471 20585812 17584008
ems5 116125238 28428297 28415051 20245345 25636473 21509065
  116249077 28451622 28426520 20260429 25663621 21547174
  116117128 28413464 28397742 20239745 25630456 21496207
ems10 232365230 56136032 56054164 37932764 50662993 39774087
  232226017 56101887 56085818 37910566 50643774 39711435
  232230440 56099503 56030860 37871106 50622855 39685294
ems15 348404276 83539894 83461996 55591757 75411889 56758484
  348435883 83529794 83463158 55594352 75435650 56771053
  348292392 83453434 83396104 55580937 75374710 56768193
ems20 464825178 110838776 110755872 72989089 100113255 72984372
  464778933 110777795 110650470 72991749 100083039 73004561
  464532828 110766213 110653180 72918789 100046482 72978434
ems25 581105516 137940393 137814551 89636246 124600275 88816000
  580758935 137898843 137748733 89647136 124521398 88829572
  580693026 137884675 137756070 89594767 124526386 88745435
ems50 1161787240 272394718 271857439 169833915 244302747 164139098
  1161908810 272481687 271896055 169824808 244355206 164069331
  1161582289 272310746 271844248 169812165 244255108 164093038
ems75 1742471477 405262890 404293340 247835911 362403056 236517596
  1742664959 405243466 404268271 247921410 362419128 236506851
  1742458336 405281768 404397179 247684455 362394809 236572552
ems100 2323234744 533757352   324292321 478735224 308211386
  2323234744 533757352   324292321 478735224 308211685
  2323234744 533757352   324292321 478735224 308211677
ems100* 2323234744      308212275
rfam_full 140518668 4413613   4113889 9504648 3995880
  140518668 4413613   4113889 9504648 3996447
  140518668 4413613   4113889 9504648 3995925
  1. All sizes are in bytes. The FASTA column shows the size of the original uncompressed FASTA file. The smallest file in each row is shown in bold. * This row shows the result of using version of find_edges optimised for the Pentium 4. nrdb+bz2 failed to compress the ems100 dataset because the size of the FASTA file exceeded 2 Gb. All coil runs performed on the rfam_full dataset used the -x option to enable in-order recovery of sequences. nrdb+bz2 was not used with the rfam_full dataset because it is incapable of restoring this order.