Skip to main content

Advertisement

Table 1 The rationale behind Compression Efficiency (CE)

From: Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest

Sequence based on 30 SNPs Rationale CE,%
000000000011111111112222222222 10 “0” + 10 “1” + 10 “2” 38.33
012012012012012012012012012012 10 “012” 40.00
001202020022111221100211121200 Random location of 10 “0”, “1” and “2” 11.66
000000000011111111112222222222 10 “0” + 10 “1” + 10 “2” replicated 5 times 75.48
000000000011111111112222222222
000000000011111111112222222222
000000000011111111112222222222
000000000011111111112222222222
012012012012012012012012012012 10 “012” replicated 5 times 76.13
012012012012012012012012012012
012012012012012012012012012012
012012012012012012012012012012
012012012012012012012012012012
001202020022111221100211121200 Random location of 10 “0”, “1” and “2” replicated 5 times 67.10
001202020022111221100211121200
001202020022111221100211121200
001202020022111221100211121200
001202020022111221100211121200
001202020022111221100211121200 5 different random locations of 10 “0”, “1” and “2” 40.64
112220101200102022010110102212
210200221211112120020122001010
120221110000202110122021012210
210010201112220012100101222012
  1. Compression efficiency for hypothetical sequences based on 30 SNPs for one individual (first three rows) or five individuals (last four rows). In all cases the Shannon’s Entropy would be identical i.e. 1.585 (= -3 (1/3) log2(1/3) ) because each SNP call is equi-probable. Compression efficiency exploits patterns in order as well as proportion allowing it to discriminate data that cannot be discriminated by Shannon’s entropy. The Rationale column is a verbal approximation of the algorithmic complexity. Regular sequences have a small algorithmic complexity and high Compression efficiency. However, complex, irregular sequences embedded in a homogeneous population can still exhibit a relatively high Compression efficiency %. Our sliding window heterozygosity corrected compression efficiency approach exploits this combination (the last four rows). In effect, a population-level assessment is made of a region’s sequence entropy, detecting high co-sharing of even very complex regions.