Skip to main content

Table 1 The rationale behind Compression Efficiency (CE)

From: Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest

Sequence based on 30 SNPs

Rationale

CE,%

000000000011111111112222222222

10 “0” + 10 “1” + 10 “2”

38.33

012012012012012012012012012012

10 “012”

40.00

001202020022111221100211121200

Random location of 10 “0”, “1” and “2”

11.66

000000000011111111112222222222

10 “0” + 10 “1” + 10 “2” replicated 5 times

75.48

000000000011111111112222222222

000000000011111111112222222222

000000000011111111112222222222

000000000011111111112222222222

012012012012012012012012012012

10 “012” replicated 5 times

76.13

012012012012012012012012012012

012012012012012012012012012012

012012012012012012012012012012

012012012012012012012012012012

001202020022111221100211121200

Random location of 10 “0”, “1” and “2” replicated 5 times

67.10

001202020022111221100211121200

001202020022111221100211121200

001202020022111221100211121200

001202020022111221100211121200

001202020022111221100211121200

5 different random locations of 10 “0”, “1” and “2”

40.64

112220101200102022010110102212

210200221211112120020122001010

120221110000202110122021012210

210010201112220012100101222012

  1. Compression efficiency for hypothetical sequences based on 30 SNPs for one individual (first three rows) or five individuals (last four rows). In all cases the Shannon’s Entropy would be identical i.e. 1.585 (= -3 (1/3) log2(1/3) ) because each SNP call is equi-probable. Compression efficiency exploits patterns in order as well as proportion allowing it to discriminate data that cannot be discriminated by Shannon’s entropy. The Rationale column is a verbal approximation of the algorithmic complexity. Regular sequences have a small algorithmic complexity and high Compression efficiency. However, complex, irregular sequences embedded in a homogeneous population can still exhibit a relatively high Compression efficiency %. Our sliding window heterozygosity corrected compression efficiency approach exploits this combination (the last four rows). In effect, a population-level assessment is made of a region’s sequence entropy, detecting high co-sharing of even very complex regions.