Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest

Table 1 The rationale behind Compression Efficiency (CE)

Sequence based on 30 SNPs	Rationale	CE,%
000000000011111111112222222222	10 “0” + 10 “1” + 10 “2”	38.33
012012012012012012012012012012	10 “012”	40.00
001202020022111221100211121200	Random location of 10 “0”, “1” and “2”	11.66
000000000011111111112222222222	10 “0” + 10 “1” + 10 “2” replicated 5 times	75.48
000000000011111111112222222222
000000000011111111112222222222
000000000011111111112222222222
000000000011111111112222222222
012012012012012012012012012012	10 “012” replicated 5 times	76.13
012012012012012012012012012012
012012012012012012012012012012
012012012012012012012012012012
012012012012012012012012012012
001202020022111221100211121200	Random location of 10 “0”, “1” and “2” replicated 5 times	67.10
001202020022111221100211121200
001202020022111221100211121200
001202020022111221100211121200
001202020022111221100211121200
001202020022111221100211121200	5 different random locations of 10 “0”, “1” and “2”	40.64
112220101200102022010110102212
210200221211112120020122001010
120221110000202110122021012210
210010201112220012100101222012

Compression efficiency for hypothetical sequences based on 30 SNPs for one individual (first three rows) or five individuals (last four rows). In all cases the Shannon’s Entropy would be identical i.e. 1.585 (= -3 (1/3) log₂(1/3) ) because each SNP call is equi-probable. Compression efficiency exploits patterns in order as well as proportion allowing it to discriminate data that cannot be discriminated by Shannon’s entropy. The Rationale column is a verbal approximation of the algorithmic complexity. Regular sequences have a small algorithmic complexity and high Compression efficiency. However, complex, irregular sequences embedded in a homogeneous population can still exhibit a relatively high Compression efficiency %. Our sliding window heterozygosity corrected compression efficiency approach exploits this combination (the last four rows). In effect, a population-level assessment is made of a region’s sequence entropy, detecting high co-sharing of even very complex regions.

ISSN: 1471-2105