Given that the purpose of genomes is to encode and transact information, it is not surprising that principles from Information Theory have been previously used to quantify their informational properties [38–46]. However, the historical use of compression per se in genomics has been more from a practical and technical perspective [47, 48], such as how to store large datasets in an efficient manner, rather than that of pattern recognition and biological interpretation. The only previous example of clustering by compression in the genomic area that we are aware of is the mitochondrial work of  which reconstructed eutherian and mammalian phylogenies.
Here, we have illustrated that CE analysis of SNP data allows one to 1) discriminate populations reconstituting known aspects of their diversity, 2) identify at very high resolution genomic regions of evolutionary interest including validated signatures of selection 3) visualise evolution through informational space and 4) achieve these goals using a method that has an appealing logical simplicity. The hypothesis-free pattern recognition qualities of CE allow exploitation of order as well as proportion, the two sources of regularity in any data stream. This is important because the detection of patterns can be used to cluster data based on guilt-by-association and drive the inference of biological meaning.
One conceptual advantage of using a proxy for direct genetic evidence (CE) rather than the direct evidence itself (such as FST) is that previously unrecognized informational footprints may be present in the data which we may wish to exploit but whose expected properties we do not need to identify, nor fully understand, a priori.
Overall, we document a promising new perspective on analysing genomic data which is intended to be complementary to existing mathematical approaches, not to supplant them. Given this is the first publication of a wholly new approach, we are not yet in a position to formally connect our ideas to existing population genetics theory in a rigorous mathematical sense. Nevertheless, given its particular biological interest to human genetics, we do explore the population-level allele pattern content of the CEU and ASW SLC24A5 skin pigmentation region in some more detail. From the broader genome-wide perspective we have to rely on less quantitative verbally expressed arguments. These draw intuitive connections between CE and various aspect of population genetic theory.
In order to infer population history, molecular geneticists conventionally look for specific genomic sites across individuals and search for changes in abundance or even fixation of an ancestral or derived allele . Here we have taken a complementary strategy, exploiting various longitudinal patterns along an individual’s genome (i.e. the rows of Table 1 rather than just the columns). The assumption we make is that closely related individuals will be more likely to share these longitudinal patterns of genome composition (i.e. haplotypes), however complex they may be. One appeal of this approach is we do not lose the informational context provided by physically proximate SNPs.
In terms of the whole genome, we find CE allows all human populations to be discriminated, sometimes with little or no overlap. Using the Human HapMap3 dataset, the lowest CE is exhibited by the African American and other African populations and the highest CE by the Asian populations (Chinese and Japanese). The groupings resonate with published phylogeographic reconstructions based on FST and PCA analyses  but are computationally much quicker and cheaper, consuming only a fraction of CPU time. In broad terms, at heterozygosity levels less than a third, there is clearly a strong negative relationship between CE and heterozygosity. However, this observation does not explain the CE output, with populations of similar heterozygosity discriminated by differential CE. The population discrimination is robust across mammalian species (Figure 4). Runs of homozygosity are clearly an obvious compositional feature that will be exploited by Gzip to compress the SNP data string in the whole genome version of the CE analysis, but there are many others sources of regularity.
One appealing analytical implication of the genome-wide CE approach is that the different scales of the various informational regularities can be assessed simultaneously by a single metric, irrespective of their size, direction or crypticity. The genomes of domesticated species offer a useful model in which to explore the genomic consequences imparted by population histories characterized by bottlenecks and artificial selection, as the genetic similarity of the various breeds provides a relatively stable background against which various evolutionary forces can be inferred . In each domestic species broadly similar patterns were observed. Genome-wide CE increased for populations likely to have been founded from a small number of founders, and it decreased for outbred populations expected to be highly heterogeneous. CE was plotted against heterozygosity showing that signals of population substructure were evident in non-human species (Figure 4).
Genomic regions harbouring signatures of selection
Given the ability of whole genome CE to discriminate populations, we next explored within-genome structure to prioritise regions of particular biological interest. We used a high resolution sliding window expressed relative to heterozygosity and normalized by Z-score (CEhZ). The correction for heterozygosity means the CE differences are more likely to be attributable to the pattern in order of heterozygotes and homozygotes and less to changes in proportion. The approach integrates a combination of individual genome regularity in the context of population homogeneity in that region (Table 1). Regions of even quite complex composition will yield a compression peak by CEhZ if they are shared by many members of a given population. This feature discriminates the window-based CEhZ analysis (which compresses shared complex regions as well as shared simpler ones) from the whole genome CE (where compression will be most strongly influenced by low information content regions such as simple runs of homozygosity).
That is, for CEhZ not only would we expect to find peaks over regions characterized by shared runs of homozygosity (as exemplified by the MSTN locus in muscular Texel sheep which has recently been swept clean of genetic diversity), but other compositionally more complex regions as well. The application of a matrix structure that permits comparisons of the same genomic regions across individuals clearly connects the output to existing population-level metrics such as LD. However, CEhZ finds loci over many different kinds of compositional regularities in a manner that defies a simple summary. A more detailed examination of allele composition in the SLC24A5 region in ASW, CEU and JPT reinforces the challenge of describing the mathematical nature of the compressible patterns exploited by Gzip (Figure 6). Irrespective of this, the data can be exploited to find commonalities and differences across any set of population groupings, in this particular case highlighting population substructure and showing the CEU and JPT are awarded similarly high CEhZ peaks for different reasons, and with different background HET.
In attempting to interpret the biological relevance of the compression peaks we examined the extent to which our new regions overlap with known signatures of selection. We find CEhZ successfully detects many of the known major signatures of selection in the various species. With regard to the human output we used  and found substantial overlap (Additional file 4: Data S2). The CE approach has the appeal of pinpointing particular genomic regions, such as coding sequence or parts of coding sequences, at very high resolution. In human populations the highlighted regions capture genes encoding proteins involved in skin pigmentation (SLC24A5), blue eye colour (HERC2), lactase persistence (LCT) and hair texture (ECAR).
A recent paper  detected a signature of positive selection in SLC24A5 in CEU, GIH and TSI exactly concordant with our observations based on CEhZ. They used a new method (haploPS) which leverages 2 sources of information relating to haplotype length and structure. HaploPS is similar to EHH and XP-EHH except that it estimates the population frequency of the allele under question and identifies the haplotype sequence on which the selected allele sits.
Furthermore, the LCT gene encodes the lactase protein that allows milk digestion into adulthood. It is known to be under selection only in those European (CEU) and African (MKK) populations with an extensive pastoral history characterised by livestock domestication and an adaptation favouring regular milk and dairy consumption (Figure 7). The lactase signature recently detected in MKK was notable in that a combination of 3 computationally intensive measures had to be leveraged (fixation index, integrated haplotype score and cross population extended haplotype homozygosity) . Moreover, in contrast to CEhZ which provides strong evidence for the exact gene, the alternative methods could resolve the region only to a relatively broad 1.7 Mb . In our data the nature of the LCT-specific peak in CEU and MKK is visibly different, consistent with the purported evolutionary independence of the selection event .
New human population predictions unique to CEhZ imply hitherto unrecognised roles for a long non-coding RNA in European populations, a chromatin remodeller (SCMH1) in African populations and the EDA2R gene in Asian populations (Figure 8). The EDA2R is noteworthy in that it is present on the X chromosome which is not amenable to conventional analyses because it is hemizygous in males.
A number of CEhZ peaks are shared by all the human populations. We might speculate these represent genomic constraint at a much deeper taxonomic level, perhaps the branch point of modern humans from other primates. It is interesting to note that the more homogeneous Asian populations exhibit a frequency distribution of CEh Z-scores characterised by a low mean value, but a number of very extreme outliers compared to the more heterogeneous African populations (Additional file 2: Figure S2).
In the cattle divergent regions contrasting Angus (Bos taurus) and Brahman (Bos indicus) breeds aligned to previously described signatures of selection. The recently documented PLAG1 is considered fundamental to tropical adaptation in Brahman cattle (Figure 9A, Table 2). New predictions for the cattle breed comparison include EN1, EYA1 and ARID4A (Figure 9B, C, D). Overall, it appears CEhZ is a useful metric for exploiting intra-chromosomal heterogeneity in a rapid, straightforward fashion.
Evolution and information compression
What are the biological implications of the CE values we have computed? As a first step, we defined the total parameter space (Additional file 5: Figure S3). This allows us to encapsulate the boundaries of the DNA informational universe. In turn, this universe enables us to envisage how real DNA sequence evolves at a whole genome level. We modelled ‘totally regular’ (highly compressible) by first sorting each individual SNP genotype prior to compression, and ‘totally irregular’ (uncompressible) through several different randomisation procedures set at both the individual and population levels that account for proportion and order. The initial randomisation provided by RAND1 breaks LD by scrambling the identical “0, 1, 2” proportions into chaotic order (Additional file 5: Figure S3).
This result shows that in practice LD serves to enforce data regularities (rather than irregularities) on real-world DNA sequences. The deeper randomisation provided by RAND2 (that borrows the proportions of 0’s and 2’s from the entire population, not the individual) is perhaps surprising. It suggests each individual genome possesses highly cryptic proportional regularities not present in the population at large (Additional file 5: Figure S3). We found that all populations in all species occupy a very specific zone, clearly converging at – or emerging from – a very well defined point in CE genomic space. They intersect at the point at which highly complex sequence most deeply explores the disordered space, without actually becoming chaotic. The overall uniformity of the shape of the output across all the populations/breeds of all the species, despite considerable compositional differences in both the density and functional bias of the SNP chip technologies, points to the very high robustness of the result. Nevertheless, SNP chips characterise highly variable (i.e. chaotic) regions, so the translation of this output to full genome sequence remains uncertain and should be a focus of future work. We were next interested in direction of travel through this space. Are we observing an emergence or a convergence from the point of minimum CE and maximum noise?
Our first attempt to answer this question was to examine the domestic species data. These species have a clearly identified progenitor which provides an unambiguous evolutionary sequence. However, SNP ascertainment bias confounds interpretation here. What we can say though, is that life - characterised by negative entropy  evolved from non-life which is usually characterised by high entropy. It is therefore tempting to assume that the more ancestral compositional states would have been more entropic. Further, another source of sequence data in a range of species supports this idea. In the context of particular protein coding sequences, we have previously noted that when genome-wide codon bias is quantified informationally, it is those proteins apparently most relevant to (or diagnostic of) the lineage under scrutiny that exhibit the lowest entropy.
Examples of these low-entropy ‘derived’ molecules include proteins influencing chloroplast physiology in plants, mitochondrial function in birds and hair formation in mammals . Generalizing this broad line of reasoning (high entropy ancestral, low entropy derived) is appealing as it places representatives of the modern African populations as relatively basal (Figure 10), which seems to be consistent with the consensus “Out of Africa” hypothesis of modern human evolution . In the future the hypothesis that derived genome sequences possess relatively low entropy could be validated using domestic species as a resource. One could compare the whole genome CE of the extant representative of the wild ancestor to various domestic populations. For example, we would predict the CE of the red jungle fowl genome, at an individual level, to be lower (i.e. more entropic) than individual genomes representing meat and egg producing domestic chickens. We would also predict population-level CEhZ sliding window scores to possess a more extreme distribution in the domestic breeds. Some of these CEhZ peaks would characterise signatures of selection for egg and meat production.
What else does this mean for our understanding of biological encoding systems? The phase transition between regularity and irregularity is theorised to be a high-impact zone of enormous computational power and evolutionary potential [55, 56]. This interests us given a genome is a computing device made of nucleic acid that is the product of evolution. The overall position of all the human populations supports a controversial concept from complex systems science [55, 57–61] that genomes are poised at or close to ‘The Edge of Chaos.’ This conclusion resonates closely with that of Kong et al.,  who analysed 384 prokaryotic and 402 eukaryotic genomes using an novel regularity/order index called ø and based on averages of nucleotide distributions in a given sequence of pre-defined length.Figure 10 also summarise the possible mechanistic explanations for the various trajectories taken by the populations and individuals through information space, based on considerations of both the implications of our data modelling coupled with the real world mammalian genomes. We see different spatial impacts of LD and extent of outbreeding depending on the particular population under consideration.
The meaning of CE in the context of population genetics theory
To finish, it is appropriate to more directly connect our CE work to existing population genetics theory, whose goal is to study the frequency and interaction of alleles and genes in populations. In population genetics theory, various evolutionary processes, particularly natural selection (in numerous guises), drift, mutation and gene flow are explored to make inference about population history. The Hardy-Weinberg principle says that the frequency of alleles will remain constant in the idealised absence of selection, mutation, migration and drift  and this provides a theoretical expectation (equilibrium) against which population level deviations from equilibrium (dis-equilibrium) can be quantified and subsequently interpreted. In Information Theory terms, the point of equilibrium corresponds to maximum entropy, and extent of dis-equilibrium reflects differing amounts of negentropy.
At a population level, nearby pairs of alleles have a high tendency to be correlated with each other (LD). In genetic ‘hitchhiking’ an allele at one locus rises to high frequency in a population because it is linked to an allele under selection at a nearby locus, not because it has been selected itself. The same phenomenon applies to genes under runaway sexual selection . Clearly, this phenomenon culminates in population-level homogeneity (pattern) in allele combinations because of genomic similarity between individuals. Adding further dynamism, these population-level patterns are gradually broken by the individual cellular/molecular process of genetic recombination, but at a slow rate.
This ebb and flow of allele pattern formation and destruction among individuals can be exploited to detect the action of natural selection via selective sweeps, and to view the impact of migrations and founder effects. For example, it is well known that there is higher LD in Asian populations, presumably due to the founder effects that occur during migrations limiting the number of haplotypes. LD is often viewed by a decay plot e.g. , where it can be shown that deviation from equilibrium is considerably stronger for nearby loci. These decay plots are relatively extreme for Africans due to faster LD decay and correspondingly smaller haplotype blocks than in the comparison Asian and European populations. A number of existing metrics for selection (EHH, IHH) are based on considerations of local decay of haplotypes.
What does this mean for the various CE metrics and what are the phenomena that serve to underpin the patterns quantified by CE? Whole genome CE is computed on an isolated individual basis. The coordinates (i.e. shape and location) of the population cluster describes the data at the population level. However, given this is fundamentally an individual-level metric, its relationship to LD might not be straightforward. For example, other sources of (unknown) compositional regularity may apply including segmental duplications  and G4 motifs and structures . It is also true in theory, that one can achieve the same compression efficiency for different reasons, but in practice we find that the accurate phylogeographic population-level clustering implies it is only similar related genome compositions that are awarded similar CE scores. Also, we know that the RAND1 modelling procedure serves to break LD and reduces CE (Additional file 5: Figure S3). Based on this reasoning, it is tempting to speculate that individuals with high CE presumably belong to populations that have even higher LD. This conclusion is clearly consistent with the population-level CE ranking we observe in Figures 1 ,2, 3 that mirrors known differences in LD between human populations, that is the Africans showing the least LD and the Asians the most LD, with the Europeans intermediate.
Next we will consider the relationship of this thinking to the particular genomic regions identified by the sliding window CEhZ. From our informational perspective, remnant population-level patterns can clearly be quantified by CEhZ, and contrasted across populations, no matter how cryptic or complex the genomic composition may be at an individual level. A detailed specific example is given by the CEU and GIH skin lightening signature of selection that resides over SLC24A5 (Figure 6). At this stage, a confident determination of CEhZ’s exact biological origin – i.e. is there a particular compressible pattern diagnostic of natural selection versus genetic drift versus a founder effect? – is not possible.
That being said, there are clearly numerous reliable patterns present at very specific genomic regions in one population, but not another. Many of these have not been described before including those specifically overlying known functionally important parts of the genome, such as protein-coding genes, non-coding RNA and so on. These new discoveries may reflect the fact that the mathematical nature of the population-level patterns we have highlighted did not have to be specified a priori, unlike FST which is more tightly expressed. The guilt-by-association heuristic tells us there is some bona fide population-level meaning in those regions. It is our contention that a post-publication community effort and a range of techniques will be required to ascribe functional significance or not on a case-by-case basis. To expedite this process, we have uploaded the CEhZ tracks onto the UCSC genome web browser.
Slower LD decay in Asian populations seems consistent with our finding that the Asians possess extreme outlier peaks in CEhZ reflecting high homogeneity in certain regions not observed in the other populations (Additional file 2: Figure S2). It is worth pointing out that direct comparisons of two specified loci between populations are not apparent in decay plots, as all conceivable pairs are simultaneously plotted. By contrast, one strength of the CEhZ sliding window approach is we maintain the identity of the genomic region such that the population contrasts are directly comparable, and therefore biological interpretation can conveniently be made on a fine-grained regional basis. An example would be the European and Gujarati Indian selection events around the SLC24A5 gene (Figure 6).
Finally, population genetic diversity has been quantified by allelic diversity – namely, the proportion of all copies of a gene made up of a particular variant . The 1000 genomes consortium  showed that CEU, JPT and YRI possess many SNPs displaying substantial absolute differences in allele frequency, and that this ability to differentiate populations decays rapidly as one increases physical distance from genic SNPs. Our observations are consistent with some of these findings, namely the whole genome discrimination determined by CE, the concordance of some CEh peaks over genic regions in particular, and the elevation of African CE when examining coding SNPs only.
CE operates by exploiting regularities within a sequence regardless of the origin of the sequence. CE is not based on any theory of segregation inheritance, nor does it require knowledge of ancestry to phase the genotypes. In light of the strong correlation between FST and CE (r = 0.885) we conclude that CE accurately estimates genetic relatedness among populations without recourse to additional sources of information. The same conclusion is reached when CE is used a sliding window approach to capture genes under selection.
Because CE is a hypothesis-free pattern recognition method that detects regularities in segments of the genome, it is more in the spirit of the various haplotype-based methods, rather than single marker methods of population differentiation. The main weakness we have identified is - like EHH and FST - it does not allow for population stratification. Further work is required to formalise other strengths and weaknesses relative to existing methods.
Our implementation of CE requires the use of the gzip tool which incorporates the DEFLATE algorithm. This is included in all unix environments. However, other compression tools exist and could be used. Also, from the computational perspective, the sequential use of gzip (ie. one genotype sequence at a time) requires a great deal of parsing arguments and I/O operations. This is not an issue when compressing a whole matrix comprising genotype sequences from individuals within a population. Nevertheless, if we were to program the DEFLATE algorithm and perform CE analyses entirely in memory then the computation efficiency of CE analyses would be greatly improved.