Kolmogorov Complexity and Information Theory: the Universal Similarity Metric and its compression-based approximations
The conditional Kolmogorov complexity K(x|y) of two strings x and y is the length of the shortest binary program P that computes x with input y [8, 40]. Thus, K(x|y) represents the minimal amount of information required to generate x by any effective computation when y is given as an input to the computation. The Kolmogorov complexity K (x) of a string x is defined as K(x|λ), where λ stands for the empty string. Given a string x, let x* denote the shortest binary program that produces x on an empty input; if there is more than one shortest program, x* is the first one in enumeration order. The Kolmogorov complexity K (x, y) of a pair of objects x and y is the length of the shortest binary program that produces x and y and a way to tell them apart. It is then possible to define the information distance ID (x, y) between two objects satisfying the following identity, up to logarithmic additive terms:
ID (x, y) = max {K (x|y), K (y|x)}
Equation (1) is the Kolmogorov complexity of describing object x, given y and vice versa. It can be shown to be a proper metric [41] and therefore, a distance function for strings. A major further advancement in Kolmogorov complexity-based distance functions has been obtained in [7], where
(2)
has been defined and properly denoted as the universal similarity metric. In fact, it is a metric, it is normalized and it is universal: a lower bound to, and therefore a refinement of, any distance function that one can define and compute.
It is well known that there is a relationship between Kolmogorov complexity of sequences and Shannon information theory [42]: the expected Kolmogorov complexity of a sequence x is asymptotically close to the entropy of the information source emitting x.
Such a fact is of great use in defining workable distance and similarity functions stemming from the theoretic setting outlined above. Indeed, Kolmogorov complexities are non-computable in the Turing sense, so the universal similarity metric must be approximated, via the entropy of the information source emitting x. However, it is very hard to infer the information source from the data. So, in order to approximate K (x), one resorts to compressive estimates of entropy.
Let C be a compression algorithm and C (x) (usually a binary string) its output on a string x. Let |C (x)|/|x| be its compression rate, i.e., the ratio of the lengths of the two strings. Usually, the compression rate of good compressors approaches the entropy of the information source emitting x [42]. While entropy establishes a lower bound on compression rates, it is not straightforward to measure entropy itself, as already pointed out. One empirical method inverts the relationship and estimates entropy by applying a provably good compressor to a sufficiently long, representative string. That is, the compression rate becomes a compressive estimate of entropy.
In conclusion, we must be satisfied by approximating K (x) by the length |C (x)|. Furthermore, since K (x, y) = K (xy) up to additive logarithmic precision [7], K (x, y) can be likewise approximated by the length |C (xy)| of the compression of the concatenation of x and y. Finally, and since K (x, y) = K (x) + K (y|x*) = K (y) + K (x|y*), again up to constant additive precision [8], the conditional complexity K (x|y*) is the limit approximation of |C (xy)| - |C (y)|, and K (y|x*) is the limit approximation of |C (yx)| - |C (x)|. This gives us our first dissimilarity function:
(3)
Furthermore, the authors of the USM methodology have devised compressive estimates of it: Namely,
(4)
Based on it, we consider
NCD (x, y) = min {NCD1 (x, y), NCD1 (y, x)}
Notice that this is a slight variation with respect to the original definition to make the function symmetric. Finally, in the realm of data mining and as an approximation of USM and independently in table compression applications, the following dissimilarity function was proposed:
(6)
Data sets
The Chew-Kedem data set consists of the following proteins:
alpha-beta 1aa900, 1chrA1, 1ct9A1, 1gnp00, 1qraA0, 2mnr01, 4enl01, 5p2100, 6q21A0, 6xia00.
mainly beta 1cd800, 1cdb00, 1ci5A0, 1hnf01, 1neu00, 1qa9A0, 1qfoA0.
mainly alpha 1ash00, 1babA0, 1babB0, 1cnpA0, 1eca00, 1flp00, 1hlb00, 1hlm00, 1ithA0, 1jhgA0, 1lh200, 1mba00, 1myt00, 2hbg00, 2lhb00, 2vhb00, 2vhbA0, 3sdhA0, 5mbn00.
Protein chain 2vhb00 appears twice in this data set, as 2vhb00 and 2vhbA0, in order to test whether the two chains are detected by compression-based classification methods to be identical (and thus clustered together) or not.
The Sierk-Pearson protein data set is as follows:
alpha-beta 1a1mA1, 1a2vA2, 1akn00, 1aqzB0, 1asyA2, 1atiA2, 1auq00, 1ax4A1, 1b0pA6, 1b2rA2, 1bcg00, 1bcmA1, 1bf5A4, 1bkcE0, 1bp7A0, 1c4kA2, 1cd2A0, 1cdg01, 1d0nA4, 1d4oA0, 1d7oA0, 1doi00, 1dy0A0, 1e2kB0, 1eccA1, 1fbnA0, 1gsoA3, 1mpyA2, 1obr00, 1p3801, 1pty00, 1qb7A0, 1qmvA0, 1urnA0, 1zfjA0, 2acy00, 2drpA1, 2nmtA2, 2reb01, 4mdhA2.
mainly beta 1a8d02, 1a8h02, 1aozA3, 1b8mB0, 1bf203, 1bjqB0, 1bqyA2, 1btkB0, 1c1zA5, 1cl7H0, 1d3sA0, 1danU0, 1dsyA0, 1dxmA0, 1et6A2, 1extB1, 1nfiC1, 1nukA0, 1otcA1, 1qdmA2, 1qe6D0, 1qfkL2, 1que01, 1rmg00, 1tmo04, 2tbvC0.
mainly alpha 1ad600, 1ao6A5, 1bbhA0, 1cnsA1, 1d2zD0, 1dat00, 1e12A0, 1eqzE0, 1gwxA0, 1hgu00, 1hlm00, 1jnk02, 1mmoD0, 1nubA0, 1quuA1, 1repC1, 1sw6A0, 1trrA0, 2hpdA0, 2mtaC0.
The Apostolico data set consists of the mitochondrial DNA complete genome of the following species:
Laurasiatheria blue whale (B. musculus), finback whale (B. physalus), white rhino (C. simum), horse (E. caballus), gray seal (H. grypus), harbor seal (P. vitulina).
Murinae house mouse (M. musculus), rat (R. norvegicus).
Hominoidea gorilla (G. gorilla), human (H. sapiens), gibbon (H. lar), pigmy chimpanzee (P. paniscus), chimpanzee (P. troglodytes), sumatran orangutan (P. pygmaeus abelii), orangutan (P. pygmaeus).
Algorithms
Compression algorithms
The compression algorithms we have used for the computation of the dissimilarity functions defined earlier, together with their parameter setting-a crucial and many times overlooked aspect of data compression-can be broadly divided into four groups. The first group consists of three state-of-the-art tools for general purpose compression.
Gzip and Bzip2 These are the well known compression tools based respectively on the classic Lempel-Ziv algorithm [43] and the bwt (Burrows-Wheeler transform) [44].
PPMd This algorithm, written by Dmitry Shkarin [45, 46], is the current state-of-the-art for PPM compression. In our tests we used it with four different context lengths (see below).
The algorithms in the second group are known as order zero-or memoryless-compressors since the codeword for a symbol only depends on its overall frequency (for Huffman Coding) or on its frequency in the already scanned part of the input (for Range and Arithmetic Coding).
Huffman An implementation of the classic two-pass Huffman coding scheme.
Ac Arithmetic coding algorithm as implemented in [47].
Rc Range coding algorithm. Range coding and arithmetic coding are based on similar concepts and achieve similar compression. We used the Range Coding implementation from [48].
The algorithms in the next group are all based on the bwt and have been implemented and tested in [49].
BwtMtfRleHuff and BwtRleHuff. The first of these algorithms consists in computing the bwt followed by Move-to-front encoding, followed by the run-length encoding, followed by Huffman coding (as implemented by algorithm Huffman). The algorithm BwtRleHuff is analogous to Huffman except that the Move-to-front encoding step is omitted.
BwtMtfRleAc and BwtRleAc are analogous respectively to BwtMtfRleHuff and BwtRleHuff, except that in the final step they use Arithmetic Coding (algorithm Ac) instead of Huffman coding.
BwtMtfRleRc and BwtRleRc are analogous respectively to BwtMtfRleHuff and BwtRleHuff, except that in the final step they use Range Coding (algorithm Rc) instead of Huffman Coding.
BwtWavelet computes the bwt of the input and compresses the resulting string using a wavelet tree encoder [50].
Finally, we used an algorithm specially designed for compressing DNA sequences.
Gencompress This is currently the best available algorithm to compress DNA sequences [51]. It makes clever use of approximate occurrences of substrings in DNA sequences to achieve good compression.
Range/arithmetic coding variants
The behavior of range and arithmetic coding depends on two parameters: MaxFreq and Increment. The ratio between these two values essentially controls how quickly the coding "adapts" to the new statistics. For range coding we set MaxFreq = 65, 536 (the largest possible value) and we experimented with three different values of Increment. Setting Increment = 256 we get a range coder with FAST adaptation, with Increment = 32 we get a range coder with MEDIUM adaptation and finally, setting Increment = 4 we get a range coder with SLOW adaptation. Similarly for arithmetic coding we set MaxFreq = 16, 383 (the largest possible value) and Increment equal to 64, 8, 1 to achieve respectively FAST, MEDIUM, and SLOW adaptation.
PPMd variants
The compressive power of PPMd is strictly related to the length of the contexts which are used for predicting the next symbol. In our experiments, we have used models (contexts) of length 2, 4, 8 and 16, and 256 Mb of working memory. Contexts of length 16 represent PPMd at its maximum strength. In the tables the number beside PPMd indicates the context length used.
Alignment and alignment-free algorithms
The alignment algorithms we have used are the global alignment algorithm of [52] with the PAM120 substitution scoring matrix [53], and the local alignment algorithm of [54] with the BLOSUM62 substitution scoring matrix [55], as implemented in the R package pairseqsim [56].
For alignment-free comparison, we have computed the linear correlation coefficient between the 20kdimensional vectors of k-mer frequencies [5, 15] for k = 1, 2, 3, using the R package seqinr [57].
Dissimilarity matrix construction algorithms
We have produced BioPerl scripts that take as input a set of files to be classified and a compression algorithm (together with all of its proper parameter settings) and that return a dissimilarity matrix. In particular, there is a preprocessing script that computes all the needed compression values in order to compute UCD, NCD and CD. In turn, there is a script corresponding to each of them that, given as input the results of the preprocessing step, produces the actual dissimilarity matrix. A detailed description of these scripts is given in the readme file at the supplementary material web site, at [1].
ROC curves and external measures
A ROC curve [58] is a plot of the true positive rate (the sensitivity) versus the false positive rate (one minus the specificity) for a binary classifier as the discrimination threshold changes. The AUC is a value ranging from 1 for a perfect classification, with 100% sensitivity (all true positives are found) to 0 for the worst possible classification, with 0% sensitivity (no true positive is found). A random classifier has an AUC value around 0.5. When plotted, the better the ROC curve follows the ordinate and then the abscissa line, the better the classification. The ROC curve of a random classifier, when plotted, is close to the diagonal. The F-measure [59] takes in input two classifications of objects, that is, two partitions of the same set of objects, and returns a value ranging from 0 for highest dissimilarity to 1 for identical classifications. The partition (or symmetric difference) distance [60, 61], takes in input the tree topologies of two alternative classifications of n species and returns a value ranging from 0 to 4n - 10. It is the number of clades in the two rooted trees that do not match and it is increasing with dissimilarity. When zero, it indicates isomorphic trees.
Experimental set-up and availability
All the experiments have been performed on a 64-bit AMD Athlon 2.2 GHz processor, 1 GB of main memory running both Windows XP and Linux. All software and data sets involved in our experimentation is available at the supplementary web site [1]. The software is given both as C++ code and Perl scripts and as executable code, tested on Linux (various versions – see supplementary material website), BSD Unix, Mac OS X and Windows operating systems.