An investigation into inter- and intragenomic variations of graphic genomic signatures

Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.


Introduction
Alongside DNA barcoding, [1] and Klee diagrams [2], Chaos Game Representation (CGR) patterns of genomic segments have been proposed as another method for the classification and identification of genomic sequences [3][4][5][6][7].The concept of genomic signature was first introduced in [8], as being any specific quantitative characteristic of a DNA genomic sequence that is pervasive along the genome of the same organism, while being dissimilar for DNA sequences originating from different organisms.Initial studies [3,9], terns of mtDNA sequences can be different from those of DNA sequences from the major genome of the same organism, and that large scale quantitative analyses of the hypothesis that CGR can play the role of a genomic signature for genomic sequences have not, to our knowledge, been performed.The objective of this study is to confirm that CGR can play the role of genomic signature for genomic DNA sequences, as well as to assess various distances that can be used to compare CGRs of genomic sequences.
We analyze 508 fragments, 150 kbp (kilo base pairs) long, taken from complete genomic DNA sequences of six species, each representing a different kingdom: chromosome 21 of Homo sapiens, chromosome 4 of Saccharomyces cerevisiae, chromosome 1 of Arabidopsis thaliana, chromosome 14 of Plasmodium falciparum, the genome of Escherichia coli, and the genome of Pyrococcus furiosus, for a total length of 76,200,000 bp analyzed.We analyze the intergenomic and intragenomic variation of CGR genomic signatures of these sequences by using six different distances for image comparison: Structural Dissimilarity Index (DSSIM) [10], Euclidean distance, Pearson correlation distance [11], Manhattan distance [12], approximated information distance [13], and a distance we propose here, called descriptor distance.We visualize the results by computing the Molecular Distance Maps of all DNA sequences in the database, for each of the six distances.The resulting Molecular Distance Maps show a good clustering of the DNA sequences, with those originating from the same genome being largely grouped together, and separated from sequences belonging to genomes of different organisms.We observe that, in some of the cases where the clustering was suboptimal, the computation of three-dimensional Molecular Distance Maps resolves what appeared to be cluster overlaps in the two-dimensional Molecular Distance Maps.Lastly, using the "ground-truth" that sequences from the same genomes should have similar structural characteristics and thus be grouped together, while those from genomes of different organisms should be separated, we assess the six distances by combining three different quality measures: correlation to an idealized cluster distance, silhouette accuracy, and histogram overlap.We conclude that DSSIM and the descriptor distance perform best according to these measures.We also provide preliminary evidence of this method's applicability to classifying genomic DNA sequences of closely related species by comparing the H. sapiens (chromosome 21) sequences with 168 genomic DNA sequences, 150 kbp long, from Pan troglodytes (chimp, chromosome Y), for an additional length of 25,200,000 bp analyzed.Further research may lead to improvements of these distances for optimal genomic DNA sequence identification and classification results.
Note that other alignment-free methods have been used for phylogenetic analysis of DNA sequences.The initial reports on CGRs of genomic sequences [3,14] contained mostly qualitative assessments of CGR patterns of whole genes.In [7], several datasets of up to 36 genomic DNA sequences were analyzed, and in [9] some various-length sequences were analyzed based on computing Euclidean distances between frequencies of their k-mers, for k = 1, ..., 8. Subsequently, [5] computed the Euclidean distance between frequencies of k-mers (k ≤ 5) for the analysis of 125 GenBank DNA sequences from 20 bird species and the American alligator.In [15], 27 microbial genomes were analyzed to find implications of 4-mer frequencies (k = 4) on their evolutionary relationships.In [13], 20 mammalian complete mtDNA sequences were analyzed using the "similarity metric", for k = 7.Another study, [16], analyzed 459 bacteriophage genomes and compared them with their host genomes to infer host-phage relationships, by computing Euclidean distances between frequencies of k-mers for k = 4.In [17], 75 complete HIV genome sequences were compared using the Euclidean distance between frequencies of 6-mers (k = 6), in order to group them in subtypes.In [4] a dataset of 3,176 complete mtDNA sequences was analyzed, and several Molecular Distance Maps were obtained using DSSIM and a value of k = 9.
The main contributions of this paper are: • We tested and confirmed for an extensive dataset, of a total length of 101,400,000 bp, the hypothesis that CGR images of genomic DNA sequences can play the role of a (graphic) genomic signature, meaning that they have a desirable genomeand species-specificity.The dataset comprised 150 kbp long sequences taken from genomes of organisms from each of the six kingdoms of life, augmented by a set of same-length genomic sequences from P. troglogytes as a test-case of this method's applicability to closely related species.• We assessed the performance of six different distances in this context, and this analysis included both same-genome and different-genome DNA fragment pairs.For several of these distances, the intragenomic values were overall smaller than intergenomic values, suggesting that this method could separate DNA genomic fragments belonging to different genomes, based on their CGRs.• We showed that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.In particular, we determined that the DSSIM distance and descriptor distance (introduced here), both of whom essentially compare the k-mer composition of DNA sequences (herein k = 9), were best able to differentiate sequences originating from different genomes in this dataset.• This study represents, to the best of our knowledge, the largest combined dataset size and value of k for this type of analysis.• Based on preliminary data, we suggest the use of three-dimensional Molecular Distance Maps for improved visualization of the simultaneous interrelationships among similar or very distant DNA sequences.

Methods
In this section we first describe the dataset used for our analysis, then present an overview of the three main steps of the method, and conclude with a description of the six distances that we considered.

Dataset
The dataset we used includes complete genomic sequences from six organisms, each representing one of the six kingdoms of life, see    In order to have relatively comparable number of DNA sequences for each organism, we chose the longest chromosomes for all organisms except H. sapiens, for which the shortest chromosome was chosen.
The DNA sequences in the NCBI database are represented as strings of letters "A", "C", "G", "T", and "N" which represent the four nucleobases Adenine, Cytosine, Guanine, Thymine, and "unidentified Nucleotide", respectively.For our analysis we ignored all letters "N".In S. cerevisiae and E. coli there were no ignored letters, and in P. falciparum and P. furiosus the number of ignored letters is of the order of 0.001% of the length of the sequence.In H. sapiens this number is 27%, and in A. thaliana is 0.54%.In H. sapiens, in particular, 96.4% of these ignored letters exist in centromeric and telomeric regions of the chromosome.
The resulting genomic DNA sequences were divided into successive, non-overlapping, contiguous fragments, each 150 kbp long.When the last sequence was shorter than 150 kbp, it was not included in the analysis.This resulted in 234 fragments for H. sapiens, 30 fragments for E. coli, 10 fragments for S. cerevisiae, 201 fragments for A. thaliana, 21 fragments for P. falciparum, and 12 fragments for P. furiosus, for a total of 508 DNA fragments, see Table 2.

Overview
The method we used to analyze and classify the 508 sequences of the dataset has three steps: (i) generate graphical representations (images) of each DNA sequence using Chaos Game Representation (CGR), (ii) compute all pairwise distances between these images, and (iii) visualize the interrelationships implied by these distances as two-or three-dimensional maps, using Multi-Dimensional Scaling (MDS).
CGR is a method introduced by Jeffrey [3] in 1990 to visualize the structure of a DNA sequence.A CGR associates an image to each DNA sequence as follows.
Starting from a unit square with corners labelled A, C, G, and T, and the center of the square as the starting point, the image is obtained by successively plotting each nucleotide as the middle point between the current point and the corner labelled by the nucleotide to be plotted.If the generated square image has a size of 2 k × 2 k pixels, then every pixel represents a distinct k-mer: A pixel is black if the k-mer it represents occurs in the DNA sequence, otherwise it is white.CGR images of genetic DNA sequences originating from various species show patterns such as squares, parallel lines, rectangles, triangles, and also complex fractal patterns, Figure 1.
For step (i), a slight modification of the original CGR was used, introduced by Deschavanne [7]: a k-th order FCGR (frequency CGR) is a 2 k × 2 k matrix that can be constructed by dividing the CGR plot into a 2 k × 2 k grid, and defining the element a ij as the number of points that are situated in the corresponding grid square.A first and second order FCGR are shown below, where N w is the number of occurrences of the oligonucleotide w in the sequence s.
The (k + 1)-th order F CGR k+1 (s) can be obtained by replacing each element N X in F CGR k (s) with four elements where X is a sequence of length k over the alphabet {A, C, G, T }.For step (ii), after computing the FCGR matrices for each of the 150 kbp sequences in our dataset, the goal was to measure "distances" between every two CGR images.There are many distances that can be defined and used for this purpose, [18].One of the goals of this study was to identify what distance is better able to differentiate the structural differences of various genomic DNA sequences and classify them based on the species they belong to.In this paper we use six different distances: Structural Dissimilarity Index (DSSIM), descriptor distance (defined here), Euclidean distance, Manhattan distance, Pearson correlation distance, and approximated information distance.
For step (iii), after computing all possible pairwise distances we obtained six different distance matrices.To visualize the inter-relationships between sequences implied by each of the distance matrices, and to thus visually assess each of the distances, we used Multi-Dimensional Scaling (MDS).MDS is an information visualization technique introduced by Kruskal in [19].Given as input a distance matrix that contains the pairwise distances among a set of items [1] , the output of MDS is a spatial representation of the items on a common Euclidean space wherein each item is represented as a point and the spatial distance between any two points corresponds to the distance between the items in the distance matrix: Objects with a small pairwise distance will result in points that are close to each other, while objects with a large pairwise distance will become points that are far apart.For example, in [4] MDS was used in conjunction with DSSIM and CGR to produce Molecular Distance Maps that visually display the simultaneous interrelationships among a set of full mitochondrial DNA sequences.
The ideal Molecular Distance Map is a placement of n items as points in an (n − 1)-dimensional space.The two-dimensional Molecular Distance Map is simply an approximation, a flattening of this highly-dimensional space onto the plane, which may sometimes result in erroneous positioning of some points.Increasing the dimensionality of the Molecular Distance Map often results in a more accurate representation of the real interrelationships between sequences, as embodied in the original distance matrix.

Distances
In this section we describe and formally define each of the six distances used in our analysis: DSSIM, descriptor distance (introduced here), Euclidean, Manhattan, Pearson, and approximated information distance.
Structural Similarity Index, SSIM, was introduced in [10] for the purpose of assessing the degree of similarity between two images.Given two images X, Y as n × n matrices having as elements integers ranging in the interval [0, L], SSIM computes three factors (luminance, contrast and structure) and combines them to obtain a similarity value.However, instead of computing a global similarity between the two images, each image is divided into 11 × 11 sliding square windows X ij (Y ij respectively) with i, j = 1, • • • , n − 10 which move pixel by pixel to eventually cover the entire image, and the SSIM similarity of any given pair of images is computed by comparing their corresponding windows.In addition, an 11 × 11 circular symmetric Gaussian weighting function W ∈ R 11×11 with a fixed standard deviation of 1.5, normalized to unit sum ( 11 p=1 11 q=1 W pq = 1), is used.Then, the mean µ x,i,j (µ y,i,j for Y ), variance σ x,i,j (σ y,i,j for Y ) and correlation σ xy,i,j are computed, as follows: W pq X ij pq [1] In this paper the items are the 150 kpb DNA sequences analyzed.
where A pq denotes the (p, q) element of the matrix A.
Based on these values, the luminance Then, these three factors are combined to get and finally, the SSIM index used to evaluate the overall image similarity is computed as In theory, the values for SSIM range in the interval [−1, 1] with the similarity being 1 between two identical images, 0, for example, between a black image and a white image, and −1 if the two images are negatively correlated; that is, SSIM(X, Y ) = −1 if and only if X and Y have the same luminance µ and every pixel x i of image X has the inverted value of the corresponding pixel To compute the distance rather than the similarity between two images, we calculate DSSIM (X, Y ) = 1− SSIM(X, Y ).Consequently, the range of DSSIM is the interval [0, 2]: two identical images will result in a DSSIM distance of 0, while two images that are the negatives of each other would result in a DSSIM distance of 2.
The descriptor distance between two FCGRs X, Y ∈ N 2 k ×2 k aims to compare a combination of several different"descriptors", that is, a combination of several different aspects, of the two given FCGRs.
A descriptor is a vector characterized by parameters m and r, as well as r intervals, where m is the size of the non-overlapping windows in which the FCGR is divided (scale of the comparison), and the r intervals represent the "granularity" of the analysis, in that they define the intervals of numbers of k-mer occurrences that are considered significant.For a given m ≤ k and r, and intervals [a 0 , a 1 ), [a 1 , a 2 ), Starting from the top-left corner, we divide each of the two FCGR matrices X and Y into non-overlapping submatrices [2] of size 2 m × 2 m .This procedure results in 4 k−m submatrices X ij and Y ij with i, j = 1, • • • , 2 k−m , which will be pairwise compared.
The choice of the r intervals, called "bins", points to the fact that, rather than considering the finest granularity, we are interested in a coarser comparison.This means that, instead of a computationally expensive pairwise comparison of all possible numbers of occurrences of k-mers, we are interested only in certain "bins" of such numbers.For example, in our case, we use r = 5 and consider only 5 different bins, that is only k-mers with number of occurences: 0 (not occurring), 1 (one occurrence), 2 (two occurrences), between 2 and 5, between 5 and 20, and greater than 20 (most frequent).Formally, we use r = 5 and [0, Afterwards, we compute for every In our case, for each X ij , we compute a five-tuple wherein, for example, the 4th element represents the number of 9-mers whose number of occurrences is in the 4th bin, that is, at least 5 but less than 20.The division to 2 m × 2 m is to obtain a probability distribution for each submatrix.The same procedure is performed for Y ij , resulting in the vector vecY ij .
We further append all vectors vecX ij and form a new vector vecX m,r and, using the same order of appending, we append all vectors vecY ij forming a new vector vecY m,r .These two vectors are the "descriptors" of the FCGR matrices X and Y for the parameters m, r and the r chosen bins.
As a last step, we combine descriptors vecX m,r (respectively vecY m,r ) for several values of m and r by appending them one after another, in the same order, to obtain the vector vecX (respectively vecY ). [2]In general, these windows (submatrices) can be overlapping, but in this paper we made the choice of using non-overlapping windows.
The descriptor distance between the two FCGRs X and Y is now defined as the Euclidean distance between the vectors vecX and vecY In our case we computed descriptors for m = 4, 5, 6 therefore forming vectors vecX and vecY of length 5 ( 512 64 ) 2 + ( 512 32 ) 2 + ( 512 16 ) 2 = 6720.In general, for a given r, the length of the vectors compared is r(( , where m 1 , m 2 , . . ., m p are the values used for m.The choice of m for this study was made to balance the computational cost of calculating the vector of descriptors with the ability to compare the two matrices at various scales: large (m = 6, that is, compare windows of size 64×64), medium (m = 5, windows of size 32×32)) and small (m = 4, windows of size 16 × 16).The parameter r = 5 and the 5 bins were kept constant throughout our calculations but, in general, these parameters can also be varied, and the resulting vectors for each value added to the vector of descriptors, resulting in a larger vector.
In principle, the descriptor distance between two FC-GRs effectively compares the distribution of frequencies of k-mers between the corresponding submatrices X ij and Y ij , and does that for several values of m, that is, at several different scales.(Note that, in each window X ij , all k-mers have the same suffix of length k − m.) We now illustrate the descriptor distance by an example wherein k = 3, m = 2, r = 3, and the 3 bins are [0, 15)∪[15, 30)∪[30, ∞).Since k = 3, the FCGR table will contain the number of occurrences of all 3-mers in a DNA sequence, as follows: CCC GCC CGC GGC CCG GCG CGG GGG ACC TCC AGC TGC ACG TCG AGG TGG CAC GAC CTC GTC CAG GAG CTG GTG AAC TAC ATC TTC AAG TAG ATG TTG CCA GCA CGA GGA CCT GCT CGT GGT ACA TCA AGA TGA ACT TCT AGT TGT CAA GAA CTA GTA CAT GAT CTT GTT AAA TAA ATA TTA AAT TAT ATT TTT Take the two FCGRs X, Y ∈ N 8×8 , (k = 3, thus 2 3 × 2 3 ) corresponding to two genomic 150 kbp sequences of our dataset (one human and one bacterial), respectively.In order to use small numbers throughout the example, we divide all elements of the obtained matrices by 100 and take the integer part of each element, obtaining: Thus, in the human DNA sequence, the triplet CCC appears about 4200 times, the triplet GCC appears about 3300 times, the triplet CGC appears about 900 times, etc.
Since m = 2, we divide each of the matrices X and Y into non-overlapping submatrices of size 4×4 (2 2 ×2 2 ).For X we thus obtain X 11 , X Since the r = 3 bins are [0, 15) ∪ [15, 30) ∪ [30, ∞), we will count, for each submatrix, the number of 3mers for which the number of occurrences is less than 15, between 15 and 30, and greater than or equal to 30.Thus we obtain vecX 11 = 1 (3, 7, 6) which has as elements the number of elements of X 11 which belong in each of the intervals selected, divided by the total number of elements of X 11 .We proceed similarly for vecX We apply exactly the same procedure for the matrix Y and we get vecY = 1 16 (1, 12, 3, 3, 9, 4, 1, 12, 3, 0, 15, 1) .
The descriptor distance between these two FCGRs is computed as the Euclidean distance between vecX and vecY , in this case d D (X, Y ) ≈ 0.718.Note that, since we started by dividing the number of 3-mer occurrences by 100, as well as because of the bin selection, this is a fictitious example.The real value of the descriptor distance between the mentioned human and bacterial sequences is 8.66, and the range of the descriptor distance for this dataset of DNA sequences is [0, 13.17].In general, the descriptor distance has a variable range, that depends on the choices of parameters used.
To compute the Euclidean, Manhattan and Pearson distances, we first convert the matrices X, Y ∈ N n×n into 1 × n 2 vectors.For two vectors x, y ∈ R n , their Euclidean distance d E (x, y) and their Manhattan distance d M (x, y) are computed as while their Pearson distance d P (x, y) is defined as where In theory, the correlation coefficient The last distance we considered is based on the information distance defined in [13].The use of this distance is motivated computationally since it is easily computed from FCGRs as it tracks the number of different k-mers for a sequence instead of the actual set.
In [13], for a given k, the information distance for two strings x, y is defined as with where N k (x) is the number of different k-mers (possibly overlapping) which occur in x.We go one step further and modify this in order to avoid the creation of "unwanted" k-mers from the concatenation xy of x and y.First, we need to show how we compute N k (x) for a sequence x.For a sequence x, firstly, we build its FCGR(x) = X ∈ N 2 k ×2 k , which is a matrix of 2 k × 2 k with element values in N. Then we unitize X, that is every non-zero entry becomes 1, while zeros remain 0. N k (x) is now computed as the sum of the elements of this unitized FCGR, that is, N k (x) = f (X) = SumOfElements(Unitize(X)).For two strings x and y, with FCGRs X and Y respectively, we define N k (x|y) as: This slight modification of the information distance gives us also the desired properties of d(x, x) = 0 and d(x, y) = d(y, x) which were not satisfied before.Using (1), we now define the approximated information distance (AID) as: where x, y are the strings and X, Y ∈ N 2 k ×2 k their FCGRs, respectively.It also turns out that this distance is in fact the normalised Hamming Distance of the unitized FCGRs X and Y .Note that, for two sets X and Y, the normalized Hamming distance is where denotes the symmetric difference.
The generation of CGR images, calculation of distance matrices and creation of 2D and 3D Molecular Distance Maps with MDS were done and can be tested with the code available in [20] written in Wolfram Mathematica, version 9. The interactive webtool ModMap, [21], allows in-depth exploration of the 2D Mod Maps (Molecular Distance Maps) in this paper [3] . [3]When using the interactive webtool MoDMap, clicking on a distance underneath a dataset will result in Online Supplemental Material [20] includes all distance matrices and the code used to produce all figures and plots in this paper.More details about the online resources can be found in Appendix B.

Analysis and Results
For our dataset, we use k = 9, that is, each DNA sequence was represented as a 2 9 × 2 9 FCGR matrix.In practice, this means that the FCGR of a DNA sequence contains the full information regarding its k-mer sequence composition, for k = 1, 2, ..., 9.The length choice of 150 kbp and value of k = 9 is justified by the fact that, for a random sequence of length 150 kbp, its CGR at resolution 2 9 × 2 9 has around half of the pixels black, and half white.
Figure 2 depicts two-dimensional Molecular Distance Maps for the over five hundred DNA sequences in our dataset, computed using the DSSIM distance, descriptor distance, Euclidean distance, Manhattan distance, Pearson distance and approximated information distance, respectively.Figure 3 depicts the corresponding three-dimensional Molecular Distance Maps for the same dataset.The projection of each threedimensional map is chosen by hand in order to visually separate clusters of points which appear to be overlapping in the two-dimensional maps, as discussed below.
We note that MDS is not a clustering method, as the clusters are defined beforehand by the coloring scheme used (blue for H. sapiens, green for E. coli, and so on).MDS simply tries to display visually the interrelationships between the given items, based on the pairwise distances in the distance matrix which is its input.Note also that an increase in dimensionality from 2 to 3 can lead to a better cluster visualization.For example, if we compare the two-dimensional and the threedimensional Molecular Distance Maps obtained using DSSIM, we see that points that appeared to be erroneously mixed with each other in the two-dimensional map, Figure 2(a), (S. cerevisiae and P. falciparum sequences mixed in with A. thaliana sequences) were in fact clearly separated from each other in Figure 3(a), the three-dimensional version of the Molecular Distance Map.
plotting the MoD Map of the dataset computed with that distance.On any particular MoD Map, clicking on a point will display a window with information about the subsequence represented by that point: its NCBI accession number, scientific name of the organism it originates from, and its CGR pattern.Clicking on the "From here" and "To here" buttons on two such selected windows will display the distance between the corresponding genomic subsequences in the distance matrix.
Figure 4 displays the histograms of the pairwise intragenomic distances (dark blue and turqoise) and intergenomic distances (grey) of DNA sequences from H. sapiens and A. thaliana, obtained using each of the six distances.As noted, some distances seem to perform better than others.Visually, the poorest performer for these two sets of sequences (from H. sapiens and A. thaliana) seems to be the Euclidean distance wherein the intragenomic distances are as high as intergenomic distances, and no separation is visible.In contrast, DSSIM gives -for the same data -intergenomic distances that are overall much higher than intragenomic distances, resulting in a clear classification of DNA sequences into the species they belong to.Table 3 displays the mean and standard deviation of distances between clusters C i and C j , 1 ≤ i, j ≤ 6, where a cluster C is defined as the set of all genomic sequences from the genome of organism , as labelled in Table 1.In each subtable, the diagonals represent the means and standard deviation for intragenomic distances, while the other entries are all intergenomic distances.From this table we see that for DSSIM, Manhattan and approximated information distance, the maximum of all the averages of intragenomic distances in this dataset is strictly smaller than the minimum of all the averages of intergenomic distances.For the descriptor distance and Pearson distance the previous statement does not hold but, for each pair of organisms, the two averages of intragenomic distances (e.g., human-human and plant-plant) are both lower than the average of the intergenomic distances (human-plant).For the Euclidean distance, none of the previous statements holds: For example, the average of the plant-plant intragenomic distances (element 4-4 in the Euclidean distance subtable of Table 3) intragenomic distances is 723, which is larger than 672, the average of the yeast-plant intergenomic distances (element 3-4 in the Euclidean distance subtable of Table 3).The complete histograms of all pairwise comparisons C i −C j can be found in Appendix C.

Quality Measures for Distances
In this section we present three quality measures that each evaluates the quality of the six distances considered.In the data mining literature a wide range of quality measures for clusterings has been defined; see for example [22,23].Most of these methods are designed to assess the quality of different automated clustering methods while using the same distance.Our set-up is different, as we use different distances while the clustering is fixed and given by the initial colourcoding of the sequence-representing points.Thus, we have to use other approaches to compare the distances we analyze.In particular, as the six distances have different ranges, we have to use assessment methods which are invariant to the scale of the distance.
The "ground-truth" that we use as a basis for our distance assessment is the fact that the "ideal" clustering of DNA sequences and the points that represent them is known: sequences from the same organism should be close to one another and far from sequences originating from other organisms.(This assumption is justified -for this dataset -as the six organisms considered are very different from one another, belonging to different kingdoms of life.)Thus, an optimal distance should yield a relatively small distance between two FCGRs which were generated from the DNA sequences originating from the same organism, and relatively high distances between two FCGR originating from DNA sequences coming from different organisms.
In order to assess each of the six distances quantitatively, we computed three quality measures which rate different features of a distance: • the correlation to an idealized cluster distance • the silhouette cluster accuracy • the relative overlap between the intragenomic and intergenomic distance histograms.Let us stress that all three quality measures of the six distances are based on the distance matrices which we computed and not on their MDS plots.We will define the three quality measures such that their expected values range in the interval [0, 1] where higher values correspond to better performance.
Let us first describe the three quality measures informally.An idealized distance is a distance that would be able to differentiate DNA sequences by species, that is, a distance δ for which δ(x, y) = 0 if x and y are sequences from the same species and δ(x, y) = 1 otherwise.The first quality measure, the correlation to an idealized cluster distance, measures how well a distance is linearly correlated to the idealized distance δ.The second quality measure, silhouette cluster accuracy, is the percentage of points that are best embedded in the cluster they belong to.The third quality measure quantifies the "visual overlap" between the intragenomic and intergenomic distance histograms.Given our dataset, it is reasonable to expect that a good distance gives a low value if applied to FCGRs of genomic sequences of the same organism, and a high value when applied to FCGRs of genomic sequences from two different organisms, thus separating the histograms of intragenomic distances from that of intergenomic distances.This is illustrated by the histograms in Figure 4, where a high overlap between the graph of intragenomic distances (dark blue and turquoise) and the graphs of intergenomic distances (grey) is an indication of a poorly performing distance.In a theoretically optimal situation, there would exist a value c such that all distances that are smaller than c are intragenomic distances and all distances that are larger than c are intergenomic distances.This can usually not be expected from real data, but a low overlap between histograms is nevertheless indicative of a "good" distance.
In order to formally define the three quality measures, we consider a dataset V which is partitioned into p non-overlapping clusters C 1 , . . ., C p for which a distance d α : V × V → R ≥0 exists.The cardinalities of the sets are |V | = m and |C i | = m i for i = 1, . . ., p.In our analysis, p = 6 and C 1 contains all FCGRs generated from genomic DNA sequences from H. sapiens, C 2 contains all FCGRs generated from genomic sequences of E.coli, and so on, according to the order in Table 1.The distance d α is one of the six distances α ∈ {DSSIM, D, E, M, P, AID}.
The correlation to an idealized cluster distance is computed as follows.We define the idealized cluster distance as a function (or matrix) δ : V × V → {0, 1} such that δ(x, y) = 0 if and only if x and y belong to the same cluster, and δ(x, y) = 1 otherwise.Because we can view d α and δ as discrete, symmetric functions which have the same domain, we can compute their correlation coefficient.We define the correlation of δ to d α to be the Pearson correlation of δ and d α .More precisely, the upper triangular part of the matrix corresponding to a distance d α is interpreted as a vector (x 1 , . . ., x n ) and compared with the corresponding values (y 1 , . . ., y n ) given by δ.We obtain the δ-correlation as The correlation ranges in the interval [−1, 1]: a value of 1 means that d α and δ are linearly correlated, and a value of 0 means that they are unrelated.In other words, if the value obtained by measuring the correlation of a given distance to the idealized cluster distance is close to 1, this means that the given distance is closer to the idealized cluster distance, and hence, performs well.Note that negative values for this measure are not expected as this would imply that d α and δ were negatively related (d α would perform worse than a matrix containing random entries).
The silhouette cluster accuracy is based on the silhouette coefficient, defined in [24], as a measure that determines how well a single point is embedded in the cluster to which it belongs.For a point x from cluster C i we define a x as the average distance of this point to all other points in C i , that is, and we define b x as the minimum over the average distances of x to all points of a different cluster The silhouette coefficient of x is defined as If a point x has a silhouette coefficient S α (x) ≤ 0, then x is at least as close to a cluster to which it does not belong than to its own cluster.The silhouette cluster accuracy A α denotes the percentage of points with a silhouette coefficient greater than 0, that is the percentage of points which are well-embedded in their own cluster, Obviously, the silhouette cluster accuracy ranges in [0, 1] with a high accuracy being desirable.For assessing the relative overlap of the histograms, consider any two clusters C i and C j with i = j (for example, C 1 is the H. sapiens cluster and C 4 the A. thaliana cluster).We compare the two sets of intragenomic distances C i -C i and C j -C j with the set of intergenomic distances C i -C j .For a distance d α , we divide the range from min(d α ) to the maximum distance max(d α ) in this dataset into 100 bins of size r = max(dα)−min(dα) 100 and count the distances which fall into this bin: c i,i [ ] denotes bin containing distances from C i -C i and c i,j [ ] denotes bin i containing distances from C i -C j .For = 1, . . ., 100 we let By s i ,j we denote the sum over all c i ,j -bins The overlap is normalized to the range [0, 1] where 0 means no overlap of elements of bins between intra-and intergenomic distances, and 1 means that one of the histograms completely "covers" the other.Also note that we are not interested in the overlap of C i -C i with C j -C j as both sets of distances are intragenomic distances.
Since we intend to define the a quality measure where a value close to 1 should represent a small overlap, we will use 1 − O α (i, j) as relative overlap.Furthermore, we combine these quantities for all possible pairs of clusters C i and C j , obtaining the relative overlap as: For example, in Figure 4, for each of the considered distance, the dark blue histograms depict the C 1 − C 1 (H.sapiens -H.sapiens) intragenomic distances, the turquoise histograms the C 4 − C 4 (A.thaliana -A.thaliana) intragenomic distances, and grey histograms the C 1 − C 4 (H.sapiens -A.thaliana) intergenomic distances.As seen from this figure, the descriptor distance appears to visually perform best at separating the two intragenomic distance histograms from the intergenomic histogram, while the Euclidean distance has the weakest performance.The relative overlap attempts to quantify this by computing the overlaps of each of the two pairs of histograms (dark blue with grey and turquoise with grey).Note that small visual histogram overlaps will result in a high numerical relative overlap, and is indicative of a better performing distance.

Distance Comparison Results
The results of comparing the six distances we analyzed, using the three quality measures, are listed in Table 4. Recall that all quality measures have an expected range of [0, 1] where larger values imply better performance.Dα is the correlation to an idealized cluster, Aα the silhouette cluster accuracy, and Oα the relative overlap.Higher is better.

Dα
To compare each distance relative to all the other distances, we further compute for each quality measure (each column) the standard scores (z-scores) of each distance d α , where α ∈ {DSSIM, D, E, M, P, AID}, as z(d α ) = dα−µ σ where µ is the mean and σ is the deviation of all six d α for that particular quality measure (column).A positive value of the standard score will mean that a distance performs above average (in this category) and a negative value that it performs below average.
Finally, we compute the sum of the z-scores for each quality measure as seen in Table 4.Note that the total of z-scores for a distance represents the performance of that distance relative to the other distances, and indicates its relative ranking.
The conclusion of this analysis is that the best performing distances are the descriptor distance and DSSIM.Manhattan, Pearson, and approximate information distance perform well in some categories but not so well in other categories.For this dataset and value of k, the Euclidean distance had the weakest performance in all measured categories, which confirms the visual assessment of the MDS plots obtained by using the Euclidean distance, as seen in Figure 2 and Figure 3.
It is worth noting that the two distances which perform best (DSSIM and descriptor) treat FCGR matrices as two-dimensional maps in which the local arrangement of the cells (matrix entries) influences the computed distance, whereas the other distances treat the FCGR matrices as linear vectors.This suggests that the organization of the k-mer tallies (in this paper k = 9) of a DNA sequence as an FCGR matrix, rather than a simple vector, reveals structural properties of the DNA sequence that could be utilized in order to identify and classify genomic DNA sequences.

Figure 2
Figure 2 Two-dimensional Molecular Distance Maps of DNA genomic sequences from all six organisms in the dataset, obtained using DSSIM, descriptor, Euclidean, Manhattan, Pearson and aproximated information distance, respectively.Each point corresponds to a 150 kbp genomic sequence from H. sapiens (blue), E. coli (green), S. cerevisiae (red), A. thaliana (turqoise), P. falciparum (magenta), and P. furiosus (orange).

Figure 3
Figure 3 Three-dimensional Molecular Distance Maps of genomic DNA sequences from all six organisms in the dataset, obtained using DSSIM, descriptor, Euclidean, Manhattan, Pearson and approximated information distance, respectively.Each point corresponds to a 150 kbp genomic sequence from H. sapiens (blue), E. coli (green), S. cerevisiae (red), A. thaliana (turqoise), P. falciparum (magenta), and P. furiosus (orange).

Figure 4
Figure 4 Histograms of pairwise intragenomic and intergenomic distances among the DNA sequences from H. sapiens and A. thaliana.

Table 4
Summary of quality measures for the performances of six distances (DSSIM, descriptors, Euclidean, Manhattan, Pearson, approximated information distance) on a dataset of 508 genomic DNA sequences taken from organisms from each kingdom of life.