An investigation into inter and intragenomic variations of graphic genomic signatures
 Rallis Karamichalis^{1},
 Lila Kari^{1}Email author,
 Stavros Konstantinidis^{2} and
 Steffen Kopecki^{1, 2}
Received: 19 December 2014
Accepted: 30 June 2015
Published: 7 August 2015
Abstract
Background
Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences.
Results
We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria  full genome), and P. furiosus (Archaea  full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a twodimensional or threedimensional space, to simultaneously display their interrelationships.
Conclusion
Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Keywords
Comparative genomics Genomic signature Species classificationBackground
Alongside DNA barcoding, [1] and Klee diagrams [2], Chaos Game Representation (CGR) patterns of genomic segments have been proposed as another method for the classification and identification of genomic sequences [3, 4]. The concept of genomic signature was first introduced in [5], as being any specific quantitative characteristic of a DNA genomic sequence that is pervasive along the genome of the same organism, while being dissimilar for DNA sequences originating from different organisms. Initial studies [3, 6] suggesting that short fragments of genomic sequences retain most of the characteristics of the genome of origin indicated that such genomic signatures exist. In particular, the Chaos Game Representation (CGR) of a DNA sequence, a graphic representation of its sequence composition, was proposed in [3] as having both the pervasiveness and differentiability properties necessary for it to qualify as a genomic signature. Indeed, CGRs of genomic DNA sequences have been shown to be genome and speciesspecific, see, e.g., [3, 4, 6–12]. Note that CGR patterns of mtDNA sequences can be different from those of DNA sequences from the major genome of the same organism, and that large scale quantitative analyses, at all taxonomic levels, of the hypothesis that CGR can play the role of a genomic signature for genomic sequences have not, to our knowledge, been performed. The long term objective of this research is to find out whether CGR can play the role of genomic signature for genomic DNA sequences, and can be used to identify and classify genomic sequences at all taxonomic levels. To this end, the objective of this study is to quantitatively assess the usability of CGR for classification of genomic sequences at the kingdom level, as well as to assess various distances that can be used to compare CGRs of genomic sequences for this purpose.
We first analyze 508 fragments, 150 kbp (kilo base pairs) long, spanning single complete chromosomes of six organisms, each representing a different kingdom: chromosome 21 of Homo sapiens, chromosome 4 of Saccharomyces cerevisiae, chromosome 1 of Arabidopsis thaliana, chromosome 14 of Plasmodium falciparum, the genome of Escherichia coli, and the genome of Pyrococcus furiosus, for a total length of 76,200 kbp analyzed. We analyze the intergenomic and intragenomic variation of CGR genomic signatures of these sequences by using six different distances: Structural Dissimilarity Index (DSSIM) [13], Euclidean distance, Pearson correlation distance [14], Manhattan distance [15], approximated information distance [16], and a distance defined here, based on an idea from computer vision, called descriptor distance. For each of the six distances, we visualize the results by computing Molecular Distance Maps, [12], which represent sequences as points in a twodimensional or threedimensional space, and thus display all their interrelationships simultaneously. The resulting Molecular Distance Maps show a good clustering, with genomic sequences originating from the same genome being largely grouped together, and separated from sequences belonging to genomes of different organisms. We observe that, in some of the cases where the clustering is suboptimal, the computation of threedimensional Molecular Distance Maps resolves what appeared to be cluster overlaps in the twodimensional Molecular Distance Maps. Using the “groundtruth” that sequences from the same genomes should have similar structural characteristics and thus be grouped together, while those from genomes of different organisms should be separated, we assess the six distances by combining three different quality measures: correlation to an idealized cluster distance, silhouette accuracy, and histogram overlap. We conclude that, for this dataset, DSSIM and the descriptor distance perform best according to these measures.
To maximize the diversity within each species, we also analyze a set of 526 fragments, 150 kbp long, sampled from the entire genomes of the aforementioned six organisms, for a total length of 78,900 kbp analyzed. The resulting Molecular Distance Maps are very similar to the ones in the first experiment, and the distance ranking is also the same, confirming the preceding results.
Lastly, we provide some preliminary evidence of this method’s applicability to classifying genomic DNA sequences at lower taxonomic levels by comparing 240 genomic sequences, 150 kbp long, sampled from the entire genome of Homo sapiens (class Mammalia, order Primates) with 210 genomic sequences, 150 kbp long, sampled from the entire genome of Mus musculus (mouse, class Mammalia, order Rodentia) for an additional length of 67,500 kbp analyzed. While a clear separation of sequences by genome is indeed achieved, we observe that the distance ranking is quite different compared to the previous two experiments, indicating that different distances may have to be used for comparing genomic sequences at different taxonomic levels.
Note that early analyses of genomic sequences with regard to similarities in the relative abundances of oligonucleotides of lengths k=1,…,6 exists and include [17–25]. Also, several alignmentfree methods that use fixedlength word frequencies have been used for phylogenomic analysis of DNA sequences, [26–28]. These methods include statistical studies of word frequency within a DNA sequence [5, 29–34], or employ kwords and the Markov model to obtain information about DNA sequences [35–39]. Iterated map methods for DNA sequence comparison include CGRbased analyses, see [3, 40–46], and such alignmentfree methods have been successfully applied for sequence comparison [4, 11, 12, 47–53].
The initial reports on CGRs of genomic sequences [3, 6] contained mostly qualitative assessments of CGR patterns of whole genes. In [54], several comparisons of eukaryotic genomic sequences, including withinspecies comparisons, were reported, using di, tri, and tetranucleotide relative abundance distance (k=2,3,4). In [25] di and tetranucleotide abundance profiles (k=2,k=4) were compared for genomic collections from genomes of 5 gramnegative proteobacteria (including 2 complete genomes), 3 grampositive bacteria, 2 mycoplasmas (complete genomes), 2 cyanobacteria (1 complete genome), and 3 thermophilic archaea (1 complete genome), using the δ ^{∗} distance which computes the average absolute difference of the dinucleotide relative abundance values. In [4], several datasets of up to 36 genomic DNA sequences were analyzed, and in [9] some variouslength sequences were analyzed based on computing Euclidean distances between frequencies of their kmers, for k=1,…,8. Subsequently, [10] computed the Euclidean distance between frequencies of kmers (k≤5) for the analysis of 125 GenBank DNA sequences from 20 bird species and the American alligator. In [47], 27 microbial genomes were analyzed to find implications of 4mer frequencies (k=4) on their evolutionary relationships. In [16], 20 mammalian complete mtDNA sequences were analyzed using the “similarity metric”, for k=7. In [50] a multigene dataset of 33 genes for 9 bacteria and one archaea species, as well as the whole genomes of a set of 16 γproteobacteria were analyzed, using values of k between 1 and 10, and Euclidean and χ ^{2} distances. In [11] a collection of 26 complete mitochondrial genomes was analyzed, using the Euclidean distance and an “image distance”, with a value of k=10. In [55] a megabasescale phylogenomic analysis of the Reptilia was reported, that compared frequency distributions of 8mer oligonucleotides (k=8) using Euclidean distance. Another study, [56], analyzed 459 bacteriophage genomes and compared them with their host genomes to infer hostphage relationships, by computing Euclidean distances between frequencies of kmers for k=4. In [57], 75 complete HIV genome sequences were compared using the Euclidean distance between frequencies of 6mers (k=6), in order to group them in subtypes. In [58] several datasets were analyzed (109 complete genomes of prokaryotes and eukaryotes, 34 prokaryote and chloroplast genomes, mitochondrial genomes of 64 vertebrates, and 62 complete genomes of alpha proteobacteria) using values of k=5,6 for proteincoding genes and k=11,12 for whole genomes, with two distances: chord distance and piecewise distance. In [12] a dataset of 3,176 complete mtDNA sequences was analyzed using an image distance, DSSIM, and a value of k=9, and several Molecular Distance Maps were obtained which displayed sequences’ interrelationships at several taxonomic levels (phylum Vertebrata, kingdom Protista, classes AmphibiaInsectaMammalia, class Amphibia, and order Primates).

We tested and confirmed for an extensive dataset, of a total length of approximately 174Mbp, the hypothesis that CGR images of genomic DNA sequences can play the role of a (graphic) genomic signature, meaning that they have a desirable genome and speciesspecificity. The dataset comprised 150 kbp fragments taken from genomes of six organisms, one from each of the six kingdoms of life. This was augmented by a set of 150 kbp fragments randomly sampled from all chromosomes of M. musculus, as a testcase of this method’s applicability at lower taxonomic levels.

We assessed the performance of six different distances in this context, and this analysis included both samegenome and differentgenome DNA fragment pairs. For several of these distances, the intragenomic values were overall smaller than intergenomic values, suggesting that this method could separate DNA genomic fragments belonging to different genomes, based on their CGRs.

We showed that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies. In particular, we determined that the DSSIM distance and the descriptor distance, adapted from computer vision for this application, were best able to differentiate sequences originating from different genomes at the kingdom level. Both these distances essentially compare the kmer composition of DNA sequences (herein k=9).

Based on preliminary data, we suggested the use of threedimensional Molecular Distance Maps for improved visualization of the simultaneous interrelationships within a given set of genomic sequences.
Further analysis is needed to explore this method’s potential to differentiate genomic sequences originating from closely related species (e.g. within the same order). Additional refinements of the distances considered may have to be defined for optimal genomic DNA sequence identification and classification at very low taxonomic levels.
Methods
In this section we first describe the dataset used for our analysis, then present an overview of the three main steps of the method, and conclude with a description of the six distances that we considered.
Dataset
Dataset for the first experiment: NCBI accession numbers of the complete chromosomes considered, in increasing order of their NCBI accession number
Organism  NCBI Acc. Nr.  

1  H. sapiens, chrom. 21 (Animalia)  NC_000021.8 
2  E. coli (Bacteria)  NC_000913.3 
3  S. cerevisiae, chrom. 4 (Fungi)  NC_001136.10 
4  A. thaliana, chrom. 1 (Plantae)  NC_003070.9 
5  P. falciparum, chrom. 14 (Protista)  NC_004317.2 
6  P. furiosus (Archaea)  NC_018092.1 
In order to have relatively comparable numbers of DNA sequences for each organism, we chose the longest chromosomes for all organisms except H. sapiens, for which the shortest chromosome was chosen.
The DNA sequences in the NCBI database are represented as strings of letters “A”, “C”, “G”, “T”, and “N” which represent the four nucleotides Adenine, Cytosine, Guanine, Thymine, and “unidentified Nucleotide”, respectively. For our analysis we ignored all letters “N”. In S. cerevisiae and E. coli there were no ignored letters, and in P. falciparum and P. furiosus the number of ignored letters is of the order of 0.001 % of the length of the sequence. In H. sapiens this number is 27 %, and in A. thaliana is 0.54 %. In H. sapiens, in particular, 96.4 % of these ignored letters exist in centromeric and telomeric regions of the chromosome.
The first experiment: Organisms considered, total length of the chromosome (respectively genome), number of ignored letters “N”, and number of DNA fragments (sequences) obtained by splitting a single complete chromosome per organism into consecutive, nonoverlapping, equal length (150 kbp) contiguous fragments
Organism  Length(bp)  # Letters “N”  # Fragments 

H. sapiens  48,129,895  13,023,253  234 
E. coli  4,641,652  0  30 
S. cerevisiae  1,531,933  0  10 
A. thaliana  30,427,671  164,359  201 
P. falciparum  3,291,871  37  21 
P. furiosus  1,909,827  10  12 
To maximize the diversity within each species, the dataset of the second experiment comprised fragments randomly sampled from each chromosome of the six chosen organisms, as follows. After deleting all “N” nucleotides, each chromosome was divided into successive, nonoverlapping, contiguous fragments, each 150 kbp long. When the last fragment was shorter than 150 kbp, it was not included in the analysis. Next, for each chromosome we selected randomly 10 such fragments to represent the chromosome, see [59], Appendix B. In the cases where there were fewer than 10 fragments in a chromosome, all of them were considered. In the cases of E. coli and P. furiosus, we retained all complete fragments of the genome. This resulted in 240 fragments for H. sapiens, 30 fragments for E. coli, 73 fragments for S. cerevisiae, 50 fragments for A. thaliana, 121 fragments for P. falciparum, and 12 fragments for P. furiosus, for a total of 526 fragments.
Overview
The method we used to analyze and classify genomic sequences has three steps: (i) generate graphical representations (images) of each DNA sequence using Chaos Game Representation (CGR), (ii) compute all pairwise distances between these images, and (iii) visualize the interrelationships implied by these distances as two or threedimensional maps, using MultiDimensional Scaling (MDS).
where X is a sequence of length k over the alphabet {A, C, G, T}.
For step (ii), after computing the FCGR matrices for each of the 150 kbp sequences in a given dataset, the goal was to measure “distances” between every two CGR images. There are many distances that can be defined and used for this purpose, [64]. One of the goals of this study was to identify what distance is better able to differentiate the structural differences of various genomic DNA sequences and classify them based on the species they belong to. In this paper we use six different distances: Structural Dissimilarity Index (DSSIM), descriptor distance (adapted from computer vision for this application), Euclidean distance, Manhattan distance, Pearson correlation distance, and approximated information distance.
For step (iii), after computing all possible pairwise distances we obtained six different distance matrices. To visualize the interrelationships between sequences implied by each of the distance matrices, and to thus visually assess each of the distances, we used MultiDimensional Scaling (MDS). MDS is an information visualization technique introduced by Kruskal in [65]. MDS takes as input a distance matrix that contains the pairwise distances among a set of items (here the items are the 150 kpb DNA sequences analyzed). The output of MDS is a spatial representation of the items in a common Euclidean space, wherein each item is represented as a point and the spatial distance between any two points corresponds to the distance between the items in the distance matrix. Objects with a small pairwise distance will result in points that are close to each other, while objects with a large pairwise distance will become points that are far apart.
The combination of CGR/DSSIM/MDS was first proposed in [66], [12] as a tool to quantitatively measure and display the interrelationships among a set of complete mitochondrial sequences. The outputs of this method, called Molecular Distance Maps, are twodimensional maps wherein each point represents a mitochondrial genome, and the spatial distances between any two points correspond to the differences between the structural composition of the corresponding DNA sequences. The ideal Molecular Distance Map is a placement of n items as points in an (n−1)dimensional space. The twodimensional Molecular Distance Map is simply an approximation, a flattening of this highlydimensional space onto the plane, which may sometimes result in erroneous positioning of some points. Increasing the dimensionality of the Molecular Distance Map often results in a more accurate representation of the real interrelationships between sequences, as embodied in the original distance matrix.
Distances
In this section we describe and formally define each of the six distances used in our analysis: DSSIM, descriptor distance (adapted from computer vision for this application), Euclidean, Manhattan, Pearson, and approximated information distance.
In theory, the values for SSIM range in the interval [−1,1] with the similarity being 1 between two identical images, 0, for example, between a black image and a white image, and −1 if the two images are negatively correlated; that is, SSIM (X,Y)=−1 if and only if X and Y have the same luminance μ and every pixel x _{ i } of image X has the inverted value of the corresponding pixel y _{ i }=2μ−x _{ i } in Y.
To compute the distance rather than the similarity between two images, we calculate DSSIM (X,Y) = 1SSIM (X,Y). Consequently, the range of DSSIM is the interval [0,2]: two identical images will result in a DSSIM distance of 0, while two images that are the negatives of each other would result in a DSSIM distance of 2.
For defining the descriptor distance we adapted for this application the spatial pyramid matching approach of [67], which is used to calculate hierarchical image descriptors. The descriptor distance between two FCGRs \(X,Y \in \mathbb {N}^{2^{k} \times 2^{k}}\) aims to compare a combination of several different “descriptors”, that is, a combination of several different aspects, of the two given FCGRs.
A descriptor is a vector characterized by parameters m and r, as well as r intervals, where m is the size of the nonoverlapping windows in which the FCGR is divided (scale of the comparison), and the r intervals represent the “granularity” of the analysis, in that they define the intervals of numbers of kmer occurrences that are considered significant.
For a given m≤k and r, and intervals [a _{0},a _{1}),[a _{1},a _{2}),⋯,[a _{ r−1},a _{ r }) such that \(\bigcup _{i=0}^{r1} [a_{i},a_{i+1})=[0,\infty)\) and [a _{ i },a _{ i+1})∩[a _{ j },a _{ j+1})=∅∀i,j with i≠j, a decriptor is constructed as follows.
Starting from the topleft corner, we divide each of the two FCGR matrices X and Y into nonoverlapping submatrices of size 2^{ m }×2^{ m }. This procedure results in 4^{ k−m } submatrices X _{ ij } and Y _{ ij } with i,j=1,⋯,2^{ k−m }, which will be pairwise compared.
The choice of the r intervals, called “bins”, points to the fact that, rather than considering the finest granularity, we are interested in a coarser comparison. This means that, instead of a computationally expensive pairwise comparison of all possible numbers of occurrences of kmers, we are interested only in certain “bins” of such numbers. For example, in our case, we use r=5 and consider only 5 different bins, that is only kmers with number of occurrences: 0 (not occurring), 1 (one occurrence), 2 (two occurrences), between 2 and 5, between 5 and 20, and greater than 20 (most frequent). Formally, we use r=5 and [0,∞)=[0,1)∪[1,2)∪[2,5)∪[5,20)∪[20,∞) as the 5 bins.
Afterwards, we compute for every X _{ ij } a vector vec\(X_{\textit {ij}} = \frac {1}{(2^{m} \times 2^{m})} (b_{1}, b_{2}, \cdots, b_{r})\) where b _{ i }={x∈X _{ ij }:a _{ i−1}≤x<a _{ i }}. In our case, for each X _{ ij }, we compute a fivetuple wherein, for example, the 4th element represents the number of 9mers whose number of occurrences is in the 4th bin, that is, at least 5 but less than 20. The division to 2^{ m }×2^{ m } is to obtain a probability distribution for each submatrix. The same procedure is performed for Y _{ ij }, resulting in the vector vec Y _{ ij }.
We further append all vectors vec X _{ ij } and form a new vector vec X ^{ m, r } and, using the same order of appending, we append all vectors vec Y _{ ij } forming a new vector vec Y ^{ m, r }. These two vectors are the “descriptors” of the FCGR matrices X and Y for the parameters m, r and the r chosen bins.
As a last step, we combine descriptors vec X ^{ m, r } (respectively vec Y ^{ m, r }) for several values of m and r by appending them one after another, in the same order, to obtain the vector vecX (respectively vecY).
In our case we computed descriptors for m=4,5,6 therefore forming vectors vecX and vecY of length \(5\left ((\frac {512}{64})^{2}+(\frac {512}{32})^{2}+(\frac {512}{16})^{2}\right)=6720\). In general, for a given r, the length of the vectors compared is \(\phantom {\dot {i}\!} r ((2^{km_{1}})^{2} + (2^{km_{2}})^{2} + \ldots + (2^{km_{p}})^{2})\), where m _{1},m _{2},…,m _{ p } are the values used for m. The choice of m for this study was made to balance the computational cost of calculating the vector of descriptors with the ability to compare the two matrices at various scales: large (m=6, that is, compare windows of size 64×64), medium (m=5, windows of size 32×32)) and small (m=4, windows of size 16×16). The parameter r=5 and the 5 bins were kept constant throughout our calculations but, in general, these parameters can also be varied, and the resulting vectors for each value added to the vector of descriptors, resulting in a larger vector.
In principle, the descriptor distance between two given FCGRs effectively compares the distribution of frequencies of kmers between the corresponding submatrices X _{ ij } and Y _{ ij }, and does that for several values of m, that is, at several different scales. (Note that, in each window X _{ ij }, all kmers have the same suffix of length k−m.)
We now illustrate the descriptor distance by an example wherein k=3, m=2, r=3, and the 3 bins are [0,15)∪[15,30)∪[30,∞). Since k=3, the FCGR table will contain the number of occurrences of all 3mers in a DNA sequence, as follows:
Thus, in the human DNA sequence, the triplet CCC appears about 42 × 100 times, the triplet GCC appears about 33 × 100 times, the triplet CGC appears about 9 × 100 times, etc.
and similarly for Y.
The descriptor distance between these two FCGRs is computed as the Euclidean distance between vecX and vecY, in this case d _{ D }(X,Y)≈0.718. Note that, since we started by dividing the number of 3mer occurrences by 100, as well as because of the bin selection, this is a fictitious example. The real value of the descriptor distance between the mentioned human and bacterial sequences is 8.66, and the range of the descriptor distance for this dataset of DNA sequences is [0, 13.17]. In general, the descriptor distance has a variable range, that depends on the choices of parameters used.
In theory, the correlation coefficient \(\frac {\sigma _{\textit {xy}}}{\sigma _{x} \sigma _{y}}\) ranges in the interval [−1,1], and therefore the Pearson distance ranges in the interval [0,2].
where x,y are the strings and \(X,Y \in \mathbb {N}^{2^{k} \times 2^{k}}\) their FCGRs, respectively. It also turns out that this distance is in fact the normalized Hamming Distance of the unitized FCGRs X and Y. Note that, for two sets \(\mathcal X\) and \(\mathcal {Y}\), the normalized Hamming distance is \(\frac {\mathcal {X} \triangle \mathcal {Y}}{\mathcal {X} \cup \mathcal {Y}} = 2  \frac {\mathcal {X} + \mathcal {Y}}{\mathcal {X} \cup \mathcal {Y}}\) where △ denotes the symmetric difference.
Online Material, [59], includes the code used, the distance matrices, and an Appendix (Appendix A with details about accessing the online resources, Appendix B with information about the dataset, and Appendix C with additional histograms for the first experiment). The code, written in Wolfram Mathematica version 9, was used (and can be tested) for the generation of CGR images, the calculation of distance matrices, and the creation of 2D and 3D Molecular Distance Maps. The interactive webtool ModMap, [68], allows indepth exploration of the 2D Mod Maps (Molecular Distance Maps) in this paper. When using the interactive webtool MoDMap, clicking on a distance underneath a dataset will result in plotting the MoD Map of the dataset computed with that distance. On any particular MoD Map, clicking on a point will display a window with information about the subsequence represented by that point: its NCBI accession number, scientific name of the organism it originates from, and its CGR pattern. Clicking on the “From here” and “To here” buttons on two such selected windows will display the distance between the corresponding genomic subsequences in the distance matrix.
Results and discussion
For our dataset, we use k=9, that is, each DNA sequence was represented as a 2^{9}×2^{9} FCGR matrix. In practice, this means that the FCGR of a DNA sequence contains the full information regarding its kmer sequence composition, for k=1,2,…,9. The length choice of 150 kbp and value of k=9 is partly justified by the fact that, for a random sequence of length 150 kbp, its CGR at resolution 2^{9}×2^{9} has around half of the pixels black, and half white, and partly justified by the fact that it empirically produced good results while at the same time being computationally inexpensive.
We note that MDS is not a clustering method, as the clusters are defined beforehand by the coloring scheme used (blue for H. sapiens, green for E. coli, and so on). MDS simply tries to display visually the interrelationships between the given items, based on the pairwise distances in the distance matrix which is its input. Note also that an increase in dimensionality from 2 to 3 can lead to a better cluster visualization. For example, if we compare the twodimensional and the threedimensional Molecular Distance Maps obtained using DSSIM, we see that points that appeared to be erroneously mixed with each other in the twodimensional map, Fig. 2(a), (S. cerevisiae and P. falciparum sequences mixed in with A. thaliana sequences) are in fact clearly separated from each other in Fig. 3(a), the threedimensional version of the Molecular Distance Map.
The first experiment: Mean and standard deviation of distances between clusters C _{ i }−C _{ j } for i,j=1,…,6
  1  2  3  4  5  6 

1  0.81±0.04  0.99±0.01  0.92±0.02  0.91±0.03  0.92±0.03  0.91±0.02 
2    0.85±0.01  0.97±0.01  0.99±0.01  0.99±0.01  0.99±0. 
3      0.87±0.01  0.89±0.02  0.91±0.  0.91±0.01 
4        0.87±0.03  0.9±0.02  0.91±0.01 
5          0.74±0.01  0.94±0. 
6  DSSIM  0.83±0.01  
1  3.76±1.69  9.74±0.66  5.92±1.14  5.71±1.41  9.33±1.23  5.44±0.92 
2    2.5±0.28  8.05±0.39  9.1±0.55  12.67±0.19  9.38±0.41 
3      2.12±0.08  3.42±1.05  9.48±0.31  4.6±0.09 
4        2.75±1.33  8.23±0.94  4.94±0.76 
5          1.53±0.14  9.99±0.28 
6  Descriptor  2.4±0.32  
1  756±498  856±349  756±361  818±514  3914±510  812±356 
2    558±5  674±17  802±366  4102±466  696±18 
3      564±11  672±383  3964±472  633±20 
4        723±535  3923±506  748±372 
5          999±276  4085±468 
6  Euclidean  585±24  
1  171±15  222±5  189±13  188±17  213±20  191±9 
2    175±2  209±4  219±8  252±4  218±3 
3      171±2  177±10  206±2  184±2 
4        172±16  200±11  188±9 
5          105±3  224±2 
6  Manhattan (in thousands)  167±3  
1  0.5±0.12  0.97±0.02  0.69±0.1  0.64±0.12  0.65±0.09  0.81±0.06 
2    0.71±0.02  0.93±0.02  0.96±0.02  0.98±0.01  0.99±0.02 
3      0.6±0.02  0.6±0.07  0.71±0.03  0.75±0.02 
4        0.53±0.11  0.63±0.09  0.76±0.04 
5          0.02±0.01  0.94±0.01 
6  Pearson  0.64±0.03  
1  0.65±0.03  0.78±0.01  0.7±0.03  0.7±0.03  0.76±0.04  0.69±0.02 
2    0.67±0.  0.75±0.01  0.77±0.02  0.85±0.01  0.77±0.01 
3      0.67±0.01  0.68±0.02  0.74±0.  0.69±0. 
4        0.67±0.03  0.73±0.02  0.69±0.02 
5          0.64±0.01  0.76±0.01 
6  Approx. Information  0.65±0.01 
Quality measures for distances
In this section we present three quality measures that each evaluates the quality of the six distances considered. In the data mining literature a wide range of quality measures for a given clustering has been defined; see for example [69, 70]. Most of these measures are designed to assess the quality of different automated clustering methods while using the same distance. Our setup is different, as we use different distances while the clustering is fixed and given by the initial colourcoding of the sequencerepresenting points. Thus, we have to use other approaches to compare the distances we analyze. In particular, as the six distances have different ranges, we have to use assessment methods which are invariant to the scale of the distance.
The “groundtruth” that we use as a basis for our distance assessment is the fact that the “ideal” clustering of DNA sequences and the points that represent them is known: sequences from the same organism should be close to one another and far from sequences originating from other organisms. (This assumption is justified – for this dataset – as the six organisms considered are very different from one another, belonging to different kingdoms of life.) Thus, an optimal distance should yield a relatively small value for two FCGRs which were generated from the DNA sequences originating from the same organism, and relatively high values for two FCGRs originating from DNA sequences coming from different organisms.

the correlation to an idealized cluster distance

the silhouette cluster accuracy

the relative overlap between the intragenomic and intergenomic distance histograms.
Let us stress that all three quality measures of the six distances are based on the distance matrices which we computed and not on their MDS plots. We will define the three quality measures such that their expected values range in the interval [0,1] where higher values correspond to better performance.
Let us first describe the three quality measures informally. An idealized distance is a distance that would be able to differentiate DNA sequences by species, that is, a distance δ for which δ(x,y)=0 if x and y are sequences from the same species and δ(x,y)=1 otherwise. The first quality measure, the correlation to an idealized cluster distance, measures how well a distance is linearly correlated to the idealized distance δ. The second quality measure, silhouette cluster accuracy, is the percentage of points that are best embedded in the cluster they belong to. The third quality measure quantifies the “visual overlap” between the intragenomic and intergenomic distance histograms. Given our dataset, it is reasonable to expect that a good distance gives a low value if applied to FCGRs of genomic sequences of the same organism, and a high value when applied to FCGRs of genomic sequences from two different organisms, thus separating the histograms of intragenomic distances from that of intergenomic distances. This is illustrated by the histograms in Fig. 4, where a high overlap between the graph of intragenomic distances (dark blue and turquoise) and the graphs of intergenomic distances (grey) is an indication of a poorly performing distance. In a theoretically optimal situation, there would exist a value c such that all distances that are smaller than c are intragenomic distances and all distances that are larger than c are intergenomic distances. This can usually not be expected from real data, but a low overlap between histograms is nevertheless indicative of a “good” distance.
In order to formally define the three quality measures, we consider a dataset V which is partitioned into p nonoverlapping clusters C _{1},…,C _{ p } for which a distance \(d_{\alpha }\colon V\times V \to \mathbb {R}_{\ge 0}\) exists. The cardinalities of the sets are V=m and C _{ i }=m _{ i } for i=1,…,p. In our analysis, p=6 and C _{1} contains all FCGRs generated from genomic DNA sequences from H. sapiens, C _{2} contains all FCGRs generated from genomic sequences of E.coli, and so on, according to the order in Table 1. The distance d _{ α } is one of the six distances α∈{DSSIM, D, E, M, P, AID }.
The correlation ranges in the interval [−1,1]: a value of 1 means that d _{ α } and δ are linearly correlated, and a value of 0 means that they are unrelated. In other words, if the value obtained by measuring the correlation of a given distance to the idealized cluster distance is close to 1, this means that the given distance is closer to the idealized cluster distance, and hence, performs well. Note that negative values for this measure are not expected as this would imply that d _{ α } and δ were negatively related (d _{ α } would perform worse than a matrix containing random entries).
Obviously, the silhouette cluster accuracy ranges in [0,1] with a high accuracy being desirable.
The relative overlap \(\mathcal {O}_{\alpha }(j,i)\) of C _{ j }– C _{ j } with C _{ i }– C _{ j } is defined analogously; note that \(\mathcal {O}_{\alpha }(i,j) \neq \mathcal {O}_{\alpha }(j,i)\) in general. The overlap is normalized to the range [0,1] where 0 means no overlap of elements of bins between intra and intergenomic distances, and 1 means that one of the histograms completely “covers” the other. Also note that we are not interested in the overlap of C _{ i }– C _{ i } with C _{ j }– C _{ j } as both sets of distances are intragenomic distances.
For example, in Fig. 4, for each of the considered distance, the dark blue histograms depict the C _{1}−C _{1} (H. sapiens – H. sapiens) intragenomic distances, the turquoise histograms the C _{4}−C _{4} (A. thaliana – A. thaliana) intragenomic distances, and grey histograms the C _{1}−C _{4} (H. sapiens – A. thaliana) intergenomic distances. As seen from this figure, the descriptor distance appears to visually perform best at separating the two intragenomic distance histograms from the intergenomic histogram, while the Euclidean distance has the weakest performance. The relative overlap attempts to quantify this by computing the overlaps of each of the two pairs of histograms (dark blue with grey, and turquoise with grey). Note that small visual histogram overlaps will result in a high numerical relative overlap, and is indicative of a better performing distance.
Distance comparison results
The first experiment: Summary of quality measures for the performances of six distances (DSSIM, descriptor, Euclidean, Manhattan, Pearson, approximated information distance) on a dataset of 508 genomic DNA sequences spanning one complete chromosome for multichromosomes organisms and the complete genome otherwise, of one organism from each kingdom of life
\(\mathcal {D}_{\alpha }\)  \(\mathcal {A}_{\alpha }\)  \(\mathcal {O}_{\alpha }\)  zscore sum  Rank  

DSSIM  0.627  1.000  0.965  1.895  2nd 
Descriptor  0.639  0.976  0.988  2.509  1st 
Euclidean  0.231  0.325  0.907  −4.831  6th 
Manhattan  0.527  1.000  0.951  0.84  3rd 
Pearson  0.536  0.980  0.888  −0.875  5th 
Approx. Inf.  0.527  1.000  0.937  0.462  4th 
To compare each distance relative to all the other distances, we compute for each quality measure (each column) the standard scores (zscores) of each distance d _{ α }, where α∈{DSSIM, D, E, M, P, AID }, as \(z(d_{\alpha }) = \frac {d_{\alpha }  \mu }{\sigma }\) where μ is the mean and σ is the deviation for that particular quality measure (column).
A positive value of the standard score will mean that a distance performs above average (in this category) and a negative value that it performs below average. Finally, we compute the sum of the zscores for each quality measure as seen in Table 4, second last column. Note that the total of zscores for a distance represents the performance of that distance relative to the other distances, and indicates its relative ranking.
The second experiment: Summary of quality measures for the performances of six distances (DSSIM, descriptor, Euclidean, Manhattan, Pearson, approximated information distance) on a dataset of 526 genomic DNA sequences sampled randomly (10 fragments per chromosome for multichromosome organisms, and all fragments of the genome otherwise) from the genomes of organisms from each kingdom of life
\(\mathcal {D}_{\alpha }\)  \(\mathcal {A}_{\alpha }\)  \(\mathcal {O}_{\alpha }\)  zscore sum  Rank  

DSSIM  0.729  1.000  0.964  1.980  2nd 
Descriptor  0.726  0.998  0.984  2.336  1st 
Euclidean  0.438  0.608  0.861  −5.292  6th 
Manhattan  0.662  1.000  0.955  1.172  3rd 
Pearson  0.639  0.949  0.875  −0.954  5th 
Approx. Inf.  0.637  1.000  0.946  0.759  4th 
The conclusion of these analyses is that the best performing distances for this dataset are the descriptor distance and DSSIM. The Manhattan, Pearson, and approximate information distances perform well in some categories but not so well in other categories. For this dataset and value of k, the Euclidean distance had the weakest performance in all measured categories, which confirms the visual assessment of the MDS plots obtained by using the Euclidean distance, as seen in Figs. 2 and 3.
It is worth noting that the two distances which perform best (DSSIM and descriptor) treat FCGR matrices as twodimensional maps in which the local arrangement of the cells (matrix entries) influences the computed distance, whereas the other distances treat the FCGR matrices as linear vectors. This suggests that the organization of the kmer tallies (in this paper k=9) of a DNA sequence as an FCGR matrix, rather than a simple vector, reveals structural properties of the DNA sequence that could be utilized in order to identify and classify genomic DNA sequences.
Conclusions
In this study we test, at the kingdom level, the hypothesis that CGRbased genomic signatures of genomic DNA sequences are indeed species and genomespecific. With this goal in mind we first analyzed over five hundred 150 kbp DNA genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life. We then separately analyzed over five hundred 150 kbp genomic sequences randomly sampled from the complete genomes of all organisms considered.
Our quantitative comparison of six different distances suggests that several other distances outperform the Euclidean distance, which has been until now almost exclusively used in such studies. Our preliminary results show that two of these distances, DSSIM and descriptor distance (introduced here) when applied to CGRbased genomic signatures, have indeed the ability to differentiate between DNA sequences coming from different species at this taxonomic level. This indicates that the kmer sequence composition (where k=1,2,…,9) of genomic sequences contains taxonomic information which could potentially aid in the identification, comparison and classification of species based on molecular evidence. The twodimensional and threedimensional Molecular Distance Maps we obtain, which visualize the simultaneous intragenomic and intergenomic interrelationships among the sequences in our dataset, show this method’s potential.
Further analysis is needed to explore this method’s applicability to the genomic species identification and classification at lower taxonomic levels. As a preview experiment, we applied it to 240 fragments, randomly sampled from the entire genome of H. sapiens (10 fragments per chromosome), and 210 fragments randomly sampled from the entire genome of M. musculus (10 fragments per chromosome). See [59], Appendix B, for dataset details.
The preview experiment: Summary of quality measures for the performances of six distances (DSSIM, descriptor, Euclidean, Manhattan, Pearson, approximated information distance) on a dataset of 450 DNA sequences, sampled from the entire genome (10 fragments per chromosome) of H. sapiens and M. musculus
\(\mathcal {D}_{\alpha }\)  \(\mathcal {A}_{\alpha }\)  \(\mathcal {O}_{\alpha }\)  zscore sum  Rank  

DSSIM  0.422  1.000  0.618  3.014  2nd 
Descriptor  0.032  0.560  0.063  −3.347  6th 
Euclidean  0.079  0.658  0.318  −1.558  4th 
Manhattan  0.209  0.969  0.336  0.601  3rd 
Pearson  0.531  0.993  0.647  3.643  1st 
Approx. Inf.  0.101  0.578  0.195  −2.353  5th 
Further largescale computational experiments have to be carried out to confirm these preliminary results and establish their validity, as well as to establish the applicability of this method to genomic sequences identification and classification at lower taxonomic levels. Such experiments could provide additional insights regarding the choice of optimal distance for structural genomic sequence comparisons in different settings.
Declarations
Acknowledgements
We thank Yuri Boykov, Lena Gorelick and Olga Veksler for discussions on image descriptors, Stephen Solis for comments on earlier drafts of the manuscript, Genlou Sun for biology expertise, and the Reviewers for their comments and suggestions to improve the paper. We acknowledge the assistance of Nikesh Dattani with the NCBI interface.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Hebert PD, Cywinska A, Ball SL, et al.Biological identifications through DNA barcodes. Proc R Soc Lond Series B: Biol Sci. 2003; 270(1512):313–21.View ArticleGoogle Scholar
 Sirovich L, Stoeckle MY, Zhang Y. Structural analysis of biodiversity. PLoS One. 2010; 5(2):e9266.View ArticlePubMedPubMed CentralGoogle Scholar
 Jeffrey H. Chaos game representation of gene structure. Nucleic Acids Res. 1990; 18(8):2163–170.View ArticlePubMedPubMed CentralGoogle Scholar
 Deschavanne P, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences.Mol Biol Evol. 1999; 16(10):1391–9.View ArticlePubMedGoogle Scholar
 Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283–90.View ArticlePubMedGoogle Scholar
 Jeffrey H. Chaos game visualization of sequences. Comput Graphics. 1992; 16(1):25–33.View ArticleGoogle Scholar
 Hill K, Schisler N, Singh S. Chaos game representation of coding regions of human globin genes and alcohol dehydrogenase genes of phylogenetically divergent species. J Mol Evol. 1992; 35(3):261–9.View ArticlePubMedGoogle Scholar
 Hill K, Singh S. Evolution of speciestype specificity in the global DNA sequence organization of mitochondrial genomes. Genome. 1997; 40:342–56.View ArticlePubMedGoogle Scholar
 Deschavanne P, Giron A, Vilain J, Dufraigne C, Fertil B. Genomic signature is preserved in short DNA fragments. In: Proceedings of IEEE International Symposium on BioInformatics and Biomedical Engineering. New York, USA: IEEE: 2000. p. 161–7.Google Scholar
 Edwards S, Fertil B, Girron A, Deschavanne P. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst Biol. 2002; 51(4):599–613.View ArticlePubMedGoogle Scholar
 Wang Y, Hill K, Singh S, Kari L. The spectrum of genomic signatures: From dinucleotides to chaos game representation. Gene. 2005; 346:173–85.View ArticlePubMedGoogle Scholar
 Kari L, Hill KA, Sayem AS, Karamichalis R, Bryans N, Davis K, et al.Mapping the space of genomic signatures. PLoS One. 2015; 10(5):e0119815.View ArticlePubMedPubMed CentralGoogle Scholar
 Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process. 2004; 13(4):600–12.View ArticlePubMedGoogle Scholar
 Iversen GR, Gergen M, Gergen MM. Statistics: The Conceptual Approach. Berlin Heidelberg: Springer; 1997.View ArticleGoogle Scholar
 Krause EF. Taxicab Geometry: An Adventure in NonEuclidean geometry. Mineola, New York: Courier Dover Publications; 2012.Google Scholar
 Li M, Chen X, Li X, Ma B, Vitany P. The similarity metric. IEEE Trans Inf Theory. 2004; 50(12):3250–264.View ArticleGoogle Scholar
 Phillips GJ, Arnold J, Ivarie R. Monothrough hexanucleotide composition of the Escherichia coli genome: a Markov chain analysis. Nucleic Acids Res. 1987; 15(6):2611–626.View ArticlePubMedPubMed CentralGoogle Scholar
 Beutler E, Gelbart T, Han J, Koziol JA, Beutler B. Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc Natl Acad Sci. 1989; 86(1):192–6.View ArticlePubMedPubMed CentralGoogle Scholar
 Deschavanne P, Radman M. Counterselection of GATC sequences in enterobacteriophages by the components of the methyldirected mismatch repair system. J Mol Evol. 1991; 33(2):125–32.View ArticlePubMedGoogle Scholar
 Bhagwat AS, McClelland M. DNA mismatch correction by Very Short Patch repair may have altered the abundance of oligonucleotides in the E. coli genome. Nucleic Acids Res. 1992; 20(7):1663–1668.View ArticlePubMedPubMed CentralGoogle Scholar
 Burge C, Campbell AM, Karlin S. Overand underrepresentation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci. 1992; 89(4):1358–62.View ArticlePubMedPubMed CentralGoogle Scholar
 Karlin S, Burge C, Campbell AM. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res. 1992; 20(6):1363–70.View ArticlePubMedPubMed CentralGoogle Scholar
 Blaisdell BE, Rudd KE, Matin A, Karlin S. Significant dispersed recurrent DNA sequences in the Escherichia coli genome: several new groups. J Mol Biol. 1993; 229(4):833–48.View ArticlePubMedGoogle Scholar
 Gelfand MS, Koonin EV. Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Res. 1997; 25(12):2430–439.View ArticlePubMedPubMed CentralGoogle Scholar
 Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol. 1997; 179(12):3899–913.View ArticlePubMedPubMed CentralGoogle Scholar
 Vinga S, Almeida J. Alignmentfree sequence comparison–a review. Bioinformatics. 2003; 19(4):513–23.View ArticlePubMedGoogle Scholar
 BonhamCarter O, Steele J, Bastola D. Alignmentfree genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. 2014; 15(6):890–905.View ArticlePubMedGoogle Scholar
 Almeida JS. Sequence analysis by iterated maps, a review. Brief Bioinform. 2014; 15(3):369–75.View ArticlePubMedGoogle Scholar
 Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci. 1986; 83(14):5155–159.View ArticlePubMedPubMed CentralGoogle Scholar
 Sitnikova T, Zharkikh A. Statistical analysis of Ltuple frequencies in eubacteria and organelles. Biosystems. 1993; 30(1):113–35.View ArticlePubMedGoogle Scholar
 Wu TJ, Burke JP, Davison DB. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics. 1997;53(4):1431–9.View ArticlePubMedGoogle Scholar
 Wu TJ, Hsieh YC, Li LA. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics. 2001; 57(2):441–8.View ArticlePubMedGoogle Scholar
 Stuart GW, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics. 2002; 18(1):100–8.View ArticlePubMedGoogle Scholar
 Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a kstring composition approach. J Mol Evol. 2004; 58(1):1–11.View ArticlePubMedGoogle Scholar
 Pham TD, Zuegg J. A probabilistic measure for alignmentfree sequence comparison. Bioinformatics. 2004; 20(18):3455–461.View ArticlePubMedGoogle Scholar
 Pham TD. Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recog. 2007; 40(2):516–29.View ArticleGoogle Scholar
 Kantorovitz MR, Robinson GE, Sinha S. A statistical method for alignmentfree comparison of regulatory sequences. Bioinformatics. 2007; 23(13):249–55.View ArticleGoogle Scholar
 Van Helden J. Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics. 2004; 20(3):399–406.View ArticlePubMedGoogle Scholar
 Dai Q, Yang Y, Wang T. Markov model plus kword distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics. 2008; 24(20):2296–302.View ArticlePubMedGoogle Scholar
 Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M. Analysis of genomic sequences by Chaos Game Representation. Bioinformatics. 2001; 17(5):429–37.View ArticlePubMedGoogle Scholar
 Almeida JS, Vinga S. Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics. 2002; 3(1):6.View ArticlePubMedPubMed CentralGoogle Scholar
 Almeida JS, Vinga S. Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol. 2006; 1:18.View ArticlePubMedPubMed CentralGoogle Scholar
 Almeida JS, Vinga S. Biological sequences as pictures–a generic two dimensional solution for iterated maps. BMC Bioinformatics. 2009; 10(1):100.View ArticlePubMedPubMed CentralGoogle Scholar
 Feng J, Hu Y, Wan P, Zhang A, Zhao W. New method for comparing DNA primary sequences based on a discrimination measure. J Theor Biol. 2010; 266(4):703–7.View ArticlePubMedGoogle Scholar
 Pandit A, Dasanna AK, Sinha S. Multifractal analysis of HIV1 genomes. Mol Phylogenet Evol. 2012; 62(2):756–63.View ArticlePubMedGoogle Scholar
 Pandit A, Vadlamudi J, Sinha S. Analysis of dinucleotide signatures in HIV1 subtype B genomes. J Genet. 2013; 92(3):403–12.View ArticlePubMedGoogle Scholar
 Pride D, Meinersmann R, Wassenaar T, Blaser M. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 2003; 13(2):145–58.View ArticlePubMedPubMed CentralGoogle Scholar
 Sandberg R, Bränden CI, Ernberg I, Cöster J. Quantifying the speciesspecificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene. 2003; 311:35–42.View ArticlePubMedGoogle Scholar
 Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a webservice and a standalone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004; 5(1):163.View ArticlePubMedPubMed CentralGoogle Scholar
 Chapus C, Dufraigne C, Edwards S, Giron A, Fertil B, Deschavanne P. Exploration of phylogenetic data using a global sequence analysis method. BMC Evol Biol. 2005; 5(1):63.View ArticlePubMedPubMed CentralGoogle Scholar
 Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P. Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005; 33(1):6.View ArticleGoogle Scholar
 Joseph J, Sasikumar R. Chaos game representation for comparison of whole genomes. BMC Bioinformatics. 2006; 7(1):243.View ArticlePubMedPubMed CentralGoogle Scholar
 Tanchotsrinon W, Lursinsap C, Poovorawan Y. A high performance prediction of HPV genotypes by chaos game representation and singular value decomposition. BMC Bioinformatics. 2015; 16(1):71.View ArticlePubMedPubMed CentralGoogle Scholar
 Karlin S, Ladunga I. Comparisons of eukaryotic genomic sequences. Proc Natl Acad Sci. 1994; 91(26):12832–6.View ArticlePubMedPubMed CentralGoogle Scholar
 Shedlock AM, Botka CW, Zhao S, Shetty J, Zhang T, Liu JS, et al.Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome. Proc Natl Acad Sci. 2007; 104(8):2767–772.View ArticlePubMedPubMed CentralGoogle Scholar
 Deschavanne P, DuBow M, Regeard C. The use of genomic signature distance between bacteriophages and their hosts diplays evolutionary relationships and phage growth cycle determination. Virol J. 2010; 7(1):163.View ArticlePubMedPubMed CentralGoogle Scholar
 Pandit A, Sinha S. Using genomic signatures for HIV1 subtyping. BMC Bioinformatics. 2010; 11(Suppl 1):26.View ArticleGoogle Scholar
 Yu ZG, Zhan XW, Han GS, Wang RW, Anh V, Chu KH. Proper distance metrics for phylogenetic analysis using complete genomes without sequence alignment. Int J Mol Sci. 2010; 11(3):1141–54.View ArticlePubMedPubMed CentralGoogle Scholar
 Online Material. https://github.com/rallis/intraSupplemental_Material.
 Burma PK, Raj A, Deb JK, Brahmachari SK. Genome analysis: a new approach for visualization of sequence organization in genomes. J Biosci. 1992; 17(4):395–411.View ArticleGoogle Scholar
 Dutta C, Das J. Mathematical characterization of chaos game representation: New algorithms for nucleotide sequence analysis. J Mol Biol. 1992; 228(3):715–9.View ArticlePubMedGoogle Scholar
 Goldman N. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 1993; 21(10):2487–491.View ArticlePubMedPubMed CentralGoogle Scholar
 Oliver J, BernaolaGalvan P, GuerreroGarcıa J, RomanRoldan R. Entropic profiles of DNA sequences through chaosgamederived images. J Theor Biol. 1993; 160(4):457–70.View ArticlePubMedGoogle Scholar
 Deza MM, Deza E. Encyclopedia of Distances. Berlin Heidelberg: Springer; 2009.View ArticleGoogle Scholar
 Kruskal J. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964; 29(1):1–27.View ArticleGoogle Scholar
 Kari L, Sayem AS, Dattani N, Hill K. Map of life: Measuring and visualizing species’ relatedness with genome distance maps. University of Western Ontario Technical Report 756, 978–0771430220 April 2013.Google Scholar
 Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference On, vol. 2,New York, USA: IEEE: 2006. 2169–178.Google Scholar
 Karamichalis R. Molecular Distance Map Interactive Webtool. 2014. https://github.com/rallis/intraMoDMap.
 PangNing T, Steinbach M, Kumar V, et al.Introduction to data mining.Pearson; 2006.Google Scholar
 Zhao Y, Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn. 2004; 55(3):311–31.View ArticleGoogle Scholar
 Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20:53–65.View ArticleGoogle Scholar
Comments
View archived comments (1)