Additive methods for genomic signatures
 Rallis Karamichalis^{2},
 Lila Kari^{1, 2}Email author,
 Stavros Konstantinidis^{3},
 Steffen Kopecki^{2, 3} and
 Stephen SolisReyes^{2}
https://doi.org/10.1186/s1285901611578
© The Author(s) 2016
Received: 13 May 2016
Accepted: 19 July 2016
Published: 22 August 2016
Abstract
Background
Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species and genomespecific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date.
Results
We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closelyrelated species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information.
Conclusions
Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled nextgeneration sequencing (NGS) read data, when highquality sequencing data is not available, or to complement information obtained by other methods of species identification or classification.
Keywords
Background
Motivated by the general need to identify and classify species based on molecular evidence, alignmentfree genome comparisons have been proposed, based on comparing Chaos Game Representations (CGR) of genomic DNA sequences. The CGR of a DNA sequence, proposed by Jeffrey [1, 2], is a graphical representation of a DNA sequence, where the patterns in the image correspond to the frequencies of kmers in the sequence. Deschavanne et al. [3, 4] were the first to suggest that CGR is a good candidate for the role of “genomic signature” defined by Karlin and Burge [5] as any specific quantitative characteristic of a sequence that is pervasive along the genome, while being dissimilar for sequences originating from organisms of different species.
CGR is one of a variety of alignmentfree methods (see [6–11] for detailed literature reviews) that have been proposed for sequence and genome comparisons, as a computationally efficient approach that performs well even with DNA sequences that have nothing or little in common. (We use the following notational conventions for genomic DNA: nDNA (nuclear/nucleoid DNA), mtDNA (mitochondrial DNA), cpDNA (chloroplast DNA), and pDNA (plasmid DNA)).
Initially, CGR images were only qualitatively analyzed [12–14], and Dutta et al. and Goldman both advanced the suggestion that CGR images represent no more information than secondorder Markov chains [15, 16], which was later disproven by Almeida et al. [17, 18] and others [19, 20]. CGR has been applied extensively to phylogenetics together with the Euclidean distance, for instance on nDNA fragments from various domains [3], 27 genomes from various genera [4], 125 nDNA fragments from several bird genomes [21], 26 mtDNA sequences (also with the Pearson distance and a custom image distance) [19], 4 bacteria and about 200 phages [22], 75 HIV1 genomes [23], 10 mtDNA sequences and 14 nDNA sequences from plants in the Brassicales order [24]. Other distances have also been used, for instance the DSSIM image distance on a set of 3,176 mtDNA sequences [20], and six different distances on 174 million base pairs of sampled nDNA fragments from organisms of all major kingdoms of life [25]. The performance of several distance functions has also been compared and benchmarked on their accuracy in constructing phylogenetic trees [26–32]. Initially, CGR was used only for strings over a 4letter alphabet (like DNA), but generalizations have been proposed to peptide sequences [33–38], and Almeida and Vinga proposed a derivative of CGR called the Universal Sequence Map (USM), which is suitable for alphabets of any size [39, 40]. CGRs have also been subjected to multifractal analysis (which measures the degree of selfsimilarity within the image), see, e.g., [35, 41–46]. Lastly, CGR has been used to estimate sequence entropy [47–49], to speed up localalignment algorithms [50], and has been used together with neural networks to classify HPV genomes by genotype [51].
Several CGR studies [13, 20, 52] observed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be completely different. While the hypothesis that CGRs of mitochondrial DNA sequences can play the role of genomic signatures was tested and validated on the set of all 3,176 sequenced mitochondrial genomes (totalling 91.3 megabase pairs) available in the NCBI GenBank sequence database in July 2012 [20], to our knowledge no such extensive analysis of CGRs of nuclear/nucleoid genomic sequences exists to date.

We present an extensive analysis of the hypothesis that conventionally computed (called herein “conventional”) nDNA signatures can play the role of genomic signatures at multiple taxonomic levels, from kingdom to species. Our dataset totals 1.45 gigabase pairs of nDNA sequences from 42 different genomes, from all major kingdoms of life.

Our analysis indicates that conventional nDNA signatures of two different origins cannot always be differentiated, especially if they originate from closely related organisms. To address this issue, we propose taking into account information obtained from organellar DNA, in addition to nDNA. More generally, we propose the concept of an additive DNA signature of a set (collection) of DNA sequences, and define two particular instances: composite DNA signatures and assembled DNA signatures.

We explore composite DNA signatures, which combine conventional nDNA signatures with organellar DNA signatures (mtDNA, cpDNA, or pDNA) of the same organism. We demonstrate that, in this dataset, the composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of conventional nDNA signatures failed. In particular, composite DNA signatures from genomes of species as closely related as H. sapiens and P. troglodytes, or E. coli and E. fergusonii, can be successfully separated.

We explore assembled DNA signatures, which combine information from many short contigs (e.g., 100 bp) of a DNA fragment to produce a recognizable signature. This is in contrast to conventional DNA signatures wherein one single long (thousand to hundreds of thousands of basepairs) DNA sequence is needed to generate a recognizable signature.
The enhanced discriminating power of composite DNA signatures, and the ability of assembled DNA signatures to operate with scattered and reduced sequence data, open the possibility of practical applications including aiding species identification or classification, and comparisons of DNA fragments of various origins such as genomes of extinct organisms, synthetic genomes, raw unassembled nextgeneration sequencing (NGS) read data, or even computergenerated DNA sequences.
Results
The first objective of this study was to test, on a comprehensive dataset, the hypothesis that conventional nDNA signatures can be used to differentiate between nuclear DNA sequences originating from different organisms, spanning all major kingdoms of life, at multiple taxonomic levels.
To this end, the following computational experiment was performed, for each of the major kingdoms of life, at various taxonomic levels. We chose a pivot organism (e.g., H. sapiens for Kingdom Animalia) and proceeded to use conventional nDNA signatures to compare fragments of its nuclear/nucleoid genome with fragments of the nuclear/nucleoid genome of one other organism from the same kingdom. The process was then repeated with the second organism being at increasing degrees of relatedness to the pivot organism.

Randomly sample 150 kbp nDNA fragments from every chromosome (20 per chromosome, or all fragments if fewer) of the two genomes involved in the comparison. For each such nDNA fragment, construct its corresponding conventional nDNA signature using the process described in Section “Methods”.

Compute pairwise distances for all pairs of conventional nDNA signatures generated in Step 1. The distance used to start with was an approximated information distance (AID), formally defined in Section “Methods” (see also [25, 53]), since it is computationally simple and uses the least amount of sequence information. If separation was not achieved using AID, five other distance measures were used: Structural Dissimilarity Index (DSSIM) [54], Euclidean distance, Pearson correlation distance [55], Manhattan distance [56], and descriptor distance [25].

Use the distance matrix obtained in Step 2 as input to a MultiDimensional Scaling (MDS) algorithm to produce a 3D Molecular Distance Map [25]: Each point in the map corresponds to (the conventional nDNA signature of) an nDNA fragment from Step 1, and the geometric distance between every two points corresponds to the distance between the respective conventional nDNA signatures in the distance matrix. Assess, for each Molecular Distance Map, whether or not separation between conventional nDNA signatures of DNA fragments from the pivot organism and those from the other organism was achieved, by using either kmeans clustering [57] or by verifying the existence of a separating plane.
For Kingdom Fungi, the pivot organism is the model organism Saccharomyces cerevisiae (16 chromosomes, 73 fragments), a species of yeast instrumental to winemaking, baking, and brewing. Separation of its conventional nDNA signatures was achieved down to and including separation from C. dubliniensis (same family, different genus). In the case of the comparison with K. pastoris, marked with Y* in Fig. 2, the accuracy score was lower than 85 %: This is an artifact of the shape of the 3D Molecular Distance Map wherein one of the clusters has a trailing set of points that become erroneously separated by kmeans from all the rest of the points. Because of this, and since the use of kmeans on the 2D Molecular Distance Map of the same dataset resulted in an accuracy score of 100 %, we interpreted this comparison as resulting in separation. The results of the comparison between the conventional nDNA signatures of the pivot organism and those of S. arboricola (same genus, different species), were inconclusive: The use of Euclidean and Pearson distances resulted in separation (both with accuracy of 88.48 %), while the use of the other four distances (DSSIM, Manhattan, descriptor, approximated information distance) did not result in separation.
For Kingdom Plantae, the pivot organism is the model organism Brassica napus (19 chromosomes, 380 DNA fragments), rapeseed, a flowering member of the family Brassicaceae (mustard or cabbage family). Separation of its conventional nDNA signatures was achieved down to and including separation from C. papaya (papaya, same order, different family). For the comparisons with A. thaliana (thale cress, same family, different tribe) and R. sativus (radish, same tribe, different genus), cluster separation was visually observed but not quantitatively confirmed by either kmeans or plane separation. The comparison with B. oleracea (wild cabbage, same genus, different species) did not result in separation, with any of the six distances.
For Kingdom Protista, the pivot organism is the model organism Plasmodium falciparum, a protozoan parasite (14 chromosomes, 149 DNA fragments), one of the species of Plasmodium that cause malaria in humans. Separation of its conventional nDNA signatures from those of other organisms from the same kingdom was achieved at all taxonomic levels, down to and including separation from P. vivax (same genus, different species).
For Kingdom Bacteria, the pivot organism is the model organism Escherichia coli (20 genomic DNA fragments), a bacterium commonly found in the lower intestine of warmblooded organisms. Separation of its conventional nDNA signatures from those of other bacteria was successful down to and including separation from S. enterica (same family, different genus), but failed with all six distances in the comparison with E. fergusonii (same genus, different species).
For Kingdom Archaea, the pivot organism is the model organism Pyrococcus furiosus (12 genomic DNA fragments), an extremophilic species of Archaea. Separation of its conventional nDNA signatures from those of other archaea was successful at all levels, down to and including separation from P. yayanosii (same genus, different species).
The above results indicate that, especially in Kingdom Animalia, conventional nDNA signatures cannot always be used to differentiate nuclear/nucleoid genomic sequences originating from two different genomes. This suggests that conventional nDNA signatures cannot always play the role of a “genomic signature”, particularly when the genomes being compared belong to closely related species.
Composite DNA signatures
To enhance the discriminating power of conventional nDNA signatures, our second objective was to introduce and explore the concept of composite DNA signatures, which combine conventional nuclear/nucleoid DNA signatures with signatures of organellar genomes (mtDNA, cpDNA, or pDNA).
To test the discriminating power of composite DNA signatures, we repeated all previous pairwise comparisons (where sequenced organellar DNA was available), using this time composite DNA signatures. The results are presented in the last two columns of Fig. 2.
For Kingdom Bacteria, the use of composite DNA signatures combining nDNA and pDNA (when available) resulted in separation in all cases.
Overall, the use of composite DNA signatures resulted in separation in all pairwise comparisons in Fig. 2 (where organellar DNA sequencing data was available), including in those where the use of conventional nDNA signature failed or resulted in inconclusive separations.
Assembled DNA signatures
As the third objective of this study, we explored a way to enhance the practical applicability of conventional DNA signatures. Recall that, to produce a recognizable visual pattern that can be reliably used to represent a genome, a conventional DNA signature needs as input a long contiguous (two to several hundred kilobase pairs) DNA fragment. This assumes a high quality and reliability of sequencing and assembly, which are not always available. We propose instead to approximate a conventional signature by an assembled DNA signature, which combines the conventional DNA signatures of many short contigs (e.g., 100 bp) of the given fragment. Note that these contigs need not cover the entire DNA fragment.
In what follows, we denote by s the length of the sequence s. Given a DNA fragment s, an assembled DNA signature of s, using r equilength contigs of length n (subfragments of the sequence s), is defined as the sum of the conventional DNA signatures of all of the r contigs. A particular case of assembled DNA signature is where the fragment s is partitioned into equilength, consecutive, nonoverlapping contigs, that is, s=s _{1} s _{2}…s _{ r } s _{ r+1}, and s _{ i }=n for 1≤i≤r, with s _{ r+1}<n. In this case, we call the assembled signature a fullyassembled DNA signature of the sequence s, using equilength contigs of length n.
(A) through (C) – Distances between the conventional nDNA signature of a fragment and its assembled DNA signatures, for various numbers r of contigs of the same length n: (A) distances to fullyassembled DNA signatures; (A ^{′}) theoretical upper bounds for (A); (B) distances to assembled DNA signatures; (C) same as (B), when tripling the number of contigs
n  r  (A)  (A’)  (B)  r  (C)  r  (B’)  r  (C’) 

100  1500  0.05  0.13  0.29  4500  0.042  1475  0.32  4434  0.041 
150  1000  0.03  0.09  0.29  3000  0.034  1000  0.29  2999  0.040 
200  750  0.02  0.07  0.28  2250  0.033  750  0.29  2250  0.038 
300  500  0.02  0.04  0.28  1500  0.030  500  0.28  1500  0.038 
500  300  0.01  0.03  0.26  900  0.037  300  0.28  900  0.033 
1000  150  0.005  0.01  0.30  450  0.030  150  0.25  450  0.039 
2000  75  0.003  0.007  0.30  225  0.041  75  0.26  225  0.023 
3000  50  0.002  0.004  0.25  150  0.044  50  0.29  150  0.021 
10000  15  0.0004  0.001  0.30  45  0.053  15  0.25  45  0.045 
15000  10  0.0003  0.0008  0.24  30  0.12  10  0.23  30  0.079 
30000  5  0.0001  0.0004  0.36  15  0.13  5  0.41  15  0.058 
Also as expected, for the same values of n and r, the distance between an assembled DNA signature and the conventional nDNA signature of the same fragment (Table 1, Column (B)) is higher than the one between a fullyassembled DNA signature and the conventional nDNA signature of the same fragment (Table 1, (A)). This indicates that the assembled DNA signature is less performant than the fullyassembled DNA signature as an approximation of a conventional nDNA signature. The reason is that, given a fixed number r of contigs, in the case of an assembled DNA signature the contigs are allowed to overlap and need not cover the entire fragment. This can be compensated by increasing the coverage, that is, the number r of contigs. Table 1, (C) shows that tripling the number of contigs results in significantly smaller differences between assembled DNA signatures and the conventional DNA signature of the same fragment which they were meant to approximate.
The results in Table 1 suggest that assembled DNA signatures have the potential to play the role of “genomic signatures”, and be used directly on raw unassembled nextgeneration sequencing read data, or in cases where other methods are not directly applicable because highquality sequencing data is not available. To test this hypothesis, we considered the organism pairs in Fig. 2 for which separation was obtained using conventional nDNA signatures, and attempted to reproduce these successful separations using assembled DNA signatures instead. In addition, we empirically sought to find, in each case, the coverage (amount of sequence data) needed to achieve separation, as a percentage of total fragment length.
To determine the threshold interval where separation between assembled DNA signatures of a given pair of organisms was achieved, when contigs of length n=300 were used, the following process was employed. For various values of t, 0≤t≤1 (representing the fragment coverage, e.g., t=0.5 means that 50 % of the fragment data was used), we attempted to see if separation of assembled DNA signatures from the two organisms was achieved, in the following way.
For each of the 150 kbp fragments s from the two genomes, q random positive integers were picked from the interval 1 to s−n+1=(150,000−300+1), where q=⌊t∗s/n⌋, that is, the integer part of t∗s/n. These q numbers represent the start positions of the q chosen contigs. For each contig start position, a contig of length n=300 was read and used for the assembled DNA signature of the fragment s.
For each value of t, the corresponding 3D Molecular Distance Map of the assembled DNA signatures of the two organisms was then analyzed, by verifying the existence (or absence) of a separating plane.
Assembled nDNA signatures: sequence coverage (amount of DNA fragment information) needed for separation of the assembled nDNA signatures of the pivot organism from assembled nDNA signatures of the comparison organism, for all major kingdoms of life. Separations were confirmed by verifying the existence of separating planes
Animalia  
H.s a p i e n svs.  Different taxon  Thresh. 
D.m e l a n o g a s t e r  Phylum: Arthropoda  1 –5 % 
G.g a l l u s  Class: Aves  3 –10 % 
M.m u s c u l u s  Order: Rodentia  10 –20 % 
M.m u r i n u s  Suborder: Strepsirrhini  60 –80 % 
T.s y r i c h t a  Infraorder: Tarsiiformes  20 –40 % 
Fungi  
S.c e r e v i s i a e vs.  Different taxon  Thresh. 
C.g a t t i i  Phylum: Basidiomycota  0.5 –2 % 
F.o x y s p o r u m  Class: Sordariomycetes  0.5 –2 % 
K.p a s t o r i s  Family: Phaffomycetaceae  2 –10 % 
C.d u b l i n i e n s i s  Genus: C a n d i d a  2 –10 % 
Plantae  
B.n a p u svs.  Different taxon  Thresh. 
M.p u s i l l a  Phylum: Chlorophyta  2 –3 % 
P.p a t e n s  Unranked: Bryophyta  3 –4 % 
M.d o m e s t i c a  Unranked: Fabids  4 –5 % 
C.p a p a y a  Family: Caricaceae  4 –5 % 
Protista  
P.f a l c i p a r u m vs.  Different taxon  Thresh. 
O.t r i f a l l a x  Phylum: Ciliophora  0.5 –2 % 
T.g o n d i i  Class: Conoidasida  0.5 –2 % 
T.o r i e n t a l i s  Order: Piroplasmida  0.5 –2 % 
P.v i v a x  Species: P.v i v a x  0.5 –2 % 
Bacteria  
E.c o l i vs.  Different taxon  Thresh. 
S.a u r e u s  Phylum: Firmicutes  0.5 –2 % 
H.p y l o r i  Class: Epsilonproteobact.  0.5 –2 % 
A.b a u m a n n i i  Order: Pseudomonadales  0.5 –2 % 
S.e n t e r i c a  Genus: Salmonella  10 –20 % 
Archaea  
P.f u r i o s u s vs.  Different taxon  Thresh. 
S.i s l a n d i c u s  Phylum: Crenarchaeota  0.5 –2 % 
M.s m i t h i i  Class: Methanobacteria  0.5 –2 % 
Thermococcus  Genus: Thermococcus  0.5 –2 % 
P.y a y a n o s i i  Species: P.y a y a n o s i i  0.5 –2 % 
The actual threshold values lie in the intervals listed, and may be subject to the quality of the sequencing. As expected, in general, the thresholds needed for separation increase with the increase in the degree of relatedness of the organisms being compared. This suggests that nDNA sequences from closely related organisms require a higher coverage (that is, a higher amount of information from each sequence) to be separated. The only exception to this trend, in this dataset, were the pairs H. sapiens with M. murinus (gray mouse lemur) requiring 60 –80 % sequence coverage, and H. sapiens and T. syrichta (Philippine tarsier) requiring 20 –40 % sequence coverage. Thus, the (human, lemur) pair required higher sequence coverage to achieve separation than the (human, tarsier) pair, even though the gray mouse lemur belongs to a different primate suborder (Haplorrhini) than the modern human, while the tarsier belongs to the same primate suborder as the modern human (Strepsirrhini), and thus one would expect that more information would be needed to achieve the latter separation. This apparent anomaly may be partly related to the fact that the phylogenetic placement of tarsiers within the order Primates has been controversial for over a century [59]: In [60] tarsiers are placed within Haplorrhini, while according to [20, 61], mitochondrial DNA evidence places tarsiiformes as a sister group to Strepsirrhini.
Table 2 indicates that the amount of DNA fragment information needed to achieve separation, at the same taxonomic level, can differ from one kingdom to another. For example, in Kingdom Animalia, conventional nDNA signatures of organisms from two species of a different species (H. sapiens and P. troglodytes) could not be separated even though we use 100 % of the DNA fragment information. In contrast, in Kingdom Fungi, assembled nDNA signatures from two organisms of a different genus (S cerevisiae and C. dubliniensis) could be separated even when using only 10 % of DNA fragment data. Similarly, in Kingdom Bacteria, assembled nDNA signatures from two organisms of different genus (E. coli and S. enterica) could be separated even when using only 20 % of DNA fragment data. The situation is even more extreme in Kingdom Protista and Kingdom Archaea, where even organisms belonging to the same genus could be separated with very little sequence coverage. Indeed, in Kingdom Protista, assembled nDNA signatures of two organisms of the same genus (P. falciparum and P. vivax) could be separated using only 2 % of DNA fragment data. Similarly, in Kingdom Archaea, assembled nDNA signatures from two organisms of the same genus (P. furiosus and P. yananosii) could also be separated using only 2 % of DNA fragment data. This suggests that some taxonomic categories, such as “genus”, do not necessarily reflect the same degree of structural similarity of genomic sequences uniformly across kingdoms.
Compositeassembled DNA signatures
We now briefly explore the potential of combining the approach of composite DNA signatures with that of assembled DNA signatures. A compositeassembled DNA signature is produced by combining information from the assembled DNA signatures of two (or more) different types of DNA fragments. For example, a compositeassembled signature using nDNA and mtDNA is obtained by combining the assembled nDNA signature of one 150 kbp nDNA fragment, with the assembled mtDNA signature of the mtDNA genome of the same organism.
Conclusions
The first objective of this paper was to conduct a comprehensive analysis, on a dataset totalling 1.45 Gb, of the hypothesis that Chaos Game Representations of nuclear/nucleoid genomic sequences can play the role of “genomic signatures”, that is, that they are genome and speciesspecific. Our results suggest that this hypothesis is not always valid, in that nuclear/nucleoid DNA sequences belonging to closely related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii cannot always be separated using conventionally computed CGR signatures.
To address this issue, as a second objective, we propose the use of composite DNA signatures, which combine information from the nuclear/nucleoid genome with that from one or more organellar genomes (mtDNA, cpDNA and/or pDNA). Composite DNA signatures were found, in this study, to result in successful separation of DNA sequences by organism in all cases, including those where conventional nDNA signatures failed.
As a third objective, we propose the use of assembled DNA signatures, which combine information from short contigs (subfragments) of a DNA fragment, rather than using the entire contiguous fragment, to produce its signature. We show that assembled DNA signatures can be successful replacements of conventional DNA signatures, and also that the composite and assembled DNA signature approaches can be used simultaneously.
Mathematically, composite and assembled DNA signatures are both particular cases of a general concept, namely that of an additive DNA signature of a set of DNA sequences (see Section “Methods”). Our results indicate that such additive DNA signatures could be considered as potential candidates for the role of “genomic signatures” at various taxonomic levels, from distant to closely related species, and thus complement other methods for species identification and classification.
Several directions of future research stem from the fact that existing literature indicates that the oligomer composition of nuclear/nucleoid DNA sequences and mitochondrial DNA sequences can be a source of taxonomic information. Such directions include testing the discriminating power of additive DNA signatures in largescale multigenome comparisons, and exploring their utility in practical applications such as DNA sequence identification and classification (including directly on raw unassembled NGS read data or when highquality sequencing data is not available), metagenomics, and synthetic genomes.
Methods
Dataset
The dataset, totalling 1.45 Gb, comprised whole nuclear/nucleoid genomes and organellar genomes of 42 organisms, spanning all major kingdoms of life (see Additional file 1 for the scientific name, NCBI accession number, chromosome number, and number of fragments sampled). In our analysis, for each complete genomic sequence, all letters other than A, C,G, T were ignored, and the resulting DNA sequence was divided into successive, nonoverlapping, contiguous fragments, each 150 kbp long (when the last portion was shorter than 150 kbp, it was not included in the analysis). The choice of fragment length, 150 kbp, was due to our choice of CGR image resolution (namely 2^{9}×2^{9}, that is, k=9), empirical testing, and computational efficiency reasons, see [25].
Subsequently, 20 such 150 kbp fragments were randomly sampled from each chromosome and, for each such fragment, a corresponding conventional nDNA signature was constructed, as described below. (If there were fewer than 20 fragments, all fragments in the chromosome were chosen.) In the cases where the genome assembly of the organism was at the contig/scaffold level, the contigs/supercontigs of the assembly were sorted by length and the first 500 contigs/supercontigs were selected. (If there were fewer than 500 contigs/supercontigs, all were selected.) From each contig/supercontig, only the first 150 kbp fragment was considered.
We note that this method is alignmentfree, and that its approach contrasts typical biodiversity and species identification research [62–65] in that it uses randomly selected DNA sequences rather than specific marker genes for identification and classification of species. This approach is somewhat similar to novel approaches in metagenomics, metatranscriptomics, and viromics [66], but there are also substantial differences such as that metatranscriptomics is based on RNA rather than DNA and that it groups sequences based on functionality rather than oligomer composition.
Chaos Game Representation (CGR)
We used a modification of the original CGR, introduced by Deschavanne [3]: a kth order FCGR (frequency CGR) of a sequence s, denoted by F C G R _{ k }(s), is a 2^{ k }×2^{ k } matrix that can be constructed by dividing the CGR image of the sequence s into a 2^{ k }×2^{ k } grid, and defining the element a _{ ij } of the matrix F C G R _{ k }(s) as the number of points that are situated in the corresponding grid square.
We now formally define the conventional DNA signature of a sequence s to be the matrix F C G R _{ k }(s), which records the numbers of occurrences of all possible kmers in the sequence s. Throughout this paper, the parameter k is assumed to be a fixed constant. In particular, similar to [25], in all computational experiments in this paper the value used was k=9.
For computing composite and assembled DNA signatures, we introduce the general concept of additive DNA signature of a set of sequences, formally defined as follows.
Definition 1

The conventional DNA signature of a sequence s is the additive DNA signature of the set {s} consisting of a single sequence s, that is, F C G R _{ k }(s)=F C G R _{ k }({s}).

The composite DNA signature using two DNA sequences s _{1},s _{2}, of two different types, is
F C G R _{ k }({s _{1},s _{2}})=F C G R _{ k }(s _{1})+F C G R _{ k }(s _{2}),

An assembled signature of a sequence s, using r equilength contigs of length n, is
\({FCGR}_{k}(\{s_{1}, s_{2}, \ldots, s_{r}\}) = \sum _{i = 1}^{r} {FCGR}_{k}(s_{i}),\) where s=α _{ i } s _{ i } β _{ i },s _{ i }=n, for 1≤i≤r.

The fullyassembled DNA signature of a sequence s, using equilength contigs of length n, is
\({FCGR}_{k}(\{s_{1}, s_{2}, \ldots, s_{r}\}) = \sum _{i=1}^{r} {FCGR}_{k}(s_{i}),\) where r=⌊s/n⌋,s=s _{1} s _{2}…s _{ r } s _{ r+1}, and s _{ i }=n for 1≤i≤r, while s _{ r+1}<n.
To compute the fullyassembled DNA signature of a sequence s, using equilength contigs of length n, one adds the F C G R _{ k } of all the adjacent consecutive contigs of length n that cover s (except possibly a short tail of length less than n), where the first contig starts at the beginning of the sequence. In contrast, to compute an assembled signature of s using equilength contigs of length n, one has the freedom to set the number of such contigs as an additional parameter r, and then add the F C G R _{ k } of r contigs sampled randomly from the sequence s. Thus, for a given n, a sequence s has only one fullyassembled DNA signature, but many different assembled signatures, each depending on both the choice of parameter r, and the particular sampling of the r sequences (which may overlap or be identical).
Approximated Information Distance (AID)
For a finite set X, we denote by X the cardinality of X, that is the number of elements in X. Given a set of sequences S={s _{1},s _{2},…,s _{ n }} we denote by M _{ k }(S) the set of all distinct kmers that occur in all the sequences of S. In the case of a set consisting of a single sequence s, we write M _{ k }(s) to denote M _{ k }({s}).
The distance d AID k(s, t) was used for most of the computations of pairwise distances between conventional DNA signatures in this paper.
This generalization of the approximated information distance preserves the original meaning of the concept as the ratio between the number of noncommon kmers of the two sets S and T and the total number of kmers that occur in S or in T (or both). This distance was used to compute distances between conventional, composite and assembled DNA signatures in this paper.
The next Proposition leads to a formula for the computation of the generalized approximated information distance, as well as gives a theoretical upper bound for the generalized approximated information distance in the case of fullyassembled DNA signatures. The following auxiliary lemma follows from standard set theory arguments.
Lemma 2
 1.
If S⊆T then M _{ k }(S)≤M _{ k }(T) and
M _{ k }(S∪T)=M _{ k }(T),
 2.
If every sequence in S is a subsequence of a given sequence s, then
M _{ k }(S)∪M _{ k }(s)=M _{ k }(s),
 3.
The number of distinct kmers that occur in S but not in T is M _{ k }(S)∖M _{ k }(T)=M _{ k }(S∪T)−M _{ k }(T),
 4.
M _{ k }(S)=# F C G R _{ k }(S),
where for a numerical matrix A we denote by #(A) or # A the number of nonzero entries of A.
Proposition 3
 1.
\(d_{\mathtt {AID}}^{k}(S,T)= 2  \frac {M_{k}(S)+M_{k}(T)}{M_{k}(S \cup T)}\)
 2.
If s=s _{1} s _{2}…s _{ r } and each s _{ i } is of length n, n>k,
then
\(d_{\mathtt {AID}}^{k}(\{s_{1}s_{2}\ldots s_{r}\}, s)\le \frac {\text {min}\{(r1)(k1), M_{k}(s)\}}{M_{k}(s)}.\)
 3.
There is a sequence s for which the above relation holds with “=”.
Proof
which is indeed equal to the required formula.
For the second statement, let S={s _{1},s _{2},…,s _{ r }} and T={s}. By the definition of the generalized information distance, d AID k({s _{1},…,s _{ r }},s) equals a fraction, where the numerator is the sum between the number of distinct kmers that appear in {s _{1},…,s _{ r }} but not in s, and the number of distinct kmers that appear in s but not in {s _{1},…,s _{ r }}. The first term of this sum is obviously zero, since s _{ i } are contigs that span the sequence s. Thus, the numerator of this fraction is the second term of the sum, namely the number of distinct kmers that appear in s=s _{1} s _{2}…s _{ r } but not in {s _{1},…,s _{ r }}. We can count these kmers by noticing that the only kmers that appear in s but not in {s _{1},…,s _{ r }}, are the ones that span consecutive contigs.
We now note that each joint of two contigs s _{ i } s _{ i+1} contains at most (k−1) distinct kmers that span both contigs, and that s contains (r−1) such joints s _{ i } s _{ i+1}. Thus, the total number of kmers of s, that are in s but not in {s _{1},…,s _{ r }}, is at most (r−1)·(k−1).
Since the approximated information distance ranges between 0 and 1, the required inequality follows.
which equals the given upper bound. □
which is the formula that was used for all generalized approximated information distance calculations in this paper.
Remark also that the upper bound determined in Proposition 3.2 for the generalized approximated information distance, in the case of the comparison between the conventional DNA signature of a sequence and the fullyassembled DNA signature of its r contigs of length n, is the one illustrated in Column (A ^{′}) of Table 1.
Multidimensional scaling and separation assessment
To visualize the interrelationships among DNA signatures originating from a pair of genomes, and thus to visually assess whether separation was achieved, we used MultiDimensional Scaling (MDS). MDS is an information visualization technique introduced by Kruskal in [67]. MDS takes as input a distance matrix that contains the pairwise distances among a set of items (here the items are DNA signatures), and outputs a spatial representation of the items in a common Euclidean space. Each item is represented as a point, and the spatial distance between any two points corresponds to the distance between the items in the distance matrix. Objects with a smaller pairwise distance will result in points that are close to each other, while objects with a larger pairwise distance will become points that are far apart.
Concretely, classical MDS, which we use in this paper, receives as input an m×m distance matrix (Δ(i, j))_{1≤i, j≤m } of the pairwise distances between any two items in the set. The output of classical MDS consists of m points in a qdimensional space whose pairwise spatial (Euclidean) distances are a linear function of the distances between the corresponding items in the input distance matrix. More precisely, MDS will return m points \(p_{1},p_{2},\ldots,p_{m}\in \mathbb {R}^{q}\) such that \(d(i, j)= p_{i}p_{j}\thickapprox f(\Delta (i,j))\) for all i, j∈{1,…,m} where d(i, j) is the spatial distance between the points p _{ i } and p _{ j }, and f is a function linear in Δ(i, j). Here, q can be at most (m−1) and the points are recovered from the eigenvalues and eigenvectors of the input m×m distance matrix. If we choose q=3, the result of classical MDS is an approximation of the original (m−1)dimensional space as a threedimensional map, such as the Molecular Distance Maps in this paper. Throughout the paper, for consistency, all Molecular Distance Maps have been scaled so that the x, y, and z coordinates always span the interval [−1,1]. The formula used for scaling is \(x_{\text {sca}} =2 \cdot \left (\frac {x  x_{\text {min}}}{x_{\text {max}}  x_{\text {min}}}\right)  1\), where x _{min} and x _{max} are the minimum and maximum of the xcoordinates of all the points in the original map, and similarly for y _{sca} and z _{sca}. In all Molecular Distance Maps displayed in this paper, the origin of coordinates (0,0,0) is the center of the depicted cube, and the parallel edges of the cube are parallel to one of the x, y, and z axis respectively. The maps have been rotated for optimal visualization and, for each of the axes, the length units are displayed only on one of the four edges of the cube that are parallel to it.
A feature of MDS is that the points p _{ i } are not unique. Indeed, one can translate or rotate a map without affecting the pairwise spatial distances d(i, j)=p _{ i }−p _{ j }. In addition, the obtained points in an MDS map may change coordinates when more data items are added to, or removed from, the dataset. This is because MDS aims to preserve only the pairwise spatial distances between points, and this can be achieved even when some of the points change their coordinates. In particular, the (x, y,z)coordinates of a point representing the DNA signature of a particular DNA fragment of H. Sapiens in Fig. 1 will not be the same as the (x, y,z)coordinates of the point representing the same DNA fragment in Fig. 3.
For a given Molecular Distance Map, kmeans clustering [57] was used to assess whether separation of the DNA sequences by organism was achieved. The reason for this choice were that in all computed Molecular Distance Maps the number of clusters was known a priori, k=2 (not to be confused with kmers, where k has a different meaning), that the clusters had approximately the same number of points and thus the prior probability of the two clusters was the same, and that in most cases the clusters were somewhat spherical in shape. Moreover, the use of kmeans yielded satisfactory results in the majority of cases.
For Molecular Distance Maps with more complex cluster shapes, where kmeans accuracy is low and separating planes do not exist, the use of other clustering methods such as densitybased spatial clustering of applications with noise (DBSCAN) [69] would have to be explored to see if separation is achieved.
The webtool MoDMap3D, [58], illustrates the 3D Molecular Distance Maps that correspond to each of the comparisons listed in Fig. 2, in the same way the Molecular Distance Map in Fig. 1 illustrates the positive separation result listed in Fig. 2, subfigure Animalia, line 1. The webtool MoDMap3D is, moreover, interactive, and allows for an indepth exploration of each particular 3D Molecular Distance Map. After first selecting the pair of genomes to be compared, the user can navigate in the threedimensional space of their DNA signatures: clicking on any point in the map will display information about the DNA fragment represented by that point, such as its NCBI accession number or assembly number, scientific name of the organism it originates from, chromosome or contig/scaffold number, length of the subsequence in bp, and fragment number from the original sequence.
Software
The code for running the experiments [68] was written in Wolfram Mathematica, and was used for the generation of FCGRs, the computation of composite and assembled DNA signatures, the calculation of distance matrices, the creation of the 3D Molecular Distance Maps, and the computation of the separating planes.
Remarks
One observation should be made about the genome assemblies at contig/scaffold level in the dataset. The general intent was for the 150 kbp DNA fragments from a given genome not to be overlapping. This is because sequence overlaps could result in artificially smaller intragenomic distances due to the increase in sequences’ similarities, and this could potentially lead to false positive cluster separations. However, some overlap may have been unavoidable in the cases where only contig/scaffold level data was available. The availability of contig/scaffold data only may thus explain why in Fig. 2 the accuracy scores do not always decrease uniformly, as expected, when one compares the pivot organism with organisms more and more closely related to it.
Another observation should be made about the length of sequences analyzed. When computing composite DNA signatures, the signature of the mitochondrial genome (or entire chloroplast or plasmid) was appended to that of each 150 kbp nDNA fragment. This, in some sense, magnifies the role of the organellar genome in the composite signature. Depending on the application, one can generalize Definition 1 to a weighted additive DNA signature which gives different weights to the different types of DNA that compose it.
We now discuss some limitations of the proposed methods. First, note that assembled DNA signatures as defined here use equilength contigs. Preliminary computational experiments, illustrated in Table 1, columns (B ^{′}) and (C ^{′}), show the results of comparisons between a conventional nDNA signature and variablelength assembled DNA signatures of the same fragment. In those experiments, contig lengths are drawn from a normal distribution N(μ,σ) with mean μ=n (the length of the contig in the corresponding equilength contig experiment) and variance σ=40. The table shows that the performance of assembled DNA signatures using variablelength contigs is comparable with the performance of those using equilength contigs. This indicates that both equilength and variablelength contigs assembled DNA signatures could be reliable approximations of conventional genomic signatures, depending on the application. Additional exploration is needed to confirm this hypothesis.
Second, every computational experiment in this study is a comparison between DNA signatures of genomic sequences belonging to two different organisms. Further analysis is needed to determine if the positive preliminary results on the discriminating power of composite and compositeassembled DNA signatures extend successfully to multigenome comparisons. A necessary step for such an experiment would be a thorough investigation of intragenomic variations of FCGRs and finding a method to determine, for each genome, a single “representative” FCGR matrix to successfully represent that genome.
The definition of F C G R _{ k } can be easily modified to make it an exact homomorphism by, e.g, defining a marked catenation of sequences s and t as s·t=s $ t, with $ a new symbol, and constructing F C G R _{ k } so as to not count any kmer that includes the symbol $. Next steps in the exploration of the mathematical properties of additive DNA signatures include studying the implications of the homomorphic, structurepreserving, nature of F C G R _{ k }, as well as extensions of the concept of additive DNA signature, to, e.g., weighted additive DNA signatures which would give different weights to the different types of DNA that compose it.
Abbreviations
AID, approximated information distance; CGR, chaos game representation; cpDNA, chloroplast DNA; FCGR, frequency CGR; MDS, multi dimensional scaling; mtDNA, mitochondrial DNA; nDNA, nuclear/nucleoid DNA; pDNA, plasmid DNA.
Declarations
Acknowledgements
We thank Katheen Hill (Biology, University of Western Ontario) for valuable comments and suggestions, Genlou Sun (Biology, St.Mary’s University) for general molecular biology expertize, and Stephen M. Watt (University of Waterloo) for useful discussions.
Funding
The research presented in this paper was supported by the Natural Sciences and Engineering Research Council of Canada (Grant No. R2824A01 to L.K., Grant No. 220259 to S.K., and Undergraduate Student Research Award No. 480936 to S.S.R.). The funding bodies had no role in the design of the study, the collection, analysis and interpretation of data, and in writing the manuscript.
Availability of data and materials
The source code for computing FCGR matrices, distance matrices, MultiDimensional Scaling and separation planes can be found in [68]. The NCBI accession numbers of all DNA sequences involved in this study can be found in the Additional file 1.
Authors’ contributions
The author order in the title is alphabetical. RK data collection; data analysis, methodology and result interpretation; manuscript tables and figures, manuscript editing; software design and implementation. LK data analysis, methodology and result interpretation; manuscript draft; manuscript editing. S.Kon data analysis, methodology and result interpretation; manuscript editing. S. Kop data analysis, methodology, result interpretation; S. SolisReyes manuscript draft (part of Section Background); data collection and analysis (plant experiments); software performance enhancements; language editing. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Jeffrey HJ. Chaos game representation of gene structure. Nucleic Acids Res. 1990; 18(8):2163–70.View ArticlePubMedPubMed CentralGoogle Scholar
 Jeffrey HJ. Chaos game visualization of sequences. Computers & Graphics. 1992; 16(1):25–33.View ArticleGoogle Scholar
 Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999; 16(10):1391–9.View ArticlePubMedGoogle Scholar
 Deschavanne PJ, Giron A, Vilain J, Dufraigne C, Fertil B. Genomic signature is preserved in short DNA fragments. In: Proceedings of the IEEE International Symposium on BioInformatics and Biomedical Engineering. IEEE: 2000. p. 161–7.Google Scholar
 Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283–90.View ArticlePubMedGoogle Scholar
 Karlin S, Campbell AM, Mrázek J. Comparative DNA analysis across diverse genomes. Annu Rev Genet. 1998; 32:185–225.View ArticlePubMedGoogle Scholar
 Vinga S, Almeida JS. Alignmentfree sequence comparison  a review. Bioinformatics. 2003; 19(4):513–23.View ArticlePubMedGoogle Scholar
 Nalbantoglu OU, Sayood K. Computational Genomic Signatures. Synth Lect Biomed Eng. 2011; 6(2):1–129.View ArticleGoogle Scholar
 BonhamCarter O, Steele J, Bastola D. Alignmentfree genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. 2013; 15(6):890–905.View ArticlePubMedPubMed CentralGoogle Scholar
 Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignmentfree sequence analysis. Brief Bioinform. 2014; 15(3):354–68.View ArticlePubMedGoogle Scholar
 Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignmentfree sequence comparison: measures, statistics and nextgeneration sequencing. Brief Bioinform. 2014; 15(3):343–53.View ArticlePubMedGoogle Scholar
 Burma PK, Raj A, Deb JK, Brahmachari SK. Genome analysis: A new approach for visualization of sequence organization in genomes. J Biosci. 1992; 17(4):395–411.View ArticleGoogle Scholar
 Hill KA, Singh SM. The evolution of speciestype specificity in the global DNA sequence organization of mitochondrial genomes. Genome. 1997; 40(3):342–56.View ArticlePubMedGoogle Scholar
 Hao B, Lee HC, Zhang SY. Fractals related to long DNA sequences and complete genomes. Chaos Solitons Fractals. 2000; 11(6):825–36.View ArticleGoogle Scholar
 Dutta C, Das J. Mathematical characterization of chaos game representation. New algorithms for nucleotide sequence analysis. J Mol Biol. 1992; 228(3):715–9.View ArticlePubMedGoogle Scholar
 Goldman N. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 1993; 21(10):2487–491.View ArticlePubMedPubMed CentralGoogle Scholar
 Almeida JS, Carriço JAA, Maretzek A, Noble PA, Fletcher M. Analysis of genomic sequences by Chaos Game Representation. Bioinformatics. 2001; 17(5):429–37.View ArticlePubMedGoogle Scholar
 Almeida JS. Sequence analysis by iterated maps, a review. Brief Bioinform. 2014; 15(3):369–75.View ArticlePubMedGoogle Scholar
 Wang Y, Hill K, Singh S, Kari L. The spectrum of genomic signatures: From dinucleotides to chaos game representation. Gene. 2005; 346:173–85.View ArticlePubMedGoogle Scholar
 Kari L, Hill KA, Sayem AS, Karamichalis R, Bryans N, Davis K, Dattani NS. Mapping the space of genomic signatures. PLoS ONE. 2015; 10(5):e0119815.View ArticlePubMedPubMed CentralGoogle Scholar
 Edwards SV, Fertil B, Giron A, Deschavanne PJ. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst Biol. 2002; 51(4):599–613.View ArticlePubMedGoogle Scholar
 Deschavanne P, DuBow MS, Regeard C. The use of genomic signature distance between bacteriophages and their hosts displays evolutionary relationships and phage growth cycle determination. Virol J. 2010; 7:163.View ArticlePubMedPubMed CentralGoogle Scholar
 Pandit A, Sinha S. Using genomic signatures for HIV1 subtyping. BMC Bioinformatics. 2010; 11(Suppl 1):26.View ArticleGoogle Scholar
 Hatje K, Kollmar M. A phylogenetic analysis of the Brassicales clade based on an alignmentfree sequence comparison method. Front Plant Sci. 2012; 3(192):11–22.Google Scholar
 Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter and intragenomic variations of graphic genomic signatures. BMC Bioinformatics. 2015; 16(1):246.View ArticlePubMedPubMed CentralGoogle Scholar
 Wu TJ, Huang YH, Li LA. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics. 2005; 21(22):4125–32.View ArticlePubMedGoogle Scholar
 Höhl M, Rigoutsos I, Ragan MA. Patternbased phylogenetic distance estimation and tree reconstruction. Evol Bioinforma. 2006; 2:359–75.Google Scholar
 Höhl M, Ragan MA. Is multiplesequence alignment required for accurate inference of phylogeny?Syst Biol. 2007; 56(2):206–21.View ArticlePubMedGoogle Scholar
 Dai Q, Yang Y, Wang T. Markov model plus kword distributions: A synergy that produces novel statistical measures for sequence comparison. Bioinformatics. 2008; 24(20):2296–302.View ArticlePubMedGoogle Scholar
 Guyon F, BrochierArmanet C, Guénoche A. Comparison of alignment free string distances for complete genome phylogeny. Adv Data Anal Classif. 2009; 3(2):95–108.View ArticleGoogle Scholar
 Jayalakshmi R, Natarajan R, Vivekanandan M, Natarajan GS. Alignmentfree sequence comparison using Ndimensional similarity space. Curr ComputerAided Drug Des. 2010; 6(4):290–6.View ArticleGoogle Scholar
 Haubold B. Alignmentfree phylogenetics and population genetics. Brief Bioinform. 2014; 15(3):407–18.View ArticlePubMedGoogle Scholar
 Fiser A, Tusnády GE, Simon I. Chaos game representation of protein structures. J Mol Graph. 1994; 12(4):302–4.View ArticlePubMedGoogle Scholar
 Basu S, Pan A, Dutta C, Das J. Chaos game representation of proteins. J Mol Graph Modell. 1997; 15(5):279–89.View ArticleGoogle Scholar
 Yu ZG, Anh V, Lau KS. Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol. 2004; 226(3):341–8.View ArticlePubMedGoogle Scholar
 Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D. Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol. 2009; 257(4):618–26.View ArticlePubMedGoogle Scholar
 Randić M, Novič M, VikićTopić D, Plašsić D. Novel numerical and graphical representation of DNA sequences and proteins. SAR QSAR Environ Res. 2006; 17(6):583–95.View ArticlePubMedGoogle Scholar
 Almeida JS, Vinga S. Biological sequences as pictures: a generic two dimensional solution for iterated maps. BMC Bioinformatics. 2009; 10:100.View ArticlePubMedPubMed CentralGoogle Scholar
 Almeida JS, Vinga S. Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics. 2002; 3:6.View ArticlePubMedPubMed CentralGoogle Scholar
 Almeida JS, Vinga S. Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol. 2006; 1:18.View ArticlePubMedPubMed CentralGoogle Scholar
 Fu W, Wang Y, Lu D. Multifractal analysis of genomic sequences CGR images. In: Proceedings of the 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. vol. 5. IEEE: 2005. p. 4783–786.Google Scholar
 Fu W, Wang Y, Lu D. Multifractal analysis of genomes sequences’ CGR graph. J Biomed Eng. 2007; 24(3):522–5.Google Scholar
 Vélez PE, Garreta LE, Martínez E, Díaz N, Amador S, Tischer I, Gutiérrez JM, Moreno PA. The Caenorhabditis elegans genome: A multifractal analysis. Genet Mol Res. 2010; 9(2):949–65.View ArticlePubMedGoogle Scholar
 Moreno PA, Vélez PE, Martínez E, Garreta LE, Díaz N, Amador S, Tischer I, Gutiérrez JM, Naik AK, Tobar F, García F. The human genome: a multifractal analysis. BMC Genomics. 2011; 12(1):506.View ArticlePubMedPubMed CentralGoogle Scholar
 Pandit A, Dasanna AK, Sinha S. Multifractal analysis of HIV1 genomes. Mol Phylogenet Evol. 2012; 62(2):756–63.View ArticlePubMedGoogle Scholar
 Pal M, Satisha B, Srinivas K, Madhusudana Rao P, Manimaran P. Multifractal detrended crosscorrelation analysis of coding and noncoding DNA sequences through chaosgame representation. Physica A: Stat Mech Appl. 2015; 436:596–603.View ArticleGoogle Scholar
 Oliver JL, BernaolaGalván P, GuerreroGarcía J, RománRoldán R. Entropic profiles of DNA sequences through chaosgamederived images. J Theor Biol. 1993; 160(4):457–70.View ArticlePubMedGoogle Scholar
 Vinga S, Almeida JS. Rényi continuous entropy of DNA sequences. J Theor Biol. 2004; 231(3):377–88.View ArticlePubMedGoogle Scholar
 Vinga S, Almeida JS. Local Rényi entropic profiles of DNA sequences. BMC Bioinformatics. 2007; 8:393.View ArticlePubMedPubMed CentralGoogle Scholar
 Joseph J, Sasikumar R. Chaos game representation for comparison of whole genomes. BMC Bioinformatics. 2006; 7:243.View ArticlePubMedPubMed CentralGoogle Scholar
 Tanchotsrinon W, Lursinsap C, Poovorawan Y. A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition. BMC Bioinformatics. 2015;16(1).Google Scholar
 Campbell AM, Mrázek J, Karlin S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc Natl Acad Sci U S A. 1999; 96(16):9184–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Li M, Chen X, Li X, Ma B, Vitanyi PMB. The similarity metric. Inf Theory IEEE Trans. 2004; 50(12):3250–264.View ArticleGoogle Scholar
 Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process. 2004; 13(4):600–12.View ArticlePubMedGoogle Scholar
 Iversen GR, Gergen M, Gergen MM. Statistics: The Conceptual Approach. Berlin Heidelberg: Springer; 1997.View ArticleGoogle Scholar
 Krause EF. Taxicab Geometry: An Adventure in NonEuclidean geometry. Mineola, New York: Courier Dover Publications; 2012.Google Scholar
 Lloyd S. Least squares quantization in pcm. IEEE Trans Inf Theory. 1982; 28(2):129–37.View ArticleGoogle Scholar
 Karamichalis R. Molecular Distance Map Interactive Webtool. 2015. https://github.com/rallis/MoDMap3D. Accessed 27 Jul 2016.
 Jameson NM, Hou ZC, Sterner KN, Weckle A, Goodman M, Steiper ME, Wildman DE. Genomic data reject the hypothesis of a prosimian primate clade. J Human Evol. 2011; 61(3):295–305.View ArticleGoogle Scholar
 Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MAM, Kessing B, Pontius J, Roelke M, Rumpler Y, Schneider MPC, Silva A, O’Brien SJ, PeconSlattery J. A molecular phylogeny of living primates. PLoS Genet. 2011; 7(3):1001342.View ArticleGoogle Scholar
 Chatterjee H, Ho S, Barnes I, Groves C. Estimating the phylogeny and divergence times of primates using a supermatrix approach. BMC Evol Biol. 2009; 9(1):259.View ArticlePubMedPubMed CentralGoogle Scholar
 Li H, Homer N. A survey of sequence alignment algorithms for nextgeneration sequencing. Brief Bioinform. 2010; 11(5):473–83.View ArticlePubMedPubMed CentralGoogle Scholar
 Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: Current challenges and future perspectives. PLoS ONE. 2011; 6(3):18093.View ArticleGoogle Scholar
 Grossmann L, Jensen M, Heider D, Jost S, Glücksman E, Hartikainen H, Mahamdallie SS, Gardner M, Hoffmann D, Bass D, et al. Protistan community analysis: key findings of a largescale molecular sampling. ISME J. Springer Nature; 2016.Google Scholar
 Lange A, Jost S, Heider D, Bock C, Budeus B, Schilling E, Strittmatter A, Boenigk J, Hoffmann D. Ampliconduo: A splitsample filtering protocol for highthroughput amplicon sequencing of microbial communities. PLoS ONE. 2015; 10(11):0141590.View ArticleGoogle Scholar
 Bikel S, ValdezLara A, CornejoGranados F, Rico K, CanizalesQuinteros S, Soberón X, Del PozoYauner L, OchoaLeyva A. Combining metagenomics, metatranscriptomics and viromics to explore novel microbial interactions: towards a systemslevel understanding of human microbiome. Comput Struct Biotechnol J. 2015; 13:390–401.View ArticlePubMedPubMed CentralGoogle Scholar
 Kruskal JB. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964; 29(1):1–27.View ArticleGoogle Scholar
 Karamichalis R. Source code for computing FCGR matrices, distance matrices, MultiDimensional Scaling and separation planes. https://github.com/rallis/GenomicSignatures. Accessed 27 Jul 2016.
 Ester M, Kriegel HP, Sander J, Xu X. A densitybased algorithm for discovering clusters in large spatial databases with noise. In: Conference on Knowledge Discovery and Data Mining; vol. 96. AAAI Press: 1996. p. 226–31.Google Scholar