Additive methods for genomic signatures

Karamichalis, Rallis; Kari, Lila; Konstantinidis, Stavros; Kopecki, Steffen; Solis-Reyes, Stephen

doi:10.1186/s12859-016-1157-8

Research Article
Open access
Published: 22 August 2016

Additive methods for genomic signatures

Rallis Karamichalis²,
Lila Kari^1,2,
Stavros Konstantinidis³,
Steffen Kopecki^2,3 &
…
Stephen Solis-Reyes²

BMC Bioinformatics volume 17, Article number: 313 (2016) Cite this article

2937 Accesses
17 Citations
8 Altmetric
Metrics details

Abstract

Background

Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date.

Results

We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information.

Conclusions

Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification.

Background

Motivated by the general need to identify and classify species based on molecular evidence, alignment-free genome comparisons have been proposed, based on comparing Chaos Game Representations (CGR) of genomic DNA sequences. The CGR of a DNA sequence, proposed by Jeffrey [1, 2], is a graphical representation of a DNA sequence, where the patterns in the image correspond to the frequencies of k-mers in the sequence. Deschavanne et al. [3, 4] were the first to suggest that CGR is a good candidate for the role of “genomic signature” defined by Karlin and Burge [5] as any specific quantitative characteristic of a sequence that is pervasive along the genome, while being dissimilar for sequences originating from organisms of different species.

CGR is one of a variety of alignment-free methods (see [6–11] for detailed literature reviews) that have been proposed for sequence and genome comparisons, as a computationally efficient approach that performs well even with DNA sequences that have nothing or little in common. (We use the following notational conventions for genomic DNA: nDNA (nuclear/nucleoid DNA), mtDNA (mitochondrial DNA), cpDNA (chloroplast DNA), and pDNA (plasmid DNA)).

Initially, CGR images were only qualitatively analyzed [12–14], and Dutta et al. and Goldman both advanced the suggestion that CGR images represent no more information than second-order Markov chains [15, 16], which was later disproven by Almeida et al. [17, 18] and others [19, 20]. CGR has been applied extensively to phylogenetics together with the Euclidean distance, for instance on nDNA fragments from various domains [3], 27 genomes from various genera [4], 125 nDNA fragments from several bird genomes [21], 26 mtDNA sequences (also with the Pearson distance and a custom image distance) [19], 4 bacteria and about 200 phages [22], 75 HIV-1 genomes [23], 10 mtDNA sequences and 14 nDNA sequences from plants in the Brassicales order [24]. Other distances have also been used, for instance the DSSIM image distance on a set of 3,176 mtDNA sequences [20], and six different distances on 174 million base pairs of sampled nDNA fragments from organisms of all major kingdoms of life [25]. The performance of several distance functions has also been compared and benchmarked on their accuracy in constructing phylogenetic trees [26–32]. Initially, CGR was used only for strings over a 4-letter alphabet (like DNA), but generalizations have been proposed to peptide sequences [33–38], and Almeida and Vinga proposed a derivative of CGR called the Universal Sequence Map (USM), which is suitable for alphabets of any size [39, 40]. CGRs have also been subjected to multifractal analysis (which measures the degree of self-similarity within the image), see, e.g., [35, 41–46]. Lastly, CGR has been used to estimate sequence entropy [47–49], to speed up local-alignment algorithms [50], and has been used together with neural networks to classify HPV genomes by genotype [51].

Several CGR studies [13, 20, 52] observed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be completely different. While the hypothesis that CGRs of mitochondrial DNA sequences can play the role of genomic signatures was tested and validated on the set of all 3,176 sequenced mitochondrial genomes (totalling 91.3 megabase pairs) available in the NCBI GenBank sequence database in July 2012 [20], to our knowledge no such extensive analysis of CGRs of nuclear/nucleoid genomic sequences exists to date.

The main contributions of this paper are:

We present an extensive analysis of the hypothesis that conventionally computed (called herein “conventional”) nDNA signatures can play the role of genomic signatures at multiple taxonomic levels, from kingdom to species. Our dataset totals 1.45 gigabase pairs of nDNA sequences from 42 different genomes, from all major kingdoms of life.
Our analysis indicates that conventional nDNA signatures of two different origins cannot always be differentiated, especially if they originate from closely related organisms. To address this issue, we propose taking into account information obtained from organellar DNA, in addition to nDNA. More generally, we propose the concept of an additive DNA signature of a set (collection) of DNA sequences, and define two particular instances: composite DNA signatures and assembled DNA signatures.
We explore composite DNA signatures, which combine conventional nDNA signatures with organellar DNA signatures (mtDNA, cpDNA, or pDNA) of the same organism. We demonstrate that, in this dataset, the composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of conventional nDNA signatures failed. In particular, composite DNA signatures from genomes of species as closely related as H. sapiens and P. troglodytes, or E. coli and E. fergusonii, can be successfully separated.
We explore assembled DNA signatures, which combine information from many short contigs (e.g., 100 bp) of a DNA fragment to produce a recognizable signature. This is in contrast to conventional DNA signatures wherein one single long (thousand to hundreds of thousands of basepairs) DNA sequence is needed to generate a recognizable signature.

The enhanced discriminating power of composite DNA signatures, and the ability of assembled DNA signatures to operate with scattered and reduced sequence data, open the possibility of practical applications including aiding species identification or classification, and comparisons of DNA fragments of various origins such as genomes of extinct organisms, synthetic genomes, raw unassembled next-generation sequencing (NGS) read data, or even computer-generated DNA sequences.

Results

The first objective of this study was to test, on a comprehensive dataset, the hypothesis that conventional nDNA signatures can be used to differentiate between nuclear DNA sequences originating from different organisms, spanning all major kingdoms of life, at multiple taxonomic levels.

To this end, the following computational experiment was performed, for each of the major kingdoms of life, at various taxonomic levels. We chose a pivot organism (e.g., H. sapiens for Kingdom Animalia) and proceeded to use conventional nDNA signatures to compare fragments of its nuclear/nucleoid genome with fragments of the nuclear/nucleoid genome of one other organism from the same kingdom. The process was then repeated with the second organism being at increasing degrees of relatedness to the pivot organism.

More precisely, for each such pairwise comparison, the following three-step process was implemented.

Randomly sample 150 kbp nDNA fragments from every chromosome (20 per chromosome, or all fragments if fewer) of the two genomes involved in the comparison. For each such nDNA fragment, construct its corresponding conventional nDNA signature using the process described in Section “Methods”.
Compute pairwise distances for all pairs of conventional nDNA signatures generated in Step 1. The distance used to start with was an approximated information distance (AID), formally defined in Section “Methods” (see also [25, 53]), since it is computationally simple and uses the least amount of sequence information. If separation was not achieved using AID, five other distance measures were used: Structural Dissimilarity Index (DSSIM) [54], Euclidean distance, Pearson correlation distance [55], Manhattan distance [56], and descriptor distance [25].
Use the distance matrix obtained in Step 2 as input to a Multi-Dimensional Scaling (MDS) algorithm to produce a 3D Molecular Distance Map [25]: Each point in the map corresponds to (the conventional nDNA signature of) an nDNA fragment from Step 1, and the geometric distance between every two points corresponds to the distance between the respective conventional nDNA signatures in the distance matrix. Assess, for each Molecular Distance Map, whether or not separation between conventional nDNA signatures of DNA fragments from the pivot organism and those from the other organism was achieved, by using either k-means clustering [57] or by verifying the existence of a separating plane.

Figure 1 illustrates an example of the end result of this three-step process: A three-dimensional Molecular Distance Map that displays the conventional nDNA signatures of the pivot organism of Kingdom Animalia, H. sapiens, plotted together with the conventional nDNA signatures of D. melanogaster.

The results for all kingdoms are presented in Fig. 2 (the first two result columns) and the corresponding 3D Molecular Distance Maps can be found in [58]. For Kingdom Animalia, the approximated information distance succeeded to separate H. sapiens (24 chromosomes, 480 fragments) conventional nDNA signatures from those of other organisms, down to and including from M. murinus (grey mouse lemur, same order but different suborder) and T. syrichta (Phillipine tarsier, same suborder but different infraorder). In the cases marked Y* in Fig. 2, while the accuracy was less than the threshold for separation (85 %), the existence of a separating plane was verified. See discussion in Section “Methods” for details.

The use of conventional nDNA signatures failed to achieve separation for genomes of more closely related species. In particular, it failed to separate conventional nDNA signatures of H. sapiens from those of C. jacchus (common marmoset, same infraorder), P. anubis (Anubis baboon, same parvorder), N. leucogenys (northern white-cheeked gibbon, same superfamily), P. abelii (Sumatran orangutan, same family), G. gorilla (gorilla, same subfamily, and P. troglodytes (chimpanzee, same tribe, see Fig. 3). For those organisms where separation was not achieved with approximated information distance, we performed the comparisons with the other five distances. The results of these multiple computations were that, in all cases where approximated information distance failed to achieve separation, the other distances also failed.

For Kingdom Fungi, the pivot organism is the model organism Saccharomyces cerevisiae (16 chromosomes, 73 fragments), a species of yeast instrumental to winemaking, baking, and brewing. Separation of its conventional nDNA signatures was achieved down to and including separation from C. dubliniensis (same family, different genus). In the case of the comparison with K. pastoris, marked with Y* in Fig. 2, the accuracy score was lower than 85 %: This is an artifact of the shape of the 3D Molecular Distance Map wherein one of the clusters has a trailing set of points that become erroneously separated by k-means from all the rest of the points. Because of this, and since the use of k-means on the 2D Molecular Distance Map of the same dataset resulted in an accuracy score of 100 %, we interpreted this comparison as resulting in separation. The results of the comparison between the conventional nDNA signatures of the pivot organism and those of S. arboricola (same genus, different species), were inconclusive: The use of Euclidean and Pearson distances resulted in separation (both with accuracy of 88.48 %), while the use of the other four distances (DSSIM, Manhattan, descriptor, approximated information distance) did not result in separation.

For Kingdom Plantae, the pivot organism is the model organism Brassica napus (19 chromosomes, 380 DNA fragments), rapeseed, a flowering member of the family Brassicaceae (mustard or cabbage family). Separation of its conventional nDNA signatures was achieved down to and including separation from C. papaya (papaya, same order, different family). For the comparisons with A. thaliana (thale cress, same family, different tribe) and R. sativus (radish, same tribe, different genus), cluster separation was visually observed but not quantitatively confirmed by either k-means or plane separation. The comparison with B. oleracea (wild cabbage, same genus, different species) did not result in separation, with any of the six distances.

For Kingdom Protista, the pivot organism is the model organism Plasmodium falciparum, a protozoan parasite (14 chromosomes, 149 DNA fragments), one of the species of Plasmodium that cause malaria in humans. Separation of its conventional nDNA signatures from those of other organisms from the same kingdom was achieved at all taxonomic levels, down to and including separation from P. vivax (same genus, different species).

For Kingdom Bacteria, the pivot organism is the model organism Escherichia coli (20 genomic DNA fragments), a bacterium commonly found in the lower intestine of warm-blooded organisms. Separation of its conventional nDNA signatures from those of other bacteria was successful down to and including separation from S. enterica (same family, different genus), but failed with all six distances in the comparison with E. fergusonii (same genus, different species).

For Kingdom Archaea, the pivot organism is the model organism Pyrococcus furiosus (12 genomic DNA fragments), an extremophilic species of Archaea. Separation of its conventional nDNA signatures from those of other archaea was successful at all levels, down to and including separation from P. yayanosii (same genus, different species).

The above results indicate that, especially in Kingdom Animalia, conventional nDNA signatures cannot always be used to differentiate nuclear/nucleoid genomic sequences originating from two different genomes. This suggests that conventional nDNA signatures cannot always play the role of a “genomic signature”, particularly when the genomes being compared belong to closely related species.

Composite DNA signatures

To enhance the discriminating power of conventional nDNA signatures, our second objective was to introduce and explore the concept of composite DNA signatures, which combine conventional nuclear/nucleoid DNA signatures with signatures of organellar genomes (mtDNA, cpDNA, or pDNA).

To test the discriminating power of composite DNA signatures, we repeated all previous pairwise comparisons (where sequenced organellar DNA was available), using this time composite DNA signatures. The results are presented in the last two columns of Fig. 2.

For Kingdoms Animalia, Fungi and Protista we used composite DNA signatures combining the conventional nDNA signature of each nuclear/nucleoid genomic fragment with that of the mtDNA of the same organism (when available). Using such composite DNA signatures, differentiation of DNA signatures by organism was successful in all cases, including all cases where the use of conventional nDNA signature previously failed or was inconclusive. See Fig. 3 (H. sapiens vs. P. troglodytes conventional nDNA signatures, no separation) versus Fig. 4 (H. sapiens vs. P. troglodytes composite DNA signatures using nDNA and mtDNA, complete separation).

To test the discriminating power of composite DNA signatures using nDNA, mtDNA and cpDNA, we employed them to perform comparisons for all genome pairs from Kingdom Plantae. Separation was achieved using all of: composite DNA signatures using nDNA and mtDNA, composite DNA signatures using nDNA and cpDNA, and composite DNA signatures using nDNA, mtDNA, and cpDNA. See Fig. 5 for the Molecular Distance Maps illustrating the relationships between these signatures for B. napus and B. oleracea.

For Kingdom Bacteria, the use of composite DNA signatures combining nDNA and pDNA (when available) resulted in separation in all cases.

Overall, the use of composite DNA signatures resulted in separation in all pairwise comparisons in Fig. 2 (where organellar DNA sequencing data was available), including in those where the use of conventional nDNA signature failed or resulted in inconclusive separations.

Assembled DNA signatures

As the third objective of this study, we explored a way to enhance the practical applicability of conventional DNA signatures. Recall that, to produce a recognizable visual pattern that can be reliably used to represent a genome, a conventional DNA signature needs as input a long contiguous (two to several hundred kilobase pairs) DNA fragment. This assumes a high quality and reliability of sequencing and assembly, which are not always available. We propose instead to approximate a conventional signature by an assembled DNA signature, which combines the conventional DNA signatures of many short contigs (e.g., 100 bp) of the given fragment. Note that these contigs need not cover the entire DNA fragment.

In what follows, we denote by |s| the length of the sequence s. Given a DNA fragment s, an assembled DNA signature of s, using r equi-length contigs of length n (subfragments of the sequence s), is defined as the sum of the conventional DNA signatures of all of the r contigs. A particular case of assembled DNA signature is where the fragment s is partitioned into equi-length, consecutive, non-overlapping contigs, that is, s=s ₁ s ₂…s _r s _r+1, and |s _i|=n for 1≤i≤r, with |s _r+1|<n. In this case, we call the assembled signature a fully-assembled DNA signature of the sequence s, using equi-length contigs of length n.

Table 1 ((A) through (C)) presents a comparison between the conventional nDNA signature of a given DNA fragment and its assembled DNA signatures, as well as fully-assembled DNA signatures, for various values of contig length n, and number of contigs r. The DNA fragment used is from H. sapiens, chromosome 21, fragment 20 (from position 2,850,001 to 3,000,000 after removing all Ns in the original sequence), and the distance used is approximated information distance between CGRs. For example, the distance between the conventional nDNA signature and the fully-assembled DNA signature of the same fragment, that uses 1,000 contigs of length 150 bp each, is 0.03 (row 2, column (A)). This value is very small, given that approximated information distance theoretically ranges between 0 and 1. This suggests that, for these parameter values (n=150 and r=1,000), a fully-assembled DNA signature can be an excellent approximation of the conventional DNA signature of the same fragment. This was expected, given that the only information lost in the computation of a fully-assembled DNA signature, when using the approximated information distance, is the information about the k-mers situated at the borders between contigs.

Table 1 (A) through (C) – Distances between the conventional nDNA signature of a fragment and its assembled DNA signatures, for various numbers r of contigs of the same length n: (A) distances to fully-assembled DNA signatures; (A ^′) theoretical upper bounds for (A); (B) distances to assembled DNA signatures; (C) same as (B), when tripling the number of contigs

Full size table

Also as expected, for the same values of n and r, the distance between an assembled DNA signature and the conventional nDNA signature of the same fragment (Table 1, Column (B)) is higher than the one between a fully-assembled DNA signature and the conventional nDNA signature of the same fragment (Table 1, (A)). This indicates that the assembled DNA signature is less performant than the fully-assembled DNA signature as an approximation of a conventional nDNA signature. The reason is that, given a fixed number r of contigs, in the case of an assembled DNA signature the contigs are allowed to overlap and need not cover the entire fragment. This can be compensated by increasing the coverage, that is, the number r of contigs. Table 1, (C) shows that tripling the number of contigs results in significantly smaller differences between assembled DNA signatures and the conventional DNA signature of the same fragment which they were meant to approximate.

The results in Table 1 suggest that assembled DNA signatures have the potential to play the role of “genomic signatures”, and be used directly on raw unassembled next-generation sequencing read data, or in cases where other methods are not directly applicable because high-quality sequencing data is not available. To test this hypothesis, we considered the organism pairs in Fig. 2 for which separation was obtained using conventional nDNA signatures, and attempted to reproduce these successful separations using assembled DNA signatures instead. In addition, we empirically sought to find, in each case, the coverage (amount of sequence data) needed to achieve separation, as a percentage of total fragment length.

To determine the threshold interval where separation between assembled DNA signatures of a given pair of organisms was achieved, when contigs of length n=300 were used, the following process was employed. For various values of t, 0≤t≤1 (representing the fragment coverage, e.g., t=0.5 means that 50 % of the fragment data was used), we attempted to see if separation of assembled DNA signatures from the two organisms was achieved, in the following way.

For each of the 150 kbp fragments s from the two genomes, q random positive integers were picked from the interval 1 to |s|−n+1=(150,000−300+1), where q=⌊t∗|s|/n⌋, that is, the integer part of t∗|s|/n. These q numbers represent the start positions of the q chosen contigs. For each contig start position, a contig of length n=300 was read and used for the assembled DNA signature of the fragment s.

For each value of t, the corresponding 3D Molecular Distance Map of the assembled DNA signatures of the two organisms was then analyzed, by verifying the existence (or absence) of a separating plane.

The results are summarized in Table 2 and can be interpreted as follows. In the comparison between H. sapiens and D. melanogaster the threshold interval is 1 –5 %. The lower limit of this interval is 1 %, and this means that in the computation using the coverage value t=0.01 (implying q=⌊0.01∗150,000/300⌋=5), separation was not achieved. That is, for each of the 150 kbp nDNA fragments available (480 from H. sapiens and 128 from D. melanogaster), when employing assembled nDNA signatures using only 5 contigs per fragment (for a maximum of 1 % of each fragment length, that is, 1,500 bp per fragment), separation was not achieved. The upper limit of the interval is 5 %, and this means that in the computation using the coverage value t=0.05 (implying q=25), separation was achieved. That is, when employing assembled nDNA signatures using 25 contigs per fragment (for a maximum of 5 % of each fragment length, that is, 7,500 bp per fragment), separation was achieved.

Table 2 Assembled nDNA signatures: sequence coverage (amount of DNA fragment information) needed for separation of the assembled nDNA signatures of the pivot organism from assembled nDNA signatures of the comparison organism, for all major kingdoms of life. Separations were confirmed by verifying the existence of separating planes

Full size table

The actual threshold values lie in the intervals listed, and may be subject to the quality of the sequencing. As expected, in general, the thresholds needed for separation increase with the increase in the degree of relatedness of the organisms being compared. This suggests that nDNA sequences from closely related organisms require a higher coverage (that is, a higher amount of information from each sequence) to be separated. The only exception to this trend, in this dataset, were the pairs H. sapiens with M. murinus (gray mouse lemur) requiring 60 –80 % sequence coverage, and H. sapiens and T. syrichta (Philippine tarsier) requiring 20 –40 % sequence coverage. Thus, the (human, lemur) pair required higher sequence coverage to achieve separation than the (human, tarsier) pair, even though the gray mouse lemur belongs to a different primate suborder (Haplorrhini) than the modern human, while the tarsier belongs to the same primate suborder as the modern human (Strepsirrhini), and thus one would expect that more information would be needed to achieve the latter separation. This apparent anomaly may be partly related to the fact that the phylogenetic placement of tarsiers within the order Primates has been controversial for over a century [59]: In [60] tarsiers are placed within Haplorrhini, while according to [20, 61], mitochondrial DNA evidence places tarsiiformes as a sister group to Strepsirrhini.

Table 2 indicates that the amount of DNA fragment information needed to achieve separation, at the same taxonomic level, can differ from one kingdom to another. For example, in Kingdom Animalia, conventional nDNA signatures of organisms from two species of a different species (H. sapiens and P. troglodytes) could not be separated even though we use 100 % of the DNA fragment information. In contrast, in Kingdom Fungi, assembled nDNA signatures from two organisms of a different genus (S cerevisiae and C. dubliniensis) could be separated even when using only 10 % of DNA fragment data. Similarly, in Kingdom Bacteria, assembled nDNA signatures from two organisms of different genus (E. coli and S. enterica) could be separated even when using only 20 % of DNA fragment data. The situation is even more extreme in Kingdom Protista and Kingdom Archaea, where even organisms belonging to the same genus could be separated with very little sequence coverage. Indeed, in Kingdom Protista, assembled nDNA signatures of two organisms of the same genus (P. falciparum and P. vivax) could be separated using only 2 % of DNA fragment data. Similarly, in Kingdom Archaea, assembled nDNA signatures from two organisms of the same genus (P. furiosus and P. yananosii) could also be separated using only 2 % of DNA fragment data. This suggests that some taxonomic categories, such as “genus”, do not necessarily reflect the same degree of structural similarity of genomic sequences uniformly across kingdoms.

Composite-assembled DNA signatures

We now briefly explore the potential of combining the approach of composite DNA signatures with that of assembled DNA signatures. A composite-assembled DNA signature is produced by combining information from the assembled DNA signatures of two (or more) different types of DNA fragments. For example, a composite-assembled signature using nDNA and mtDNA is obtained by combining the assembled nDNA signature of one 150 kbp nDNA fragment, with the assembled mtDNA signature of the mtDNA genome of the same organism.

Figure 6 plots together composite DNA signatures and composite-assembled DNA signatures using nDNA and mtDNA from H. sapiens and P. troglodytes. Note that composite-assembled DNA signatures and composite DNA signatures of fragments (using nDNA and mtDNA), from the same species are closely clustered together. On the other hand, all DNA signatures of H. sapiens are separated from all DNA signatures of P. troglodytes, and the existence of a separating plane was verified. These results suggest that composite-assembled DNA signatures could also be potential candidates for the role of “genomic signature”, as they have in general better discriminating power than conventional nDNA signatures while using scattered and potentially less sequence information.

Conclusions

The first objective of this paper was to conduct a comprehensive analysis, on a dataset totalling 1.45 Gb, of the hypothesis that Chaos Game Representations of nuclear/nucleoid genomic sequences can play the role of “genomic signatures”, that is, that they are genome- and species-specific. Our results suggest that this hypothesis is not always valid, in that nuclear/nucleoid DNA sequences belonging to closely related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii cannot always be separated using conventionally computed CGR signatures.

To address this issue, as a second objective, we propose the use of composite DNA signatures, which combine information from the nuclear/nucleoid genome with that from one or more organellar genomes (mtDNA, cpDNA and/or pDNA). Composite DNA signatures were found, in this study, to result in successful separation of DNA sequences by organism in all cases, including those where conventional nDNA signatures failed.

As a third objective, we propose the use of assembled DNA signatures, which combine information from short contigs (subfragments) of a DNA fragment, rather than using the entire contiguous fragment, to produce its signature. We show that assembled DNA signatures can be successful replacements of conventional DNA signatures, and also that the composite and assembled DNA signature approaches can be used simultaneously.

Mathematically, composite and assembled DNA signatures are both particular cases of a general concept, namely that of an additive DNA signature of a set of DNA sequences (see Section “Methods”). Our results indicate that such additive DNA signatures could be considered as potential candidates for the role of “genomic signatures” at various taxonomic levels, from distant to closely related species, and thus complement other methods for species identification and classification.

Several directions of future research stem from the fact that existing literature indicates that the oligomer composition of nuclear/nucleoid DNA sequences and mitochondrial DNA sequences can be a source of taxonomic information. Such directions include testing the discriminating power of additive DNA signatures in large-scale multi-genome comparisons, and exploring their utility in practical applications such as DNA sequence identification and classification (including directly on raw unassembled NGS read data or when high-quality sequencing data is not available), metagenomics, and synthetic genomes.

Methods

Dataset

The dataset, totalling 1.45 Gb, comprised whole nuclear/nucleoid genomes and organellar genomes of 42 organisms, spanning all major kingdoms of life (see Additional file 1 for the scientific name, NCBI accession number, chromosome number, and number of fragments sampled). In our analysis, for each complete genomic sequence, all letters other than A, C,G, T were ignored, and the resulting DNA sequence was divided into successive, non-overlapping, contiguous fragments, each 150 kbp long (when the last portion was shorter than 150 kbp, it was not included in the analysis). The choice of fragment length, 150 kbp, was due to our choice of CGR image resolution (namely 2⁹×2⁹, that is, k=9), empirical testing, and computational efficiency reasons, see [25].

Subsequently, 20 such 150 kbp fragments were randomly sampled from each chromosome and, for each such fragment, a corresponding conventional nDNA signature was constructed, as described below. (If there were fewer than 20 fragments, all fragments in the chromosome were chosen.) In the cases where the genome assembly of the organism was at the contig/scaffold level, the contigs/supercontigs of the assembly were sorted by length and the first 500 contigs/supercontigs were selected. (If there were fewer than 500 contigs/supercontigs, all were selected.) From each contig/supercontig, only the first 150 kbp fragment was considered.

We note that this method is alignment-free, and that its approach contrasts typical biodiversity and species identification research [62–65] in that it uses randomly selected DNA sequences rather than specific marker genes for identification and classification of species. This approach is somewhat similar to novel approaches in metagenomics, metatranscriptomics, and viromics [66], but there are also substantial differences such as that metatranscriptomics is based on RNA rather than DNA and that it groups sequences based on functionality rather than oligomer composition.

Chaos Game Representation (CGR)

CGR is a method introduced by Jeffrey [1] as a way to visualize the structural composition of a DNA sequence. This method associates an image to each DNA sequence as follows: Starting from a square with corners labelled A, C, G, and T, and the center of the square as the starting point, the image is obtained by successively plotting each nucleotide as the middle point between the current point and the corner labelled by the nucleotide to be plotted. If the generated square image has a size of 2^k×2^k pixels, then every pixel represents a distinct k-mer: A pixel is black if the k-mer it represents occurs in the DNA sequence, otherwise it is white. CGR images of genetic DNA sequences originating from various species show patterns such as squares, parallel lines, rectangles, triangles, and also complex fractal patterns, as shown in Fig. 7.

We used a modification of the original CGR, introduced by Deschavanne [3]: a k-th order FCGR (frequency CGR) of a sequence s, denoted by F C G R _k(s), is a 2^k×2^k matrix that can be constructed by dividing the CGR image of the sequence s into a 2^k×2^k grid, and defining the element a _ij of the matrix F C G R _k(s) as the number of points that are situated in the corresponding grid square.

We now formally define the conventional DNA signature of a sequence s to be the matrix F C G R _k(s), which records the numbers of occurrences of all possible k-mers in the sequence s. Throughout this paper, the parameter k is assumed to be a fixed constant. In particular, similar to [25], in all computational experiments in this paper the value used was k=9.

For computing composite and assembled DNA signatures, we introduce the general concept of additive DNA signature of a set of sequences, formally defined as follows.

Definition 1

The additive DNA signature of a set of sequences S={s ₁,s ₂,…,s _r},r≥1, is defined as

$${FCGR}_{k}(S) = {FCGR}_{k}(s_{1})+ \ldots + {FCGR}_{k}(s_{r}). $$

Note that the notions of conventional DNA signature, composite DNA signature, assembled DNA signature, and fully-assembled DNA signature, are all particular cases of additive DNA signatures, as follows:

The conventional DNA signature of a sequence s is the additive DNA signature of the set {s} consisting of a single sequence s, that is, F C G R _k(s)=F C G R _k({s}).
The composite DNA signature using two DNA sequences s ₁,s ₂, of two different types, is

F C G R _k({s ₁,s ₂})=F C G R _k(s ₁)+F C G R _k(s ₂),
An assembled signature of a sequence s, using r equi-length contigs of length n, is

${FCGR}_{k}(\{s_{1}, s_{2}, \ldots, s_{r}\}) = \sum _{i = 1}^{r} {FCGR}_{k}(s_{i}),$ where s=α _i s _i β _i,|s _i|=n, for 1≤i≤r.
The fully-assembled DNA signature of a sequence s, using equi-length contigs of length n, is

${FCGR}_{k}(\{s_{1}, s_{2}, \ldots, s_{r}\}) = \sum _{i=1}^{r} {FCGR}_{k}(s_{i}),$ where r=⌊|s|/n⌋,s=s ₁ s ₂…s _r s _r+1, and |s _i|=n for 1≤i≤r, while |s _r+1|<n.

To compute the fully-assembled DNA signature of a sequence s, using equi-length contigs of length n, one adds the F C G R _k of all the adjacent consecutive contigs of length n that cover s (except possibly a short tail of length less than n), where the first contig starts at the beginning of the sequence. In contrast, to compute an assembled signature of s using equi-length contigs of length n, one has the freedom to set the number of such contigs as an additional parameter r, and then add the F C G R _k of r contigs sampled randomly from the sequence s. Thus, for a given n, a sequence s has only one fully-assembled DNA signature, but many different assembled signatures, each depending on both the choice of parameter r, and the particular sampling of the r sequences (which may overlap or be identical).

For example, if s is the DNA sequence

$$ s=AAAAACCCCCGGGGGTTT, $$

of length 18, and if we consider contigs of length n=5, then the fully-assembled DNA signature of s is unique and is obtained by adding the F C G R _k of the following r=⌊18/5⌋=3 contigs

$$ \{AAAAA,CCCCC,GGGGG\} $$

that cover s (except the discarded remainder TTT).

For the same sequence s and contig length n=5, many diferent assembled DNA signatures can be computed. For example, an assembled DNA signature of s using r=3 equi-length contigs of length n=5 could use contigs {A A A C C, C C C G G, C C C G G}, while another could use contigs {A A C C C, C C C C G, G G T T T}. In addition, other assembled DNA signatures of s with equi-length contigs of length n=5 exist, depending on the parameter r. For example, an assembled DNA signature of s using r=5 equi-length contigs of length n=5 could use the contigs

$$ \{AAAAA, AAACC, CGGGG, GGGGT, GGTTT \}. $$

Approximated Information Distance (AID)

For a finite set X, we denote by |X| the cardinality of X, that is the number of elements in X. Given a set of sequences S={s ₁,s ₂,…,s _n} we denote by M _k(S) the set of all distinct k-mers that occur in all the sequences of S. In the case of a set consisting of a single sequence s, we write M _k(s) to denote M _k({s}).

The approximated information distance between two sequences s and t (introduced in [25] as a slight modification of a distance used in [53]) is defined as:

$$d_{\mathtt{AID}}^{k}(s,t) = \frac{|M_{k}(s)\setminus M_{k}(t)| + |M_{k}(t)\setminus M_{k}(s)|}{|M_{k}(\{s, t\})|}, $$

where for two sets X and Y, X∖Y denotes the set difference between X and Y, that is, the set of elements that belong to X but not to Y.

The distance d AID k(s, t) was used for most of the computations of pairwise distances between conventional DNA signatures in this paper.

The notion of approximated information distance between two sequences can now be extended to that of generalized approximated information distance between two sets of sequences S and T, as:

$$ d_{\mathtt{AID}}^{k}(S,T) = \frac{|M_{k}(S)\setminus M_{k}(T)| + |M_{k}(T)\setminus M_{k}(S)|}{|M_{k}(S\cup T)|}. $$

This generalization of the approximated information distance preserves the original meaning of the concept as the ratio between the number of noncommon k-mers of the two sets S and T and the total number of k-mers that occur in S or in T (or both). This distance was used to compute distances between conventional, composite and assembled DNA signatures in this paper.

The next Proposition leads to a formula for the computation of the generalized approximated information distance, as well as gives a theoretical upper bound for the generalized approximated information distance in the case of fully-assembled DNA signatures. The following auxiliary lemma follows from standard set theory arguments.

Lemma 2

Let s be a sequence and S, T be two finite sets of sequences over the DNA alphabet {A, C,G, T}, and let k≥2 be an integer. The following statements hold true.

1.
If S⊆T then |M _k(S)|≤|M _k(T)| and

|M _k(S∪T)|=|M _k(T)|,
2.
If every sequence in S is a subsequence of a given sequence s, then

|M _k(S)∪M _k(s)|=|M _k(s)|,
3.
The number of distinct k-mers that occur in S but not in T is |M _k(S)∖M _k(T)|=|M _k(S∪T)|−|M _k(T)|,
4.
|M _k(S)|=# F C G R _k(S),

where for a numerical matrix A we denote by #(A) or # A the number of non-zero entries of A.

Proposition 3

Let s be a sequence and let S, T be two sets of sequences. The following statements hold true.

1.
$d_{\mathtt {AID}}^{k}(S,T)= 2 - \frac {|M_{k}(S)|+|M_{k}(T)|}{|M_{k}(S \cup T)|}$
2.
If s=s ₁ s ₂…s _r and each s _i is of length n, n>k,

then

$d_{\mathtt {AID}}^{k}(\{s_{1}s_{2}\ldots s_{r}\}, s)\le \frac {\text {min}\{(r-1)(k-1), |M_{k}(s)|\}}{|M_{k}(s)|}.$
3.
There is a sequence s for which the above relation holds with “=”.

Proof

The first statement follows from Lemma 2.3, by noting that d AID k(S, T) equals

$$\begin{aligned} \frac{\Big{(}|M_{k}(S \cup T)| - |M_{k}(T)| \Big{)} +\Big{(}|M_{k}(S \cup T)| - |M_{k}(S)| \Big{)}}{|M_{k}(S\cup T)|} \end{aligned} $$

which is indeed equal to the required formula.

For the second statement, let S={s ₁,s ₂,…,s _r} and T={s}. By the definition of the generalized information distance, d AID k({s ₁,…,s _r},s) equals a fraction, where the numerator is the sum between the number of distinct k-mers that appear in {s ₁,…,s _r} but not in s, and the number of distinct k-mers that appear in s but not in {s ₁,…,s _r}. The first term of this sum is obviously zero, since s _i are contigs that span the sequence s. Thus, the numerator of this fraction is the second term of the sum, namely the number of distinct k-mers that appear in s=s ₁ s ₂…s _r but not in {s ₁,…,s _r}. We can count these k-mers by noticing that the only k-mers that appear in s but not in {s ₁,…,s _r}, are the ones that span consecutive contigs.

We now note that each joint of two contigs s _i s _i+1 contains at most (k−1) distinct k-mers that span both contigs, and that s contains (r−1) such joints s _i s _i+1. Thus, the total number of k-mers of s, that are in s but not in {s ₁,…,s _r}, is at most (r−1)·(k−1).

Since the denominator of the fraction is, by Lemma 2.2, |M _k(s)∪M _k({s ₁,s ₂,…,s _r})|=|M _k(s)|, we have that

$$ d_{\mathtt{AID}}^{k}(\{s_{1},\ldots, s_{r}\},s)\le \frac{0 + (r-1)(k-1)}{|M_{k}(s)|}. $$

Since the approximated information distance ranges between 0 and 1, the required inequality follows.

For the third statement, an example of a sequence where the upper bound of the distance between the conventional DNA signature of the sequence and the fully-assembled DNA signature of its contigs is reached is the sequence

$$ s=AAAACCCCGGGGTTTT, $$

with k=3 and n=r=4. Then s contains exactly 10 different 3-mers, that is, |M ₃(s)|=10, and (r−1)·(k−1)/|M ₃(s)|=0.6. On the other hand, let s ₁=A A A A, s ₂=C C C C, s ₃=G G G G, s ₄=T T T T. Then we have |M ₃({s ₁,s ₂,s ₃,s ₄})|=4, since only 4 distinct 3-mers, namely AAA, CCC, GGG and TTT can be found in this set, and thus

$$d_{\mathtt{AID}}^{3}(\{s_{1},s_{2},s_{3},s_{4}\},s)=2-\frac{4+10}{10}=0.6, $$

which equals the given upper bound. □

Remark that, by Proposition 3.1, the generalized approximated distance between two sets of sequences S and T can be now computed as

$$ d_{AID}^{k} (S, T) = 2 - \frac{\#{FCGR}_{k}(S) + \# {FCGR}_{k}(T)}{\#({FCGR}_{k}(S) + {FCGR}_{k}(T))}, $$

which is the formula that was used for all generalized approximated information distance calculations in this paper.

Remark also that the upper bound determined in Proposition 3.2 for the generalized approximated information distance, in the case of the comparison between the conventional DNA signature of a sequence and the fully-assembled DNA signature of its r contigs of length n, is the one illustrated in Column (A ^′) of Table 1.

Multi-dimensional scaling and separation assessment

To visualize the interrelationships among DNA signatures originating from a pair of genomes, and thus to visually assess whether separation was achieved, we used Multi-Dimensional Scaling (MDS). MDS is an information visualization technique introduced by Kruskal in [67]. MDS takes as input a distance matrix that contains the pairwise distances among a set of items (here the items are DNA signatures), and outputs a spatial representation of the items in a common Euclidean space. Each item is represented as a point, and the spatial distance between any two points corresponds to the distance between the items in the distance matrix. Objects with a smaller pairwise distance will result in points that are close to each other, while objects with a larger pairwise distance will become points that are far apart.

Concretely, classical MDS, which we use in this paper, receives as input an m×m distance matrix (Δ(i, j))_{1≤i, j≤m} of the pairwise distances between any two items in the set. The output of classical MDS consists of m points in a q-dimensional space whose pairwise spatial (Euclidean) distances are a linear function of the distances between the corresponding items in the input distance matrix. More precisely, MDS will return m points $p_{1},p_{2},\ldots,p_{m}\in \mathbb {R}^{q}$ such that $d(i, j)= ||p_{i}-p_{j}||\thickapprox f(\Delta (i,j))$ for all i, j∈{1,…,m} where d(i, j) is the spatial distance between the points p _i and p _j, and f is a function linear in Δ(i, j). Here, q can be at most (m−1) and the points are recovered from the eigenvalues and eigenvectors of the input m×m distance matrix. If we choose q=3, the result of classical MDS is an approximation of the original (m−1)-dimensional space as a three-dimensional map, such as the Molecular Distance Maps in this paper. Throughout the paper, for consistency, all Molecular Distance Maps have been scaled so that the x-, y-, and z- coordinates always span the interval [−1,1]. The formula used for scaling is $x_{\text {sca}} =2 \cdot \left (\frac {x - x_{\text {min}}}{x_{\text {max}} - x_{\text {min}}}\right) - 1$, where x _min and x _max are the minimum and maximum of the x-coordinates of all the points in the original map, and similarly for y _sca and z _sca. In all Molecular Distance Maps displayed in this paper, the origin of coordinates (0,0,0) is the center of the depicted cube, and the parallel edges of the cube are parallel to one of the x-, y-, and z- axis respectively. The maps have been rotated for optimal visualization and, for each of the axes, the length units are displayed only on one of the four edges of the cube that are parallel to it.

A feature of MDS is that the points p _i are not unique. Indeed, one can translate or rotate a map without affecting the pairwise spatial distances d(i, j)=||p _i−p _j||. In addition, the obtained points in an MDS map may change coordinates when more data items are added to, or removed from, the dataset. This is because MDS aims to preserve only the pairwise spatial distances between points, and this can be achieved even when some of the points change their coordinates. In particular, the (x, y,z)-coordinates of a point representing the DNA signature of a particular DNA fragment of H. Sapiens in Fig. 1 will not be the same as the (x, y,z)-coordinates of the point representing the same DNA fragment in Fig. 3.

For a given Molecular Distance Map, k-means clustering [57] was used to assess whether separation of the DNA sequences by organism was achieved. The reason for this choice were that in all computed Molecular Distance Maps the number of clusters was known a priori, k=2 (not to be confused with k-mers, where k has a different meaning), that the clusters had approximately the same number of points and thus the prior probability of the two clusters was the same, and that in most cases the clusters were somewhat spherical in shape. Moreover, the use of k-means yielded satisfactory results in the majority of cases.

The k-means clustering algorithm proceeds as follows. Suppose S ₁ is the set of points originating from the genome of one of the organisms, and S ₂ is the set of points originating from the second one. k-means assigns labels A and B to all given points, in its attempt to cluster them into two clusters, A and B. The k-means accuracy score is computed by counting how many points were assigned correctly to their cluster, that is,

$$Acc = \frac{max \{ |A_{S_{1}}| + |B_{S_{2}}|, \;\; |B_{S_{1}}| + |A_{S_{2}}| \}}{|S_{1}| +|S_{2}|} $$

where $A_{S_{1}}$ is the set of points in the cluster A that belong to the set S ₁, and $B_{S_{2}}$ is the set of points in the cluster B that belong to the set S ₂ ($B_{S_{1}}$ and $A_{S_{2}}$ are defined similarly). If label A would correspond to species S ₁, and B to species S ₂, the quantity $ |A_{S_{1}}| + |B_{S_{2}}|$ would represent the number of points that have been correctly classified in this Molecular Distance Map, while $ |B_{S_{1}}| + |A_{S_{2}}|$ would represent the number of points that have been incorrectly classified. As a number, Acc is a quantity between 0.5 and 1, with 50 % indicating the worst clustering, and 100 % indicating perfect clustering. For this paper, any Molecular Distance Map with an accuracy greater than 85 % was interpreted as achieving separation of points by species.

In some cases the accuracy was less than 85 % in spite of the fact that separation of clusters could clearly be observed visually. A closer look at those cases revealed that they were generally plots similar to Fig. 4, that is, consisting of two long and thin clusters. In addition, in those plots the clusters were closer to each other than in Fig. 4. In such cases, k-means erroneously labelled the top halves of the two clusters by A, and the two bottom halves by B. For such situations, where the k-means clustering algorithm had a relatively low accuracy score but visual separation was nevertheless observed, we verified the existence of a plane that completely separated the two clusters. That is, if cluster S ₁ had n ₁ points of coordinates $(x_{i_{1}}, x_{i_{2}}, x_{i_{3}})$, where 1≤i≤n ₁, and cluster S ₂ had n ₂ points $(y_{j_{1}}, y_{j_{2}}, y_{j_{3}})$, where 1≤j≤n ₂, then our Mathematica-based code [68] was used to find one (out of possibly infinitely many) solutions to the system of equations with unknowns a, b,c, d:

$$\left\{ \begin{array}{lclr} a\cdot x_{i_{1}} + b\cdot x_{i_{2}} + c\cdot x_{i_{3}} + d & > 0, & i = 1,\ldots, n_{1}\\ a\cdot y_{j_{1}} + b\cdot y_{j_{2}} + c\cdot y_{j_{3}} + d & < 0, & j = 1,\ldots, n_{2} \end{array} \right. $$

that is, it found the equation a x+b y+c z+d=0 of a plane with the property that the points of the cluster S ₁ are situated on one of its sides, while those of cluster S ₂ are situated on the other. For example, in Fig. 6, the equation of a plane computed by this method, that completely separates the points originating from H. sapiens from those originating from P. troglodytes, is x+0.918 y+0.37 z+0.0002=0.

For Molecular Distance Maps with more complex cluster shapes, where k-means accuracy is low and separating planes do not exist, the use of other clustering methods such as density-based spatial clustering of applications with noise (DBSCAN) [69] would have to be explored to see if separation is achieved.

The webtool MoDMap3D, [58], illustrates the 3D Molecular Distance Maps that correspond to each of the comparisons listed in Fig. 2, in the same way the Molecular Distance Map in Fig. 1 illustrates the positive separation result listed in Fig. 2, subfigure Animalia, line 1. The webtool MoDMap3D is, moreover, interactive, and allows for an in-depth exploration of each particular 3D Molecular Distance Map. After first selecting the pair of genomes to be compared, the user can navigate in the three-dimensional space of their DNA signatures: clicking on any point in the map will display information about the DNA fragment represented by that point, such as its NCBI accession number or assembly number, scientific name of the organism it originates from, chromosome or contig/scaffold number, length of the subsequence in bp, and fragment number from the original sequence.

Software

The code for running the experiments [68] was written in Wolfram Mathematica, and was used for the generation of FCGRs, the computation of composite and assembled DNA signatures, the calculation of distance matrices, the creation of the 3D Molecular Distance Maps, and the computation of the separating planes.

Remarks

One observation should be made about the genome assemblies at contig/scaffold level in the dataset. The general intent was for the 150 kbp DNA fragments from a given genome not to be overlapping. This is because sequence overlaps could result in artificially smaller intragenomic distances due to the increase in sequences’ similarities, and this could potentially lead to false positive cluster separations. However, some overlap may have been unavoidable in the cases where only contig/scaffold level data was available. The availability of contig/scaffold data only may thus explain why in Fig. 2 the accuracy scores do not always decrease uniformly, as expected, when one compares the pivot organism with organisms more and more closely related to it.

Another observation should be made about the length of sequences analyzed. When computing composite DNA signatures, the signature of the mitochondrial genome (or entire chloroplast or plasmid) was appended to that of each 150 kbp nDNA fragment. This, in some sense, magnifies the role of the organellar genome in the composite signature. Depending on the application, one can generalize Definition 1 to a weighted additive DNA signature which gives different weights to the different types of DNA that compose it.

We now discuss some limitations of the proposed methods. First, note that assembled DNA signatures as defined here use equi-length contigs. Preliminary computational experiments, illustrated in Table 1, columns (B ^′) and (C ^′), show the results of comparisons between a conventional nDNA signature and variable-length assembled DNA signatures of the same fragment. In those experiments, contig lengths are drawn from a normal distribution N(μ,σ) with mean μ=n (the length of the contig in the corresponding equi-length contig experiment) and variance σ=40. The table shows that the performance of assembled DNA signatures using variable-length contigs is comparable with the performance of those using equi-length contigs. This indicates that both equi-length and variable-length contigs assembled DNA signatures could be reliable approximations of conventional genomic signatures, depending on the application. Additional exploration is needed to confirm this hypothesis.

Second, every computational experiment in this study is a comparison between DNA signatures of genomic sequences belonging to two different organisms. Further analysis is needed to determine if the positive preliminary results on the discriminating power of composite and composite-assembled DNA signatures extend successfully to multi-genome comparisons. A necessary step for such an experiment would be a thorough investigation of intragenomic variations of FCGRs and finding a method to determine, for each genome, a single “representative” FCGR matrix to successfully represent that genome.

Third, we mention a case where separation by organism could not be achieved, even when using composite DNA signatures (nDNA and cpDNA). This is the pairwise comparison between a cultivated pepper Capsicum annuum L, cultivar Zunla-1 (domesticated) and its wild progenitor Capsicum annuum var. glabriusculum, cultivar Chiltepin (wild), see Fig. 8.

Several directions of future research stem from the observation that the function F C G R _k is a quasi-homomorphism from the set of all DNA sequences with the operation of catenation, to the set of 2^k×2^k matrices with the operation of addition, in the sense that for sequences s, t, we have

$${FCGR}_{k}(st) \approx {FCGR}_{k}(s) + {FCGR}_{k}(t). $$

The definition of F C G R _k can be easily modified to make it an exact homomorphism by, e.g, defining a marked catenation of sequences s and t as s·t=s $ t, with $ a new symbol, and constructing F C G R _k so as to not count any k-mer that includes the symbol $. Next steps in the exploration of the mathematical properties of additive DNA signatures include studying the implications of the homomorphic, structure-preserving, nature of F C G R _k, as well as extensions of the concept of additive DNA signature, to, e.g., weighted additive DNA signatures which would give different weights to the different types of DNA that compose it.

Abbreviations

AID, approximated information distance; CGR, chaos game representation; cpDNA, chloroplast DNA; FCGR, frequency CGR; MDS, multi dimensional scaling; mtDNA, mitochondrial DNA; nDNA, nuclear/nucleoid DNA; pDNA, plasmid DNA.

References

Jeffrey HJ. Chaos game representation of gene structure. Nucleic Acids Res. 1990; 18(8):2163–70.
Article CAS PubMed PubMed Central Google Scholar
Jeffrey HJ. Chaos game visualization of sequences. Computers & Graphics. 1992; 16(1):25–33.
Article Google Scholar
Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999; 16(10):1391–9.
Article CAS PubMed Google Scholar
Deschavanne PJ, Giron A, Vilain J, Dufraigne C, Fertil B. Genomic signature is preserved in short DNA fragments. In: Proceedings of the IEEE International Symposium on Bio-Informatics and Biomedical Engineering. IEEE: 2000. p. 161–7.
Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283–90.
Article CAS PubMed Google Scholar
Karlin S, Campbell AM, Mrázek J. Comparative DNA analysis across diverse genomes. Annu Rev Genet. 1998; 32:185–225.
Article CAS PubMed Google Scholar
Vinga S, Almeida JS. Alignment-free sequence comparison - a review. Bioinformatics. 2003; 19(4):513–23.
Article CAS PubMed Google Scholar
Nalbantoglu OU, Sayood K. Computational Genomic Signatures. Synth Lect Biomed Eng. 2011; 6(2):1–129.
Article Google Scholar
Bonham-Carter O, Steele J, Bastola D. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. 2013; 15(6):890–905.
Article PubMed PubMed Central Google Scholar
Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Brief Bioinform. 2014; 15(3):354–68.
Article PubMed Google Scholar
Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform. 2014; 15(3):343–53.
Article PubMed Google Scholar
Burma PK, Raj A, Deb JK, Brahmachari SK. Genome analysis: A new approach for visualization of sequence organization in genomes. J Biosci. 1992; 17(4):395–411.
Article CAS Google Scholar
Hill KA, Singh SM. The evolution of species-type specificity in the global DNA sequence organization of mitochondrial genomes. Genome. 1997; 40(3):342–56.
Article CAS PubMed Google Scholar
Hao B, Lee HC, Zhang S-Y. Fractals related to long DNA sequences and complete genomes. Chaos Solitons Fractals. 2000; 11(6):825–36.
Article CAS Google Scholar
Dutta C, Das J. Mathematical characterization of chaos game representation. New algorithms for nucleotide sequence analysis. J Mol Biol. 1992; 228(3):715–9.
Article CAS PubMed Google Scholar
Goldman N. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 1993; 21(10):2487–491.
Article CAS PubMed PubMed Central Google Scholar
Almeida JS, Carriço JAA, Maretzek A, Noble PA, Fletcher M. Analysis of genomic sequences by Chaos Game Representation. Bioinformatics. 2001; 17(5):429–37.
Article CAS PubMed Google Scholar
Almeida JS. Sequence analysis by iterated maps, a review. Brief Bioinform. 2014; 15(3):369–75.
Article PubMed Google Scholar
Wang Y, Hill K, Singh S, Kari L. The spectrum of genomic signatures: From dinucleotides to chaos game representation. Gene. 2005; 346:173–85.
Article CAS PubMed Google Scholar
Kari L, Hill KA, Sayem AS, Karamichalis R, Bryans N, Davis K, Dattani NS. Mapping the space of genomic signatures. PLoS ONE. 2015; 10(5):e0119815.
Article PubMed PubMed Central Google Scholar
Edwards SV, Fertil B, Giron A, Deschavanne PJ. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst Biol. 2002; 51(4):599–613.
Article PubMed Google Scholar
Deschavanne P, DuBow MS, Regeard C. The use of genomic signature distance between bacteriophages and their hosts displays evolutionary relationships and phage growth cycle determination. Virol J. 2010; 7:163.
Article PubMed PubMed Central Google Scholar
Pandit A, Sinha S. Using genomic signatures for HIV-1 sub-typing. BMC Bioinformatics. 2010; 11(Suppl 1):26.
Article Google Scholar
Hatje K, Kollmar M. A phylogenetic analysis of the Brassicales clade based on an alignment-free sequence comparison method. Front Plant Sci. 2012; 3(192):11–22.
Google Scholar
Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics. 2015; 16(1):246.
Article PubMed PubMed Central Google Scholar
Wu TJ, Huang YH, Li LA. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics. 2005; 21(22):4125–32.
Article CAS PubMed Google Scholar
Höhl M, Rigoutsos I, Ragan MA. Pattern-based phylogenetic distance estimation and tree reconstruction. Evol Bioinforma. 2006; 2:359–75.
Google Scholar
Höhl M, Ragan MA. Is multiple-sequence alignment required for accurate inference of phylogeny?Syst Biol. 2007; 56(2):206–21.
Article PubMed Google Scholar
Dai Q, Yang Y, Wang T. Markov model plus k-word distributions: A synergy that produces novel statistical measures for sequence comparison. Bioinformatics. 2008; 24(20):2296–302.
Article CAS PubMed Google Scholar
Guyon F, Brochier-Armanet C, Guénoche A. Comparison of alignment free string distances for complete genome phylogeny. Adv Data Anal Classif. 2009; 3(2):95–108.
Article Google Scholar
Jayalakshmi R, Natarajan R, Vivekanandan M, Natarajan GS. Alignment-free sequence comparison using N-dimensional similarity space. Curr Computer-Aided Drug Des. 2010; 6(4):290–6.
Article CAS Google Scholar
Haubold B. Alignment-free phylogenetics and population genetics. Brief Bioinform. 2014; 15(3):407–18.
Article PubMed Google Scholar
Fiser A, Tusnády GE, Simon I. Chaos game representation of protein structures. J Mol Graph. 1994; 12(4):302–4.
Article CAS PubMed Google Scholar
Basu S, Pan A, Dutta C, Das J. Chaos game representation of proteins. J Mol Graph Modell. 1997; 15(5):279–89.
Article CAS Google Scholar
Yu ZG, Anh V, Lau KS. Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol. 2004; 226(3):341–8.
Article CAS PubMed Google Scholar
Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D. Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol. 2009; 257(4):618–26.
Article CAS PubMed Google Scholar
Randić M, Novič M, Vikić-Topić D, Plašsić D. Novel numerical and graphical representation of DNA sequences and proteins. SAR QSAR Environ Res. 2006; 17(6):583–95.
Article PubMed Google Scholar
Almeida JS, Vinga S. Biological sequences as pictures: a generic two dimensional solution for iterated maps. BMC Bioinformatics. 2009; 10:100.
Article PubMed PubMed Central Google Scholar
Almeida JS, Vinga S. Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics. 2002; 3:6.
Article PubMed PubMed Central Google Scholar
Almeida JS, Vinga S. Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol. 2006; 1:18.
Article PubMed PubMed Central Google Scholar
Fu W, Wang Y, Lu D. Multifractal analysis of genomic sequences CGR images. In: Proceedings of the 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. vol. 5. IEEE: 2005. p. 4783–786.
Fu W, Wang Y, Lu D. Multifractal analysis of genomes sequences’ CGR graph. J Biomed Eng. 2007; 24(3):522–5.
CAS Google Scholar
Vélez PE, Garreta LE, Martínez E, Díaz N, Amador S, Tischer I, Gutiérrez JM, Moreno PA. The Caenorhabditis elegans genome: A multifractal analysis. Genet Mol Res. 2010; 9(2):949–65.
Article PubMed Google Scholar
Moreno PA, Vélez PE, Martínez E, Garreta LE, Díaz N, Amador S, Tischer I, Gutiérrez JM, Naik AK, Tobar F, García F. The human genome: a multifractal analysis. BMC Genomics. 2011; 12(1):506.
Article CAS PubMed PubMed Central Google Scholar
Pandit A, Dasanna AK, Sinha S. Multifractal analysis of HIV-1 genomes. Mol Phylogenet Evol. 2012; 62(2):756–63.
Article CAS PubMed Google Scholar
Pal M, Satisha B, Srinivas K, Madhusudana Rao P, Manimaran P. Multifractal detrended cross-correlation analysis of coding and non-coding DNA sequences through chaos-game representation. Physica A: Stat Mech Appl. 2015; 436:596–603.
Article CAS Google Scholar
Oliver JL, Bernaola-Galván P, Guerrero-García J, Román-Roldán R. Entropic profiles of DNA sequences through chaos-game-derived images. J Theor Biol. 1993; 160(4):457–70.
Article CAS PubMed Google Scholar
Vinga S, Almeida JS. Rényi continuous entropy of DNA sequences. J Theor Biol. 2004; 231(3):377–88.
Article CAS PubMed Google Scholar
Vinga S, Almeida JS. Local Rényi entropic profiles of DNA sequences. BMC Bioinformatics. 2007; 8:393.
Article PubMed PubMed Central Google Scholar
Joseph J, Sasikumar R. Chaos game representation for comparison of whole genomes. BMC Bioinformatics. 2006; 7:243.
Article PubMed PubMed Central Google Scholar
Tanchotsrinon W, Lursinsap C, Poovorawan Y. A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition. BMC Bioinformatics. 2015;16(1).
Campbell AM, Mrázek J, Karlin S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc Natl Acad Sci U S A. 1999; 96(16):9184–9.
Article CAS PubMed PubMed Central Google Scholar
Li M, Chen X, Li X, Ma B, Vitanyi PMB. The similarity metric. Inf Theory IEEE Trans. 2004; 50(12):3250–264.
Article Google Scholar
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process. 2004; 13(4):600–12.
Article PubMed Google Scholar
Iversen GR, Gergen M, Gergen MM. Statistics: The Conceptual Approach. Berlin Heidelberg: Springer; 1997.
Book Google Scholar
Krause EF. Taxicab Geometry: An Adventure in Non-Euclidean geometry. Mineola, New York: Courier Dover Publications; 2012.
Google Scholar
Lloyd S. Least squares quantization in pcm. IEEE Trans Inf Theory. 1982; 28(2):129–37.
Article Google Scholar
Karamichalis R. Molecular Distance Map Interactive Webtool. 2015. https://github.com/rallis/MoDMap3D. Accessed 27 Jul 2016.
Jameson NM, Hou ZC, Sterner KN, Weckle A, Goodman M, Steiper ME, Wildman DE. Genomic data reject the hypothesis of a prosimian primate clade. J Human Evol. 2011; 61(3):295–305.
Article Google Scholar
Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MAM, Kessing B, Pontius J, Roelke M, Rumpler Y, Schneider MPC, Silva A, O’Brien SJ, Pecon-Slattery J. A molecular phylogeny of living primates. PLoS Genet. 2011; 7(3):1001342.
Article Google Scholar
Chatterjee H, Ho S, Barnes I, Groves C. Estimating the phylogeny and divergence times of primates using a supermatrix approach. BMC Evol Biol. 2009; 9(1):259.
Article PubMed PubMed Central Google Scholar
Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010; 11(5):473–83.
Article CAS PubMed PubMed Central Google Scholar
Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: Current challenges and future perspectives. PLoS ONE. 2011; 6(3):18093.
Article Google Scholar
Grossmann L, Jensen M, Heider D, Jost S, Glücksman E, Hartikainen H, Mahamdallie SS, Gardner M, Hoffmann D, Bass D, et al. Protistan community analysis: key findings of a large-scale molecular sampling. ISME J. Springer Nature; 2016.
Lange A, Jost S, Heider D, Bock C, Budeus B, Schilling E, Strittmatter A, Boenigk J, Hoffmann D. Ampliconduo: A split-sample filtering protocol for high-throughput amplicon sequencing of microbial communities. PLoS ONE. 2015; 10(11):0141590.
Article Google Scholar
Bikel S, Valdez-Lara A, Cornejo-Granados F, Rico K, Canizales-Quinteros S, Soberón X, Del Pozo-Yauner L, Ochoa-Leyva A. Combining metagenomics, metatranscriptomics and viromics to explore novel microbial interactions: towards a systems-level understanding of human microbiome. Comput Struct Biotechnol J. 2015; 13:390–401.
Article CAS PubMed PubMed Central Google Scholar
Kruskal JB. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964; 29(1):1–27.
Article Google Scholar
Karamichalis R. Source code for computing FCGR matrices, distance matrices, MultiDimensional Scaling and separation planes. https://github.com/rallis/GenomicSignatures. Accessed 27 Jul 2016.
Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Conference on Knowledge Discovery and Data Mining; vol. 96. AAAI Press: 1996. p. 226–31.

Download references

Acknowledgements

We thank Katheen Hill (Biology, University of Western Ontario) for valuable comments and suggestions, Genlou Sun (Biology, St.Mary’s University) for general molecular biology expertize, and Stephen M. Watt (University of Waterloo) for useful discussions.

Funding

The research presented in this paper was supported by the Natural Sciences and Engineering Research Council of Canada (Grant No. R2824A01 to L.K., Grant No. 220259 to S.K., and Undergraduate Student Research Award No. 480936 to S.S.-R.). The funding bodies had no role in the design of the study, the collection, analysis and interpretation of data, and in writing the manuscript.

Availability of data and materials

The source code for computing FCGR matrices, distance matrices, MultiDimensional Scaling and separation planes can be found in [68]. The NCBI accession numbers of all DNA sequences involved in this study can be found in the Additional file 1.

Authors’ contributions

The author order in the title is alphabetical. RK data collection; data analysis, methodology and result interpretation; manuscript tables and figures, manuscript editing; software design and implementation. LK data analysis, methodology and result interpretation; manuscript draft; manuscript editing. S.Kon data analysis, methodology and result interpretation; manuscript editing. S. Kop data analysis, methodology, result interpretation; S. Solis-Reyes manuscript draft (part of Section Background); data collection and analysis (plant experiments); software performance enhancements; language editing. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Author information

Authors and Affiliations

School of Computing Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
Lila Kari
Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
Rallis Karamichalis, Lila Kari, Steffen Kopecki & Stephen Solis-Reyes
Department of Mathematics and Computing Science, Saint Mary’s University, Halifax NS, Canada
Stavros Konstantinidis & Steffen Kopecki

Authors

Rallis Karamichalis
View author publications
You can also search for this author in PubMed Google Scholar
Lila Kari
View author publications
You can also search for this author in PubMed Google Scholar
Stavros Konstantinidis
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Kopecki
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Solis-Reyes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lila Kari.

Additional file

Additional file 1

The 42 genomes analyzed in the Results section, and the two genomes exemplified in the Remarks subsection: Scientific name, number of chromosomes, NCBI accession number. (∗P.patens genome from JGI Phytozome). (PDF 127 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Karamichalis, R., Kari, L., Konstantinidis, S. et al. Additive methods for genomic signatures. BMC Bioinformatics 17, 313 (2016). https://doi.org/10.1186/s12859-016-1157-8

Download citation

Received: 13 May 2016
Accepted: 19 July 2016
Published: 22 August 2016
DOI: https://doi.org/10.1186/s12859-016-1157-8

Additive methods for genomic signatures

Abstract

Background

Results

Conclusions

Background

Results

Composite DNA signatures

Assembled DNA signatures

Composite-assembled DNA signatures

Conclusions

Methods

Dataset

Chaos Game Representation (CGR)

Definition 1

Approximated Information Distance (AID)

Lemma 2

Proposition 3

Proof

Multi-dimensional scaling and separation assessment

Software

Remarks

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Authors’ contributions

Competing interests

Consent for publication

Ethics approval and consent to participate

Author information

Authors and Affiliations

Corresponding author

Additional file

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us