Dataset
The dataset, totalling 1.45 Gb, comprised whole nuclear/nucleoid genomes and organellar genomes of 42 organisms, spanning all major kingdoms of life (see Additional file 1 for the scientific name, NCBI accession number, chromosome number, and number of fragments sampled). In our analysis, for each complete genomic sequence, all letters other than A, C,G, T were ignored, and the resulting DNA sequence was divided into successive, non-overlapping, contiguous fragments, each 150 kbp long (when the last portion was shorter than 150 kbp, it was not included in the analysis). The choice of fragment length, 150 kbp, was due to our choice of CGR image resolution (namely 29×29, that is, k=9), empirical testing, and computational efficiency reasons, see [25].
Subsequently, 20 such 150 kbp fragments were randomly sampled from each chromosome and, for each such fragment, a corresponding conventional nDNA signature was constructed, as described below. (If there were fewer than 20 fragments, all fragments in the chromosome were chosen.) In the cases where the genome assembly of the organism was at the contig/scaffold level, the contigs/supercontigs of the assembly were sorted by length and the first 500 contigs/supercontigs were selected. (If there were fewer than 500 contigs/supercontigs, all were selected.) From each contig/supercontig, only the first 150 kbp fragment was considered.
We note that this method is alignment-free, and that its approach contrasts typical biodiversity and species identification research [62–65] in that it uses randomly selected DNA sequences rather than specific marker genes for identification and classification of species. This approach is somewhat similar to novel approaches in metagenomics, metatranscriptomics, and viromics [66], but there are also substantial differences such as that metatranscriptomics is based on RNA rather than DNA and that it groups sequences based on functionality rather than oligomer composition.
Chaos Game Representation (CGR)
CGR is a method introduced by Jeffrey [1] as a way to visualize the structural composition of a DNA sequence. This method associates an image to each DNA sequence as follows: Starting from a square with corners labelled A, C, G, and T, and the center of the square as the starting point, the image is obtained by successively plotting each nucleotide as the middle point between the current point and the corner labelled by the nucleotide to be plotted. If the generated square image has a size of 2k×2k pixels, then every pixel represents a distinct k-mer: A pixel is black if the k-mer it represents occurs in the DNA sequence, otherwise it is white. CGR images of genetic DNA sequences originating from various species show patterns such as squares, parallel lines, rectangles, triangles, and also complex fractal patterns, as shown in Fig. 7.
We used a modification of the original CGR, introduced by Deschavanne [3]: a k-th order FCGR (frequency CGR) of a sequence s, denoted by F
C
G
R
k
(s), is a 2k×2k matrix that can be constructed by dividing the CGR image of the sequence s into a 2k×2k grid, and defining the element a
ij
of the matrix F
C
G
R
k
(s) as the number of points that are situated in the corresponding grid square.
We now formally define the conventional DNA signature of a sequence s to be the matrix F
C
G
R
k
(s), which records the numbers of occurrences of all possible k-mers in the sequence s. Throughout this paper, the parameter k is assumed to be a fixed constant. In particular, similar to [25], in all computational experiments in this paper the value used was k=9.
For computing composite and assembled DNA signatures, we introduce the general concept of additive DNA signature of a set of sequences, formally defined as follows.
Definition 1
The additive DNA signature of a set of sequences S={s
1,s
2,…,s
r
},r≥1, is defined as
$${FCGR}_{k}(S) = {FCGR}_{k}(s_{1})+ \ldots + {FCGR}_{k}(s_{r}). $$
Note that the notions of conventional DNA signature, composite DNA signature, assembled DNA signature, and fully-assembled DNA signature, are all particular cases of additive DNA signatures, as follows:
-
The conventional DNA signature of a sequence s is the additive DNA signature of the set {s} consisting of a single sequence s, that is, F
C
G
R
k
(s)=F
C
G
R
k
({s}).
-
The composite DNA signature using two DNA sequences s
1,s
2, of two different types, is
F
C
G
R
k
({s
1,s
2})=F
C
G
R
k
(s
1)+F
C
G
R
k
(s
2),
-
An assembled signature of a sequence s, using r equi-length contigs of length n, is
\({FCGR}_{k}(\{s_{1}, s_{2}, \ldots, s_{r}\}) = \sum _{i = 1}^{r} {FCGR}_{k}(s_{i}),\) where s=α
i
s
i
β
i
,|s
i
|=n, for 1≤i≤r.
-
The fully-assembled DNA signature of a sequence s, using equi-length contigs of length n, is
\({FCGR}_{k}(\{s_{1}, s_{2}, \ldots, s_{r}\}) = \sum _{i=1}^{r} {FCGR}_{k}(s_{i}),\) where r=⌊|s|/n⌋,s=s
1
s
2…s
r
s
r+1, and |s
i
|=n for 1≤i≤r, while |s
r+1|<n.
To compute the fully-assembled DNA signature of a sequence s, using equi-length contigs of length n, one adds the F
C
G
R
k
of all the adjacent consecutive contigs of length n that cover s (except possibly a short tail of length less than n), where the first contig starts at the beginning of the sequence. In contrast, to compute an assembled signature of s using equi-length contigs of length n, one has the freedom to set the number of such contigs as an additional parameter r, and then add the F
C
G
R
k
of r contigs sampled randomly from the sequence s. Thus, for a given n, a sequence s has only one fully-assembled DNA signature, but many different assembled signatures, each depending on both the choice of parameter r, and the particular sampling of the r sequences (which may overlap or be identical).
For example, if s is the DNA sequence
$$ s=AAAAACCCCCGGGGGTTT, $$
of length 18, and if we consider contigs of length n=5, then the fully-assembled DNA signature of s is unique and is obtained by adding the F
C
G
R
k
of the following r=⌊18/5⌋=3 contigs
$$ \{AAAAA,CCCCC,GGGGG\} $$
that cover s (except the discarded remainder TTT).
For the same sequence s and contig length n=5, many diferent assembled DNA signatures can be computed. For example, an assembled DNA signature of s using r=3 equi-length contigs of length n=5 could use contigs {A
A
A
C
C, C
C
C
G
G, C
C
C
G
G}, while another could use contigs {A
A
C
C
C, C
C
C
C
G, G
G
T
T
T}. In addition, other assembled DNA signatures of s with equi-length contigs of length n=5 exist, depending on the parameter r. For example, an assembled DNA signature of s using r=5 equi-length contigs of length n=5 could use the contigs
$$ \{AAAAA, AAACC, CGGGG, GGGGT, GGTTT \}. $$
Approximated Information Distance (AID)
For a finite set X, we denote by |X| the cardinality of X, that is the number of elements in X. Given a set of sequences S={s
1,s
2,…,s
n
} we denote by M
k
(S) the set of all distinct k-mers that occur in all the sequences of S. In the case of a set consisting of a single sequence s, we write M
k
(s) to denote M
k
({s}).
The approximated information distance between two sequences s and t (introduced in [25] as a slight modification of a distance used in [53]) is defined as:
$$d_{\mathtt{AID}}^{k}(s,t) = \frac{|M_{k}(s)\setminus M_{k}(t)| + |M_{k}(t)\setminus M_{k}(s)|}{|M_{k}(\{s, t\})|}, $$
where for two sets X and Y, X∖Y denotes the set difference between X and Y, that is, the set of elements that belong to X but not to Y.
The distance d
AID
k(s, t) was used for most of the computations of pairwise distances between conventional DNA signatures in this paper.
The notion of approximated information distance between two sequences can now be extended to that of generalized approximated information distance between two sets of sequences S and T, as:
$$ d_{\mathtt{AID}}^{k}(S,T) = \frac{|M_{k}(S)\setminus M_{k}(T)| + |M_{k}(T)\setminus M_{k}(S)|}{|M_{k}(S\cup T)|}. $$
This generalization of the approximated information distance preserves the original meaning of the concept as the ratio between the number of noncommon k-mers of the two sets S and T and the total number of k-mers that occur in S or in T (or both). This distance was used to compute distances between conventional, composite and assembled DNA signatures in this paper.
The next Proposition leads to a formula for the computation of the generalized approximated information distance, as well as gives a theoretical upper bound for the generalized approximated information distance in the case of fully-assembled DNA signatures. The following auxiliary lemma follows from standard set theory arguments.
Lemma 2
Let s be a sequence and S, T be two finite sets of sequences over the DNA alphabet {A, C,G, T}, and let k≥2 be an integer. The following statements hold true.
-
1.
If S⊆T then |M
k
(S)|≤|M
k
(T)| and
|M
k
(S∪T)|=|M
k
(T)|,
-
2.
If every sequence in S is a subsequence of a given sequence s, then
|M
k
(S)∪M
k
(s)|=|M
k
(s)|,
-
3.
The number of distinct k-mers that occur in S but not in T is |M
k
(S)∖M
k
(T)|=|M
k
(S∪T)|−|M
k
(T)|,
-
4.
|M
k
(S)|=#
F
C
G
R
k
(S),
where for a numerical matrix A we denote by #(A) or #
A the number of non-zero entries of A.
Proposition 3
Let s be a sequence and let S, T be two sets of sequences. The following statements hold true.
-
1.
\(d_{\mathtt {AID}}^{k}(S,T)= 2 - \frac {|M_{k}(S)|+|M_{k}(T)|}{|M_{k}(S \cup T)|}\)
-
2.
If s=s
1
s
2…s
r
and each s
i
is of length n, n>k,
then
\(d_{\mathtt {AID}}^{k}(\{s_{1}s_{2}\ldots s_{r}\}, s)\le \frac {\text {min}\{(r-1)(k-1), |M_{k}(s)|\}}{|M_{k}(s)|}.\)
-
3.
There is a sequence s for which the above relation holds with “=”.
Proof
The first statement follows from Lemma 2.3, by noting that d
AID
k(S, T) equals
$$\begin{aligned} \frac{\Big{(}|M_{k}(S \cup T)| - |M_{k}(T)| \Big{)} +\Big{(}|M_{k}(S \cup T)| - |M_{k}(S)| \Big{)}}{|M_{k}(S\cup T)|} \end{aligned} $$
which is indeed equal to the required formula.
For the second statement, let S={s
1,s
2,…,s
r
} and T={s}. By the definition of the generalized information distance, d
AID
k({s
1,…,s
r
},s) equals a fraction, where the numerator is the sum between the number of distinct k-mers that appear in {s
1,…,s
r
} but not in s, and the number of distinct k-mers that appear in s but not in {s
1,…,s
r
}. The first term of this sum is obviously zero, since s
i
are contigs that span the sequence s. Thus, the numerator of this fraction is the second term of the sum, namely the number of distinct k-mers that appear in s=s
1
s
2…s
r
but not in {s
1,…,s
r
}. We can count these k-mers by noticing that the only k-mers that appear in s but not in {s
1,…,s
r
}, are the ones that span consecutive contigs.
We now note that each joint of two contigs s
i
s
i+1 contains at most (k−1) distinct k-mers that span both contigs, and that s contains (r−1) such joints s
i
s
i+1. Thus, the total number of k-mers of s, that are in s but not in {s
1,…,s
r
}, is at most (r−1)·(k−1).
Since the denominator of the fraction is, by Lemma 2.2, |M
k
(s)∪M
k
({s
1,s
2,…,s
r
})|=|M
k
(s)|, we have that
$$ d_{\mathtt{AID}}^{k}(\{s_{1},\ldots, s_{r}\},s)\le \frac{0 + (r-1)(k-1)}{|M_{k}(s)|}. $$
Since the approximated information distance ranges between 0 and 1, the required inequality follows.
For the third statement, an example of a sequence where the upper bound of the distance between the conventional DNA signature of the sequence and the fully-assembled DNA signature of its contigs is reached is the sequence
$$ s=AAAACCCCGGGGTTTT, $$
with k=3 and n=r=4. Then s contains exactly 10 different 3-mers, that is, |M
3(s)|=10, and (r−1)·(k−1)/|M
3(s)|=0.6. On the other hand, let s
1=A
A
A
A, s
2=C
C
C
C, s
3=G
G
G
G, s
4=T
T
T
T. Then we have |M
3({s
1,s
2,s
3,s
4})|=4, since only 4 distinct 3-mers, namely AAA, CCC, GGG and TTT can be found in this set, and thus
$$d_{\mathtt{AID}}^{3}(\{s_{1},s_{2},s_{3},s_{4}\},s)=2-\frac{4+10}{10}=0.6, $$
which equals the given upper bound. □
Remark that, by Proposition 3.1, the generalized approximated distance between two sets of sequences S and T can be now computed as
$$ d_{AID}^{k} (S, T) = 2 - \frac{\#{FCGR}_{k}(S) + \# {FCGR}_{k}(T)}{\#({FCGR}_{k}(S) + {FCGR}_{k}(T))}, $$
which is the formula that was used for all generalized approximated information distance calculations in this paper.
Remark also that the upper bound determined in Proposition 3.2 for the generalized approximated information distance, in the case of the comparison between the conventional DNA signature of a sequence and the fully-assembled DNA signature of its r contigs of length n, is the one illustrated in Column (A
′) of Table 1.
Multi-dimensional scaling and separation assessment
To visualize the interrelationships among DNA signatures originating from a pair of genomes, and thus to visually assess whether separation was achieved, we used Multi-Dimensional Scaling (MDS). MDS is an information visualization technique introduced by Kruskal in [67]. MDS takes as input a distance matrix that contains the pairwise distances among a set of items (here the items are DNA signatures), and outputs a spatial representation of the items in a common Euclidean space. Each item is represented as a point, and the spatial distance between any two points corresponds to the distance between the items in the distance matrix. Objects with a smaller pairwise distance will result in points that are close to each other, while objects with a larger pairwise distance will become points that are far apart.
Concretely, classical MDS, which we use in this paper, receives as input an m×m distance matrix (Δ(i, j))1≤i, j≤m
of the pairwise distances between any two items in the set. The output of classical MDS consists of m points in a q-dimensional space whose pairwise spatial (Euclidean) distances are a linear function of the distances between the corresponding items in the input distance matrix. More precisely, MDS will return m points \(p_{1},p_{2},\ldots,p_{m}\in \mathbb {R}^{q}\) such that \(d(i, j)= ||p_{i}-p_{j}||\thickapprox f(\Delta (i,j))\) for all i, j∈{1,…,m} where d(i, j) is the spatial distance between the points p
i
and p
j
, and f is a function linear in Δ(i, j). Here, q can be at most (m−1) and the points are recovered from the eigenvalues and eigenvectors of the input m×m distance matrix. If we choose q=3, the result of classical MDS is an approximation of the original (m−1)-dimensional space as a three-dimensional map, such as the Molecular Distance Maps in this paper. Throughout the paper, for consistency, all Molecular Distance Maps have been scaled so that the x-, y-, and z- coordinates always span the interval [−1,1]. The formula used for scaling is \(x_{\text {sca}} =2 \cdot \left (\frac {x - x_{\text {min}}}{x_{\text {max}} - x_{\text {min}}}\right) - 1\), where x
min and x
max are the minimum and maximum of the x-coordinates of all the points in the original map, and similarly for y
sca and z
sca. In all Molecular Distance Maps displayed in this paper, the origin of coordinates (0,0,0) is the center of the depicted cube, and the parallel edges of the cube are parallel to one of the x-, y-, and z- axis respectively. The maps have been rotated for optimal visualization and, for each of the axes, the length units are displayed only on one of the four edges of the cube that are parallel to it.
A feature of MDS is that the points p
i
are not unique. Indeed, one can translate or rotate a map without affecting the pairwise spatial distances d(i, j)=||p
i
−p
j
||. In addition, the obtained points in an MDS map may change coordinates when more data items are added to, or removed from, the dataset. This is because MDS aims to preserve only the pairwise spatial distances between points, and this can be achieved even when some of the points change their coordinates. In particular, the (x, y,z)-coordinates of a point representing the DNA signature of a particular DNA fragment of H. Sapiens in Fig. 1 will not be the same as the (x, y,z)-coordinates of the point representing the same DNA fragment in Fig. 3.
For a given Molecular Distance Map, k-means clustering [57] was used to assess whether separation of the DNA sequences by organism was achieved. The reason for this choice were that in all computed Molecular Distance Maps the number of clusters was known a priori, k=2 (not to be confused with k-mers, where k has a different meaning), that the clusters had approximately the same number of points and thus the prior probability of the two clusters was the same, and that in most cases the clusters were somewhat spherical in shape. Moreover, the use of k-means yielded satisfactory results in the majority of cases.
The k-means clustering algorithm proceeds as follows. Suppose S
1 is the set of points originating from the genome of one of the organisms, and S
2 is the set of points originating from the second one. k-means assigns labels A and B to all given points, in its attempt to cluster them into two clusters, A and B. The k-means accuracy score is computed by counting how many points were assigned correctly to their cluster, that is,
$$Acc = \frac{max \{ |A_{S_{1}}| + |B_{S_{2}}|, \;\; |B_{S_{1}}| + |A_{S_{2}}| \}}{|S_{1}| +|S_{2}|} $$
where \(A_{S_{1}}\) is the set of points in the cluster A that belong to the set S
1, and \(B_{S_{2}}\) is the set of points in the cluster B that belong to the set S
2 (\(B_{S_{1}}\) and \(A_{S_{2}}\) are defined similarly). If label A would correspond to species S
1, and B to species S
2, the quantity \( |A_{S_{1}}| + |B_{S_{2}}|\) would represent the number of points that have been correctly classified in this Molecular Distance Map, while \( |B_{S_{1}}| + |A_{S_{2}}|\) would represent the number of points that have been incorrectly classified. As a number, Acc is a quantity between 0.5 and 1, with 50 % indicating the worst clustering, and 100 % indicating perfect clustering. For this paper, any Molecular Distance Map with an accuracy greater than 85 % was interpreted as achieving separation of points by species.
In some cases the accuracy was less than 85 % in spite of the fact that separation of clusters could clearly be observed visually. A closer look at those cases revealed that they were generally plots similar to Fig. 4, that is, consisting of two long and thin clusters. In addition, in those plots the clusters were closer to each other than in Fig. 4. In such cases, k-means erroneously labelled the top halves of the two clusters by A, and the two bottom halves by B. For such situations, where the k-means clustering algorithm had a relatively low accuracy score but visual separation was nevertheless observed, we verified the existence of a plane that completely separated the two clusters. That is, if cluster S
1 had n
1 points of coordinates \((x_{i_{1}}, x_{i_{2}}, x_{i_{3}})\), where 1≤i≤n
1, and cluster S
2 had n
2 points \((y_{j_{1}}, y_{j_{2}}, y_{j_{3}})\), where 1≤j≤n
2, then our Mathematica-based code [68] was used to find one (out of possibly infinitely many) solutions to the system of equations with unknowns a, b,c, d:
$$\left\{ \begin{array}{lclr} a\cdot x_{i_{1}} + b\cdot x_{i_{2}} + c\cdot x_{i_{3}} + d & > 0, & i = 1,\ldots, n_{1}\\ a\cdot y_{j_{1}} + b\cdot y_{j_{2}} + c\cdot y_{j_{3}} + d & < 0, & j = 1,\ldots, n_{2} \end{array} \right. $$
that is, it found the equation a
x+b
y+c
z+d=0 of a plane with the property that the points of the cluster S
1 are situated on one of its sides, while those of cluster S
2 are situated on the other. For example, in Fig. 6, the equation of a plane computed by this method, that completely separates the points originating from H. sapiens from those originating from P. troglodytes, is x+0.918 y+0.37 z+0.0002=0.
For Molecular Distance Maps with more complex cluster shapes, where k-means accuracy is low and separating planes do not exist, the use of other clustering methods such as density-based spatial clustering of applications with noise (DBSCAN) [69] would have to be explored to see if separation is achieved.
The webtool MoDMap3D, [58], illustrates the 3D Molecular Distance Maps that correspond to each of the comparisons listed in Fig. 2, in the same way the Molecular Distance Map in Fig. 1 illustrates the positive separation result listed in Fig. 2, subfigure Animalia, line 1. The webtool MoDMap3D is, moreover, interactive, and allows for an in-depth exploration of each particular 3D Molecular Distance Map. After first selecting the pair of genomes to be compared, the user can navigate in the three-dimensional space of their DNA signatures: clicking on any point in the map will display information about the DNA fragment represented by that point, such as its NCBI accession number or assembly number, scientific name of the organism it originates from, chromosome or contig/scaffold number, length of the subsequence in bp, and fragment number from the original sequence.
Software
The code for running the experiments [68] was written in Wolfram Mathematica, and was used for the generation of FCGRs, the computation of composite and assembled DNA signatures, the calculation of distance matrices, the creation of the 3D Molecular Distance Maps, and the computation of the separating planes.
Remarks
One observation should be made about the genome assemblies at contig/scaffold level in the dataset. The general intent was for the 150 kbp DNA fragments from a given genome not to be overlapping. This is because sequence overlaps could result in artificially smaller intragenomic distances due to the increase in sequences’ similarities, and this could potentially lead to false positive cluster separations. However, some overlap may have been unavoidable in the cases where only contig/scaffold level data was available. The availability of contig/scaffold data only may thus explain why in Fig. 2 the accuracy scores do not always decrease uniformly, as expected, when one compares the pivot organism with organisms more and more closely related to it.
Another observation should be made about the length of sequences analyzed. When computing composite DNA signatures, the signature of the mitochondrial genome (or entire chloroplast or plasmid) was appended to that of each 150 kbp nDNA fragment. This, in some sense, magnifies the role of the organellar genome in the composite signature. Depending on the application, one can generalize Definition 1 to a weighted additive DNA signature which gives different weights to the different types of DNA that compose it.
We now discuss some limitations of the proposed methods. First, note that assembled DNA signatures as defined here use equi-length contigs. Preliminary computational experiments, illustrated in Table 1, columns (B
′) and (C
′), show the results of comparisons between a conventional nDNA signature and variable-length assembled DNA signatures of the same fragment. In those experiments, contig lengths are drawn from a normal distribution N(μ,σ) with mean μ=n (the length of the contig in the corresponding equi-length contig experiment) and variance σ=40. The table shows that the performance of assembled DNA signatures using variable-length contigs is comparable with the performance of those using equi-length contigs. This indicates that both equi-length and variable-length contigs assembled DNA signatures could be reliable approximations of conventional genomic signatures, depending on the application. Additional exploration is needed to confirm this hypothesis.
Second, every computational experiment in this study is a comparison between DNA signatures of genomic sequences belonging to two different organisms. Further analysis is needed to determine if the positive preliminary results on the discriminating power of composite and composite-assembled DNA signatures extend successfully to multi-genome comparisons. A necessary step for such an experiment would be a thorough investigation of intragenomic variations of FCGRs and finding a method to determine, for each genome, a single “representative” FCGR matrix to successfully represent that genome.
Third, we mention a case where separation by organism could not be achieved, even when using composite DNA signatures (nDNA and cpDNA). This is the pairwise comparison between a cultivated pepper Capsicum annuum L, cultivar Zunla-1 (domesticated) and its wild progenitor Capsicum annuum var. glabriusculum, cultivar Chiltepin (wild), see Fig. 8.
Several directions of future research stem from the observation that the function F
C
G
R
k
is a quasi-homomorphism from the set of all DNA sequences with the operation of catenation, to the set of 2k×2k matrices with the operation of addition, in the sense that for sequences s, t, we have
$${FCGR}_{k}(st) \approx {FCGR}_{k}(s) + {FCGR}_{k}(t). $$
The definition of F
C
G
R
k
can be easily modified to make it an exact homomorphism by, e.g, defining a marked catenation of sequences s and t as s·t=s
$
t, with $ a new symbol, and constructing F
C
G
R
k
so as to not count any k-mer that includes the symbol $. Next steps in the exploration of the mathematical properties of additive DNA signatures include studying the implications of the homomorphic, structure-preserving, nature of F
C
G
R
k
, as well as extensions of the concept of additive DNA signature, to, e.g., weighted additive DNA signatures which would give different weights to the different types of DNA that compose it.