Alignment-free analysis of barcode sequences by means of compression-based methods

Background The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. Results We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. Conclusions In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. This way we demonstrate the reliability of compression-based methods even for the analysis of short barcode sequences. Compression-based methods, with their strong theoretical assumptions, may then represent a valid alignment-free and parameter-free approach for barcode studies.


Background
The use of DNA sequences in order to integrate ecological, morphological and genetic information to improve taxonomic studies of biological species [1] has been carried out since 2003 by Herbert et al. [2]. The authors introduced and discussed the need of having DNA sequences as taxon "barcodes". The main purpose was to identify, for each kingdom of life (animals, plants, fungi, and so on) a short DNA fragment that could exploit biodiversity among different species. This way taxonomists can focus above all on discovering new species and describing and fixing existing taxa, leaving identification issues to barcode-based tools [3].
A 648-bp region of the cytochrome c oxidase I (COI) gene has been identified as a DNA barcode sequence for the animal kingdom [4]. DNA barcode approach has proven to be useful for the study of biodiversity of very different species, including fishes [5,6], birds [7], bugs [8][9][10].
The analysis of DNA barcode sequences is usually done by means of clustering methods, like for instance Neighbor Joining (NJ) method [11], that allow to obtain phylogenetic trees (dendograms) of input sequences. Taxonomic studies with DNA barcoding data relies on traditional approaches, that consist of evaluating genetic distances among species in order to perform distancebased clustering analysis [12]. Moreover genetic distances computation needs a preprocessing step, that is sequence alignment, in order to compare corresponding loci. Genetic distances, also called evolutionary distances, are stochastic estimates and they do not define a distance metric [13].
In this work we propose a novel alignment-free approach, for the analysis of DNA barcode data based on information theory concepts. Our aim is to employ Universal Similarity Metric (USM) [14] in order to compute genetic distances among biological species described by DNA barcode sequences. USM represents a class of distance measures based on Kolmogorov complexity [15] and that defines, under some assumptions, a distance metric.
USM is said to be universal because it can be applied for the analysis of data belonging to very different domains: it, in fact, has been used in the field of text and language analysis, image and sound processing [16]. As said earlier, USM is based on Kolmogorov complexity which is, unfortunately, not computable. For this reason, USM needs to be approximated. One of USM's approximation, called Normalized Compression Distance (NCD), has been adopted for the first time for the analysis of biological sequences in [16], where it has been built a coherent phylogenetic tree of 24 species belonging to Eutherian orders considering complete mammalian mtDNA sequences. Another compression-based approximation, the Information-Based Distance (IBD) [17], was applied for the study of whole mitochondrial genome phylogeny. USM and its compression-based approximations have also been used for the analysis of different biological datasets in [18], including protein and genomic (complete mithocondrial genome) sequences. The authors compared phylogenetic trees obtained through USM with gold standard trees using F-measure [19] and Robinson metric [20], obtaining encouraging results about USM use in bioinformatics. NCD has also been adopted for clustering of bacteria considering 16S rRNA gene sequences and topographic representations obtained by means of Self-Organizing Map algorithm [21,22].
Our proposed approach, then, wants to demonstrate that it is possible to apply information theory techniques to the study of short biological sequences for taxonomic and phylogenetic purposes. Genetic distances, obtained through USM's approximations, will be used in order to compute phylogenetic trees of 30 barcode sequence datasets and then those trees will be compared with the ones obtained using traditional bioinformatics approaches depending on sequence-alignment and evolutionary distances computation. The presented results, showing a trees' similarity between 80% and 100%, demonstrates our approach can be adopted for the afore mentioned analysis. In order to further validate our results, we also made experimental tests with simulated barcode datasets, composed of 100, 150, 200 and 500 sequences. For each dataset composition, we considered 25 different barcode datasets, for a total of 100 experiments. The presented results, showing a trees' similarity between 83% and 99% for all simulations, strenghten our findings with real barcode datasets.
In this work, we use USM's compression-based approximations for a deep study and analysis of short DNA barcode sequences. Preliminary results about this topic were presented in [23].

Methods
The study of application of USM's compression-based approximations to barcode sequences data has been carried out considering both Normalized Compression Distance (NCD) and Information-Based Distance (IBD). Those two distances have been used to compute dissimilarities among species belonging to different kingdoms of life. DNA barcode datasets have been downloaded from Barcode of Life Data System (BOLD) [24], which represents the best source and repository for barcode sequences. In our work we considered 30 datasets of different size and species composition. Using NCD and IBD dissimilarity matrices, we built phylogenetic trees of each of the thirty datasets through two state-of-the-art phylogenetic algorithms, Neighbor Joining and Unweighted Pair Group Method with Arithmetic Mean. Those trees were compared with the ones obtained from five different kinds of evolutionary distances (see next Sections). Figure 1 shows the flowchart of the experimental setup.
In the following subsections a brief explanation of all the employed techniques and algorithms will be provided.

USM and compression-based distances
Universal Similarity Metric is a class of distance measures defined in terms of Kolmogorov complexity. The Kolmogorov complexity K(x) of a string x is the length of the shortest binary program x* to compute x on a universal Turing machine [14,15]. K(x|y) represents the conditional Kolmogorov complexity of two strings, x and y, and it is defined as the length of the shortest binary program that produces x as output, given the input y [14,15]. In other terms, K(x|y) is the amount of minimal information needed to generate x when y is given as input.
USM has been presented in [14] and defined as: It has been demonstrated [14] that USM is a metric, is normalized (it ranges between 0 an 1) and is universal.
In order to adopt USM as a distance measure, it needs to be approximated since Kolmogorov complexity is not computable. In our work we considered two USM approximations based on data compression: Normalized Compression Distance (NCD) and the Information-Based Distance (IBD) defined in [17]. We chose NCD and IBD because they have been successfully used for the analysis of biological data [16][17][18]21,22].
NCD and IBD are respectively defined as: In Eq. (3) and (4), C(x) is the size, in byte, of the compression version of string x; C(xy) is the size of the compressed version of the concatenation of string x and y; C(x|y) is the size of the conditional compression of string x given string y. The basic idea of a string compression algorithm is to find portions of input string that are repeated and to substitute them with a shorter reference. The set of repeated string portions is indicated as "dictionary". Compressing a string x given a string y means that the compression algorithm builds the dictionary using the string y and makes the references on string x using that dictionary. This gives a measure of the similarity between the two strings. Both NCD and IBD give better USM approximations if the string are compressed with optimized compression-ratios.
In our experiments, it has been used GenCompress [26] compressor in order to compute both NCD and IBD. GenCompress, in fact, is a Lempel and Ziv dictionary based compressor [27] optimized to work with DNA sequences. If GenCompress is used with generic text strings, as input, it works as a generic ascii-text compressor, without any optimization property.

Evolutionary distances and phylogenetic trees
Evolutionary distances are distance measures used in order to compute the dissimilarity among genetic sequences [13]. Evolutionary distances are estimates obtained through stochastic methods that take into account many biological phenomena such as convergent substitutions, multiple substitutions per site or retro-mutations. There exist several kinds of evolutionary distance according to the prior assumptions of the stochastic model adopted and their related complexity. The more complex the model, the more accurate and computational expensive the resulting evolutionary distance. In our work, we used five different evolutionary distances, sorted by complexity level, in order to compute phylogenetic trees: Kimura 2parameter [28], Tajima-Nei [29], Tamura 3-parameter [30] Tamura-Nei [31] and Maximum Composite Likelihood (MCL) [32]. Kimura 2-parameter distance model corrects for multiple hits, taking into account transitional and transversional substitution rates, while assuming that the four nucleotide frequencies are the same and that rates of substitution do not vary among sites. Tajima-Nei distance model derives from the simpler Jukes-Cantor distance [33] and it gives a better estimate of the number of nucleotide substitutions. Tajima-Nei model assumes an equality of substitution rates among sites and between transitional and transversional substitutions. Tamura 3-parameter model corrects for multiple hits, taking into account the differences in transitional and transversional rates and the G+C-content bias. The Tamura-Nei distance with the gamma model corrects for multiple hits, taking into account the different rates of substitution between nucleotides and the inequality of nucleotide frequencies. As for MCL model, a composite likelihood is defined as a sum of log-likelihoods for related estimates. In [32] it is showed that pairwise evolutionary distances and the related parameters are accurately estimated by maximizing the composite likelihood. It is also stated that a complex model had virtually no disadvantage in the composite likelihood method for phylogenetic analyses. In our case, the maximum composite likelihood method is used for describing the sum of log-likelihoods for all pairwise distances estimated by using the Tamura-Nei model. Evolutionary distances were computed using MEGA 5 software [34].
Phylogenetic relationships among biological species are usually inferred by means of phylogenetic trees [35]. In our work we considered the two most popular distancebased algorithms to build phylogenetic trees: Neighbor Joining (NJ) [11] and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [36]. NJ and UPGMA are said "distance-based" because they need as input a dissimilarity (distance) matrix among elements. Our goal is not to compare the two tree construction methods, but to build and to compare two trees, one with evolutionary distance and the other with compression distance, first using NJ and after using UPGMA.

Phylogenetic trees comparison algorithms
It is possible to obtain different phylogenetic trees, for the same input dataset, according to the adopted distance measure and/or the used algorithm. That's the reason why there are methods to compute similarity between trees, so that it is possible to understand the shared information content among them. One of the most popular similarity measures between phylogenetic trees is the symmetric distance introduced by Robinson and Foulds [20]. Robinson's metric considers as tree distance the number of "shifts", i.e. edit operations, required to obtain the second tree from the first one (and vice-versa). This approach makes the symmetric distance a "local" similarity algorithm, because it penalizes, in the same way, all the mis-pairings without considering the global clustering results and the tree's topology representing the actual phylogenetic relationships.
For this reason, in our work, we adopted one more recent algorithm for trees' comparison: the PhyloCore algorithm developed by Nye et al. [37], that has a different approach from Robinson's one. PhyloCore, in fact, builds an alignment between trees by matching corresponding branches that share the same leaf elements. Each edge (branch) in a phylogenetic tree divides the tree into two subtrees, creating this way a partition of the leaf nodes into two subsets. Each pair of edges between two trees is given a score by comparing the two corresponding partition of leaf elements. Trees partitions with the same leaf nodes represent corresponding clusters and then a similarity in terms of topology and phylogenetic preservation. PhyloCore gives the percentage of topology similarity between trees.

Results and discussion
In order to extensively test the proposed compressionbased approach we used both real and synthetic datasets and compared the results with the ones obtained using the evolutionary distances. In the following subsections we will describe the proposed methodologies and we will discuss the comparison between the two approaches.

Barcode datasets
We performed our experiments considering real barcode datasets all taken from Barcode Of Life Database (BOLD). Since our purpose was to test the reliability of compression-based distance models, we considered a subset of the whole database. We selected 30 datasets that differ each other on the basis of the type of species (birds, fish, and so on), the number of species, the number of barcode sequences per species (specimens), the sequence length and the sequence quality, expressed in terms of the percentage of sequences with undefined nucleotides, marked with the "N" character. We did not consider all BOLD database because we had no interest in obtaining a phylogenetic tree for all available datasets. It is very important to consider the percentage of sequences containing undefined bases because, as highlighted in Section "Methods", Gen-Compress works as an optimized compressor for DNA sequences only when dealing with string having the four letters A,C,G,T. In all other situations, GenCompress works as a generic ascii text compressor. That means Gen-Compress will give bad compression ratios for those sequences, and as a consequence NCD and IBD distance (see Eq. (3) and (4)) will not properly approximate USM. Since typical sequence length of COI barcode gene is about 650 bp [4], longer sequences contain information content related to other genes; whereas shorter sequences have incomplete information content. In our study, we then considered as "good" those datasets having a low percentage of sequences with undefined bases and sequences of about the same length (the 650 bp length of typical COI barcode sequence).
The complete list of the barcode datasets of our experiments is summarized in Table 1 and Table 2.

Data simulation
In order to test our approach even in case of synthetic data, we simulated some barcode datasets obtained using a generation strategy similar to the one reported in [38,39]. First of all we started by simulating a random ultrametric species tree with Mesquite software (version 2.75, build 564) [40] using the Yule model [41]. We generated four different simulated species trees considering respectively 10, 15, 20 and 50 species, with a total tree depth of 1 million generations. Gene trees were then simulated on the species trees, using the Coalescent package of Mesquite, considering 10 individuals (specimens) per species, obtaining this way gene trees with, respectively, 100, 150, 200 and 500 individuals. Gene trees were simulated using an effective population size of 10000 elements. We finally added noise to the gene trees in order to produce non- The 30 barcode datasets used in our experimental tests taken from BOLD database ultrametric trees. We considered normally distributed noise with a variance of 0.7 times the original branch length, ad done in [38]. Sequences barcode datasets were simulated, from the gene trees, using the Seq-gen software (version 1.3.3) [42]. We adopted the HKY model of evolution [43], with a transition/transversion ratio of 3, nucleotide frequencies of 0.3 (A), 0.2 (C), 0.2 (G), 0.3 (T), and sequence length of 650 bp, representing the typical COI gene length. For each gene tree, we obtained 25 barcode datasets, resulting in a total of 100 simulated datasets.

Experimental results
The purpose of the proposed experimental tests is to demonstrate that compression-based distances represent a valid alignment-free approach for the analysis of phylogenetic relationships among short barcode sequences. In Tables 3, 4, 5, 6, 7 there are summarized the similarity scores, obtained using PhyloCore score, among evolutionary based trees and compression based trees of real barcode datasets. More in detail, for every pair of compression-based distances (NCD and IBD) and for every pair of phylogenetic tree inference algorithms (NJ and UPGMA), each table gives the similarity scores according to a reference evolutionary distance model (Kimura 2-parameter, Tamura-Nei and so on).
Since, in our experiments, we use two kinds of compression-based distances, NCD and IBD, and two different phylogenetic tree inference algorithms, NJ and UPGMA, we are interested in the specific behavior of each distance measure and algorithm. In Figure 2(a) we show the curve trends, related to NCD and IBD methods, The main features of the 30 barcode datasets used in our experimental tests representing the PhyloCore similarity mean scores, considering every evolutionary distance model, for the input datasets. The two curves have a similar trend, that is NCD and IBD give very close similarity scores, except for AGWEB, CLNVA, DSFCH and RDMYS datasets. That chart does not give enough information about which compression-based distance produces the most regular results in terms of topology similarity. Our next step was then to check, separately, the similarity scores obtained using the NJ and UPGMA algorithms. In Figure 2(b) and 2(c) we show the trend curves of, respectively, the PhyloCore similarity mean scores, considering every evolutionary distance model and only the NJ algorithm; and the PhyloCore similarity mean scores, considering every evolutionary distance model and only the UPGMA algorithm. From those charts we can state NCD and IBD distance models give quite identical similarity scores in trees' comparison when using UPGMA algorithm for tree inference. Using NJ algorithm, otherwise, we obtain a very unstable trend, with similarity scores generally  below than the corresponding scores obtained through UPGMA algorithm. Moreover, in Figure 3 we show in an histogram the highest similarity values, considering all the evolutionary distance models and input datasets, obtained using NJ and UPGMA algorithm. From that chart, we can see that in 90% (27/30) of cases, the best similarity scores from comparison among evolutionary based trees and compression based trees are obtained using UPGMA. That means UPGMA algorithm is the best tree inference algorithm when adopting a compression-based distance models. Looking again at Figure 2(c), the lesser scores, below 80% of similarity, are obtained for AGWEB, JTB, and RDMYS datasets. According to Table  2, AGWEB and RDMYS are the datases with the highest percentage of sequences with undefined bases, respectively 87% and 32%. These low similarity results are then justified by considering the low quality of input datasets, that gave bad compression ratios using GenCompress that in turn produced a bad estimate of NCD and ICD and consequently a wrong phylogenetic tree. As for JTB, its low PhyloCore similarity scores of the 30 barcode datasets. The score are obtained comparing compression-based trees, using both NCD and IBD, with evolutionary-based trees, using Tamura 3-parameter distance. The trees were generated through NJ and UPGMA algorithm. PhyloCore similarity scores of the 30 barcode datasets. The score are obtained comparing compression-based trees, using both NCD and IBD, with evolutionary-based trees, using Tamura-Nei distance. The trees were generated through NJ and UPGMA algorithm.
similarity score is explained considering the different lengths of its sequences, ranging from 658 to 899 bp. As early said in Section "Barcode Datasets", longer sequences contain additional information not related to COI barcode gene and furthermore the spread of sequence length influences NCD and IBD computation (Eq. (3) and (4)).
In order to realize what are the most similar compression-based and evolutionary-based trees, with regards to the evolutionary distance model adopted, we draw the histogram of Figure 4. The histogram is obtained considering the highest similarity values from Tables 3, 4, 5, 6, 7, that is considering both NJ and UPGMA algorithms and both NCD and IBD distance models. The chart in Figure 4 shows the highest similarity scores are reached in the comparison among compression-based trees and evolutionary-based trees obtained through MCL distance model. Moreover in Figure 5 we show the boxplot of similarity scores obtained comparing MCL-based trees and compression-based (NCD and IBD) trees using both NJ and UPGMA algorithm. This chart confirms the best similarity scores, in terms of minimum value, maximum value and mean values, are reached in the comparison between MCL-based trees and compression-based trees using UPGMA algorithm. Finally, in the piechart of Figure 6, we summarize the mean similarity scores for the 30 datasets resulting from the comparison between both compression-based trees and MCL-based trees using UPGMA algorithm. The piechart shows that in 6% of cases (2/30) we obtain similarity score below 80% (corresponding to AGWEB and JTB datasets); in 58% of cases we have a similarity scores ranging from 80% and 90% (17/30); in 33% of considered datasets (10/30) we obtain a similarity score over 90% and in the 3% of cases (1/30) we reach a 100% of tree similarity. It interesting to note that the perfect similarity score (100%) is obtained for BPRP dataset that, as reported in Table 2, represents an ideal barcode dataset, with 658bp sequence lenght and 0% of sequences with undefined bases. As explained in Section "Evolutionary Distances and Phylogenetic Trees", MCL method gives a better estimates of evolutionary distance than the other four distance models, and consequently more accurate phylogenetic trees. From our experimental study we found NCD and IBD compression-based distances,using UPGMA algorithm, build phylogenetic trees that have the best similarity scores with MCL-based trees, which, in turn, give the most accurate phylogenetic relationships.
In order to strengthen our experimental results, we carried out other tests using simulated data, as described in Section "Data Simulation". Results obtained with simulated datasets are summarized in Table 8 and 9. Since we obtained analogous results using both NCD and IBD distance measures, we report only the similarity scores obtained using NCD for sake of simplicity. For each number of input sequences (100, 150, 200, 500), we replicated the simulation 25-fold, for a total of 100 new experiments. Considering all five evolutionary models and the NJ algorithm we evaluated the comparison between compression-based and evolutionary trees, obtaining a very high mean similarity score (83% with a variance between 10 −3 and 10 −4 ). Using the UPMGMA algorithm the similarity score was even higher with a mean of 99% and a variance between 10 −3 and 10 −6 . We can state, then, that our proposed approach is very reliable using simulated data and robust enough to be applied with real barcode datasets.

Speed evaluation
In order to compare the processing time of the proposed algorithm with the speed of evolutionary distance methods, we performed additional experiments. It is possible to notice that the compression-based distance can be calculated separately for each sequence versus all the other, so that, in principle we can calculate all the distance running all the programs at the same time (one program for each sequence running on one processor core), this makes the compression-based method intrinsically Figure 2 Mean PhyloCore similarity scores of 30 input datasets. Mean PhyloCore similarity scores resulting from the comparison among NCD and IBD based trees with the trees obtained from all the five evolutionary distance models. We considered separetely the results obtained using both NJ and UPGMA algorithm(a), only NJ algorithm (b), only UPGMA algorithm (c). The trend curves show NCD and IBD distance models give a quite identical similarity scores in trees' comparison when using UPGMA algorithm for tree inference. Histogram of the best similarity scores, for all the evolutionary distance models and input datasets, using NJ and UPGMA algorithm. In 90% (27/30) of cases, the best similarity scores from comparison among evolutionary based trees and compression based trees are obtained using UPGMA.

Figure 4
Histogram of the best Phylocore similarity scores for all input datasets. For each dataset, it is shown the best similarity score resulting from the pairwise comparison of compression-based trees and the five trees derived from the five evolutionary distance models. The chart shows the highest similarity scores are reached in the comparison among compression-based trees and evolutionary-based trees obtained through MCL distance model.  Piechart summarizing the mean similarity scores among compression-based trees and MCL-based trees obtained using UPGMA algorithm. From the chart it is shown that in 7% of cases (2/30) we obtain similarity score below 80% (corresponding to AGWEB and JTB datasets); in 57% of cases we have a similarity scores ranging from 80% and 90% (17/30); in 33% of considered datasets (10/30) we obtain a similarity score over 90% and in the 3% of cases (1/30) we reach a 100% of tree similarity.
parallel. If we want to compare the performance of the proposed method to the one using the alignment distance, we have to take into account a parallel version of the alignment algorithm. We used the algorithm described in [44], that exploits the multi-core processor and becomes faster each time a processor core is available. In this algorithm the speed increment decreases in non-linear way each time we double the number of cores. On the other hand, as said above, in the compression-based distance method the speed increment is constant and each time we double the number of cores, the speed doubles. For this reason if we compare the running time of the two methods in term of number of cores we will find a trade-off point. Experiments for evaluation of running times were carried out using a multicore system up to 16 cores. We tested the execution times of both compression and alignment for barcode dataset of 500 sequences versus the number of cores. Running times are summarized in Figure 7, that shows real (solid line) and estimated (dashed line) times in log 2 base. Compression-based approach overcomes alignment approach using a multicore system after 32 cores.

Conclusions
In this paper we presented a novel alignment-free approach for the study of barcode genetic sequences. We used two compression-based approximations of USM, namely NCD and IBD, for reconstructing phylogenetic trees of short barcode sequences. In previous works, in fact, compression-based distances were used only for the analysis of whole mithocondrial genomes. We tested our approach considering 30 barcode datasets, of different size and belonging to different species, and 100 simulated datasets composed of different number of sequences (100, 150, 200, 400). Compression-based trees, obtained from NCD and IBD distances, were compared with evolutionary-based trees derived using five evolutionary distance models: Kimura 2-parameter, Tajima-Nei, Tamura 3-parameter, Tamura-Nei and MCL. Trees were obtained using NJ and UPGMA algorithms. Our experimental tests demonstrated that using NCD and IBD compression-based distances we were able to obtain phylogenetic trees quite similar to evolutionary-based trees, with similarity scores ranging from 80% to 100%. More in detail, the highset similarity scores were reached comparing compression-based trees with MCL-based trees using UPGMA algorithm, with no substantial differences between NCD and IBD. MCL provides a better esitmates of evolutionary distance, and as a consequence more accurate phylogenetic trees, than the remaining considered methods. As for simulated data, our experimental trials show very stable results with regards to the number of input sequences and evolutionary model considered, with similarity scores spanning from 83%, using NJ algorithm, and 99%, using UPGMA algorithm. NCD and IBD compression distance models represent a sound alignment-free and parameterindependent approach, based on strong theoretical PhyloCore similarity scores of the simulated datasets. According to the number of barcode sequences, each simulation was replicated 25-fold, for a total of 100 simulated dataset. The scores (mean and variance) are obtained comparing compression-based trees, using NCD, with evolutionary-based trees obtained through all five evolutionary models. The trees were generated using NJ algorithm. Table 9 Tree similarity score (mean and variance) among compression-based trees and evolutionary trees, obtained with UPGMA, of simulated datasets. PhyloCore similarity scores of the simulated datasets. According to the number of barcode sequences, each simulation was replicated 25-fold, for a total of 100 simulated dataset. The scores (mean and variance) are obtained comparing compression-based trees, using NCD, with evolutionary-based trees obtained through all five evolutionary models. The trees were generated using UPGMA algorithm.
assumptions. Using these models it is possible to obtain very reliable phylogenetic trees and they are a valid tool for the analysis of barcode sequences.