 Research
 Open Access
 Published:
An alignmentfree heuristic for fast sequence comparisons with applications to phylogeny reconstruction
BMC Bioinformatics volume 21, Article number: 404 (2020)
Abstract
Background
Alignmentfree methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its kmismatch counterpart, ACS_{k}, have been shown to produce results as effective as multiplesequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS_{k} takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACS_{k} have been introduced.
Results
In this paper, we present a novel lineartime heuristic to approximate ACS_{k}, which is faster than computing the exact ACS_{k} while being closer to the exact ACS_{k} values compared to previously published lineartime greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.
Conclusions
Our method produces a better approximation for ACS_{k} and is applicable for the alignmentfree comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyarrs.
Background
Over the past two decades, many similarity measures based on alignmentfree methods have been proposed for sequence comparison for a diverse range of bioinformatics applications. With the increasing availability of sequence data from multiple sources and as alignment algorithms are reaching their limits, many of these alignmentfree methods have become popular in applications such as phylogeny reconstruction, sequence clustering, transcript quantification and detection of horizontal gene transfers [1, 2].
For phylogeny reconstruction, alignmentfree methods are used to construct the pairwise distance matrix, a symmetric matrix of sequence similarity measures computed for every pair in the given set of sequences. With the distance matrix as their input, algorithms such as unweighted pair group method with arithmetic mean (UPGMA) [3] or neighborjoining (NJ) [4] construct the desired tree.
Alignmentfree methods for computation of similarity measures can be classified based on whether the seeds are exact or approximate and whether the seeds are of fixed or variablelength. The most popular among the fixedlength exact seed methods are kmerbased techniques, which proceed by first constructing the sets of all the kmers (kmers are fixedlength exact seeds of length k) of a pair of sequences, followed by the estimation of a similarity measure either based on the kmer frequency profile (Eg. Euclidean distance, CVTree [5],FFP [6]) or based on the intersection/differences of the kmer sets (Eg. Jaccard coefficient). Lu et al. [7] presents a comprehensive review of about 28 different such measures typically used in the construction of phylogeny trees. Methods using approximate fixedlength such as spacedseeds approaches [8] allow the use of kmers with mismatches at specific locations and make use of multiple patterns to improve accuracy.
Among the variablelength seeding methods, one of the measures shown to be effective in phylogeny applications is the average common substring, ACS, which is computed for a pair of sequences as the mean of the lengths of the longest common prefixes [9]. After computing the ACS, the sequence similarity of two sequences X and Y is computed as follows:
By introducing k mismatches into the average common substring metric (abbreviated as ACS_{k}), Leimeister and Morgenstern [10] demonstrated improved accuracy for phylogeny applications. However, their approach, called kmacs, uses a greedy heuristic as an approximation of ACS_{k} since computing exact ACS_{k} is computationally expensive and was shown to take O(n logkn) for a pair of sequences of total length n [11]. In a later work, Thankachan et. al. [12] also proved that the runtime bounds remain O(n logkn) even when insertions and deletions allowed along with mismatches. Based on [11, 13] presented another greedy heuristic to approximate ACS_{k}.
In this work, we present a novel lineartime heuristic that is a more accurate approximation of ACS_{k} than kmacs’ approach. While kmacs constructs an ACS_{k} approximation by means of a forward extension of the longest common prefixes, our algorithm performs both forward and backward extensions to identify a kmismatch common substring of longer length, and hence, producing a closer approximation to the exact ACS_{k}. Using three real datasets, we evaluate the runtime, accuracy and the effectiveness of our proposed approach. We also demonstrate its applicability for phylogeny tree construction.
Methods
Notations and preliminaries
Let X and Y be two sequences drawn from the alphabet set Σ. We denote the length of X by X, the suffix of X starting at the position i as X_{i}. Also, we use \(\overleftarrow {\mathsf {X}}\) and \(\overleftarrow {\mathsf {Y}}\) to denote the reverse of the strings X and Y respectively.
Let X+Y=n. We define LCP(X_{i},Y_{j}) to be the longest common prefix of X_{i} that matches with Y_{j} and LCP_{k}(X_{i},Y_{j}) its kmismatch counterpart i.e., a longest common prefix that allows k mismatches, k≥0 (also termed as the longest kmismatch substring starting at X_{i}). We denote maxjLCP_{k}(X_{i},Y_{j}) by λ_{k}(i) and the position in Y corresponding to the λ_{k}(i)length match as μ_{k}(i) i.e.,
For the sake of brevity, we abbreviate λ_{0}(i) and μ_{0}(i) as λ(i) and μ(i) respectively.
The average common substring, ACS of X w.r.t. Y is defined as
ACS_{k}(X,Y),k≥0 of X w.r.t. Y is defined similarly with LCP_{k} instead of LCP in the above equation. Note that ACS_{k}(X,Y)≠ACS_{k}(Y,X).
We use GST_{f} and GST_{r} to denote the generalized suffix tree constructed for the concatenated strings \(\mathsf {T} = \mathsf {X}\$_{1}\mathsf {Y}\$_{2} \text {and} \overleftarrow {\mathsf {T}} = \overleftarrow {\mathsf {X}}\$_{1}\overleftarrow {\mathsf {Y}}\$_{2}\) respectively, where $_{1},$_{2}∉Σ. For our algorithm, GST_{f} and GST_{r} serve as an indexing data structures that enable us to perform longest common prefix queries for X and Y in constant time. Both GST_{f} and GST_{r} can be constructed in O(n) time with O(n) space.
Previous greedy heuristics
Using the notations described above, ACS_{k}(X,Y) is computed as \(\mathsf {ACS}_{k}(\mathsf {X}, \mathsf {Y}) = \sum _{i=1} \lambda _{k}(i) / \mathsf {X}\). The key difficulty in computing ACS_{k} is the estimation of the array λ_{k}(i),i=1,…,X. Before we present our lineartime approximate algorithm for computing λ_{k}, we briefly discuss the previously established heuristic methods for approximating λ_{k}.
In kmacs [10], the previously published greedy approach, λ_{k}(i) is approximated by extending the longest common prefixes. kmacs uses the longest common substring of suffixes X_{i} and Y_{q},q= arg maxjLCP(X_{i},Y_{j}) as the initial anchor segment, then performs a forward extension to identify the common substring with k−1 mismatches and approximates the total length as λ_{k}(i). For example, if X and Y are the strings CATTGCATACGA and ATGGATCCAATAG respectively, then to compute an approximation to λ_{2}(4), kmacs would first identify the LCP match of X_{4} at Y_{2} and then approximate λ_{2}(4) as 6 by matching the segments TGCATA and TGGATC. Formally, kmacs computes the following measure as an approximation of λ_{k}(i):
Using a generalized suffix tree constructed for X and Y, the above measure can be calculated in O(k) time via k consecutive LCP queries starting with X_{i} and Y_{μ(i)}. Therefore, the similarity metric based on the above heuristic measure can be computed in O(nk) time.
ALFREDG [13] follows a similar logic except that it includes an extra mismatch in the initial anchor segment. Formally, ALFREDG approximates λ_{k}(i) with the following measure:
Proposed algorithm
In our algorithm, we make use of the following observation: a kmismatch common substring of two suffixes X_{i} and Y_{j} includes k−1 common substrings separated by k mismatch characters. This observation leads to the following key intuition behind our algorithm – the anchor segment can be any one of the k−1 segments that constitute a kmismatch common string. As mentioned above, both kmacs and ALFREDG consider the first segment as the anchor segment. Our heuristic, denoted by λk′(i), is computed by extending all k−1 matching substrings that overlap the position i, as anchors.
We illustrate our approach with the following example. Consider the three suffixes X_{p}=AATCGGT...,Y_{q}=AATGGGA... and Y_{r}=AACCGGT..., and let μ(p)=q;μ(p+3)=r+3. A greedy heuristic based algorithm such as kmacs, which uses the LCP to find anchor segments, will select Y_{q} as the anchor point and approximate λ_{1}(p) as 5, even though there is a better match at Y_{r}. Extending the segments backward overcomes this limitation in this example because a backward extension from the Y_{r+3}=CGG... segment from X_{p+3} can identify Y_{r} to be the better match for X_{p}.
Algorithm 1 presents the pseudocode for our proposed heuristic. It takes as input strings X and Y, and outputs an array λk′ of length X, whose ith entry contains the approximation for λ_{k}(i).
After constructing the two generalized suffix trees GST_{f} and GST_{r} and initializing the λk′(i) entries (Lines 1– 3), the algorithm proceeds in two phases. In the first phase, we compute the forward and backward extensions of the longest common substring for each position in X (Lines 4– 25). Here, we make use of two arrays L_{f} and L_{r}, each of length k+1, which contain the lengths of the 0,1,2…,kmismatch substrings starting and ending at position i respectively. L_{f} and L_{r} can be computed via k LCP queries on GST_{f} and GST_{r} respectively (Lines 13 and 21). After computing the L_{f} and L_{r} arrays, we update λk′ arrays for the k+1 possible positions corresponding to all the possible forward and backward extensions(Lines 22– 25).
In some cases, the approximation computed at position i in the first phase can be improved by examining the λk′(i−1) entry, if the ithe position doesn’t correspond to a mismatch character of the preceding entry. In the second phase, we update those entries for whom a better approximation is one less than the preceding entry in λk′ (Lines 26– 29).
Since LCP queries take constant time using GSTs, the first phase can be accomplished in O(nk) time and O(n+k) space. Since the second phase is just a left to right pass over the λk′ array and since the construction of the suffix trees also take O(n) time and space, our algorithm takes linear time.
Implementation details
We implemented our algorithm in the Rust progamming language [14] and used the libdivsufsort library [15] to construct the suffix array data structures.
Our algorithm requires only the computation of LCP queries, which can be done only using generalized suffix arrays along with the longestcommonprefix (LCP) arrays and range minimum query (RMQ) data structures. The use of suffix and LCP arrays instead of the suffix trees significantly reduces the memory footprint for our implementation.
Whenever there are multiple options to extend the LCP match (i.e., there are multiple locations in which a common substring can be equal in length and be the longest), we apply the heuristic discussed earlier in this section to all possible locations and select the longest among them. In the worst case this can make the implementation to take O(n^{2}k) time but in practice, it takes O(nkz) time, where z is the average number of maximal matches to a substring in Y starting at a position i in X.
The core computation of the algorithm demands multiple LCP queries. However, we observed that in most cases, the time taken to walk through the text to identify is faster compared to evaluating the LCP query. This is because of two reasons (a) modern CPU architectures include multiple hierarchies of caches, that enable faster access to data elements that are located with in relatively close storage locations and (b) most practical datasets have a distribution of relatively short LCP lengths. Therefore, in all our experiments reported in the next section, except for the case when LCP is 0, we walk the text to identify the longest common prefixes. When the LCP is 0 i.e., the case when there is no suffix match in Y for a suffix X_{i} or viceversa, we estimate the kmismatch LCP by the approximation computed for the suffix X_{i+1}.
Results and discussion
All the experiments were run on a system having two 2.4 GHz 14Core Intel E52680 V4 processors and 256 GB of main memory, and running RedHat Enterprise Linux (RHEL) 7.0 operating system. Along with our implementation, we also ran kmacs [10] and ALFREDG [13] for comparison. kmacs and ALFREDG were compiled using gcc compiler version 8.3.0. Our implementation was compiled using rust compiler version 1.3.6.
To evaluate the runtime, the relative accuracy and the effectiveness of our proposed algorithm, we used four real datasets – Primates, Roseobacter, BAliBASE and E. coli, all of which have been previously used to evaluate alignmentfree techniques to estimate sequence similarity [10, 13, 16].
Primates is a DNA sequence dataset collected from prokaryotic organisms and has 27 primate mitochondrial genomes with a total length of ≈450 kilobases. The reference phylogeny tree for this dataset was constructed based on multiple sequence alignment the 27 sequences.
Roseobacter dataset is a set of eukaryotic DNA sequences with a total length of ≈875 kilobases, collected from the coding regions of 32 Roseobacter genomes as described in [13]. BAliBASE dataset is a collection of 218 protien sequence datasets of total length ≈2.5 megabases, gathered from the BAliBASE V3.0 [17], a popular benchmark for evaluating multiple sequence alignment algorithms.
E. coli dataset is a collection of 29 whole genomes of E. coli/Shigella strains, originally compiled by [16]. This dataset has a total length of 138 megabases in which the size of the seqeunces range from 4.3 megabases to 5.4 megabases.
For the Roseobacter dataset, we used the phylogenetic tree presented in [18] as the reference tree. In case of BAliBASE datasets, the reference trees are constructed from the corresponding reference alignments using the proml program available in PHYLIP [19], which implements the Maxmimum Likelihood method. For the E. coli datasets, we used the phylogenetic tree presented in [16] as the reference tree.
We conducted experiments on all the software to evaluate (i) the accuracy of the estimated ACS_{k}, (ii) runtime characteristics, and (iii) applicability of our algorithm to phylogeny reconstruction.
To estimate the accuracy of the ACS_{k} estimated using our heuristic, we approximate ACS_{k}, for every pair of input sequence in the Primates and Roseobacter datasets, using our proposed heuristic with k=1,…,5. We, then, used the ALFRED software published by [20] to find the true value of ACS_{k}, computed the error percentage of estimated ACS_{k} compared to the true values, and finally plotted the average error percentages against increasing values of k. Figures 1a and b illustrate the average deviation of the approximate values computed by the respective software from the exact value of ACS_{k} for Primates and Roseobacter datasets respectively.
Figures 1a and b show that the error percentage for our proposed method is less than that of kmacs in all the cases. Specifically, in case of kmacs, the error percentages can be as high as 80% with k=4 for the Primates datasets, where as our method shows a deviation of less than 40%. Even though the error percentages increases as k increases for both kmacs and our algorithm, compared to kmacs, the error rate grows relatively slower with increasing k for our method.
It can also be observed in Figs. 1a and b that compared to both kmacs and our method, ALFREDG has a much lower error percentage. However, in terms of runtime, our method runs 1.5–2.5X faster than that of ALFREDG as illustrated by the runtime plots in Figs. 2a, b and c. As expected, the runtime grows approximately linearly as k increases. In constrast, the timings for the exact methods grows exponentially for both the Primates ranging from 8.7 seconds for k=1 to 2202.83 seconds for k=5. Similar behavior is observed for the Roseobacter dataset ranging from 15.04 seconds for k=1 to 800.11 seconds for k=5 (Fig. 3).
For the Primates dataset, we also ran the software MissMax, a heuristic alignmentfree method for sequence comparisons developed in [21]. While the method produced an error rate of at most 0.52% with respect to exact values, its runtime is ≈145 – 185X slower than that of ALFREDG.
For the E. coli full genomes dataset, the timings are shown in Fig. 2d. Note that the runtime is shown in hours as compared to seconds for the other datasets and only for kmacs and our method since ALFREDG failed to complete its run in the allotted time of 72 hours.
To test the effectiveness of our approach for phylogeny construction in comparison to kmacs and ALFREDG, we also constructed the phylogeny trees using our method as follows:

1
For every pair of input sequence in a dataset, compute λk′(·) both for X w.r.t. Y and for Y w.r.t. X.

2
Using approximate ACS_{k}(X,Y) and ACS_{k}(Y,X) estimated from λk′(·), compute the sequence similarity measure defined by Eq. 1.

3
Construct the symmetric distance matrix with entries filled with sequence similarity measures computed in the previous step. This matrix is of size 27×27,32×32 and 29×29 for Primates, Roseobacter, and E. coli datasets respectively.

4
Reconstruct phylogenetic tree using the neighbor program in the PHYLIP software suite [19] with the distance matrix as its input. The neighbor program constructs the phylogeny tree with Neighbor Joining methodology.

5
Compute the Robinson–Foulds distance (RF) distance w.r.t the reference tree using the treedist program in the PHYLIP [19] software suite. Note that lower the RF distance, better the matching of topology between two trees. If RF distance is zero, then there is an exact match between the two trees.

6
Repeat the above steps with k=1,…,10 for Primates, Roseobacter and BAliBASE datasets, and with k=1,…,7 for the E. coli dataset.
For the Primates dataset, the Robinson–Foulds (RF) distance of the reconstructed tree with respect to the reference tree is 0 for k=5, 2–4 for other values of k (Fig. 4). For the same dataset kmacs reported an RF distance of 2–8, whereas ALFREDG reported an RF distance of 0–2 [13]. Similar to ALFREDG, our method was able to recover the expected phylogenetic tree for Primates. For the Roseobacter dataset, the RF distance of our algorithm are in the range of 20–8, and as the value of k is increases from 1 to 10, the RF distance tends to decline. kmacs and ALFREDG reported RF distances, for the Roseobacter dataset, in the range of 18–10 and 18–8 respectively [13]. For BAliBASE datasets, all the three methods reported an average RF distances in the same range of 31–26. For E. coli dataset, both kmacs and our method reported RF distances in the same range of 24–26, while ALFREDG was not able to complete within the alloted time limit of 24 hours per pair. Both kmacs and our method were able to complete their runs in less than 11 hours for k=5.
Finally, to evaluate the scalability of our algorithm in its ability to process genomelength sequences, we ran both kmacs and our method on the full genome sequences of 14 plant species. This dataset was originally compiled by [22] and is of total size 4.5 gigabases with the sequence lengths ranging from 111 megabases to 746 megabases. For the 50 pairs of sequences that both kmacs and our method were able to process in the allotted time limit of 72 hours per pair, our method was able to complete the runs in a average of 1.56 hours per pair, where as kmacs took an average of 4.67 hours per pair. This discrepancy in time is due to the difference in how kmacs and our method process the suffixes whose LCP is 0, which happens more often in longer genomes. Neither ALFREDG nor the exact method is capable of processing smallest of the sequences of this dataset.
To summarize, our proposed method provides results more accurate than kmacs for the Primates and Roseobacter datasets, while being competitive in runtime compared to kmacs and much faster than ALFREDG. In case of the Primates dataset, our method was able to recover the reference tree for k=5. For BAliBASE and E. coli datasets, the results are comparable to that of kmacs. With repsect to scalability, our method shows considerably improvement over that of kmacs for longer full genomes that are few hundred megabases long.
Conclusions
In this paper, we presented a novel lineartime heuristic to compute the alignmentfree measure of sequence similarity ACS_{k}. We evaluate the accuracy of the ACS_{k} estimated from the proposed heuristic and demonstrated its applicability in construction of phylogeny trees.
We plan to extend this heuristic in the future in two different ways. Currently, all the published heuristics, including the one introduced in this work, can handle only mismatches and not insertions or deletions. We plan to adapt the proposed algorithm such that it allows insertions and deletions, where the key challenge is to manage is varying lengths of matched segments. Another way we plan to develop this heuristic is to enable forward and backward extensions on a 1mismatch anchor segment.
Availability of data and materials
Datasets are available at http://alurulab.cc.gatech.edu/phyloand http://afproject.org/app/and the code is available at https://github.com/srirampc/adyarrs
Abbreviations
 ACS :

Average Common Substring
 ACS _{ k } :

Average Common Substring with k mismatches
 UPGMA:

Unweighted Pair Group Method for Phylogeny reconstruction
 NJ:

Neighborjoining Method for Phylogeny reconstruction
 LCP :

Longest Common Prefix
 LCP _{ k } :

Longest Common Prefix while allowing k mismatches
 GST :

Generalized Suffix Tree
 RMQ:

Range miniumn query
References
 1
Vinga S, Almeida J. Alignmentfree sequence comparison–a review. Bioinformatics. 2003; 19(4):513–23.
 2
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignmentfree sequence comparison: benefits, applications, and tools. Genome Biol. 2017; 18(1):186.
 3
Sokal RR. A statistical method for evaluating systematic relationship. Univ Kans Sci Bull. 1958; 28:1409–38.
 4
Saitou N, Nei M. The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987; 4(4):406–25.
 5
Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a kstring composition approach. J Mol Evol. 2004; 58(1):1–11.
 6
Sims GE, Jun SR, Wu GA, Kim SH. Alignmentfree genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci. 2009; 106(8):2677–82.
 7
Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F. CAFE: aCcelerated AlignmentFrEe sequence analysis. Nucleic Acids Res. 2017; 45(W1):554–9.
 8
Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister CA, Morgenstern B. Spaced words and kmacs: fast alignmentfree sequence comparison based on inexact word matches. Nucleic Acids Res. 2014; 42(W1):7–11.
 9
Ulitsky I, Burstein D, Tuller T, Chor B. The average common substring approach to phylogenomic reconstruction. J Comput Biol. 2006; 13(2):336–50.
 10
Leimeister CA, Morgenstern B. Kmacs: the kmismatch average common substring approach to alignmentfree sequence comparison. Bioinformatics. 2014; 30(14):2000–8.
 11
Aluru S, Apostolico A, Thankachan SV. Efficient alignment free sequence comparison with bounded mismatches. In: International Conference on Research in Computational Molecular Biology. Springer: 2015. p. 1–12.
 12
Thankachan SV, Aluru C, Chockalingam SP, Aluru S. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: International Conference on Research in Computational Molecular Biology. Springer: 2018. p. 211–24.
 13
Thankachan SV, Chockalingam SP, Liu Y, Krishnan A, Aluru S. A greedy alignmentfree distance estimator for phylogenetic inference. BMC Bioinformatics. 2017; 18(8):238.
 14
Matsakis ND, Klock II FS. The rust language. In: ACM SIGAda Ada Letters. ACM: 2014. p. 103–4.
 15
Mori Y. Libdivsufsort. 2006. https://github.com/y256/libdivsufsort. Accessed on 9 Sept 2020.
 16
Yi H, Jin L. Cophylog: an assemblyfree phylogenomic approach for closely related organisms. Nucleic Acids Res. 2013; 41(7):75.
 17
Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins Struct Funct Bioinforma. 2005; 61(1):127–36.
 18
Newton RJ, Griffin LE, Bowles KM, Meile C, Gifford S, Givens CE, Howard EC, King E, Oakley CA, Reisch CR, et al. Genome characteristics of a generalist marine bacterial lineage. ISME journal. 2010; 4(6):784–98.
 19
Felsenstein J. PHYLIP (phylogeny Inference Package), Version 3.5 C: Joseph Felsenstein; 1993.
 20
Thankachan SV, Chockalingam SP, Liu Y, Apostolico A, Aluru S. Alfred: a practical method for alignmentfree distance computation. J Comput Biol. 2016; 23(6):452–60.
 21
Pizzi C. MissMax: alignmentfree sequence comparison with mismatches through filtering and heuristics. Algorithms Mol Biol. 2016; 11(1):6.
 22
Hatje K, Kollmar M. A phylogenetic analysis of the brassicales clade based on an alignmentfree sequence comparison method. Frontiers Plant Sci. 2012; 3:192.
Acknowledgements
We thank the reviewers of the preliminary version of this article for their helpful comments.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 21 Supplement 6, 2020: Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA19): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume21supplement6.
Funding
The funding for publication of the article was by the U.S. National Science Foundation grants CCF1704552 and CCF1703489. The funding was used to develop, implement and evaluate the proposed algorithms. The funding body did not play any role in the design and implementation of the algorithms and in writing the manuscript.
Author information
Affiliations
Contributions
SC wrote the manuscript, worked on improving the performance of the implementation and performed the experiments; JP wore the initial implementation and performed some of the experiments; SH performed some of the experiments; ST conceived the algorithm; SA conceptualized the study. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Chockalingam, S.P., Pannu, J., Hooshmand, S. et al. An alignmentfree heuristic for fast sequence comparisons with applications to phylogeny reconstruction. BMC Bioinformatics 21, 404 (2020). https://doi.org/10.1186/s12859020037385
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859020037385
Keywords
 Alignmentfree methods
 Sequence comparison
 Phylogeny reconstruction