Skip to main content

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

Abstract

Background

Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced.

Results

In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.

Conclusions

Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.

Background

Over the past two decades, many similarity measures based on alignment-free methods have been proposed for sequence comparison for a diverse range of bioinformatics applications. With the increasing availability of sequence data from multiple sources and as alignment algorithms are reaching their limits, many of these alignment-free methods have become popular in applications such as phylogeny reconstruction, sequence clustering, transcript quantification and detection of horizontal gene transfers [1, 2].

For phylogeny reconstruction, alignment-free methods are used to construct the pairwise distance matrix, a symmetric matrix of sequence similarity measures computed for every pair in the given set of sequences. With the distance matrix as their input, algorithms such as unweighted pair group method with arithmetic mean (UPGMA) [3] or neighbor-joining (NJ) [4] construct the desired tree.

Alignment-free methods for computation of similarity measures can be classified based on whether the seeds are exact or approximate and whether the seeds are of fixed- or variable-length. The most popular among the fixed-length exact seed methods are kmer-based techniques, which proceed by first constructing the sets of all the kmers (kmers are fixed-length exact seeds of length k) of a pair of sequences, followed by the estimation of a similarity measure either based on the kmer frequency profile (Eg. Euclidean distance, CVTree [5],FFP [6]) or based on the intersection/differences of the kmer sets (Eg. Jaccard coefficient). Lu et al. [7] presents a comprehensive review of about 28 different such measures typically used in the construction of phylogeny trees. Methods using approximate fixed-length such as spaced-seeds approaches [8] allow the use of kmers with mismatches at specific locations and make use of multiple patterns to improve accuracy.

Among the variable-length seeding methods, one of the measures shown to be effective in phylogeny applications is the average common substring, ACS, which is computed for a pair of sequences as the mean of the lengths of the longest common prefixes [9]. After computing the ACS, the sequence similarity of two sequences X and Y is computed as follows:

$$ d(\mathsf{X}, \mathsf{Y}) = \frac{1}{2} \left(\frac{\log |\mathsf{Y}|}{\mathsf{ACS}(\mathsf{X}, \mathsf{Y})} + \frac{\log |\mathsf{X}|}{\mathsf{ACS}(\mathsf{Y}, \mathsf{X})} \right) - \left(\frac{\log |\mathsf{X}|}{|\mathsf{X}|} + \frac{\log |\mathsf{Y}|}{|\mathsf{Y}|} \right) $$
(1)

By introducing k mismatches into the average common substring metric (abbreviated as ACSk), Leimeister and Morgenstern [10] demonstrated improved accuracy for phylogeny applications. However, their approach, called kmacs, uses a greedy heuristic as an approximation of ACSk since computing exact ACSk is computationally expensive and was shown to take O(n logkn) for a pair of sequences of total length n [11]. In a later work, Thankachan et. al. [12] also proved that the runtime bounds remain O(n logkn) even when insertions and deletions allowed along with mismatches. Based on [11, 13] presented another greedy heuristic to approximate ACSk.

In this work, we present a novel linear-time heuristic that is a more accurate approximation of ACSk than kmacs’ approach. While kmacs constructs an ACSk approximation by means of a forward extension of the longest common prefixes, our algorithm performs both forward and backward extensions to identify a k-mismatch common substring of longer length, and hence, producing a closer approximation to the exact ACSk. Using three real datasets, we evaluate the runtime, accuracy and the effectiveness of our proposed approach. We also demonstrate its applicability for phylogeny tree construction.

Methods

Notations and preliminaries

Let X and Y be two sequences drawn from the alphabet set Σ. We denote the length of X by |X|, the suffix of X starting at the position i as Xi. Also, we use \(\overleftarrow {\mathsf {X}}\) and \(\overleftarrow {\mathsf {Y}}\) to denote the reverse of the strings X and Y respectively.

Let |X|+|Y|=n. We define LCP(Xi,Yj) to be the longest common prefix of Xi that matches with Yj and LCPk(Xi,Yj) its k-mismatch counterpart i.e., a longest common prefix that allows k mismatches, k≥0 (also termed as the longest k-mismatch substring starting at Xi). We denote maxj|LCPk(Xi,Yj)| by λk(i) and the position in Y corresponding to the λk(i)-length match as μk(i) i.e.,

$$\mu_{k}(i) = \arg \max_{j} \left|\mathsf{LCP}_k \left(X_{i}, Y_{j}\right)\right|, k \geq 0. $$

For the sake of brevity, we abbreviate λ0(i) and μ0(i) as λ(i) and μ(i) respectively.

The average common substring, ACS of X w.r.t. Y is defined as

$$ \mathsf{ACS}\left(\mathsf{X}, \mathsf{Y}\right) = \frac{1}{|\mathsf{X}|}\sum_{i=1}^{|\mathsf{X}|} \max_{j} \left|\mathsf{LCP}\left(\mathsf{X}_{i}, \mathsf{Y}_{j}\right)\right|. $$
(2)

ACSk(X,Y),k≥0 of X w.r.t. Y is defined similarly with LCPk instead of LCP in the above equation. Note that ACSk(X,Y)≠ACSk(Y,X).

We use GSTf and GSTr to denote the generalized suffix tree constructed for the concatenated strings \(\mathsf {T} = \mathsf {X}\$_{1}\mathsf {Y}\$_{2} \text {and} \overleftarrow {\mathsf {T}} = \overleftarrow {\mathsf {X}}\$_{1}\overleftarrow {\mathsf {Y}}\$_{2}\) respectively, where $1,$2Σ. For our algorithm, GSTf and GSTr serve as an indexing data structures that enable us to perform longest common prefix queries for X and Y in constant time. Both GSTf and GSTr can be constructed in O(n) time with O(n) space.

Previous greedy heuristics

Using the notations described above, ACSk(X,Y) is computed as \(\mathsf {ACS}_{k}(\mathsf {X}, \mathsf {Y}) = \sum _{i=1} \lambda _{k}(i) / |\mathsf {X}|\). The key difficulty in computing ACSk is the estimation of the array λk(i),i=1,…,|X|. Before we present our linear-time approximate algorithm for computing λk, we briefly discuss the previously established heuristic methods for approximating λk.

In kmacs [10], the previously published greedy approach, λk(i) is approximated by extending the longest common prefixes. kmacs uses the longest common substring of suffixes Xi and Yq,q= arg maxj|LCP(Xi,Yj)| as the initial anchor segment, then performs a forward extension to identify the common substring with k−1 mismatches and approximates the total length as λk(i). For example, if X and Y are the strings CATTGCATACGA and ATGGATCCAATAG respectively, then to compute an approximation to λ2(4), kmacs would first identify the LCP match of X4 at Y2 and then approximate λ2(4) as 6 by matching the segments TGCATA and TGGATC. Formally, kmacs computes the following measure as an approximation of λk(i):

$$\lambda(i) + 1 + \left|\mathsf{LCP}_{k-1} \left(\mathsf{X}_{i + \lambda(i) + 1}, \mathsf{Y}_{\mu(i) + \lambda(i) + 1}\right)\right|. $$

Using a generalized suffix tree constructed for X and Y, the above measure can be calculated in O(k) time via k consecutive LCP queries starting with Xi and Yμ(i). Therefore, the similarity metric based on the above heuristic measure can be computed in O(nk) time.

ALFRED-G [13] follows a similar logic except that it includes an extra mismatch in the initial anchor segment. Formally, ALFRED-G approximates λk(i) with the following measure:

$$\lambda_{1}(i) + 1 + \left|\mathsf{LCP}_{k-2} \left(\mathsf{X}_{i + \lambda_{1}(i) + 1}, \mathsf{Y}_{\mu_{1}(i) + \lambda_{1}(i) + 1}\right)\right|. $$

Proposed algorithm

In our algorithm, we make use of the following observation: a k-mismatch common substring of two suffixes Xi and Yj includes k−1 common substrings separated by k mismatch characters. This observation leads to the following key intuition behind our algorithm – the anchor segment can be any one of the k−1 segments that constitute a k-mismatch common string. As mentioned above, both kmacs and ALFRED-G consider the first segment as the anchor segment. Our heuristic, denoted by λk′(i), is computed by extending all k−1 matching substrings that overlap the position i, as anchors.

We illustrate our approach with the following example. Consider the three suffixes Xp=AATCGGT...,Yq=AATGGGA... and Yr=AACCGGT..., and let μ(p)=q;μ(p+3)=r+3. A greedy heuristic based algorithm such as kmacs, which uses the LCP to find anchor segments, will select Yq as the anchor point and approximate λ1(p) as 5, even though there is a better match at Yr. Extending the segments backward overcomes this limitation in this example because a backward extension from the Yr+3=CGG... segment from Xp+3 can identify Yr to be the better match for Xp.

Algorithm 1 presents the pseudo-code for our proposed heuristic. It takes as input strings X and Y, and outputs an array λk′ of length |X|, whose ith entry contains the approximation for λk(i).

After constructing the two generalized suffix trees GSTf and GSTr and initializing the λk′(i) entries (Lines 1– 3), the algorithm proceeds in two phases. In the first phase, we compute the forward and backward extensions of the longest common substring for each position in X (Lines 4– 25). Here, we make use of two arrays Lf and Lr, each of length k+1, which contain the lengths of the 0,1,2…,k-mismatch substrings starting and ending at position i respectively. Lf and Lr can be computed via k LCP queries on GSTf and GSTr respectively (Lines 13 and 21). After computing the Lf and Lr arrays, we update λk′ arrays for the k+1 possible positions corresponding to all the possible forward and backward extensions(Lines 22– 25).

In some cases, the approximation computed at position i in the first phase can be improved by examining the λk′(i−1) entry, if the ithe position doesn’t correspond to a mismatch character of the preceding entry. In the second phase, we update those entries for whom a better approximation is one less than the preceding entry in λk′ (Lines 26– 29).

Since LCP queries take constant time using GSTs, the first phase can be accomplished in O(nk) time and O(n+k) space. Since the second phase is just a left to right pass over the λk′ array and since the construction of the suffix trees also take O(n) time and space, our algorithm takes linear time.

Implementation details

We implemented our algorithm in the Rust progamming language [14] and used the libdivsufsort library [15] to construct the suffix array data structures.

Our algorithm requires only the computation of LCP queries, which can be done only using generalized suffix arrays along with the longest-common-prefix (LCP) arrays and range minimum query (RMQ) data structures. The use of suffix and LCP arrays instead of the suffix trees significantly reduces the memory footprint for our implementation.

Whenever there are multiple options to extend the LCP match (i.e., there are multiple locations in which a common substring can be equal in length and be the longest), we apply the heuristic discussed earlier in this section to all possible locations and select the longest among them. In the worst case this can make the implementation to take O(n2k) time but in practice, it takes O(nkz) time, where z is the average number of maximal matches to a substring in Y starting at a position i in X.

The core computation of the algorithm demands multiple LCP queries. However, we observed that in most cases, the time taken to walk through the text to identify is faster compared to evaluating the LCP query. This is because of two reasons (a) modern CPU architectures include multiple hierarchies of caches, that enable faster access to data elements that are located with in relatively close storage locations and (b) most practical datasets have a distribution of relatively short LCP lengths. Therefore, in all our experiments reported in the next section, except for the case when LCP is 0, we walk the text to identify the longest common prefixes. When the LCP is 0 i.e., the case when there is no suffix match in Y for a suffix Xi or vice-versa, we estimate the k-mismatch LCP by the approximation computed for the suffix Xi+1.

Results and discussion

All the experiments were run on a system having two 2.4 GHz 14-Core Intel E5-2680 V4 processors and 256 GB of main memory, and running RedHat Enterprise Linux (RHEL) 7.0 operating system. Along with our implementation, we also ran kmacs [10] and ALFRED-G [13] for comparison. kmacs and ALFRED-G were compiled using gcc compiler version 8.3.0. Our implementation was compiled using rust compiler version 1.3.6.

To evaluate the runtime, the relative accuracy and the effectiveness of our proposed algorithm, we used four real datasets – Primates, Roseobacter, BAliBASE and E. coli, all of which have been previously used to evaluate alignment-free techniques to estimate sequence similarity [10, 13, 16].

Primates is a DNA sequence dataset collected from prokaryotic organisms and has 27 primate mitochondrial genomes with a total length of ≈450 kilobases. The reference phylogeny tree for this dataset was constructed based on multiple sequence alignment the 27 sequences.

Roseobacter dataset is a set of eukaryotic DNA sequences with a total length of ≈875 kilobases, collected from the coding regions of 32 Roseobacter genomes as described in [13]. BAliBASE dataset is a collection of 218 protien sequence datasets of total length ≈2.5 megabases, gathered from the BAliBASE V3.0 [17], a popular benchmark for evaluating multiple sequence alignment algorithms.

E. coli dataset is a collection of 29 whole genomes of E. coli/Shigella strains, originally compiled by [16]. This dataset has a total length of 138 megabases in which the size of the seqeunces range from 4.3 megabases to 5.4 megabases.

For the Roseobacter dataset, we used the phylogenetic tree presented in [18] as the reference tree. In case of BAliBASE datasets, the reference trees are constructed from the corresponding reference alignments using the proml program available in PHYLIP [19], which implements the Maxmimum Likelihood method. For the E. coli datasets, we used the phylogenetic tree presented in [16] as the reference tree.

We conducted experiments on all the software to evaluate (i) the accuracy of the estimated ACSk, (ii) runtime characteristics, and (iii) applicability of our algorithm to phylogeny reconstruction.

To estimate the accuracy of the ACSk estimated using our heuristic, we approximate ACSk, for every pair of input sequence in the Primates and Roseobacter datasets, using our proposed heuristic with k=1,…,5. We, then, used the ALFRED software published by [20] to find the true value of ACSk, computed the error percentage of estimated ACSk compared to the true values, and finally plotted the average error percentages against increasing values of k. Figures 1a and b illustrate the average deviation of the approximate values computed by the respective software from the exact value of ACSk for Primates and Roseobacter datasets respectively.

Fig. 1
figure1

Avg. error percentage of estimated ACSk w.r.t. exact ACSk

Figures 1a and b show that the error percentage for our proposed method is less than that of kmacs in all the cases. Specifically, in case of kmacs, the error percentages can be as high as 80% with k=4 for the Primates datasets, where as our method shows a deviation of less than 40%. Even though the error percentages increases as k increases for both kmacs and our algorithm, compared to kmacs, the error rate grows relatively slower with increasing k for our method.

It can also be observed in Figs. 1a and b that compared to both kmacs and our method, ALFRED-G has a much lower error percentage. However, in terms of runtime, our method runs 1.5–2.5X faster than that of ALFRED-G as illustrated by the runtime plots in Figs. 2a, b and c. As expected, the runtime grows approximately linearly as k increases. In constrast, the timings for the exact methods grows exponentially for both the Primates ranging from 8.7 seconds for k=1 to 2202.83 seconds for k=5. Similar behavior is observed for the Roseobacter dataset ranging from 15.04 seconds for k=1 to 800.11 seconds for k=5 (Fig. 3).

Fig. 2
figure2

Total runtime of all pairwise comparisons

Fig. 3
figure3

Run time to compute exact ACSk

For the Primates dataset, we also ran the software MissMax, a heuristic alignment-free method for sequence comparisons developed in [21]. While the method produced an error rate of at most 0.52% with respect to exact values, its runtime is ≈145 – 185X slower than that of ALFRED-G.

For the E. coli full genomes dataset, the timings are shown in Fig. 2d. Note that the runtime is shown in hours as compared to seconds for the other datasets and only for kmacs and our method since ALFRED-G failed to complete its run in the allotted time of 72 hours.

To test the effectiveness of our approach for phylogeny construction in comparison to kmacs and ALFRED-G, we also constructed the phylogeny trees using our method as follows:

  1. 1

    For every pair of input sequence in a dataset, compute λk′(·) both for X w.r.t. Y and for Y w.r.t. X.

  2. 2

    Using approximate ACSk(X,Y) and ACSk(Y,X) estimated from λk′(·), compute the sequence similarity measure defined by Eq. 1.

  3. 3

    Construct the symmetric distance matrix with entries filled with sequence similarity measures computed in the previous step. This matrix is of size 27×27,32×32 and 29×29 for Primates, Roseobacter, and E. coli datasets respectively.

  4. 4

    Reconstruct phylogenetic tree using the neighbor program in the PHYLIP software suite [19] with the distance matrix as its input. The neighbor program constructs the phylogeny tree with Neighbor Joining methodology.

  5. 5

    Compute the Robinson–Foulds distance (R-F) distance w.r.t the reference tree using the treedist program in the PHYLIP [19] software suite. Note that lower the R-F distance, better the matching of topology between two trees. If RF distance is zero, then there is an exact match between the two trees.

  6. 6

    Repeat the above steps with k=1,…,10 for Primates, Roseobacter and BAliBASE datasets, and with k=1,…,7 for the E. coli dataset.

For the Primates dataset, the Robinson–Foulds (RF) distance of the reconstructed tree with respect to the reference tree is 0 for k=5, 2–4 for other values of k (Fig. 4). For the same dataset kmacs reported an RF distance of 2–8, whereas ALFRED-G reported an RF distance of 0–2 [13]. Similar to ALFRED-G, our method was able to recover the expected phylogenetic tree for Primates. For the Roseobacter dataset, the RF distance of our algorithm are in the range of 20–8, and as the value of k is increases from 1 to 10, the RF distance tends to decline. kmacs and ALFRED-G reported RF distances, for the Roseobacter dataset, in the range of 18–10 and 18–8 respectively [13]. For BAliBASE datasets, all the three methods reported an average RF distances in the same range of 31–26. For E. coli dataset, both kmacs and our method reported RF distances in the same range of 24–26, while ALFRED-G was not able to complete within the alloted time limit of 24 hours per pair. Both kmacs and our method were able to complete their runs in less than 11 hours for k=5.

Fig. 4
figure4

Robinson–Foulds distance with respect to the reference trees

Finally, to evaluate the scalability of our algorithm in its ability to process genome-length sequences, we ran both kmacs and our method on the full genome sequences of 14 plant species. This dataset was originally compiled by [22] and is of total size 4.5 gigabases with the sequence lengths ranging from 111 megabases to 746 megabases. For the 50 pairs of sequences that both kmacs and our method were able to process in the allotted time limit of 72 hours per pair, our method was able to complete the runs in a average of 1.56 hours per pair, where as kmacs took an average of 4.67 hours per pair. This discrepancy in time is due to the difference in how kmacs and our method process the suffixes whose LCP is 0, which happens more often in longer genomes. Neither ALFRED-G nor the exact method is capable of processing smallest of the sequences of this dataset.

To summarize, our proposed method provides results more accurate than kmacs for the Primates and Roseobacter datasets, while being competitive in runtime compared to kmacs and much faster than ALFRED-G. In case of the Primates dataset, our method was able to recover the reference tree for k=5. For BAliBASE and E. coli datasets, the results are comparable to that of kmacs. With repsect to scalability, our method shows considerably improvement over that of kmacs for longer full genomes that are few hundred megabases long.

Conclusions

In this paper, we presented a novel linear-time heuristic to compute the alignment-free measure of sequence similarity ACSk. We evaluate the accuracy of the ACSk estimated from the proposed heuristic and demonstrated its applicability in construction of phylogeny trees.

We plan to extend this heuristic in the future in two different ways. Currently, all the published heuristics, including the one introduced in this work, can handle only mismatches and not insertions or deletions. We plan to adapt the proposed algorithm such that it allows insertions and deletions, where the key challenge is to manage is varying lengths of matched segments. Another way we plan to develop this heuristic is to enable forward and backward extensions on a 1-mismatch anchor segment.

Availability of data and materials

Datasets are available at http://alurulab.cc.gatech.edu/phyloand http://afproject.org/app/and the code is available at https://github.com/srirampc/adyar-rs

Abbreviations

ACS :

Average Common Substring

ACS k :

Average Common Substring with k mismatches

UPGMA:

Unweighted Pair Group Method for Phylogeny reconstruction

NJ:

Neighbor-joining Method for Phylogeny reconstruction

LCP :

Longest Common Prefix

LCP k :

Longest Common Prefix while allowing k mismatches

GST :

Generalized Suffix Tree

RMQ:

Range miniumn query

References

  1. 1

    Vinga S, Almeida J. Alignment-free sequence comparison–a review. Bioinformatics. 2003; 19(4):513–23.

    CAS  Article  Google Scholar 

  2. 2

    Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017; 18(1):186.

    Article  Google Scholar 

  3. 3

    Sokal RR. A statistical method for evaluating systematic relationship. Univ Kans Sci Bull. 1958; 28:1409–38.

    Google Scholar 

  4. 4

    Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987; 4(4):406–25.

    CAS  Google Scholar 

  5. 5

    Qi J, Wang B, Hao B-I. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J Mol Evol. 2004; 58(1):1–11.

    CAS  Article  Google Scholar 

  6. 6

    Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci. 2009; 106(8):2677–82.

    CAS  Article  Google Scholar 

  7. 7

    Lu YY, Tang K, Ren J, Fuhrman JA, Waterman MS, Sun F. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res. 2017; 45(W1):554–9.

    Article  Google Scholar 

  8. 8

    Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, Leimeister C-A, Morgenstern B. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 2014; 42(W1):7–11.

    Article  Google Scholar 

  9. 9

    Ulitsky I, Burstein D, Tuller T, Chor B. The average common substring approach to phylogenomic reconstruction. J Comput Biol. 2006; 13(2):336–50.

    CAS  Article  Google Scholar 

  10. 10

    Leimeister C-A, Morgenstern B. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014; 30(14):2000–8.

    CAS  Article  Google Scholar 

  11. 11

    Aluru S, Apostolico A, Thankachan SV. Efficient alignment free sequence comparison with bounded mismatches. In: International Conference on Research in Computational Molecular Biology. Springer: 2015. p. 1–12.

  12. 12

    Thankachan SV, Aluru C, Chockalingam SP, Aluru S. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: International Conference on Research in Computational Molecular Biology. Springer: 2018. p. 211–24.

  13. 13

    Thankachan SV, Chockalingam SP, Liu Y, Krishnan A, Aluru S. A greedy alignment-free distance estimator for phylogenetic inference. BMC Bioinformatics. 2017; 18(8):238.

    Article  Google Scholar 

  14. 14

    Matsakis ND, Klock II FS. The rust language. In: ACM SIGAda Ada Letters. ACM: 2014. p. 103–4.

  15. 15

    Mori Y. Libdivsufsort. 2006. https://github.com/y-256/libdivsufsort. Accessed on 9 Sept 2020.

  16. 16

    Yi H, Jin L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 2013; 41(7):75.

    Article  Google Scholar 

  17. 17

    Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins Struct Funct Bioinforma. 2005; 61(1):127–36.

    CAS  Article  Google Scholar 

  18. 18

    Newton RJ, Griffin LE, Bowles KM, Meile C, Gifford S, Givens CE, Howard EC, King E, Oakley CA, Reisch CR, et al. Genome characteristics of a generalist marine bacterial lineage. ISME journal. 2010; 4(6):784–98.

    CAS  Article  Google Scholar 

  19. 19

    Felsenstein J. PHYLIP (phylogeny Inference Package), Version 3.5 C: Joseph Felsenstein; 1993.

  20. 20

    Thankachan SV, Chockalingam SP, Liu Y, Apostolico A, Aluru S. Alfred: a practical method for alignment-free distance computation. J Comput Biol. 2016; 23(6):452–60.

    CAS  Article  Google Scholar 

  21. 21

    Pizzi C. MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms Mol Biol. 2016; 11(1):6.

    Article  Google Scholar 

  22. 22

    Hatje K, Kollmar M. A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method. Frontiers Plant Sci. 2012; 3:192.

    Article  Google Scholar 

Download references

Acknowledgements

We thank the reviewers of the preliminary version of this article for their helpful comments.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 21 Supplement 6, 2020: Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA-19): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-21-supplement-6.

Funding

The funding for publication of the article was by the U.S. National Science Foundation grants CCF-1704552 and CCF-1703489. The funding was used to develop, implement and evaluate the proposed algorithms. The funding body did not play any role in the design and implementation of the algorithms and in writing the manuscript.

Author information

Affiliations

Authors

Contributions

SC wrote the manuscript, worked on improving the performance of the implementation and performed the experiments; JP wore the initial implementation and performed some of the experiments; SH performed some of the experiments; ST conceived the algorithm; SA conceptualized the study. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Sriram P. Chockalingam.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chockalingam, S.P., Pannu, J., Hooshmand, S. et al. An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction. BMC Bioinformatics 21, 404 (2020). https://doi.org/10.1186/s12859-020-03738-5

Download citation

Keywords

  • Alignment-free methods
  • Sequence comparison
  • Phylogeny reconstruction