# Fastphylo: Fast tools for phylogenetics

- Mehmood Alam Khan
^{1, 7}Email author, - Isaac Elias
^{1, 4}, - Erik Sjölund
^{2}, - Kristina Nylander
^{3}, - Roman Valls Guimera
^{2}, - Richard Schobesberger
^{4, 6}, - Peter Schmitzberger
^{4, 6}, - Jens Lagergren
^{1}and - Lars Arvestad
^{1, 5}Email author

**14**:334

**DOI: **10.1186/1471-2105-14-334

© Khan et al.; licensee BioMed Central Ltd. 2013

**Received: **15 July 2013

**Accepted: **14 November 2013

**Published: **20 November 2013

## Abstract

### Background

Distance methods are ubiquitous tools in phylogenetics. Their primary purpose may be to reconstruct evolutionary history, but they are also used as components in bioinformatic pipelines. However, poor computational efficiency has been a constraint on the applicability of distance methods on very large problem instances.

### Results

We present fastphylo, a software package containing implementations of efficient algorithms for two common problems in phylogenetics: estimating DNA/protein sequence distances and reconstructing a phylogeny from a distance matrix. We compare fastphylo with other neighbor joining based methods and report the results in terms of speed and memory efficiency.

### Conclusions

Fastphylo is a fast, memory efficient, and easy to use software suite. Due to its modular architecture, fastphylo is a flexible tool for many phylogenetic studies.

## Background

Distance methods are important for phylogenetic inference, and this is confirmed by the many available algorithms and software implementations [1-12]. The main ambition with several implementation efforts has been to improve the computational efficiency, which is essential for any method’s applicability. In particular, the cubic time complexity of Neighbor Joining (NJ) [13] has been an obvious obstacle that several groups have challenged. One of these efforts is Fast Neighbour Joining (FNJ), a quadratic-time algorithm for tree reconstruction presented by Elias and Lagergren [14]. They showed in [14] that FNJ performs similar to the canonical NJ method. FNJ modifies the NJ selection function for joining any pair of sequences together and introduced the concept of *visibility set* to avoid redundant computation, thus, giving a significant improvement in speed and similar accuracy as NJ for computing the phylogenetic tree. This paper presents fnj, a fast and practical implementation of the FNJ algorithm.

A sometimes overlooked issue in distance-based method development is that the distance matrix, the input to tree reconstruction algorithms, is the real computational bottleneck. With *n* sequences of length *l*, you cannot do better than time *O*(*ln*^{2}) for estimating a distance matrix. Since *l* is rarely smaller than *n*, the distance computations have cubic time complexity, and there is therefore little gain with efficient tree reconstruction.

We address this efficiency problem by making speedup techniques by [15] available in a space-efficient implementation through the fastdist program. With novel substitution-counting algorithms and register-based bit-fiddling in 128-bit registers, common distance estimators for DNA sequence can reach a speedup of two orders of magnitude compared to e.g. PHYLIP. In addition, the implementation makes optimal use of ambiguity symbols instead of dismissing them, as described in [15]. Similarly, for fast computation of the distance matrices of protein sequences, we introduce fastprot and fastprot_mpi.

We present fastphylo as a package containing phylogenetic tools of efficiency.

## Implementation

to compute a phylogenetic tree and save it to a file.

By reading and writing the commonly used sequence formats, FASTA and PHYLIP, compatibility is maintained with existing phylogenetic tools such as PHYLIP [4] and RaxML [10]. However, we have also implemented support for XML-based I/O to encourage validatable data handling. Using XML simplifies format conversion, safe-guards against formatting mistakes, and enables validation of input and output. To support validation, the RelaxNG XML [16] schemas for sequence data, distance matrices, and phylogenies are builtin to all the fastphylo modules and can be easily retrieved from the programs. Unlike the PHYLIP format, XML also enables users to work with long accessions.

One of the main issues with phylogeny reconstruction is the storage of distance matrices. It requires a large amount of disk space to store a distance matrix for very large gene families. We, therefore, introduce a binary format that overcomes this problem (see Section 'Features of fastdist’ for further details).

### Features of fastdist

The fastdist program estimates distance matrices from DNA alignments. It implements fast computation of four distance estimators: Hamming (also known as *p*-distance), JC [17], K2P [18], and TN93 [19]. K2P is the default distance estimator for fastdist.

### Features of fastprot and fastprot_mpi

fastprot estimates the evolutionary distance between aligned protein sequences. It implements two methods for calculating the distance between protein sequences: the maximum likelihood (ML), which for two aligned sequences *a* and *b* returns *argmax*_{
d
}*P* *r*(*a*,*b*∣*d*), and the expected distance, which returns *E*[*d*∣*a*,*b*] (see further [11]). The ML estimator uses Newton-Raphson method to find the optimum. It is, however, slower than the expectation estimator.

**Time Comparison of**
fastprot
**vs**
fastprot_mpi

Tools | Nodes | Time (minutes) |
---|---|---|

fastprot | 1 | 1149 |

fastprot_mpi | 8 | 148 |

fastprot_mpi | 16 | 76 |

fastprot_mpi | 32 | 40 |

fastprot_mpi | 64 | 22 |

### Features of fnj

The fnj program implements three tree reconstruction methods, and the default is FNJ [14]. Furthermore, Neighbor-Joining [13], the mainstay of phylogenetics, as well as the more recent improvement BioNJ [1], are available as command line options. The program supports the formats used by fastdist and fastprot (i.e. XML and PHYLIP).

### Bootstrap feature

We provide a random seed option -s for the reproducibility of results. If a random seed option is not specified, the program will use the current time stamp for bootstrap analysis.

## Results and discussion

In order to access the performance of fastphylo compared to other NJ-based tools, we considered two performance metrics: speed and memory utilization. Apart from this, we were also interested in measuring how large gene families fastphylo can handle. The basic motivation for such analysis comes from the limitation that most of the NJ tools fail to compute phylogenetic trees for very large gene families.

### Simulated data

To evaluate the performance of fastphylo, we simulated two different datasets. The first dataset, which we called dataset-1, consists of 10 gene families with family size ranging from 1,000 to 10,000 family members. The second dataset, dataset-2, contains 20 gene families with gene sequences ranging from 5,000 to 100,000. Each gene sequence is 2,000 nucleotides long, while each protein sequence is 350 amino acids long.

We used tools developed by our colleagues Ali Tofigh and Bengt Sennblad to generate trees and sequences. All the details on parameter settings for generating trees and sequences are mentioned in Additional file 1.

### Environment and experimental set-up

for all our experiments. All experiments were performed on a cluster machine. Each cluster node has 8 cores and each core has 3 GB of RAM. We set up two experimental environments: one for dataset-1 and one for dataset-2, separately. For dataset-1, we ran each experiment on a single dedicated core with a time duration of 2 hours for each job. However, for dataset-2, the time limit for each experiment was set to 24 hours, and each experiment was performed on a node instead of a core due to memory requirements.

We used Massif, a memory profiling tool available in the Valgrind suite [26], to profile memory consumption of the aforementioned NJ tools. The standard time tool available in Linux (version 2.6.32) was used for measuring running time of each experiment. Only "User time" output from the time tool is considered in the time comparison analysis. We tried to use the best performance parameters for each tool in our analysis. All the details on the choice of parameters used, for different NJ tools, are mentioned in Additional file 1.

### Results on dataset-1

### Results on dataset-2

Figure 8 shows the time and memory comparison of fnj and RapidNJ. The input to both programs is distance matrices. We used output from fastdist in a binary format as an input to fnj, and distance matrices in PHYLIP format to RapidNJ. It is interesting to note that fnj and RapidNJ performed similar on both the time and memory comparison analysis, but RapidNJ has an advantage on memory usage.

## Conclusions

FastPhylo is a software package containing software that is easy to use and has well-defined interfaces. It is an efficient software that enables very large problem sizes. In addition, Fastphylo can be a good tool of choice in many studies: for instance, in MCMC and maximum likelihood (ML) methods for phylogeny reconstruction, it can be used to generate a good starting tree. Further more, Fastphylo’s modular architecture offers maximum flexibility in phylogenetic computations.

## Availability and requirements

**Project name:** Fastphylo

**Project home page:**
http://fastphylo.sourceforge.net

**Operating system(s):** Linux, Mac OS X (10.6.8 and 10.8.4)

**Programming language:** C++

**Licence:** MIT License

**Any restrictions to use by non-academics:** None

## Declarations

### Acknowledgements

This work was supported by The Royal Institute of Technology (KTH), Sweden, and University of Engineering and Technology (UET Peshawar), Pakistan. The computations were performed using the resources provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under Project b2012160: FastPhylo. We thank Henric Zazzi at PDC, KTH Royal Institute of Technology, for coding assistance.

## Authors’ Affiliations

## References

- Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997, 14 (7): 685-695. 10.1093/oxfordjournals.molbev.a025808.View ArticlePubMedGoogle Scholar
- St John K, Warnow T, Moret BME, Vawter L: Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining. J Algorithms. 2003, 48: 173-193. 10.1016/S0196-6774(03)00049-X.View ArticleGoogle Scholar
- Roshan U, Moret BME, Warnow T, Williams TL: Rec-I-DCM3: a fast algorithmic technique for reconstructing large Phylogenetic trees. Proc. 3rd Computational Systems Bioinformatics Conference. 2004, IEEE Computer Society: Washington DC, 98-109.Google Scholar
- Felsenstein J: PHYLIP - Phylogeny inference package (version 3.2). Cladistics. 1989, 5: 164-166.Google Scholar
- Howe KL, Bateman A, Durbin R: QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics. 2002, 18 (11): 1546-1547. 10.1093/bioinformatics/18.11.1546.View ArticlePubMedGoogle Scholar
- Mailund T, Pedersen CNS: QuickJoin-fast neighbour-joining tree reconstruction. Bioinformatics. 2004, 20 (17): 3261-3262. 10.1093/bioinformatics/bth359.View ArticlePubMedGoogle Scholar
- Mailund T, Brodal G, Fagerberg R, Pedersen C, Phillips D: Recrafting the neighbor-joining method. BMC Bioinformatics. 2006, 7: 29-10.1186/1471-2105-7-29.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler TJ: Large-scale neighbor-joining with NINJA. Proceedings of the 9th international conference on Algorithms in bioinformatics,. 2009, WABI’09, Berlin,Heidelberg: Springer-Verlag, 375-389.View ArticleGoogle Scholar
- Sheneman L, Evans J, Foster J: Clearcut: a fast implementation of relaxed neighbor joining. Bioinformatics. 2006, 22 (22): 2823-2824. 10.1093/bioinformatics/btl478.View ArticlePubMedGoogle Scholar
- Stamatakis A, Ludwig T, Meier H: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics. 2005, 21 (4): 456-463. 10.1093/bioinformatics/bti191.View ArticlePubMedGoogle Scholar
- Agarwal P, States DJ: A Bayesian evolutionary distance for parametrically aligned sequences. J Comput Biol. 1996, 3: 1-17. 10.1089/cmb.1996.3.1.View ArticlePubMedGoogle Scholar
- Simonsen M, Mailund T, Pedersen CNS: Rapid neighbour-joining. WABI, Volume 5251 of Lecture Notes in Computer Science. Edited by: Crandall KA, Lagergren J. 2008, Berlin, Heidelberg: Springer-Verlag, 113-122.Google Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406-425.PubMedGoogle Scholar
- Elias I, Lagergren J: Fast neighbor joining. Theor Comput Sci. 2009, 410 (21-23): 1993-2000. 10.1016/j.tcs.2008.12.040.View ArticleGoogle Scholar
- Elias I, Lagergren J: Fast computation of distance estimators. BMC Bioinformatics. 2007, 8: 89-10.1186/1471-2105-8-89.PubMed CentralView ArticlePubMedGoogle Scholar
- Clark J, Makoto M: RELAX NG specification. OASIS. 2001,, [http://www.oasis-open.org/committees/relax-ng/spec.html],Google Scholar
- Jukes T, Cantor C: Evolution of protein molecules. Mamm Protein Metab. 1969, 3: 21-132.View ArticleGoogle Scholar
- Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16 (2): 111-120. 10.1007/BF01731581.View ArticlePubMedGoogle Scholar
- Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993, 10 (3): 512-526.PubMedGoogle Scholar
- Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001, 18: 691-699. 10.1093/oxfordjournals.molbev.a003851.View ArticlePubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8 (3): 275-282.PubMedGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas Protein Seq Structure. 1978, 5 (suppl 3): 345-351.Google Scholar
- Müller T, Vingron M: Modeling amino acid replacement. J Comput Biol. 2000, 7 (5): 761-76.View ArticlePubMedGoogle Scholar
- Le SQ, Gascuel O: An improved general amino acid replacement matrix. Mol Biol Evol. 2008, 25 (7): 1307-1320. 10.1093/molbev/msn067. [http://mbe.oxfordjournals.org/cgi/content/abstract/25/7/1307],View ArticlePubMedGoogle Scholar
- Tange O: GNU Parallel - The Command-Line Power Tool. The USENIX Magazine. 2011, 36 (1): 42-47. [http://www.gnu.org/s/parallel],Google Scholar
- Seward J, Nethercote N, Weidendorfer J: Valgrind 3.3 - Advanced Debugging and Profiling for GNU/Linux applications. 2008, UK: Network Theory LtdGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.