YBYRÁ facilitates comparison of large phylogenetic trees
© Machado. 2015
Received: 25 March 2015
Accepted: 6 June 2015
Published: 1 July 2015
The number and size of tree topologies that are being compared by phylogenetic systematists is increasing due to technological advancements in high-throughput DNA sequencing. However, we still lack tools to facilitate comparison among phylogenetic trees with a large number of terminals.
The “YBYRÁ” project integrates software solutions for data analysis in phylogenetics. It comprises tools for (1) topological distance calculation based on the number of shared splits or clades, (2) sensitivity analysis and automatic generation of sensitivity plots and (3) clade diagnoses based on different categories of synapomorphies. YBYRÁ also provides (4) an original framework to facilitate the search for potential rogue taxa based on how much they affect average matching split distances (using MSdist).
YBYRÁ facilitates comparison of large phylogenetic trees and outperforms competing software in terms of usability and time efficiency, specially for large data sets. The programs that comprises this toolkit are written in Python, hence they do not require installation and have minimum dependencies. The entire project is available under an open-source licence at http://www.ib.usp.br/grant/anfibios/researchSoftware.html.
Phylogenetic trees comprising hundreds or thousands of terminals are becoming increasingly common , and technological breakthroughs in high throughput DNA sequencing promise to allow trees to expand even more . Within this context, there is an increasing demand for software solutions that help phylogeneticists to automate the process of comparing multiple optimal or nearly optimal topologies as well as topologies derived from different data partitions, optimality criteria, or assumption sets and extract information about the distribution of evidence in those trees . The “YBYRÁ” package was developed to allow researchers to compare multiple trees containing large numbers of terminals quickly and accurately.
YBYRÁ is written in Python; hence, it is a cross-platform application (e.g., Windows, OS X or Linux) and does not require compilation. YBYRÁ makes use of free, easy to install Python modules to root trees and print images in SVG format. Search for potential wildcard taxa and the identification of diagnostic character states requires MSdist v0.5  and TNT v1.1 , respectively. The programs, examples files and a graphic user interface for creating and editing configuration files can be downloaded under the GNU General Public License version 3.0 (GPL-3.0) at http://www.ib.usp.br/grant/anfibios/researchSoftware.html. A wiki page is available at https://gitlab.com/MachadoDJ/ybyra/wikis/home.
In phylogenetic systematics, authors make use sensitivity analysis to address how much hypothesis choice may be affected by variables such as different tree search strategies, optimality criteria, alignment methods, and transformation cost schemes [8–10]. There is some debate in the literature regarding the scientific and heuristic value of sensitivity analysis [11, 12]. However, the instrumental value of sensitivity analysis as means to describe and compare different methodological approaches in systematics is indisputable.
Comparing the approximate execution time and memory use of Cladescan and YBYRÁ (2.9 GHz Intel Core i7, 8 GB 1600 MHz DDR3)
Evaluation of diagnostic character states
Differing from character-based DNA barcoding approaches such as CAOS , YBYRÁ categorizes character transformation events from any source of data given all possible optimization schemes in a set of trees. The input consists of one or more trees in TREAD format and a matrix in simplified NEXUS format containing a single DATA block. YBYRÁ proceeds by spawning tree(s) and data matrix to TNT to compile synapomorphies using TNT’s command “apo”. Synapomorphies are categorized as ambiguously or unambiguously optimized. Unambiguously optimized synapomorphies are further classified as unique and non-homoplastic, unique and homoplastic or non-unique and homoplastic. Program output consists of a table in comma-separated-values (CSV) and vector graphic files (SVG) illustrating categorized character states (Fig. 1b; see manually edited tree in Fig. 1c).
Detection of wildcard taxa
In phylogenetic analysis, lack of data or conflicting information may cause some terminals to be highly unstable “wildcards” or “rogues” (see  for a recent empirical example). YBYRÁ offers a framework to rank every terminal according to how much it affects the average matching split distances (MSD) calculated in MSdist. Trees are pruned one terminal at a time and submitted to MSdist. YBYRÁ will generate an ordered list of terminals according to how much they affect MSD (see Fig. 1d). Terminals that resulted in the lowest MSD are more likely to cause decrease of resolution and may be considered potential wildcard.
In , the author’s used homemade scripts to prune terminals from the set of most parsimonious trees, recalculate the strict consensus using TNT and count the number of nodes nodes in a iterative manner. YBYRÁ automates this process and was able to recover the same results with fewer commands.
YBYRÁ is dedicated to phylogeneticists with minimal computational skills. To facilitate usage, it accompanies a graphic user interface to create and edit configuration files and the user receives instructions in case additional modules are required to run specific functions. The package integrates strategies for topological comparison and distance calculation, as well as a novel framework to search for potential rogue taxa. It also offers a different strategy to compile and evaluate diagnostic character states than CAOS. While CAOS aims to identify diagnostic character states from molecular sequences without reference to tree topology, YBYRÁ uses TNT to categorize all transformation events considering every possible optimization scheme in the observed trees. Finally, YBYRÁ outperforms Cladescan for phylogenetic sensitivity analysis, allowing automatic generation of sensitivity plots for large data sets in feasible time.
The present project provides user-friendly programs that allows automatization and reproducibility of result analysis operations in phylogenetics. To of my knowledge, YBYRÁ is the first software package to integrate solutions for topological distance calculation, extraction of diagnostic characters and search for potential rogue taxa. Additionally, it outperforms Cladescan for the analysis of large data sets and is currently the only viable solution for automated phylogenetic sensitivity analysis of large trees (over 1.000 terminals).
Availability and requirements
Project name: YBYRÁProject home page: http://www.ib.usp.br/grant/anfibios/researchSoftware.html Operating system(s): Windows, Linux, OS XProgramming language: PythonLicence: GNU General Public License version 3.0(GPL-3.0)Other requirements: view dependencies in thedocumentation.Any restrictions to use by non-academics: view license.
YBYRÁ was first introduced as a poster at the XXXII Willi Hennig Meeting (Rostock, Germany, 2013). I thank Fernando P. L. Marques, Taran Grant and two anonymous reviewers for their insightful suggestions. The name ybyrá is a noun in Tupi which means tree, stick, tree, rod, stalk, lance or spear. I thank Miguel T. Rodrigues for suggesting the name. This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo in Brazil (FAPESP Proc. No. 2009/13561-5, 2013/05958-8, and 2012/10000-5).
- Goloboff, PA, SA Catalano, J Marcos Mirande, CA Szumik, J Salvador Arias, M Källersjö, and JS Farris. 2009. Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics25(3): 211–30. doi:10.1111/j.1096-0031.2009.00255.x.View ArticleGoogle Scholar
- McCormack, JE, SM Hird, AJ Zellmer, BC Carstens, and RT Brumfield. 2013. Applications of next-generation sequencing to phylogeography and phylogenetics. Mol Phylogenet Evol66(2): 526–38. doi:10.1016/j.ympev.2011.12.007.View ArticlePubMedGoogle Scholar
- Padial, JM, T Grant, and DR Frost. 2014. Molecular systematics of terraranas (Anura: Brachycephaloidea) with an assessment of the effects of alignment and optimality criteria. Zootaxa3825(1): 1–132. doi:10.11646/zootaxa.3825.1.1.View ArticlePubMedGoogle Scholar
- Bogdanowicz, D, and K Giaro. 2012. Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Trans Comput Biol Bioinform9(1): 150–60. doi:10.1109/TCBB.2011.48.View ArticlePubMedGoogle Scholar
- Goloboff, PA, JS Farris, and KC Nixon. 2008. TNT, a free program for phylogenetic analysis. Cladistics24(5): 774–86.View ArticleGoogle Scholar
- Robinson, DF, and LR Foulds. 1981. Comparison of phylogenetic trees. Math Biosci53: 131–47.View ArticleGoogle Scholar
- Paradis, E, J Claude, and K Strimmer. 2004. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics20(2): 289–90.View ArticlePubMedGoogle Scholar
- Higdon, JW, ORP Bininda-Emonds, RMD Beck, and SH Ferguson. 2007. Phylogeny and divergence of the pinnipeds (Carnivora: Mammalia) assessed using a multigene dataset. BMC Evol Biol7: 216.View ArticlePubMedPubMed CentralGoogle Scholar
- Miller, JA, A Carmichael, MJ Ramírez, JC Spagna, CR Haddad, M Rezác, J Johannesen, J Král, X Wang, and CE Griswold. 2010. Phylogeny of entelegyne spiders: affinities of the family Penestomidae (NEW RANK), generic phylogeny of Eresidae, and asymmetric rates of change in spinning organ evolution (Araneae, Araneoidea, Entelegynae). Mol Phylogenet Evol55(3): 786–804. doi:10.1016/j.ympev.2010.02.021.View ArticlePubMedGoogle Scholar
- Payne, A. 2014. Resolving the relationships of apid bees (Hymenoptera: Apidae) through a direct optimization sensitivity analysis of molecular, morphological, and behavioural characters. Cladistics30(1): 11–25.View ArticleGoogle Scholar
- Grant, T, and AG Kluge. 2005. Stability, sensitivity, science and heurism. Cladistics21(6): 597–604.View ArticleGoogle Scholar
- Giribet, G, and WC Wheeler. 2007. The case for sensitivity: a response to Grant and Kluge. Cladistics23(3): 294–6.View ArticleGoogle Scholar
- Sanders, JG. 2010. Program note: Cladescan, a program for automated phylogenetic sensitivity analysis. Cladistics26(1): 114–6. doi:10.1016/10.1111/j.1096-0031.2009.00280.x.View ArticleGoogle Scholar
- Sarkar, IN, PJ Planet, and R Desalle. 2008. CAOS software for use in character-based DNA barcoding. Mol Ecol Resour8: 1256–59. doi:10.1111/j.1755-0998.2008.02235.x.View ArticlePubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.