L.U.St: a tool for approximated maximum likelihood supertree reconstruction
© Akanni et al.; licensee BioMed Central Ltd. 2014
Received: 30 January 2014
Accepted: 2 June 2014
Published: 12 June 2014
Supertrees combine disparate, partially overlapping trees to generate a synthesis that provides a high level perspective that cannot be attained from the inspection of individual phylogenies. Supertrees can be seen as meta-analytical tools that can be used to make inferences based on results of previous scientific studies. Their meta-analytical application has increased in popularity since it was realised that the power of statistical tests for the study of evolutionary trends critically depends on the use of taxon-dense phylogenies. Further to that, supertrees have found applications in phylogenomics where they are used to combine gene trees and recover species phylogenies based on genome-scale data sets.
Here, we present the L.U.St package, a python tool for approximate maximum likelihood supertree inference and illustrate its application using a genomic data set for the placental mammals. L.U.St allows the calculation of the approximate likelihood of a supertree, given a set of input trees, performs heuristic searches to look for the supertree of highest likelihood, and performs statistical tests of two or more supertrees. To this end, L.U.St implements a winning sites test allowing ranking of a collection of a-priori selected hypotheses, given as a collection of input supertree topologies. It also outputs a file of input-tree-wise likelihood scores that can be used as input to CONSEL for calculation of standard tests of two trees (e.g. Kishino-Hasegawa, Shimidoara-Hasegawa and Approximately Unbiased tests).
This is the first fully parametric implementation of a supertree method, it has clearly understood properties, and provides several advantages over currently available supertree approaches. It is easy to implement and works on any platform that has python installed.
Availability: bitBucket page - https://email@example.com/afro-juju/l.u.st.git.
Supertree methods are generalisation of consensus methods to the case of partially overlapping input trees, and any method that can be used to amalgamate a collection of such trees is a supertree method . Supertrees were formally introduced to the realm of the classification sciences by Gordon , who described a Strict Consensus Supertree method. However, the first supertree algorithm was introduced by Aho and colleagues  as an application to merge partially overlapping databases. Since these early works, there has been a lot of interest in supertree reconstruction particularly in evolutionary biology where supertrees have found an application as meta-analytical tools used to combine, and derive inferences from, published phylogenetic trees. Purvis  presented the first application of a supertree in this context merging primate phylogenies obtained from the literature to generate a supertree, and using it to test evolutionary hypotheses. Since then, the application of supertrees and more specifically their use for reconstructing large phylogenies in evolutionary biology has continued to be on the rise, paralleled by a substantial interest in the development of supertree methods. More recently, supertrees have also found important applications in genomics where they have been used to combine gene trees and derive species phylogenies [5–9].
A large number of supertree methods have been developed since the time of the Aho algorithm. However, most actual supertrees have been derived using the Matrix Representation with Parsimony (MRP) method of Baum  and Ragan . This is due to the availability of excellent parsimony software and the general good understanding of the theory underlying parsimony. Yet theoretical justifications for the application of parsimony to the supertree setting are weak, and MRP is mostly implemented due to the fact that it is easily applicable in practice and tends to return well-resolved trees . More generally, most available supertree methods are ad hoc, their properties being often poorly known, and the rationale for their application unclear [13–15]. The only exceptions seem to be those based on generalisations of well-known consensus methods , and the maximum likelihood (ML) method of Steel and Rodrigo .
We present a Python implementation of the ML supertree method of Steel and Rodrigo . The method has been shown to be consistent on general statistical conditions unlike other approaches like MRP , and it is closely related to the majority rule (-) supertree method , with which it has been suggested to share important properties, in particular the fact that the supertrees it generates have been suggested to be, like those derived using majority rule (-), median trees for the input set .
The method is “approximate” in the sense that, likelihood vales are not normalised for tree size. However, it has been pointed out that at the least in the context of Maximum Likelihood analyses, given the parametric conditions under which our software is limited to work, this should not be a problem .
The ML supertree method is available as part of the Likelihood Utility for Supertrees (L.U.St) package. L.U.St is licensed under the GNU General Public License. Once downloaded, L.U.St can be run on any platform on which python is installed.
Where α is a normalising constant and β is a value representing the quantity and quality of the data used to infer the input tree. An exponential distribution is used to model phylogenetic error. This implies that the probability that a given input tree is a sample of the proposed supertree decrease exponentially as d increases. The likelihood of each proposed superteee is then calculated summing across all tree-wise likelihood scores.
The method is “approximate” in the sense that, likelihood vales are not normalised for tree size. This means that the likelihood we calculate is a “weighted” sum of the input tree likelihoods, where the weights correspond to the tree-specific normalising constant (α). Albeit calculating these normalising factors is in theory possible , it is computationally very time consuming. However, Bryant and Steel  pointed out that if one uses small β values, the normalising constants simplify to a value that can be approximated using α = 1 irrespective of the input-tree sizes. For pragmatic reason (to maximise speed of execution), we currently do not allow the user to select β, has been fixed to a low value (β = 1). This should result in the normalising factor of (α), of Steel and Rodrigo  to simplify to a value of one (i.e. α = 1). It has been pointed out that at the least in the context of Maximum Likelihood analyses this should not affect the ranking of the supertrees . Indeed analyses performed to test the accuracy of the method and to compare it with other supertree methods, seem to confirm Bryant and Steel results . But we acknowledge that the ranking will be based on approximate, rather than correct, likelihood values.
L.U.St includes methods that allows for a variety of extra functions, including statistical tests for choosing between alternative hypotheses (tests of two trees – Winning site test, Kishino Hasegawa (KH) test , Shimidoara Hasegawa (SH) test  and the Approximately unbiased (AU) test ). Whilst the winning site test can be run natively in L.U.St, the calculation of KH, SH, AU and other tests requires the use of CONSEL . To our knowledge there is no other software package that allows the extension of standard tests of two trees to the supertree framework. However, tests of two trees can have great utility in supertree research, as they can be used, for example, to investigate the extent to which current evidence (i.e. currently published trees) support alternative phylogenetic hypotheses (i.e. a set of proposed supertrees). Further to that, tests of two trees can be used in the phylogenomic context to evaluate the extent to which a set of gene-trees can reject a set of alternative phylogenetic hypotheses (i.e. a set of supertrees). Below an example of the use of test of two super(trees) in the phylogenomic context is provided.
L.U.St offers the user other useful functions to randomly resolve polytomies, deroot trees, reroot trees, resolve polytomies in a set of trees according to a user-provided input tree, create bootstrap replicates of input tree datasets, prune phyologenies, convert nexus formatted trees to the newick format and vice versa, and extract the taxon set of sets of trees.
Using supertree to investigate deep placental phylogeny.
We decided to present an exemplar phylogenomic study of the mammalian relationships to illustrate our supertree software because, based on current knowledge, we can make predictions about what results to expect from our analyses and investigate whether the actualised outcomes from our software deviate from our expectations. More precisely, based on the results of  we expect that: (1) either the Afrotheria (Figure 1A) or the Atlantogenata (Figure 1B) hypotheses will emerge in our optimal ML supertree (most genes in mammalian genomes support one of these two topologies). (2) Similarly, a bootstrap majority rule consensus tree will most likely display one of the two above-mentioned hypotheses (Figure 1A or B). However, (3) as many genes are known to support both the topologies in Figure 1A-B (and to a lesser extent the tree in Figure 1C), bootstrap support for the basal placental split in the optimal ML supertree (and in the bootstrap consensus tree) are expected to be low. (4) Tests of two trees are not expected to be able to differentiate significantly between the topologies in Figure 1A-B. Indeed, given the results of  we can confidently predict that the trees in Figure 1A and B should be the first and second best fitting hypotheses, even though we cannot predict what their relative order will be (i.e. whether the tree in Figure 1A or in Figure 1B will be the best fitting one). Similarly, (5) whilst we cannot predict whether the Xenarthra hypothesis of Figure 1C will be significantly rejected by the Approximately Unbiased (or by another) test (e.g. Kishino-Hasegawa test), we can predict that this hypothesis should emerge as the third best one (see ). Finally, although we cannot make predictions about how the trees in Figure 1D-F will be ranked, given what is known of the distribution of the signal in mammal gene trees , we would expect all these hypotheses to be significantly rejected by the data and to emerge as the three hypotheses that worst fit our data.
To reconstruct our ML supertree of the placental mammals the gene-trees dataset of  was employed. This gene-trees data set was pruned to exclude irrelevant taxa using Clann . Only 6 placentals (human, mouse, cat, hedgehog, elephant and armadillo) and one marsupial (the opossum) were retained. This meant that the dataset was reduced from 42 taxa overlapping on 2216 gene trees to 7 taxa overlapping on 389 gene trees (with the gene trees being partially overlapping and containing between 4 and 7 taxa).
Result and discussion
Results of the test of two trees
Erinaceous root 1
Erinaceous root 2
All results generated were in agreement with our expectations (see above) and apart from confirming that the phylogenetic relationships of the mammals are still far from being resolved, they illustrate that L.U.St behave as expected and return results that reflect well current understanding of mammal evolution. Overall this illustrates that L.U.St will represent a useful tool in phylogenomics and supertree reconstruction more broadly.
L.U.St represent the first implementation of a maximum likelihood supertree method. This method calculates approximate ML values and has the advantage of finding a tree that has been suggested might be representative of the median of the set of input trees when the symmetric difference metric is used to calculate the tree-to-tree distance. An added advantage of having an approximate ML supertree implementation is that it allows performing statistical test on trees to choose between alternative hypotheses. The results obtained with our toy example reflect current knowledge of mammalian evolution and confirm that the L.U.St package behaves as expected when used to attempt resolving a phylogenetic problem that is well known to be difficult. Being a freely available package for the Python programming environment, L.U.St is both flexible and platform-independent while also being user friendly and easy to implement.
Availability and requirements
Project name: L.U.St.
Project home page:https://firstname.lastname@example.org/afro-juju/l.u.st.git.
Operating system(s): Linux.
Programming language: Python.
Other requirements: Consel.
License: GNU GPL.
This project was made possible by funding received from the Irish Research council and the UK Biotechnology and Biological Sciences Research Council (grant BB/K007440/1). It was also partially supported by the computing resources at the National University of Ireland, Maynooth and the University of Bristol, UK. DP was supported by a Science Foundation Ireland Grant SFI-RFP 11/RFP/EOB/3106.
- Semple C, Steel M: A supertree method for rooted trees. Discret Appl Math. 2000, 105: 147-158.View ArticleGoogle Scholar
- Gordon AD: Consensus supertrees: the synthesis of rooted trees containing overlapping sets of labeled leaves. J Classif. 1986, 3: 335-348.View ArticleGoogle Scholar
- Aho AV, Sagiv Y, Szymanski TG, Ullman JD: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981, 10: 405-421.View ArticleGoogle Scholar
- Purvis A: A modification to Baum and Ragan’s method for combining phylogenetic trees. Syst Biol. 1995, 44: 251-255.View ArticleGoogle Scholar
- Daubin V, Gouy M, Perriere G: A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res. 2002, 12: 1080-1090.View ArticlePubMed CentralPubMedGoogle Scholar
- Creevey CJ, Fitzpatrick DA, Philip GK, Kinsella RJ, O’Connell MJ, Pentony MM, Travers SA, Wilkinson M, McInerney JO: Does a tree–like phylogeny only exist at the tips in the prokaryotes?. Proc R Soc Lond Ser B Biol Sci. 2004, 271: 2551-2558.View ArticleGoogle Scholar
- Fitzpatrick DA, Logue ME, Stajich JE, Butler G: A fungal phylogeny based on 42 complete genomes derived from supertree and combined gene analysis. BMC Evol Biol. 2006, 6: 99-View ArticlePubMed CentralPubMedGoogle Scholar
- Pisani D, Cotton JA, McInerney JO: Supertrees disentangle the chimerical origin of eukaryotic genomes. Mol Biol Evol. 2007, 24: 1752-1760.View ArticlePubMedGoogle Scholar
- Holton TA, Pisani D: Deep genomic-scale analyses of the metazoa reject Coelomata: evidence from single-and multigene families analyzed under a supertree and supermatrix paradigm. Genome Biol Evol. 2010, 2: 310-View ArticlePubMed CentralPubMedGoogle Scholar
- Baum BR: Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon. 1992, 41: 3-10.View ArticleGoogle Scholar
- Ragan MA: Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol. 1992, 1: 53-58.View ArticlePubMedGoogle Scholar
- Wilkinson M, Cotton JA, Lapointe F-J, Pisani D: Properties of supertree methods in the consensus setting. Syst Biol. 2007, 56: 330-337.View ArticlePubMedGoogle Scholar
- Lapointe F-J, Wilkinson M, Bryant D: Matrix representations with parsimony or with distances: two sides of the same coin?. Syst Biol. 2003, 52: 865-868.PubMedGoogle Scholar
- Gatesy J, Springer MS: A critique of matrix representation with parsimony supertrees. Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Edited by: Bininda-Emonds ORP. 2004, Dordrecht: Kluwer Academic, 369-388.View ArticleGoogle Scholar
- Wilkinson M, Thorley JL, Pisani DE, Lapointe F-J, McInerney JO: Some desiderata for liberal supertrees. Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Edited by: Bininda-Emonds ORP. 2004, Dordrecht: Kluwer Academic, 227-246.View ArticleGoogle Scholar
- Cotton JA, Wilkinson M: Majority-rule supertrees. Syst Biol. 2007, 56: 445-452.View ArticlePubMedGoogle Scholar
- Steel M, Rodrigo A: Maximum likelihood supertrees. Syst Biol. 2008, 57: 243-250.View ArticlePubMedGoogle Scholar
- Bryant D, Steel M: Computing the distribution of a tree metric. IEEE/ACM Trans Comput Biol Bioinform. 2009, 6: 420-426.View ArticlePubMedGoogle Scholar
- Akanni WA: Developing and Applying Supertree methods in Phylogenomics and Macroevolution. 2014, PhD Thesis, Department of Biology, The National University of Ireland, Maynooth. Maynooth, IrelandGoogle Scholar
- Robinson D, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131-147.View ArticleGoogle Scholar
- Kishino H, Hasegawa M: Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J Mol Evol. 1989, 29: 170-179.View ArticlePubMedGoogle Scholar
- Shimodaira H, Hasegawa M: Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol. 1999, 16: 1114-1116.View ArticleGoogle Scholar
- Shimodaira H: An approximately unbiased test of phylogenetic tree selection. Syst Biol. 2002, 51: 492-508.View ArticlePubMedGoogle Scholar
- Shimodaira H, Hasegawa M: CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics. 2001, 17: 1246-1247.View ArticlePubMedGoogle Scholar
- O’Leary MA, Bloch JI, Flynn JJ, Gaudin TJ, Giallombardo A, Giannini NP, Goldberg SL, Kraatz BP, Luo Z-X, Meng J: The placental mammal ancestor and the post–K-Pg radiation of placentals. Science. 2013, 339: 662-667.View ArticlePubMedGoogle Scholar
- McCormack JE, Faircloth BC, Crawford NG, Gowaty PA, Brumfield RT, Glenn TC: Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis. Genome Res. 2012, 22: 746-754.View ArticlePubMed CentralPubMedGoogle Scholar
- Romiguier J, Ranwez V, Delsuc F, Galtier N, Douzery EJ: Less is more in mammalian phylogenomics: AT-rich genes minimize tree conflicts and unravel the root of placental mammals. Mol Biol Evol. 2013, 30: 2134-2144.View ArticlePubMedGoogle Scholar
- Meredith RW, Janečka JE, Gatesy J, Ryder OA, Fisher CA, Teeling EC, Goodbla A, Eizirik E, Simão TL, Stadler T: Impacts of the Cretaceous Terrestrial Revolution and KPg extinction on mammal diversification. Science. 2011, 334: 521-524.View ArticlePubMedGoogle Scholar
- Song S, Liu L, Edwards SV, Wu S: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Natl Acad Sci. 2012, 109: 14942-14947.View ArticlePubMed CentralPubMedGoogle Scholar
- Morgan CC, Foster PG, Webb AE, Pisani D, McInerney JO, O’Connell MJ: Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013, 30: 2145-2156.View ArticlePubMed CentralPubMedGoogle Scholar
- Cao Y, Fujiwara M, Nikaido M, Okada N, Hasegawa M: Interordinal relationships and timescale of eutherian evolution as inferred from mitochondrial genome data. Gene. 2000, 259: 149-158.View ArticlePubMedGoogle Scholar
- Corneli PS, Ward RH: Mitochondrial genes and mammalian phylogenies: increasing the reliability of branch length estimation. Mol Biol Evol. 2000, 17: 224-234.View ArticlePubMedGoogle Scholar
- Misawa K, Nei M: Reanalysis of Murphy et al’.s data gives various mammalian phylogenies and suggests overcredibility of Bayesian trees. J Mol Evol. 2003, 57: S290-S296.View ArticlePubMedGoogle Scholar
- Foster PG, Cox CJ, Embley TM: The primary divisions of life: a phylogenomic approach employing composition-heterogeneous methods. Philos Trans R Soc Lond B Biol Sci. 2009, 364: 2197-2207.View ArticlePubMed CentralPubMedGoogle Scholar
- Rota-Stabelli O, Lartillot N, Philippe H, Pisani D: Serine codon-usage bias in deep phylogenomics: pancrustacean relationships as a case study. Syst Biol. 2013, 62: 121-133.View ArticlePubMedGoogle Scholar
- Rota-Stabelli O, Campbell L, Brinkmann H, Edgecombe GD, Longhorn SJ, Peterson KJ, Pisani D, Philippe H, Telford MJ: A congruent solution to arthropod phylogeny: phylogenomics, microRNAs and morphology support monophyletic Mandibulata. Proc R Soc B Biol Sci. 2011, 278: 298-306.View ArticleGoogle Scholar
- Creevey CJ, McInerney JO: Clann: investigating phylogenetic information through supertree analyses. Bioinformatics. 2005, 21: 390-392.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.