Analysis on the reconstruction accuracy of the Fitch method for inferring ancestral states
- Jialiang Yang^{1},
- Jun Li^{1},
- Liuhuan Dong^{1} and
- Stefan Grünewald^{1}Email author
https://doi.org/10.1186/1471-2105-12-18
© Yang et al; licensee BioMed Central Ltd. 2011
Received: 8 December 2010
Accepted: 13 January 2011
Published: 13 January 2011
Abstract
Background
As one of the most widely used parsimony methods for ancestral reconstruction, the Fitch method minimizes the total number of hypothetical substitutions along all branches of a tree to explain the evolution of a character. Due to the extensive usage of this method, it has become a scientific endeavor in recent years to study the reconstruction accuracies of the Fitch method. However, most studies are restricted to 2-state evolutionary models and a study for higher-state models is needed since DNA sequences take the format of 4-state series and protein sequences even have 20 states.
Results
In this paper, the ambiguous and unambiguous reconstruction accuracy of the Fitch method are studied for N-state evolutionary models. Given an arbitrary phylogenetic tree, a recurrence system is first presented to calculate iteratively the two accuracies. As complete binary tree and comb-shaped tree are the two extremal evolutionary tree topologies according to balance, we focus on the reconstruction accuracies on these two topologies and analyze their asymptotic properties. Then, 1000 Yule trees with 1024 leaves are generated and analyzed to simulate real evolutionary scenarios. It is known that more taxa not necessarily increase the reconstruction accuracies under 2-state models. The result under N-state models is also tested.
Conclusions
In a large tree with many leaves, the reconstruction accuracies of using all taxa are sometimes less than those of using a leaf subset under N-state models. For complete binary trees, there always exists an equilibrium interval [a, b] of conservation probability, in which the limiting ambiguous reconstruction accuracy equals to the probability of randomly picking a state. The value b decreases with the increase of the number of states, and it seems to converge. When the conservation probability is greater than b, the reconstruction accuracies of the Fitch method increase rapidly. The reconstruction accuracies on 1000 simulated Yule trees also exhibit similar behaviors. For comb-shaped trees, the limiting reconstruction accuracies of using all taxa are always less than or equal to those of using the nearest root-to-leaf path when the conservation probability is not less than $\frac{1}{N}$. As a result, more taxa are suggested for ancestral reconstruction when the tree topology is balanced and the sequences are highly similar, and a few taxa close to the root are recommended otherwise.
Background
Ancestral state reconstruction attempts to predict properties of ancestral proteins, genes and even whole genomes in a given phylogeny according to data of extant species. This approach to understanding protein functions and evolution was first proposed by Pauling and Zukerkandl in their seminal work [1]. Thereafter, with the increasing availability of biological data it has become a technique of growing importance in investigating the functions and origins of genes and proteins [2–9].
Parsimony and maximum likelihood (ML) are the two most popular criteria utilized to reconstruct ancestral states when the phylogenetic tree representing the evolutionary history of a character is known [6, 10]. Parsimony methods minimize the total number of hypothetical substitutions along all branches of the evolutionary tree. The Fitch method was the first parsimony method for inferring ancestral states [11]. It is a linear time algorithm and is accurate for taxa with highly similar sequences. The method was later modified by Sankoff to account for different rates of substitutions among states [12, 13]. The reader is referred to a survey book [14] for reviews of parsimony methods and their variants. In contrast to parsimony methods, ML methods choose a state to be the ancestral state such that the observed states could have been evolved from it with maximum likelihood. ML inference of ancestral sequences was pioneered by Yang, Kumar and Nei [15] and by Koshi and Goldstein [16]. Later, a widely used variant of ML method called the Bayesian approach was introduced by Huelsenbeck and his coworkers [17, 18]. The reader is referred to [6] and [19] for reviews of ML methods and their variants.
Due to the extensive usage of ancestral reconstruction methods, it has become a significant scientific endeavor to study their reconstruction accuracies. These accuracies of different methods have been either estimated by statistical simulations [20, 21] or calculated precisely by theoretical analyses [22–27]. For example, under a 2-state Jukes-Cantor model, several recurrence systems for calculating the reconstruction accuracies of the Fitch algorithm were presented for a given phylogenetic tree [22, 24, 25, 27]. It was shown in these studies that the reconstruction accuracies depend largely on the topology of the phylogenetic tree. Thus, reconstruction accuracies and their asymptotic properties on the number of leaves were also analyzed for extremal trees like complete binary trees and comb-shaped trees (or rooted caterpillars) [25, 27]. However, by far most theoretical analyses have been limited to 2-state models. More effort should be made to study the reconstruction accuracies under higher-state models as there are 4 states for DNA sequences and even 20 states for protein sequences.
In this paper, we study the ambiguous and unambiguous reconstruction accuracy of the Fitch algorithm for reconstructing the root state under N-state evolutionary models. We first present a general recurrence system for calculating the reconstruction accuracies on any given phylogenetic tree. We developed software that implements this system. As pointed out by Li et al. [24], more taxa are not necessarily better for the reconstruction of ancestral states. Our recurrence system and software can be used to select good subsets of taxa to reconstruct ancestral DNAs, proteins, or other characters.
After that, we restrict the analyses to 3 extremal evolutionary trees under the N-state Jukes-Cantor model, namely equal-branch complete binary tree, equal-branch comb-shaped tree and Hennigain comb-shaped tree [24, 28]. It is clear that for the equal-branch trees the substitution probability along any branch is the same and is denoted by p, thus the conservation probability is q: = 1 - (N - 1)p. As examples, we analyze reconstruction accuracies and their asymptotic properties on the number of leaves for N = 2, 4, 5, 20. We also compare the limiting ambiguous and unambiguous reconstruction accuracy of using all taxa with those of using a nearest root-to-leaf path. Finally, 1000 Yule trees with 1024 leaves are generated by the software Mesquite [29] and analyzed to simulate real phylogenetic trees.
From the studies, we observe several interesting properties for the reconstruction accuracies under N-state models. First, for equal-branch complete binary trees, there always exists an equilibrium interval [a, b] of conservation probability q such that, with the number of leaves tending to infinity, the ambiguous reconstruction accuracy converges to $\frac{1}{N}$, the reconstruction accuracy of randomly picking the ancestral state from N possible states. For example, the equilibrium interval for complete binary trees is $\left[\frac{1}{8},\frac{7}{8}\right]$ under the 2-state Jukes-Cantor model [22, 25, 27]. However, a becomes 0 when N ≥ 3. We calculate b for N = 2, ..., 25 and find that b decreases slowly with the increase of N. The reconstruction accuracies for 1000 Yule trees exhibit similar behaviors. Second, for any equal-branch comb-shaped tree, the limiting reconstruction accuracies using the nearest root-to-leaf path are always greater than those using all taxa if the conservation probability is larger than $\frac{1}{N}$. Finally, for any Hennigian comb-shaped tree, the limiting ambiguous reconstruction accuracy is always equal to that of randomly picking a state, $\frac{1}{N}$, whereas the limiting unambiguous one is equal to $\frac{{N}^{N-2}}{{\displaystyle {\sum}_{i=l}^{N}{N}^{N-i}}\frac{(N-1)!}{(N-x)!}}$.
Our results suggest that more taxa should be used for reconstructing ancestral states if the tree topology is balanced and the sequences are highly similar, whereas some taxa close to the root are recommended if the tree topology is very unbalanced. In addition, under evolutionary models with molecular clock, the reconstructed state by the Fitch algorithm is as bad as a state randomly picked when the conservation probability is low or the phylogenetic tree is very unbalanced. The suggestions are also partially applicable for ML methods as ML inference of the root state is the same as that of maximum parsimony estimation under simple models such as Jukes-Cantor models when the branch lengths of the phylogenetic tree are unknown [30].
Methods
The Fitch Method
- (1)
If u is a leaf, S_{ u } contains only the state of u,
- (2)If u is an internal node having children v and w,${S}_{u}={S}_{v}*{S}_{w}=\{\begin{array}{ll}{S}_{v}\cap {S}_{w}\hfill & \text{if}\phantom{\rule{0.5em}{0ex}}{S}_{v}\cap {S}_{w}\ne \varnothing ,\hfill \\ {S}_{v}\cup {S}_{w}\hfill & \text{otherwise}.\hfill \end{array}$
As a result, any state in S_{ r } is chosen as the root state with an equal probability $\frac{1}{\left|{S}_{r}\right|}$, where |S_{ r } | denotesthe cardinality of S_{ r } .
Reconstruction Accuracies of the Fitch Method
- (1)
Pr[t_{ r } = i], that is, the initial probability that the root state is i for i = 1, 2, ..., N , and
- (2)
Pr[t_{ v } = j | t_{ u } = i], that is, the transition probability that a state i evolves to j along the branch from node u to v for any I, j = 1, 2, ..., N and branch uv.
In particular, Pr[t_{ v } = i | t_{ u } = i] is called the conservation probability of i along uv, and Pr[t_{ v } = j | t_{ u } = i] with i ≠ j is called the substitution probability from i to j along uv. Clearly, the probability of a state in each node is already determined by the model. For any state i and vertex u in T, we use Pr[t_{ u } = i] to denote the probability that the state of u is i. We assume throughout this paper that the evolutionary model is symmetric on all states, that is Pr[t_{ v } = j | t_{ u } = i] = Pr[t_{ v } = i | t_{ u } = j] and Pr[t_{ v } = i | t_{ u } = i] = Pr[t_{ v } = j | t_{ u } = j] for any two states i and j, and any branch uv.
That is, if the reconstructed set B contains 1, there is still a probability of $\frac{1}{\left|B\right|}$ to infer the true root state.
Recurrence Relations to Calculate Reconstruction Accuracies
Let Z be an internal node with two children X and Y. Since the evolutionary model is symmetric on all N states, the substitution probability between any two states is the same along a given branch. We use p_{ X } and p_{ Y } to denote the substitution probabilities along branches ZX and ZY, respectively. Clearly, the corresponding conservation probabilities on any state are 1 - (N - 1)p_{ X } and 1 - (N - 1)p_{ Y } . In the following, we derive a recursive system involving 2N - 1 recursive formulas to calculate the reconstruction accuracies of a parent node from those of its two children.
where ${\mathrm{Pr}}_{u}[\mathcal{D}|1]$ denotes the probability that the leaf configuration under u is $\mathcal{D}$ given that the true state at u is 1, and ${C}_{u}({\mathcal{B}}_{i},\mathcal{D})$ denotes the probability that the reconstructed set at u from $\mathcal{D}$ is ${\mathcal{B}}_{i}$. By this definition, $UA={A}_{1}^{r}$ and $AA={\displaystyle {\sum}_{k=1}^{N}\left({k-1}^{N-1}\right)}\frac{1}{k}{A}_{2k-1}^{r}$.
For any node u, let the reconstructed state set be Bu. Then for any $B\in {\mathcal{B}}_{i}$, BZ = B if and only if: (1) B_{ X } ∩ B_{ Y } = B, or (2) B_{ X } ∩ B_{ Y } = ∅ and B_{ X } ∪ B_{ Y } = B. Thus ${A}_{i}^{Z}$ can be calculated from the reconstruction accuracies on node X and Y in conjunction with the substitution or conservation probabilities along the two branches ZX and ZY (see Additional File 1 for details). Recurrence formulas for 2-state models can be found in [22–27]. We present the recurrence system and initial conditions for N-state models in Additional File 1. To facilitate our study, we also implemented a computer program which takes a phylogenetic tree in Newick format and the substitution rate along branches as inputs. The phylogenetic tree can be inferred by methods like Neighbor-Joining [31] and the substitution rate can also be estimated, see for example [32]. A potential application of our algorithm and program is to select the subsets of taxa to accurately reconstruct ancestral sequences such as DNAs and proteins.
Results and Discussion
As the reconstruction accuracies of the Fitch algorithm depend largely on the topology of phylogenetic trees, we focused our attention on two extremal tree shapes: complete binary trees which are the most balanced trees and comb-shaped trees (caterpillars), the most unbalanced trees. We are interested in trees with many taxa and in our results and figures we choose sufficiently many taxa to exhibit the asymptotic behavior. In order to simulate more realistic evolutionary scenarios, we also generated and analyzed 1000 Yule trees with 1024 leaves.
Reconstruction Accuracies for Equal-branch Complete Binary Trees
Similar to the 2-state case in [24], we observed that for N = 4, 5 and 20: (1) UA or AA on a root-to-leaf path is always less than or equal to AA on all the terminal taxa, and (2) UA or AA on a root-to-leaf path is greater than UA on all terminal taxa when q is small but becomes smaller than that when q is larger than a threshold. We conjecture that the two properties hold for arbitrary number of states.
The estimated values of b for the number of states N = 2, ..., 25.
Estimated values of b | ||||||||
---|---|---|---|---|---|---|---|---|
N | 2 | 3 | 4 | 5 | 10 | 15 | 20 | 25 |
b | 0.875 | 0.839 | 0.821 | 0.809 | 0.784 | 0.774 | 0.768 | 0.763 |
In summary, when q ≤ b the performance of the Fitch method is as poor as randomly picking a state. Only when q > b, the Fitch method could be used to reconstruct ancestral states and the performance improves quickly with the increase of q. As a conclusion, conservation probability is the most important factor to determine the performance of the Fitch method. The method is reliable only when q is large, which indicates that the taxa are highly similar. However, as we know, when taxa are not similar, no reconstruction method performs good, so more effort should be made in developing a reliable method in this scenario. A suggestion for ancestral reconstruction is that, instead of treating all taxa as a whole, one first reconstructs subset of very similar taxa and make use of the reconstructed ancestral sequences to infer the ancestor of the whole taxa set.
Reconstruction Accuracies on Comb-shaped Trees
Equal-branch Comb-shaped Trees
An interesting observation is that, in contrast to complete binary trees, there is no interval of conservation probability such that AA converges to $\frac{1}{N}$ for any equal-branch comb-shaped tree. A possible reason is that the leaves that are close to the root dominate the reconstruction accuracy and their distances to the root do not increase with time. In addition, the limiting AA using the nearest root-to-leaf path is always greater than that using all taxa if conservation probability is larger than $\frac{1}{N}$.
Hennigian Comb-shaped Trees
Clearly, complete binary tree and Hennigian comb-shaped tree are the two kinds of extremal ultrametric trees. By comparing the reconstruction accuracies on both trees with those on real evolutionary trees, one can examine which extremal trees are more realistic. Under the Jukes-Cantor model, for a Hennigian comb tree in which each branch has its own length l, the substitution probability is $p=\frac{1}{N}-\frac{1}{N}{e}^{-N\lambda l}$, and the conservation probability is $q=1-(N-1)p=\frac{1}{N}+\frac{N-1}{N}{e}^{-N\lambda l}$, where λ is the substitution rate. Similarly, the recurrence system to calculate UA of the Fitch algorithm along the Hennigian tree can be derived from the general recurrence relations.
The estimated values for UA of Hennigian comb-shaped trees when n is large for the number of states N = 2, ..., 50.
Estimated values of the limiting UA for Hennigian trees | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
N | 2 | 3 | 4 | 5 | 10 | 15 | 20 | 30 | 40 | 50 |
UA | 0.3333 | 0.1765 | 0.1126 | 0.0797 | 0.0274 | 0.0145 | 0.0098 | 0.0053 | 0.0034 | 0.0022 |
Reconstruction Accuracies for Yule Trees with 1024 Leaves
Conclusions
In this paper, we study the unambiguous and ambiguous reconstruction accuracy of the Fitch method. We first present a general recurrence system as well as a program for calculating reconstruction accuracies on arbitrary trees. Based on the system and program, we analyze 3 special trees under the Jukes-Cantor evolutionary model, namely equal-branch complete binary trees, equal-branch comb-shaped trees, and Hennigian comb-shaped trees, as well as 1000 randomly generated Yule trees to simulate real evolutionary scenarios. From the analyses, we conclude that (1) for equal-branch complete binary trees, there always exists an interval [0, b] of conservation probability, in which the ambiguous reconstruction probability converges to $\frac{1}{N}$, the probability of randomly picking a state, when the conservation probability is greater than b, both reconstruction accuracies increase rapidly, The randomly generated Yule trees also exhibit the same behavior, (2) For unbalanced trees like comb-shaped trees, the reconstruction accuracies using the nearest root-to-leaf path are always greater than or equal to those using all taxa. As a conclusion, more taxa are suggested for ancestral reconstruction when the tree topology is balanced and the sequences of taxa are highly similar, and a few taxa close to the root are recommended otherwise.
Availability and Requirements
The software as well as the source code in C_{++} to calculate the reconstruction accuracy of the Fitch method on any tree with arbitrary states under the one parameter Jukes-Cantor model can be found in Additional File 4. The reader is referred to the "install.txt" and "help.txt" file for the installation and usage of the program, or alternatively run the bash file "accuracy.out" in a Unix/Linux system. The programs to draw the figures and tables are written in Matlab, which can also be found in Additional File 4. So Matlab should be installed to run these codes.
Declarations
Acknowledgements
This work was partially supported by the Natural Science Foundation of China (No. 10971213 and No. 10701070).
Authors’ Affiliations
References
- Pauling L, Zuckerkandl E: Chemical paleogenetics: molecular restoration studies of extinct forms of lives. Acta Chem Scand 1963, 17: S9-S16. 10.3891/acta.chem.scand.17s-0009View ArticleGoogle Scholar
- Hills D, Huelsenbeck J, Cunningham C: Application and accuracy of molecular phylogenies. Science 1994, 264: 671–677. 10.1126/science.8171318View ArticleGoogle Scholar
- Jermann T, Opitz J, Stackhouse S, Benner S: Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature 1995, 374: 57–59. 10.1038/374057a0View ArticlePubMedGoogle Scholar
- Zhang J, Rosenberg H: Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates. Proc Natl Acad Sci 2002, 99: 5486–5491. 10.1073/pnas.072626199PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang C, Zhang M, Ju J, Neitfeldt J, Wise Jea: Genome diversification in phylogenetic lineages I and II of Listeria monocytogenes: Identification of segments unique to lineage II populations. J Bacteriol 2003, 185: 5573–5584. 10.1128/JB.185.18.5573-5584.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: Inferring phylogenies. Sunderland, Massachusetts: Sinauer Associates; 2004.Google Scholar
- Krishnan N, Seligmann H, Stewart C, De Koning A, Pollock D: Ancestral sequence reconstruction in primate mitochondrial DNA: Compositional bias and effect on functional inference. Mol Biol Evol 2004, 21: 1871–1883. 10.1093/molbev/msh198View ArticlePubMedGoogle Scholar
- Tauberberger J, Reid A, Louren R, Wang R, Jin G: Characterization of the 1981 influenza virus polymerase genes. Nature 2005, 437: 889–893. 10.1038/nature04230View ArticleGoogle Scholar
- Clemente J, Ikeo K, Valiente G, Gojobori T: Optimized ancestral state reconstruction using Sankoff parsimony. BMC Bioinformatics 2009, 10: 51. 10.1186/1471-2105-10-51PubMed CentralView ArticlePubMedGoogle Scholar
- Crisp M, Cook L: Do early branching lineages signify ancestral traits? Trends in Ecol Evol 2005, 20: 122–128. 10.1016/j.tree.2004.11.010View ArticleGoogle Scholar
- Fitch W: Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool 1971, 20: 406–416. 10.2307/2412116View ArticleGoogle Scholar
- Sankoff D: Minimal mutation trees of sequences. SIAM J Appl Math 1975, 28: 35–42. 10.1137/0128004View ArticleGoogle Scholar
- Sankoff D, Rousseau P: Locating the vertices of a Steiner tree in an arbitrary metric space. Math Program 1975, 9: 240–246. 10.1007/BF01681346View ArticleGoogle Scholar
- Albert V: Parsimony, Phylogeny, and Genomics. Natural History Museum, University of Oslo, Norway: Oxford Scholarship Online; 2007.Google Scholar
- Yang Z, Kumar S, Nei M: A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 1995, 141: 1641–1650.PubMed CentralPubMedGoogle Scholar
- Koshi J, Goldstein R: Probabilistic reconstruction of ancestral protein sequences. J Mol Evol 1996, 42::313–320. 10.1007/BF02198858View ArticleGoogle Scholar
- Huelsenbeck J, Bollback J: Empirical and hierarchical Bayesian estimation of ancestral states. Syst Bio 2001, 50: 351–366. 10.1080/106351501300317978View ArticleGoogle Scholar
- Huelsenbeck J, Ronquist F: MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 2001, 17: 754–755. 10.1093/bioinformatics/17.8.754View ArticlePubMedGoogle Scholar
- Lio P, Goldman N: Models of molecular evolution and phylogeny. Genome Res 1998, 8: 1233–1244.PubMedGoogle Scholar
- Zhang J, Nei M: Accuracies of ancestral amino acid sequences inferred by parsimony, likelihood, and distance methods. J Mol Evol 1997, 44: 139–146. 10.1007/PL00000067View ArticleGoogle Scholar
- Salisbury B, Kim J: Ancestral state estimation and taxon sampling density. Syst Biol 2001, 50: 557–564. 10.1080/106351501750435103View ArticlePubMedGoogle Scholar
- Steel M: Distributions on bicoloured evolutionary trees. PhD thesis. Massey University, New Zealand; 1989.Google Scholar
- Maddison W: Calculating the probability distributions of ancestral states reconstructed by parsimony on phylogenetic trees. Syst Biol 1995, 44: 474–481.View ArticleGoogle Scholar
- Li G, Steel M, Zhang L: More taxa are not necessarily better for the reconstruction of ancestral character states. Syst Biol 2008, 57: 647–653. 10.1080/10635150802203898View ArticlePubMedGoogle Scholar
- Yang J: Three mathematical issues in reconstructing ancestral genome. PhD thesis. National University of Singapore, Singapore; 2008.Google Scholar
- Fischer M, Thatte B: Maximum parsimony on subsets of taxa. J Theoret Bio 2009, 260: 290–293. 10.1016/j.jtbi.2009.06.010View ArticleGoogle Scholar
- Zhang L, Shen J, Yang J, Li G: Analyzing the Fitch method for reconstructing ancestral states on ultrametric phylogenetic trees. Bull Math Bio 2010, 72: 1760–1782. 10.1007/s11538-010-9505-8View ArticleGoogle Scholar
- Oakley T, Gu Z, Abouheif E, Patel N, Li W: Comparative methods for the analysis of gene-expression evolution: An example using yeast functional genomic data. Mol Biol Evol 2005, 22: 40–50. 10.1093/molbev/msh257View ArticlePubMedGoogle Scholar
- Maddison W, Maddison D: Mesquite: a modular system for evolutionary analysis.2010. [http://mesquiteproject.org]Google Scholar
- Tuffley C, Steel M: Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bull Math Bio 1997, 59: 581–607. 10.1007/BF02459467View ArticleGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.PubMedGoogle Scholar
- Lundstrom R, Tavare S, Ward R: Estimating substitution rates from molecular data using the coalescent. Proc Natl Acad Sci 1992, 89: 5961–5965. 10.1073/pnas.89.13.5961PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.