Multiple sequence alignment accuracy and evolutionary distance estimation
© Rosenberg; licensee BioMed Central Ltd. 2005
Received: 14 July 2005
Accepted: 23 November 2005
Published: 23 November 2005
Sequence alignment is a common tool in bioinformatics and comparative genomics. It is generally assumed that multiple sequence alignment yields better results than pair wise sequence alignment, but this assumption has rarely been tested, and never with the control provided by simulation analysis. This study used sequence simulation to examine the gain in accuracy of adding a third sequence to a pair wise alignment, particularly concentrating on how the phylogenetic position of the additional sequence relative to the first pair changes the accuracy of the initial pair's alignment as well as their estimated evolutionary distance.
The maximal gain in alignment accuracy was found not when the third sequence is directly intermediate between the initial two sequences, but rather when it perfectly subdivides the branch leading from the root of the tree to one of the original sequences (making it half as close to one sequence as the other). Evolutionary distance estimation in the multiple alignment framework, however, is largely unrelated to alignment accuracy and rather is dependent on the position of the third sequence; the closer the branch leading to the third sequence is to the root of the tree, the larger the estimated distance between the first two sequences.
The bias in distance estimation appears to be a direct result of the standard greedy progressive algorithm used by many multiple alignment methods. These results have implications for choosing new taxa and genomes to sequence when resources are limited.
DNA sequence alignment is a common step in molecular evolutionary analysis. Aligned sequences are used for many purposes, including estimation of patterns of divergence, selection, the tempo and mode of evolutionary change, identification of functional elements and constraints, and phylogenetic history, just to name a few. Alignments are a hypothesis of site homology; as evolutionary distance among sequences increases, alignments are known to become less accurate [1–7]. The effect of alignment accuracy on downstream analysis in comparative genomics and bioinformatics is largely an unexplored topic, although some empirical studies have attempted to examine this with respect to functional element identification [8, 9] and phylogenetic analysis [10–16].
Beyond simple accuracy, multiple sequence alignment may affect downstream sequence analysis in unexpected ways relative to pair wise sequence alignment. In a previous study , I show that evolutionary distance estimation from DNA sequences can be surprisingly robust to alignment error (evolutionary distance is the number of substitutions per site which have occurred since a pair of sequences diverged from a common ancestral sequence). This previous work was based on alignments of paired sequences; the relationship between accuracy of alignment and distance estimation might differ under multiple alignment conditions.
The primary goal of this study was to use simulation to examine the improvement in alignment accuracy when going from pair wise to multiple alignments, profiling the change versus the position of the additional sequences. How much is the accuracy of alignment of a pair of sequences improved by the addition of a third sequence with an intermediate evolutionary history (relative to the initial pair)? Where should the third sequence be in order to maximize the accuracy of the initial pair's alignment? In addition, the effects of these multiple sequence alignment on evolutionary distance estimation were also profiled. Does the position of the third sequence have an effect on the estimation of evolutionary distance of the initial pair, independent of the accuracy of the alignments?
Results and discussion
Accuracy of multiple alignments
These results add an additional spin to the debate over the importance of taxon sampling in phylogeny reconstruction. Although it is generally thought that increased taxon sampling yields more accurate phylogenies, simulation studies have proven to be equivocal and controversial [21–29]. However, all of these studies (and most of the empirical ones as well) tacitly assume perfect sequence alignment. Even if increased taxon sampling has no direct effect on phylogenetic accuracy (a debatable point; see cited literature above), it certain appears to have an effect on the accuracy of alignment prior to the phylogenetic reconstruction process. How alignment accuracy may directly affect phylogeny reconstruction is a topic in need of further study (although see ).
These results also imply a specific strategy for choosing species to sequence when resources are limited. For example, given current estimates of the mammalian phylogeny [31–33], these results suggest that error in human and mouse sequence alignments could be reduced by about 15% if aligned with a third species with an evolutionary position similar to that of the Capuchin (Cebus albifrons) or the blind mole rat (Spalax judaei). In contrast, inclusion of sequences from the completed rat genome , would only be expected to decrease error in human and mouse alignments by about 7%.
Multiple alignment and evolutionary distance estimation
Careful examination of the strict (greedy) progressive alignment algorithm [17, 18] used by Clustal explains this pattern. When sequences A and B are aligned directly in a pair wise alignment, the hypothesized transitions and transversions are based solely on the properties of these sequences. In the multiple alignment, Clustal begins by aligning the closest pair of sequences (A and C). The more distant sequence C is from sequence A, the more transversions will have occurred since they shared a common ancestor. More transversions are therefore identified (hypothesized) between this pair during alignment. When sequence B is added to the alignment, numerous potential transversions have already been "set" between sequences A and C. Thus, any potential transversional difference between sequences A and B have less cost (on average) than would be found in the corresponding pair wise alignment because a transversion may already have been identified between sequences A and C. The biasing effect of phylogenetic position on distance estimation from pair wise alignment appears to be a consequence of the greedy progressive algorithm implemented in Clustal.
This study used Clustal because it is one of the most widely used alignment programs, particularly for high-throughput genomic analysis, and tends to be among the most accurate [7, 35]. While it is quite possible that the resulting alignments could be improved by changing alignment parameters (such as the mismatch and gap costs), the purpose of this study is not to optimize the alignment but rather to examine the difference multiple alignment makes under simple conditions and to see examine downstream effects of these errors on distance estimation.
To partially examine whether the observed results are specific to Clustal or may be a more general alignment problem, most of the alignments were repeated with T-Coffee 1.37 . Using the default parameters, T-Coffee produced significantly worse alignments and distance estimates than Clustal (the purpose of this study was not to compare the relative accuracy of these alignment methods and the absolute differences may be due to default parameterization choices and not the overall quality of the methods themselves). However, the shape of the alignment accuracy curves were the same (i.e., the peak gain in accuracy during multiple alignment occurred when the third sequence bisected the branch leading from the root to sequence A). The biasing effect of the position of the third sequence appeared to be less severe, and in some cases, completely absent; unfortunately the differences in accuracy made it difficult to systematically compare these results. Unlike Clustal, T-Coffee uses both global and local pair wise alignments to guide the production of the final global alignment. Although it is also a progressive algorithm, the intermediate alignments it produces make use of more information at early stages, which may help prevent the distance estimate bias.
There are many additional multiple DNA sequence alignment algorithms and programs available, some of which use similar progressive alignment schemes as Clustal and T-Coffee but allow for revision of previously aligned sequences, and some of which use very different approaches, including statistical alignments based on maximum likelihood or Bayesian methods (e.g., [37–41]). Some of these methods simultaneously estimate evolutionary distance and alignment [42, 43], while methods for simultaneously estimating phylogenies and alignments are also being developed [30, 44–48]. Comparisons of the overall accuracies of some of these programs in pair wise DNA sequence analysis has recently been conducted . How these programs and algorithms compare under multiple alignment conditions and whether the observed biasing effect is widespread or narrow across algorithms is a task for future investigation.
Multiple sequence alignments do improve upon pair wise sequence alignments. The optimal taxon sampling strategy for maximally improving alignments is to bisect long branches in a balanced framework. Independent of alignment accuracy, however, multiple alignment using a progressive algorithm can bias evolutionary distance estimates, with larger estimates consistently found as intermediate sequences appear deeper in the phylogeny.
In addition to point substitutions under the HKY model, insertions and deletions were allowed to occur, with the expected rate of deletion events being one occurrence every 40 substitutions and the expected rate of insertion events being one occurrence every 100 substitutions (as observed in primates and rodents) . Realized number of insertions and deletions were drawn from a Poisson distribution with mean equal to the expected value. The lengths of individual insertion and deletion events were also chosen from a truncated (so as not to include zero) Poisson distribution with a mean of 4 bases (as observed from primate and rodent lineages) [52, 53]. Variation in insertion/deletion rate and size can have a large affect on alignment accuracy . However, it is likely that changing the values of these parameters in the present study would have similar effects across all conditions. Each simulation condition was replicated 1000 times.
For every simulated data set, the fate of each of the original sites was tracked and an alignment representing the true homology was constructed for each data set (that is, the simulation program produced gapped sequences in which all aligned sites were truly homologous). The gaps were removed from the sequences and data sets consisting of all three sequences and of just sequences A and B were constructed. Each data set was aligned using Clustal W version 1.83  with the default parameters, as is common in high-throughput analysis and comparative studies of this sort [3, 6, 7, 36, 55–57]. This produced a hypothesized alignment, just as one would obtain from analysis of real data.
The hypothesized alignments were compared to the true alignment derived from the simulation. Evolutionary distances between sequences A and B were estimated for the correct alignment, the AB hypothesized alignment, and the ABC hypothesized alignment using the Tamura-Nei formula .
Thanks to Heath Ogden and anonymous reviewers for comments on this manuscript. This work was partially supported by the NIH R03-LM008637 and Arizona State University.
- Pevsner J: Bioinformatics and Functional Genomics. Hoboken, NJ , Wiley; 2003:753.
- Briffeuil P, Baudoux G, Lambert C, De Bolle X, Vinals C, Feytmans E, Depiereux E: Comparative analysis of seven multiple protein sequence alignment servers: Clues to enhances reliability of predictions. Bioinformatics 1998, 14(4):357–366.View ArticlePubMed
- Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 1999, 27(13):2682–2690.PubMed CentralView ArticlePubMed
- Duret L, Abdeddaim S: Multiple alignments for structrual, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics: Sequence, Structure, and Databanks. Edited by: Higgins D, Taylor W. Oxford , Oxford University Press; 2000:51–76.
- Altschul SF, Gish W: Local alignment statistics. Methods in Enzymology. In Methods in Enzymology: Computer Methods for Macromolecular Sequence Analysis. Volume 266. Edited by: Doolittle RF. San Diego , Academic Press; 1996:460–480.View Article
- Rosenberg MS: Evolutionary distance estimation and fidelity of pair wise sequence alignment. BMC Bioinformatics 2005, 6: 102.PubMed CentralView ArticlePubMed
- Pollard DA, Bergman CM, Stoye J, Celniker SE, Eisen MB: Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics 2004, 5(1):6.PubMed CentralView ArticlePubMed
- Frith MC, Hansen U, Spouge JL, Weng Z: Finding functional sequence elements by multiple local alignment. Nucleic Acids Research 2004, 32(1):189–200.PubMed CentralView ArticlePubMed
- Margulies EH, Blanchette M, Haussler D, Green ED: Identification and characterization of multi-species conserved sequences. Genome Research 2003, 13(12):2507–2518.PubMed CentralView ArticlePubMed
- Xia XH, Xie Z, Kjer KM: 18S ribosomal RNA and tetrapod phylogeny. Syst Biol 2003, 52(3):283–295.View ArticlePubMed
- Cammarano P, Creti R, Sanangelantoni AM, Palm P: The Archaea monophyly issue: A phylogeny of translational elongation factor G(2) sequences inferred from an optimized selection of alignment positions. Journal of Molecular Evolution 1999, 49(4):524–537.View ArticlePubMed
- Kjer KM: Aligned 18S and insect phylogeny. Systematic Biology 2004, 53(3):506–514.View ArticlePubMed
- Kjer KM: Use of rRNA secondary structure in phylogenetic studies to identify homologous positions: An example of alignment and data presentation from the frogs. Molecular Phylogenetics and Evolution 1995, 4(3):314–330.View ArticlePubMed
- Titus T, Frost DR: Molecular homology assessment and phylogeny in the lizard family Opluridae (Squamata: Iguania). Molecular Phylogenetics and Evolution 1996, 6: 49–62.View ArticlePubMed
- Morrison DA, Ellis JT: Effects of nucleotide sequence alignment on phylogeny estimation: A case study of 18S rDNAs of Acpiocomplexa. Molecular Biology and Evolution 1997, 14: 428–441.View ArticlePubMed
- Hwang UW, Kiim W, Tautz D, Friedrich M: Molecular phylogenetics at the Felsenstein zone: Approaching the Strepsipera probelm using 5.8S and 28S rDNA sequences. Molecular Phylogenetics and Evolution 1998, 9: 470–480.View ArticlePubMed
- Feng DF, Doolittle RF: Progressive alignment and phylogenetic tree constrution of protein sequences. Methods in Enzymology 1990, 183: 375–387.View ArticlePubMed
- Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 1987, 25: 351–360.View ArticlePubMed
- Raghava GPS, Searle SMJ, Audley PC, Barber JD, Barton GJ: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 2003, 4: 47.PubMed CentralView ArticlePubMed
- Thompson JD, Plewniak F, Poch O: BaliBASE: A benchmarch alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics 1999, 1: 87–88.View Article
- Rosenberg MS, Kumar S: Taxon sampling, bioinformatics, and phylogenomics. Systematic Biology 2003, 52(1):119–124.PubMed CentralView ArticlePubMed
- Rosenberg MS, Kumar S: Incomplete taxon sampling is not a problem for phylogenetic inference. Proceedings of the National Academy of Sciences USA 2001, 98(19):10751–10756.View Article
- Pollock DD, Zwickl DJ, McGuire JA, Hillis DM: Increased taxon sampling is advantageous for phylogenetic inference. Systematic Biology 2002, 51(4):664–671.PubMed CentralView ArticlePubMed
- Zwickl DJ, Hillis DM: Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology 2002, 51(4):588–598.View ArticlePubMed
- Kim J: General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Systematic Biology 1996, 45(3):363–374.View Article
- Kim J: Large-scale phylogenies and measuring the performance of phylogenetic estimators. Systematic Biology 1998, 47(1):43–60.View ArticlePubMed
- Hendy MD, Penny D: A framework for the quantitative study of evolutionary trees. Systematic Zoology 1989, 38(4):297–309.View Article
- Graybeal A: Is it better to add taxa or characters to a difficult phylogenetic problem? Systematic Biology 1998, 47(1):9–17.View ArticlePubMed
- Poe S, Swofford DL: Taxon sampling revisited. Nature 1999, 398(6725):299–300.View ArticlePubMed
- Fleißner R: Sequence alignment and phylogenetic inference. In Mathematisch-Naturwissenschaftlichen Fakultät. Düsseldorf , Heinrich-Heine-Universität Düsseldorf; 2003:132.
- Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O'Brien SJ: Molecular phylogenetics and the origins of placental mammals. Nature 2001, 409: 614–618.View ArticlePubMed
- Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS: Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 2001, 294(5550):2348–2351.View ArticlePubMed
- Reyes A, Gissi C, Catzeflis F, Nevo E, Pesole G, Saccone C: Congruent mammalian trees from mitochondrial and nuclear genes using Bayesian methods. Molecular Biology and Evolution 2004, 21(2):397–403.View ArticlePubMed
- Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ, Scherer S, Scott G, Steffen D, Worley KC, Burch PE, Okwuonu G, Hines S, Lewis L, DeRamo C, Delgado O, Dugan-Rocha S, Miner G, Morgan M, Hawes A, Gill R, Holt RA, Adams MD, Amanatides PG, Baden-Tillson H, Barnstead M, Chin S, Evans CA, Ferriera S, Fosler C, Glodek A, Gu ZP, Jennings D, Kraft CL, Nguyen T, Pfannkoch CM, Sitter C, Sutton GG, Venter JC, Woodage T, Smith D, Lee HM, Gustafson E, Cahill P, Kana A, Doucette-Stamm L, Weinstock K, Fechtel K, Weiss RB, Dunn DM, Green ED, Blakesley RW, Bouffard GG, de Jong J, Osoegawa K, Zhu BL, Marra M, Schein J, Bosdet I, Fjell C, Jones S, Krzywinski M, Mathewson C, Siddiqui A, Wye N, McPherson J, Zhao SY, Fraser CM, Shetty J, Shatsman S, Geer K, Chen YX, Abramzon S, Nierman WC, Havlak PH, Chen R, Durbin KJ, Egan A, Ren YR, Song XZ, Li BS, Liu Y, Qin X, Cawley S, Cooney AJ, D'Souza LM, Martin K, Wu JQ, Gonzalez-Garay ML, Jackson AR, Kalafus KJ, McLeod MP, Milosavljevic A, Virk D, Volkov A, Wheeler DA, Zhang ZD, Bailey JA, Eichler EE, Tuzun E, Birney E, Mongin E, Ureta-Vidal A, Woodwark C, Zdobnov E, Bork P, Suyama M, Torrents D, Alexandersson M, Trask BJ, Young JM, Huang H, Wang HJ, Xing HM, Daniels S, Gietzen D, Schmidt J, Stevens K, Vitt U, Wingrove J, Camara F, Alba MM, Abril JF, Guigo R, Smit A, Dubchak I, Rubin EM, Couronne O, Poliakov A, Hubner N, Ganten D, Goesele C, Hummel O, Kreitler T, Lee YA, Monti J, Schulz H, Zimdahl H, Himmelbauer H, Lehrach H, Jacob HJ, Bromberg S, Gullings-Handley J, Jensen-Seaman MI, Kwitek AE, Lazar J, Pasko D, Tonellato PJ, Twigger S, Ponting P, Duarte JM, Rice S, Goodstadt L, Beatson SA, Emes RD, Winter EE, Webber C, Brandt P, Nyakatura G, Adetobi M, Chiaromonte F, Elnitski L, Eswara P, Hardison RC, Hou MM, Kolbe D, Makova K, Miller W, Nekrutenko A, Riemer C, Schwartz S, Taylor J, Yang S, Zhang Y, Lindpaintner K, Andrews TD, Caccamo M, Clamp M, Clarke L, Curwen V, Durbin R, Eyras E, Searle SM, Cooper GM, Batzoglou S, Brudno M, Sidow A, Stone EA, Payseur BA, Bourque G, Lopez-Otin C, Puente XS, Chakrabarti K, Chatterji S, Dewey C, Pachter L, Bray N, Yap VB, Caspi A, Tesler G, Pevzner PA, Haussler D, Roskin KM, Baertsch R, Clawson H, Furey TS, Hinrichs AS, Karolchik D, Kent WJ, Rosenbloom KR, Trumbower H, Weirauch M, Cooper DN, Stenson PD, Ma B, Brent M, Arumugam M, Shteynberg D, Copley RR, Taylor MS, Riethman H, Mudunuri U, Peterson J, Guyer M, Felsenfeld A, Old S, Mockrin S, Collins F: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 2004, 428(6982):493–521.View ArticlePubMed
- Hickson RE, Simon C, Perrey SW: The performance of several multiple-sequence alignment programs in relation to secondary-structure features for an rRNA sequence. Molecular Biology and Evolution 2000, 17(4):530–539.View ArticlePubMed
- Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 2000, 302(1):205–217.View ArticlePubMed
- Keightley PD, Johnson T: MCALIGN: Stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution. Genome Research 2004, 14(3):442–450.PubMed CentralView ArticlePubMed
- Holmes I, Bruno WJ: Evolutionary HMMs: A Bayesian approach to multiple alignment. Bioinformatics 2001, 17(9):803–820.View ArticlePubMed
- Thorne JL, Kishino H, Felsenstein J: Inching toward reality: An improved likelihood model of sequence evolution. Journal of Molecular Evolution 1992, 34: 3–16.View ArticlePubMed
- Thorne JL, Kishino H, Felsenstein J: An evolutionary model for maximul likelihood alignment of DNA sequences. Journal of Molecular Evolution 1991, 33: 114–124.View ArticlePubMed
- Metzler D, Fleißner R, Wakolbinger A, von Haeseler A: Assessing variability by joint sampling of alignments and mutation rates. Journal of Molecular Evolution 2001, 53: 660–669.View ArticlePubMed
- Hein J, Wiuf C, Knudsen B, Møller MB, Wibling G: Statistical alignment: Computational properties, homology testing and goodness-of-fit. Journal of Molecular Biology 2000, 302: 265–279.View ArticlePubMed
- Fleißner R, Metzler D, von Haeseler A: Can one estimate distances from pairwise sequence alignments? In Proceedings of the German Conference on Bioinformatics. Edited by: Bornberg-Bauer E, Rost U, Stoye J, Vingron M. Berlin , Logos Verlag; 2000:89–95.
- Gladstein D, Wheeler WC: POY: The Optimization of Alignment Characters. New York , American Museum of Natural History; 1997.
- Redelings BD, Suchard MA: Joint Bayesian estimation of alignment and phylogeny. Systematic Biology In press In press
- Lunter G, Miklos I, Drummond A, Jensen JL, Hein J: Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 2005, 6: 83.PubMed CentralView ArticlePubMed
- Fleissner R, Metzler D, von Haeseler A: Simultaneous statistical multiple alignment and phylogeny reconstruction. Systematic Biology 2005, 54(4):548–561.View ArticlePubMed
- Rosenberg MS: MySSP: Non-stationary evolutionary sequence simulation, including indels. Evolutionary Bioinformatics Online 2005, 1: 51–53.
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 1985, 22: 160–174.View ArticlePubMed
- Rosenberg MS, Subramanian S, Kumar S: Patterns of transitional mutation biases within and among mammalian genomes. Molecular Biology and Evolution 2003, 20(6):988–993.View ArticlePubMed
- Ophir R, Graur D: Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 1997, 205(1–2):191–202.View ArticlePubMed
- Sundström H, Webster MT, Ellegren H: Is the rate of insertion and deletion mutation male baised?: Molecular evolutionary analysis of avian and primate sex chromosome sequences. Genetics 2003, 164: 259–268.PubMed CentralPubMed
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22: 4673–4680.PubMed CentralView ArticlePubMed
- Morgenstern B: DIALIGN 2: Improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15(3):211–218.View ArticlePubMed
- Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Research 2003, 13(1):97–102.PubMed CentralView ArticlePubMed
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Research 2003, 13(4):721–731.PubMed CentralView ArticlePubMed
- Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 1993, 10: 512–526.PubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.