Multiple sequence alignment accuracy and evolutionary distance estimation
© Rosenberg. 2005
Received: 14 July 2005
Accepted: 23 November 2005
Published: 23 November 2005
Skip to main content
© Rosenberg. 2005
Received: 14 July 2005
Accepted: 23 November 2005
Published: 23 November 2005
Sequence alignment is a common tool in bioinformatics and comparative genomics. It is generally assumed that multiple sequence alignment yields better results than pair wise sequence alignment, but this assumption has rarely been tested, and never with the control provided by simulation analysis. This study used sequence simulation to examine the gain in accuracy of adding a third sequence to a pair wise alignment, particularly concentrating on how the phylogenetic position of the additional sequence relative to the first pair changes the accuracy of the initial pair's alignment as well as their estimated evolutionary distance.
The maximal gain in alignment accuracy was found not when the third sequence is directly intermediate between the initial two sequences, but rather when it perfectly subdivides the branch leading from the root of the tree to one of the original sequences (making it half as close to one sequence as the other). Evolutionary distance estimation in the multiple alignment framework, however, is largely unrelated to alignment accuracy and rather is dependent on the position of the third sequence; the closer the branch leading to the third sequence is to the root of the tree, the larger the estimated distance between the first two sequences.
The bias in distance estimation appears to be a direct result of the standard greedy progressive algorithm used by many multiple alignment methods. These results have implications for choosing new taxa and genomes to sequence when resources are limited.
DNA sequence alignment is a common step in molecular evolutionary analysis. Aligned sequences are used for many purposes, including estimation of patterns of divergence, selection, the tempo and mode of evolutionary change, identification of functional elements and constraints, and phylogenetic history, just to name a few. Alignments are a hypothesis of site homology; as evolutionary distance among sequences increases, alignments are known to become less accurate [1–7]. The effect of alignment accuracy on downstream analysis in comparative genomics and bioinformatics is largely an unexplored topic, although some empirical studies have attempted to examine this with respect to functional element identification [8, 9] and phylogenetic analysis [10–16].
Beyond simple accuracy, multiple sequence alignment may affect downstream sequence analysis in unexpected ways relative to pair wise sequence alignment. In a previous study , I show that evolutionary distance estimation from DNA sequences can be surprisingly robust to alignment error (evolutionary distance is the number of substitutions per site which have occurred since a pair of sequences diverged from a common ancestral sequence). This previous work was based on alignments of paired sequences; the relationship between accuracy of alignment and distance estimation might differ under multiple alignment conditions.
The primary goal of this study was to use simulation to examine the improvement in alignment accuracy when going from pair wise to multiple alignments, profiling the change versus the position of the additional sequences. How much is the accuracy of alignment of a pair of sequences improved by the addition of a third sequence with an intermediate evolutionary history (relative to the initial pair)? Where should the third sequence be in order to maximize the accuracy of the initial pair's alignment? In addition, the effects of these multiple sequence alignment on evolutionary distance estimation were also profiled. Does the position of the third sequence have an effect on the estimation of evolutionary distance of the initial pair, independent of the accuracy of the alignments?
These results add an additional spin to the debate over the importance of taxon sampling in phylogeny reconstruction. Although it is generally thought that increased taxon sampling yields more accurate phylogenies, simulation studies have proven to be equivocal and controversial [21–29]. However, all of these studies (and most of the empirical ones as well) tacitly assume perfect sequence alignment. Even if increased taxon sampling has no direct effect on phylogenetic accuracy (a debatable point; see cited literature above), it certain appears to have an effect on the accuracy of alignment prior to the phylogenetic reconstruction process. How alignment accuracy may directly affect phylogeny reconstruction is a topic in need of further study (although see ).
These results also imply a specific strategy for choosing species to sequence when resources are limited. For example, given current estimates of the mammalian phylogeny [31–33], these results suggest that error in human and mouse sequence alignments could be reduced by about 15% if aligned with a third species with an evolutionary position similar to that of the Capuchin (Cebus albifrons) or the blind mole rat (Spalax judaei). In contrast, inclusion of sequences from the completed rat genome , would only be expected to decrease error in human and mouse alignments by about 7%.
Careful examination of the strict (greedy) progressive alignment algorithm [17, 18] used by Clustal explains this pattern. When sequences A and B are aligned directly in a pair wise alignment, the hypothesized transitions and transversions are based solely on the properties of these sequences. In the multiple alignment, Clustal begins by aligning the closest pair of sequences (A and C). The more distant sequence C is from sequence A, the more transversions will have occurred since they shared a common ancestor. More transversions are therefore identified (hypothesized) between this pair during alignment. When sequence B is added to the alignment, numerous potential transversions have already been "set" between sequences A and C. Thus, any potential transversional difference between sequences A and B have less cost (on average) than would be found in the corresponding pair wise alignment because a transversion may already have been identified between sequences A and C. The biasing effect of phylogenetic position on distance estimation from pair wise alignment appears to be a consequence of the greedy progressive algorithm implemented in Clustal.
This study used Clustal because it is one of the most widely used alignment programs, particularly for high-throughput genomic analysis, and tends to be among the most accurate [7, 35]. While it is quite possible that the resulting alignments could be improved by changing alignment parameters (such as the mismatch and gap costs), the purpose of this study is not to optimize the alignment but rather to examine the difference multiple alignment makes under simple conditions and to see examine downstream effects of these errors on distance estimation.
To partially examine whether the observed results are specific to Clustal or may be a more general alignment problem, most of the alignments were repeated with T-Coffee 1.37 . Using the default parameters, T-Coffee produced significantly worse alignments and distance estimates than Clustal (the purpose of this study was not to compare the relative accuracy of these alignment methods and the absolute differences may be due to default parameterization choices and not the overall quality of the methods themselves). However, the shape of the alignment accuracy curves were the same (i.e., the peak gain in accuracy during multiple alignment occurred when the third sequence bisected the branch leading from the root to sequence A). The biasing effect of the position of the third sequence appeared to be less severe, and in some cases, completely absent; unfortunately the differences in accuracy made it difficult to systematically compare these results. Unlike Clustal, T-Coffee uses both global and local pair wise alignments to guide the production of the final global alignment. Although it is also a progressive algorithm, the intermediate alignments it produces make use of more information at early stages, which may help prevent the distance estimate bias.
There are many additional multiple DNA sequence alignment algorithms and programs available, some of which use similar progressive alignment schemes as Clustal and T-Coffee but allow for revision of previously aligned sequences, and some of which use very different approaches, including statistical alignments based on maximum likelihood or Bayesian methods (e.g., [37–41]). Some of these methods simultaneously estimate evolutionary distance and alignment [42, 43], while methods for simultaneously estimating phylogenies and alignments are also being developed [30, 44–48]. Comparisons of the overall accuracies of some of these programs in pair wise DNA sequence analysis has recently been conducted . How these programs and algorithms compare under multiple alignment conditions and whether the observed biasing effect is widespread or narrow across algorithms is a task for future investigation.
Multiple sequence alignments do improve upon pair wise sequence alignments. The optimal taxon sampling strategy for maximally improving alignments is to bisect long branches in a balanced framework. Independent of alignment accuracy, however, multiple alignment using a progressive algorithm can bias evolutionary distance estimates, with larger estimates consistently found as intermediate sequences appear deeper in the phylogeny.
In addition to point substitutions under the HKY model, insertions and deletions were allowed to occur, with the expected rate of deletion events being one occurrence every 40 substitutions and the expected rate of insertion events being one occurrence every 100 substitutions (as observed in primates and rodents) . Realized number of insertions and deletions were drawn from a Poisson distribution with mean equal to the expected value. The lengths of individual insertion and deletion events were also chosen from a truncated (so as not to include zero) Poisson distribution with a mean of 4 bases (as observed from primate and rodent lineages) [52, 53]. Variation in insertion/deletion rate and size can have a large affect on alignment accuracy . However, it is likely that changing the values of these parameters in the present study would have similar effects across all conditions. Each simulation condition was replicated 1000 times.
For every simulated data set, the fate of each of the original sites was tracked and an alignment representing the true homology was constructed for each data set (that is, the simulation program produced gapped sequences in which all aligned sites were truly homologous). The gaps were removed from the sequences and data sets consisting of all three sequences and of just sequences A and B were constructed. Each data set was aligned using Clustal W version 1.83  with the default parameters, as is common in high-throughput analysis and comparative studies of this sort [3, 6–7 36, 55–57]. This produced a hypothesized alignment, just as one would obtain from analysis of real data.
The hypothesized alignments were compared to the true alignment derived from the simulation. Evolutionary distances between sequences A and B were estimated for the correct alignment, the AB hypothesized alignment, and the ABC hypothesized alignment using the Tamura-Nei formula .
Thanks to Heath Ogden and anonymous reviewers for comments on this manuscript. This work was partially supported by the NIH R03-LM008637 and Arizona State University.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.