Progressive multiple sequence alignments from triplets
© Kruspe and Stadler; licensee BioMed Central Ltd. 2007
Received: 05 November 2006
Accepted: 15 July 2007
Published: 15 July 2007
The quality of progressive sequence alignments strongly depends on the accuracy of the individual pairwise alignment steps since gaps that are introduced at one step cannot be removed at later aggregation steps. Adjacent insertions and deletions necessarily appear in arbitrary order in pairwise alignments and hence form an unavoidable source of errors.
Here we present a modified variant of progressive sequence alignments that addresses both issues. Instead of pairwise alignments we use exact dynamic programming to align sequence or profile triples. This avoids a large fractions of the ambiguities arising in pairwise alignments. In the subsequent aggregation steps we follow the logic of the Neighbor-Net algorithm, which constructs a phylogenetic network by step-wisely replacing triples by pairs instead of combining pairs to singletons. To this end the three-way alignments are subdivided into two partial alignments, at which stage all-gap columns are naturally removed. This alleviates the "once a gap, always a gap" problem of progressive alignment procedures.
The three-way Neighbor-Net based alignment program aln3nn is shown to compare favorably on both protein sequences and nucleic acids sequences to other progressive alignment tools. In the latter case one easily can include scoring terms that consider secondary structure features. Overall, the quality of resulting alignments in general exceeds that of clustalw or other multiple alignments tools even though our software does not included heuristics for context dependent (mis)match scores.
(The software is freely available for download from reference )
High quality multiple sequence alignments (MSAs) are a prerequisite for many applications in bioinformatics, from the reconstruction of phylogenies and the assessment of evolutionary rate variations to gene finding and phylogenetic footprinting. A large part of comparative genomics thus hinges on our ability to construct accurate MSAs. Since the multiple sequence alignment problem is NP hard  with the computational cost growing exponentially with the number of sequences, it has been a long-standing challenge to devise approximation algorithms that are both efficient and accurate. These approaches can be classified into progressive, iterative, and stochastic alignment algorithms. The most widely used tools such as clustalw  and pileup utilize the progressive method that was at first introduced in [4, 5]. This approach makes explicit use of the evolutionary relatedness of the sequences to build the alignment. The complete multiple sequence alignment of the given sequences is calculated from pairwise alignments of previous aligned sequences by following the branching order of a pre-computed "guide" tree, which reflects (at least approximately) the evolutionary history of the input sequences. It is typically reconstructed from pairwise sequence distances by some clustering method such as Neighbor-Joining  or UPGMA . Progressive sequence alignments, while computationally efficient, suffer from two major shortcomings. First, they are of course not guaranteed to find the optimal alignment. Pairwise comparisons necessarily utilize only a small part of the information that is potentially available in the complete data set. In particular, the relative placement of adjacent insertions and deletions leads to score-equivalent alignments among which the algorithm chooses one by means of a pragmatic rule (e.g. "Always make insertions before deletions"). At a later aggregation step, when profiles are aligned to sequences or with each other, these alternative are no longer equivalent. Secondly, in contrast to other techniques, there is no mechanism to identify errors that have been made in previous steps and to correct them during later stages.
In this contribution we present a novel approach to progressive sequence alignment that alleviates both shortcomings at the expense of utilizing an exact algorithm to compute alignment of sequence and profile triples. Instead of using a single guide tree, we follow here the logic of phylogenetic networks as constructed by the Neighbor-Net algorithm  which calls for an aggregation step that constructs pairs from triples. As this requires us to subdivide 3-way alignments into pairs of alignments, it provides a chance for the removal of erroneously inserted gaps at later aggregation steps.
The contribution is organized as follows: In the following section we outline the algorithms aspects of our approach. Furthermore we describe a straightforward way of incorporating RNA secondary information. Section 3 summarizes benchmark data in comparison to other multiple alignment tools. We conclude with a brief discussion of future improvements.
2.1 Dynamic Programming
The basic dynamic programming scheme for pairwise sequence comparison, known as the Needleman-Wunsch algorithm  requires quadratic space and time. It easily translates to a cubic space and time algorithms for three sequences. Biologically plausible sequence alignment, however, require the use of non-trivial gap cost functions. While cubic time algorithms are available for arbitrary gap costs , affine gap costs (with a much higher penalty for opening a new gap than for extending an existing one) in general yield good results already. In this contribution we therefore use an affine gap cost model. Gotoh's algorithm solves this problem with quadratic CPU and memory requirements for two sequences . The same author also described a dynamic programming scheme for the alignment of three sequences with affine gap costs  that requires (n3) time and space, which we use here with minor modifications.
As in the case of pairwise sequence alignments, the recursions immediately generalize to alignments of profiles so that a single sequence becomes a special case of a profile. Match and gap scores are simply added up over all triples of sequences, one from each profile.
The resource requirements of this algorithm, in particular the cubic memory consumption, are acceptable only for relative small sequence lengths n even on modern workstations. Several approaches have been explored in the past to reduce the search space so that long sequences can be dealt with, see e.g. [14–16].
We utilized here the Divide-&-Conquer approach described by  to limit both space and time requirements. Input sequences that exceed a given threshold length l are subsequently subdivided into smaller sequences until the length criterion is fulfilled. The partial sequences are aligned separately and the emerging alignments are concatenated afterward. The result is an approximate solution of the global multiple sequence alignment problem. The choice of the threshold length depends on sequence properties and the available amount of memory and CPU resources. For the following simulations we have chosen a length of l = 150. The methods described by [14, 15] are known to produce optimal alignments but are much harder to implement.
2.2 Alignment order
The order in which sequences and profiles are aligned has an important influence on the performance of progressive alignment algorithms. In programs that are based on pairwise alignments such as clustalw or pileup, binary guide trees, which encapsulate at least an approximation to the phylogenetic relationships of the input sequences, are used to determine the alignment order. The input sequences form the leaves of this tree; each interior node corresponds to an alignment, so that the root of the guide tree represents the desired multiple alignment of all input sequences.
The division of the ABC alignment into AB' and B''C frequently results in all-gap columns in the two parts. These are removed in order to recover valid MSAs. This constitutes a mechanism by which gaps introduced in early agglomeration steps can be removed again in later steps. This removal is guided by the increasing amount of information that is implicit in profiles composed of a larger number of sequences. Our software keeps track of gaps that appear in intermediate alignments but that are not present in the final result to demonstrate that gap removal is not a rare phenomenon in practice.
2.4 Alignments of Structured RNAs
Recent discoveries of a large number of small RNAs with distinctive secondary structures has prompted the development of specialized multiple alignment programs for this class of molecules. Most of these approaches make explicit use of structural alignment techniques such as tree editing (MARNA ), tree alignments (RNAforester ), or variants of the Sankoff algorithm  (foldalign , dynalign , locarna ). In contrast, "structure enhanced" approaches utilize standard sequence alignment algorithms but incorporate modified match and mismatch scores designed to take structural information into account . The STRAL program  recently has demonstrated that such "structure enhanced" alignments perform comparable to true structural alignments in many cases. We have thus included in our software the possibility to use RNA secondary structure annotation as additional input with nucleic acid alignments.
We use McCaskill's algorithm  (as implemented in the Vienna RNA package) to compute the matrix of equilibrium base pairing probabilities P ij for each input sequence and derive for each sequence position the probabilities p1(i) = ∑j<iP ij , p2(i) = ∑j>iP ij , and p3(i) = 1 - p1(i) - p2(i) that sequence position i is paired with a position j <i, a position j > i, or that it remains unpaired, resp. The p x (i)-values are used as structure annotation. For a pair of annotated input sequences A and B we define structural score contributions for positions i and j by This rewards bases that share similar structural properties. The total (mis)match score is the weighted sum of the sequence score and the structure score using the equation Sfinal(i A , j B ) = ψ·Sseq(i A , j B ) + (1 - ψ)·Sstruct(i A ,j B ) with a balance term ψ that measure the relative contribution of sequence and structure similarity. In the case of very similar sequence one should use ψ ≈ 1 since inaccuracies in the structure prediction are more harmful than the extra information in this case. Conversely, very dissimilar sequences have to be aligned with a score dominated by the structural component.
3.1 Pairwise versus Three-Way Alignments
Somewhat surprisingly, three-way alignments also provide a small but significant gain in alignment score even in the cases where the simulated data correspond to a correct alignment that is entirely gap free. This effect is noticable in particular in comparison with "straight" pairwise progressive clustalw alignments, in which no attempt is made to correct for problems in the initial pairwise alignments (Additional file 1). This introduction of spurious gaps is well-known problem with pairwise nucleic acid alignments.
3.2 Protein Alignments
Our software does not employ any heuristic rules to alter scoring parameters based on local sequence context or properties of partial profiles. Nevertheless, aln3nn compares well with other common alignment programs, indicating that a simple affine scoring model is sufficient; only ProbCons , a combination of probabilistic modeling and consistency-based alignment techniques specialized for protein alignments performs systematically better. Elaborate scoring heuristics thus essentially seem to compensate for the algorithmic shortcomings of MSAs based on initial pairwise alignments.
3.3 RNA alignments
As in , we used the structure conservation index (SCI)  to assess the quality of the calculated alignments. The SCI is defined as the ratio of consensus folding energy of a set of aligned sequences (calculated using the RNAalifold program ) and average unconstrained folding energies of the individual sequences. The SCI is close to 0 for structurally divergent sequences and close to 1 for correctly aligned sequences with a common fold. Values larger than one indicate a perfectly RNA structure which is additionally supported by compensatory as well as consistent mutations that preserve the common structure. The benchmark study  established that the SCI is an appropriate measure for RNA alignment quality when the sequences are known to have a common fold, since decreased values of the SCI can be attributed to alignment errors. For the four of the six test sets with reference alignments we also computed the BAliBase SP score (SPS), which directly measure the similarity of two alignments. For all computations we used a fixed tradeoff between sequence and structure scores of ψ = 0.5.
3.4 Gap removals
Mean frequency f and standard deviation σ f of correctly removed gap columns from the intermediate alignments after the division process.
Group II Intron
We have presented here a novel progressive alignment tool, aln3nn, that uses exact dynamic programming to construct three-way alignments of sequences and profiles and that uses a three-to-two aggregation procedure in the spirit of Neighbor-Net. A direct comparison of exact three-way alignments with progressive alignments of the same three sequences shows that the progressive approach leads to significantly suboptimal scores. The discrepancy increases with sequence diversity and in/del probability. While incurring significant additional computational costs compared to pair-wise, guide-tree based, approaches, aln3nn achieves competitive alignment accuracies on both protein and nucleic acid data on BAliBASE and BRaliBase benchmark data set. The software furthermore provides an option to compute structure enhanced RNA alignments.
Programs such as clustalw employ a variety of heuristic rules that introduce local modifications of the scoring scheme to (partially) compensate for problematic sections of intermediate alignments. In contrast, aln3nn achieves this encouraging performance without any heuristic modifications of the scoring schemes. This indicates that three-way alignments and the more sophisticated aggregation steps provide a significant advantage of pair-wise methods. In particular, the comparison with the performance of t_coffee shows that the shortcoming of initial pairwise alignments cannot be fully overcome even by utilizing consensus information of a collection of pairwise alignments. In particular, we observe that the three-to-two aggregation step, with its division procedure, removed up to one fifth of the previously introduced gap characters, emphasizing that the inability to correct misplaced gaps is major shortcoming of traditional progressive alignment algorithms.
In its present implementation, aln3nn demonstrates that progressive alignment schemes can produce competitive high quality alignments even without sophisticated scoring functions. This leaves ample room for future improvements. In particular, one might want to include gap penalties that depend on local sequence context in particular in the intermediate profile alignment steps. The division-step for the three-way alignments could also be modified in several ways. A possible approach would infer a phylogenetic tree that is is then subdivided at the longest or the most central edge. In its present implementation, aln3nn is relatively slow compared to many recent multiple alignment methods, although it typically outperforms some of the standard tools. This lack of performance could be alleviated in the future e.g. by improving the branch and bound approach and by anchoring the alignments at very well conserved regions. Overall, aln3nn shows that progressive alignments are a competitive approach that is worth-while to explore.
Availability and Requirements
Project name: aln3nn
Project homepage: http://www.bioinf.uni-leipzig.de/Software/aln3nn
Operating system(s): platform independent in principle, tested for LINUX and other UNIX dialects. An ANSI C compiler is required.
Programming language: C License: GNU GPL.
Restrictions to use by non-academics: none.
Fruitful discussions with Dirk Drasdo and Ivo L. Hofacker, as well as valuable comments by anonymous referees are gratefully acknowledged. This work was supported in part by the FP-6 EMBIO project, the SPP 1174 "Deep Metazoan Phylogeny" and the DFG Bioinformatics Initiative (BIZ-6/1-2).
- The aln3nn software. [http://www.bioinf.uni-leipzig.de/Software/aln3nn]
- Wang L, Jiang T: On the Complexity of Multiple Sequence Alignment. J Comp Biol 1994, 1: 337–348.View ArticleGoogle Scholar
- Thompson J, Higgins D, Gibson T: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
- Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J Mol Evol 1984, 20: 175–186. 10.1007/BF02257378View ArticlePubMedGoogle Scholar
- Feng D, Doolittle R: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25: 351–360. 10.1007/BF02603120View ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method, for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.PubMedGoogle Scholar
- Sokal RR, Michner CD: A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 1958, 38: 1409–1438.Google Scholar
- Bryant D, Moulton V: NeighborNet: An agglomerative method for the construction of planar phylogenetic networks. In WABI '02: Proceedings of the Second International Workshop on Algorithms in Bioinformatics. London, UK: Springer-Verlag; 2002:375–391.View ArticleGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the aminoacid sequences of two proteins. J Mol Biol 1970, 48: 443–452. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Dewey TG: A sequence alignment algorithm with an arbitrary gap penalty function. J Comp Biol 2001, 8: 177–190. 10.1089/106652701300312931View ArticleGoogle Scholar
- Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol 1982, 162: 705–708. 10.1016/0022-2836(82)90398-9View ArticlePubMedGoogle Scholar
- Gotoh O: Alignment of three biological sequences with an efficient traceback procedure. J theor Biol 1986, 121: 327–337. 10.1016/S0022-5193(86)80112-6View ArticlePubMedGoogle Scholar
- Konagurthu A, Whisstock J, Stuckey P: Progressive multiple alignment using sequence triplet optimization and three-residue exchange costs. J Bioinf and Comp Biol 2004, 2(4):719–745. 10.1142/S0219720004000831View ArticleGoogle Scholar
- Myers E, Miller W: Optimal alignemnts in linear space. Bioinformatics 1988, 4: 11–17. 10.1093/bioinformatics/4.1.11View ArticleGoogle Scholar
- Lipman D, Altschul S, Kececioglu J: A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 1989, 86(12):4412–4415. 10.1073/pnas.86.12.4412PubMed CentralView ArticlePubMedGoogle Scholar
- Gupta S, Kececioglu J, Schaffer A: Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational Biology 1995, 2(3):459–462.View ArticlePubMedGoogle Scholar
- Stoye J: Multiple sequence alignment with the divide-and-conquer method. Gene Combis 1997, 211: 45–56.Google Scholar
- Bryant D, Moulton V: Neighbor-Net: An Agglomerative Method for the Construction of Phylogenetic Networks. Mol Biol Evol 2004, 21: 255–265. 10.1093/molbev/msh018View ArticlePubMedGoogle Scholar
- Bryant D, Moulton V: Consistency of Neighbor-Net. Alg Mol Biol 2007, 2(1):8. Under review Under review 10.1186/1748-7188-2-8View ArticleGoogle Scholar
- Bandelt HJ, Dress AWM: Split Decomposition: A New and Useful Approach to Phylogenetic Analysis of Distance Data. Mol Phyl Evol 1992, 1: 242–252. 10.1016/1055-7903(92)90021-8View ArticleGoogle Scholar
- Huson DH: SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 1998, 14: 68–73. 10.1093/bioinformatics/14.1.68View ArticlePubMedGoogle Scholar
- Wetzel R: Zur Visualisierung abstrakter Ähnlichkeitsbeziehungen. PhD thesis. Bielefeld University, Germany; 1995.Google Scholar
- Hofacker I, Fontana W, Stadler P, Bonhoeffer L, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie 1994, 125: 167–188. 10.1007/BF00818163View ArticleGoogle Scholar
- Hofacker I, Fekete M, Stadler P: Secondary structure prediction for aligned RNA sequences. J Mol Evol 2002, 319(5):1059–1066.Google Scholar
- The Vienna RNA package. [http://www.tbi.univie.ac.at/RNA]
- Siebert S, Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics 2005, 21: 3352–3359. 10.1093/bioinformatics/bti550View ArticlePubMedGoogle Scholar
- Höchsmann M, Töller T, Giegerich R, Kurtz S: Local Similarity in RNA Secondary Structures. Proc of the Computational Systems Bioinformatics Conference, Stanford, CA, August 2003 (CSB 2003) 2003, 2: 159–168.View ArticleGoogle Scholar
- Sankoff D: Simultaneous solution of the RNA folding, alignment, and proto-sequence problems. SIAM J Appl Math 1985, 45: 810–825. 10.1137/0145048View ArticleGoogle Scholar
- Hull Havgaard JH, Lyngsø R, Stormo GD, Gorodkin J: Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 2005, 21: 1815–1824. 10.1093/bioinformatics/bti279View ArticleGoogle Scholar
- Mathews DH, Turner DH: Dynalign: An Algorithm for Finding Secondary Structures Common to Two RNA Sequences. J Mol Biol 2002, 317: 191–203. 10.1006/jmbi.2001.5351View ArticlePubMedGoogle Scholar
- Will S, Missal K, Hofacker IL, Stadler PF, Backofen R: Inferring Non-Coding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering. PLoS Comp Biol 2006, in press.Google Scholar
- Bonhoeffer LS, McCaskill JS, Stadler PF, Schuster P: RNA Multi-Structure Landscapes. A Study Based on Temperature Dependent Partition Functions. Eur Biophys J 1993, 22: 13–24. 10.1007/BF00205808View ArticlePubMedGoogle Scholar
- Dalli D, Wilm A, Mainz I, Steger G: STRAL: progressive alignment of non-coding RNA using base pairing probability vectors in quadratic time. Bioinformatcs 2006, 22(13):1593–1599. 10.1093/bioinformatics/btl142View ArticleGoogle Scholar
- McCaskill J: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 1990, 29: 1105–1119. 10.1002/bip.360290621View ArticlePubMedGoogle Scholar
- Stoye J, Evers D, Meyer F: Rose: generating sequence families. Bioinformatics 1998, 14: 157–163. 10.1093/bioinformatics/14.2.157View ArticlePubMedGoogle Scholar
- Thompson J, Plewniak F, Poch O: BAliBASE: a benchmark alignment databse for the evaluation of multiple alignment programs. Bioinformatcs 1999, 15: 78–88.Google Scholar
- Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S: PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research 2005, 15: 330–340. 10.1101/gr.2821705PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004., 5:Google Scholar
- Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology 2000., 302:Google Scholar
- Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30: 3059–3066. 10.1093/nar/gkf436PubMed CentralView ArticlePubMedGoogle Scholar
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy S, Bateman A: Rfam: annotating noc-coding RNAs in complete genomes. Nucleic Acid Research 2005., 33:Google Scholar
- Gardner P, Wilm A, Washietl S: A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Research 2005, 33: 2433. 10.1093/nar/gki541PubMed CentralView ArticlePubMedGoogle Scholar
- Hertel J, Lindemeyer M, Missal K, Fried C, Tanzer A, Flamm C, Hofacker I, Stadler P: The Expansion of the Metazoan MicroRNA Repertoire. BMC Genomics 2006., 7(25):
- Hertel J, Hofacker IL, Stadler PF: snoReport: Computational identification of snoRNAs with unknown targets.2007. [http://www.bioinf.uni-leipzig.de/Publications/07wplist.html] Submitted; preprint BIOINF 07-003Google Scholar
- Washietl S, Hofacker I, Stadler P: Fast and reliable prediction of noncoding RNAs. PNAS 2005, 102(7):2454–2459. 10.1073/pnas.0409169102PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.