Kalign – an accurate and fast multiple sequence alignment algorithm
© Lassmann and Sonnhammer; licensee BioMed Central Ltd. 2005
Received: 30 May 2005
Accepted: 12 December 2005
Published: 12 December 2005
The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics.
We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods.
Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.
The alignment of multiple sequences is essential in the analysis of protein sequences . In contrast to pairwise alignment, multiple sequence alignment (MSA) can reveal subtle similarities among large groups of proteins. Such information can be used in phylogenetic analysis , function prediction , HMM building , finding consensus sequences and in the identification of residues critical to function. Due to the importance of multiple sequence alignments in such a wide range of applications, a considerable amount of interest has been focused on improving the accuracy of MSA algorithms. Two basic modes of sequence alignment exist: global, i.e. over the entire length of the sequences , and local, in which only high-scoring areas are considered . In general, database search algorithms use the local strategy [7–10], while most MSA algorithms use the global strategy. However, two noticeable exceptions are the two local MSA methods Poa  and Dialign [12, 13]. Global methods tend to outperform local methods when sequences are related over their entire length , while local methods are superior in multiple domain cases when sequences may only share one common domain . Since it is rarely known how sequences are related prior to the alignment, a method attempting to combine both local and global features was proposed by Notredame et al. . Although this method produces good alignments [14, 16, 17] it suffers from being time consuming. Recently two new approaches, Mafft  and Muscle  were proposed. Both methods claim to produce alignments as accurate as those by T-Coffee while being considerably faster. The high accuracy is achieved by two iterative refinement strategies that are applied after an initial 'rough' alignment has been found. An important task for multiple sequences alignment algorithms in the future lies in the annotation and analysis of complete proteomes . For an algorithm to be successful in such a setting it must produce high quality alignments consistently and, due to the volume of data, in a computationally feasible way. With these two goals in mind we developed Kalign, a global, progressive alignment method. We enhanced this approach by employing an approximate string-matching algorithm to calculate sequence distances and by incorporating local matches into the otherwise global alignment. We demonstrate that Kalign is well suited both in terms of speed and accuracy to deal with the challenges posed by large-scale comparative genomics.
The Kalign algorithm follows a strategy analogous to the standard progressive method for sequence alignment . Pairwise distances are calculated, a guide tree is constructed and sequences/profiles are aligned in the order given by the tree. In contrast to existing methods, the Wu-Manber approximate string-matching algorithm  is used in the distance calculation and optionally in the dynamic programming used to align the profiles.
Distance estimation and tree building
In progressive alignment the order in which sequences are aligned is crucial for accuracy. The goal is to align the most similar sequences first, to avoid errors in early stages of the alignment that cannot be corrected later on. This is normally accomplished by building a guide tree that dictates the order of pairwise alignments. The tree is typically built from a matrix of pairwise distances between all sequences, e.g. using the UPGMA  or the Neighbour-Joining  method. Performing all pairwise comparisons has quadratic complexity, and hence this step tends to dominate the running time of most progressive methods when aligning many sequences. Performing all pairwise alignments is slow but gives better distance estimates than approximate techniques, e.g. the k-tuple method employed optionally by ClustalW. In Kalign, sequence distances are estimated based on a novel method employing the Wu-Manber approximate string-matching algorithm. This strategy enables Kalign to estimate sequence distances more accurately but as fast as the k-tuple method. We will first discuss the Wu-Manber algorithm in some detail and then explain how it is incorporated into Kalign.
The Wu-Manber algorithm is an extension to the exact Baeza-Yates-Gonnet algorithm  that allows string-matching with mismatches. The distance between two strings is measured by the Levenshtein edit distance. Two strings A and B have an edit distance d if A can be transformed into B by applying d mismatches, insertions or deletions. The Wu-Manber algorithm has a complexity of O(tk) where t is the length of the text string and k the number of errors allowed. Previously, protein sequences were commonly searched with patterns 2–3 long [7, 9, 10], and hence we choose to search with likewise short, three-residue patterns as well, allowing only one error. Since the maximum number of patterns under these setting is 8000 and only one error is allowed, all input sequences can be searched efficiently. A minor modification of the standard algorithm  allows us to search the sequences with 10 patterns at once, which speeds up this step. The benefit of using approximate string-matching when comparing protein sequences is obvious in two scenarios. Firstly, if sequences have an even degree of similarity along their entire length rather than patches of high identity, the standard k-tuple method fails to detect the similarity. For example, imagine comparing two sequences 'AAAAAA' and 'AALAAL'. Even though the sequences are 66% identical, the standard k-tuple method (with a word length of 3) finds no common patterns between the two sequences while the Wu-Manber algorithm finds 67 common patters ('AAA' matches sequence 1 with a mismatch and sequence 2 exactly, etc.). Similarly, sequences 'AAAAAA' and 'ALALAL' (50% sequence identity) share no exact patterns, yet 25 patterns with mismatches. Secondly, sequences that are less than 30% identical often share few, if any strings of length 3, so the resolution of the k-tuple methods starts to fail. However, shared mismatch patterns can still be readily found and enable the Wu-Manber algorithm to report meaningful distances even between highly divergent sequences. A drawback of using pattern matching for distance estimation is that many spurious matches are reported. Kalign overcomes this problem by considering the locality of the matches found as well as their total number. All matches add a score (16 if the pattern matches both sequences exactly, 8 if one has a mismatch, and 1 if both sequences have a mismatch) to the diagonal on the dynamic programming matrix on which they occur. In similar sequences one diagonal will get a very high score compared to the rest since many matches occur on the same diagonal. However, in distantly related sequences the distribution of scores is less clear and many diagonals will receive relatively low scores. Therefore, the sequence similarity in Kalign is defined as the sum of the three highest scoring diagonals found. By including three diagonals, only matches that are likely to be part of the final alignment are considered when estimating similarity, while many spurious matches are excluded. Although this method is marginally slower than the standard k-tuple counting method, it is much faster than performing all pairwise alignments. The pairwise similarity scores computed in this way are used by the UPGMA clustering method to construct the guide tree.
At each internal node of the guide tree two sequences, two profiles or one sequence and one profile are aligned. Optionally, Kalign can use Wu-Manber matches as anchor points during the alignment phase, which requires two extra steps in addition to the dynamic programming. This option is intended to improve accuracy in cases of discontinuous alignments, e.g. if a domain has been inserted or deleted. It speeds up the dynamic programming, but the gain is cancelled out by the time to perform the extra steps.
Kalign employs the global dynamic programming method using affine gap penalties as described in Durbin et al. . Residues are assigned to three states (aligned, gap in sequence a, or gap in sequence b) which are interconnected by arrows representing transitions. The model has been modified to disallow a gap in one sequence to be immediately followed by a gap in the other sequences. Normally, three matrices of size (m + 1) (n + 1) are used to represent these states. When these 'state-matrices' are filled, the final cell contains the maximum alignment score, and a traceback procedure (requiring these matrices) is used to retrieve the actual alignment. To reduce the amount of memory required we combined two known strategies. Firstly, an additional matrix size (n + 1) (m + 1) is used to store which transition occurs in every cell of the dynamic programming matrix . By doing so, the optimal alignment can be read off this 'trace-matrix' by looking at which transitions lead from the final cell to the first cell. Secondly, since the three 'state-matrices' are no longer needed to perform the traceback procedure, they can be reduced into one-dimensional arrays because each cell in the dynamic programming matrix only depends on values from the previous column. Thus, instead of using 3 ((n + 1) (m + 1)) memory we now only need 3 (n + 1) + ((n + 1) (m + 1)) memory. In practice, this translates into a ten-fold reduction in memory requirement. The reduction in memory reduces the number of cache-misses and makes the dynamic programming substantially faster. To our knowledge, the combination of these two methods has not been previously described. If the user wishes to use the option of including Wu-Manber matches as anchor points during the alignment phase, two additional steps are performed:
1. Consistency check
The task of the consistency check is to sieve through the thousands of matches found between two sequences and find the largest set of matches that can be included into an alignment. For example: a pattern matching at position 100 in both sequences is inconsistent with a pattern that matches sequence A at position 20 and sequence B at position 150. Matches found in both sequences (or profiles) are plotted at the corresponding position into the dynamic programming matrix. Since it is possible for several patterns to match at the same position the number of matching patterns is recorded. The filled matrix is analogous to a homology matrix containing substitution scores in standard sequence alignment. A modified version of the Needleman-Wunsch algorithm is then used to find the path through the dynamic programming matrix that contains the highest number of consistent patters. Because we are interested in local matches here, no gap penalties are used. Finally, all shared matches that occur on too short diagonals are considered spurious and are excluded. We found that a cutoff of 22 residues long diagonals worked well.
2. Updating of pattern match positions
Including matches in initial pairwise alignments involving regular sequences is relatively trivial. However, deeper into the guide tree, profiles are aligned both to each other and to sequences. The updating step adjusts the absolute position of matches found within sequences to their relative positions within the profiles generated by the dynamic programming step. For example, a match at position 50 in sequence A can end up in the 55th column in a profile, if 5 gaps were inserted anywhere in sequence A before position 50. The matches initially 'tied' to individual sequences are thus assigned to matches within profiles and can be used in the next pairwise alignment step.
Like other alignment programs based on dynamic programming, Kalign uses a substitution matrix and affine gap penalties. The choice of matrix and gap penalties has been the subject of previous studies . The most commonly used substitution matrices are BLOSUM  and PAM . A common idea is that similar sequences should be aligned with 'hard' matrices like PAM50 or BLOSUM80 while more distantly related sequences align better using 'soft' matrices like PAM250 or BLOSUM40. For instance, the commonly used program ClustalW adjusts the choice of substitution matrix accordingly. However, in agreement with Vogt et al. , we found no significant difference in alignment quality when using a soft matrix instead of a hard matrix on similar sequences. However, in the case of more distantly related sequences, hard matrices generally produce worse results than soft ones. Simply put, similar sequences are easy to align and the choice of substitution matrix does not noticeably affect the alignment quality. However, the correct alignment of dissimilar sequences requires using 'soft' matrices. Therefore we decided not to implement a complicated scheme that adjusts the choice of matrix but use a single soft matrix in all cases. We found little difference in alignment quality between using the BLOSUM50, PAM250 or GONNET250  matrices and arbitrarily chose the GONNET250 matrix in combination with the parameters reported by Vogt et al. .
The Kalign algorithm was implemented in standard C.
The Balibase test set is a collection of alignments derived from structural databases or from manual alignments in the literature. We used the first five categories in Balibase version 2.01, containing 141 alignments. Each category represents some characteristic such as long or short sequences, high or low sequence identity, or large N/C terminal deletions or extensions. Reference alignments in Balibase contain only few, on average 10, and partial sequences. As noted before , this unnatural bias in the test set favours certain methods. We believe the real challenge lies in aligning large numbers of full-length sequences, which is currently not covered in Balibase. The diversity of the test set is further reduced because several sequences appear in more than one reference alignment.
Prefab exploits the abundance of pairwise structural alignment to create a multiple alignment test set. Each case in Prefab consists of a pairwise reference alignment and a set of sequences containing the two reference sequences plus 48 additional sequences that were obtained by querying a database with the reference sequences. To perform a test, the set of 50 sequences are aligned, the pairwise alignment of the two reference sequences is extracted from the resulting MSA, and is compared against the pairwise reference alignment. Prefab version 3 contains 1932 individual test cases, each based on a single pairwise alignment. Compared to the 141 test cases in Balibase this seems impressive, but each Balibase test case is an alignment of more than two sequences and in effect Balibase contains 8053 pairwise alignments.
As Balibase and Prefab alignments are relatively small (10 and 50 sequences), we felt the need to examine performance on larger alignments. Since no real testset with large alignments exists, we were forced to use simulations to create this dataset. We used Rose version 1.3 , a program that simulates the evolution of sequences in a probabilistic fashion. Given a specified number of sequences and a target average evolutionary distance between them, Rose constructs a random phylogenetic tree, a random ancestor sequence at the root, and simulates evolution by applying substitutions, insertions, and deletions to create the sequences along the edges. As all the events in the history of the sequences are known, the true alignment is known and is recorded. Although some aspects of the simulation may be artificial, it is the only method that provides 100% knowledge of the true alignment. Obviously, alignments and sequences cannot be simulated perfectly by an evolutionary model. For example, two sequences modelled at a PAM distance of 200 might resemble real sequences at a PAM250 more closely than at PAM200. However, it is undeniable that also simulated sequences will become more and more difficult to align with increasing PAM distances. Alignment programs will align distant sequences differently, and based on this a meaningful and informative comparison between the programs can be made. A large test set such as this offers the opportunity to analyze the running times of alignment methods in depth.
The alignment quality of each method was determined by calculating the sum-of-pairs score (SP) according to Thompson et al. . This score reflects the percentage of correctly aligned residues determined by comparison to a reference alignment, and has little in common with the likewise called sum-of-pairs score often used as an objective function. We do not use the column score (CS)  as we feel this score is inadequate and does not reflect the biological correctness of alignments. For example, even if 99% of the sequences are correctly aligned, the column score can become zero due to a single misaligned sequence. In practice, the CS score gives lower accuracies than the SP score, but the ranking of the methods remains the same (results not shown).
We compared our method Kalign (with the default parameters) to Mafft version 3.85 , Muscle version 3.0 , ClustalW version 1.83 , Dialign version 2.2.1  and T-Coffee version 1.37 . A comprehensive review of the individual programs and their options is beyond this paper. Unless otherwise stated, we used the programs tested here in their highest accuracy mode. In the case of Mafft, four different scripts govern whether Fast Fourier Transform (FFT) and iterative refinement are used or not. Our experience in using Mafft revealed only small differences in quality and speed between the scripts using FFT or not (results not shown). We used the FFTNSI script throughout because it is slightly faster than the corresponding NWNSI (lacking FFT) script. The Muscle program, similar to the NWNSI script of Mafft, was run using all of the available refinement options.
Results and Discussion
Balibase results. Categories ("Cat.") refer to the five Balibase categories. Average 1 is the average sum-of-pairs score over all 141 alignments, while average 2 is the average across all five categories.
CPU time (sec.)
CPU time (sec.)
In this paper we present Kalign, a novel multiple sequence alignment algorithm based on Wu-Manber approximate pattern matching that combines high quality with high speed. Compared to existing programs, Kalign performed much more robustly when aligning large amounts of sequences or distant sequences in a large-scale benchmark of generated alignments. In terms of computational efficiency, Kalign is superior to the other methods, and readily aligns hundreds of sequences in minutes on a normal desktop computer. Coupled with the fact that Kalign gives very accurate alignments, this makes Kalign a very attractive overall method. The high accuracy of Kalign is due to the innovative use of the approximate Wu-Manber string-matching algorithm. This allows sequence distances to be accurately estimated even in difficult cases. Precise sequence distances generate good quality guide trees that, in turn, lead to good alignments. At the same time, Wu-Manber string-matching is very fast and dramatically cuts down the time required for the distance estimation step that dominates the running time of most alignment programs. The strategy detailed here can, in principle, be applied to any other progressive alignment method. Even when disregarding the results on the new large testset, Kalign's performance on Balibase and Prefab is impressive especially when considering that unlike other methods Kalign was not trained on either test set, and that other methods with similar performance are much slower.
Availability and requirements
The Kalign program and a Kalign server are freely available at http://msa.cgb.ki.se or by request from the authors.
We would like to thank Alistair Chalk for many useful discussions and Robert Edgar for help with the Prefab testset.
- Notredame C: Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 2002, 3: 131–144.View ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.Google Scholar
- Sjolander K: Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 2004, 20(2):170–179.View ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, (32 Database):138–141.Google Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453.View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197.View ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85(8):2444–2448.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 1990, 183: 63–98.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics 2002, 18(3):452–464.View ArticlePubMedGoogle Scholar
- Morgenstern B, Dress A, Werner T: Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci USA 1996, 93(22):12098–12103.PubMed CentralView ArticlePubMedGoogle Scholar
- Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15(3):211–218.View ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999, 27(13):2682–2690.PubMed CentralView ArticlePubMedGoogle Scholar
- Lassmann T, Sonnhammer ELL: Quality assessment of multiple alignment programs. FEBS Lett 2002, 529: 126–130.View ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302: 205–217.View ArticlePubMedGoogle Scholar
- Karplus K, Hu B: Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 2001, 17(8):713–720.View ArticlePubMedGoogle Scholar
- Katoh K, Misawa K, Kuma Ki, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30(14):3059–3066.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792–1797.PubMed CentralView ArticlePubMedGoogle Scholar
- Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O: Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene 2001, 270(1–2):17–30.View ArticlePubMedGoogle Scholar
- Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Biol 1987, 25: 351–360.Google Scholar
- Wu S, Manber U: Fast Text Searching Allowing Errors. Communications of the ACM 1992, 35: 83–91.View ArticleGoogle Scholar
- Sokal RR, Michener CD: A statistical method for evaluating systematic relationships. Univ Kansas Sci Bul 1 1958, 38: 1409–1438.Google Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4(4):406–425.PubMedGoogle Scholar
- Baeza-Yates RA, Gonnet GH: A new approach to text searching.In Proceedings of the 12th International Conference on Research and Development in Information Retrieval Edited by: Belkin NJ, van Rijsbergen CJ. Cambridge, MA: ACM Press; 1989, 168–175. [http://citeseer.ist.psu.edu/50265.html]Google Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press; 1998.View ArticleGoogle Scholar
- Goad WB, Kanehisa MI: Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries. Nucleic Acids Res 1982, 10: 247–263.PubMed CentralView ArticlePubMedGoogle Scholar
- Vogt G, Etzold T, Argos P: An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mol Biol 1995, 249(4):816–831.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89(22):10915–10919.PubMed CentralView ArticlePubMedGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of protein sequence and structure 1978, 5: 345–358. [National biomedical research foundation Washington DC] [National biomedical research foundation Washington DC]Google Scholar
- Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256(5062):1443–1445.View ArticlePubMedGoogle Scholar
- Bahr A, Thompson JD, Thierry JC, Poch O: BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res 2001, 29: 323–326.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15: 87–88.View ArticlePubMedGoogle Scholar
- Stoye J, Evers D, Meyer F: Rose: generating sequence families. Bioinformatics 1998, 14(2):157–163.View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: Low-complexity multiple sequence alignment with T-Coffee accuracy [Short Paper]. ISMB/ECCB 2004.Google Scholar
- Sonnhammer ELL: Belvu.1999. [http://www.cgb.ki.se/cgb/groups/sonnhammer/Belvu.html]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.