Comprehensive comparison of graph based multiple protein sequence alignment strategies
BMC Bioinformatics volume 13, Article number: 64 (2012)
Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark.
Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal.
This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna.biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1).
A variety of methods used by modern molecular biology such as structural modeling, function annotation, phylogenetic analysis and similarity searches are based on multiple protein sequence alignments (MPSA). Correct MPSA arranges in one column position amino acids that share a common ancestor or are functionally/structurally equivalent. MPSA provides position-specific information on evolutionary conserved characters, correlation between characters and their distribution. These features can be used in many further applications for which the quality of MPSA is, therefore, crucial . Traditionally, the quality of the alignment is evaluated using a scoring function based on gap penalties and a substitution matrix. When two sequences are aligned, an exact solution can be found using dynamic programming [2, 3]. When many sequences are aligned, the exact solution is computationally expensive and becomes intractable for more than a few sequences . The majority of methods avoid exponential scaling of alignment problem by employing various greedy heuristics, including a widely used progressive alignment technique . Classic progressive methods such as ClustalW  are fast and can produce reasonable results for sequences that are sufficiently similar (e.g., show identity higher then 40%) but become impractical in the so called twilight zone (e.g., identity below 30%) [1, 7]. A vast amount of research has been conducted in order to improve alignment quality for distant sequences. Many methods that have succeeded in this task are based on modifying the scoring function. We use a unified graph framework to compare different scoring functions.
In this article we shall use the term "alignment graph" as it is defined in the graph-based alignment algorithm by Rausch et al.  and the corresponding implementation in the SeqAn sequence alignment library . The alignment graph represents residues of aligned sequences by vertices and various signals on which the alignment is based, such as residue substitutions from the substitution model, structural links etc., by edges connecting the vertices. This provides a flexible model for scoring: any type of information can be introduced by adding edges or weights to the existing edges, while certain transformations of the graph help to avoid local minima during the optimization step.
The separation of the input signal for the alignment from the process that produces the alignment was pioneered by T-Coffee . This program starts by compiling a library of local and global pairwise alignments, which equals to populating the alignment graph with the corresponding edges. T-Coffee was then extended to include structural information  and to combine alignments produced by other methods into a consensus alignment . Another example is MaxFlow , which initializes the alignment graph with information from PSI-BLAST searches. We note that, in spite of these implementations, the contribution of various information sources to the quality of MPSA is yet to be analyzed. Here, we supply this need by presenting a clear comparison of sequence based information sources.
Various consistency transformations have been shown to effectively improve alignments of distant sequences [10, 14, 15]. The idea is to transform scores used to align pairs of sequences to be consistent between different pairs. For example, for three sequences A, B and C, if residue Ai is aligned to residue Bj and residue Bj is aligned to residue Ck, then this implies that a consistent scoring between A and C should lead to the alignment of Ai to Ck. In alignment graph notation consistency implies shifting weights between neighboring edges. Pioneering work on consistency was presented by T-Coffee that implemented consistency transformation referred as triplet library extension. In this transformation the consistency of edge between vertices Ai and Bj is increased by iterating all possible Ck vertices and adding the minimum of (Ai, Ck) and (Bj, Ck) weights to the (Ai, Bj) weight. This idea has been extended in the SeqAn::T-Coffee program, which implements T-Coffee algorithm using SeqAn sequence alignment library . As in the original T-Coffee, existing edges are modified by iteration of neighboring triplets. However, this version also introduces new edges where triplet information requires such for consistency. Tests on benchmark suggest that this approach improves alignment quality .
Consistency transformation was also approached in a probabilistic framework. For example, in the ProbCons  implementation, vertices of the alignment graph are connected by probabilities according to pairwise hidden Markov models (HMMs). These probabilities are then transformed towards a higher degree of consistency by additively combining probabilities of triplet paths trough all possible vertex triplets. This transformation is similar to triplet library extension, but since edges are weighted by a probability function, the contribution of (Ai, Ck) and (Bj, Ck) edge-pair to the (Ai, Bj) edge is equal to the product of the corresponding probabilities. ProbCons allows repeating the consistency transformation several times, in effect extending consistency to sets of four, five or a larger number of vertices.
Consistency and transitivity concepts were unified in MaxFlow program, which was designed to align motifs of remote homologs that have little sequence similarity when compared directly, but can be connected via a path of pairwise alignments . MaxFlow starts with an alignment graph based on a library of PSI-BLAST alignments. This graph is weighted by assigning each vertex pair a consistency score: the number of common neighbors relative to the total number of neighbors for the two vertices in a pair. The graph is then used to align a pair of distant structural homologs. During the alignment residue pairs are weighted using the weakest path link: the maximum over the path scores defined by the weakest consistency score in a path. In comparison with classical methods, MaxFlow demonstrated superior performance in both reliability and coverage of structural motifs aligned.
In addition to the scoring function, alignment quality is largely affected by guiding tree used in the progressive alignment. Our results suggest that single linkage clustering is optimal for this purpose regardless of the benchmark set.
The progressive alignment method has a serious pitfall: subgroups of sequences are aligned independently of one another, which implies that an optimal subalignment produced near the top of the tree can become a source of errors as it is aligned to other subalignments in the later stages . Consistency scoring, discussed earlier, is one way to address this problem; the other is to correct alignment errors in a post processing step called iterative refinement. In one step of iterative refinement MPSA is partitioned horizontally into two subalignments, which are then realigned and the new alignment is kept if it improves scoring function. Iterative methods based on simulated annealing are too slow to be practical . Other methods differ mainly in the way sequences are divided into two groups before being re-aligned . This can be a leave one out , random partitioning , or tree-dependent restricted partitioning . The last option was shown to be effective in terms of both quality and speed, and a variation of this technique has been adopted by several current methods including MUSCLE  and MAFFT .
Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality. We used our program to see how alignment quality changes depending on the input information, guiding trees, the applied consistency transformations, and the strategy for iterative refinement during post-processing. This is the first time these strategies are comprehensively and clearly compared using a single implementation platform. Our results confirm the existing knowledge on which strategies are efficient. We also show that the best strategy is comparable in accuracy to the best software in the field.
To compare alignment strategies we systematically varied information sources, guiding trees, consistency, clique transformation and iterative refinement options, evaluating the resulting alignments on BAliBASE and SABmark.
Comparison of information sources
We found that using both global and local pairwise alignments to construct the alignment graph, resulted in high quality multiple alignments for both BAliBASE (Figure 1) and SABmark (Figure 2A and 2B) benchmark databases. Adding longest common subsequences to global and local alignments had minor effect on the quality of the multiple alignments. Adding external information in the form of GTG motifs, extracted using motif tracking as described in the original GTG article , increased the quality of the multiple alignments by an average of 1% for both BAliBASE and SABmark (full data available in Additional file 2: Tables S2 and S3).
We found that single linkage clustering is clearly the best option for all reference sets in BAliBASE and SABmark benchmarks (see Tables 1 and 2). To our surprise, it produced significantly better alignments than commonly used neighbor joining. For BAliBASE, single linkage increased accuracy relative to neighbor joining by 3% and for SABmark benchmark by 5%. We note that this comparison is based on alignments produced with no consistency transformation or postprocessing to account for errors introduced by heuristic nature of progressive alignment. It is probable that when consistency or postprocessing is included the particular type of the guiding tree used becomes less relevant.
Consistency and clique transformations
Our results show that triplet library extension improves alignment accuracy for all sets in both benchmarks when compared to alignments produced with no consistency transformation (referred here as the basic strategy). TripletT-Coffee, which is limited to applying consistency to the existing edges, improved BAliBASE alignments by 1.7%, SABmark Twilight Zone alignments by 3.3% and SABmark Superfamily alignments by 1% (see Tables 3 and 4). TripletSeqAn, which introduces new edges when implied by consistency, was considerably better, enhancing BAliBASE accuracy by 3%, SABmark Superfamily accuracy by 3.1% and SABmark Twilight alignment accuracy by a remarkable 11.2%. Reiterating tripletSeqAn transformation for the second time either enhanced or deteriorated alignment accuracy depending on the particular set. The effect was small and has little practical value.
Applying clique transformation decreased alignment accuracy when compared to the basic strategy. Applying clique transformation after the graph has been made consistent by tripletT-Coffee generally provided an additional improvement relative to tripletT-Coffee. However, this is not practical since this transformation multiplied the required computational time by a factor of 8 for SABmark sets and by a factor of 65 for the BAliBASE sets. This is due to the huge amount of edges generated by the clique transformation which makes the progressive alignment step computationally demanding. The task appeared even intractable when applying clique transformation after tripletSeqAn consistency for some of the BAliBASE alignments, for which reason we do not report on these results here. Running clique transformation after tripletSeqAn consistency for SABmark alignments decreased alignment accuracy. This suggests that tripletSeqAn is sufficient to introduce edges that have practical utility for the MPSA. Combining MaxFlow consistency with clique transformation had minor improvement on BAliBASE alignments and deteriorated quality of SABmark alignments.
All iterative refinement strategies improved alignment accuracy, but the tree dependent strategies were more efficient. It made little difference, whether the partitioning was done randomly (treeRandom) or systematically in breath-first order (treeBF). Random partitioning improved accuracy of BAliBASE alignments by 1%, treeRandom by 2.6% and treeBF by 2.8%, relative to alignments produced with no refinement (see Table 5). Improvement of accuracy for SABmark Twilight alignments were 5.9%, 9% and 8.5%, respectively; and for SABmark Superfamily alignments 1.6%, 3.2% and 2.9%, respectively (Table 6). Interestingly, the tree dependent strategies are comparable in time and accuracy to the tripletSeqAn, which was the best consistency transformation as presented above. The mean SP score for BAliBASE alignments produced with tripletSeqAn was 0.856, the corresponding score for treeBF was very similar: 0.854. Computation times for these two strategies were also comparable: tripletSeqAn took on average 45 seconds and treeBF on average 54 seconds per BAliBASE alignment. Moreover, applying treeBF iterative refinement on BAliBASE alignments produced under tripletSeqAn consistency had no effect on accuracy when measured up to three decimal points of the SP score (Table 5). The same tendency can be seen from SABmark alignments: differences between tripletSeqAn and treeBF accuracy are rather cosmetic and treeBF is on average slightly faster (1.2 sec per alignments against two seconds per alignment for Twilight alignments and one sec against 1.6 sec for Superfamily alignments). For this benchmark, however, applying treeBF refinement to alignments produced under tripletSeqAn did result in some improvement (Table 6).
The relative speed of iterative refinement strategies depended largely on the number of sequences in the alignment. For BAliBASE alignments, which on average contain 30 sequences, run times of random and restricted partitioning were comparable. For SABmark alignments, which on average contain 8 sequences, restricted partitioning was slightly faster. Partitioning alignment along all possible edges in the guiding tree was generally faster than random tree dependent partitioning for alignments with a small to moderate number of sequences (BAliBASE sets RV11, RV12, RV40 and RV50 and both SABmark sets), but became slower as the number of aligned sequences approached the number of refinement iterations (BAliBASE sets BB20 and BB30 that have on average 46 and 63 sequences per alignment, respectively).
Comparison to other programs
We compared our optimal strategies to the eight leading multiple alignment programs: (1) ProbCons 1.12 , (2) T-Coffee 8.98 , (3) MUSCLE 3.8 , (4) MAFFT 6.847 , (5) ClustalW 2.0.12 , (6) Clustal Omega 1.0.3 , (7) Kalign 2.03  and (8) MSAProbs 0.9.5 . MAFFT was run with L-INS-i option; other programs were run with default parameters. All programs were run with a single core. The comparison was done on BAliBASE (Table 7), SABmark (Table 8) and PREFAB (Table 9). The optimal strategy for BAliBASE alignments, referred as MMSA in Table 7, was to use global and local pairwise alignments complemented with GTG motifs as input information; to apply consistency transformation tripletSeqAn; and to align the sequences using a single linkage guiding tree. No postprocessing was performed. For SABmark we chose to test two strategies: the same strategy as for BAliBASE, and another strategy, that was augmented with iterative refinement of type treeBF (referred as MMSA::treeBF in Table 8). On BAliBASE our aligner ranked fourth in terms of alignment accuracy. MSAProbs, MAFFT and ProbCons were more accurate, while Kalign, MUSCLE, MAFFT and both Clustals had better time performance. Also for SABmark our aligner ranked fourth in term of accuracy, this time it was MSAProbs, ProbCons and T-Coffee that performed better. The fastest were again Kalign, both Clustals, MUSCLE and MAFFT.
For PREFAB we chose to use the same strategy as for BAliBASE. On the main set this yielded moderate accuracy outperforming only the fast aligners: MUSCLE, Kalign and ClustalW. On the weighted reference set our code performed poorly, since it does not account for the overrepresented families (Table 9).
We conclude that, although our modular implementation of the best alignment strategies does not outperform the best aligners in the field, the performance is at a comparable level.
We have completed a comprehensive comparison of graph based strategies for aligning multiple sequences. Our results suggest clear guidelines for a number of choices made during construction of a multiple protein sequence alignment. First, we showed that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. When reliable external information, the GTG motifs in our case, is available, it will most likely improve the accuracy and thus should also be included. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, tripletSeqAn is the most efficient consistency transformation from those compared. It can have a large improvement on alignment quality, particularly for alignments in the twilight zone. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with tripletSeqAn consistency transformation in both time and accuracy.
As a more subtle result, we found that transitivity, which in principle can increase the sensitivity of the aligner, will in most cases introduce more noise than signal. When we iterated tripletSeqAn two times, in effect transferring edge weights within neighborhoods of four vertices, the quality of the alignments increased slightly. However, when iterations were repeated three times, extending the influenced neighborhood to five vertices, the quality invariably deteriorated. The same was the case for clique transformation, where any two vertices, connected by a path, were connected by an edge and thus were allowed to directly influence each other during the progressive alignment.
We showed that graph based modular implementation allows to compare the contribution of different algorithmic components to the alignment quality and computation time. We demonstrated that shifting edge information within triplets of alignment graph vertices prior to the progressive step and the tree-dependent iterative refinement after the progressive step are equally effective strategies that significantly improve accuracy, particularly for the case of distant homologs. As a negative result we report that extending the neighborhood of edge shifting to five or more vertices introduces more noise than signal.
Our program starts by gathering input information from pairwise alignments into an alignment graph. The graph is then refined, transformed and finally fed to a progressive alignment routine that builds an alignment. The resulting alignment is iteratively refined in a post processing step. Altogether there are six steps, detailed below. Each step was implemented as a combination of SeqAn library methods and our own code. The contribution of SeqAn library is outlined in Additional file 2: Table S1.
1. Computing information sources
Global, local and longest common subsequence alignments were computed using SeqAn library with Blosum62 substitution matrix, gap opening penalty set to -13 and gap extension penalty set to -1. Global alignments were constructed using Gotoh's affine gap cost and local alignments using Smith-Waterman algorithms. From local alignments we selected the best four, possibly overlapping, matches; as our preliminary tests indicate that including a larger number of local matches has little effect on accuracy (data not published). We acknowledge that the selection of substitution matrix and gap penalties will have a large effect on the pairwise alignments and hence on the overall MPSA. The values applied here are based on the literature and other implementations.
Additionally, we compiled a collection of conserved motifs based on the Global Trace Graph (GTG) . These were extracted using motif tracking as described in the original GTG article. Alignments and motifs were converted into edges between aligned sequence segments referred hereafter as segment matches.
2. Initializing the alignment graph
The vertices of the graph represent segments of the sequence in the set of proteins to be aligned. The edges of the graph represent possible alignments between these segments. After the edges are generated in the first step, they are assigned weights using Blosum62 substitution matrix. Blosum62 was offset to include only positive values. After this, the structure of the alignment graph is simplified by refinement. During the refinement segment matches are cut into smaller parts in such a way that none of the matches partly overlap (for details see ).
3. Score transformation
Score transformations can both modify the edge weights of existing edges, and create new edges. An overview of score transformations is presented in Additional file 3: Figure S1.
3A. Consistency transformation
Alignment graph is made consistent using either triplet library extension introduced in T-Coffee, tripletT-Coffee, a variation of this introduced in SeqAn, tripletSeqAn, or the consistency measure used in MaxFlow . TripletSeqAn can be repeated several times, in effect extending consistency to sets of four, five or larger number of vertices.
3B. Clique transformation
Connected components are identified using Kruskal's algorithm and converted to cliques. Kruskal's is used to cluster alignment graph vertices by iterating all edges in the order of decreasing edge weight. This produces a collection of spanning trees, which are used to connect each pair of vertices in a tree with a weight equal to the weakest link of the path connecting the two vertices.
4. Tree construction
When global alignments are included in the list of information sources (see above), they are used to calculate the distance matrix. Otherwise, the distance matrix is calculated using k-mer counting. A guiding tree is constructed from the distance matrix using neighbor joining, single linkage, complete linkage, UPGMA or weighted UPGMA. When the guiding tree is constructed using neighbor joining, the root is placed between the last two taxa to be joined.
5. Progressive alignment
Sequences are progressively aligned in the order defined by the guiding tree. In each step, we apply heaviest common subsequence algorithm to align two subalignments. This is a segment based dynamic programming that finds the maximum weight trace from a set of refined alignment segments.
6. Iterative refinement
We implemented three different refinement strategies: random partitioning introduced by Berget & Munson , and the two tree-dependent restricted partitioning strategies referred as treeRandom and treeBF. TreeRandom is the same strategy as used by MUSCLE and MAFFT: sequences are partitioned by cutting a random edge of the guiding tree. TreeBF was designed to be a time efficient version of treeRandom: each edge of the tree is cut only once in the breath-first order.
We run our program with options corresponding to each of the tested strategies on BAliBASE 3.0  and SABmark 1.65  alignment benchmark databases. The best strategy was also tested on PREFAB 4.0  benchmark. Tests were performed on a 2.93 GHz Intel Xeon X7350 with 66 GB RAM.
The BAliBASE 3.0 benchmark is a collection of 218 reference protein alignments, compiled from FSSP structural alignments , HOMSTRAD  databases and hand constructed alignments. The database is organized into five reference sets: in Reference 1 sequences are equidistant and have two levels of conservation; in References 2 a highly divergent "orphan" is aligned with a family of close homologs; in Reference 3 subgroups with < 25% residue identity between groups are aligned; Reference 4 contains sequences with large N/C-terminal extensions; and Reference 5 sequences with large insertions in the middle of the aligned blocks. Alignments were scored relative to the core blocks of benchmark multiple alignments using quality measures proposed by BAliBASE: the sum of pairs score (SP) and the true column score (TC). SP is equal to the number of residue pairs correctly aligned in the evaluated alignment relative to the total number of residue pairs in the benchmark alignment. TC is the number of columns with correct alignment divided by the total number of columns in the benchmark alignment.
The SABmark 1.65 benchmark contains pairwise structural alignments from SOFI  and CE  databases, that are organized according to SCOP classification. There are two sets: the "Twilight Zone" set contains 1740 domains grouped into 209 SCOP folds, and the "Superfamilies" set that contains 3280 domains grouped into 425 SCOP superfamilies. Sequences in each of the twilight subsets are restricted to have at most 25% and those in superfamilies subsets to have at most 50% pairwise identity. The quality of SABmark alignments was evaluated using FD and FM scores based on reference structural alignments. The FD score measures the number of correct residue pairs relative to the number of paired residues in the reference alignment and FM score relates the number of correct pairs to the total number of paired residues in the test alignment.
PREFAB 4.0 is a large database generated automatically by supplementing structural pairs from FSSP database with homologs found through PSI-BLAST queries. Each alignment set is filtered to have at most 80% identity and limited to a set of 50 random homologs. There are 1682 alignments in the main set and 100 alignments in the weighted set. The weighted set consists of families with one highly over-represented sub-family. The accuracy of the multiple alignment for a given PREFAB alignment set is evaluated relative to the consensus of FSSP and CE alignments of the corresponding structural pair. Alignments were evaluated using PREFAB Q score, which is the analog of BAliBASE SP and SABmark FD scores.
Pei J: Multiple protein sequence alignment. Curr Opin Struct Biol 2008, 18: 382–386. 10.1016/j.sbi.2008.03.007
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4
Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5
Lipman DJ, Altschul SF, Kececioglu JD: A tool for multiple sequence alignment. Proc Natl Acad Sci 1989, 86: 4412–4415. 10.1073/pnas.86.12.4412
Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25: 351–360. 10.1007/BF02603120
Thompson JD, Higgins DG, Gibson TJ: CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680. 10.1093/nar/22.22.4673
Sander C, Schneider R: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991, 19: 56–68.
Rausch T, Emde A-K, Weese D, Döring A, Notredame C, Reinert K: Segment-based multiple sequence alignment. Bioinformatics 2008, 24: i187-i192. 10.1093/bioinformatics/btn281
Döring A, Weese D, Rausch T, Reinert K: SeqAn An efficient, generic C++ library for sequence analysis. BMC Bioinforma 2008, 9: 11. 10.1186/1471-2105-9-11
Notredame C, Higgins DG, Heringa J: T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. J Mol Biol 2000, 302: 205–217. 10.1006/jmbi.2000.4042
O' Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C: 3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments. J Mol Biol 2004, 340: 385–395. 10.1016/j.jmb.2004.04.058
Wallace MI, O'Sullivan O, Higgins DG, Notredame C: M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res 2006, 34(6):1692–1699. 10.1093/nar/gkl091
Heger A, Lappe M, Holm L: Accurate Detection of Very Sparse Sequence Motifs. J Comput Biol 2004, 11(5):843–857. 10.1089/cmb.2004.11.843
Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 2005, 15: 330–340. 10.1101/gr.2821705
Pei J, Sadreyev R, Grishin NV: PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 2003, 19: 427–428. 10.1093/bioinformatics/btg008
Hirosawa M, Totoki Y, Hoshida M, Ishikawa M: Comprehensive study on iterative algorithms of multiple sequence alignment. CABIOS 1995, 11: 13–15.
Notredame C: Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 2002, 3: 131–144. 10.1517/14622422.214.171.124
Barton GJ, Sternberg MJE: A strategy for the rapid multiple alignment of protein sequences: Confidence levels from tertiary structure comparisons. J Mol Biol 1987, 198(2):327–337. 10.1016/0022-2836(87)90316-0
Berger MP, Munson PJ: A novel randomized iterative strategy for aligning multiple protein sequences. Bioinformatics 1991, 7(4):479–484. 10.1093/bioinformatics/7.4.479
Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797. 10.1093/nar/gkh340
Katoh K, Misawa K, Kuma K-i, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 2002, 30(14):3059–3066. 10.1093/nar/gkf436
Heger A, Mallick S, Wilton C, Holm L: The global trace graph, a novel paradigm for searching protein sequence databases. Bioinformatics 2007, 23: 2361–2367. 10.1093/bioinformatics/btm358
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 2011, 7: 539.
Lassmann T, Frings O, Sonnhammer E: Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res 2009, 37: 858–865. 10.1093/nar/gkn1006
Liu Y, Schmidt B, Maskell DL: MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics 2010, 26(16):1958–1964. 10.1093/bioinformatics/btq338
Thompson JD, Plewniak F: O. P: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15: 87–88. 10.1093/bioinformatics/15.1.87
Walle IV, Lasters I, Wyns L: SABmark--a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 2005, 21: 1267–1268. 10.1093/bioinformatics/bth493
Holm L, Sander C: The FSSP database of structurally aligned protein fold families. Nucleic Acids Res 1994, 22: 3600–3609.
Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: A database of protein structure alignments for homologous families. Prot Sci 1998, 7: 2469–2471. 10.1002/pro.5560071126
Boutonnet NS, Rooman MJ, Ochagavia M-E, Richelle J, Wodak SJ: Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Engineering 1995, 8(7):647–662. 10.1093/protein/8.7.647
Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 1998, 11: 739–747. 10.1093/protein/11.9.739
We would like to thank Finnish Graduate School in Computational Sciences for funding this study.
The authors declare that they have no competing interests.
IP conceived of the study, carried out implementation, testing and benchmark validation, and drafted the manuscript. LH participated in the design and coordination of the study and helped to draft the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 2:Table S1. Outline of MMSA implementation. This table presents the main alignment steps and the corresponding components and functions. Some functions are referred in the main text using abbreviations listed in the fourth column. Source code for any function can be found in one of the two packages: SeqAn sequence alignment library or MMSA code deposited at http://ekhidna.biocenter.helsinki.fi/MMSA. Table S2. Performance of strategies with different information sources on the BAliBASE benchmark. Columns show the average sum-of-pairs (SP) and true-column (TC) scores achieved by different alignment strategies for each of the six BAliBASE references. The number of sequences in each reference is given in parentheses. Mean values for the entire database are reported in addition to the mean execution time of each strategy. The best results in each column are shown in bold. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs. Table S3. Performance of strategies with different information sources on the SABmark benchmark Columns show the average developer (FD) score (equivalent to sum-of-pairs [SP] score) and modeller (FM) score achieved by different alignment strategies for the "Superfamily" and "Twilight Zone" sets in the SABmark database. The number of sequences in each reference is given in parentheses. Mean execution time of each strategy is reported in seconds. The best results in each column are shown in bold. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs. (DOC 396 KB)
Additional file 3:Figure S1. Overview of the score transformations. A) Alignment graph of four sequences after refinement has resolved conflicts between segment matches. B) Applying tripletT-Coffee reinforces connections within triplet cliques. C) Applying tripletSeqAn reinforces connections within triplet cliques and introduces new edges between vertices that have a common neighbor. Note that edges are never introduced between vertices that belong to the same sequence. D) Applying MaxFlow consistency followed by a clique transformation: edges within spanning trees are weighted by the relative number of common neighbors, and new edges are introduced to convert each spanning tree into a clique (or almost a clique, since edges are never introduced between vertices that belong to the same sequence). (DOC 34 KB)
About this article
Cite this article
Plyusnin, I., Holm, L. Comprehensive comparison of graph based multiple protein sequence alignment strategies. BMC Bioinformatics 13, 64 (2012). https://doi.org/10.1186/1471-2105-13-64