Comprehensive comparison of graph based multiple protein sequence alignment strategies

Background Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark. Results Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal. Conclusions This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna.biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1).


Background
A variety of methods used by modern molecular biology such as structural modeling, function annotation, phylogenetic analysis and similarity searches are based on multiple protein sequence alignments (MPSA). Correct MPSA arranges in one column position amino acids that share a common ancestor or are functionally/structurally equivalent. MPSA provides position-specific information on evolutionary conserved characters, correlation between characters and their distribution. These features can be used in many further applications for which the quality of MPSA is, therefore, crucial [1]. Traditionally, the quality of the alignment is evaluated using a scoring function based on gap penalties and a substitution matrix. When two sequences are aligned, an exact solution can be found using dynamic programming [2,3]. When many sequences are aligned, the exact solution is computationally expensive and becomes intractable for more than a few sequences [4]. The majority of methods avoid exponential scaling of alignment problem by employing various greedy heuristics, including a widely used progressive alignment technique [5]. Classic progressive methods such as ClustalW [6] are fast and can produce reasonable results for sequences that are sufficiently similar (e.g., show identity higher then 40%) but become impractical in the so called twilight zone (e.g., identity below 30%) [1,7]. A vast amount of research has been conducted in order to improve alignment quality for distant sequences. Many methods that have succeeded in this task are based on modifying the scoring function. We use a unified graph framework to compare different scoring functions.
In this article we shall use the term "alignment graph" as it is defined in the graph-based alignment algorithm by Rausch et al. [8] and the corresponding implementation in the SeqAn sequence alignment library [9]. The alignment graph represents residues of aligned sequences by vertices and various signals on which the alignment is based, such as residue substitutions from the substitution model, structural links etc., by edges connecting the vertices. This provides a flexible model for scoring: any type of information can be introduced by adding edges or weights to the existing edges, while certain transformations of the graph help to avoid local minima during the optimization step.
The separation of the input signal for the alignment from the process that produces the alignment was pioneered by T-Coffee [10]. This program starts by compiling a library of local and global pairwise alignments, which equals to populating the alignment graph with the corresponding edges. T-Coffee was then extended to include structural information [11] and to combine alignments produced by other methods into a consensus alignment [12]. Another example is MaxFlow [13], which initializes the alignment graph with information from PSI-BLAST searches. We note that, in spite of these implementations, the contribution of various information sources to the quality of MPSA is yet to be analyzed. Here, we supply this need by presenting a clear comparison of sequence based information sources.
Various consistency transformations have been shown to effectively improve alignments of distant sequences [10,14,15]. The idea is to transform scores used to align pairs of sequences to be consistent between different pairs. For example, for three sequences A, B and C, if residue A i is aligned to residue B j and residue B j is aligned to residue C k , then this implies that a consistent scoring between A and C should lead to the alignment of A i to C k . In alignment graph notation consistency implies shifting weights between neighboring edges. Pioneering work on consistency was presented by T-Coffee that implemented consistency transformation referred as triplet library extension. In this transformation the consistency of edge between vertices A i and B j is increased by iterating all possible C k vertices and adding the minimum of (A i , C k ) and (B j , C k ) weights to the (A i , B j ) weight. This idea has been extended in the SeqAn::T-Coffee program, which implements T-Coffee algorithm using SeqAn sequence alignment library [8]. As in the original T-Coffee, existing edges are modified by iteration of neighboring triplets. However, this version also introduces new edges where triplet information requires such for consistency. Tests on benchmark suggest that this approach improves alignment quality [8].
Consistency transformation was also approached in a probabilistic framework. For example, in the ProbCons [14] implementation, vertices of the alignment graph are connected by probabilities according to pairwise hidden Markov models (HMMs). These probabilities are then transformed towards a higher degree of consistency by additively combining probabilities of triplet paths trough all possible vertex triplets. This transformation is similar to triplet library extension, but since edges are weighted by a probability function, the contribution of (A i , C k ) and (B j , C k ) edge-pair to the (A i , B j ) edge is equal to the product of the corresponding probabilities. ProbCons allows repeating the consistency transformation several times, in effect extending consistency to sets of four, five or a larger number of vertices.
Consistency and transitivity concepts were unified in MaxFlow program, which was designed to align motifs of remote homologs that have little sequence similarity when compared directly, but can be connected via a path of pairwise alignments [13]. MaxFlow starts with an alignment graph based on a library of PSI-BLAST alignments. This graph is weighted by assigning each vertex pair a consistency score: the number of common neighbors relative to the total number of neighbors for the two vertices in a pair. The graph is then used to align a pair of distant structural homologs. During the alignment residue pairs are weighted using the weakest path link: the maximum over the path scores defined by the weakest consistency score in a path. In comparison with classical methods, MaxFlow demonstrated superior performance in both reliability and coverage of structural motifs aligned.
In addition to the scoring function, alignment quality is largely affected by guiding tree used in the progressive alignment. Our results suggest that single linkage clustering is optimal for this purpose regardless of the benchmark set.
The progressive alignment method has a serious pitfall: subgroups of sequences are aligned independently of one another, which implies that an optimal subalignment produced near the top of the tree can become a source of errors as it is aligned to other subalignments in the later stages [16]. Consistency scoring, discussed earlier, is one way to address this problem; the other is to correct alignment errors in a post processing step called iterative refinement. In one step of iterative refinement MPSA is partitioned horizontally into two subalignments, which are then realigned and the new alignment is kept if it improves scoring function. Iterative methods based on simulated annealing are too slow to be practical [16]. Other methods differ mainly in the way sequences are divided into two groups before being re-aligned [17]. This can be a leave one out [18], random partitioning [19], or tree-dependent restricted partitioning [16]. The last option was shown to be effective in terms of both quality and speed, and a variation of this technique has been adopted by several current methods including MUSCLE [20] and MAFFT [21].
Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality. We used our program to see how alignment quality changes depending on the input information, guiding trees, the applied consistency transformations, and the strategy for iterative refinement during post-processing. This is the first time these strategies are comprehensively and clearly compared using a single implementation platform. Our results confirm the existing knowledge on which strategies are efficient. We also show that the best strategy is comparable in accuracy to the best software in the field.

Results
To compare alignment strategies we systematically varied information sources, guiding trees, consistency, clique transformation and iterative refinement options, evaluating the resulting alignments on BAliBASE and SABmark.

Comparison of information sources
We found that using both global and local pairwise alignments to construct the alignment graph, resulted in high quality multiple alignments for both BAliBASE ( Figure 1) and SABmark (Figure 2A and 2B) benchmark databases. Adding longest common subsequences to global and local alignments had minor effect on the quality of the multiple alignments. Adding external information in the form of GTG motifs, extracted using motif tracking as described in the original GTG article [22], increased the quality of the multiple alignments by an average of 1% for both BAliBASE and SABmark (full data available in Additional file 2: Tables S2 and S3).

Guiding trees
We found that single linkage clustering is clearly the best option for all reference sets in BAliBASE and SABmark benchmarks (see Tables 1 and 2). To our surprise,  it produced significantly better alignments than commonly used neighbor joining. For BAliBASE, single linkage increased accuracy relative to neighbor joining by 3% and for SABmark benchmark by 5%. We note that this comparison is based on alignments produced with no consistency transformation or postprocessing to account for errors introduced by heuristic nature of progressive alignment. It is probable that when consistency or postprocessing is included the particular type of the guiding tree used becomes less relevant.

Consistency and clique transformations
Our results show that triplet library extension improves alignment accuracy for all sets in both benchmarks when compared to alignments produced with no consistency transformation (referred here as the basic strategy). Triplet T-Coffee , which is limited to applying consistency to the existing edges, improved BAliBASE alignments by 1.7%, SABmark Twilight Zone alignments by 3.3% and SABmark Superfamily alignments by 1% (see Tables 3  and 4). Triplet SeqAn , which introduces new edges when implied by consistency, was considerably better, enhancing BAliBASE accuracy by 3%, SABmark Superfamily accuracy by 3.1% and SABmark Twilight alignment accuracy by a remarkable 11.2%. Reiterating triplet SeqAn transformation for the second time either enhanced or deteriorated alignment accuracy depending on the particular set. The effect was small and has little practical value.
Applying clique transformation decreased alignment accuracy when compared to the basic strategy. Applying clique transformation after the graph has been made consistent by triplet T-Coffee generally provided an additional improvement relative to triplet T-Coffee . However, this is not practical since this transformation multiplied the required computational time by a factor of 8 for SABmark sets and by a factor of 65 for the BAliBASE sets. This is due to the huge amount of edges generated by the clique transformation which makes the progressive alignment step computationally demanding. The task appeared even intractable when applying clique transformation after triplet SeqAn consistency for some of the BAliBASE alignments, for which reason we do not report on these results here. Running clique transformation after triplet SeqAn consistency for SABmark alignments decreased alignment accuracy. This suggests that triplet SeqAn is sufficient to introduce edges that have practical utility for the MPSA. Combining Max-Flow consistency with clique transformation had minor improvement on BAliBASE alignments and deteriorated quality of SABmark alignments.

Iterative refinement
All iterative refinement strategies improved alignment accuracy, but the tree dependent strategies were more efficient. It made little difference, whether the partitioning was done randomly (tree Random ) or systematically in breath-first order (tree BF ). Random partitioning improved accuracy of BAliBASE alignments by 1%, tree Random by 2.6% and tree BF by 2.8%, relative to alignments produced with no refinement (see Table 5). Improvement of accuracy for SABmark Twilight alignments were 5.9%, 9% and 8.5%, respectively; and for SABmark Superfamily alignments 1.6%, 3.2% and 2.9%, respectively (Table 6). Interestingly, the tree dependent strategies are comparable in time and accuracy to the  SP  TC  SP  TC  SP  TC  SP  TC  SP  TC  SP  TC  SP  TC  Time(  triplet SeqAn , which was the best consistency transformation as presented above. The mean SP score for BAli-BASE alignments produced with triplet SeqAn was 0.856, the corresponding score for tree BF was very similar: 0.854. Computation times for these two strategies were also comparable: triplet SeqAn took on average 45 seconds and tree BF on average 54 seconds per BAliBASE alignment. Moreover, applying tree BF iterative refinement on BAliBASE alignments produced under triplet SeqAn consistency had no effect on accuracy when measured up to three decimal points of the SP score ( Table 5). The same tendency can be seen from SABmark alignments: differences between triplet SeqAn and tree BF accuracy are rather cosmetic and tree BF is on average slightly faster (1.2 sec per alignments against two seconds per alignment for Twilight alignments and one sec against 1.6 sec for Superfamily alignments). For this benchmark, however, applying tree BF refinement to alignments produced under triplet SeqAn did result in some improvement ( Table 6).
The relative speed of iterative refinement strategies depended largely on the number of sequences in the alignment. For BAliBASE alignments, which on average contain 30 sequences, run times of random and restricted partitioning were comparable. For SABmark alignments, which on average contain 8 sequences, restricted partitioning was slightly faster. Partitioning alignment along all possible edges in the guiding tree was generally faster than random tree dependent partitioning for alignments with a small to moderate number of sequences (BAliBASE sets RV11, RV12, RV40 and RV50 and both SABmark sets), but became slower as the number of aligned sequences approached the number of refinement iterations (BAli-BASE sets BB20 and BB30 that have on average 46 and 63 sequences per alignment, respectively).

Comparison to other programs
We compared our optimal strategies to the eight leading multiple alignment programs: (1) ProbCons 1.12 [14], (2) T-Coffee 8.98 [10], (3) MUSCLE 3.8 [20], (4) MAFFT   Columns show the average developer (FD) score, the percentage value of the FD score relative to the basic strategy (%basic) and the average run time achieved by different alignment strategies for the "Superfamily" and "Twilight Zone" sets in the SABmark database. The number of sequences in each reference is given in parentheses. The best results in each column are shown in bold 6.847 [21], (5) ClustalW 2.0.12 [6], (6) [25].
MAFFT was run with L-INS-i option; other programs were run with default parameters. All programs were run with a single core. The comparison was done on BAli-BASE (Table 7), SABmark (Table 8) and PREFAB ( Table  9). The optimal strategy for BAliBASE alignments, referred as MMSA in Table 7, was to use global and local pairwise alignments complemented with GTG motifs as input information; to apply consistency transformation triplet Se-qAn ; and to align the sequences using a single linkage guiding tree. No postprocessing was performed. For SABmark we chose to test two strategies: the same strategy as for BAliBASE, and another strategy, that was augmented with iterative refinement of type tree BF (referred as MMSA:: tree BF in Table 8 (Table 9).
We conclude that, although our modular implementation of the best alignment strategies does not outperform the best aligners in the field, the performance is at a comparable level.

Discussion
We have completed a comprehensive comparison of graph based strategies for aligning multiple sequences. Our results suggest clear guidelines for a number of choices made during construction of a multiple protein sequence alignment. First, we showed that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. When reliable external information, the GTG motifs in our case, is available, it will most likely improve the accuracy and thus should also be included. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet SeqAn is the most efficient consistency transformation from those compared. It can have a large improvement on alignment quality, particularly for alignments in the

5.75
Columns show the average developer (FD) score, the percentage value of the FD score relative to the basic strategy (%basic) and the average run time achieved by different alignment strategies for the "Superfamily" and "Twilight Zone" sets in the SABmark database. The number of alignments in each reference is given in parentheses. The best results in each column are shown in bold twilight zone. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with triplet SeqAn consistency transformation in both time and accuracy. As a more subtle result, we found that transitivity, which in principle can increase the sensitivity of the aligner, will in most cases introduce more noise than signal. When we iterated triplet SeqAn two times, in effect transferring edge weights within neighborhoods of four vertices, the quality of the alignments increased slightly. However, when iterations were repeated three times, extending the influenced neighborhood to five vertices, the quality invariably deteriorated. The same was the case for clique transformation, where any two vertices, connected by a path, were connected by an edge and thus were allowed to directly influence each other during the progressive alignment.

Conclusions
We showed that graph based modular implementation allows to compare the contribution of different algorithmic components to the alignment quality and computation  Columns show the average developer (FD) score, modeller (FM) score and run time achieved by different aligners for the "Twilight Zone" and "Superfamily" sets in the SABmark database. The number of alignments in each reference is given in parentheses. The best results in each column are shown in bold time. We demonstrated that shifting edge information within triplets of alignment graph vertices prior to the progressive step and the tree-dependent iterative refinement after the progressive step are equally effective strategies that significantly improve accuracy, particularly for the case of distant homologs. As a negative result we report that extending the neighborhood of edge shifting to five or more vertices introduces more noise than signal.

Algorithm overview
Our program starts by gathering input information from pairwise alignments into an alignment graph. The graph is then refined, transformed and finally fed to a progressive alignment routine that builds an alignment. The resulting alignment is iteratively refined in a post processing step. Altogether there are six steps, detailed below. Each step was implemented as a combination of SeqAn library methods and our own code. The contribution of SeqAn library is outlined in Additional file 2: Table S1.

Computing information sources
Global, local and longest common subsequence alignments were computed using SeqAn library with Blosum62 substitution matrix, gap opening penalty set to -13 and gap extension penalty set to -1. Global alignments were constructed using Gotoh's affine gap cost and local alignments using Smith-Waterman algorithms. From local alignments we selected the best four, possibly overlapping, matches; as our preliminary tests indicate that including a larger number of local matches has little effect on accuracy (data not published). We acknowledge that the selection of substitution matrix and gap penalties will have a large effect on the pairwise alignments and hence on the overall MPSA. The values applied here are based on the literature and other implementations. Additionally, we compiled a collection of conserved motifs based on the Global Trace Graph (GTG) [22]. These were extracted using motif tracking as described in the original GTG article. Alignments and motifs were converted into edges between aligned sequence segments referred hereafter as segment matches.

Initializing the alignment graph
The vertices of the graph represent segments of the sequence in the set of proteins to be aligned. The edges of the graph represent possible alignments between these segments. After the edges are generated in the first step, they are assigned weights using Blosum62 substitution matrix. Blosum62 was offset to include only positive values. After this, the structure of the alignment graph is simplified by refinement. During the refinement segment matches are cut into smaller parts in such a way that none of the matches partly overlap (for details see [8]).

Score transformation
Score transformations can both modify the edge weights of existing edges, and create new edges. An overview of score transformations is presented in Additional file 3: Figure S1. 3A. Consistency transformation Alignment graph is made consistent using either triplet library extension introduced in T-Coffee, triplet T-Coffee , a variation of this introduced in SeqAn, triplet SeqAn , or the consistency measure used in MaxFlow [13]. Triplet SeqAn can be repeated several times, in effect extending consistency to sets of four, five or larger number of vertices. 3B. Clique transformation Connected components are identified using Kruskal's algorithm and converted to cliques. Kruskal's is used to cluster alignment graph vertices by iterating all edges in the order of decreasing edge weight. This produces a collection of spanning trees, which are used to connect each pair of vertices in a tree with a weight equal to the weakest link of the path connecting the two vertices.

Tree construction
When global alignments are included in the list of information sources (see above), they are used to calculate the distance matrix. Otherwise, the distance matrix is calculated using k-mer counting. A guiding tree is constructed from the distance matrix using neighbor joining, single linkage, complete linkage, UPGMA or weighted UPGMA. When the guiding tree is constructed using neighbor joining, the root is placed between the last two taxa to be joined.

Progressive alignment
Sequences are progressively aligned in the order defined by the guiding tree. In each step, we apply heaviest common subsequence algorithm to align two subalignments. This is a segment based dynamic programming that finds the maximum weight trace from a set of refined alignment segments.

Iterative refinement
We implemented three different refinement strategies: random partitioning introduced by Berget & Munson [19], and the two tree-dependent restricted partitioning strategies referred as tree Random and tree BF . Tree Random is the same strategy as used by MUSCLE and MAFFT: sequences are partitioned by cutting a random edge of the guiding tree. Tree BF was designed to be a time efficient version of tree Random : each edge of the tree is cut only once in the breath-first order.

Testing methodology
We run our program with options corresponding to each of the tested strategies on BAliBASE 3.0 [26] and SABmark 1.65 [27] alignment benchmark databases.
The best strategy was also tested on PREFAB 4.0 [20] benchmark. Tests were performed on a 2.93 GHz Intel Xeon X7350 with 66 GB RAM. The BAliBASE 3.0 benchmark is a collection of 218 reference protein alignments, compiled from FSSP structural alignments [28], HOMSTRAD [29] databases and hand constructed alignments. The database is organized into five reference sets: in Reference 1 sequences are equidistant and have two levels of conservation; in References 2 a highly divergent "orphan" is aligned with a family of close homologs; in Reference 3 subgroups with < 25% residue identity between groups are aligned; Reference 4 contains sequences with large N/C-terminal extensions; and Reference 5 sequences with large insertions in the middle of the aligned blocks. Alignments were scored relative to the core blocks of benchmark multiple alignments using quality measures proposed by BAliBASE: the sum of pairs score (SP) and the true column score (TC). SP is equal to the number of residue pairs correctly aligned in the evaluated alignment relative to the total number of residue pairs in the benchmark alignment. TC is the number of columns with correct alignment divided by the total number of columns in the benchmark alignment.
The SABmark 1.65 benchmark contains pairwise structural alignments from SOFI [30] and CE [31] databases, that are organized according to SCOP classification. There are two sets: the "Twilight Zone" set contains 1740 domains grouped into 209 SCOP folds, and the "Superfamilies" set that contains 3280 domains grouped into 425 SCOP superfamilies. Sequences in each of the twilight subsets are restricted to have at most 25% and those in superfamilies subsets to have at most 50% pairwise identity. The quality of SABmark alignments was evaluated using FD and FM scores based on reference structural alignments. The FD score measures the number of correct residue pairs relative to the number of paired residues in the reference alignment and FM score relates the number of correct pairs to the total number of paired residues in the test alignment. PREFAB 4.0 is a large database generated automatically by supplementing structural pairs from FSSP database with homologs found through PSI-BLAST queries. Each alignment set is filtered to have at most 80% identity and limited to a set of 50 random homologs. There are 1682 alignments in the main set and 100 alignments in the weighted set. The weighted set consists of families with one highly over-represented sub-family. The accuracy of the multiple alignment for a given PREFAB alignment set is evaluated relative to the consensus of FSSP and CE alignments of the corresponding structural pair. Alignments were evaluated using PREFAB Q score, which is the analog of BAliBASE SP and SABmark FD scores.

Additional material
Additional file 1: Source code, Makefile, installation instructions and test alignments.
Additional file 2: Table S1. Outline of MMSA implementation. This table presents the main alignment steps and the corresponding components and functions. Some functions are referred in the main text using abbreviations listed in the fourth column. Source code for any function can be found in one of the two packages: SeqAn sequence alignment library or MMSA code deposited at http://ekhidna.biocenter.helsinki.fi/ MMSA. Table S2. Performance of strategies with different information sources on the BAliBASE benchmark. Columns show the average sum-ofpairs (SP) and true-column (TC) scores achieved by different alignment strategies for each of the six BAliBASE references. The number of sequences in each reference is given in parentheses. Mean values for the entire database are reported in addition to the mean execution time of each strategy. The best results in each column are shown in bold. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs. Table S3. Performance of strategies with different information sources on the SABmark benchmark Columns show the average developer (FD) score (equivalent to sum-of-pairs [SP] score) and modeller (FM) score achieved by different alignment strategies for the "Superfamily" and "Twilight Zone" sets in the SABmark database. The number of sequences in each reference is given in parentheses. Mean execution time of each strategy is reported in seconds. The best results in each column are shown in bold. The strategies tested differ in the combination of pairwise sequence information that is used to construct the alignment graph. Combinations include: C, longest common subsequence, L, the four top scoring local alignments, G, global alignment, M, GTG motifs.
Additional file 3: Figure S1. Overview of the score transformations. A) Alignment graph of four sequences after refinement has resolved conflicts between segment matches. B) Applying triplet T-Coffee reinforces connections within triplet cliques. C) Applying triplet SeqAn reinforces connections within triplet cliques and introduces new edges between vertices that have a common neighbor. Note that edges are never introduced between vertices that belong to the same sequence. D) Applying MaxFlow consistency followed by a clique transformation: edges within spanning trees are weighted by the relative number of common neighbors, and new edges are introduced to convert each spanning tree into a clique (or almost a clique, since edges are never introduced between vertices that belong to the same sequence).