MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments
© Collingridge and Kelly; licensee BioMed Central Ltd. 2012
Received: 2 December 2011
Accepted: 30 May 2012
Published: 30 May 2012
The generation of multiple sequence alignments (MSAs) is a crucial step for many bioinformatic analyses. Thus improving MSA accuracy and identifying potential errors in MSAs is important for a wide range of post-genomic research. We present a novel method called MergeAlign which constructs consensus MSAs from multiple independent MSAs and assigns an alignment precision score to each column.
Using conventional benchmark tests we demonstrate that on average MergeAlign MSAs are more accurate than MSAs generated using any single matrix of sequence substitution. We show that MergeAlign column scores are related to alignment precision and hence provide an ab initio method of estimating alignment precision in the absence of curated reference MSAs. Using two novel and independent alignment performance tests that utilise a large set of orthologous gene families we demonstrate that increasing MSA performance leads to an increase in the performance of downstream phylogenetic analyses.
Using multiple tests of alignment performance we demonstrate that this novel method has broad general application in biological research.
The construction of multiple sequence alignments (MSAs) from individual sequences is fundamental to nearly all aspects of post-genomic biological research. In addition to the role these alignments play in progressing our understanding of the evolution and diversity of life, they also provide a platform from which algorithms that predict protein structure and function can be based. Given their pivotal role, the development of improved MSA algorithms and matrices of sequence evolution has been an active area of research for more than 20 years. The result of this research has been the production of many different MSA methods whose performances on diverse data types can vary considerably (for example of comparative analyses of method performance see [1, 2]).
The rationale for the continued improvement of alignment methods is that reduction in error in multiple sequence alignment will thus lead to reduction in error in all subsequent downstream bioinformatic analysis. For example, in phylogenetic analysis it has been demonstrated that alignment error leads to tree inference error regardless of the inference method used downstream of alignment construction . In addition to reducing error by improving alignment accuracy, independent methods have been developed to identify and discard alignment data which may contain errors. As information gaps are a major source of alignment error a common approach adopted in phylogenetic analysis is to discard information when it exceeds a threshold value of gap-characters. Popular methods such as GBLOCKS  have been developed to automate this error reduction process and thereby reduce the amount of possibly erroneous data in MSAs. These strategies are hard-lined with regard to gap characters and hence even correctly aligned insertion and deletion events can be discarded. Due to the high occurrence of insertions and deletions in real biological sequence data, (for example the Pfam  seed alignments of conserved domains contain approximately 38.6% gap-characters) the danger of gap parsing is that one can reduce the usable phylogenetic information to a level which can be insufficient to facilitate a robust resolution of phylogenetic relationships. Moreover, some phylogenetic inference methods ignore or misinterpret gaps as additional character states which can lead to errors in phylogenetic inference. As an alternative strategy to discarding gap-rich data methods such as PRANK [6, 7] have been developed which incorporate the phylogenetic implications of gaps and treat insertions and deletions as separate evolutionary events.
It has been demonstrated that for some MSAs, different methods correctly align different regions whilst no current method correctly aligns the entire sequence . This work showed that it is possible to improve the accuracy of MSAs by creating consensus MSAs based on multiple independent MSAs of the same sequence [8, 9]. M-COFFEE is such a meta-method which uses MSAs generated by individual MSA methods to generate consensus alignments that are more accurate, on average, than any of the individual alignment methods used .
Here, we provide a novel algorithm called MergeAlign that uses a dynamic programming approach to efficiently construct consensus MSAs from any number of independent alignments of the same sets of sequences. To generate input MSAs for this method we use a single alignment methodology and multiple different matrices of amino acid substitution. We use standard alignment benchmark tests to show that the consensus alignments produced in this way are more accurate than those produced using any individual matrix of amino acid substitution. We further show that our method is suitable for combining alignments generated using different methodologies. Creating consensus alignments from large numbers of constituent alignments allows us to assign a score to each column in the final MSA. We show that the MergeAlign column score is related to alignment error rate and hence provide a novel method for data selection based on expected alignment error rate. In addition to improving performance on MSA benchmarks we demonstrate that MergeAlign alignments improve performance of downstream phylogenetic analysis. The MergeAlign algorithm is provided as Additional file 1 and is available at http://www.mergealign.com.
Results and discussion
A novel method for constructing the consensus of multiple sequence alignments
Multiple substitution matrices achieve near equivalent performance on alignment benchmarks
To generate a set of input alignments for use for consensus MSA construction we chose to adopt a novel approach which involves creating multiple MSA using different matrices of amino acid substitution. We first assayed the MSA performance of a set of 142 previously characterised amino acid substitution matrices to identify how each substitution matrix performed on benchmark MSAs. The MAFFT  MSA method was used with the FFT-NS-2 strategy to align each of a randomly selected set of 1000 benchmark sequences ( Additional file 2), using each amino acid substitution matrix. The F-score for each alignment was determined by comparison of the inferred alignment to the benchmark MSA, which is assumed for this purpose to represent the true alignment (see methods). The amino acid substitution matrices were then ranked according to the mean F-score obtained over all MSAs (Figure 2, Additional file 3).
MergeAlign alignments are more accurate that alignments based on any individual amino acid substitution matrix
In order to test how MergeAlign consensus MSAs perform compared to MSAs generated with a single substitution matrix, we selected a second set of 1000 randomly selected benchmark MSAs ( Additional files 4 & 5). We aligned each member of this set using both MAFFT FFT-NS-2 and the more accurate MAFFT L-INS-i method . Individual MSAs were generated for both methods using each of the 91 top-scoring substitution matrices from the previous test. The mean F-score observed using each substitution matrix is highly correlated between FFT-NS-2 and L-INS-i alignment methods and the rank order of amino acid substitution matrices is near identical (Spearman r = 0.959, p < 1 × 10−10Additional file 5). Having selected 91 MSAs for each benchmark MSA, we used MergeAlign to generate a consensus of these individual MSAs. Consensus MergeAlign MSAs were generated from all 91 MSAs or subsets of the 91 MSAs. Subsets consisted of those MSAs generated by the top 10, or multiples of 10 up to 80, substitution matrices according to rank order of the mean F-score shown in Figure 2. We scored each MSA, including the MergeAlign consensus MSAs, using the mean F-score of the 1000 independent MSA inferences.
To provide a direct comparison to an existing method which produces a consensus from MSAs in a different way, we aligned the 1000 benchmark MSAs above with M-Coffee . Here, M-Coffee creates a consensus alignment using 8 independent MSA methods . To compare the methods directly the same 8 constituent MSAs were also combined using MergeAlign. Starting from the same raw input alignments the mean time taken to construct a consensus alignment was 8.60 s (standard deviation 16.362 s) and 0.10 s (standard deviation 0.164 s) for M-Coffee and MergeAlign respectively. The mean score for M-Coffee was 0.7217 (Figure 3) whereas the mean score for MergeAlign was 0.7237 (Figure 3). Thus on small numbers of input alignments MergeAlign is significantly faster than M-Coffee (p < 1 × 10−100) however the accuracy of the two methods is not significantly different (p = 0.253).
MergeAlign column score is related to alignment accuracy
Increasing alignment F-score on benchmark tests improves phylogenetic tree performance
Merge Align improves performance on structural measure of multiple sequence alignment accuracy
Performance of top scoring methods across all alignment performance tests
Rank in Benchmark Test
Rank in Tree Resolution Test
Rank in Topological Congruence Test
Rank in Structural Accuracy Test
We present a novel method for improving multiple sequence alignment (MSA) performance by constructing consensus MSAs from multiple individual MSAs. In all tests performed on benchmark MSAs the precision, recall (Q-score) and iRMSD score of the MergeAlign method is better than that observed for any individual matrix of amino acid substitution. This is the first demonstration that combining of MSAs using different models of amino acid substitution can improve MSA accuracy. We also demonstrated in two independent tests that did not use benchmark MSAs that MergeAlign alignments increase the resolution of phylogenetic trees and increase the topological agreement between phylogenetic trees inferred from individual gene families.
This method for improving MSA has broad general applicability in biological research. The method is also extensible: as more matrices of sequence substitution become available they can be incorporated into the MergeAlign procedure. Similarly, MergeAlign will benefit from improvements in underlying MSA technology; as these improve MergeAlign will improve. Independent of its use for inferring consensus MSAs, MergeAlign is also suitable for identifying the consensus columns between MSAs generated under different alignment methods. MergeAlign also provides an estimate of the precision of a MSA on a column-by-column basis. We demonstrated that the MergeAlign column score is related to MSA precision and propose this as an additional and independent method for data selection for downstream bioinformatic applications.
Constructing MergeAlign consensus multiple sequence alignments
In order to construct consensus MSAs we create a weighted directed acyclic graph that represents all the constituent MSAs (Figure 1). In this graph, each node represents a column present in at least one constituent MSA; each edge represents the transition between two columns in at least one constituent MSA; each edge weight represents the number of MSAs containing that transition. We then use a dynamic programming approach to find the path through the graph, from source vertex to sink vertex, that maximises the mean weight of the edges traversed (Figure 1). This path is converted into a consensus MSA and the edge weights used to score each column.
where s(x i ) is the path score of the node x i , l(x i ) is the path length of node x i , and w(e i ) is the weight of edge e i . The edge with the highest score, e s = (x s , n), is then selected. The path score of node n is set to s(x s ) + w(e s ), and the path length of node y is set to (x s ) + 1 (See Figure 1 E; nodes are labelled with path score/path length). A traceback edge t = (n, x s ) is created with the same weight as e s for the traceback process. The path score of any node is thus the maximum sum of edge weights for a path from the source node to that node; the path length is the length of this optimal path. The optimum path from source to sink is found by starting at the sink node and following the traceback edges. This path is used to reconstruct the consensus MSA while the weights of the traceback edges are used to score each column. The column score can be normalised by dividing it by the total number of input MSAs. A python implementation of the above algorithm is presented as Additional file 1 and is available online at http://www.mergealign.com.
Since MergeAlign only considers columns present in constituent MSAs it is very efficient and scalable. Generating a consensus MSA for 834 unique benchmark MSAs (on average consisting of 37 sequences each 256 amino acids long) from 91 constituent MSAs took an average of 1.1 s per consensus MSA on one processor of a standard laptop. A graph built by MergeAlign has a maximum of kN vertices, where k is the number of constituent MSAs, and N is the maximum number of columns in any constituent MSA. The maximum number of edges is entering or leaving a vertex is k. The speed at which MergeAlign can build and traverse the graph is therefore given by O(Nk 2 ), thus linear with respect to MSA length and independent of the number of sequences in an MSA.
This method can be viewed as an extension of the classical dynamic programming approach used to construct pairwise sequence alignments [13, 14]. These algorithms represent all possible alignments of two sequences as a 2D array and use dynamic programming to find the optimum path through the matrix. These methods require scoring every possible column in an MSA and hence run in O(N k ) time, thus rendering them unusable for more than a few sequences . MergeAlign is able to extend this approach to MSA by limiting its search to only columns present in one or more constituent alignment.
Amino acid substitution matrix and benchmark MSAs
The source data used in this work comes from a number of publicly available databases. A set of 137 amino acid substitution matrices was downloaded from AAindex http://www.genome.jp/aaindex/. This set comprised amino acid substitution matrices and statistical protein contact potentials derived from different source datasets using a variety of different methods described in . The matrix VOGG950101 was removed from this set because it resulted in alignments which were identical to those generated when using matrix GONG920101. In addition to these matrices we also included commonly used matrices which were absent from this list including BLOSUM62 (as used by BLAST ), BLOSUM80, BLOSUM90, PAM30, PAM70 and VTML200. This resulted in a final set of 142 substitution matrices ( Additional file 3). Benchmark MSAs were obtained from http://www.drive5.com/bench. These comprised the complete set of benchmark MSAs from Balibase v3 , OXBENCH , PREFAB v4  and SABRE [20, 21]. In total this set contained 2724 benchmark MSAs.
Substitution matrix selection and alignment inference
To select the optimal substitution matrices for use in MergeAlign a random subset of 1000 benchmark MSAs was sampled with replacement from the complete set of 2724. This set contained 804 different benchmark MSAs ( Additional file 2). This training set of sequences were un-aligned and MSAs were inferred for each member using each of the 142 amino acid substitution matrices above. MSAs were generated using MAFFT  with the FFT-NS-2 method, which has been shown to be as accurate as the clustalw method [2, 22]. Different matrices of amino acid substitution were utilised by MAFFT by using the “--aamatrix” option and specifying one of the matrices listed in Additional file 2 described above.
Assessment of MSA performance
The F-score was evaluated as the harmonic mean of the above precision and recall scores. The recall score presented here is mathematically equivalent to the alignment Q-score which has been previously described . Precision scores for each column in the inferred MSAs were also calculated. In this case each column is treated independently and only the aligned letter-pairs present in the inferred column were used to calculate the precision score as described above. The combination of precision and recall (Q-score) to generate the F-score hence represents an incremental improvement to existing scoring methods. A Perl algorithm which computes column-by-column precision scores as well as whole alignment precision, recall and F-scores is presented as Additional file 6. In all cases concerning MSA benchmarks the true alignment is not known and the benchmark alignments contain FP and FN errors. For the purposes of this test the benchmark alignment is assumed to be true. In support of these benchmark based scores an independent measure of accuracy derived from structural data is also provided using the iRMSD method . An additional and independent test that does not involve benchmark alignments is provided below.
Evaluation of MergeAlign MSA performance
To evaluate the performance of MergeAlign and compare it to MSAs inferred from single matrices of amino acid substitution a second, independent test set of 1000 benchmark MSAs were randomly selected with replacement from the original set of 2724. This test set contained 834 discrete benchmark MSAs ( Additional file 4). Each member of this test set was aligned using each substitution matrix using MAFFT and the FFT-NS-2 method. FFT-NS-2 is a fast progressive method which performs two iterations of tree-guided progressive multiple sequence alignment . This method has been demonstrated to be about as accurate as the well established clustal methods [2, 10], though significantly faster. To provide an additional test, MSAs were inferred for each benchmark MSA using the L-INS-i method with each substitution matrix. L-INS-I has been repeatedly demonstrated to be one of the most accurate MSA algorithms currently available [2, 10]. In brief, this method has a different objective function to FFT-NS-2, which evaluates the consistency between pairwise alignments and global multiple sequence alignment . In this method flanking sequences surrounding the alignable regions are ignored and iterative refinement is allowed to proceed for a maximum of 1000 cycles . For each benchmark, MergeAlign consensus MSAs were constructed from alignments generated by either method. The performance of MergeAlign was assayed by calculating the F-score of the consensus MSAs as described above and comparing it to the F-score for each of the individual MSAs generated using a single substitution matrix. To provide a comparison with a similar, consensus, methodology the same set of 1000 benchmark MSAs was also aligned using M-Coffee .
To determine the relationship between MergeAlign column score and alignment accuracy the individual columns from all MSAs inferred using MergeAlign were grouped into categories based on their MergeAlign column score. For each category, the precision of the aligned letter-pairs for the constituent columns was evaluated using the method described above. To determine the relationship between the MergeAlign column score and MSA precision this data was then fit to a power function of the form f(x) = x m using a least squares method.
Benchmark-free MSA performance evaluation
To provide an additional and independent assay for MSA performance we developed a novel test based on increasing the performance of downstream phylogenetic analysis. To do this we selected a large dataset of orthologous gene families which comprises 3537 orthologous gene families differentially distributed across 48 species of Archaea , where each gene family has no more than one sequence representative per taxa. It is assumed that the majority of these gene families will share the same evolutionary history and hence a more accurate MSA should improve topological agreement between individual gene family phylogenetic trees. For each of the orthologous groups containing four or more sequences (n = 1688) we constructed a MSA using each of the 91 selected amino acid substitution matrices. MSAs were also constructed using MergeAlign and a subset of substitution matrices which were selected according to their ranked performance of the benchmark tests. The top ranked 10, 20, 30, 40, 50, 60, 70, 80 or all 91 substitution matrices were used to construct MergeAlign alignments. Each alignment was then subject to phylogenetic inference using the FastTree program  with CAT rates. The proportion of partitions in each tree which received greater than 0.5 support by SH-like test  were recorded. All possible pairwise comparisons between trees (without replication) were performed for each MSA method (n = 7.06x1010). For each pairwise comparison between two trees, both trees were pruned to contain the identical set of taxa using the dendropy python module . Only those trees which had four or more taxa in common and only partitions which received SH support of >0.5 were used for the analysis. The two trimmed trees were then compared using the dendropy python module  and the fraction of bi-partitions which were in agreement was evaluated.
SK is supported by the BBSRC through BB/D020190/1. The Authors would like to thank Keith Gull for his support and advice on this manuscript.
- Edgar RC: Quality measures for protein alignment benchmarks. Nucleic Acids Res 2010, 38(7):2145–2153. 10.1093/nar/gkp1196PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Linard B, Lecompte O, Poch O: A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 2011, 6(3):e18093. 10.1371/journal.pone.0018093PubMed CentralView ArticlePubMedGoogle Scholar
- Ogdenw TH, Rosenberg MS: Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol 2006, 55(2):314–328. 10.1080/10635150500541730View ArticleGoogle Scholar
- Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 2007, 56(4):564–577. 10.1080/10635150701472164View ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al.: The Pfam protein families database. Nucleic Acids Res 2010, 38(Database issue):D211-D222.PubMed CentralView ArticlePubMedGoogle Scholar
- Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A 2005, 102(30):10557–10562. 10.1073/pnas.0409137102PubMed CentralView ArticlePubMedGoogle Scholar
- Loytynoja A, Goldman N: Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 2008, 320(5883):1632–1635. 10.1126/science.1158395View ArticlePubMedGoogle Scholar
- Bucka-Lassen K, Caprani O, Hein J: Combining many multiple alignments in one improved alignment. Bioinformatics 1999, 15(2):122–130. 10.1093/bioinformatics/15.2.122View ArticlePubMedGoogle Scholar
- Wallace IM, O’Sullivan O, Higgins DG, Notredame C: M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res 2006, 34(6):1692–1699. 10.1093/nar/gkl091PubMed CentralView ArticlePubMedGoogle Scholar
- Katoh K, Kuma K, Miyata T, Toh H: Improvement in the accuracy of multiple sequence alignment program MAFFT. Genome Inform 2005, 16(1):22–33.PubMedGoogle Scholar
- Kelly S, Wickstead B, Gull K: Archaeal phylogenomics provides evidence in support of a methanogenic origin of the Archaea and a thaumarchaeal origin for the eukaryotes. Philos Trans R Soc Lond B Biol Sci 2011, 278(1708):1009–1018. 10.1098/rspb.2010.1427View ArticleGoogle Scholar
- Armougom F, Moretti S, Keduas V, Notredame C: The iRMSD: a local measure of sequence alignment accuracy using structural information. Bioinformatics 2006, 22(14):e35-e39. 10.1093/bioinformatics/btl218View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS, Fitch WM: Comparative biosequence metrics. J Mol Evol 1981, 18(1):38–46. 10.1007/BF01733210View ArticlePubMedGoogle Scholar
- Gotoh O: Alignment of three biological sequences with an efficient traceback procedure. J Theor Biol 1986, 121(3):327–337. 10.1016/S0022-5193(86)80112-6View ArticlePubMedGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2008, 36(Database issue):D202-D205.PubMed CentralPubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15(1):87–88. 10.1093/bioinformatics/15.1.87View ArticlePubMedGoogle Scholar
- Raghava GP, Searle SM, Audley PC, Barber JD, Barton GJ: OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinforma 2003, 4: 47. 10.1186/1471-2105-4-47View ArticleGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Van Walle I, Lasters I, Wyns L: SABmark–a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 2005, 21(7):1267–1268. 10.1093/bioinformatics/bth493View ArticlePubMedGoogle Scholar
- Thompson JD, Gibson TJ, Higgins DG: Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics 2002, Chapter 2: Unit 2 3.PubMedGoogle Scholar
- Price MN, Dehal PS, Arkin AP: FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One 2010, 5(3):e9490. 10.1371/journal.pone.0009490PubMed CentralView ArticlePubMedGoogle Scholar
- Sukumaran J, Holder MT: DendroPy: a Python library for phylogenetic computing. Bioinformatics 2010, 26(12):1569–1571. 10.1093/bioinformatics/btq228View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.