Constructing MergeAlign consensus multiple sequence alignments
In order to construct consensus MSAs we create a weighted directed acyclic graph that represents all the constituent MSAs (Figure 1). In this graph, each node represents a column present in at least one constituent MSA; each edge represents the transition between two columns in at least one constituent MSA; each edge weight represents the number of MSAs containing that transition. We then use a dynamic programming approach to find the path through the graph, from source vertex to sink vertex, that maximises the mean weight of the edges traversed (Figure 1). This path is converted into a consensus MSA and the edge weights used to score each column.
In order to identify columns common between MSAs, we first convert each sequence within each alignment into an array of indices (Figure 1
). This is done by replacing the letter representing each amino acid with a number representing that amino acid’s position in the sequence and by replacing each gap character by the number of the preceding amino acid. All sequences are assumed to start with a 0. Each column in an MSA can then be converted into an N
-tuple, where N
is the number of sequences in that MSA. Once a graph has been constructed, each node is given a path score and a path length. The path score and path length of the source node are set at zero and all other nodes are scored recursively as follows. With the exception of the source node, each node n
, has at least one incoming edge, e
), where x
is the source of that edge. Each incoming edge, e
, is given a score equal to:
) is the path score of the node x
) is the path length of node x
, and w(e
) is the weight of edge e
. The edge with the highest score, e
, n), is then selected. The path score of node n is set to s(x
) + w(e
), and the path length of node y is set to (x
) + 1 (See Figure 1 E; nodes are labelled with path score/path length). A traceback edge t = (n, x
) is created with the same weight as e
for the traceback process. The path score of any node is thus the maximum sum of edge weights for a path from the source node to that node; the path length is the length of this optimal path. The optimum path from source to sink is found by starting at the sink node and following the traceback edges. This path is used to reconstruct the consensus MSA while the weights of the traceback edges are used to score each column. The column score can be normalised by dividing it by the total number of input MSAs. A python implementation of the above algorithm is presented as Additional file 1 and is available online at http://www.mergealign.com.
Since MergeAlign only considers columns present in constituent MSAs it is very efficient and scalable. Generating a consensus MSA for 834 unique benchmark MSAs (on average consisting of 37 sequences each 256 amino acids long) from 91 constituent MSAs took an average of 1.1 s per consensus MSA on one processor of a standard laptop. A graph built by MergeAlign has a maximum of kN vertices, where k is the number of constituent MSAs, and N is the maximum number of columns in any constituent MSA. The maximum number of edges is entering or leaving a vertex is k. The speed at which MergeAlign can build and traverse the graph is therefore given by O(Nk
), thus linear with respect to MSA length and independent of the number of sequences in an MSA.
This method can be viewed as an extension of the classical dynamic programming approach used to construct pairwise sequence alignments [13, 14]. These algorithms represent all possible alignments of two sequences as a 2D array and use dynamic programming to find the optimum path through the matrix. These methods require scoring every possible column in an MSA and hence run in O(N
) time, thus rendering them unusable for more than a few sequences . MergeAlign is able to extend this approach to MSA by limiting its search to only columns present in one or more constituent alignment.
Amino acid substitution matrix and benchmark MSAs
The source data used in this work comes from a number of publicly available databases. A set of 137 amino acid substitution matrices was downloaded from AAindex http://www.genome.jp/aaindex/. This set comprised amino acid substitution matrices and statistical protein contact potentials derived from different source datasets using a variety of different methods described in . The matrix VOGG950101 was removed from this set because it resulted in alignments which were identical to those generated when using matrix GONG920101. In addition to these matrices we also included commonly used matrices which were absent from this list including BLOSUM62 (as used by BLAST ), BLOSUM80, BLOSUM90, PAM30, PAM70 and VTML200. This resulted in a final set of 142 substitution matrices ( Additional file 3). Benchmark MSAs were obtained from http://www.drive5.com/bench. These comprised the complete set of benchmark MSAs from Balibase v3 , OXBENCH , PREFAB v4  and SABRE [20, 21]. In total this set contained 2724 benchmark MSAs.
Substitution matrix selection and alignment inference
To select the optimal substitution matrices for use in MergeAlign a random subset of 1000 benchmark MSAs was sampled with replacement from the complete set of 2724. This set contained 804 different benchmark MSAs ( Additional file 2). This training set of sequences were un-aligned and MSAs were inferred for each member using each of the 142 amino acid substitution matrices above. MSAs were generated using MAFFT  with the FFT-NS-2 method, which has been shown to be as accurate as the clustalw method [2, 22]. Different matrices of amino acid substitution were utilised by MAFFT by using the “--aamatrix” option and specifying one of the matrices listed in Additional file 2 described above.
Assessment of MSA performance
Each MSA inferred in this work was compared to its corresponding benchmark reference MSA and the number of true positive (TP) false positive (FP) and false negative (FN) aligned letter pairs were recorded. TP is the number of correctly aligned letter-pairs. FP is the number of aligned letter-pairs present in the inferred MSA that are not found in the benchmark MSA. FN is the number aligned letter-pairs in the benchmark MSA that are not found in the inferred MSA. These scores were then used to evaluate the precision and recall of the inferred MSAs where
The F-score was evaluated as the harmonic mean of the above precision and recall scores. The recall score presented here is mathematically equivalent to the alignment Q-score which has been previously described . Precision scores for each column in the inferred MSAs were also calculated. In this case each column is treated independently and only the aligned letter-pairs present in the inferred column were used to calculate the precision score as described above. The combination of precision and recall (Q-score) to generate the F-score hence represents an incremental improvement to existing scoring methods. A Perl algorithm which computes column-by-column precision scores as well as whole alignment precision, recall and F-scores is presented as Additional file 6. In all cases concerning MSA benchmarks the true alignment is not known and the benchmark alignments contain FP and FN errors. For the purposes of this test the benchmark alignment is assumed to be true. In support of these benchmark based scores an independent measure of accuracy derived from structural data is also provided using the iRMSD method . An additional and independent test that does not involve benchmark alignments is provided below.
Evaluation of MergeAlign MSA performance
To evaluate the performance of MergeAlign and compare it to MSAs inferred from single matrices of amino acid substitution a second, independent test set of 1000 benchmark MSAs were randomly selected with replacement from the original set of 2724. This test set contained 834 discrete benchmark MSAs ( Additional file 4). Each member of this test set was aligned using each substitution matrix using MAFFT and the FFT-NS-2 method. FFT-NS-2 is a fast progressive method which performs two iterations of tree-guided progressive multiple sequence alignment . This method has been demonstrated to be about as accurate as the well established clustal methods [2, 10], though significantly faster. To provide an additional test, MSAs were inferred for each benchmark MSA using the L-INS-i method with each substitution matrix. L-INS-I has been repeatedly demonstrated to be one of the most accurate MSA algorithms currently available [2, 10]. In brief, this method has a different objective function to FFT-NS-2, which evaluates the consistency between pairwise alignments and global multiple sequence alignment . In this method flanking sequences surrounding the alignable regions are ignored and iterative refinement is allowed to proceed for a maximum of 1000 cycles . For each benchmark, MergeAlign consensus MSAs were constructed from alignments generated by either method. The performance of MergeAlign was assayed by calculating the F-score of the consensus MSAs as described above and comparing it to the F-score for each of the individual MSAs generated using a single substitution matrix. To provide a comparison with a similar, consensus, methodology the same set of 1000 benchmark MSAs was also aligned using M-Coffee .
To determine the relationship between MergeAlign column score and alignment accuracy the individual columns from all MSAs inferred using MergeAlign were grouped into categories based on their MergeAlign column score. For each category, the precision of the aligned letter-pairs for the constituent columns was evaluated using the method described above. To determine the relationship between the MergeAlign column score and MSA precision this data was then fit to a power function of the form f(x) = x
using a least squares method.
Benchmark-free MSA performance evaluation
To provide an additional and independent assay for MSA performance we developed a novel test based on increasing the performance of downstream phylogenetic analysis. To do this we selected a large dataset of orthologous gene families which comprises 3537 orthologous gene families differentially distributed across 48 species of Archaea , where each gene family has no more than one sequence representative per taxa. It is assumed that the majority of these gene families will share the same evolutionary history and hence a more accurate MSA should improve topological agreement between individual gene family phylogenetic trees. For each of the orthologous groups containing four or more sequences (n = 1688) we constructed a MSA using each of the 91 selected amino acid substitution matrices. MSAs were also constructed using MergeAlign and a subset of substitution matrices which were selected according to their ranked performance of the benchmark tests. The top ranked 10, 20, 30, 40, 50, 60, 70, 80 or all 91 substitution matrices were used to construct MergeAlign alignments. Each alignment was then subject to phylogenetic inference using the FastTree program  with CAT rates. The proportion of partitions in each tree which received greater than 0.5 support by SH-like test  were recorded. All possible pairwise comparisons between trees (without replication) were performed for each MSA method (n = 7.06x1010). For each pairwise comparison between two trees, both trees were pruned to contain the identical set of taxa using the dendropy python module . Only those trees which had four or more taxa in common and only partitions which received SH support of >0.5 were used for the analysis. The two trimmed trees were then compared using the dendropy python module  and the fraction of bi-partitions which were in agreement was evaluated.