Fast and robust multiple sequence alignment with phylogeny-aware gap placement
© A M Szalkowski; licensee BioMed Central Ltd. 2012
Received: 3 November 2011
Accepted: 13 June 2012
Published: 13 June 2012
Skip to main content
© A M Szalkowski; licensee BioMed Central Ltd. 2012
Received: 3 November 2011
Accepted: 13 June 2012
Published: 13 June 2012
ProGraphMSA is a state-of-the-art multiple sequence alignment tool which produces phylogenetically sensible gap patterns while maintaining robustness by allowing alternative splicings and errors in the branching pattern of the guide tree.
This is achieved by incorporating a graph-based sequence representation combined with the advantages of the phylogeny-aware gap placement algorithm of Prank. Further, we account for variations in the substitution pattern by implementing context-specific profiles as in CS-Blast and by estimating amino acid frequencies from input data.
ProGraphMSA shows good performance and competitive execution times in various benchmarks.
Multiple sequence alignment (MSA) is often the first step for evolutionary analyses of protein families. Its role is to detect homologous characters and to reconstruct the evolutionary history relating a set of sequences.
ProGraphMSA combines the advantages of several state-of-the-art methods  with an efficient implementation to provide fast and accurate multiple sequence alignments. This tool includes methods like progressive partial order alignment  combined with phylogeny-aware gap placement , which causes the gaps in the multiple sequence alignment to principally follow the branching pattern of the guide tree, but still allows for exceptions to account for alternative splicings and errors in the guide tree. This work was motivated by discussions with Dr. Löytynoja, the author of Prank who is also working on a graph-alignment algorithm combined with phylogeny-aware gap placement  with a focus on the placement of sequenced data onto a reference alignment/sequence.
To account for the uncertainty in pair-wise distance estimates a BioNJ  guide tree is used. ProGraphMSA achieves competitive execution times thanks to alignment-free distances  for constructing an approximate initial guide tree.
As evolution is not uniform along a sequence, a site-independent Markov model is often not able to account for specific substitution patterns and evolutionary rates in e.g. secondary structure elements, low complexity regions, or intrinsically disordered proteins. Several specific substitution matrices have been proposed for these structures [7, 8], which could be combined into e.g. a Hidden Markov Model (HMM), but adding states to the alignment HMM would significantly affect the execution time.
Instead, we implement context-specific profiles which directly infer the substitution pattern of a site from the site's context. The method uses a library containing 4000 context profiles covering a large spectrum of possible evolutionary scenarios. To our knowledge, ProGraphMSA is the first software to apply context-specific profiles to the alignment of multiple sequences and thereby significantly increasing alignment accuracy.
ProGraphMSA is based on progressive alignment as this has the advantage of having a linear time complexity with respect to the number of sequences and implies phylogenetically sensible evolutionary events. Unfortunately, this can also be a disadvantage, as errors in the guide tree or unexpected events such as alternative splicings might induce errors in the alignment. A graph-based sequence representation is able to store the whole history of indel events and thus allows for handling these cases by reverting an indel introduced by an earlier step of the progressive alignment.
The knowledge of all past indel events prevents the repeated penalization of insertions and alternative splicings . Further, the graph-based representation is able to attenuate a weak point of progressive alignment. This allows for wrongly inferred indels to be revoked  rendering the algorithm more robust against small errors in the guide tree.
At each step of the progressive alignment two graphs are aligned using a variant of the Needleman-Wunsch algorithm  with affine gap penalties . These algorithms are instances of the Viterbi dynamic programming algorithm  and are originally designed for the alignment of sequences. The alignment score in each cell of a dynamic programming matrix is computed as the maximum of possible transitions from three adjacent cells in diagonal, horizontal or vertical direction. These transitions correspond to either matching two homologous characters of the sequences or a gap in one of the sequences.
The leaves of the guide tree are assigned linear graphs corresponding to the input sequences where every graph node but the start node has exactly one predecessor. In general, the inner nodes of the guide tree can contain arbitrary directed acyclic graphs where graph nodes can have multiple predecessor nodes. Thus for graphs the alignment algorithm has to be extended to consider all combinations of preceding graph nodes for each cell in the dynamic programming matrix. While the alignment of sequences considers three preceding cells, the alignment of graphs has to consider n*m + n + mpreceding cells, if the corresponding graph nodes have n and m preceding nodes, respectively. This is n + m for the diagonal direction, when matching two nodes, and n or m for horizontal or vertical gaps.
Analogous to sequence alignment, the alignment algorithm identifies a homologous path in each graph by backtracking in the dynamic programming matrix. New gaps are created for unmatched nodes along the homologous paths in both graphs but are not immediately distinguished into insertions and deletions. Instead, for each indel two alternative paths are added to the ancestral graph and the decision is postponed. In the original phylogeny-aware gap placement this procedure corresponds to flagging unresolved gap positions in the ancestral sequence .
Unlike e.g. in Ortheus , the distinction between insertions and deletions is not optimized over all ancestral sequences. Instead the decision is made with the help of the closest outgroup i.e. in the alignment at the next guide tree node towards the root of the tree. Whichever of the alternative paths is aligned to the outgroup graph is considered part of the ancestral sequence (Figure 1). Thus, aligned paths are considered deletions and unmatched paths are considered insertions.
In principle we implement the progressive partial order alignment algorithm  augmented with edge weights. These are used to realize a "relaxed" variant of phylogeny-aware gap placement by penalizing paths, which are believed not to be part of the ancestral sequence . Thus, unmatched paths in a graph are penalized with a cost proportional to the evolutionary distance separating the current internal tree node from the last use or the introduction of the path. This corresponds to an exponentially declining probability of the insertion/deletion having been inferred incorrectly. Therefore all indels from previous alignments can be reused, however with an increasing penalty if they have not been matched in a recent alignment.
Here, match_init, gap_init, and gap_ext are computed from the transition probabilities in the pair-HMM, depending on the specific evolutionary distance separating the aligned graphs as defined by the guide tree [3, 16, 17]. E is a matrix with edge penalties and S is a pre-computed matrix of match scores for each pair of graph nodes computed using probabilistic ancestral sequences.
Profiling Prank  showed that most of its execution time is spent during the all-against-all alignment for the estimation of distances for the initial guide tree. Similar to Muscle  we overcome this limitation by using alignment-free distances  and simple estimates of variances for the initial BioNJ  guide tree. These distances and variances are re-estimated by maximum-likelihood from the resulting MSA using the induced pair-wise alignments. This estimation of distances, guide tree, and alignment is iterated until convergence or until a maximum number of iterations is reached. For typical problem sizes this procedure is still much faster than an all-against-all alignment.
Context-specific profiles are a method to generate position-specific substitution matrices from a sequence . The method is based on the assumption that the substitution pattern of a site may depend on the neighbouring sites. Originally, the computation and the alignment of context-specific profiles has been applied to pair-wise sequence alignment and homology search, effecting in increased sensitivity especially for distant homologs. In the following we will briefly describe the original algorithm  to compute a context-sensitive profile from a sequence and our adaption of this algorithm for the alignment of multiple sequences.
This is the probability of the characters in the sequence window x i + j being emitted by profile column p k (j,»). This product is multiplied with the prior P(p k ) of the profile. As the match probability is to be representative for the center column x i of the sequence window X i , this product is weighted by w j according to the declining importance of a site with increasing distance to the center column. As suggested by the authors we use .
i.e. the mutation probabilities are a weighted average of the center columns (p k (0,»)) of all profiles in the profile library. A context-specific profile is obtained by applying equation 6 to each position of a sequence.
ProGraphMSA adopts this method and computes context-specific profiles for the input sequences which are placed at the leaves of the guide tree. In this way the expected context-specific evolution along the terminal branches is encoded in the leaf sequences. However, ProGraphMSA's scoring function relies on probabilistic ancestral sequences. Using Bayes' theorem, context-sensitive profiles can be converted into probabilistic ancestral sequences: . Again, and Π y denote the equilibrium amino acid frequencies.
Alignments at internal tree nodes are computed using these probabilistic ancestral sequences at the leaves with the exception that terminal branch lengths are ignored (=0) with respect to the substitution model as the expected evolution along those branches is already encoded in the terminal probabilistic ancestral sequences.
We evaluated the alignments produced by Mafft , Muscle , ClustalW , Prank , POA , and variants of ProGraphMSA using the BAliBASE  collection of reference alignments and two simulated data sets. Further, the quality of the MSAs is measured by analyzing phylogenies reconstructed from these MSAs . For this we built maximum likelihood and gap phylogeny trees from MSAs of orthologous protein groups with known phylogenetic relations and compared them to reference species trees.
Two versions of ProGraphMSA with different evolutionary models were evaluated:
Versions and additional options
Versions and additional command-line parameters
-do\_global -do\_progressive blosum80.mat
--retree 2 --maxiterate 1000 --globalpair
--retree 1 --maxiterate 0
--mldist --cs_profile K4000.lib
(--estimate_aafreqs --end_indel_prob -1 --no_force_align_m)
--mldist --darwin --cs_profile K4000.lib
(--estimate_aafreqs --end_indel_prob -1 --no_force_align_m)
From the BAliBASE 3.0 benchmark suite we only use the subset of tests that are compatible with the evolutionary models of the tested tools, namely we use the trimmed (BBS*) tests in RV11 (close equidistant sequences), RV12 (more divergent equidistant sequences), RV20 (families with "orphans") and RV30 (divergent subfamilies). All others involve evolutionary events like duplications and rearrangements which are not accounted for in any of the tools in our benchmark and would lead to an arbitrary ranking.
Performance comparison on BAliBASE 3.0
Ranking of MSA tools on BAliBASE
ProGraphMSA D (noCS)
For further means of ranking the performance of ProGraphMSA, 10000 protein MSAs with 10 taxa were simulated using ALF . To represent a realistic evolutionary scenario, we chose gamma distributed sequence lengths with a mean length of 300 amino acids and used the WAG model  with gamma rate variation among sites. The maximum distance between two sequences was 2 expected substitutions/site. Insertions and deletions were each inserted with a rate of 0.005/substitution and having Zipfian distributed lengths  with a mean of 3.5 amino acids and a maximum of 50.
Again, Mafft and Muscle produced more precise alignments than either version of ProGraphMSA, but ProGraphMSA outperforms its forefathers POA and Prank. On this simulated data set ProGraphMSA D performs worse than the other variant and in terms of alignment length ProGraphMSA's results are closest to the reference alignments. Surprisingly, Prank significantly over-estimates alignment length, which is also reflected in its fM score. This might be an artefact of errors in the reconstructed guide trees or of Prank not detecting distant homologies due to using p-distances for its guide-tree construction and alignment, and thus underestimating evolutionary distances.
Overall, both tools exhibit a similar performance in indel reconstruction, with ProGraphMSA+F on average reconstructing alignments with the most accurate length and ProGraphMSA notably reconstructing more indel events than the other tools. The latter can be best explained by ProGraphMSA's feature to revoke erroneous inferences of indel events which appear in the alignment as multiple independent events in the same column leading to a higher error rate.
These combined results indicate that ProGraphMSA is indeed able to compensate errors in the guide tree (Figure 3) while maintaining a comparable precision under ideal conditions, where the true guide tree is provided and gap patterns are congruent with the phylogeny (Figure 4).
The real-data phylogeny reconstruction test  uses the precision of phylogenetic tree reconstruction as proxy for MSA quality. The test set consists of more than 10000 groups, each having six sequences sampled from orthologous groups  according to established reference topologies of Bacteria, Fungi, and Eukaryota. A MSA program is evaluated by computing an alignment for each of these groups. As indirect quality measure of the alignments, the Robinson-Foulds  distance of the reference tree to a PhyML tree reconstructed from the MSAs in question is used.
The authors of the above phylogeny reconstruction test further propose a minimum-duplication test based on larger groups . Here, a MSA tool is considered better, if the reconstructed phylogenies explain the evolution of the leaf sequences with less gene duplications. This test did not yield any significant results and the results have been included in Additional file 1: Figure S1.
All methods exhibit the usual problems of placing characters at the correct side of long insertions (region B). Due to its heuristic, ProGraphMSA correctly aligned the Methionines (M) at the beginning of the sequence, and the graph-based representation allows for a correct alignment of the alternative splicings (regions C+D) including the insertion inside the alternatively spliced region. Prank+F enforces phylogenetic gap patterns and was thus not able to correctly reconstruct the alternative splicing. Without this feature the regions C and D were aligned almost correctly except for a single Aspartic acid (D) from region C which was aligned with region D. ProGraphMSA aligns this region consistently because it maintains a history of all indel events in its graph structure.
In region A, Prank+F and ProGraphMSA reconstruct the two independent insertions correctly whereas Prank merges these two events. Here it is the penalization of unused graph paths that prevents ProGraphMSA from merging these insertions.
Average execution times
In comparison, Prank's performance is dominated by pair-wise alignment and distance estimation for the initial guide tree. We avoid this performance bottleneck by using alignment-free distances  for the initial guide tree and compensate for the slightly lower alignment quality by performing an additional iteration of guide tree estimation and progressive alignment.
ProGraphMSA is a progressive multiple sequence alignment method that combines phylogeny-aware gap placement  with a graph-based sequence representation to produce phylogenetically sensible gap patterns while maintaining the flexibility to handle alternative splicings and errors in the guide tree. Our benchmarks reveal that ProGraphMSA presents an unprecedented combination of accuracy on BAliBASE and simulated data, phylogenetically sensible gap patterns, and quality of trees built from the resulting MSAs.
We have successfully applied context-specific profiles  to the alignment of multiple sequences. Although the profile generation has only linear time complexity with respect to sequence length, due to the size of the context library the execution time is significantly increased. Nevertheless, we recommend using this feature in ProGraphMSA by default, as context-specific profiles significantly improve alignment quality and the execution time remains competitive in comparison to other tools.
In the future we are planning to implement codon and DNA models and to explore methods of iterative refinement for our alignments. The graph representation allows for adding additional information to the sequences which we intend to adopt for the alignment of proteins with tandem-repeats.
Project name: ProGraphMSA
Project home page: http://www.inf.ethz.ch/personal/sadam/ProGraphMSA
Operating system(s): Platform independent
Programming language: C++
Other requirements: Eigen 3.0, TCLAP 1.1 or higher
License: GNU GPLv3
a δ xy =1 if x=y else 0
I would like to thank Ari Löytynoja for discussing his preliminary results on graph-based alignment with phylogeny-aware gap placement. Further, I thank the reviewers, and Daniel Dalquen, Manuel Gil, and Maria Anisimova for giving me feedback on the manuscript.
This work was partially supported by the Swiss National Science Foundation grant to Maria Anisimova (ref. 31003A/127325). The author is also funded by the Eidgenössische Technische Hochschule (ETH) Zürich. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.