The alignment of multiple DNA, RNA or Protein sequences is of major importance for a variety of biological modelling methods, including the estimation of the phylogenetic tree of the sequences and the prediction of their structural, functional and/or evolutionary relationships
. In addition, the recent advances in rapid, low-cost sequencing methods, have resulted in the accumulation of large amounts of molecular data to be processed, making thus the need for fast and accurate multiple sequence aligners even more imperative
A widely used approach to cope with the Multiple Sequence Alignment (MSA) problem, is the employment of a computational formulation comprised of two major components, namely an objective function
 able to quantify the degree of similarity of a given alignment and an optimization procedure that targets at identifying the optimal alignment based on the underlying objective function
. Concerning the former component, the Sum-of-Pairs scoring model (SP)
[4, 5] remains amongst the most popular choices
The maximization of the SP score is usually performed using dynamic programming. For the pairwise alignment case, an optimal (numerically but not necessarily biologically) solution can be found within reasonable time. However this does not hold for the multiple sequence alignment case, where it has already been shown
[9–12] that obtaining the optimal alignment using the SP score is NP-hard. To overcome the computational intractability of the MSA problem, a large number of efficient heuristic algorithms have been proposed with the most popular being the progressive alignment approach
In progressive alignment the sequences are initially placed on a bifurcating tree according to their degree of similarity. Then, they are progressively aligned in pairs following the formed guide tree in a bottom-up order until its root is reached. At each step, two nodes of the tree (i.e. two sequences, a sequence and an alignment or two alignments) are aligned by a standard pairwise alignment algorithm, and the deriving subalignment is retained to be used at a subsequent step. One important aspect of the progressive alignment strategy is the “once a gap, always a gap” rule, first introduced in
. Based on this policy, once a group of sequences is aligned, all gaps in the alignment are replaced by a neutral ‘X’ symbol ensuring that all subsequent pairwise alignments will be consistent with the pre-existing alignment of the group
. This rule by definition implies that once a group of alignments has been built up, they will remain fixed even in the view of new sequences that could potentially improve the overall alignment. Consequently, early errors in the progressive alignment steps are accumulated and propagated to later alignment stages compromising thus the alignment quality.
This problem is often tackled by using iterative refinement techniques
[18, 19]. In iterative refinement, one sequence (or a group of sequences) is iteratively subtracted and realigned against the alignment of the remaining sequences. Via this sequence-profile or profile-profile realignment, a new alignment is obtained which is then used for the next iteration of the algorithm. The refinement terminates when a fixed number of iterations is reached or when the alignment remains unchanged between consecutive iterations
. Although these methods are very efficacious at curating early alignment errors, they only partially address the “frozen subalignments” issue, since at each iteration of the algorithm only one sequence (or a group of sequences) is realigned whereas the alignment of the remaining sequences is kept fixed.
In this article we present a variation of the aforementioned iterative refinement strategy where all sequences may be simultaneously and independently realigned against a summarization profile that encapsulates all the starting alignment information. The process begins by constructing a standard profile that summarizes all the initial alignment information. Then, a series of individual (and possibly concurrent) sequence-profile pairwise comparisons takes place, recording the way that each sequence is aligned against the profile. The new alignment is then indirectly inferred by merging all the individual subalignments into a unified group. The proposed approach is implemented as part of a newly introduced meta-aligner under the name ReformAlign (Reformed Alignments) and is freely available to the public from http://evol.bio.lmu.de/_statgen/software/reformalign/ under the GNU General Public License (version 3 or later). We call ReformAlign a meta-aligner, in the sense described in
, meaning that an initial alignment is required and it can be used with a variety of alignment programs.
As currently ReformAlign can only align DNA/RNA sequences, for the needs of our performance evaluations real nucleic acid datasets of variable length and average sequence identity rates were used. Our experimental results demonstrate that the suggested profile-based modification of the classic progressive alignment and iterative refinement strategies is able to overcome the challenges posed by the propagation of early pairwise alignment errors and that ReformAlign is an efficient, well suited approach that may improve on the performance of a vast variety of existing alignment software.