Reticular alignment: A progressive cornercutting method for multiple sequence alignment
 Adrienn Szabó^{1, 2},
 Ádám Novák^{3, 4}Email author,
 István Miklós^{1, 5} and
 Jotun Hein^{3}
https://doi.org/10.1186/1471210511570
© Szabó et al; licensee BioMed Central Ltd. 2010
Received: 20 June 2010
Accepted: 23 November 2010
Published: 23 November 2010
Abstract
Background
In this paper, we introduce a progressive corner cutting method called Reticular Alignment for multiple sequence alignment. Unlike previous cornercutting methods, our approach does not define a compact part of the dynamic programming table. Instead, it defines a set of optimal and suboptimal alignments at each step during the progressive alignment. The set of alignments are represented with a network to store them and use them during the progressive alignment in an efficient way. The program contains a threshold parameter on which the size of the network depends. The larger the threshold parameter and thus the network, the deeper the search in the alignment space for better scored alignments.
Results
We implemented the program in the Java programming language, and tested it on the BAliBASE database. Reticular Alignment can outperform ClustalW even if a very simple scoring scheme (BLOSUM62 and affine gap penalty) is implemented and merely the threshold value is increased. However, this setup is not sufficient for outperforming other cuttingedge alignment methods. On the other hand, the reticular alignment search strategy together with sophisticated scoring schemes (for example, differentiating gap penalties for hydrophobic and hydrophylic amino acids) overcome FSA and in some accuracy measurement, even MAFFT. The program is available from http://phylogenycafe.elte.hu/RetAlign/
Conclusions
Reticular alignment is an efficient search strategy for finding accurate multiple alignments. The highest accuracy achieved when this searching strategy is combined with sophisticated scoring schemes.
Background
The multiple sequence alignment problem is still the Holy Grail of bioinformatics [1]. There are 517100 sequences in the UniProtKB/SwissProt release of the 18th of May 2010 http://expasy.org/sprot/, while on the other hand, there are only 65802 known structures in the last PDB database relase of the 8th of June 2010 http://www.pdb.org/pdb/home/home.do. Therefore, the in silico prediction of protein structures is still demanding, and the majority of the protein structure prediction methods need accurate alignments. There are two major technical hurdles in the multiple sequence alignment problem. The first is the scoring problem: how to score the alignments such that the best scored alignment is the most accurate one. The second is the algorithmic problem: how to find the best scored alignment.
Significantly more effort has been put into the research for solving the second challenge. Although the number of possible alignments of two sequences grows exponentially with the length of the sequences, finding the best scoring alignment of two sequences is computationally feasible, since such an alignment can be found by iteratively comparing the prefixes of the two sequences [2]. The optimal alignment of longer prefixes can be calculated quickly from shorter prefixes, and hence, the algorithm needs only memory and running time that both are proportional to the product of the lengths of the sequences. This dynamic programming algorithm can be extended to many sequences [3], however, it becomes computationally infeasible, since analysing all possible combinations of prefixes requires O(L^{ N } ) memory and running time. It has been proven that finding the best scoring multiple alignment under the sumofpairs scoring scheme is NPhard [4], therefore it is very unlikely that any fast algorithm exists for the exact multiple sequence alignment problem.
The memory requirement and running time can be reduced by cornercutting methods. Cornercutting algorithms define a narrow strip in the dynamic programming table which contains the optimal alignment. Some methods use an a priori estimated upper limit for the score of the optimal alignment to define such a strip [5–7]. Hein et al. obtained a strip around the parsimonybased optimal alignment for HMMbased calculations [8]. The strip can also be defined on the y using the socalled diagonal extension method [9]. The cornercutting method has been extended to multiple sequence alignment, too [10, 11], with which the optimal alignment of 410, each 200300 long sequences can be found in reasonable time [12]. However, even the size of the narrowest possible strip  which has a unit hypercube transverse section  grows exponentially with the number of sequences to be aligned, hence, this approach eventually becomes unfeasible for large number of sequences.
Above exact methods, approximation methods for the multiple sequence alignment problem are also widespread. The most commonly used approximation to multiple sequence alignment is the progressive alignment approach [13–17], which builds multiple sequence alignments bottomup along a guide tree, through a series of pairwise alignments of two sequences (leaves of the guide tree), two alignments (inner nodes), or a sequence and an alignment. The guide tree is typically constructed from the pairwise distance matrix of the sequences that is computed using pairwise sequence alignments. These methods apply the "once a gap, always a gap" rule [14]: gaps inserted into an alignment at an inner node of the guide tree cannot be removed or modified further up in the guide tree. Although one can trust more in gaps introduced at the lower nodes of the guide tree, there is no guarantee that these gaps are correct, and a gap that has incorrectly been inserted into a subalignment based on local information cannot be corrected later on.
There have been successful attempts in other directions to reduce the computational time required to align sequences. MAFFT employs the Fast Fourier Transformation (FFT) technique to rapidly identify homologous regions by converting the amino acid sequence into a sequence of volume and polarity values [18]. The two basic optimisation heuristics (progressive and iterative alignment) have been substituted by more advanced iterative methods in the most recent version of the software where pairwise alignment information is incorporated into the objective function, thus making MAFFT one of the most accurate alignment tools available.
One artifact shared by all of the previously mentioned methods is that evolutionary events are scored using userspecified values (gap penalties and substitution matrices). The accuracy of the alignments largely depends on the selection of these parameter values. To overcome these difficulties the statistical alignment approach has been introduced where evolutionary models describe the type of events that transform the sequences and provide a means of calculating the probability of a sequence of events. The alignment is then produced in an optimisation framework such as maximum likelihood or Markov chain Monte Carlo by finding the set of events explaining the evolution of the sequences with a high probability and the parameters of the evolutionary model are estimated from the data. This approach is taken by computationally expensive methods such as BAliPhy [19] and StatAlign [20] that integrate over all possible tree topologies. To make this more practical, FSA uses only pairwise comparisons in a statistical alignment framework and so reduces the running time drastically while sacrificing some of the accuracy [21].
In this paper, we introduce a novel cornercutting method combined with progressive sequence alignment. Unlike former cornercutting methods, our method does not define a compact part of the dynamic programming table to be filled in. The rationale behind the idea is the following. It is easy to see that any highscored alignment is surrounded by a large set of lowscored alignments, and the number of lowscored alignments increases exponentially with the number of sequences. Indeed, a highscored alignment contains several alignment columns containing homologous aminoacids. There are 2 ^{ k } 2 ways to split an alignment column containing k characters into two columns with gaps. Any alignment containing such pair of alignment columns will be a neighbour of the highscored alignment in the dynamic programming table. Furthermore, the score of these alignments will be Significantly lower than the score of the highscored alignment, since the scores differ in two gap scores and the scores missing due to not aligning homologous aminoacids.
The method has been implemented in the Java programming language, tested on the BAliBASE database [23], and compared with ClustalW [16], MAFFT [18] and Fast Statistical Alignment (FSA) [21]. Several scoring schemes have been implemented and assessed in the Reticular Alignment algorithm. We show that Reticular Alignment outperforms ClustalW even if a simple scoring scheme is applied. When sophisticated scoring models are applied (like sequence weighting in sumofpairs scoring, decreasing gap penalties for runs of hydrophilic aminoacids, etc.) Reticular Alignment outperformes FSA and even MAFFT in some accuracy measurement.
Methods
In this section, we describe the algorithms and theorems which are the theoretical background of the Reticular Alignment algorithm.
The WatermanByers algorithm and xnetworks
Let A and B be two sequences over an alphabet Σ, of lengths n and m, respectively. Let A_{ i } denote the i long prefix of sequence A, and let A_{ i } denote the suffix of A starting in the i + 1 ^{ st } position. In this way, A_{ i } ◦ A^{ i } = A, where ◦ denotes concatenation. Let a_{ i } denote the character of A in position i.
Let s : Σ × Σ → R be a similarity function. g_{ o } will denote the gap opening and g_{ e } will denote the gap extension penalty. The score of any alignment, and thus all introduced concepts based on the alignment scores depend on the choices on similarity function, gap opening and gap extension penalty. However, for sake of simplicity, we omit to denote this dependence.
The WatermanByers algorithm [24] produces all alignments that have a score no less than the score of the optimal alignment minus some constant value. Here we show a variant of the algorithm that our method is based on. The algorithm is built up of 3 parts: a forwardalign algorithm, a backwardalign algorithm, and the alignment search algorithm that finds all alignments above a given score using the scores calculated by the forward and backward algorithms.
The forwardalign algorithm calculates the score of the best alignment of prefixes A_{ i } and B_{ j } within the following 3 subsets of alignments:

alignments ending in two aligned (matched) characters. The score of the best alignment of prefixes A_{ i }and B_{ j }in this set is denoted by M_{ f }(i, j).

alignments ending in an insertion of character b_{ j }. The score of the best alignment of prefixes A_{ i }and B_{ j }in this set is denoted by I_{ f }(i, j).

alignments ending in a deletion of character a_{ i }. The score of the best alignment of prefixes A_{ i }and B_{ j }in this set is denoted by D_{ f }(i, j).
 1.
M_{ b } (i, j) denotes the score of the best alignment of suffixes A^{ i } and B^{ j } whose score is calculated as if the a_{ i }/b_{ j } alignment column was before it. Namely, if the alignment starts with a gap, it will be scored with the gap opening penalty.
 2.
I_{ b } (i, j) denotes the score of the best alignment of suffixes A^{ i } and B^{ j } whose score is calculated as if the  _{ i }/b_{ j } alignment column was before it. Namely, if the alignment starts with an insertion, it will be scored with the gap extension penalty.
 3.
D_{ b } (i, j) denotes the score of the best alignment of suffixes A^{ i } and B^{ j } whose score is calculated as if the a_{ i }/j alignment column was before it. Namely, if the alignment starts with a deletion, it will be scored with the gap extension penalty.
Using the forward and the backward scores it is possible to find all alignment columns that appear in an alignment with a score above a given threshold. This is based on the following theorem:
Theorem 1
The score of the best alignment containing alignment column a_{ i }/b_{ j } (or _{ i } ,/b_{ j }, a_{ i } / _{ j },respectively) is M_{ f } (i, j) + M_{ b } (i, j) (I_{ f } (i, j) + I_{ b } (i, j), D_{ f } (i, j) + D_{ b } (i, j)) ^{ . }
Proof
We give a proof for the first case, the proof for the other two cases goes in the same way. If an alignment contains a_{ i } /b_{ j } , then cutting the alignment after this alignment column will create two alignments. The left one is an alignment of prefixes A_{ i } and B_{ j } in which the last alignment column is a_{ i } /b_{ j } . The right one is an alignment of suffixes Ai and Bj whose score is calculated by adding the alignment column a_{ i } /b_{ j } before it. The best scored alignment containing a_{ i } /b_{ j } are cut into the best scored left and right alignment, by definition, with scores M_{ f } (i, j) and M_{ b } (i, j). The score of the alignment is the sum of these two values.
Theorem 1 provides the means to collect the alignment columns that participate in an alignment having score above a given threshold. The best score of the alignment column will be denoted by b(α). We define the xnetwork of the alignments in the following way.
Definition
For any sequences A and B, x ≥ 0, the x network of the alignments of A and B is a directed graph G(V, E). The vertex set consists of alignment columns α for which b(α) ≥ opt  x, where opt is the score of the optimal alignment of A and B; plus two auxiliary vertices, representing the beginning and the end of the alignment. These two auxiliary vertices are denoted by Start and End. An edge is going from vertex α_{1} to vertex α_{2} if there is an alignment in which α_{1} is followed by α_{2}. The outgoing edges from the Start vertex go to the alignment columns with which the alignment might start, and the incoming edges of the End vertex come from the alignment columns that might be at the end of an alignment.
The following theorem states that an xnetwork never contains dead ends.
Theorem 2
For any sequences A, B, x ≥ 0, and α vertex of the xnetwork, there is a directed path from Start to α and also from α to End.
Proof
Since α is in the xnetwork, b(α) ≥ opt  x. Consider an alignment containing α with score b(α). Any α' of this alignment has a best score greater or equal than b(α), hence they are all in the vertex set of the xnetwork. This alignment defines one possible directed path from Start to α and also from α to End.
An xnetwork can be constructed using an algorithm that first runs the forward and backward algorithm to calculate b(α) for each possible alignment column α, selects those columns for which b(α) ≥ opt  x, and builds the network from them.
Aligning a network of alignments to a network of alignments
We are going to extend the WatermanByers algorithm to align a network of alignments to another network of alignments. First we define the network of alignments.
Definition
A network of alignments of sequences A_{1}, A_{2}, . . . A_{ k } , k ≥ 1 is a directed acyclic graph whose vertices are alignment columns of the set of sequences together with a unique source (denoted by Start) and a unique sink (denoted by End). The vertices along any path from the source to the sink form a multiple sequence alignment of the set of sequences.
Obviously, an xnetwork is a network of alignments. Moreover, any single sequence (meaning k = 1) can be considered a simple, formal network. In that case, the formal alignment columns contain only one character, and the network is a single line containing only one alignment. We can generalise the definition of the xnetwork of two sequences to the xnetwork of two networks of alignments. For this, we first have to define the alignment of alignments.
Definition
An alignment of two alignments $\mathcal{A}$ and ℬ of sequences A_{1}, A_{2}, . . . A_{ k } and B_{1}, B_{2}, . . . B_{ l } is a multiple sequence alignment of these k + l sequences such that the non allgap columns of the first k rows gives back $\mathcal{A}$, and the non allgap columns of the last l rows gives back ℬ.
When we take an alignment column containing allgap characters in the first k rows or in the last l rows, we indicate what was the previous non allgap alignment column from $\mathcal{A}$ or ℬ. For example, ^{ a } _{ i }/_{ j } indicates an alignment column in which the first k row is the i^{ th } alignment column from $\mathcal{A}$, the last l rows contain only gaps, and the first alignment column to the left of this alignment column in which the last l rows contain not only gaps is the j^{ th } alignment column from ℬ. In the next definition when we talk about alignment columns from the multiple alignment of all A s and B s sequences, we always mean alignment columns containing this additional information. Similarly, from now on, we always assume that the alignment is an alignment of two alignments, the first containing k lines, the second containing l lines.
Definition
For any two networks of alignments A and B and x ≥ 0, the xnetwork of A and B is a directed graph G(V, E). The vertex set consists of two auxiliary vertices, representing the beginning and the end of the alignment and all alignment columns α for which b(α) ≥ opt  x, where b(α) is the maximal score of the alignment that can be achieved by aligning an alignment $\mathcal{A}$ ∈ A to an alignment ℬ ∈ B so that it contains the column α. opt is the maximal score that can be achieved by aligning any alignment $\mathcal{A}$ ∈ A to any alignment ℬ ∈ B. An edge is going from α_{1} to α_{2} if there is an alignment in which α_{1} and α_{2} are neighbour columns. The outgoing edges from Start go to the vertices that might be the first alignment column in an alignment, and the incoming edges of the End vertex come from the vertices that might be the last alignment column in an alignment.
When we align a network to a network using a dynamic programming algorithm, it is important to visit the alignment columns of the network in an order such that the entries are already calculated by the time we want to use them in the dynamic programming recursion. Therefore we introduce the linear extension of networks that can be used for traversing the network.
Definition
A linear extension of a directed acyclic graph is a total ordering, <, on the vertices such that for any two vertices v and u, if there is a directed path from v to u then v < u.
Furthermore, the forwardalign and the backwardalign algorithms work with prefixalignments and suffixalignments defined in the following way.
Definition
A prefixalignment is a prefix of an alignment achievable by aligning an alignment $\mathcal{A}$ ∈ A to an alignment ℬ ∈ B. Similarly, a suffixalignment is a suffix of an alignment achievable by aligning an alignment $\mathcal{A}$ ∈ A to an alignment ℬ ∈ B.
The generalisation of the WatermanByers algorithm is the following. The input consists of a threshold value x ≥ 0 and a couple of networks of alignments, A and B, together with a linear extension for each network. The output is the xnetwork of A and B together with a linear extension of it.
The algorithm uses a forward and a backward dynamic programming algorithm. The forward align algorithm calculates the score of the best prefixalignment in which the last non allgap columns in the first k lines is a_{ i } and in the last l lines is b_{ j } for each subset of alignments:

alignments ending with a_{ i }/b_{ j }. The score will be denoted by M_{ f }(i, j).

alignments ending with _{ i }/b_{ j }. The score will be denoted by I_{ f }(i, j).

alignments ending with a_{ i }/_{ j }. The score will be denoted by D_{ f }(i, j).
where ${\mathcal{N}}^{+}$(i) is the set of indices of vertices sending an edge to the vertex indexed by i, and g(a, b, c, d) is the gap penalty function for alignment column b/d preceded by alignment column a/c. We assume that the gap penalty for a given alignment column can be calculated from the alignment column in question and its preceding alignment column. See details in the subsection Gap penalties below. The maximum of an empty set is defined to be  ∞.
The backward algorithm calculates the following scores:

M_{ b }(i, j) denotes the score of the best suffixalignment that can follow the alignment column a_{ i }/b_{ j }. Furthermore, the gap score of the first alignment column is calculated as if a_{ i }/b_{ j }was inserted before the first alignment column.

I_{ b }(i, j) denotes the score of the best suffixalignment that can follow the alignment column _{ i }/b_{ j }. Furthermore, the gap score of the first alignment column is calculated as if _{ i }/b_{ j }was inserted before the first alignment column.

D_{ b }(i, j) denotes the score of the best suffixalignment that can follow the alignment column a_{ i }/_{ j }. Furthermore, the gap score of the first alignment column is calculated as if a_{ i }/_{ j }was inserted before the first alignment column.
where ${\mathcal{N}}^{+}$(End_{ A }) and ${\mathcal{N}}^{+}$(End_{ B }) are the sink vertex of networks A and B, respectively.
Similarly to Theorem 2., it is easy to show that there are no dead ends in the so constrained network. The following theorem states that visiting the alignment columns in lexicographical order will provide a linear extension for the constructed xnetwork.
Theorem 3
The lexicographical ordering of alignment columns together with arbitrary ordering of a_{ i } /b_{ j }, _{ i } /b_{ j } and a_{ i } / _{ j } is a linear extension for the xnetwork of networks A and B if the indices i's and j's are linear extensions for the networks A and B, respectively.
Proof
The preceding alignment columns for a_{ i } /b_{ j } might be a_{ i' }/b_{ j' } , _{ i' }/b_{ j' } or a_{ i' }/_{ j' } for some i' ∈ ${\mathcal{N}}^{+}$(i) and j' ∈ ${\mathcal{N}}^{+}$(i). Since indices in A are linear extensions, i' < i for any i' ∈ ${\mathcal{N}}^{+}$(i), and thus, in the lexicographical order, all possible preceding alignment columns are smaller than a_{ i } /b_{ j }^{.}
The preceding alignment columns for _{ i } /b_{ j } might be a_{ i }/b_{ j' } , _{ i }/b_{ j' } or a_{ i }/_{ j' } for some j' ∈ ${\mathcal{N}}^{+}$(j). Since indices in B are linear extensions, j' < j for any j' ∈ ${\mathcal{N}}^{+}$(j), and thus, in the lexicographical order, all possible preceding alignment columns are smaller than _{ i } /b_{ j }
The preceding alignment columns for a_{ i }/_{ j } might be a_{ i' }/b_{ j } , _{ i' }/b_{ j } or a_{ i' }/_{ j } for some i' ∈ ${\mathcal{N}}^{+}$(i). Since indices in A are linear extensions, i' < i for any i' ∈ ${\mathcal{N}}^{+}$(i), and thus, in the lexicographical order, all possible preceding alignment columns are smaller than a_{ i }/_{ j } .
The Reticular Alignment algorithm
 1.
Build or load a guide tree for the sequences
 2.
Transform the sequences at the leaves of the guide tree into simple 'linear' networks
 3.
Visit the internal nodes of the guide tree in reverse traversal order. For each internal node v with children u _{1} and u _{2}, labelled with the networks of alignments A_{ 1 } and A_{ 2 }, respectively, calculate the x_{ v } network of A_{ 1 } and A_{ 2 } using the generalised WatermanByers algorithm
 4.
Return the best scored alignment from the xnetwork calculated at the root of the guide tree.
When x is set to 0, only the (locally) optimal multiple alignments are stored in the xnetwork. In this case, the Reticular Alignment algorithm mimics a standard progressive alignment method. When x is set to ∞, the Reticular Alignment method performs an exhaustive search in the space of multiple alignments, namely, it finds the best scored alignment. As x increases, the size of the network also increases, having a similar effect on the running time and memory usage. Along with the x value the Reticular Alignment algorithm can be parameterised in a list of ways:

guide tree construction method

similarity scoring of alignment columns

gap scoring model and gap penalties

strategy to select threshold values at the internal nodes
Here we briefly describe the choices we had and the decisions we made considering these aspects of the algorithm.
Building the guide tree
Standard methods for constructing a guide tree using pairwise comparisons of the input sequences include UPGMA and Neighbour Joining (NJ) [25, 26]. We implemented both and allow the user to choose between the two algorithms or to provide their own guide tree.
In a future version of RetAlign we plan to implement a basic optimisation such as the Hirschberg algorithm [27] to reduce the memory usage of this initial alignment phase from Θ(L^{2}) to Θ(L) where L is length of the longest sequence.
Gap penalties
Insertiondeletion score matrix used by RetAlign.
    *  *   * *  

     OP/ 2  OP/ 2  EX 
 *  OP/ 2    OP  OP/ 2 
*   OP/ 2  OP    OP/ 2 
* *  EX  OP/ 2  OP/ 2   

consistent in that it does not depend on the order of the sequences within the selected pairs (the indel matrix is symmetric)

symmetric  the reverse of a multiple alignment has the same score (the score of a pattern and its horizontally flipped variant is the same)

best approximate the sumofpairs scores
The simplest case to consider is when there is an insertion in one of the sequences: $\begin{array}{ccccc}*& & & & *\\ *& *& *& *& *\end{array}$
The sumofpairs indel score of this alignment is OP + 2EX where OP and EX are the gap opening and extension penalty. This  and any similar cases where the length of the insertion is different  can be mimicked precisely if (and only if) the score of the pattern $\begin{array}{cc}& \\ *& *\end{array}$ is EX while the score of $\begin{array}{cc}*& \\ *& *\end{array}$ and $\begin{array}{cc}& *\\ *& *\end{array}$ is both OP/ 2 due to the symmetric property. The score of $\begin{array}{cc}*& \\ & *\end{array}$ is OP for similar reasons as it starts a new sequence of gaps in both directions. With this choice cases such as $\begin{array}{cccccc}*& & & *& *& *\\ *& *& *& & & *\end{array}$ are handled properly. To avoid Significant overestimation of the score of $\begin{array}{ccccc}*& & & & *\\ *& & & & *\end{array}$ which is 0 in the sumofpairs scheme, a score of 0 must be assigned to both $\begin{array}{cc}& \\ & \end{array}$ and $\begin{array}{cc}& *\\ & *\end{array}$ Then the only pattern left to assign a score to is $\begin{array}{cc}*& \\ & \end{array}$ (and its 3 mirror images). The problem with this one is that the score should depend on whether the gap in $\begin{array}{c}*\\ \end{array}$ is extended in the next nongaponly column to the right. The three possibilites are (1) $\begin{array}{cccc}*& & & *\\ & & & \end{array}$, (2) $\begin{array}{cccc}*& & & \\ & & & *\end{array}$ and (3) $\begin{array}{cccc}*& & & *\\ & & & *\end{array}$ so it is easy to see that a score of EX/ 2 suffices for (1) and OP/ 2 does for (2,3) (note that the pattern in question repeats twice in (1,2) so the total scores of EX, OP and OP/ 2 are obtained that match the scores of the corresponding patterns formed by removing the gaponly columns). We resolved the ambiguity by choosing OP/ 2 as the score because we expected this to provide the best approximation of the sumofpairs scores with a systematic overscoring in cases such as (1). Other alternatives include EX or EX/ 2, both of which have been later shown to yield slightly lower overall accuracy as measured on the BAliBASE reference database.
In addition to RetAlign's default pairwise indel score model presented above we also implemented the simplified, nonpairwise indel scoring method used in ClustalW. In this scheme, when two sets of sequences are aligned, each insertion or deletion of a full alignment column receives a single gap penalty  even if the alignment column contains several gaps. This score can be computed considerably faster (although we implemented tricks allowing the calculation of the pairwise indel scores in linear time in the number of sequences) but creates anomalies when suboptimal alignments are inserted or deleted: the gaps 'hidden' in the columns of the suboptimal alignments are not penalised and these columns become overly represented in the final alignment (see results). Unlike the pairwise indel scoring this score cannot be used as an accuracy measure of multiple alignments because it depends on which sets of sequences are being aligned in the last step.
Scoring similarities
We score substitutions in accordance with the sumofpairs scoring scheme. A similarity score is computed for each alignment column as the sum of similarity values for each pair of nongap characters in the column (in all experiments, the BLOSUM62 matrix was used for scoring pairwise character similarities [29]). The similarity score is also computed for columns where an insertion or deletion occurs and creates a stack of gaps in the ongoing alignment step. The total similarity score of an alignment is simply the sum of similarity scores for all columns.
This pairwise scoring method is slightly different from ClustalW's approach where the substitution score is dependent on what the two sets of sequences are that are being aligned: only 'crossscores' are taken into account (scores for pairs of nongap characters where the first element of the pair is from a sequence in the first set and the second from the second set). The similarity score of columns with insertion or deletion in the ongoing alignment is thus zero. We also implemented this modified similarity scoring method and combined it with the nonpairwise indel scoring shown in the previous section to imitate ClustalW's scoring model.
Internal score and sequence weighting
We introduced the indel and similarity scoring models of RetAlign in the last two sections. The total (internal) score of a multiple alignment as produced by RetAlign is the sum of the indel and similarity scores, both of which are calculated pairwise. Though very similar, this score slightly deviates from the sumofpairs score as a result of the approximation of the indel score using adjoining alignment columns only  this is explained in detail above. In practice we did not encounter any situation where the difference was Significant and the optimisation targeted to maximise the internal score efficiently boosted the sumofpairs score, too. Both scores are completely independent from the phylogenetic tree connecting the sequences and can be used as an accuracy measure.
One inherent weakness of the sumofpairs scoring, though, is that evolutionary events separating a distant sequence from a number of closely related (overrepresented) sequences are overscored  one evolutionary event in time might be penalised several times in the alignment score. To overcome this, we introduced sequence weighting, based on principles set out in ClustalW. First, a list of weights is calculated and assigned to sequences using the topology and edge lengths of the guide tree, precisely as described in [16]. The weight of a sequence is calculated from the edge lengths of the branches leading to the sequence from the root node, and edge lengths of branches that are shared by two or more sequences are divided up equally between them. The weights are the sums of these partial lengths. Once the weights are available, the pairwise scoring method can be applied with the modification that whenever a score is calculated for a pair of sequences it is multiplied by both sequence weights. The sum of these weighted scores gives rise to an overall score that is much less biased by the overrepresented sequences.
Threshold values
The size of the alignment space that is being explored by the Reticular Alignment algorithm and consequently, the accuracy of the alignments created depends on the strategy for choosing an x value for each alignment step at the internal nodes. We chose to set the x value dynamically such that the final size of the alignment network is at most (t + 100)% of the length (number of columns) of one of the best scored multiple alignments, where t is a threshold parameter set by the user. Note that this is the "Threshold to be passed for computation" value on the GUI of our application. The slider underneath is just for convenience, with a log transformation. For any such t value, the corresponding x value can be found by building the network gradually: at first, alignment columns are placed in a priority queue where the key is the score of the best alignment they appear in; then groups of columns having equal score are removed from the queue iteratively, starting with the ones having the highest score, and added to the growing network en masse while the size limit permits.
This approach is more advantageous than if the x was constant or set to a fixed proportion of the optimum score because the later would have an unpredictable effect on the running time and memory usage and could also cause the alignment networks to vary considerably in relative size at the internal nodes. In contrast, with our method, the memory requirement can be estimated from t and the proportion of the number of suboptimal alignment columns to optimal columns is uniform over the whole tree.
Efficient score calculation
The scoring scheme of RetAlign involves summing similarity values and gap penalties for all sequence pairs. These pairwise summations are carried out repeatedly on alignment columns (for similarity) and pairs of columns (for gap scores) to fill each element of the dynamic programming tables. The straightforward implementation can thus have a huge impact on the running time when many sequences are aligned. For this reason we developed techniques to speed up these calculations.
The two problems are essentially the same: given a list of pattern values v_{1}, . . . v_{ n } , pairwise pattern scores must be summed for all pairs: $S={\displaystyle {\sum}_{i=1}^{n}{\displaystyle {\sum}_{j=1}^{n}M({v}_{i},{v}_{j}),\phantom{\rule{0.1em}{0ex}}{v}_{i}\in}\phantom{\rule{0.1em}{0ex}}\{{p}_{1},\dots {p}_{k}\}}$. In the similarity score case, the patterns are the residues (of 20 different types when aligning proteins) and the matrix is the similarity matrix, while in the indel score case there are 4 different patterns formed by two successive characters, both either gap or nongap. The trick is simple: first count how many of each pattern type is present in the column, then sum the score value for the pair of types multiplied by the counts:
Results and Discussion
We implemented the Reticular Alignment method in the Java programming language with all features for the choices of parameters as described in the previous section. The method was tested on the BAliBASE database [23], and compared with ClustalW [16], MAFFT [18] and Fast Statistical Alignment (FSA) [21]. BAliBASE is a database of manuallyrefined multiple sequence alignments specifically designed for the evaluation and comparison of multiple sequence alignment programs. The alignments are categorised by sequence length, similarity, and presence of insertions and N/C terminal extensions. Core blocks are identified excluding nonsuperposable regions.
BAliBASE provides a scoring tool (bali_score.c) to measure the accuracy of sequence alignments based on the reference alignments in the database. This tool offers two accuracy measures (SP and TC) and allows assessment based on either all or a subset of alignment columns thus essentially giving four different accuracy scores. SP is the number of correctly aligned residue pairs divided by the number of aligned residue pairs in the reference alignment, TC is the number of correctly aligned columns divided by the number of columns in the reference alignment. SP and TC can also be calculated on columns of the core blocks only  these feature columns are described in the BAliBASE database by separate files. We denote the soobtained scores 'Feature SP' and 'Feature TC'. All four of these scores can be regarded as a sensitivity measure (in classiffcation terminology) because characters/columns incorrectly shown homologous do not decrease the score.
We were interested in how the accuracy of our method depends on different parameters. Since the parameter space is four dimensional (guide tree building, similarity scoring, gap scoring, threshold value in the generalised WatermanByers algorithm) with several choices along each dimension, we do not show the results for each possible combinations of parameters. Two parameters (similarity scores, gap scores) influence only the score of the multiple alignments, one parameter (threshold value for the generalised WatermanByers algorithm) influences how to search in the search space, and one parameter (how to build the guide tree) influences the search strategy (when the treeconstructing strategy changes the topology of the tree, since our method is a progressive alignment method), and might also influence the way of scoring alignments (if the sequences are weighted by the guide tree). For each fixed score function, we tested how the alignment accuracy changes with the t parameter, namely, how much the accuracy can be improved by a deeper search in the alignment space. Some of our findings are quite surprising, discussed in the following subsections.
Alignment accuracy generally increases with the deepened search in the alignment space
Comparing single and pairwise gap penalties
Clustal uses a simple nonpairwise gappenalty for multiple alignments as described in the Methods section. This seems a rational choice for Clustal, as this gap scoring scheme indeed generates better alignments for Clustal than the pairwise scoring scheme.
The importance of a good guide tree, sequence weighting and gap scoring
Although Reticular Alignment outperformed ClustalX with a simple sumofpairs scoring scheme, and without any sophisticated gap scoring scheme, its performance with the less sophisticated scoring schemes was worse than the performance of the cuttingedge multiple sequence alignment methods. Therefore, we improved the scoring scheme both for similarity scoring and for gap scoring.
It is wellknown that the relative difference between the score of the fully conserved alignment column and the score of the alignment column with a single mismatch decreases with the number of sequences [30]. This artefact can be reduced by weighting the sequences according to the evolutionary tree showing their relationship. Such a weighting also improves alignment accuracy [30]. We implemented the same sequence weighting method that ClustalX uses.
Since our sequence weighting method uses the guide tree, it is also important to construct a good guide tree. We found that NJ outperforms UPGMA measured in alignment accuracy (data not shown). Since the NJ algorithm generates an unrooted tree, and the Reticular Alignment method needs a rooted tree, the NJ tree must be rooted. Changing the root of the guide tree also changes the progression of the multiple alignment. The more balanced the tree, the closer the numbers of sequences in the two alignment networks. We found that balanced trees generated by the 'midpoint' method as described in [16] generates more accurate alignments than unbalanced trees where one of the subtrees of the root contains only a single sequence.
Finally, it is also important to distinguish gap scores based on whether hydrophilic or hydrophobic aminoacids are inserted and/or deleted. Applying the same scoring scheme that ClustalX uses improved the alignment accuracy.
Fortifying the Reticular Alignment method with these sophisticated scoring schemes yielded a method that generated highly accurate alignments. Reticular Alignment outperformed all of ClustalX, MAFFT and FSA in SP values on the BAliBASE v1.0 database, and only MAFFT outperformed Reticular Alignment in the TC values, see Figure 3. On BAliBASE v2.0., Reticular Alignment outperformed ClustalX and FSA in all accuracy measurements, and it had a higher feature SP value than MAFFT, see Figure 2.
Memory and Computational demand
Conclusions
Previous cornercutting methods define a compact part of the dynamic programming table for searching the best scored alignment. These methods become very inefficient when the number of sequences increases. We introduced a new progressive alignment method called Reticular Alignment, which obtains a set of optimal and suboptimal alignments at each step of the progressive alignment procedure. This set of alignments is represented by a network and are not directly embedded into the highdimensional dynamic programming table. The set typically contains highscored alignments that are usually not neighbours in the dynamic programming table (see, for example, the already mentioned Figure 2. in [22]). Therefore, the convex hull of the set of these alignments in the high dimensional dynamic programming table contains a Significantly larger set of alignments. Any previous cornercutting method setting a convex part containing the set of alignments found by the Reticular Alignment method would need Significantly more memory and running time.
This novel cornercutting approach allows the efficient search of the space of multiple sequence alignments for highscored alignments. The method has a parameter which affects how much of the alignment space is explored. Furthermore, the Reticular Alignment method can be combined with any scoring scheme, and in this way, we were able to infer what is the importance of sophisticated scoring schemes and more exhaustive searches in finding accurate multiple sequence alignments.
The conclusion is that it is important to increase the search space for finding highscored alignments. The Reticular Alignment method could find more accurate alignments than ClustalW even when the gapscoring scheme was Significantly less sophisticated than the scoring scheme of ClustalW. For example, ClustalW gives different gap scores for hydrophilic and hydrophobic amino acids. This is considered to improve the alignment quality as hydrophobic amino acids are on the surface of globular proteins forming loops, and these loops undergo Significantly more insertion and deletion events than other parts of the proteins. Still, Reticular Alignment could generate more accurate alignments than ClustalW by merely extending the search space and without applying the above mentioned sophisticated scoring scheme of ClustalW. On the other hand, sophisticated scoring schemes are also necessary to get highly accurate multiple alignments. Combining sophisticated scoring schemes with the Reticular Alignment progressive alignment approach yielded a method whose accuracy is comparable to that of cuttingedge alignment methods. Without such sophisticated methods, the Reticular Alignment method only outperformed the ClustalX method, and were beaten by MAFFT and FSA in all accuracy measurements. Therefore it is also an important question how to find the scoring function that provides the most accurate multiple alignments. Kececioglu and Kim gave a fast linear programmingbased method that finds parameter values that make given example alignments be optimalscoring alignments of their strings [32]. Such extension of that approach for multiple sequence alignments would be desirable.
Declarations
Acknowledgements
Ádám Novák gratefully thanks BBSRC for the continued support and funding. István Miklòs is supported by OTKA grant NK 78439.
Authors’ Affiliations
References
 Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press; 1997.View ArticleGoogle Scholar
 Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–53. 10.1016/00222836(70)900574View ArticlePubMedGoogle Scholar
 Sankoff D, Cedergren RJ: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. AddisonWesley, Reading, Massachusetts; 1983:253–263. chap. Simultaneous comparison of three or more sequences related by a tree chap. Simultaneous comparison of three or more sequences related by a treeGoogle Scholar
 Wang L, Jiang T: On the complexity of multiple sequence alignment. J Comp Biol 1994, 1(4):337–348. 10.1089/cmb.1994.1.337View ArticleGoogle Scholar
 Fickett J: Fast optimal alignment. Nucleic Acids Research 1984, 12: 175–180. 10.1093/nar/12.1Part1.175View ArticlePubMedPubMed CentralGoogle Scholar
 Ukkonnen E: Algorithms for approximate string matching. Inform Control 1985, 64: 100–118. 10.1016/S00199958(85)800462View ArticleGoogle Scholar
 Spouge J: Fast optimal alignment. CABIOS 1991, 7: 1–7.PubMedGoogle Scholar
 Hein J, Wiuf C, Knudsen B, Moller MB, Wibling G: Statistical alignment: computational properties, homology testing and goodnessoffit. J Mol Biol 2000, 302: 265–279. 10.1006/jmbi.2000.4061View ArticlePubMedGoogle Scholar
 Wu S, Manber U, Myers G, Miller W: An O(NP) sequence comparison algorithm. Information Processing Letters 1990, 35(6):317–323. 10.1016/00200190(90)90035VView ArticleGoogle Scholar
 Carrillo H, Lipman D: The multiple sequence alignment problem in biology. SIAM Journal of Applied Mathematics 1988, 48: 1073–1082. 10.1137/0148063View ArticleGoogle Scholar
 Lipman D, Altschul S, Kececioglu J: A tool for multiple sequence alignment. PNAS 1989, 86: 4412–4415. 10.1073/pnas.86.12.4412View ArticlePubMedPubMed CentralGoogle Scholar
 Gupta S, Kececioglu J, Schäffer A: Improving the practical space and time efficiency of the shortestpaths approach to sumofpairs multiple sequence alignment. J Comp Biol 1995, 2(3):459–472. 10.1089/cmb.1995.2.459View ArticleGoogle Scholar
 Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phyletic trees: An integrated method. J Mol Evol 1984, 20(2):175–186. 10.1007/BF02257378View ArticlePubMedGoogle Scholar
 Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, (25):351–360. 10.1007/BF02603120Google Scholar
 Higgins D, Sharp P: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 1988, 73: 237–44. 10.1016/03781119(88)903307View ArticlePubMedGoogle Scholar
 Thompson J, Higgins D, Gibson T: ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucl Acids Res 1994, 22: 4673–4690. 10.1093/nar/22.22.4673View ArticlePubMedPubMed CentralGoogle Scholar
 Notredame C, Higgins D, Heringa J: TCoffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302: 205–17. 10.1006/jmbi.2000.4042View ArticlePubMedGoogle Scholar
 Katoh K, Misawa K, Kuma Ki, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl Acids Res 2002, 30(14):3059–3066. 10.1093/nar/gkf436View ArticlePubMedPubMed CentralGoogle Scholar
 Suchard MA, Redelings BD: BAliPhy: Simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 2006, 22(16):2047–2048. 10.1093/bioinformatics/btl175View ArticlePubMedGoogle Scholar
 Novák A, Miklós I, Lyngsø R, Hein J: StatAlign: An Extendable Software Package for Joint Bayesian Estimation of Alignments and Evolutionary Trees. Bioinformatics 2008, 24(20):2403–2404. 10.1093/bioinformatics/btn457View ArticlePubMedGoogle Scholar
 Bradley R, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L: Fast Statistical Alignment. PLoS Computational Biology 2009, 5: e1000392. 10.1371/journal.pcbi.1000392View ArticlePubMedPubMed CentralGoogle Scholar
 Zhu J, Liu J, Lawrence C: Bayesian adaptive sequence alignment algorithms. Bioinformatics 1998, 14: 25–39. 10.1093/bioinformatics/14.1.25View ArticlePubMedGoogle Scholar
 Thompson J, Koehl P, Ripp R, O P: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2005, 61: 127–136. 10.1002/prot.20527View ArticlePubMedGoogle Scholar
 Waterman MS, Byers TH: A dynamic programming algorithm to find all solutions in the neighborhood of the optimum. Math Biosci 1985, 77: 179–188. 10.1016/00255564(85)900963View ArticleGoogle Scholar
 Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4(4):406–425.PubMedGoogle Scholar
 Studier J, Keppler K: A note on the NeighborJoining algorithm of Saitou and Nei. Mol Biol Evol 1988, 5(6):729–731.PubMedGoogle Scholar
 Hirschberg DS: A linear space algorithm for computing maximal common subsequences. Commun ACM 1975, 18(6):341–343. 10.1145/360825.360861View ArticleGoogle Scholar
 Ma B, Wang Z, Zhang K: Alignment between Two Multiple Alignments. Lecture Notes in Computer Science 2003, 2676: 254–265. full_textView ArticleGoogle Scholar
 Henikoff S, Henikoff J: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915View ArticlePubMedPubMed CentralGoogle Scholar
 Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press; 1998.View ArticleGoogle Scholar
 Tarnas C, Hughey R: Reduced space hidden Markov model training. Bioinformatics 1998, 14: 401–406. 10.1093/bioinformatics/14.5.401View ArticlePubMedGoogle Scholar
 Kececioglu J, Kim E: Simple and Fast Inverse Alignment. Lecture Notes in Computer Science 2006, 3909: 441–455. full_textView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.