 Methodology Article
 Open Access
Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
 Joseph L Herman^{1, 2}Email author,
 Ádám Novák^{1},
 Rune Lyngsø^{1},
 Adrienn Szabó^{3},
 István Miklós^{3, 4} and
 Jotun Hein^{1}
https://doi.org/10.1186/s1285901505161
© Herman et al.; licensee BioMed Central. 2015
 Received: 24 March 2014
 Accepted: 24 February 2015
 Published: 1 April 2015
Abstract
Background
A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment.
Results
In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables samplebased estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased.
Conclusions
The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference.
Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign.
Keywords
 Alignment graphs
 Statistical alignment
 Alignment uncertainty
 Multiple sequence alignment
Background
Sequence alignment is one of the most intensely studied problems in bioinformatics, and is an important step in a wide range of different analyses, including identification of conserved motifs [1], analysis of molecular coevolution [24], estimation of phylogenies [5], and homologybased protein structure prediction [6,7].
Many of the most popular alignment methods seek to compute a single optimal alignment, using dynamic programming algorithms [8,9] as well as a variety of heuristic procedures [1015]. Similar approaches can be used to find maximum likelihood alignments under certain probabilistic models of insertion, deletion and substitution events [1620].
Effect of alignment on downstream inference
It has become increasingly clear in recent years that downstream analyses are often highly sensitive to the specific choice of alignment. There may be many plausible but suboptimal alignments within the vicinity of the optimum, containing additional—often complementary—information regarding the evolutionary relationships between the sequences [21]; selecting a single point estimate results in the loss of this additional information, and fails to account for the statistical uncertainty associated with different regions of the alignment [22].
A number of studies have highlighted the impact of the choice of alignment on subsequent phylogenetic inference [2331]; in many cases different alignment methods, or different guide trees, can give rise to very different phylogenies [23,3236]. Sensitivity to the alignment is also observed in the context of many other types of downstream analysis, including homology modelling of protein structures [3739], detection of correlated evolution [40,41], prediction of RNA secondary structure [42], and inference of positive selection [36,4345].
Filtering methods
A common approach to tackling the issue of alignment uncertainty has been to attempt to annotate particular regions of the alignment as unreliable, and to remove these before carrying out subsequent analysis. Filtering methods have in some cases been observed to yield improved inference for phylogenies [4648] and positive selection [44,45].
However, the specific choice of filtering method may have a strong influence on the results [49], and uncertain regions of the alignment may also contain important information that is lost through the use of such methods. For example, tree accuracy is not related in a straightforward fashion to alignment uncertainty [27], and seemingly unreliable regions may be important for accurately resolving phylogenies [50,51]. Regions of high alignment uncertainty can also correspond to sites with higher indel rates [22,52], as well as regions of structural variability [53] or intrinsic disorder [54] in protein structures, and filtering these out may lead to unpredictable biases in subsequent analysis.
Joint sampling approaches
Within the Bayesian paradigm, alignment uncertainty can be addressed in a more methodical fashion by considering alignments, along with other parameters of interest, as samples from an unknown posterior distribution. In this framework, regions of high alignment variability then correspond to regions of high variance in the posterior. The last decade has seen the development of several fully Bayesian approaches for performing joint inference on alignments along with other objects of interest, such as mutation rates [55], phylogenetic trees [5658], information about the evolution of protein structure [5962], and the locations of putative regulatory elements [6365]; inference on these quantities after accounting for alignment uncertainty can then be obtained by averaging over alignments according to their posterior probability under the joint model.
However, although such approaches may be analytically tractable for comparison of a small number of sequences [63,64,66], the computational complexity involved in analysing these hierarchical joint models typically does not scale well with the number of sequences; procedures such as Markov chain Monte Carlo can only increase the range of tractability to a limited extent [56,57,65]. Moreover, adding in another level of annotation or information may require a new model to be formulated, such that in many cases this fully Bayesian approach may be impractical for problems of interest.
Alternatives to joint sampling
In this work we focus on a tractable alternative that can be used when joint sampling approaches are impractical. This approach takes a collection of alignments sampled according to a particular model, and uses an efficient graphbased representation to generate a much larger set of possible alignments from the initial collection. The acyclic structure of the graph allows many types of analysis to be easily carried out on the whole ensemble of alignments rather than just a single representative, such that the alignment uncertainty quantified by the ensemble can be incorporated into downstream analysis without the need for designing computationally intensive joint sampling approaches. If a single representative of the ensemble is required, this framework also allows for the efficient computation of the single alignment that maximises the expected value of a variety of different accuracy scores.
The simple and computationally efficient nature of this representation makes it practical to adopt a more principled, probabilistic approach to quantifying and making use of alignment uncertainty, and we discuss examples of cases where this may prove particularly useful.
Quantifying alignment uncertainty
A number of different approaches have been developed for quantifying the uncertainty associated with a multiple sequence alignment. Many of these methods focus on the notion of alignment reliability, i.e. the degree to which a particular alignment (or regions thereof) can be trusted as a prediction of the homology between the sequences.
One set of approaches involves computing scores or summary statistics on a single alignment of interest, using these as a measure of reliability of the alignment. Some of these approaches equate reliability of a particular alignment column with a high score under the model used to generate the alignment [67], the justification being that lowscoring columns are harder to distinguish from random noise, and so are more likely to contain erroneous homology statements; others generate the alignment using one scoring scheme, and measure its ‘reasonableness’ based upon another set of criteria [68,69], which may involve looking at the deviation of summary statistics from their expected background distribution under the null hypothesis of no homology [70,71]. One potential issue with some of these approaches is that they introduce a bias towards highly conserved regions, since they do not distinguish between evolutionary variability and statistical uncertainty, often using the term alignment quality as a synonym for reliability.
An alternative approach, first mentioned by [49], involves generating a set of plausible alignments, and assessing the alignment uncertainty by measuring the similarity between the alignments in this set. This type of consistency or congruencebased approach has a more natural statistical interpretation, but requires a method of generating alternative alignments, as well as a measure of alignment similarity or distance; the interpretation of the resulting measures of uncertainty may depend heavily on these two factors.
Generating sets of alignments
A variety of heuristic methods have been developed in order to generate sets of alignments for the purposes of measuring uncertainty. Perhaps the simplest of these is to align the same sequences with the residue order reversed [72], although the efficacy of this technique is questionable [73,74]. Another class of methods generates alternative alignments by perturbing parameters such as the guide tree [75,76], gap opening and extension penalties [77,78], and substitution matrices [79,80], and recomputing the optimal alignment with these alternative parameters. However, in all these cases the types of perturbations applied to the parameters will affect the resulting estimates of uncertainty in an unpredictable fashion [70].
Another approach is to look at a set of suboptimal alignments under a particular scoring scheme, given fixed parameters [8183], using these to search for regions of consistency [8486]. The variability among these suboptimal alignments can then be converted into a measure of statistical uncertainty, using an approximation to the distribution of scores, for example using an extreme value distribution [87].
A Bayesian approach
Within a Bayesian framework, the collection of plausible alignments can be identified with the posterior distribution of the alignment given the sequences and other model parameters; this leads to a probabilistic interpretation of alignment uncertainty, whereby the fraction of alignments containing a particular homology statement is a measure of the posterior probability of that homology statement.
For the pairwise case, alignments can often be sampled exactly from their posterior distribution under a particular evolutionary model using a dynamic programming approach [8890]. However, for multiple sequences such approaches rapidly become computationally infeasible, and other types of procedures must be used. A popular option is to use Markov chain Monte Carlo (MCMC) in order to sample from the posterior distribution of alignments [5558,60,61,65,9194]. The main advantage of the MCMC approach is that it is guaranteed to sample alignments from the correct probability distribution, provided that the simulation is run for long enough to ensure convergence, although this may require significant amounts of runtime.
Representing the distribution of sampled alignments
Once a set of plausible alignments has been generated, a common issue that arises is how to represent and/or summarise this set in a useful fashion. In a Bayesian context this entails representing the approximation to the posterior distribution over alignments, given a collection of samples. We shall present here a graphbased formulation that allows for a compact representation of this distribution, permitting algorithms to be designed for efficient inference on exponentially large sets of alignments derived from a collection of samples.
Mapping columns to dynamic programming tables
A multiple sequence alignment can be represented as a path through a multidimensional matrix; an edge from one cell of the matrix to an adjacent cell represents a particular set of homology statements, synonymous with a column in the alignment. It is a straightforward extension to consider a set of alignments as a set of paths in such a matrix [95].
where \(s^{(i)}_{j}\) is the jth character of the ith sequence, such that C(X) corresponds to the coordinates of the midpoint of an edge connecting two cells in the matrix. We will also introduce initial and terminal columns, X ^{(0)} and X ^{(T)}, which can be thought of as allgap columns preceding the first characters and following the last characters of the sequences, respectively. These will therefore be encoded as C(X ^{(0)})=(0,…,0) and C(X ^{(T)})=(2L _{1},…,2L _{ m }) where L _{ i } is the length of the i ^{ t h } sequence.
Intersections between alignments
The paths corresponding to a particular set of alignments may intersect at one or more points in the matrix; as first discussed by BuckaLassen et al. [95], subpaths can be ‘spliced’ at these points in order to generate new alignments. This approach was originally used to create an augmented search space for locating an optimal alignment [95,96], and more recently has been used as part of a progressive alignment algorithm that keeps track of suboptimal alignments [97].
As a result of this, an initial set of alignments sampled according to a particular model can be used to generate a much larger set of alignments sampled according to the same distribution, as we shall examine in further detail in the subsequent section.
Equivalence classes of columns
The equivalence class E _{ P }(X) is then defined as the set of columns, {X ^{′}∣f _{ P }(X ^{′})=f _{ P }(X)}, with E _{ S }(X) similarly defined.
Using the definitions above, a column X ^{′} is a predecessor of X if and only if f _{ S }(X ^{′})=f _{ P }(X), since any path connecting them must pass through the separating equivalence class E _{ S }(X ^{′})≡E _{ P }(X). We will use the notation \(\mathcal {P}(X) \equiv \{ X^{\prime } \mid f_{S}(X^{\prime }) = f_{P}(X) \}\) to denote the set of predecessors of X.
The alignment column graph
We can then define the alignment column graph, \(\mathcal {D}(\Xi)\), of a set of columns, Ξ, as a graph whose nodes are the columns in Ξ, with a directed edge from column X to column X ^{′} if and only if f _{ S }(X)=f _{ P }(X ^{′}), which we write as \(X \ltimes X^{\prime }\). From the definitions in Equations (2) and (3), we have f _{ P }(X)<f _{ S }(X) for all X, in the sense that f _{ P }(X)_{ i }≤f _{ S }(X)_{ i } for all i, with no column having f _{ S }(X)=f _{ P }(X) unless it consists of all gaps. This ensures that the alignment column graph is acyclic, since it is never possible to return to the same equivalence class by following a set of directed edges in the graph.
Implicit in the definition of the mapping in Equation (1) is a distinction between gaps based on their position in the alignment, such that the two situations shown in Figure 1 represent distinct alignments, each yielding two different pairs of columns. This assumption is necessary in order to generate a sparse graph; treating all gaps as equivalent is tantamount to replicating each gapcontaining column onto all parallels, such that the graph in general becomes maximally dense, making efficient algorithms difficult to implement (see Additional file 1 : Figure S2).
Probability distributions on alignment DAGs
Due to the highdimensional nature of the alignment space, in any particular set each alignment will typically occur with a very low frequency; even the most likely alignment may only be sampled once, if at all [93,98]. As such, the relative probabilities of entire alignments are difficult—if not impossible—to estimate directly by their observed frequencies. However, a particular column may occur in many different alignments, allowing the marginal probability of each column, averaged over all alignments, to be estimated much more efficiently [93,99]. As we shall discuss, they also represent useful summary statistics of the full distribution.
Alignment probabilities in terms of pair marginals
For evolutionary models based on firstorder hidden Markov models (HMMs) (such as the one shown in Additional file 1: Figure S4), the pairmarginal representation is exact, since the dependencies in the model are equivalent to those in the DAG. For models with nonlocal dependencies between columns, simply setting the pair marginals to be equal to the observed pair marginals minimises the KullbackLiebler divergence from the full empirical distribution to the pairmarginal approximation (see Additional file 1 : Section S4).
Motivations for using factored approximations

The number of possible column pairs is many orders of magnitude lower than the number of alignments, such that pair marginals can be estimated much more reliably from observed frequencies. These can then be used to construct more accurate estimates of the overall joint probability.

Expression of the joint in terms of pairmarginals allows for interchanges in the alignment DAG (cf. Figure 2), allowing many alternative alignments to be generated from an initial collection of samples.

Factorisation of the probability into a product of local terms allows for efficient algorithms to be implemented on the DAG structure.
We discuss these factors in further detail below.
Meanfield approximation
where \(p(X \mid \mathcal {P}(X))\) is the probability of column X given that any one of its possible predecessors is in the alignment. The second line uses the identities \(p(X, \mathcal {P}(X)) \equiv p(X)\) (since a column can only be present if one of its predecessors is present), and \(p(\mathcal {P}(X)) \equiv \sum _{X^{\prime }\ltimes X} p(X)\) (since only one member of an equivalence class can be present in any particular alignment).
An important corollary of the expression in Equation (8) is that singlecolumn marginals are sufficient to reconstruct the meanfield approximation to the joint probability; this has several important consequences, as we shall discuss below.
Motivations for using the meanfield approximation

Since the number of possible columns is substantially less than the number of possible column pairs, it is easier to obtain reliable estimates of singlecolumn marginals from a collection of alignment samples. Hence, the meanfield approximation is likely to be more accurate for lower sample sizes.

The use of singlecolumn marginals allows for crossovers in the alignment DAG (cf. Figure 3), whereas the pairmarginal expression will assign a weight of zero to any pairs that are not observed, hence only permitting interchanges of the form shown in Figure 2. This allows for a higher effective sample size for the alignments under the meanfield approximation, with more alternative alignments generated from the same collection of samples.

Restricting to singlecolumn marginals more efficient algorithms to be constructed, involving onestep rather than twostep recursions.
In the rest of this section, we examine these points in further detail.
Estimating marginal probabilities
For a pairwise alignment, column marginals can be easily represented using a matrix in which the (i,j) entry contains the marginal probability \(p(s^{(1)}_{i} \diamond s^{(2)}_{j})\), where \(s^{(1)}_{i}\) and \(s^{(2)}_{j}\) are the ith and jth characters in two sequences s ^{(1)} and s ^{(2)}, and the symbol ◇ denotes homology. When only two sequences are under comparison, dynamic programming recursions allow for the exact computation of these marginal probabilities under certain types of evolutionary models [55,100,101].
with \(n_{C}(X,\mathcal {A}^{+})\) denoting the number of occurrences of C(X) across all the alignments contained in the multiset \(\mathcal {A}^{+}\). If enough alignments are sampled from the correct distribution, the above estimator will converge to the true value p _{ C }(X). Although conditional marginals can also be computed from local alignments (see Additional file 1 : Section S1), in this work we will consider only global alignments, in the interests of simplicity.
Since in most cases each sampled alignment will be unique, due to the high dimensional nature of the state space, in the rest of this manuscript we will refer only to the set rather than the multiset \(\mathcal {A}^{+}\). However, for cases where uncertainty is low, and the same alignment may be sampled more than once, it is important to treat each replica as an independent sample when computing marginal probabilities.
Marginal probabilities can also be estimated for pairs of columns using observed pair frequencies. However, the space of possible pairs of columns can be much larger than the space of columns; in the worst case this will be by a factor of \(\mathcal {O}(2^{N})\), where N is the number of sequences, since this is the maximum size of an equivalence class. Hence, a larger number of alignment samples will be needed to obtain accurate estimates for pair marginals. As we shall see, this means that pairbased reconstructions of joint probabilities are typically less accurate unless a very large number of samples is used.
Reconstructing alignment probabilities from marginals
Generally, with samplingbased procedures such as MCMC, posterior probabilities are estimated via sampled frequencies. However, in the case of a very high dimensional parameter such as a multiple sequence alignment, each point in the space may only be visited once, such that it is not possible to estimate posterior probabilities based on these frequencies.
As discussed above, the set of marginal probabilities for each column (or pair of neighbouring columns) can be used to reconstruct the posterior probability for any particular alignment, via Equation (5). Although the likelihood for each sampled alignment will often be known as a byproduct of the sampling procedure, the marginal posterior probability of each alignment after integrating over other unknown parameters (for example indel rates), will typically not be known. Hence, the DAGbased approach presented here represents a useful way to calculate posterior probabilities in such cases. A similar approach has been used recently to compute the posterior probabilities of phylogenetic trees based on the probabilities of each of the constituent clades, under the assumption of conditional independence between clades [102].
For lower numbers of samples, the estimates are more accurate for the more probable alignments, since the more extreme regions of the space are sampled with lower probability, and hence converge more slowly.
Although both pairmarginal and meanfield estimates converge in this case at a similar rate, closer analysis shows that the mean squared error in the approximation to the true posterior is considerably less for the meanfield approximation. This suggests that the improvement obtained by summing over a larger number of paths (see Figure 5) outweighs the approximation introduced by averaging over predecessor states, although eventually at around 2000 samples the pairmarginal estimates begin to dominate the meanfield approximation (see Figure 6), since the true pairHMM involves neighbourdependent terms. The precise location of this crossover point will depend on the degree of neighbour dependency; for a completely siteindependent model (e.g. the pairHMM in Additional file 1: Figure S4 with δ=ε=σ), the singlecolumn marginal estimate always dominates (see Additional file 1 : Figure S7).
Approximate summation over all alignments
As well as computing the probability of individual paths in the DAG, it is possible to sum over all alignments contained within the DAG using a standard dynamic programming algorithm (see Additional file 1 : Section S5).
In the pairwise case, where it is possible to analytically compute the sum over all alignments (by filling out the full dynamic programming table), it is possible to examine how much of the posterior mass is contained within the DAG resulting from a particular set of samples.
While the probability mass contained within the individual samples increases relatively slowly, and encapsulates only a very small fraction of the total, the proportion of the posterior mass encapsulated in the set of paths through the alignment DAG increases much more rapidly; the DAG contains in the order of 10 15% of the total posterior mass over the entire set of possible alignments with just 100 samples, increasing to around 80% after including 2000 samples (see Figure 9 and Additional file 1 : Figure S1).
A similar dynamic programming algorithm can be used to calculate the total number of paths (i.e. alignments) contained within the DAG. Examining the number of paths in the DAG as a function of the number of alignment samples shows a superexponential relationship when crossovers are allowed, whereas restricting to observed column pairings increases close to exponentially (see Figure 5). In the pairwise case, the theoretical maximum can be computed analytically; for the pairwise example discussed above, the total number of paths in the DAG has an upper bound in the order of 10^{113}.
Summarising the alignment distribution
Although the set of alignments encoded by the DAG contains a great deal of additional information beyond that contained in any one alignment, there may be situations where a single alignment is desired as a summary of the distribution. Due to the highdimensional and constrained nature of the state space, standard summary statistics such as the mean are not applicable in this case [104].
Finding the MAP alignment
such that the path with the maximum posterior can be found using standard dynamic programming algorithms for DAGs (see Algorithm 1).
Nevertheless, due to large size of the space of possible alignments, there may be a large number of very similar alignments with very similar posterior probability. Hence, quantities such as the MAP can be poor summary statistics of the distribution [58,93,94]. Instead, we will consider alternative types of summary alignments that account for the uncertainty contained within the DAG.
Loss function formulation
The problem of choosing a single summary alignment can be approached within a decision theoretical framework, whereby the choice of summary is designed to minimise the expected value of a particular loss function, also known as the posterior risk [104]. For a loss function defined in terms of alignment accuracy, minimising the posterior risk is equivalent to selecting the maximum expected accuracy alignment [98,105,106].
where the sum over A ^{′} includes all alignments. The minimumrisk alignment is then \(\hat {A} = \arg \min _{A} \mathcal {R}(A)\).

True positives (TP) = Columns correctly present

False positives (FP) = Columns incorrectly present

True negatives (TN) = Columns correctly absent

False negatives (FN) = Columns incorrectly absent
such that T P∪F P∪T N∪F N=Ξ, the set of all observed columns.
where f is a bijective function operating on columns, with f(A)=(f(A ^{(1)},…,f(A ^{(L)})), and λ _{ FP } and ρ _{ TP } are loss/reward functions associated with false positives and true positives respectively.
where is the marginal probability of column X being present according to the mapping specified by f, and g=λ _{ FP }/(ρ _{ TP }+λ _{ FP }) is penalty term that penalises longer alignments by a factor proportional to the penalty on false positives. In contrast to an arbitrarily chosen gap penalty, the penalty, g, has a direct interpretation in this case. It is also a straightforward extension to allow λ _{ FP } and ρ _{ TP }, and hence g, to depend on the specific column, X, for example penalising a false positive proportionally to the number of nongap characters contained in the column.
Loss functions corresponding to common accuracy measures
As discussed in Additional file 1: Section S3, the above approach can easily be extended to make use of a function, f, which splits a column up into a set of pairwise homology statements. This allows various pairwise accuracy scores to be expressed in terms of similar types of loss functions.
Modeller scores
One other class of loss function worth mentioning here is the socalled modeller version of each of the aforementioned scores, \(\mathcal {L}^{m}_{f}(A\, \, A^{\prime })\), which involve normalising \(\mathcal {L}_{f}(A\, \, A^{\prime })\) by the length of the predicted alignment, A. For example, the modeller Cscore, corresponding to \(\mathcal {L}^{m}_{C}(A\, \, A^{\prime })\), was considered by Collingridge and Kelly [79]; as we shall see, the dependence on the length of the predicted alignment precludes the use of exact optimisation algorithms for loss functions such as this.
Efficient algorithms
In general, minimising the expectation of any of the aforementioned loss functions over the space of all possible multiple alignments is a problem whose complexity grows exponentially with the number of sequences [107]. For the pairwise case, the minimumrisk/maximum expected accuracy problem can be implemented efficiently using standard dynamic programming algorithms [22,60,61,88,94,98,108110]; for multiple sequences approximate techniques have generally been used, including simulated annealing [20,111,111,112], and greedy [113] or progressive alignment algorithms [105,114116].
The same approach cannot be applied to minimise the risk under modeller variants, however, since the contribution of each column to the partial sum at each step in the dynamic programming algorithm depends on the unknown final alignment length. Collingridge and Kelly recently presented an algorithm, entitled MergeAlign, that proposed to optimise a score of this type, but as shown in Additional file 1: Figure S5, it is possible to construct counterexamples for which the algorithm does not compute the optimal solution. As we shall illustrate, this lack of optimality can result in significant losses when summarising a set of alignments. Moreover, the same objective, i.e. penalising longer alignments, can be achieved through the use of a nonzero g parameter as described above, such that the use of modeller variant loss functions is unnecessary.
Efficient data structures
In representing the alignment DAG, it is essential to ensure that the space complexity of the data structure is less than the total number of paths through the graph, which increases very rapidly with the number of columns. The obvious way to represent a graph is via a list of neighbours for each node, which requires \(\mathcal {O}(\bar {d} \Xi )\) storage, where Ξ is the number of observed columns and \(\bar {d}\) is the average node indegree.
However, within the meanfield setting, we can use the predecessor and successor equivalence classes to significantly increase the space efficiency, since each column need only record its predecessor and successor equivalence class. Given the definitions of the predecessor and successor equivalence classes, we can see that each equivalence class is of size at most 2^{ N }−1, where N is the number of sequences, since each row can take one of two possible values (gap/character) in each equivalence class, with the restriction that the column cannot be all gaps. In general, the number of equivalence classes is therefore somewhat less than the number of columns, with \(\Xi  = \bar {d} \mathcal {E}\), where \(1 \leq \bar {d} \leq 2^{N}  1\). Using an equivalenceclass representation of the DAG structure therefore results in \(\mathcal {O}(\bar {d} \mathcal {E}) = \mathcal {O}(\Xi )\) space requirements, saving a factor of \(\bar {d}\).
Similar gains can be made in time complexity. Since any column in a particular f _{ P }equivalence class will have the same set of possible predecessors, and similarly for successors, the partial sums required in dynamic programming algorithms can be stored per equivalence class rather than per node, which results in algorithms of \(\mathcal {O}(\Xi )\) time complexity rather than \(\mathcal {O}(\bar {d} \Xi )\) (see Algorithms 1 and 2 for examples). In the limit of a large number of short sequences with high uncertainty, this results in going from approximately quadratic time, to time linear in the number of columns.
Example application: summary alignments for simulated and benchmark datasets
In order to illustrate the utility of the aforementioned procedure, we first simulated sequence data using the program DAWG [117], yielding sets of sequences for which the true alignment is known. Details of the simulation are provided in Additional file 1: Section S7. Data were simulated under three parameter regimes, with indel rates set to low, medium and high (see Additional file 1 : Section S7 for further details); 50 datasets were generated for each regime, yielding 150 datasets overall, each containing 10 sequences, with average sequence length equal to 905 nucleotides.
As a biologically relevant example, we also considered a set of 78 alignments taken from the BAliBASE database, comprising the fulllength alignments from the Reference 1 set [118]. This set further comprises two subsets, consisting of low sequence identity (Ref 1a, ID <25%) (short: 14, medium: 12, long: 12; average 6.8 sequences per alignment; average sequence length 309), and medium sequence identity (Ref 1b, ID =20−40%) (short: 14, medium: 16, long: 10; average 9.0 sequences per alignment; average sequence length 351). The simulated and BAliBASE datasets can be found in Additional file 2.
For each of these datasets, we ran the statistical alignment software StatAlign [56], which jointly samples alignments and trees under a stochastic model of substitution, insertion and deletion [93]. 1000 alignment samples were generated from the posterior distribution, and a Javabased implementation of Algorithm 2 was used to compute a summary alignment minimising the risk under the C and C ^{+}based loss functions.
It is also of interest to consider how the minimumrisk summary approach scales to alignments containing larger numbers of sequences. As a test dataset containing larger alignments, we selected one of the largest alignments from the OXBench suite [119], consisting of 122 immunoglobulin sequences, with average length 113. To assess how the method scaled with the number of sequences after controlling for other factors (such as amino acid content and sequence length), we subsampled smaller datasets from this alignment, yielding datasets with 15, 33, 60 and 122 sequences. These subsets were sampled so as to maximise dissimilarity within the subset, since the original alignment contained several welldefined subgroups that would otherwise skew the analysis. Since full posterior sampling of alignments is only feasible for around 2030 sequences, we made use of an approximate method for sampling alignments for these datasets [80], generating 2000 alignment samples for each dataset (see Additional file 1 : Section S7 for further details).
Comparison to other methods
For comparison, we also generated summary alignments for each dataset using the MergeAlign method of Collingridge and Kelly [79], and a consistencybased approach whereby the alignment samples are used as a library for input to the program TCoffee [114], using the aln option [120]. We call the latter approach SCoffee, with the ‘S’ signifying that the TCoffee method is being used on a library derived from a set of sampled alignments.
Average time (in seconds) taken to generate a summary alignment from 1000 samples, for the three simulated datasets
Indel rate  

Low  Medium  High  
MinRisk (C)  1.5  1.8  2.2 
MinRisk (C ^{+})  1.9  2.4  2.8 
MergeAlign  12.0  17.6  22.9 
SCoffee  43.0  48.4  50.9 
Accuracy metrics
For the simulated data, accuracy is computed relative to the known true alignments, and for the BAliBASE datasets, relative to the benchmark alignment provided.
Since the minimal \(\mathcal {R}_{C}\) and \(\mathcal {R}_{C^{+}}\) alignments maximise the expectation of the C and C ^{+}score respectively, it would be expected that these methods perform best under the corresponding scores. The MergeAlign method seeks to maximise the Modeller C score, although as mentioned earlier, the algorithm cannot guarantee an optimal solution. As a pairwise progressive algorithm, the SCoffee method might be expected to perform best under a sumofpairs score, such as the AMA metric.
A rank of 1 therefore indicates an alignment that is more accurate under measure α than each of the individual samples, whereas a rank of 0 indicates an accuracy lower than any of the individual samples.
Results: simulated data
Average rank scores for the different methods on simulated datasets, using the accuracy metrics described in the main text and in Table 2
Low indel rate  Medium indel rate  High indel rate  

α _{ C }  \(\protect \phantom {\dot {i}\!}\boldsymbol {\alpha _{C^{+}}}\)  AMA  α _{ C }  \(\protect \phantom {\dot {i}\!}\boldsymbol {\alpha _{C^{+}}}\)  AMA  α _{ C }  \(\protect \phantom {\dot {i}\!}\boldsymbol {\alpha _{C^{+}}}\)  AMA  
MinRisk (C), g=0  0.91  0.89  0.90  0.96  0.92  0.93  0.89  0.88  0.88 
MinRisk (C), g=0.5  0.89  0.73  0.84  0.93  0.50  0.78  0.84  0.09  0.40 
MinRisk (C), g=1  0.88  0.63  0.80  0.90  0.30  0.65  0.79  0.03  0.28 
MinRisk (C ^{+}), g=0  0.86  0.98  0.96  0.87  1.00  1.00  0.76  1.00  1.00 
MinRisk (C ^{+}), g=0.5  0.89  0.92  0.92  0.93  0.94  0.94  0.86  0.94  0.94 
MinRisk (C ^{+}), g=1  0.89  0.84  0.88  0.91  0.74  0.85  0.83  0.34  0.55 
MergeAlign  0.65  0.40  0.48  0.80  0.46  0.58  0.73  0.36  0.45 
SCoffee  0.08  0.02  0.10  0.15  0.01  0.10  0.29  0.00  0.04 
In contrast, on these datasets MergeAlign typically yields a summary alignment whose accuracy is close to the median, with a rank score close to 0.5, although performance is more reasonable under the α _{ C } measure. The progressive heuristic SCoffee algorithm performs consistently badly in all cases, yielding summary alignments that are typically worse than the majority of the samples used to build the library, suggesting a conflict between the information contained in the samples, and the heuristics used to construct the alignment.
Average rank scores for the different methods on simulated datasets, measured using the modeller scores
Low indel rate  Medium indel rate  High indel rate  

\(\boldsymbol {{\alpha ^{m}_{C}}}\)  \(\boldsymbol {\alpha ^{m}_{C^{+}}}\)  \(\boldsymbol {{\alpha ^{m}_{C}}}\)  \(\boldsymbol {\alpha ^{m}_{C^{+}}}\)  \(\boldsymbol {{\alpha ^{m}_{C}}}\)  \(\boldsymbol {\alpha ^{m}_{C^{+}}}\)  
MinRisk (C), g=0  0.92  0.91  0.96  0.95  0.89  0.92 
MinRisk (C), g=0.5  0.93  0.88  0.97  0.80  0.90  0.35 
MinRisk (C), g=1  0.95  0.85  0.96  0.65  0.87  0.23 
MinRisk (C ^{+}), g=0  0.69  0.88  0.62  0.96  0.56  1.00 
MinRisk (C ^{+}), g=0.5  0.90  0.94  0.95  0.97  0.88  0.96 
MinRisk (C ^{+}), g=1  0.93  0.92  0.95  0.91  0.88  0.74 
MergeAlign  0.74  0.57  0.85  0.67  0.78  0.63 
SCoffee  0.15  0.05  0.22  0.03  0.37  0.00 
As might be expected, the performance of MergeAlign improves when the accuracy is measured using the modeller scores. However, better performance can still be obtained under the modeller variants by using the MinRisk method and a nonzero g parameter (see Table 4). As discussed earlier, the g parameter accomplishes the key aim of the modeller score (i.e. to penalise longer alignments) while maintaining computational tractability, and a meaningful statistical interpretation.
Results: BAliBASE
Average rank scores for the different methods on BAliBASE datasets, using the accuracy metrics described in the main text and in Table 2
Ref 1a (<25 % )  Ref 1b (20−40 % )  

α _{ C }  \(\phantom {\dot {i}\!}\alpha _{C^{+}}\)  AMA  α _{ C }  \(\phantom {\dot {i}\!}\alpha _{C^{+}}\)  AMA  
MinRisk (C), g=0  0.94  0.77  0.88  0.88  0.85  0.82 
MinRisk (C), g=0.5  0.90  0.41  0.66  0.92  0.81  0.90 
MinRisk (C), g=1  0.88  0.41  0.63  0.94  0.83  0.93 
MinRisk (C ^{+}), g=0  0.67  0.92  0.77  0.71  0.87  0.66 
MinRisk (C ^{+}), g=0.5  0.86  0.86  0.88  0.85  0.91  0.89 
MinRisk (C ^{+}), g=1  0.88  0.64  0.78  0.90  0.88  0.93 
MergeAlign  0.91  0.59  0.74  0.80  0.75  0.84 
SCoffee  0.45  0.14  0.26  0.52  0.32  0.52 
Average rank scores for the different methods on BAliBASE datasets, measured using the modeller scores
Ref 1a (<25 % )  Ref 1b (20−40 % )  

\(\alpha ^{m}_{C}\)  \(\alpha ^{m}_{C^{+}}\)  \(\alpha ^{m}_{C}\)  \(\alpha ^{m}_{C^{+}}\)  
MinRisk (C), g=0  0.93  0.74  0.82  0.78 
MinRisk (C), g=0.5  0.95  0.70  0.96  0.96 
MinRisk (C), g=1  0.92  0.68  0.97  0.97 
MinRisk (C ^{+}), g=0  0.40  0.50  0.34  0.33 
MinRisk (C ^{+}), g=0.5  0.86  0.88  0.83  0.85 
MinRisk (C ^{+}), g=1  0.90  0.86  0.93  0.96 
MergeAlign  0.93  0.74  0.85  0.86 
SCoffee  0.59  0.46  0.76  0.75 
We can see also that in general the optimal value of g for the MinRisk method is higher for the Ref 1b dataset reflecting the fact that these sequences are less diverged, and hence likely to contain fewer indels. However, as with the simulated data, a value of g=0.5 gives results that are close to optimal in all scenarios with the BAliBASE datasets.
Results: approximate sampling on larger OXBench alignments
Using the OXBench datasets, we can examine how the above conclusions scale to alignments with larger numbers of sequences. As discussed by BuckaLassen et al. [95], the number of intersections between sampled alignments may be expected to decrease as the number of sequences is increased, due to the increased size of the state space. Similarly, since the number of possible columns increases exponentially with the number of sequences, it might be expected that the marginal probabilities of each column would decrease as the number of sequences is increased, thereby making the minimumrisk alignment less reliable.
Results on OXBench datasets
Number of sequences  15  33  60  122 

Benchmark alignment length  144  150  152  157 
Mean eq. class size  15.2  11.8  12.4  11.1 
Average marginal  0.19  0.21  0.25  0.23 
MinRisk rank, g=0  0.67  0.85  0.84  0.92 
MinRisk rank, g=0.5  0.85  0.95  0.69  0.74 
# columns in DAG  20288  26782  26221  30305 
Time to read alignments (s)  0.5  0.8  1.2  2.1 
Total runtime (s)  0.9  1.3  1.9  3.0 
Since the alignments are of length around 150, and the DAGs contain in the region of 30,000 unique columns, 2000 samples is approximately 10 observations per column. While this appears to be sufficient for estimating the minimumrisk alignment, more samples will be needed in order to accurately estimate the probabilities of the less likely alignments, since these tend to converge more slowly (cf. Figures 7 and 8).
Overall the rank scores are of comparable magnitude to those observed with the BAliBASE datasets. Moreover, the performance does not appear to degrade as the number of sequences is increased, although the optimal value of g does switch from 0.5 to 0 as the number of sequences is increased to 60 and 122. This is likely due to the fact that the benchmark alignment increases in length as the number of sequences is increased, and a lower value of g favours longer alignments.
Computational considerations
While the runtime does increase with the number of sequences, a breakdown of the contributions to these timings shows that the majority of the time is spent reading in the alignments, which scales linearly with the number of alignments multiplied by the number of sequences (cf. Additional file 1 : Figure S9). As discussed earlier, the minimumrisk algorithm scales linearly with the number of columns in the DAG, but this step contributes a very small proportion of the total runtime in the examples shown in Table 7. On our test systems the overall time taken to process and summarise 2000 alignments is only 3 seconds for the 122sequence dataset (see Table 7), and around 10 seconds for 10,000 alignments (data not shown). For a 20sequence dataset, analysing 500,000 alignments takes 150 seconds (see Additional file 1 : Figure S9). Memory usage is also generally low, requiring less than 2Gb in all the cases we have tested, even for 500,000 alignments.
In all cases we have examined, the time taken to actually generate the alignment samples is significantly larger than the time required to analyse the samples. As such, large gains in efficiency can be obtained by generating one set of alignment samples and carrying out multiple downstream analyses on this same set, compared to carrying out a full joint sampling analysis.
Effect of alignment accuracy on tree estimation
As discussed in the introduction, a number of studies have highlighted how biases in alignments may lead to misleading conclusions in the context of downstream tree inference. As such, any methodology that has the potential to improve alignment accuracy, particularly in the presence of high uncertainty, has the potential to improve subsequent phylogenetic inference. Here we will provide a brief example to reiterate this point.
For each of the simulated datasets discussed earlier, we performed tree inference using the program DNAML from version 3.69 of the PHYLIP package [122], using alignments generated by four commonly used programs, as well as the summary alignments generated using the minimumrisk procedure presented here. DNAML was run with the default settings in each case, and the distance to the known true tree was computed using the RobinsonFoulds distance, equal to the number of bipartitions that differ from the true tree, with maximum value of 2(n−3), where n is the number of leaves in the tree [123].
Results for tree inference on alignments generated using different methods, on the simulated datasets, as shown in Figure 15
Low  Medium  High  

AMA  RF  AMA  RF  AMA  RF  
MinRisk (C ^{+}), g=0  0.84  0.12  0.67  0.12  0.59  0.44  
MinRisk (C), g=0  0.83  0.20  0.64  0.28  0.53  0.44  
MAFFT  0.81  0.16  0.60  0.44  0.50  0.64  
MUSCLE  0.80  0.36  0.61  0.48  0.49  0.48  
TCoffee  0.67  1.00  0.52  0.92  0.42  1.24  
CLUSTALW2  0.62  1.04  0.44  1.32  0.35  1.36 
Predictive power of column marginals
Accuracy of marginal probabilities in predicting column presence/absence, as measured by the area under a ROC curve (AUC), including a comparison to results generated using the program GUIDANCE [ 76 ] (indicated by the p _{ G } row in the table)
Comparison to results generated by the widelyused program GUIDANCE [76] indicate that column marginals are typically a more reliable predictor of column presence/absence. However, it is important to note that the predictive power of these column marginals is dependent on the quality of the alignments used to construct the DAG.
Propagating alignment uncertainty into downstream inference
So far we have examined how the DAG facilitates the efficient generation of accurate summary alignments, which can then be used for subsequent analyses. However, for many types of analyses it may be advantageous to jointly sample alignments and other parameters of interest, such as trees [56,57], or sequence annotations [65], in order to account for the interdependence of these different quantities. Since joint sampling approaches are typically computationally intensive, it is also desirable to explore alternative ways in which alignment uncertainty can be incorporated into downstream inference in cases where joint analysis is not feasible [29,124].
Sequential approach
One way of accomplishing this is to carry out the downstream analyses separately on each of the sampled alignments, averaging or summarising the results as appropriate. This type of sequential approach has been used to assess the sensitivity of phylogenetic inference to the starting alignment [26,29,33], as well as examining the effect of alignment uncertainty on estimates of positive selection [36] and RNA secondary structure prediction [125].
However, as discussed earlier, a set of alignment samples will typically contain only a small portion of the total probability mass, even for pairwise alignments with relatively low uncertainty (cf. Additional file 1 : Figure S3). Hence, the uncertainty quantified in the individual samples will be a significant underestimate of the true alignment uncertainty.
Moreover, since the relative frequencies of whole alignments are a very poor estimator of posterior probabilities, simply carrying out an independent analysis on each sampled alignment and then averaging is likely to yield unreliable results. Reweighting procedures such as those discussed by Blackburne and Whelan [36] are only feasible when the posterior probability of each alignment can be computed exactly, which is not the case for many models of interest.
DAGbased approach
In order to address these issues, we can make use of the alignment DAG, making use of intersections between alignments to increase the effective sample size.
Due to the acyclic structure of the graph, it is possible to adapt many standard algorithms, such as forwardbackward algorithms for HMMs, to operate on the DAG structure rather than an individual alignment. This allows for downstream inference to be averaged over a very large number of alignments, weighted according to a more reliable estimate of the posterior probability for each alignment, rather than analysing only a small collection of individual samples.
Example application: marginal probabilities for topologies
As an illustration of the utility of this approach, we consider here a 4sequence example, for which there are three possible unrooted topologies relating the sequences. The specific example we consider consists of three human globin sequences, αhaemoglobin (HbA), myoglobin (Mb), and cytoglobin (Cygb), as well as a plant leghaemoglobin (LegHb) (datasets can be found in Additional file 2). Previous studies have shown significant uncertainty as to the phylogenetic relationship between these different types of globins [62], hence this represents a good test case to analyse the effect of alignment uncertainty on topology inference. Here we restrict our analysis to four sequences for the purposes of simplifying the example.
where indicates that tree Υ conforms to topology τ. These marginal posteriors can then be compared to the topology posterior computed on each alignment individually, replacing \(\mathcal {D}(\mathcal {A})\) with A in Equation (26) above.
The bottom panel of Figure 16 shows posterior probabilities computed using Equation (26), indicating significant variability depending on which alignment is used. While some alignments result in a posterior probability of more than 0.9 for the most favourable topology, others result in a probability of less than 0.2 for this topology. Simply taking the mean posterior over all the individual alignments in this case results in a posterior probability of only 0.56 for the most favourable topology. However, combining all the alignment samples into the DAG leads to a posterior probability of 0.94. This illustrates the fact that combining the alignments into a DAG may result in additional information being extracted from the same set of alignments, due to the increased effective sample size arising from intersections in the DAG.
Since the same DAG is used to compute the likelihood for all trees in the set , the majority of the runtime for this procedure is not spent reading in the alignments from disk (as it was for the minimumrisk summary procedure). As such, the runtime scales linearly with the number of columns in the DAG, as expected (see Additional file 1 : Figure S10).
Conclusions
The approaches illustrated here provide a general framework for dealing with alignment uncertainty in a statistically meaningful fashion. Encoding a set of sampled alignments in a DAG structure allows for more accurate estimation of posterior probabilities based on column or pair marginals. Due to interchanges and crossovers in the DAG, the number of alignments encoded in the graph is typically many orders of magnitude greater than the number of samples used to generate the DAG, such that the effective sample size is greatly increased by this representation.
Since the graph is acyclic, efficient algorithms can be developed for summation over this very large number of alignments, each weighted according to its probability. As a specific example, we have considered algorithms for generating summary alignments that minimise the expected value of various types of loss functions, observing that this type of algorithm is generally very successful at minimising the loss on a set of test cases.
This approach provides a way to conduct many types of sequence analysis on the very large set of alignments encoded in the DAG structure, allowing for alignment uncertainty to be propagated into downstream inference in cases where computationally expensive joint sampling procedures are infeasible. In addition to the tree inference example illustrated here, we are currently working on adapting several other common algorithms to the alignment DAG structure.
Combining the output of other alignment programs
The approaches detailed here are in theory applicable to a set of alignments generated by any type of method, although the quality of the probability estimates generated by the DAG will depend on the quality of the underlying model used to generate the alignments. Although this type of method can be used to combine the output of several different alignment programs, in a similar fashion to the MCoffee procedure [120], such an approach does not have a probabilistic interpretation, and will depend heavily on the choice of programs used to generate the input.
We have observed that this type of procedure usually yields summary alignments that are similar in accuracy to the program that typically generates the most accurate alignments (data not shown); however, since the most accurate alignment method is usually known from the outset, based on benchmarking results, there is not much to be gained by employing such a procedure. Moreover, the reliability of such an approach as a heuristic will depend strongly on the degree of similarity between the different alignment programs, hence we would recommend against using alignment DAGs as a way of combining the output of nonprobabilistic alignment programs.
Alignment DAGs as generators of alignment samples
One other obvious application of the alignment DAG is as a way of generating additional alignment samples, which can be sampled by using a DAGbased version of the traditional stochastic traceback algorithm (cf. Additional file 1 : Section S6).
One potential use for these alignment samples could be as a source of proposals within an MCMC alignment sampler, allowing for a new state to be efficiently generated, along with a known proposal probability for use in a MetropolisHastings accept/reject step. Although this type of approach does not allow for the exploration of previously unobserved columns, it could be useful as way to improve mixing, particularly once the key regions of the space have already been explored.
Software availability
Java software implementing the minimumrisk alignment summary algorithm and computation of marginal topology probabilities is available for download at http://statalign.github.io/WeaveAlign. A platformindependent jar archive containing version 1.2.1 of WeaveAlign is included in Additional file 2, along with datasets and example results.
Declarations
Acknowledgements
This work was supported by grants from the EPSRC (JLH) and BBSRC (ÁN). The authors thank Ian Holmes and Benjamin Redelings for productive discussions.
Authors’ Affiliations
References
 Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005; 15(8):1034–50.PubMedPubMed CentralView ArticleGoogle Scholar
 Altschuh D, Vernet T, Berti P, Moras D, Nagai K. Coordinated amino acid changes in homologous protein families. Protein Eng. 1988; 2(3):193–9.PubMedView ArticleGoogle Scholar
 Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS. Threedimensional structures of membrane proteins from genomic sequencing. Cell. 2012; 149(7):1607–21.PubMedPubMed CentralView ArticleGoogle Scholar
 Knudsen B, Hein J. RNA secondary structure prediction using stochastic contextfree grammars and evolutionary history. Bioinformatics. 1999; 15(6):446–54.PubMedView ArticleGoogle Scholar
 Höhl M, Ragan MA. Is multiplesequence alignment required for accurate inference of phylogeny?Syst Biol. 2007; 56(2):206–21.PubMedView ArticleGoogle Scholar
 Blundell TL, Sibanda B L, Sternberg M J E Thornton J M. Knowledgebased prediction of protein structures and the design of novel molecules. Nature. 1987; 326(6111):347–52.PubMedView ArticleGoogle Scholar
 Sali A, Blundell T. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993; 234(3):779–815.PubMedView ArticleGoogle Scholar
 Needleman S, Wunsch C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443–53.PubMedView ArticleGoogle Scholar
 Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162(3):705–8.PubMedView ArticleGoogle Scholar
 Edgar RC. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004; 5:113.PubMedPubMed CentralView ArticleGoogle Scholar
 Lupyan D, LeoMacias A, Ortiz AR. A new progressiveiterative algorithm for multiple structure alignment. Bioinformatics. 2005; 21(15):3255–63.PubMedView ArticleGoogle Scholar
 Notredame C, Higgins DG. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 1996; 24(8):1515–24.PubMedPubMed CentralView ArticleGoogle Scholar
 Kim J, Pramanik S, Chung MJ. Multiple sequence alignment using simulated annealing. Comput Appl Biosci CABIOS. 1994; 10(4):419–26.PubMedGoogle Scholar
 Feng DF, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987; 25(4):351–60.PubMedView ArticleGoogle Scholar
 Löytynoja A, Goldman N. Phylogenyaware gap placement prevents errors in sequence alignment and evolutionary analysis. Science. 2008; 320(5883):1632–5.PubMedView ArticleGoogle Scholar
 Thorne JL, Kishino H, Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol. 1991; 33(2):114–24.PubMedView ArticleGoogle Scholar
 Thorne JL, Kishino H, Felsenstein J. Inching toward reality: An improved likelihood model of sequence evolution. J Mol Evol. 1992; 34:3–16.PubMedView ArticleGoogle Scholar
 Hein J, Wiuf C, Knudsen B, Møller MB, Wibling G. Statistical alignment: computational properties, homology testing and goodnessoffit. J Mol Biol. 2000; 302:265–79.PubMedView ArticleGoogle Scholar
 Miklós I, Lunter GA, Holmes I. A “long indel"? model for evolutionary sequence alignment. Mol Biol Evol. 2004; 21(3):529–40.PubMedView ArticleGoogle Scholar
 Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, et al. Fast statistical alignment. PLoS Comput Biol. 2009; 5(5):e1000392.PubMedPubMed CentralView ArticleGoogle Scholar
 Godzik A. The structural alignment between two proteins: is there a unique answer?Protein Sci. 1996; 5(7):1325–38.PubMedPubMed CentralView ArticleGoogle Scholar
 Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: Assessing and improving genomic sequence alignment. Genome Res. 2008; 18(2):298–309.PubMedPubMed CentralView ArticleGoogle Scholar
 Lake JA. The order of sequence alignment can bias the selection of tree topology. Mol Biol Evol. 1991; 8(3):378–85.PubMedGoogle Scholar
 Morrison DA, Ellis JT. Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. Mol Biol Evol. 1997; 14(4):428–41.PubMedView ArticleGoogle Scholar
 Ogden TH, Rosenberg MS. Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol. 2006; 55(2):314–28.PubMedView ArticleGoogle Scholar
 Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T. Rapid and accurate largescale coestimation of sequence alignments and phylogenetic trees. Science. 2009; 324(5934):1561–4.PubMedView ArticleGoogle Scholar
 Dessimoz C, Gil M. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol. 2010; 11(4):1–9.View ArticleGoogle Scholar
 Wang LS, LeebensMack J, Wall PK, Beckmann K, de Pamphilis CW, Warnow T. The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM Trans Comput Biol Bioinformatics. 2011; 8(4):1108–19.View ArticleGoogle Scholar
 Liu K, Warnow TJ, Holder MT, Nelesen SM, Yu J, Stamatakis AP, Linder CR. SATéII: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol. 2012; 61:90–106.PubMedView ArticleGoogle Scholar
 Simmons MP, Müller KF, Norton AP. Alignment of, and phylogenetic inference from, random sequences: The susceptibility of alternative alignment methods to creating artifactual resolution and support. Mol Phylogenet Evol. 2010; 57(3):1004–16.PubMedView ArticleGoogle Scholar
 Levy Karin E, Susko E, Pupko T. Alignment errors strongly impact likelihoodbased tests for comparing topologies. Mol Biol Evol. 2014; 31(11):3057–67.PubMedView ArticleGoogle Scholar
 Thorne JL, Kishino H. Freeing phylogenies from artifacts of alignment. Mol Biol Evol. 1992; 9(6):1148–62.PubMedGoogle Scholar
 Wong KM, Suchard MA, Huelsenbeck JP. Alignment uncertainty and genomic analysis. Science. 2008; 319(5862):473–6.PubMedView ArticleGoogle Scholar
 Dwivedi B, Gadagkar S. Phylogenetic inference under varying proportions of indelinduced alignment gaps. BMC Evol Biol. 2009; 9:211.PubMedPubMed CentralView ArticleGoogle Scholar
 CapellaGutiérrez S, Gabaldón T. Measuring guidetree dependency of inferred gaps in progressive aligners. Bioinformatics. 2013; 29(8):1011–7.PubMedPubMed CentralView ArticleGoogle Scholar
 Blackburne BP, Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis. Mol Biol Evol. 2013; 30(3):642–53.PubMedView ArticleGoogle Scholar
 Tramontano A, Leplae R, Morea V. Analysis and assessment of comparative modeling predictions in CASP4. Proteins: Struct Funct Bioinformatics. 2001; 45(S5):22–38.View ArticleGoogle Scholar
 Schwarzenbacher R, Godzik A, Grzechnik SK, Jaroszewski L. The importance of alignment accuracy for molecular replacement. Acta Crystallographica Section D. 2004; 60(7):1229–36.View ArticleGoogle Scholar
 Chivian D, Baker D. Homology modeling using parametric alignment ensemble generation with consensus and energybased model selection. Nucleic Acids Res. 2006; 34(17):e112.PubMedPubMed CentralView ArticleGoogle Scholar
 Dickson RJ, Wahl LM, Fernandes AD, Gloor GB. Identifying and seeing beyond multiple sequence alignment errors using intramolecular protein covariation. PLoS ONE. 2010; 5(6):e11082.PubMedPubMed CentralView ArticleGoogle Scholar
 Dickson RJ, Gloor GB. Protein sequence alignment analysis by local covariation: Coevolution statistics detect benchmark alignment errors. PLoS ONE. 2012; 7(6):e37645.PubMedPubMed CentralView ArticleGoogle Scholar
 Gardner PP, Wilm A, Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005; 33(8):2433–9.PubMedPubMed CentralView ArticleGoogle Scholar
 Fletcher W, Yang Z. The effect of insertions, deletions, and alignment errors on the branchsite test of positive selection. Mol Biol Evol. 2010; 27(10):2257–67.PubMedView ArticleGoogle Scholar
 Privman E, Penn O, Pupko T. Improving the performance of positive selection inference by filtering unreliable alignment regions. Mol Biol Evol. 2012; 29:1–5.PubMedView ArticleGoogle Scholar
 Jordan G, Goldman N. The effects of alignment error and alignment filtering on the sitewise detection of positive selection. Mol Biol Evol. 2012; 29(4):1125–39.PubMedView ArticleGoogle Scholar
 Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000; 17(4):540–52.PubMedView ArticleGoogle Scholar
 Talavera G, Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 2007; 56(4):564–77.PubMedView ArticleGoogle Scholar
 Wu M, Chatterji S, Eisen JA. Accounting for alignment uncertainty in phylogenomics. PLoS ONE. 2012; 7:e30288.PubMedPubMed CentralView ArticleGoogle Scholar
 Gatesy J, DeSalle R, Wheeler W. Alignmentambiguous nucleotide sites and the exclusion of systematic data. Mol Phylogenet Evol. 1993; 2(2):152–7.PubMedView ArticleGoogle Scholar
 Lee MSY. Unalignable sequences and molecular evolution. Trends Ecol Evol. 2001; 16(12):681–5.View ArticleGoogle Scholar
 Ajawatanawong P, Atkinson GC, WatsonHaigh NS, MacKenzie B, Baldauf SL. SeqFIRE: A web application for automated extraction of indel regions and conserved blocks from protein multiple sequence alignments. Nucleic Acids Res. 2012; 40(W1):W340–7.PubMedPubMed CentralView ArticleGoogle Scholar
 Lunter G. Probabilistic wholegenome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007; 23(13):289–96.View ArticleGoogle Scholar
 Miklós I, Novák A, Dombai B, Hein J. How reliably can we predict the reliability of protein structure predictions?BMC Bioinformatics. 2008; 9:137.PubMedPubMed CentralView ArticleGoogle Scholar
 Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: Current challenges and future perspectives. PLoS ONE. 2011; 6(3):e18093.PubMedPubMed CentralView ArticleGoogle Scholar
 Metzler D, Fleissner R, Wakolbinger A, von Haeseler A. Assessing variability by joint sampling of alignments and mutation rates. J Mol Evol. 2001; 53(6):660–9.PubMedView ArticleGoogle Scholar
 Novák A, Miklós I, Lyngsø R, Hein J. StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees. Bioinformatics. 2008; 24(20):2403–4.PubMedView ArticleGoogle Scholar
 Suchard MA, Redelings BD. BAliPhy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006; 22(16):2047–8.PubMedView ArticleGoogle Scholar
 Redelings BD, Suchard MA. Joint Bayesian estimation of alignment and phylogeny. Syst Biol. 2005; 54(3):401–18.PubMedView ArticleGoogle Scholar
 Dryden IL, Hirst JD, Melville JL. Statistical analysis of unlabeled point sets: Comparing molecules in chemoinformatics. Biometrics. 2007; 63:237–51.PubMedView ArticleGoogle Scholar
 Green PJ, Mardia KV, Nyirongo VB, Ruffieux Y. Bayesian modelling for matching and alignment of biomolecules. Oxford: Oxford University Press. The Oxford Handbook of Applied Bayesian Analysis; 2010, pp. 27–50.Google Scholar
 Ruffieux Y, Green PJ. Alignment of multiple configurations using hierarchical models. J Comput Graphical Stat. 2009; 18(3):756–73.View ArticleGoogle Scholar
 Herman J L, Challis CJ, Novák A, Hein J, Schmidler SC. Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol Biol Evol. 2014; 31(9):2251–66.PubMedPubMed CentralView ArticleGoogle Scholar
 Sinha S, He X. MORPH: Probabilistic alignment combined with hidden Markov models of cisregulatory modules. PLoS Comput Biol. 2007; 3(11):e216.PubMedPubMed CentralView ArticleGoogle Scholar
 Satija R, Pachter L, Hein J. Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics. 2008; 24(10):1236–42.PubMedView ArticleGoogle Scholar
 Satija R, Novák A, Miklós I, Lyngsø R, Hein J. BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC. BMC Evol Biol. 2009; 9:217.PubMedPubMed CentralView ArticleGoogle Scholar
 Hamada M, Sato K, Kiryu H, Mituyama T, Asai K. CentroidAlign: fast and accurate aligner for structured RNAs by maximizing expected sumofpairs score. Bioinformatics. 2009; 25(24):3236–43.PubMedView ArticleGoogle Scholar
 CapellaGutiérrez S SillaMartínez JM. trimAl: a tool for automated alignment trimming in largescale phylogenetic analyses. Bioinformatics. 2009; 25(15):1972–3.PubMedPubMed CentralView ArticleGoogle Scholar
 Ahola V, Aittokallio T, Vihinen M, Uusipaikka E. Modelbased prediction of sequence alignment quality. Bioinformatics. 2008; 24(19):2165–71.PubMedView ArticleGoogle Scholar
 DeBlasio D, Wheeler T, Kececioglu J. Estimating the accuracy of multiple alignments and its use in parameter advising In: Chor B, editor. Research in Computational Molecular Biology, Volume 7262 of Lecture Notes in Computer Science. Berlin Heidelberg: Springer: 2012. p. 45–59.Google Scholar
 Misof B, Misof K. A Monte Carlo approach successfully identifies randomness in multiple sequence alignments: A more objective means of data exclusion. Syst Biol. 2009; 58(1):21–34.PubMedView ArticleGoogle Scholar
 Dress A, Flamm C, Fritzsch G, Grunewald S, Kruspe M, Prohaska S, Stadler P. Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol. 2008; 3:7.PubMedPubMed CentralView ArticleGoogle Scholar
 Landan G, Graur D. Heads or Tails: A simple reliability check for multiple sequence alignments. Mol Biol Evol. 2007; 24(6):1380–3.PubMedView ArticleGoogle Scholar
 Hall B G. How well does the HoT score reflect sequence alignment accuracy?Mol Biol Evol. 2008; 25(8):1576–80.PubMedView ArticleGoogle Scholar
 Wise MJ. Not so HoT? Heads or tails is not able to reliably compare multiple sequence alignments. Cladistics. 2010; 26(4):438–43.View ArticleGoogle Scholar
 Penn O, Privman E, Landan G, Graur D, Pupko T. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol. 2010; 27(8):1759–67.PubMedPubMed CentralView ArticleGoogle Scholar
 Penn O, Privman E, Ashkenazy H, Landan G, Graur D, Pupko T. GUIDANCE a web server for assessing alignment confidence scores. Nucleic Acids Res. 2010; 38(suppl 2):W23–8.PubMedPubMed CentralView ArticleGoogle Scholar
 Löytynoja A, Milinkovitch M C. SOAP: cleaning multiple alignments from unstable blocks. Bioinformatics. 2001; 17(6):573–4.PubMedView ArticleGoogle Scholar
 Wheeler WC. Sequence alignment, parameter sensitivity, and the phylogenetic analysis of molecular data. Syst Biol. 1995; 44(3):321–31.View ArticleGoogle Scholar
 Collingridge P, Kelly S. MergeAlign: Improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments. BMC Bioinformatics. 2012; 13:117.PubMedPubMed CentralView ArticleGoogle Scholar
 Herman JL, Szabó A, Miklós I, Hein J. Approximate posterior sampling of multiple sequence alignments by iterative perturbation of substitution matrices. 2015. arXiv: arXiv:1501.04986.Google Scholar
 Waterman MS, Byers TH. A dynamic programming algorithm to find all solutions in a neighborhood of the optimum. Math Biosci. 1985; 77(12):179–88.View ArticleGoogle Scholar
 Zuker M. Suboptimal sequence alignment in molecular biology: Alignment with error analysis. J Mol Biol. 1991; 221(2):403–20.PubMedView ArticleGoogle Scholar
 Vingron M. Nearoptimal sequence alignment. Curr Opinion Struct Biol. 1996; 6(3):346–52.View ArticleGoogle Scholar
 Vingron M, Argos P. Determination of reliable regions in protein sequence alignments. Protein Eng. 1990; 3(7):565–9.PubMedView ArticleGoogle Scholar
 Mevissen HT, Vingron M. Quantifying the local reliability of a sequence alignment. Protein Eng. 1996; 9(2):127–32.PubMedView ArticleGoogle Scholar
 Landan G, Graur D. Local reliability measures from sets of cooptimal multiple sequence alignments. In: Pacific Symposium on Biocomputing., Volume 13. Kohala Coast, HI, USA: 2008. p. 15–24.Google Scholar
 Karlin S, Altschul SF. Applications and statistics for multiple highscoring segments in molecular sequences. Proc Nat Acad Sci. 1993; 90(12):5873–7.PubMedPubMed CentralView ArticleGoogle Scholar
 Durbin R, Eddy SR, Krogh A, Mitchison G. Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998.View ArticleGoogle Scholar
 Zhu J, Liu JS, Lawrence CE. Bayesian adaptive sequence alignment algorithms. Bioinformatics. 1998; 14:25–39.PubMedView ArticleGoogle Scholar
 Webb BJM, Liu JS, Lawrence CE. BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res. 2002; 30(5):1268–77.PubMedPubMed CentralView ArticleGoogle Scholar
 Churchill GA. Monte Carlo sequence alignment. In: Proceedings of the First Annual International Conference on Computational Molecular Biology. Santa Fe, NM, USA: ACM: 1997. p. 93–97.Google Scholar
 Metzler D. Statistical alignment based on fragment insertion and deletion models. Bioinformatics. 2003; 19(4):490–99.PubMedView ArticleGoogle Scholar
 Lunter GA, Miklós I, Drummond A, Jensen JL, Hein J. Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics. 2005; 6:83.PubMedPubMed CentralView ArticleGoogle Scholar
 Green PJ, Mardia KV. Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika. 2006; 93(2):235–54.View ArticleGoogle Scholar
 BuckaLassen K, Caprani O, Hein J. Combining many multiple alignments in one improved alignment. Bioinformatics. 1999; 15(2):122–30.PubMedView ArticleGoogle Scholar
 Schwikowski B, Vingron M. Weighted sequence graphs: boosting iterated dynamic programming using locally suboptimal solutions. Discrete Appl Math. 2003; 127:95–117.View ArticleGoogle Scholar
 Szabó A, Novák A, Miklós I, Hein J. Reticular alignment: A progressive cornercutting method for multiple sequence alignment. BMC Bioinformatics. 2010; 11:570.PubMedPubMed CentralView ArticleGoogle Scholar
 Hamada M, Asai K. A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol. 2012; 19(5):532–49.PubMedPubMed CentralView ArticleGoogle Scholar
 Redelings BD, Suchard MA. Robust inferences from ambiguous alignments, Sequence, Alignment: Methods, Models, Concepts and Strategies. Oakland, CA: University of California Press; 2011, pp. 209–271.Google Scholar
 Thorne JL, Churchill GA. Estimation and reliability of molecular sequence alignments. Biometrics. 1995; 51:100–13.PubMedView ArticleGoogle Scholar
 Yu L, Smith T. Positional statistical significance in sequence alignment. J Comput Biol. 1999; 6(2):253–9.PubMedView ArticleGoogle Scholar
 Larget B. The estimation of tree posterior probabilities using conditional clade probability distributions. Syst Biol. 2013; 62(4):501–11.PubMedPubMed CentralView ArticleGoogle Scholar
 Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. Atlas Protein Seq Struct. 1978; 5(suppl 3):345–51.Google Scholar
 Carvalho LE, Lawrence CE. Centroid estimation in discrete highdimensional spaces with applications in biology. Proc Nat Acad Sci. 2008; 105(9):3209–14.PubMedPubMed CentralView ArticleGoogle Scholar
 Roshan U, Livesay DR. Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics. 2006; 22(22):2715–21.PubMedView ArticleGoogle Scholar
 Hamada M, Kiryu H, Iwasaki W, Asai K. Generalized centroid estimators in bioinformatics. PLoS ONE. 2011; 6(2):e16450.PubMedPubMed CentralView ArticleGoogle Scholar
 Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994; 1(4):337–48.PubMedView ArticleGoogle Scholar
 Miyazawa S. A reliable sequence alignment method based on probabilities of residue correspondences. Protein Eng. 1995; 8(10):999–1009.PubMedView ArticleGoogle Scholar
 Holmes I, Durbin R. Dynamic programming alignment accuracy. J Comput Biol. 1998; 5(3):493–504.PubMedView ArticleGoogle Scholar
 Wolfsheimer S, Hartmann A, Rabus R, Nuel G. Computing posterior probabilities for scorebased alignments using ppALIGN. Stat Appl Genet Mol Biol. 2012; 11(4). Article 1.Google Scholar
 Schwartz AS, Pachter L. Multiple alignment by sequence annealing. Bioinformatics. 2007; 23(2):e24–9.PubMedView ArticleGoogle Scholar
 Schwartz AS. Posterior decoding methods for optimization and accuracy control of multiple alignments. PhD thesis. Berkeley: University of California; 2007.Google Scholar
 Sahraeian SME, Yoon BJ. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res. 2010; 38(15):4917–28.PubMedPubMed CentralView ArticleGoogle Scholar
 Notredame C, Higgins DG, Heringa J. TCoffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000; 302:205–17.PubMedView ArticleGoogle Scholar
 Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: Probabilistic consistencybased multiple sequence alignment. Genome Res. 2005; 15(2):330–40.PubMedPubMed CentralView ArticleGoogle Scholar
 Liu Y, Schmidt B, Maskell DL. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics. 2010; 26(16):1958–64.PubMedView ArticleGoogle Scholar
 Cartwright RA. DNA assembly with gaps (DAWG): Simulating sequence evolution. Bioinformatics. 2005; 21(Suppl 3):31–8.View ArticleGoogle Scholar
 Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins: Struct Funct Bioinformatics. 2005; 61:127–36.View ArticleGoogle Scholar
 Raghava G, Searle S, Audley P, Barber J, Barton G. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics. 2003; 4:47.PubMedPubMed CentralView ArticleGoogle Scholar
 Wallace IM, O’Sullivan O, Higgins DG, Notredame C. MCoffee: combining multiple sequence alignment methods with TCoffee. Nucleic Acids Res. 2006; 34(6):1692–9.PubMedPubMed CentralView ArticleGoogle Scholar
 Schwartz AS, Myers EW, Pachter L. Alignment metric accuracy. arXiv:qbio/0510052. 2005.Google Scholar
 Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981; 17(6):368–376.PubMedView ArticleGoogle Scholar
 Robinson D, Foulds L. Comparison of phylogenetic trees. Math Biosci. 1981; 53(12):131–47.View ArticleGoogle Scholar
 Lunter G, Drummond AJ, Miklós I, Hein J. Statistical Alignment Recent progress, new applications, and challenges. In: Statistical Methods in, Molecular Evolution, Statistics for Biology and Health. New York: Springer: 2005. p. 375–405.Google Scholar
 Arunapuram P, Edvardsson I, Golden M, Anderson JWJ, Novák A, Sükösd Z, et al. StatAlign 2.0: combining statistical alignment with RNA secondary structure prediction. Bioinformatics. 2013; 29(5):654–5.PubMedView ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.