Parameterizing sequence alignment with an explicit evolutionary model

Rivas, Elena; Eddy, Sean R.

doi:10.1186/s12859-015-0832-5

Research Article
Open access
Published: 10 December 2015

Parameterizing sequence alignment with an explicit evolutionary model

Elena Rivas¹ &
Sean R. Eddy^1,2,3,4

BMC Bioinformatics volume 16, Article number: 406 (2015) Cite this article

4166 Accesses
18 Citations
12 Altmetric
Metrics details

Abstract

Background

Inference of sequence homology is inherently an evolutionary question, dependent upon evolutionary divergence. However, the insertion and deletion penalties in the most widely used methods for inferring homology by sequence alignment, including BLAST and profile hidden Markov models (profile HMMs), are not based on any explicitly time-dependent evolutionary model. Using one fixed score system (BLOSUM62 with some gap open/extend costs, for example) corresponds to making an unrealistic assumption that all sequence relationships have diverged by the same time. Adoption of explicit time-dependent evolutionary models for scoring insertions and deletions in sequence alignments has been hindered by algorithmic complexity and technical difficulty.

Results

We identify and implement several probabilistic evolutionary models compatible with the affine-cost insertion/deletion model used in standard pairwise sequence alignment. Assuming an affine gap cost imposes important restrictions on the realism of the evolutionary models compatible with it, as single insertion events with geometrically distributed lengths do not result in geometrically distributed insert lengths at finite times. Nevertheless, we identify one evolutionary model compatible with symmetric pair HMMs that are the basis for Smith-Waterman pairwise alignment, and two evolutionary models compatible with standard profile-based alignment.

We test different aspects of the performance of these “optimized branch length” models, including alignment accuracy and homology coverage (discrimination of residues in a homologous region from nonhomologous flanking residues). We test on benchmarks of both global homologies (full length sequence homologs) and local homologies (homologous subsequences embedded in nonhomologous sequence).

Conclusions

Contrary to our expectations, we find that for global homologies a single long branch parameterization suffices both for distant and close homologous relationships. In contrast, we do see an advantage in using explicit evolutionary models for local homologies. Optimal branch parameterization reduces a known artifact called “homologous overextension”, in which local alignments erroneously extend through flanking nonhomologous residues.

Background

Despite the apparent maturity of both the sequence similarity searching and phylogenetic inference fields, and despite the fact that inference of sequence homology via sequence alignment is itself obviously an evolutionary question, standard sequence comparison methods for homology search such as BLAST [1] or profile HMMs [2] still do not depend on an explicitly divergence-dependent evolutionary model for their parameterization. Instead standard sequence alignment methods use parameterizations that assume fixed evolutionary divergence times either implicitly or explicitly. The basis for deriving standard log-odds substitution scores from a rate-dependent continuous time Markov model of the substitution process is straightforward and well known [3]. The difficulty has been in reconciling the standard affine gap penalties used in fast, efficient sequence alignment and database search methods with a satisfactory continuous time evolutionary model of the insertion and deletion process.

Imposing an evolutionary model in standard affine pair and profile HMMs would allow us to optimize the parameterization (branch length) to the apparent relatedness of each comparison, instead of assuming that all sequences are at the same evolutionary distance from each other. We expect to see a gain from such parameter optimization. For example, one advantage of optimizing score systems for evolutionary distance has been shown when choosing substitution matrices for scoring ungapped alignments of different lengths [4]. Using a divergent scoring system, a long divergent homologous region may be detected by summing many weak positive scores, whereas a short conserved homologous region may not accumulate enough total positive score to rise to significance. On the other hand, using a conserved scoring system, a short conserved homologous region may be detected because conserved residues received higher positive score per position, whereas a long divergent homologous region may go unrecognized because mismatches are heavily penalized. An optimal scoring system can find the best compromise given the length of the compared sequences and their relatedness.

A large body of work has established the feasibility of evolutionary models of insertion and deletion processes. Several models have been proposed including the well-known TKF91 and TKF92 models [5, 6], some precursor models [7, 8], and other related models [9, 10]; tree HMMs [11–13]; models that treat gaps as an extra residue [14, 15]; pair HMMs implementing an approximate evolutionary model [16]; and other more complex but analytically intractable models [17, 18]. The desire to synchronize the evolutionary distance of a given substitution matrix to that of the indel parameters has been recognized, and sets of score parameters (including gap costs) at different discrete evolutionary distances have been proposed [19]. Nevertheless, standard sequence comparison methods such as BLAST [1] or SSEARCH [20] that implement a standard three-state (Match/Insert/Delete) affine gap cost recursion (with symmetric treatment of the Insert/Delete states) do not use explicit evolutionary models that would automatically set the parameters to a variable divergence time.

We are most interested in assessing the value of evolutionary models for profile HMMs. Profile HMMs can be understood as a generalization of sequence/sequence comparison methods to a sequence/profile comparison in which the profile compiles information about one sequence or a collection of homologous sequences. Groups of sequences can be efficiently aligned to the profile avoiding a costly all-to-all comparison. Profile HMM methods are used extensively for protein and DNA homology analysis [21–24]. Standard profile HMM packages (like other standard sequence similarity methods) do not explicitly utilize any evolutionary model [25, 26].

Profile HMMs are usually parameterized “long branched” to optimize for remote homology detection. Thus, the advantages of an optimal-branch model would be in situations where having a higher score per conserved position is advantageous. Such situations include: (1) to reduce alignment overextension artifacts in the case of homologous sequences embedded in nonhomologous sequence [27, 28]; (2) to improve the identification of short homologies (such as in metagenomic reads); (3) to improve alignment accuracy by better discriminating mutations from indels, avoiding the artifact of indel collapse, in which two independent insertions are aligned together as if they were homologous [29].

In this paper, first we investigate the constraints imposed by the affine-gap architecture in probabilistic evolutionary models. In addition to the TKF91 and TKF92 models [5, 6], we identify new affine evolutionary models that have richer parameterizations with more variables. Importantly (and beyond what the TKF models can do), we identify one evolutionary model compatible with standard pair HMMs with a symmetric treatment of insertions and deletions (as in the Smith-Waterman algorithm), and two evolutionary models compatible with profile HMMs.

We have implemented these affine evolutionary models (including TKF91 and TKF92) into a pair HMM-based probabilistic local alignment program called e2msa. The e2msa method provides a platform for the level comparison of different affine evolutionary models.

We use e2msa to test the effect of optimizing branch lengths on alignment accuracy and homology coverage (fraction of the true homology incorporated into the alignment). We produce a benchmark of global (full-length) homologies. To test for nonhomologous overextension artifacts, we produce a second alignment benchmark of local homologies embedded into nonhomologous sequences.

Results and discussion

Constraints on evolutionary models compatible with affine gap cost

The frequency of insertions and deletions (indels) in biological sequences deviates significantly from the geometric length distribution implicit in any affine gap cost model [30, 31]. However, affine gap cost has become standard because it is a good compromise between realism and computational efficiency. Here, we have tried to elucidate what kind of evolutionary model is compatible with well-established computationally efficient comparative models, such as pair HMMs [3] or profile HMMs [2, 3, 25, 32].

An evolutionary model starts by proposing a microscopic model that dictates the rules describing how inserted residues appear or disappear instantaneously, as well as how ancestral residues can mutate or disappear in one single event. The microscopic model’s repertoire of single events are described in terms of some constant parameters, the so-called rates. Solving the differential equations derived from the microscopic model results in a macroscopic model described in terms of time-dependent conditional probabilities (or log-odds scores derived from those) which become functions depending on the rates and a divergence time parameter t. The macroscopic model describes the ancestral/descendant sequence relationship after millions of years of evolution, while the microscopic model describes the same relationship only after very short times.

Throughout this paper, the word “insert” has a special meaning (as opposed to the terms insertion or inserted residues). “Insert” specifically stands for the collection of all inserted residues in between any two ancestral positions (at a given time), regardless of whether the flanking ancestral residues are alive or not at the time. This concept arises naturally in the framework of an evolutionary profile HMM where each position in the profile represents an ancestral residue, such that in between any two profile positions an arbitrary number of residues (an insert) can exist. In an evolutionary profile HMM, residues in a given insert have been generated using the same rate parameters (possibly at different times), and different inserts use different rates.

In a macroscopic affine model, the cost of an insert of length n is a + bn, consisting of the cost of opening (a) and the cost of extending the insert (b) as two independent parameters. When the gap open cost is set to zero, it is called a linear model. Affine models fit biological data better than linear models, by making it more costly to start an indel than to extend it. For instance, NCBIBLAST typically uses a default gap-open cost of -11, and a gap-extend cost of -1; in PHMMER [21], the gap-open probability is 0.03, while the gap-extend probability is 0.40.

The seminal TKF91 evolutionary model assumes a microscopic model in which inserted residues are all added (or removed) identically to each other and one at a time (with rate λ for insertions and μ for deletions). As a result, the macroscopic model for TKF91 is essentially a linear gap cost model¹. In addition, because TKF91 is meant for sequence/sequence comparisons, it assumes that when an ancestral residue is deleted it does not leave any trace behind. In a sequence/profile comparison scenario, the profile plays the role of the ancestor, and even if an ancestral residue might have died at some point, inserted residues associated to that profile position retain their own position-specific parameterization. Thus, when using position-specific scores in profile evolutionary models, it is not entirely unreasonable to maintain at all times a “memory” of all the consensus (ancestral) positions in the profile.

In an attempt to make affine evolutionary models, one might imagine making two modifications (in addition to the memory effect) to the microscopic model of TKF91. One modification is to allow residues to appear and disappear not one at a time but in groups according to a geometric distribution (with probability (1−v _I)(v _I)ⁿ⁻¹ and (1−v _D)(v _D)ⁿ⁻¹ for n appearing or disappearing residues respectively). The second modification is to allow the rates at which these single-event group insertions appear to be different whether they open a new insert (with rate λ) or just expand an existing insert (with rate λ _I), and similarly for the deletion of inserted residues, the model distinguishes whether the insert disappears completely (with rate μ) or it just shrinks (with rate μ _I). Because of the geometric nature of the single events allowed for insertions (addition and removal of groups of them), we refer to this model as the Geometric (GM) model.

Unfortunately, the GM model does not have an affine macroscopic solution. Allowing elementary events with geometric length distributions does not result in geometrically distributed insert lengths. In the Additional file 1, we present an explicit description of the GM model, as well as a proof that the macroscopic GM model cannot have a geometric form. The result comes from proposing a geometric distribution as an ansatz for solving the differential equations of the GM model, and showing that a solution cannot be reached.

In addition, in Fig. 1 we confirm by numerical integration the non-geometric nature of the macroscopic GM model with several examples (see Fig. 1 a). The macroscopic model is not geometric even when the open/extend rates for insertions and deletions are identical to each other (λ=λ _I and μ=μ _I) as shown in Fig. 1 b. Furthermore, even without geometrically distributed elementary events (v _I=v _D=0), assuming that the rate of an emerging/disappearing insert is different from that of an expanding/contracting insert (λ≠λ _I and μ≠μ _I) still falls outside affine gap cost models, as shown in Fig. 1 c.

It is only in the absence of both modifications (but maintaining the memory effect) that we obtain geometric distributions of insert lengths (Fig. 1 d). However, the resulting insert distribution is linear, not affine. This linear for insertions model can be made affine for the deletion of ancestral residues by assuming that a different rate is used depending on whether the deletion happens after an ancestral substitution, an ancestral deletion, or after an inserted residue. We name this model the affine ancestral and linear inserts (AALI) model. The AALI model can be solved analytically. Details of the differential equations and solutions of the AALI model are given in the Additional file 1. In Fig. 2, we provide as a summary a state representation of the macroscopic AALI model as a standard three-state pair HMM. The AALI model is not constrained by reversibility, which allows some richness of parameters. A particular tying of the AALI parameters produces a reversible model, which we refer to as the Linear Reversible (LR) model. Details of the LR model are given in the Additional file 1.

The AALI model (more specifically, the LR model) is comparable to the TKF91 model since both assume at the microscopic level that residues are added/deleted one at a time. The two models have an important difference: the AALI microscopic model, influenced by the fact that it aims at parameterizing position-specific models, maintains a memory throughout the evolutionary process of all ancestral positions even after their death. In the TKF91 model on the other hand, because it assumes a non-position-specific model, inserted residues after a deleted ancestral residue are collapsed with any other previous inserted residues associated to a different ancestral residue. This one microscopic difference results in some important macroscopic differences between the two models. The macroscopic TKF91 model also admits a three-state HMM representation [9, 33], as shown in Fig. 2. We present a comparison between the LR model and the TKF91 model in the Additional file 1.

Affine evolutionary models by fragments

The AALI model, though it is only linear for insertions, can be used as a starting point to generate affine variants using the “fragments” technique introduced with the TKF92 evolutionary model (a fragment-derivative of TKF91) [6]. From a macroscopic perspective, a linear insert state can be made affine by replacing the emitting self-looping state with another geometric state machine (adding a new geometric parameter r). The modified HMM can be converted back to the original state machine, but now with an affine transition probability (depending on the added geometric parameter) as described in Fig. 3 a. The concept “fragment” was coined with the introduction of the TKF92 model because the microscopic interpretation of this procedure requires assuming that all the residues created instantaneously by the added state machine have to be removed instantaneously as they were created without further subdivisions. The affine for insertions fragment-derivative of our AALI model is named the AFG model (see Fig. 3 b). Because we do not insist on reversibility, the AFG model has the freedom to use independent fragment parameters for the different states (Match, Delete or Insert).

Table 1 Evolutionary models compatible with pair and profile HMMs

Full size table

An evolutionary model (AFR) compatible with the Smith-Waterman algorithm

Sequence to sequence comparison methods based on the Smith-Waterman algorithm [34] such as BLAST or SSEARCH have a symmetric treatment of insertions and deletions. Finding an evolutionary parameterization of a Smith-Waterman-based comparative method requires having an affine gap cost and a reversible model.

There is a fragment derivative of the Linear Reversible (LR) model that is also reversible, thus compatible with the insertion/deletion symmetry required by the Smith-Waterman algorithm. We call this model the Affine Fragment Reversible (AFR) evolutionary model. To preserve the insertion/deletion symmetry, the AFR model must use the same fragment parameter for insertions and for deletions (r _I=r _D in Table 1). See Fig. 4 for details about the AFR model. The AFR evolutionary model is described in the Additional file 1.

In the TKF models, the joint transition probabilities M→I and M→D are different from each other, and the joint transition probabilities I→I and D→D are also different from each other, as can be seen by inspection of Figs. 2 and 3. Thus, the TKF models cannot be used to parameterize symmetric three-state pair HMMs.

Two evolutionary models (AIF and AGA) compatible with profile HMMs

The fragment models AFG and TKF92 are not compatible with profile HMMs. The AFG and TKF92 models add fragments to all three self-looping states in the pair HMM, but in a profile HMM only the Insert is a self-looping state. We can create a variant of the AFG model in which fragments are only added to the Insert state, which we call the Affine Insert Fragment (AIF) model. (A similar variant cannot be made for TKF92, since all fragments are tied to use the same Bernoulli parameter.) The AIF evolutionary model is affine both for insertions and deletions, although for different reasons. Affine insertions occur because of the fragments, affine deletions occur because of the position-specific deletion rate constants. The AIF evolutionary model is also not reversible, and compatible with standard profile HMMs (see Fig. 5).

There is an alternative to the fragment method used by the AIF model in order to introduce evolutionary models for profile HMMs. It requires assuming that the length distribution of inserts does not change with time. We call this model the AGA model. (A pair HMM with time-independent insert length distribution has been introduced before [16]). Despite the simplification, we later verify that the AGA model performs comparably to the AIF model. The AIF and the AGA state machine representations are given in Fig. 5, and details about the models are given in the Additional file 1.

The AIF and the AGA models are compatible with profile HMMs that contain D→I and I→D transitions, including the original profile HMMs proposed by Krogh et al. [32]. The HMMER HMM profile software uses a modified “plan 7” architecture that disallows D→I and I→D transitions [21]. It seems quite forced to propose an evolutionary model that would not allow one to find inserted residues after deleted ancestral residues. The HMMER architecture would need to be revisited to take advantage of the AIF and AGA models.

In summary, we report that: (1) In general, creating evolutionary models compatible with affine-cost alignment is not easy; geometrically distributed microscopic insertions do not yield geometrically distributed macroscopic inserts. (2) There is an affine gap-cost model (the AFR model) compatible with the insertion/deletion symmetry assumptions of Smith-Waterman-based comparative methods. (3) There are two ways of obtaining the affine costs of profile HMMs, either by adding fragments to the Insert state (the AIF model), or by assuming that inserts are an indivisible unit that follow a time-independent geometric length distribution (the AGA model).

A compilation of affine probabilistic evolutionary models with analytic solutions identified by us and others and their properties is given in Table 1.

A time-dependent parameterization for BLAST

A typical parameterization of BLAST is to use the BLOSUM62 substitution scoring matrix, together with an open cost of -11 and a extend cost of -1 (all given in 1/2 bit units), which has been obtained after years of empirical determination by trial and error [35]. Alternative sets of affine gap cost parameters appropriate for different substitution matrices modeling different degrees of sequence similarity have been obtained empirically [36]. Here we propose a time dependent parameterization of BLAST (or any other sequence to sequence comparison method based on the Smith-Waterman algorithm) based on the AFR evolutionary model. We also propose a method to estimate the evolutionary parameters based on the requirement that it yields the (–11/-1) particular values at one particular time point.

We have shown that the affine fragment reversible (AFR) model is compatible with the particular insertion/deletion symmetry implied in Smith-Waterman-type comparisons. From an evolutionary perspective, we could assume that a set of specific parameter values (such as the above mentioned -11/-1) corresponds to the parameter values under the AFR evolutionary model at one specific (and arbitrary) fixed-time point (t ^⋆). Then, we will show that calculating the probabilities at any other time becomes an algebraic problem of solving for the rates and Bernoulli parameters of the evolutionary model given particular fixed time values of the probabilities. (We have taken a similar approach elsewhere [37]).

Smith-Waterman’s “open+extend” cost is the score of the first residue in an indel, and the “extend” cost is the score of any other residue in the indel. Based on the probabilistic symmetric AFR pair HMM (Fig. 4), we propose to use

$$\begin{array}{@{}rcl@{}} \text{extend} &=& \log\mathrm{T}_{\text{XX}},\\ \text{open} + \text{extend} &=& \log\mathrm{T}_{\text{MX}} + \log(1-\mathrm{T}_{\text{XX}}), \end{array} $$

where in the open score in addition to the cost of entering an indel (logT_MX), we need to include the cost associated with exiting the indel state which has no counterpart in the affine gap cost Smith-Waterman scheme. This particular mapping is an approximation.

In fact, it can only be an approximation. In Smith-Waterman, all indels are scored equally independently of their context. In its symmetric pair HMM counterpart, indels have different scores (log probabilities) depending on where they occur (flanked by conserved regions and/or indels in the other sequence). In the probabilistic Smith-Waterman, it is hard to make the different possible costs equal to each other. From an evolutionary perspective, one does not want to set them identical to each other. For more conserved parameterizations, one would expect to have more indels flanked by homologous residues. For more divergent parameterizations, one would expect see more instances of indels flanked by indels. The term log(1−T_XX) includes both contributions ($\mathrm {T}^{\mathrm {t}}_{\text {XM}}$ and $\mathrm {T}^{\mathrm {t}}_{\text {XY}}$) which will become alternatively dominant at different divergence degrees.

We propose the following continuous time parameterization for Smith-Waterman:

$$\begin{aligned} \text{extend(t)} &= \log\mathrm{T}^{\mathrm{t}}_{\text{XX}},\\ \text{open(t)} &= \log\left(\mathrm{T}^{\mathrm{t}}_{\text{MX}}\right) + \log\left(1-\mathrm{T}^{\mathrm{t}}_{\text{XX}}\right) \\ &\quad-\log\left(\mathrm{T}^{\mathrm{t}}_{\text{XX}}\right),\\ \text{substitution score}\,\, S_{t}(\text{\textit{a,b}}) &= \log \frac{P_{t}(\text{\textit{a,b}})}{f_{a}\, f_{b}} = \log \frac{P_{t}(b\mid a) f_{a}}{f_{a}\, f_{b}} \\&= \log \frac{P_{t}(b\mid a)}{f_{b}} = \log \frac{\left(e^{t Q}\right)(\text{\textit{a,b}})}{f_{b}}. \end{aligned} $$

where the functions $\mathrm {T}^{\mathrm {t}}_{\text {MX}}$ and $\mathrm {T}^{\mathrm {t}}_{\text {XX}}$ are the time-dependent joint transition probabilities of the AFR model given in Fig. 4. The conditional probabilities are given by P _t(b∣a)=(e ^tQ)(a,b), where the K×K rate matrix Q is such that $\sum _{b=0}^{K-1} Q(\text {\textit {a,b}}) = 0$, for each residue 0<a<K, for an alphabet of size K, and has f _b as its saturation probabilities (that is, ${\lim }_{t\rightarrow \infty }\left (e^{t\,Q}\right)(\text {\textit {a,b}}) = f_{b}$).²

The standard BLAST parameterization (-11/-1/BLOSUM62) can be cast as a fixed-time instance of the previous parameterization as,

$$\begin{aligned} \text{extend}(t^{\star}) &\,=\,& -1 &= 2 \log_{2}\left(\mathrm{T}^{\star}_{\text{XX}}\right). \\ \text{open}(t^{\star}) &\,=\,& -11 &= 2 \log_{2}\frac{\mathrm{T}^{\star}_{\text{MX}}\left(1-\mathrm{T}^{\star}_{\text{XX}}\right)}{\mathrm{T}^{\star}_{\text{XX}}},\\ S_{t^{\star}}(\text{\textit{a,b}}) &=& 2\log_{2} \frac{P^{B62}(b\mid a)}{f_{b}} &= 2\log_{2} \frac{\left(e^{t^{\star} Q}\right)(\text{\textit{a,b}})}{f_{b}}, \end{aligned} $$

where P ^B62 are the conditional probabilities corresponding to BLOSUM62, and $\mathrm {T}^{\star }_{\text {MX}}$ and $\mathrm {T}^{\star }_{\text {XX}}$ stand for the particular values of the corresponding time-dependent joint transition probabilities at a fixed and arbitrary time t=t ^⋆.

We can rewrite the above equations as,

$$\begin{aligned} &\,\mathrm{T}^{\star}_{\text{XX}} = 2^{-1/2} &= 0.7017,\\ &\mathrm{T}^{\star}_{\text{MX}} = \frac{\mathrm{T}^{\star}_{\text{XX}}}{1-\mathrm{T}^{\star}_{\text{XX}}}\, 2^{-11/2} &= 0.0534,\\ &Q^{B62} = \log P^{B62}, \end{aligned} $$

where we assume with all generality that t ^⋆=1, and the rate Q ^B62 is the logarithm of the BLOSUM62 substitution matrix.

Under the AFR evolutionary model, we can rewrite (see Fig. 4),

$$\begin{aligned} &\,\mathrm{T}^{\star}_{\text{XX}} = \beta^{\textsc{lr}}_{\star} \,(1-r_{X}) + r_{X} &= 0.7017, \\ &\mathrm{T}^{\star}_{\text{MX}} = \beta^{\textsc{lr}}_{\star} \,(1-r_{M}) &= 0.0534. \end{aligned} $$

Not all parameters of the AFR model (λ,μ _A,r _X,r _M) can be determined by the above conditions.

Introducing $\beta ^{\textsc {lr}}_{\infty } =\lambda /(\lambda +\mu _{A})$, which is the value of $\beta ^{\textsc {lr}}_{t} $ at infinite divergence, we have

$$\beta^{\textsc{lr}}_{t} = \beta^{\textsc{lr}}_{\infty} \frac{1-e^{-\mu_{A}\,t}}{1- \beta^{\textsc{lr}}_{\infty} \,e^{-\mu_{A}\,t}}. $$

Then, we can solve the above equations to obtain the values of the parameters of the AFR model given the fixed-time values $\mathrm {T}^{\star }_{\text {MX}}, \mathrm {T}^{\star }_{\text {XX}}$, and $\beta ^{\textsc {lr}}_{\infty } $ as,

$$ \begin{aligned} &\mu_{A} = \log \frac{\beta^{\textsc{lr}}_{\infty} \,\left(1-\beta^{\textsc{lr}}_{\star} \right)}{(\beta^{\textsc{lr}}_{\infty} -\beta^{\textsc{lr}}_{\star})}, \\ &\,\ \ \lambda = \mu_{A}\,\frac{\beta^{\textsc{lr}}_{\infty} }{1-\beta^{\textsc{lr}}_{\infty} },\\ &r_{X} \,= \frac{\mathrm{T}^{\star}_{\text{XX}}-\beta^{\textsc{lr}}_{\star} }{1-\beta^{\textsc{lr}}_{\star} }, \end{aligned} $$

((1))

where

$$\beta^{\textsc{lr}}_{\star} = \frac{\mathrm{T}^{\star}_{\text{MX}}}{1-r_{M}}, $$

and $\beta ^{\textsc {lr}}_{\infty } $ and r _M are still undetermined.

In order to infer positive valued rates such that λ≤μ _A (so that p ^LR=λ/μ _A<1), we need to impose the conditions $\beta ^{\textsc {lr}}_{\star } < \beta ^{\textsc {lr}}_{\infty } < 0.5$. The Bernoulli parameter r _X is properly parameterized as long as $\mathrm {T}^{\star }_{\text {XX}}>\beta ^{\textsc {lr}}_{\star } $, which is usually the case, and in particular it is satisfied by this parameterization. The two remaining parameters r _M and $\beta ^{\textsc {lr}}_{\infty } $ are not constrained by the fixed-time point values of -11/-1/BLOSUM62. Those two parameters can be estimated from data.

For a given set of values of the four parameters of the AFR model (μ _A,λ,r _X,r _M), we can calculate the open and extend costs at any other time given the expressions for the AFR model (Fig. 4). The continuous-time affine-gap and substitution score functions are (in half bits)

$$ \begin{aligned} &\text{extend}(t) = 2\,\log_{2}\mathrm{T}^{\mathrm{t}}_{\text{XX}}\\ &\qquad\qquad= 2\,\log_{2}\left(\beta^{\textsc{lr}}_{t} (1-r_{X}) + r_{X}\right), \\ &\ \ \ \text{open}(t) = 2\, \log_{2}\frac{\mathrm{T}^{\mathrm{t}}_{\text{MX}}(1-\mathrm{T}^{\mathrm{t}}_{\text{XX}})}{\mathrm{T}^{\mathrm{t}}_{\text{XX}}} &\\ &\qquad\qquad= 2\, \log_{2}\frac{\beta^{\textsc{lr}}_{t} \left(1-\beta^{\textsc{lr}}_{t} \right)(1-r_{X})(1-r_{M})}{\beta^{\textsc{lr}}_{t} (1-r_{X}) + r_{X}},\\ &\ \ \ \,S_{t}(\text{\textit{a,b}}) = 2\,\log_{2} \frac{P_{t}(\text{\textit{a,b}})}{f_{a} f_{b}} &\\ &\qquad\qquad= 2\,\log_{2} \frac{\left(e^{t\,Q^{B62}}\right)(\text{\textit{a,b}})}{f_{b}}. \end{aligned} $$

((2))

Homology programs such as SSEARCH and FASTA provide a handful of different substitution matrices at different percentage identity with their corresponding recommended open and extend penalties, which have been estimated by numerical testing [28, 36]. Here for demonstration, we fix the values of the remaining free parameters r _M and $\beta ^{\textsc {lr}}_{\infty } $ such that they produce a relatively good agreement with the empirical values collected for SSEARCH and FASTA in [28]. We use the values r _M=0.75, and $\beta ^{\textsc {lr}}_{\infty } =0.30$, which result in $\beta ^{\textsc {lr}}_{\star } = 0.2134$. Applying these specific values into the algebraic expression of the AFR model parameters in Eq. 1, we obtain,

$$\begin{array}{@{}rcl@{}} \mu_{A} &=& 1.0023,\\ \lambda &=& 0.4296,\\ r_{X} &=& 0.6276,\\ r_{M} &=& 0.7500. \end{array} $$

The continuous-time affine-gap costs for this particular parameterization, and their correspondence to the empirical discrete values used by FASTA as described in [28] are given in Fig. 6. A collection of particular point values for the “open” and “extend” functions for notable evolutionary distances is given in Table 2.

Table 2 Several point values for the affine gap cost continuous-time functions open(t) and extend(t) for time instances associated to well known substitution matrices. The analytic functions are given in Eq. 2 and depicted in Fig. 6. The parameter values are: $\mu _{A}=1.0023,\lambda =0.4296,r_{X}=0.6276,r_{M}=0.7500,\beta ^{\textsc {lr}}_{\infty } =0.3000,\beta ^{\textsc {lr}}_{\star } =0.2134$. The divergence parameter t has been normalized to BLOSUM62

Full size table

By construction and for this particular set of parameters of the AFR model, the continuous-time functions valued at t=1 reproduce the values of the the standard -11/-1/BLOSUM62 parameterization of BLAST and SSEARCH.³

Alignment accuracy using explicit evolutionary models

Profile methods for sequence homology search and alignment, unlike standard sequence/sequence comparison methods, do not assume any equivalence between deletions and insertions. Consequently, the evolutionary models compatible with profile HMMs such as the AIF and AGA model are not reversible and cannot be incorporated into the symmetric pair HMM of Fig. 4. We are interested in investigating the potential of nonreversible evolutionary models but that are still compatible with standard three-state pair HMMs, such as AIF and AGA.

We have implemented an alignment algorithm (named e2msa) that uses an explicit evolutionary model and standard pair HMMs. Having a model with a continuous variable parameterization allows us to compare the performance of the model at different parameterizations describing different branch lengths, that is, different degrees of sequence similarity. The e2msa alignment algorithm also allows us to compare the performance of the different evolutionary models. All evolutionary models described in Table 1 have been implemented as part of e2msa. We give a detailed description of the program e2msa and the training of the evolutionary parameters in the ‘Methods’ section.

Next we test the performance (for alignment accuracy and homology coverage in pairwise alignments) of a short-branch parameterization and a long-branch parameterization, compared to an optimal-branch parameterization in which the divergence time is the one that maximizes the score of the sequences being compared. We expected to see that optimal-branch parameterized models produce more accurate alignments.

A fixed long-branch parameterization is sufficient to align global homologies

We created an alignment benchmark composed of a mixture of structural pairwise alignments (the reference alignments from databases SABmark and PREFAB), as well as conserved (≥ 40 % identity) pairwise alignments selected at random from random Pfam families. All sequences in this benchmark are full-length homologies. Details about this large benchmark (more than 35,000 pairwise alignments) named the “Global Homology set” are provided in the ‘Methods’ section. While the alignments in SABmark and PREFAB are very divergent (all are less than 50 % identity), due to the Pfam contribution, the “Global Homology set” covers all ranges of percentage identities, with at least 100 alignments in each of the more divergent identity ranges 80–85 %, 85–90 %, 90–95 % and 95–100 %, and many more in the other less divergent ranges.

In Fig. 7, we compare two single fixed-branch parameterizations: a “long-branch” parameterization that (by sampling from the model) produces alignments with an average percentage identity of 27 % (similar to that of BLOSUM62), and a “short-branch” parameterization with 71 % average identity alignments (similar to PAM30). We calculate alignment accuracy by measuring sensitivity (fraction of aligned positions that are inferred correctly), and positive predictive value (PPV or fraction of predicted aligned positions that are correct). The two measures can be combined in one single measure, the F value, the harmonic mean of sensitivity and positive predictive value. As a single measure that conveys information about a given method for alignments at all percentages of identity, we also provide the area under the curve (AUC) for a given measure (sensitivity, positive predictive value, or F).

Alignments are binned by percentage identity (here we use 5 % identity bins). Alignment accuracy measures are calculated for each identity bin independently of of any of the other bins.

For the two fixed-branch models, we also calculate (for each percentage identity bin) the mean score efficiency (based in a similar measure introduced for time-dependent substitution matrices [4]) which corresponds to the fraction of the maximal information of the evolutionary model that is captured by a single fixed-branch parameterization. The score efficiency is the ratio of the probability of the sequences given the e2msa model (the Forward score) for the fixed-branch parameterization divided by the Forward score of the optimal parameterization.

We observe in Fig. 7, by comparing the score efficiency of the long-branch and short-branch parameterizations, that the two models are tuned to the alignment’s divergence: the long-branch parameterization produces more efficient scores for more divergent alignments, while the short-branch parameterization produces better scores for less divergent alignments. However, contrary to our expectations, this selective score efficiency does not translate into alignment accuracy: the long-branch parameterization which (as expected) performs better for alignments of divergent sequences, it also performs well (comparable to a short-branch parameterization) for alignments of very conserved sequences. This result can be seen in Fig. 7 b for the composite F measure. We have also included the two components of F (sensitivity and positive predictive value) in Fig. 7 c and d respectively for completeness. We have also tested homology coverage, which produces comparable results to those of alignment accuracy.

We have performed the same experiments using several different evolutionary models in e2msa. Results are presented in Table 3 for alignment accuracy (and in Table 4 for homology coverage). Evolutionary models with more parameters tend to perform better (unsurprisingly). Fragment models perform better than their corresponding single-residue counterparts. The two models compatible with profile HMMs (the AIF and the AGA models) perform similarly to each other, and to the AFG model. The AFR model performs comparably to TKF92, and it has the advantage of allowing a symmetric pair HMM implementation.

Table 3 Alignment accuracy for global and local homologies of different evolutionary models implemented under the e2msa local alignment algorithm

Full size table

Table 4 Homology coverage performance for global and local homologies of different evolutionary models implemented under the e2msa local alignment algorithm

Full size table

From these results, one would be tempted to conclude that the best strategy is to always use a long-branch parameterization and never bother with implementing an explicit evolutionary model which carries the additional expense of optimizing the branch length. However, there is a common scenario not addressed by this benchmark of global homologies that we investigate next, the reported artifact of overextension of local alignments into flanking nonhomologous sequence when using a long-branch parameterized method [19, 27, 28].

A fixed short-branch parameterization reduces nonhomologous overextension

From the “Global Homology set”, we have constructed a “Local Homology set” with the same number of alignments, but where we have preserved only a small part of the original global alignment (50 positions on average), and have replaced the rest of the original homologous sequences with random sequence (details are given in the ‘Methods’ section).

We perform the same alignment accuracy benchmark for this “Local Homology set”. Results are presented in Fig. 8 a. Unlike the situation for the “Global Homology set” described in Fig. 7, we observe that in the presence of local homologies the short-branch parameterization performs better (as given by the F measure) than the long-branch parameterization for more conserved homologies, while the long-branch parameterization performs better for more divergent homologies.

What makes the local homology case different is that the inferred alignment can extend into the nonhomologous flanking regions. The alignment accuracy PPV measure penalizes both homologous residues that are improperly align to other homologous residues as well as aligned positions that include nonhomologous residues. In order to disentangle these two effects, we calculated measures of homology coverage. Coverage measures isolate the nonhomologous overextension problem because they penalize nonhomologous positions that are included in the alignment, but they do not penalized homologous positions misaligned to other homologous positions. Coverage sensitivity calculates the fraction of homologous positions that are included in the alignment, and coverage positive predictive value measures the fraction of aligned positions that are actually homologous. The coverage F measure is the harmonic mean of its SEN and PPV.

Homology coverage results are presented in Fig. 8 b. For close relationships (40 % identity or more), we observe that coverage measures are similar to those of alignment accuracy. Thus, the low alignment accuracy PPV observed for the long-branch parameterization (relative to that of the short-branch parameterization) is mostly associated with alignment overextension. For distant relationships (less than 40 % identity), the poorer values for alignment accuracy relative to those of coverage are mostly due to misalignment of homologous residues.

Both for alignment accuracy and homology coverage, the sensitivity of the short-branch parameterization is uniformly below that of the long-branch parameterization. However the consistently better PPV for the short-branch parameterization more than compensates for the loss in sensitivity. The result is that for the F measure (the harmonic mean of sensitivity and PPV) there is a crossover such that the short-branch parameterization is better for close relationships and the long-branch parameterization is better for distant relationships.

Regarding the behavior of the different affine evolutionary models for the Local Homology set, we observe the same trends as as for the Global Homology set. In particular, the AIF and AGA models compatible with profile HMMs perform similarly to each other and to the AFG model, and the AFR model performs similarly to the TKF92 model. See Table 3 for alignment accuracy of the different evolutionary models, and Table 4 for homology coverage.

A variable optimal-branch parameterization is best to align local homologies

An explicit evolutionary model allows us to optimize for the parameterization that best suits a given homology comparison. The optimal-branch parameterization of e2msa uses a simple line-optimizer to select the branch length that achieves the best probability for the compared sequences summing to all possible alignments.

We have applied the optimal-branch parameterization of e2msa to both the Global and Local Homology sets. Results for alignment accuracy are presented in Fig. 9. As expected, for the Global Homology set, the optimal-branch model performs almost identically to the long-branch model for all degrees of sequence divergence. For the Local Homology set, on the other hand, the optimal-branch model follows the short-branch model for more conserved sequences, and follows the long-branch model for more divergent sequences, thus combining the best of both regimes. Results for homology coverage (Fig. 10) are similar to those of alignment accuracy.

Conclusions

We have identified and implemented probabilistic models of biological sequence evolution that provide an affine treatment of insertions and deletions, and that can be compactly described as a standard three-state pair HMM in which the emission and transition probabilities are continuous time functions depending on a small number of independent rate parameters. An unexpected difficulty is the realization that macroscopic inserts with geometrically distributed lengths force quite constrained and unrealistic microscopic events such as indivisible fragments; and that for instance, single event geometrically distributed insertions do not produce geometrically distributed inserts. One of the evolutionary models (the AFR model) is compatible with the symmetric pair HMMs that treat insertions and deletions indistinguishably used routinely in sequence/sequence comparisons. Two of the evolutionary models (AIF and AGA) are compatible with standard profile HMMs for which there is a set of independent rate parameters for each position in the profile.

A continuous time evolutionary model, including a model of indels, allows us to optimize the parameterization of a profile/sequence or sequence/sequence comparison to match the apparent divergence time. Using a single alignment program implementing pair-HMMs (e2msa), we asked what gain (if any) this could bring. Our results show that the most important benefit of using an optimal-branch parameterization is the reduction of homologous overextension artifacts in which non homologous regions become part of the alignment and are treated as homologous. For global homologies with no risk of overextension, a fixed long-branch parameterization is the most economical choice to provide the best possible performance. Optimal-branch parameterized models could improve iterative methods such as JACKHMMER or PSI-BLAST in which it is important to avoid overextension of hits in the early stages of the search. Another advantage of a short-branch parameterization that we can anticipate, but have not tested in this manuscript, is the detection of very short homologies such as in the analysis of short metagenomic reads, although preliminary results suggest this effect would only be relevant for ORFs shorter than 10 aa.

Because e2msa implements several different evolutionary models, we can conduct a level comparison of those models keeping the rest of the details of the alignment algorithm constant. We observe that the AIF and AGA evolutionary models compatible with profile HMMs are amongst the best performing model in a pairwise alignment accuracy benchmark of both global and local homologies.

The AIF and AGA models are not applicable to some specific profile HMMs such as HMMER that disallow transitions between Delete and Insert states (named Plan7, for using only 7 transitions for a given profile position). HMMER’s original design was purposely non-evolutionary, favoring structural alignments where nonhomologous positions may appear the same alignment column as they occupy the same nonconserved region in between structurally well defined conserved regions. Constructing an evolutionary version of HMMER is going to require changing the architecture of the HMM to include all possible transitions from one profile position to the next (a Plan9 profile HMM).

In an evolutionary profile HMM, since we want to obtain position-specific scores, we require position-specific rates both for the substitution and insertion/deletion process. We anticipate replacing the standard profile training method by using a mixture of HMM rates trained in a large protein domain database that would replace the use of mixture Dirichlet priors, a phylogenetic tree that would replace sequence weighting, and the evolutionary time that would replace entropy weighting. It remains an open question whether it is worth implementing an evolutionary profile HMM to improve training (by removing the use of Dirichlet priors, sequence and entropy weighting), when the method is going to be used mainly with a very long branch parameterization.

Methods

e2msa: a pair HMM alignment algorithm implementing explicit evolutionary models

In order to compare the different evolutionary models while maintaining other variables equal, and to measure the effect of using an explicit evolutionary model on alignment accuracy, we have implemented a pairwise or multiple sequence alignment program (named e2msa) that can adopt any of the evolutionary models described in Table 1.

Constructing evolutionary alignments using any of the models in Table 1 simply requires that in the Forward and Viterbi algorithms of a standard local pair HMM described in [3], one replaces the constant emission and transition probabilities with continuous time functions dictated by the evolutionary model. A pair HMM describes the probability of an ancestral sequence and one of its descendants. Thus, a pair HMM can only be used to align two extant sequences for reversible evolutionary model such as the AFR model (where one sequence can be taken as the ancestor of the other one with all generality). A profile HMM on the other hand describes the relationship between an extant sequence and the (ancestral) profile. Thus, profile HMMs do not care about reversibility, and in fact the evolutionary models (AIF and AGA) compatible with profile HMMs are nonreversible. In order to test non reversible evolutionary models using pair HMMs, we need to calculate the probability of two extant sequences being generated by a unknown common ancestor, each of them according to a standard pair HMM.

We have implemented the so-called E2pair algorithm to calculate the probability of two sequences being related by a common ancestor, as described in Fig. 11. The algorithm sums over all possible evolutionary histories and ancestral sequences. In Fig. 11 a, we provide an example of two extant sequences related by one particular ancestral sequence and one particular evolutionary history. In Fig. 11 b, we provide a full description of the E2pair model and its time-dependent parameters, where the time dependencies are given by a specific evolutionary model. The E2pair dynamic programming algorithm calculates P(s ₁,s ₂∣evomodel,t ₁,t ₂), that is, the probability of two sequences s ₁,s ₂ both descend from the same (unknown) common ancestor after times t ₁ and t ₂ respectively. The E2pair model does not assume that all the extent of the sequences being compared have to be homologous. By default as described in Fig. 12, the E2pair model allows for nonhomologous regions flanking the (possibly more than one) homologous regions. We refer to this as the local mode of the E2pair model. Under a particular parameterization described in Fig. 12, the E2pair model can force the two sequences to align in full; we refer to that as the global mode parameterization.

The Forward algorithm for the E2pair model is described in detail in Fig. 13. It has a complexity of O(l1×l2) both in time and memory for two sequences of lengths l1 and l2.

The e2msa method can be run at a fixed evolutionary distance (fixed branch) or using variable branch lengths obtained by optimizing the probability of the sequences being compared given the model. The e2msa method can be run in local or global mode. All experiments in this manuscript are run in local mode.

Training of rate parameters

The rate parameters of the evolutionary model are an input to the e2msa program. In all experiments performed here, e2msa uses evolutionary rate parameters derived from a training set of 1000 trusted pairwise alignments obtained at random from Pfam. This training set is independent of any other Pfam family used in the alignment benchmarks. The training set file is named “Pfam.seed.S1000.sto”, and it is included as part of the Additional file 2.

The rate parameters are obtained by optimizing the total probability of the sequences in the training set given the evolutionary model using a gradient descent optimization method implemented in the program e2train. For the emission probabilities, we constructed a rate matrix (scaled to one substitution per site) derived from the BLOSUM62 substitution matrix. In the Additional file 2, we provide evolutionary parameters (both the rates and the Bernoulli geometric parameters) trained on the same training set (“Pfam.seed.S1000.sto”) for all evolutionary models given in Table 1.

Alignment accuracy benchmark

The alignment benchmark used in Fig. 7 consists of a total of 36,484 global homology pairwise alignments including the whole reference sets of SABmark (29,756 alignments) [38] and PREFAB (1682 alignments) [39], as well as a collection of 5043 pairwise alignment selected each from a random Pfam seed alignment, and such that these selected families are different from those used to generate the Pfam training set. Because standard alignment benchmarks like SABmark and PREFAB tend to select for more divergent alignments, we used Pfam to include more closely related alignments. For the Pfam alignments, we required at least 40 % identity between the sequences. The result is a large collection of trusted pairwise alignments of global homologies representing all degrees of sequence diversity. We refer to this set as the “Global Homology set”, and it is available in the Additional file 2.

In the “Global Homology set”, all 5 % percentage identity bins in which we report results include at least 100 alignments. The length of the sequences in the “Global Homology set” is quite similar in all identity bins, with mean and standard deviation that range from 199 +119 nts for the alignments in the 0–5 % identity bin, to 187 +/- 125 nts for the alignments in the 90–100 % identity bin. The absolute range of sequence lengths is 7 nts to 2196 nts.

Derived from the “Global Homology set”, we construct a set of local homologies. For each global homology alignment, we preserve just a section of the original alignment. The homologous fragments have lengths normally distributed around 50 positions, and are selected from a random starting position in the alignment. The original homologous residues not part of the selected homology fragment are replaced with an identical number of nonhomologous residues taken at random from UniProt and randomized by single-residue shuffling. We further require that the nonhomologous regions have to be at least a fifth of the total alignment length. The result is a collection of local homology alignments that relative to the original “Global Homology set” has a similar number of alignments (36,404) with sequences of identical length, but where we only preserve 20 % of the total number of homologous positions present in the original set. We refer to this set of alignments as the “Local Homology set”. The observed average length of the homologous regions in this set is 49 ± 10 amino acids. The “Local Homology set” is also available as part of the Additional file 2.

Measure of alignment identity

For two aligned sequences, their percentage identity is calculated as the ratio of identical positions divided by the minimum length of the two sequences. We calculate multiple alignment identity as the average percentage identity of the alignment’s pairwise comparisons.

Measures of alignment accuracy

We report alignment accuracy by pairwise comparison. For each pair of aligned sequences, we measure alignment accuracy by its sensitivity (SEN_A) and positive predictive value (PPV_A)

$$\begin{aligned} \text{SEN}_{\mathrm{A}}=\frac{\text{Aligned Positions Inferred Correctly}}{\text{True Aligned Positions}}, \quad\quad \text{PPV}_{\mathrm{A}}=\frac{\text{Aligned Positions Inferred Correctly}}{\text{Inferred Aligned Positions}}. \end{aligned} $$

Sometimes for simplicity, we use the F measure (the harmonic mean of the SEN and PPV [40]) as a proxy for prediction accuracy,

$$ \mathrm{F}_{\mathrm{A}}=\frac{1}{\frac{1}{2}\,\left(\frac{1}{\text{SEN}_{\mathrm{A}}} + \frac{1}{\text{PPV}_{\mathrm{A}}}\right)}. $$

((3))

In this work, the alignment sensitivity (SEN_A) also referred to as the “SP-Score”, and positive predictive value (PPV_A) also referred to as the “Modeler Score” have been calculated using the program FastSP v1.6 [41].

Measures of homology coverage

We measure homology coverage performance by its sensitivity (SEN_C) and positive predictive value (PPV_C). For a predicted multiple alignment of some embedded homologous regions, we define,

$$\begin{aligned} \text{SEN}_{\mathrm{C}}&=\frac{\text{\# of homologous residues aligned}}{\text{\# of homologous residues}},\\ \quad\quad \text{PPV}_{\mathrm{C}}&=\frac{\text{\# of homologous residues aligned}}{\text{\# of aligned residues}}. \end{aligned} $$

Similarly to alignment accuracy, the F measure of coverage (F_C) is introduced as in Eq. (3).

Differently from alignment accuracy, the coverage measures do not penalize homologous positions aligned incorrectly to other homologous positions.

Software implementations and availability

We provide the ANSI C source code for the programs e2msa (to produce a pairwise or progressive multiple sequence alignment using the E2pair algorithm), e2train (for training the evolutionary parameters of the pair-HMM), e2sim (for generating aligned sequences according to an evolutionary model and an arbitrary phylogenetic tree), and e1sim (to evolve sequences according to the different evolutionary models). All programs accept under one single implementation all the evolutionary models described in Table 1.

The ANSI C source code for programs e2msa, e2train, e2sim, and e1sim is freely available available under the GNU General Public License (GPL) from eddylab.org. All programs use the packages HMMER [21] and EASEL (S.R. Eddy, unpublished) as libraries, which are also provided under the same license.

The source code for the above programs, as well as the Perl scripts used to generate the experiments in this work and the results of the experiments have been collected in a tarball available as part of the Additional file 2.

Software and database versions used

We used the following versions of programs: NCBIBLAST 2.2.26+, [42], SSEARCH 36.3.6d [20], MSAProbs 0.9.7 [43], MUSCLE 3.8.31 [44], and FastSP v1.6 [41].

The VT amino acid substitution score matrices [45] used in Fig. 6 were created using the Perl script “vt_scores.pl” written by K. Kneutgen and T. Mueller (May 2002), and provided to us by courtesy of W.R. Pearson. The VT conditional probability matrices have been generated using a modified version of the original script, named “vt_scores_modER.pl” which also provides the target amino acid frequencies, and the percentage identity. The scripts “vt_scores.pl”, “vt_scores_modER.pl”, and several VT score and transition matrices are provided as part of the Additional file 2.

We used the following versions of databases: Pfam v.27 [22], UniProt 2013_06 [46], PREFAB 4.0 [38], and SABmark 1.65 [39].

Endnotes

¹ There is an exception for a residue inserted after a deleted ancestral residue that makes the macroscopic TKF91 model not a linear model, strictly speaking.

² We could propose that the substitution score also adds the contribution corresponding to $\log \mathrm {T}^{t}_{\text {MM}}$.

³ This does not mean that a dynamic programming algorithm for the AFR model at t=1 would be identical to that of the (-11/-1/BLOSUM62) Smith-Waterman algorithm. A one-to-one correspondence between the Smith-Waterman algorithm and a symmetric pair HMM is not possible, not even for a fixed-time parameterization. The reason is that while Smith-Waterman uses arbitrary scores, in a pair HMM each transition has to be accounted for as a probability. In addition to not being able to use one unique open score, another example of the lack of correspondence is in the score of a match. Smith-Waterman assumes that the score of a match is just the emission substitution score S(a,b); its probabilistic counterpart requires adding the contribution of the match-to-match transition, S(a,b)+logT_MM. It does not appear that these two schemes can be reconciled with each other.

References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl Acids Res. 1997; 25:3389–402.
Article CAS PubMed PubMed Central Google Scholar
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998; 14:755–63.
Article CAS PubMed Google Scholar
Durbin R, Eddy SR, Krogh A, Mitchison GJ. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge UK: Cambridge University Press; 1998.
Book Google Scholar
Altschul SF. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol. 1993; 36:290–300.
Article CAS PubMed Google Scholar
Thorne JL, Kishino H, Felsenstein J. An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol. 1991; 33:114–24.
Article CAS PubMed Google Scholar
Thorne JL, Kishino H, Felsenstein J. Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol. 1992; 34:3–16.
Article CAS PubMed Google Scholar
Bishop MJ, Friday AE. Evolutionary trees from nucleic acid and protein sequence. Proc R Soc B. 1985; 226:271–302.
Article CAS Google Scholar
Bishop MJ, Thompson EA. Maximum likelihood alignment of DNA sequences. J Mol Biol. 1986; 190:159–65.
Article CAS PubMed Google Scholar
Metzler D, Fleissner D, Wakolbinger A, von Haeseler A. Assessing variability by joint sampling of alignments and mutation rates. J Mol Evol. 2001; 53:660–9.
Article CAS PubMed Google Scholar
Bouchard-Côté A, Jordan MI. Evolutionary inference via the Poisson indel process. 2012. PNAS 10.1073/pnas.1220450110.
Mitchison GJ, Durbin RM. Tree-based maximal likelihood substitution matrices and hidden Markov models. J Mol Evol. 1995; 41:1139–51.
Article CAS Google Scholar
Mitchison GJ. A probabilistic treatment of phylogeny and sequence alignment. J Mol Evol. 1999; 49:11–22.
Article CAS PubMed Google Scholar
Qian B, Goldstein RA. Detecting distant homologs using phylogenetic tree-based HMMs. Proteins. 2003; 52:446–53.
Article CAS PubMed Google Scholar
McGuire AM, Hughes JD, Church GM. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res. 2000; 10:744–57.
Article CAS PubMed Google Scholar
Rivas E, Eddy SR. Probabilistic phylogenetic inference with insertions and deletions. PLoS Comput Biol. 2008; 4:1000172.
Article CAS Google Scholar
Knudsen B, Miyamoto MM. Sequence alignments and pair hidden Markov models using evolutionary history. J Mol Biol. 2003; 333:453–60.
Article CAS PubMed Google Scholar
Miklós I, Toroczkai Z. An improved model for statistical aligment In: Gascuel O, Moret BME, editors. WABI 2001. Berlin Heidelberg: Springer: 2001. p. 1–10.
Google Scholar
Miklós I, Lunter GA, Holmes I. A “Long Indel” model for evolutionary sequence alignment. Mol Biol Evol. 2004; 21:529–40.
Article CAS PubMed Google Scholar
Reese JT, Pearson WR. Empirical determination of effective gap penalties for sequence comparison. Bioinformatics. 2002; 18:1500–7.
Article CAS PubMed Google Scholar
Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. Meth Mol Biol. 2000; 132:185–219.
CAS Google Scholar
Eddy SR. Accelerated profile HMM searches. PLoS Comp Biol. 2011; 7:1002195.
Article CAS Google Scholar
Finn RD, Clements J, Eddy SR. HMMER web server: Interactive sequence similarity searching. Nucl Acids Res. 2011; 39:29–37.
Article CAS Google Scholar
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, et al. The Pfam protein families database. NAR. 2012; 40:290–301.
Article CAS Google Scholar
Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucl Acids Res. 2013; 41:70–82.
Article CAS Google Scholar
Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 2008; 4:1000069.
Article CAS Google Scholar
Karplus K. SAM-T08, HMM-based protein structure prediction. Nucleic Acids Res. 2009; 21:492–7.
Article CAS Google Scholar
Gonzalez MW, Pearson WR. Homologous over-extension: a challenge for iterative similarity searches. Nucl Acids Res. 2010; 38:2177–89.
Article CAS PubMed PubMed Central Google Scholar
Mills LJ, Pearson WR. Adjusting scoring matrices to correct overextended alignments. Bioinformatics. 2013; 29:3007–13.
Article CAS PubMed PubMed Central Google Scholar
Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007; 23:289–96.
Article CAS Google Scholar
Wang J, Keightley PD, Johnson T. MCALIGN2: Faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution. BMC Bioinformatic. 2006; 7:292.
Article CAS Google Scholar
Cartwright RA. Problems and solutions for estimating indel rates and length distributions. Mol Biol Evol. 2009; 26(2):473–80.
Article CAS PubMed Google Scholar
Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. Hidden Markov models in computational biology: Applications to protein modeling. J Mol Biol. 1994; 235:1501–31.
Article CAS PubMed Google Scholar
Hein J. An algorithm for statistical alignment of sequences related by a binary tree. Pac Symp Biocomput. 2001; 6:179–90.
Google Scholar
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.
Article CAS PubMed Google Scholar
Pearson WR. Comparison of methods for searching protein sequence databases. Protein Sci. 1995; 4:1145–60.
Article CAS PubMed PubMed Central Google Scholar
Pearson WR. Selecting the right similarity-scoring matrix. Curr Protocol Bioinform. 2013; 3:3–5351359.
Google Scholar
Rivas E. Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics. 2005; 6:63.
Article CAS PubMed PubMed Central Google Scholar
Edgar RC. Quality measures for protein alignment benchmarks. Nucleic Acids Res. 2010; 38:2145–53.
Article CAS PubMed PubMed Central Google Scholar
Van Walle I, Lasters I, Wyns L. SABmark–a benchmark for sequence alingnment that covers the entire known fold space. Bioinformatics. 2005; 1:293–303.
Google Scholar
van Rijsbergen CJ. Information Retrival. London: London Butterworths; 1979.
Google Scholar
Mirarab S, Warnow T. FastSP: Linear time calculation of alignment accuracy. Bioinformatics. 2011; 27:3250–8.
Article CAS PubMed Google Scholar
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: Architecture and applications. BMC Bioinformatics. 2009; 10:421.
Article CAS PubMed PubMed Central Google Scholar
Liu Y, Schmidt B, Maskell DL. MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities. Bioinformatics. 2010; 26:1958–64.
Article CAS PubMed Google Scholar
Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004; 5:113.
Article CAS PubMed PubMed Central Google Scholar
Müller T, Spang R, Vingron M. A comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2002; 19:8–13.
Article PubMed Google Scholar
The UniProt Consortium. UniProt: a hub for protein information. Nucl. Acids Res. 2015; 43(D1):D204–D212. [doi:10.1093/nar/gku989].
Article Google Scholar

Download references

Acknowledgments

Thanks to William R. Pearson for discussions about alignment overextension. Thanks to Fred P. Davis and Eric P. Nawrocki for comments on the manuscript. Thanks to the Centro de Ciencias de Benasque Pedro Pascual, Benasque, Spain, for its hospitality, where part of this manuscript was drafted.

Author information

Authors and Affiliations

Department of Molecular and Cellular Biology, Harvard University, Cambridge, 02138, MA, USA
Elena Rivas & Sean R. Eddy
Howard Hughes Medical Institute, 4000 Jones Bridge Rd, Chevy Chase, 20815, MD, USA
Sean R. Eddy
John A. Paulson School of Engineering and Applied Sciences, 16 Divinity Avenue, Cambridge, 02138, MA, USA
Sean R. Eddy
FAS Center for Systems Biology, Harvard University, 16 Divinity Avenue, Cambridge, 02138, MA, USA
Sean R. Eddy

Authors

Elena Rivas
View author publications
You can also search for this author in PubMed Google Scholar
Sean R. Eddy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elena Rivas.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

Both authors contributed to the development of ideas, design of experiments, and to the writing of the manuscript. The methods and experiments were carried out by ER. Both authors read and approved the final manuscript.

Additional files

Additional file 1

This supplement provides details about the sequence evolutionary models described in the manuscript, their differential equations and analytic solutions (when existing). This supplement also collects our observations regarding previously existing evolutionary models, mainly tkf91 and tkf92. (PDF 972 kb)

Additional file 2

Tarball that contains the source code for the programs introduced in this manuscript, as well as the Perl scripts used to generate the experiments in this work and the results of the experiments. (ZIP 42,496 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Rivas, E., Eddy, S.R. Parameterizing sequence alignment with an explicit evolutionary model. BMC Bioinformatics 16, 406 (2015). https://doi.org/10.1186/s12859-015-0832-5

Download citation

Received: 27 May 2015
Accepted: 20 November 2015
Published: 10 December 2015
DOI: https://doi.org/10.1186/s12859-015-0832-5

Parameterizing sequence alignment with an explicit evolutionary model

Abstract

Background

Results

Conclusions

Background

Results and discussion

Constraints on evolutionary models compatible with affine gap cost

Affine evolutionary models by fragments

An evolutionary model (AFR) compatible with the Smith-Waterman algorithm

Two evolutionary models (AIF and AGA) compatible with profile HMMs

A time-dependent parameterization for BLAST

Alignment accuracy using explicit evolutionary models

A fixed long-branch parameterization is sufficient to align global homologies

A fixed short-branch parameterization reduces nonhomologous overextension

A variable optimal-branch parameterization is best to align local homologies

Conclusions

Methods

e2msa: a pair HMM alignment algorithm implementing explicit evolutionary models

Training of rate parameters

Alignment accuracy benchmark

Measure of alignment identity

Measures of alignment accuracy

Measures of homology coverage

Software implementations and availability

Software and database versions used

Endnotes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Additional files

Additional file 1

Additional file 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us