Evolutionary models for insertions and deletions in a probabilistic modeling framework

Background Probabilistic models for sequence comparison (such as hidden Markov models and pair hidden Markov models for proteins and mRNAs, or their context-free grammar counterparts for structural RNAs) often assume a fixed degree of divergence. Ideally we would like these models to be conditional on evolutionary divergence time. Probabilistic models of substitution events are well established, but there has not been a completely satisfactory theoretical framework for modeling insertion and deletion events. Results I have developed a method for extending standard Markov substitution models to include gap characters, and another method for the evolution of state transition probabilities in a probabilistic model. These methods use instantaneous rate matrices in a way that is more general than those used for substitution processes, and are sufficient to provide time-dependent models for standard linear and affine gap penalties, respectively. Given a probabilistic model, we can make all of its emission probabilities (including gap characters) and all its transition probabilities conditional on a chosen divergence time. To do this, we only need to know the parameters of the model at one particular divergence time instance, as well as the parameters of the model at the two extremes of zero and infinite divergence. I have implemented these methods in a new generation of the RNA genefinder QRNA (eQRNA). Conclusion These methods can be applied to incorporate evolutionary models of insertions and deletions into any hidden Markov model or stochastic context-free grammar, in a pair or profile form, for sequence modeling.

Sequence similarity methods based on HMMs or SCFGs can take the form of profile or pair models and are very important for comparative genomics. These probabilistic methods for sequence comparison assume a certain degree of sequence divergence. For instance, in profile models (either profile HMMs [2][3][4] or profile SCFGs [12,13]) a sequence is compared to a consensus model. Profile models must allow for the occurrence of insertions and deletions with respect to the consensus, and they do so by using state transition probabilities that assign some position-dependent penalties for modifying the consensus with insertions or deletions. Similarly, in pair probabilistic models [8,16] two related sequences are compared (aligned and/or scored). Pairwise alignments need to allow for substitution, insertion and deletion events between the two related sequences. Substitutions are taken care of by residue emission probabilities, while insertion and deletion events are generally taken care of by state transition probabilities as in the case of profile HMMs.
In the BLAST programs [17], the score of a pairwise alignment is determined using substitution matrices which measure the degree of similarity between two aligned residues. Similarly, in pair probabilistic models, residue emission probabilities are based on substitution matrices. The evolution of substitution matrices has been studied at large for many different kinds of processes: nucleotides, amino acids, codons, or RNA basepairs [18][19][20][21][22][23]. The evolution of emission probabilities using substitution matrices is easily integrated into probabilistic models both for HMMs [24][25][26][27][28][29] and for SCFGs [14].
In probabilistic models, insertion and deletion events (indels) are sometimes described by treating indels as an additional residue (gap characters) in a substitution matrix. More often they are described using additional hidden states, where transition probabilities into those states represent the cost of gap initiation and transitions within those states represent the cost of gap extension. If the cost of gap initiation and gap extension are identical, it is referred to as a linear gap cost model. Hidden states allow arbitrary costs for gap initiation and gap extension, which is traditionally referred to as an affine gap cost model. Treating gaps as an extra character in a substitution matrix is equivalent to assuming a linear gap cost model. The parameters that modulate those processes should be allowed to change as the divergence time for the sequences being compared is varied. It has been difficult to combine probabilistic models such as profile and pair HMMs or SCFGs with evolutionary models for insertion and deletions [30][31][32][33]. Methods to evolve transition probabilities are not as well developed as those describing substitution matrices, but significant effort is currently aimed at this problem [34][35][36][37][38][39][40][41]. Models incorporating the evolution of insertions and deletions in the context of probabilistic models such as profile HMMs or pair models are a very important goal in order to make those probabilistic models more realistic.
I encountered this problem in working on QRNA, a computational program to identify noncoding RNA genes de novo. QRNA uses probabilistic comparative methods to analyze the pattern of mutation present in a pairwise alignment in order to decide whether the compared nucleic acid sequences are more likely to be protein-coding, structural RNA encoding, or neither. Originally QRNA was parameterized at a fixed divergence time. Motivated by the goal of making QRNA a time dependent parametric family of models, I investigated the possibility of evolving the transition and emission probabilities associated with a given probabilistic model. Since I already had the model parameterized for a given time, I aimed to use that model as a generating point of the whole timeparameterized family of models.
Because QRNA includes both linear and affine gap models in different places, in this paper I propose algorithms to describe the evolution of indels as a (N + 1)-th character in a substitution matrix, and algorithms to describe the evolution of the transition probabilities associated with a probabilistic model. The purpose of this paper is to describe the general theoretical framework behind these methods. A detailed description of the particular implementation of these algorithms in QRNA and a discussion of the results obtained with "evolutionary QRNA" (eQRNA) will appear in a complementary publication.

Evolutionary models for emission probabilities
The evolution of emission probabilities without gaps In order to introduce notation, I will start with a brief review of the current methods for calculating joint probabilities conditional on time, P(i, j|t), where i, j are two residues (for instance, nucleotides, amino acids, RNA basepairs, or codons). P(i, j|t) gives us the probability that residues i and j are observed at a homologous site in a pairwise alignment after a divergence time t. Pairwise sequence comparison methods score aligned residue pairs with these joint probabilities either explicitly or implicitly [17]. In explicit generative pair probabilistic models, like the pair-HMMs and pair-SCFG in QRNA, the P(i, j|t) terms are referred to as pair emission probabilities.
The evolution of joint probabilities is usually obtained by modeling the corresponding conditional probabilities P(j|i, t)as a substitution process in which residue i has been substituted by residue j over time t. Probabilistic models for nucleotide substitutions [18,19,[42][43][44] assume that nucleotide substitution follows a model of evolution that depends on an instantaneous rate matrix, (1) where t is the divergence time, R is the instantaneous rate matrix, and Q t is the substitution matrix of conditional probabilities, that is Q t (ij) ≡ P(j|i, t). This is a reasonable model used, for instance, to describe nucleotide substitutions in the Jukes-Cantor [42] or Kimura [43] models, or the more general REV model [44]; this is also the evolutionary model used for amino acid substitutions [18,19,21,45,46], codon to codon substitutions [20,47], and RNA basepair to basepair substitutions [14,22,23].
Throughout this paper, I will use the words "divergence time", "divergence", or "time" equivalently to describe the amount of dissimilarity between biological sequences measured as the number of mutations and gaps introduced in the alignment of the sequences. I will never refer to "time" as representing an actual number of years of divergence, since this number cannot be determined intrinsically from sequence data.
Thus, given a rate matrix R, Q t (and therefore the desired joint emission probabilities) can be inferred for any desired time using the Taylor expansion for the matrix exponential, This Taylor series converges in all cases.
There are several ways in which the rate matrix R can be determined. One approach is to use analytically inferred rate matrices that depend on a small number of external parameters [42][43][44]48]. For instance, the HKY model for nucleotide substitutions [48] depends on six parameters: the four stationary nucleotide frequencies, a rate of transitions, and a rate of transversions, which have to be provided externally. Another type of approach uses maximum likelihood methods [21,49,50] in order to estimate a rate matrix numerically from a training set of sequence alignments.
A third approach arises naturally in cases where suitable joint probabilities have already been estimated for a pair model, and we wish to make that model conditional on evolutionary divergence time. This approach starts from the assumption that our point estimate represents sequences at a particular arbitrary divergence time t * . For example, a similar assumption was taken to construct the BLOSUM matrices [51], which were obtained as joint probabilities at discrete point estimates from clusters of aligned sequences.
In this third approach the parameters at the generating time t * will be used to construct a rate matrix for the process. This approach is motivated by the kind of situation in which we find ourselves with probabilistic methods based on homology such as QRNA: a model has been trained in one kind of data, and the resulting probabilities represent some effective but fixed divergence time, and we wish to extend that model to a time-dependent parameterization.
For residue substitution processes, the rate matrix R and Q * , defined as the substitution matrix at the generating time t * [Q * ≡ Q t * ], convey exactly the same information. More explicitly, assuming the evolutionary model given in equation (1) we can calculate the rate matrix of the process as a function of Q * as Kishino et al [52] introduced the idea of calculating the rate matrix starting from a given substitution matrix using equation (3) and an eigenvalue decomposition of Q * . It is worth noting that the matrix equation for the rate R can be expressed as a Taylor expansion of the form which allows for a direct numerical calculation of the rate matrix. The convergence of this series requires only that for every (real or complex) eigenvalue λ of matrix Q * , then |λ -1| < 1. In addition, for any valid substitution matrix the eigenvalues have to be real and |λ| ≤ 1 (see Appendix A). Under these two conditions, the above Taylor series converges so long as the eigenvalues of Q * are positive. Therefore the three properties required of Q * in order to be able to obtain a rate matrix using the Taylor expansion in equation (4) are that its eigenvalues are all smaller than one (but one that is strictly one), real, and positive. Complex or negative eigenvalues would correspond to oscillatory behaviors, which do not seem to reflect the biology. All the substitution processes I have tested so far for nucleotides, amino acids, and RNA basepairs correspond to real and positive eigenvalues for which the above method is applicable.
It is relevant to compare instantaneous rate matrix approaches to the approach used in the PAM amino-acid  substitution matrices [53]. The PAM matrices were not generated by calculating a rate matrix, but by estimating from a collection of highly similar sequences the substitution matrix for the time of one substitution per site ≡ Q t = 0.01 , and then calculating Q t at any other (integer t) time by multiplication. This is a discrete approximation that converges to the same answer given by the rate method for very small time units. PAM matrices have been criticized for not being able to capture the substitutions that are observed for more dissimilar sequences. BLOSUM matrices empirically outperform PAM in sequence homology searches, presumably because sequences at larger divergence times were used to calculate the BLO-SUM matrices. However, the BLOSUM method is not a time dependent continuous model but a very coarsegrained discretization. There are ways of combining the best of both approaches (more divergent sequence for training and a continuous-time model) to generate rate matrices, for instance by using the resolvent method [54], or using maximum likelihood methods as in the WAG matrices [21]. However, it is also possible to take a discrete BLOSUM matrix, for instance BLOSUM62, and convert it to an underlying rate matrix. The BLOSUM62generated rate matrix obtained using equation (4) is shown in Figure 1.
A rate matrix can also be derived from the PAM data by various methods. One exact method is to do an eigenvalue decomposition as presented in [52]. Recently, other methods have been proposed to calculate a rate matrix from the Dayhoff data [55]. These methods still assume that R ≈ ( -I) which corresponds to taking only the first term in the Taylor series for the logarithm in equation (4). This assumption is good only for very closely related sequences. Using the Taylor series allows one to estimate, using the same input data and avoiding the calculation of eigenvalues, the rate matrix to any desired level of precision, independent of the degree of similarity in the training set.
Notice that the rate matrix obtained using BLOSUM62 ( Figure 1) has two off-diagonal negative entries (and if we use more divergent BLOSUM matrices we have more negative off-diagonals). Off-diagonal entries of the rate matrix have to be positive so that I + δtR can be interpreted as a substitution matrix for very small times δt. This problem is not unique to sequence data. The construction of rate matrices for a Markov process from empirical data using a generating time is also used in mathematical modeling of financial processes such as credit risk modeling [56,57]. In the world of mathematical finances the problem is referred to as the regularization problem. I will use one of following regularization algorithms presented in [57]. The QOG algorithm (quasi-optimization of the generator) regularizes the rate matrix. The QOM algorithm (quasi-optimization of the root matrix) leaves the rate matrix unchanged and regularizes the conditional matrix at a given time if any negative probability appears. Using Rate matrix generated from BLOSUM62 Figure 1 Rate matrix generated from BLOSUM62. Rate matrix obtained from the amino-acid substitution matrix BLOSUM62, rescaled to have an average number of one substitution per amino acid. Notice in bold the two off diagonal negative entries.
the QOG algorithm we obtain a regularized version of the rate matrix using BLOSUM62, which is given in Figure 2.

Regularization algorithms
Here I reproduce the QOG and QOM regularization algorithms. The proofs for these algorithms can be found in [57]. The QOG algorithm regularizes each row of a rate matrix independently. Given a row in a rate matrix R, r = (r 1 ,...,r n ) ≡ (R(i, 1) ..., R(i, n)), (5) the QOG algorithm solves the problem of finding the vector at the minimal Euclidean distance from r such that the sum of all its elements is zero, and all elements but one are positive.
The steps of the QOG algorithm are: 1. Permute the row vector so that r 1 = R(i, i).
2. Construct the vector w, such that w i = r i -λ, where .
6. Construct the vector 7. The regularized row is given by r ← P -1 ( ). Finally reverse the permutation of step (1).
The QOM algorithm regularizes each row of a conditional matrix independently. Given a row in a conditional matrix Q t r = (r 1 ,..., r n ) ≡ (Q t (i, 1),..., Q t (i, n)), (6) Regularized rate matrix generated from BLOSUM62 Figure 2 Regularized rate matrix generated from BLOSUM62 Regularized rate matrix generated from BLOSUM62 after the QOG algorithm has been applied. The matrix has been rescaled to have an average number of one substitution per amino acid. In this simple case in which there was at most one negative off-diagonal entry per row, the regularization process requires the negative off-diagonal value to be set to zero (represented in bold in this Figure), and to shift the rest of the elements in that row by the corresponding amount so that the sum of all elements is zero. Rows without any off-diagonal negative values remain unchanged from the values obtained in Figure 1.  the QOM algorithm solves the problem of finding the vector at the minimal Euclidean distance from r such that the sum of all its elements is one, and all elements are positive.
The steps of the QOM algorithm are: 1. Construct the vector w, such that w i = r i -λ, where .
2. If all w i are non negative, r ← w is the new regularized row.
3. Otherwise, obtain the permutation , such that .

A 4 × 4 example starting from joint probabilities at a given generating time
As an review of these techniques, I will use a set of 4 × 4 single-nucleotide joint probabilities P(i, j|t * ) for i, j = {a, c, g, t} at a particular generating time t * to construct the corresponding rate matrix.
In this example, the joint probabilities at the generating time using the matrix notation P * (ij) ≡ P(i, j|t * ) are given by, These 4 × 4 pair-nucleotide probabilities are taken from the program QRNA. They were calculated according to [16] by marginalizing codon-codon joint probabilities which were constructed from the BLOSUM62 matrix of amino acid substitutions. These 4 × 4 probabilities can be viewed as a particular example of the REV model [44]. Note that the sum of all elements of P * adds up to one, and the matrix is symmetric.
The marginal probabilities defined as p i = ∑ j P(i, j|t * ) can be calculated from the joint probabilities to be, p = (p a , p c , p g , p t ) = (0.2836, 0.2311, 0.2531, 0.2322). (8) Similarly, the conditional probabilities P(j|i, t * ) can be calculated from the previous joint and marginal probabilities using the relationship P(i, j|t * ) = P(j|i, t * ) p i . Using the matrix representation Q * (ij) ≡ P(j|i, t * ) we have, Notice how the sum of the elements in each row adds up to one. Notice also how Q * is quite different from the identity matrix, which means that we have started with a quite divergent generating time.
If we assume a homogeneous Markov substitution process, we can interpret the conditional probabilities Q * as the matrix of substitution probabilities at the generating time. Thus, we can characterize the underlying evolutionary process by its instantaneous rate of evolution, which can be calculated from Q * using equation (4). The resulting rate matrix R (up to an arbitrary scaling factor t * ) is given by, This rate matrix has all the good properties: (i)"Normalization": the sum of the elements of each row is zero. (ii) "Reversibility": p i R ij = p j R ji . The process is reversible by construction because we started with symmetric joint probabilities. (iii) "Saturation". The rate matrix converges at time infinity to the given marginal probabilities in equation (8). We can test saturation by using equation (2) and calculating the substitution matrix for a very large time. For instance, for t = 10t * we have Saturation (or stationarity) of a Markov process is a necessary consequence of (i) normalization and (ii) reversibility. Appendix A shows a derivation of the previous statement which was useful for me (and hopefully for some readers) when studying the behavior at t = ∞ of more complicated evolutionary models. Therefore, starting from joint probabilities as in this example, we can always interpret the marginal probabilities as the stationary probabilities of the evolutionary process.
In summary, starting with a single set of joint probabilities at one particular generating divergence time t * , we calculate the joint probabilities at any other arbitrary time, assuming an exponential model of evolution. To that effect, given the particular set of joint probabilities (7) we have calculated the corresponding rate matrix (10) by Taylor expansion. Thus we can estimate the substitution matrix/conditional probabilities at any other arbitrary time, simply using equation (2), and reconstruct the joint probabilities at any other arbitrary time. For instance, for t = 0.3t * we obtain, This method allows us to evolve pair emission probabilities corresponding to different processes (in addition to the 4 × 4 nucleotide emissions) for instance 20 × 20 amino acid-to-amino acid joint emission probabilities, 64 × 64 codon-to-codon joint emission probabilities, or 16 × 16 RNA basepair-to-basepair joint emission probabilities. Thus, this method is useful to be applied in combination with pair HMMs or pair SCFGs already parameterized at one fixed divergence time to make their emission probabilities a time-dependent family.

The evolution of emission probabilities with indels treated as an extra character
Substitution processes (even if describing multi-nucleotide events such as codon evolution or RNA basepair evolution) are not enough to describe the full evolutionary relationship between two biological sequences. We also need to consider indels, for which we need to introduce more complicated models of evolution than the one described so far.
Indels have traditionally been a problem for phylogenetic methods. Programs to construct phylogenetic trees from data such as PHYLIP [58], PAUP * [59], and other phylogeny packages [60][61][62][63][64] treat gaps as missing data. The theoretical description of the evolution of gaps in a probabilistic fashion reached a landmark with the Thorne/Kishino/Felsenstein (TKF) model [30,31]. The TKF model however is hard to implement in combination with a probabilistic model such as an HMM, although an active area of research exists in that direction [36,39,40]. A more direct attack to the problem of introducing phylogeny into existing probabilistic models originated with the concept of tree HMMs [34,35]. The tree HMM method models the evolution of the parsing of different sequences through an HMM. This approach is more related with the evolution of transition probabilities, and I will discuss it later on in this paper.
Here I am going to describe a method for the evolution of indels under the assumption that they behave like an additional residue added to a N × N residue substitution matrix. This is a simplification of the problem because it forces indels to have linear penalties (that is, the cost of opening an indel in an alignment or the cost of extending it with one more indel character is the same) and to behave independently of each other (that is, successive indel characters in one sequence will be treated as independent events, rather than as a single indel of n residues long). Despite its apparent simplicity, this approach poses interesting problems in parameterizing evolution.
Let us review some of the implications of insertion and deletion processes. The treatment that pair models give to pairwise alignments can be interpreted (if we assume reversibility, as is the case here) with all generality as if one of the sequences is the ancestor of the other one. For any two aligned residues we assume that they can be related by a substitution process. For a residue aligned to a gap we assume that either a residue in the ancestor was deleted in the descendent sequence, or that a residue not present in the ancestor appeared in the descendent sequence.
An stochastic insertion-deletion process also involves insertions followed by subsequent deletions. These events leave no trace in pairwise alignments because alignments usually do not retain gaps aligned to gaps. However, when we are treating indels as an extra character, we have to account for such events.
If we were given ideal alignments with all their gap-to-gap aligned columns we could estimate from data the (N + 1) × (N + 1) extended joint probabilities at a generating time, . Because that is not the case, we need to make some inference about . Let us represent with ∆, such that 0 ≤ ∆ ≤ 1/2, the expected frequency of observed gaps with respect to the total number of residues in pairwise alignments at a particular time t * . The parameter ∆, can be estimated from data, or it could be estimated according to the TKF model [30] as .
if we knew the values for the rate of insertions λ and the rate of deletions µ, such that 0 <λ <µ.
Let us represent with ∆' the expected frequency of missing gap-to-gap aligned columns in a pairwise alignment at a particular time t * . One can estimate ∆' as the expected length of insertions that were later deleted without leaving any trace in current sequences. The probability of a stretch of l gap-to-gap characters is given by the geometric distribution density ρ(l) = (1 -∆ 2 ) ∆ 2l . Therefore ∆' is given by, Using these two parameters and the joint probabilities in the absence of gaps at the generating time P * ( ) we can construct the set of (N + 1) × (N + 1) extended joint probabilities at t * as where we have assumed independence for the joint probability of a residue and a gap. The normalization factor Ω = 1/(1 + 2∆ + ∆') represents the fact that the observed ∆ is different from the value we would have obtained had we known the complete alignment.
Another implication of insertion and deletions appears in the behavior of the marginal probabilities of single residues and indels. At t = 0 when sequences have not yet diverged, the marginal probability of finding a gap in an alignment should be zero. In the limit t = ∞, the pairwise alignment of two finite-length sequences is going to be dominated by gap-to-gap alignments, which implies that as the divergence time increases the marginal probability of a residue becomes negligible, while the marginal probability of a gap becomes one in the limit t = ∞. Our evolutionary model has to be able to accommodate such saturation frequencies.
A step-by-step description of the algorithm for the evolution of gaps as an extra residue I will start by describing the steps to implement the method before explaining how to derive those steps. This method can be applied starting from two different situations: starting from a N × N set of joint probabilities at a generating time that need to be extended to allow indel characters and evolved with time; or starting from a given N × N rate matrix that needs to be extended to allow indel characters.
Suppose we start with a N × N set of joint probabilities P * at a generating time t * , where p stands for the marginal probabilities and Q * represents the set of conditional probabilities associated with P * .
1. Extend the joint probabilities at the generating time t * to a (N + 1) × (N + 1) matrix of joint probabilities of the form, where ∆ is a parameter which represents the expected frequency of gaps with respect to the total number of residues in an pairwise alignment at t * , and which satisfies the condition 0 ≤ ∆ ≤ 1/2. The parameter ∆' is given in terms of ∆ as , and the normalization constant is The indices with hats ( ) stand for the N residues, and exclude the gap character, which I represent with the symbol -.  On the other hand, while remaining "quasi-stationary" the background frequencies evolve from ( , 0) at time zero towards "all gaps" at time infinity, i.e. lim t → ∞ Λ t = 1. This behaviour at time infinity is the consequence of the particular value of q 0 selected in the previous step. 6. Finally, construct the evolved (N + 1) × (N + 1) joint probabilities at arbitrary time as The expression for Λ t in equation (24) guarantees reversibility, that is, that the extended constructed according to the above expression are symmetric.
For the other starting situation, in which we have a N × N rate marix R, the procedure to generate a (N + 1) × (N + 1) quasi-stationary reversible evolutionary model is the following: The instantaneous rate is given by, Thus β is the instantaneous rate of deletion of a character, while -β(1 -q 0 ) is the rate of insertion of character . (More complicated models in which the rate of deletion is different for different characters are also possible.) Notice that q 0 = 1 corresponds to the case in which the rate of insertions is zero.

Find
analytically, if an analytic expression for R ε is given by solving the differential equation d ( , or numerically, proceeding as in step (3) of the previous procedure.

A 5 × 5 example starting from joint probabilities at a given generating time
We start with the generating joint probabilities P * in the 4 × 4 example in equation (7), which we want to extend to a 5 × 5 matrix by adding a gap character. For this example, I have selected the arbitrary value for the gap parameter ∆ = 0.18.
The matrix of conditional probabilities at time zero using expression (22) is given by, The rate matrix for this example, calculated using the Taylor expansion described in equation (18) takes the value, One should not be concerned to see a whole row of zeros for this rate matrix. For this generalized model the instantaneous rate of evolution is not directly given by the rate matrix; instead, the instantaneous rate of evolution is given by, In this example, the instantaneous rate of evolution takes the form, One should not be concerned either by having some negative off diagonal components. For small times δt, the conditional matrix is given by, Therefore, in order to have a proper matrix of conditional probabilities for sufficiently small δt, it is necessary to satisfy the following condition for each pair of indices i, j, (36) In this case, the off-diagonal components of the last row of Q 0 are non-zero, which allows us to have negative offdiagonal elements for that row in the instantaneous rate matrix Q 0 R ε .
With the 5 × 5 rate matrix in hand, we can apply steps (3) and (4) to obtain the conditional probabilities at any arbitrary time . For instance for t = 0.3t * we obtain the following evolved conditional probabilities: The quasi-stationary marginal probabilities are con-   .   Notice that this matrix is symmetric, which is the result of having imposed reversibility for any arbitrary divergence time.
We can also see by calculating the conditional probabilities at large divergence times how these probabilities evolve towards their saturation values given by (0, 0, 0, 0, 1). For instance, for t = 30t * we have,

An example starting from a rate matrix: The Jukes-Cantor model extended to gaps
As an example of a situation in which we start with a rate matrix, let us consider the generalization of the Jukes-Cantor model [42] to a 5 × 5 evolutionary model with a gap character. The original Jukes-Cantor model assumes that all nucleotides mutate at the same rate α > 0 which is represented by the rate matrix In this simple case the conditional matrix Q t = e tR can be found analytically by solving the matrix differential equation . Because of the symmetries of the problem we can write with the condition r t + 3s t = 1. We then obtain the following differential equations and the solutions are, By taking the limit t = ∞ in the previous two equations, one can see that the saturation frequencies of the Jukes-Cantor model are p i = 0.25 for i = a, c, g, t.
The 5 × 5 extended Jukes-Cantor rate matrix R ε is constructed by adding a rate of mutation to a gap represented by the quantity β ≥ 0 which in principle we will assume is different from the rate of substitutions α, We also introduce the matrix at time zero Q 0 which depends on the probability parameter 1 ≥ q 0 > 0, where the particular case q 0 = 1 is only allowed if simultaneously β = 0, and corresponds to a trivial extension of the original Jukes-Cantor model in which the gap character does not evolve.
The conditional matrix at arbitrary time is given by = Q 0 . The symmetries of the problem in this case allow us to parameterize as with the conditions r t + 3s t + γ t = 1 and 4ξ t + σ t = 1.
Introducing the matrix M t ≡ , we can parameterize . .
The differential equation to calculate M t takes the form = R ε M t , which translates into the differential equations, Which are satisfied by γ t = 1 -e -βt . (58) And in addition we have (60) In the limit case β = 0, the solutions for r t and s t reduce to those of the original Jukes-Cantor model with the trivial additions of σ t = 1, ξ t = 0 and γ t = 0, after setting q 0 = 1.
The extended Jukes-Cantor model depends on three parameters: the rate of nucleotide substitution α > 0, the rate of nucleotide deletion β ≥ 0, and the parameter 1 ≥ q 0 > 0. What is the meaning of q 0 ? q 0 controls the saturation frequencies (i.e. the background frequencies at time infinity), as well as the background frequencies at any other finite time. For β > 0 and 1 >q 0 > 0, taking the limit t = ∞ in equations (56)-(60), one can see that the saturation probabilities are given by (0, 0, 0, 0, 1). At any other finite time, the background frequencies of the model are quasistationary with respect to the background frequencies of the original Jukes-Cantor model, and are given by Imposing the reversibility condition in partic- Therefore q 0 controls how fast the background frequencies approach the saturation probabilities (0, 0, 0, 0, 1) through the factor Λ t . For a given β, the larger q 0 , the faster Λ t approaches one. (Note that Λ t always approaches one as t goes to infinity.) At first glance, it looks like q 0 could take any value including one in the solution for the extended Jukes-Cantor model. q 0 = 1 would result in fixed background frequencies of the form (0, 0, 0, 0, 1), which is an undesirable result, and the value q 0 = 1 would have to be excluded when β > 0. In fact, the limit to the ungapped Jukes-Cantor model has to be taken by setting β = 0 first, and then q 0 = 1. In that way, Λ t = 0 for all times, which is the correct result for the original Jukes-Cantor model.

Derivation of the algorithm for a (N + 1) × (N + 1) quasi-stationary and reversible evolutionary process
Unlike the ungapped N × N case in which the marginal probabilities are time independent, in the presence of gaps the marginal probabilities have to evolve with time.
In fact, as I discussed earlier, the marginal probability of a gap (-) has to evolve from zero at time zero to one at time infinity. As a result of that observation, probabilistic evolutionary models with Q 0 ≠ I are necessary in the presence of gaps in order to maintain reversibility. The reason for this requirement is the following: for an evolutionary model of the form e tR , reversibility implies that there is some p * such that p * Q * = p * [see Appendix B, equation  where Notice that Q 0 may be inverted as long as 0 <q 0 ≤ 1.
With respect to the marginal probabilities we have that at the generating time t * because of the way the extended probabilities P * were constructed we imposed a quasi-stationary behavior of the form, where The generalized conditional matrix in (64) also saturates at very large times, and the saturation probabilities (i.e. the marginal probabilities at infinity) are given by those of the rate matrix, that is lim t→∞ = lim t→∞ (see Corollary A.1). Because of the relationship in equation (66) between the rate matrix and the matrix , the saturation probabilities are given by the condition (see Appendix B), A pair-HMM model Then using equation (71) we can see that the saturation probabilities maintain the quasi-stationary property that was imposed at the time instance t * , and are given by, where As I discussed before, it is reasonable to impose that at infinity all we find is gaps, i.e Λ ∞ = 1.0, which implies η = 1 and Notice that because the relationship given in equation (14) between ∆' and ∆, then 0 <q 0 < 1.

For an arbitrary time we have the reversibility relationship
This equation is satisfied by construction in the N × N subspace. By inspecting the implications of the above equation for the gap index, we obtain an expression of Λ t (the marginal probability of a gap) at arbitrary time that allows us to have quasi-stationary reversible evolution. The function Λ t is given by,

Evolutionary model for transition probabilities
The standard way in which comparative probabilistic models allow for insertions and deletions is by introducing several additional states with their corresponding transition probabilities. For instance, in a pair-HMM for sequence alignment ( Figure 3) the presence of gaps requires the introduction of two states ("X" and "Y") which emit a nucleotide in only one of the two sequences. The probabilities associated with transitioning in and out of those states control the "gappiness" of the alignment. Therefore the evolution of these parameters with time is necessary in order to model different degrees of sequence divergence.
There has been a continued effort on improving the accuracy of the evolution of emission probabilities (i.e. substitution matrices) such as allowing correlations between the rates at different sites [65,66], improvements in the deri-vation of rate matrices from sequence data [23,67], or estimating multiple nucleotide changes [68]. In comparison, the ideas to describe the evolution of transition parameters in probabilistic models are much less standardized [34][35][36][37][38][39][40].
The goal of this section is to describe the evolution of transition probabilities. For instance, in the pair-HMM of Figure 3 the transition probabilities from the "XY" state to the "X" or "Y" states describe the introduction of gaps in one of the two aligned sequences, using an affine penalty. These transitions should be zero when the sequences have not yet diverged (time zero), but they should be maximal at infinite divergence. In between these two extremes, it is desirable to model the transition probabilities changing with divergence time. These methods are termed "evolutionary" because the transition probabilities will be parameterized with time, using functions that are generalizations of the Markov process that probabilistic evolutionary models assume for substitutions. Unlike the TKF model [30,31] and other related evolutionary models [32,33,41], the approach presented here will not describe the actual underlying evolutionary process that may have generated one sequence from another.
The tree-HMM method [34,35,37] is possibly the method closest to what I develop here. A tree HMM tries to model the phylogenetic relationship between related sequences by modeling the parsings of different sequences through the model. In a tree HMM it is not the actual transition probabilities of the HMM, but the parsing of the different sequences through the models that are evolved using rate matrices that resemble the diagonal rate matrices introduced in the first of the methods described below. Here I want to generate pair or profile probabilistic models that when comparing two related sequences are able to accommodate to the degree of divergence observed between the two sequences, and I intend to do that in a continuous-intime and probabilistic fashion, using the smallest possible number of free parameters. No evolutionary history of individual insertion/deletion events will be generated; only a posteriori would an evolutionary history be established by comparing sequences (in the case of a profile model) or alignments (in the case of pair models) generated by the model at different times.
I present two methods to evolve transition probabilities. One of the methods considers the evolution of a vector of transition probabilities. In this method, the value of the transition probabilities at time zero and time infinity are input parameters, which gives a relatively large number of free parameters. In the second, more restrictive, method the transitions associated with several states are assumed to evolve under the same evolutionary process. This condition constrains some of the free parameters, but does ( ) ε ε ε 1 76 not fix them all completely. When the more restrictive conditions are used, both algorithms give the same results. These two algorithms are applicable to most pair and profile probabilistic models, be they HMMs or SCFGs, generalized or not. I present an example of the evolution of a vector probability vector for a pair HMM, and an example of the evolution of a matrix of transition substitutions for a profile HMM.

Evolution of a vector of transition probabilities A step-by-step description of the algorithm
Let us start by providing the recipe to apply the algorithm: 1. Given a transition probability vector 2. Assume its set of values is known at the three particular time instances of t = 0, t = t * , and t = ∞, named q 0 , q * , and q ∞ . Assume each component i in these probability vectors satisfies one of the following three conditions, 3. If the three input vectors satisfy the condition, where r > 0 is a real number independent of i, then calculate q t at an arbitrary time t (0 <t <t ∞ ) as Normalization of the vector q t is guaranteed by equation (79).

Otherwise q t is given by the following expression
where the function w t is given by

An example: evolution of the transition probabilities of a pair-HMM "XY" state
Consider the transition probability vector associated to the "XY" state of the pair-HMM given in Figure   applied to any full set of transition probabilities emerging from a particular state in a given probabilistic model that must evolve with time.

Connection with a tree-HMM 2×2 match-transition matrix
In the original representation of a tree-HMM [35] the idea of a match-transition matrix is introduced. If one parse through the HMM generated a Match to Match (MM) transition, while another parse through the model generated a Match-to-Delete (MD) transition, one can consider the substitution of MM by MD similarly to a substitution of residues by the conditional probability P (MD|MM, t). This leads to the concept of a 2 × 2 match-transition matrix given by, which in [35] is parameterized with two real numbers r ≥ 0, and 0 ≤ a ≤ 1 as Tree-HMMs model the evolution of paths though the HMM. In contrast, the method proposed here models the evolution of the transition probabilities of the model themselves. However, one can see that the match-transition matrix is closely related in form to the model we have proposed here. Introduce the probability vectors, For t = 0 and t = ∞ they have the following values, It is easy to see that the match-transition matrix given in equation (87)

Derivation of the algorithm to evolve a vector of transition probabilities
To describe the evolution of transition probabilities, the simple exponential models used for substitution matrices are not sufficient. I propose to adopt a generalization of the exponential model of the form, where I is the n × n identity matrix, r is a vector still to be identify, and a R is a n × n rate matrix.
This model simply adds to the exponential term a time independent vector a, Because q t=0 = q 0 , then it is necessary that a = q 0 -r, thus giving the expression in equation (92). Note that this is the most general solution of a differential equation of the form ∝ (q t -a). Until now it was always assumed that the constant term was zero, that is r = q 0 . The freedom added by including a term constant in time is that, while before the behavior at t = ∞ was solely controlled by e tR , now the additional term also contributes to that limit.
An immediate consequence of this generalization is that the rate matrix is not now sufficient to determine the whole evolutionary process. In addition to the rate matrix, the probability vector must also be specified at time zero (no divergence) and at time infinity (all mutations have saturated) such that, The exponential of the rate matrix R has the general form, for some real eigenvalues . If conditions are restricted to the case in which k i > 0, ∀i, the immediate consequence of working with positive eigenvalues is that, There is then a simple relationship between the vector r and the values of the probabilities at time zero and saturation, Therefore, we can write with all generality However, for the given information (q 0 , q * , q ∞ ), the timeparameterized vector q t in (98) is still underdetermined.
In order to reduce the amount of freedom, I assume that e tR is diagonal (i.e. U = I). Diagonal rate matrices have been used in other contexts of generalized evolution such as the tree-HMM model [34,35]. Then we have, At this time the known probabilities at the generating time t * , q * , have not yet been used. These are, Thus we obtain which can be solved for k i , The condition k i > 0 translates into 0 < < 1, or This condition has two solutions: q 0 (i) <q * (i) <q ∞ (i) or q 0 (i) >q * (i) >q ∞ (i). (104) Even though this model was derived under the conditions of equation (104), it also extends to the degenerate case where for some i we have since this simply corresponds to these parameters undergoing no evolution at all.
Therefore if the input column vectors satisfy one of the three previous conditions for each one of their elements, the parametric expression is A normalization condition has not yet been imposed.
Using the unity vector u T = (1,..., 1), normalization requires that For an evolutionary model of the form q t ∝ e tR normalization requires that e tR u = u for arbitrary times. I refer to this property as the strong normalization condition. The normalization of a generalized evolutionary model of the form requires the weaker condition (q 0 -q ∞ ) T e tR u = 0. This property is always true for a rate matrix that satisfies the strong normalization condition. I refer to this property as the weak normalization condition.
In order to obtain the strong normalization condition automatically it is necessary to have a rate matrix of the form R = U diag (0, -λ 2 ,..., -λ n ) U -1 (see Appendix A, equation (193)). Such a type of rate matrix is not appropriate to describe the evolution of a probability vector, since such rate matrix cannot be uniquely inferred from the three input probability vectors q 0 , q * , q ∞ . For that reason, I have explored the use of rate matrices of the diagonal form R = diag (-k 1 ,..., -k n ), which can be inferred from the three input probability vectors q 0 , q * , q ∞ using expression (102). Such diagonal rate matrices, however, in general do not satisfy the strong normalization condition, thus the weak normalization normalization condition must be obtained by other means.
If the input vectors satisfy the condition for all i, for some real number r > 0, then the rate matrix has the particular form R = diag(-r,..., -r), the function w t = 0 for arbitrary times, and the weak normalization condition is satisfied automatically.
The previous condition is in general too restrictive. If the previous condition is not satisfied, by construction w t is zero at t = 0, t = t * and t = t ∞ , but in general w t ≠ 0 for n > 2. Normalization is then achieved (107) by modifying our definition of q t to The final expression is

Evolution of a matrix of transition probabilities
In some cases, several states of a model correspond to a particular evolutionary event, and it seems natural to expect that their transitions would evolve under the control of the same rate matrix. For instance, in a profile HMM ( Figure 4) I will consider the joint evolution of the transitions of three states associated with a given consen-  We want to describe the evolution with time of the transition probabilities. For that purpose, I will use the m × n matrix of transitions Q t such that, Note that an evolutionary model of the form Q t = Q 0 e tR , like that used for evolving gaps as extra characters, is not sufficient. A model of the form Q t = Q 0 e tR in the limit of infinite divergence would necessarily result in transitions that for a given end state are all identical and independent of the previous state. That is clearly too restrictive for most models, for instance in a profile HMM, in which some of the transitions are not evolved and are set to zero.
In order to allow for more general saturation properties of the transition probabilities, I propose the following model for the evolution of the matrix of transition probabilities, where R is the n × n rate matrix, and the m × n matrix K is still to be determined. This extension (as in the vector model proposed before) corresponds to adding a constant term A = Q 0 -K, and it is is the more general solution of a differential equation of the form ∝ (Q t -A).
We will see that K(e tR -I) = (Q 0 -Q ∞ )(e tR -I), thus As in the previous case, a freedom provided by the additional constant-in-time term is that while the saturation behavior of Q 0 e tR is controlled by the saturation probabilities of e tR , the model given by equation (116) is independent of those saturation probabilities so that the probabilities at infinity can be set arbitrarily. That is, assuming that ψ is the n dimensional vector of saturation probabilities of e tR , Notice that while Q t , Q 0 , Q ∞ and K are m × n matrices operating in the S × E space, the matrices R and e tR are square n × n matrices operating in the E × E space. In fact, e tR determines the change in time that a transition probability into one of the E states experiences and in which fashion that change is absorbed by the transition probabilities into any other E state.

A step-by-step description of the algorithm
The recipe to implement the algorithm is as follows: 1. Assume we know the m × n (m ≤ n) matrices of transition probabilities at time zero Q 0 and at time infinity Q ∞ , such that the rank of Q 0 -Q ∞ is m.
2. If an analytic n × n rate matrix R is given, one can find the analytic expression for e tR by solving the differential equation d(e tR )/dt = Re tR , and jump to step (6). For a numerical solution jump to step (5).
3. If the information given is the set of transition probabilities at a generating time t * , calculate the rate matrix R as, The n × m matrix O is obtained by solving the set of linear equations This condition [necessary so that ψ T R = 0] imposes constraints between the set of probabilities at time zero, at time t * , and at time infinity. 6. Finally, calculate the set of evolved transition probabilities as,

Consider the following the transition matrix
This transition matrix, like a nucleotide substitution matrix, adds up to one by rows. We assume as in HMMER [4] that there are no transitions between the Insert and the Delete states, but the model could work under more general conditions.
The matrix at time zero is given by Let us assume that the rate matrix is given by for some parameter α > 0. This rate matrix assumes that the rate of change in the occurrence of state M' is similar to that of state D' and that of state I, and that this change reverts equally into the other two states. More realistic situations can be achieved using rate matrices depending on more parameters.
The instantaneous rate of transition change is given by, This matrix gives the instantaneous change that a transition probability experiences under this model and describes how that change is transferred to the other allowed transition probabilities.
The matrix e tR can be obtained analytically by solving the differential equation d(e tR )/dt = Re tR . This is a 3-dimen- Putting all together, we obtain the following evolved transition probabilities for a profile HMM under a Jukes-Cantor like assumption for the rate matrix: Substituting the values for s t given in equation (133), we have the following evolved transition probabilities for a profile HMM, The evolution of different paths through the HMM In a tree-HMM one assumes that the different paths through the model are the objects that are subject to evolution [34]. Here we have directly modeled the evolution of the transition probabilities of the HMM. We can get an intuition for the meaning of these evolved transition probabilities by estimating how these evolved transition probabilities induce the evolution of different paths through the model. A process that is similarly to that modeled by a one-branch tree-HMM.
Suppose that at time zero, we emitted residue a from state M, and residue b from state M'. The model assigns to such sequence a probability given by, where p M (a) and p M' (b) represent the emission probabilities associated to the M and M' states respectively. Now suppose that at time t there has been an insertion of n res-idues in between the two matches a and b; the model assigns to such sequence a probability given by, where p I (i i ) represent the emission probabilities associated to the Insert state.
We can interpret that in time t the path through the model that generated ab has evolved into the path through the model that generated ai 1 ...i n b with probability given by To get a better intuition of what this means, take as an example the case in which the time interval t is very small.
Then the probability that a path between two matches in the HMM inserts n residues in time t ≈ 0 is This probability is proportional to 3αm I the rate of substituting a Match-to-Match transition for a Match-to-Insert transition, and to (1 -q I ), which is the geometric factor associated to an insert of length n at time zero.

An example: Evolution of the transitions of a profile HMM given the transitions at a generating time
In this case, we maintain the same values for the transition probabilities at time zero Q 0 and at time infinity Q ∞ , but the rate matrix will be obtained from a generating time for which we know the transition probabilities.
The set of linear equations in step (3) of this algorithm that determine the vector v T = (v 1 , v 2 , v 3 ) are The solution of these linear equations is   D I  I D   M  I I  I  I D  D  M  I I  I  I  I  In practice when I have tested that kind of situation, the rate matrices obtained are always not real, and therefore they lack any biological interpretation.

Derivation of the algorithm to evolve a matrix of transition probabilities
We start with a model of the general form where Q 0 is the known m × n matrix of probabilities at time zero, and the m × n matrix K must still be determined.
Assume that we know the transition probabilities at time infinity, which we represent by the m × n matrix Q ∞ . Then, because of the asymptotic behavior of the exponential family e tR , lim t→∞ e tR = u n ψ T for some n dimensional saturation probabilities, where ψ T = (ψ i ,...,ψ n ), and the n dimensional unity vector = (1, ..., 1), we have This equation implies that where is a m dimensional vector that represents the sums by rows of K, i.e. ∑ j K(i, j) = which we impose to be different from zero.
Because for the exponential family e tR we have the reversibility condition ψ T e tR = ψ T for arbitrary time, introducing the expression for K in equation (170)  Therefore if given Q 0 , Q ∞ and a n × n rate matrix R, which satisfy the reversibility conditions ψ T R = 0, we can calculate the evolved transition probabilities using the equation (171).
In the case in which the information given is the set of transition probabilities at a generating time t * , designated by Q * , the calculation of the rate matrix R involves the following steps: (a) The m × n matrix K is by construction invertible because we have imposed ≠ 0, for all rows i.
A little aside with respect to matrix inversions is in order here. The (unique) inverse of a matrix is defined only for square matrices. One can introduce a inverse-like matrix for a non-square matrix; these are called pseudoinverses [69]. The pseudoinverse of a non-square matrix is not unique and many pseudoinverses can be defined; one of the best known is the Moore-Penrose matrix inverse [70].
We will see how despite the fact that the pseudoinverse of K is not unique, we can still define Q t uniquely.
Therefore solving for R in equation (114) (177) is a set of homogeneous linear equations that together with the normalization conditions in equation (176) uniquely determine the vector v.
On the other hand, in order to satisfy KK -1 = I m×m , the following must apply: Equation (180)  is what makes the vector ψ that appears in the pseudoinverse K -1 irrelevant. Even though lim t→∞ e tR = uψ T , it is also true that lim t→∞ (Q ∞ -Q 0 )e tR = 0, so that the dependence on ψ disappears from the final expression of Q ∞ .

Reversibility and multiplicativity
For a given probabilistic model, imposing reversibility has different implications for its emission and transition probabilities. In pair models, we assume that the emission probabilities are reversible by imposing P(a t , b t+t' ) = P(a t+t' ,b t ), which corresponds to using symmetric joint probabilities represented by the shorthand notation P(a, b|t'). If the emissions do not involve gaps, the marginal probabilities do not evolve, and the evolved joint probabilities are obtained from the evolved conditionals and the saturation probabilities. In the presence of gaps, I have described how to construct the evolved conditionals and the corresponding evolved marginals in a way that maintains reversibility for any arbitrary time, so that we can construct evolved symmetric joint probabilities.
For transition probabilities the situation is different. Mathematically, a matrix of transition probabilities is like a substitution matrix (i.e. conditional probabilities) but there is not the equivalent of "joint" probabilities for transitions. To maintain reversibility for the transitions of a probabilistic model, one has to build reversibility in the design of the model. In particular, one needs to be sure that the transition probabilities that involve gaps lack any directionality. For instance, in the pair-HMM of Figure 3 we need to impose that for arbitrary times. That is achieved by making sure that the input transition probabilities at time t * , zero and infinity do lack directionality.
Another property of probabilistic models of evolution for residue substitutions is multiplicativity. Multiplicativity is an immediate property for evolutionary models of the form e tR . For residue-substitution evolutionary processes, multiplicativity implies that the transition from one given event (say residue a) to another event (say residue b) in a finite time, if it goes through any intermediate state, has to be of the form of any other possible substitution. In mathematical terms, However, when allowing gaps, any intermediate evolutionary step can go through processes of deletions or insertions in addition to substitutions; therefore multiplicativity as described in the previous equation does not hold anymore. There is a natural explanation of why "substitutions-only multiplicativity" is modified when considering insertion and deletion events. Consider the evolution of gaps as single characters, which was introduced previously in this paper. The substitution matrix with gaps satisfies the relationship Analyzing this matrix equation by components and using the expression for Q 0 given in equation (22), the substitution of residue a into residue b in finite time t + t' has the following terms: The first term corresponds to pure substitution events of the form , and it is identical to equation (184). The second term modulated by the coefficient 1/q 0 (introduced in equation (65), which is part of the non trivial matrix Q 0 ) represents the event in which . The third term (preceded by coefficient (1 -1/q 0 )) represents the event in which . Note that this model would align at time t + t' residues which could have been derived by a gap intermediate. This is usually discouraged by evolutionary models that describe the evolutionary history of insertions and deletions, in which such event would be represented as . For the model at hand, the fact that a gap can revert into a residue is a consequence of treating gaps as an additional residue in a substitution matrix.
For the particular case of the generalized Jukes-Cantor model introduced before, it turns out that the two extra terms in equation (186) are independent of the particular substitutions and cancel, such that Therefore the generalized Jukes-Cantor model preserves multiplicativity. This results from the extreme simplicity of the model and is not true for more complicated models. For instance, for the rate matrix created from a particular Q * in the other example presented in this paper (which is a particular case of the REV model [44]), the two extra terms in equation (186) are different for the different nucleotide substitutions, and do not cancel out.
A more complicated situation appears for probabilistic models that introduce gaps in an affine manner. A given residue-to-residue substitution process that occurred in a finite time could have appeared from a very large number of intermediate situations in which stretches of other nucleotides could have been added or removed. The simple one-to-one correspondence that models of substitutions maintain through evolution does not exist in the presence of insertion and deletion events. This does not mean that evolutionary models with gaps are inconsistent, however some traditional properties of phylogenetic trees of single residue evolution such as the pulley principle [71] cannot be applied under the transition probability evolution models.

Conclusion
Motivated by the goal of making QRNA (a comparative probabilistic method for RNA genefinding) an evolutionary model, I have introduced several probabilistic methods to describe the evolution of insertion and deletion events. The methods introduced here have a larger scope than this program alone, and they can be applied to other pair probabilistic models and to profile HMMs and SCFGs as well.
I described an algorithm which addresses the evolution of gaps as an extra residue in a (N + 1) × (N + 1) substitution matrix. This method can be applied to the joint emission probabilities of pair models. This method allows us to maintain a stationary N-dimensional background distribution, while the actual (N + 1)-dimensional background frequencies evolve towards all gaps at time infinity. I call this process quasi-stationary. As an example, I showed an analytic solution for the Jukes-Cantor model extended to gaps.
I also presented two methods for the evolution of transition probabilities in a profile or pair HMM or SCFG, that are applicable to any probabilistic model that uses transitions between states to model insertions and deletions. In the first algorithm, the transition probabilities associated with one state in the model are evolved as a vector independently of the transition probabilities associated to any other state in the model. I also presented a second algorithm in which the transition probabilities associated with a given set of states co-evolve under the control of a single rate matrix. I presented an example of the application of these methods to a pair-HMM and to a profile HMM.
I have applied these methods to the program QRNA, which was the motivation for the development of the algorithms in the first place. QRNA contains three probabilistic models (the oth, cod, and rna models) that analyze the pattern of mutation of a given pairwise alignment to decide which of the three models best classifies the alignment. These models are a combination of generalized pair-HMMs and a pair-SCFG. Originally, this program assumed a fixed divergence time, and all the ( ) emission probabilities of the different models were tied to those of BLOSUM62. That produced a QRNA parameterized for highly diverse sequences, which in turn produced a large number of false positives for highly similar sequences. In the new program eQRNA, all emission and transition probabilities are a continuous time-dependent family able to match any possible degree of sequence divergence.
The three models of QRNA (the OTH, COD, and RNA models) need to be at approximately the same evolutionary distance, so that when a pairwise alignment is analyzed, the differences in scores of the models result from observing a different pattern of mutations (coding, RNA, or none in particular) rather than because one model favors more closely related sequences than the other. This model synchronization requires a number of QRNA-specific design elements which are tangential to the implementation of the evolutionary models for indels and transition probabilities presented in this paper. For reasons of clarity, I leave for another paper a detailed description of the particular implementation designs that went into eQRNA in order to make it fully evolutionary. In a nutshell, the transition probabilities of the OTH and COD models are evolved according to the algorithm to evolve vectors of transition probabilities, while the emission probabilities of those two models were evolved using the original QRNA parameters as the generating time of the respective rate matrix. In the RNA model, for the contextfree grammar component of the model, the transitions are fixed, and the evolution of gaps is accommodated by treating gaps as extra characters according to the method presented here for that purpose. The HMM component of the RNA model is parameterized with time similarly to the OTH and COD models. Preliminary results show an important improvement compared with the previous fixed-time implementation. The application of these evolutionary methods for other probabilistic models for sequence comparison beyond eQRNA should be tractable.
So far the methods presented here have been introduced only in profile and pair models. They could also be applied to probabilistic models where, instead of aligning two contemporary sequences, one aligns a sequence to an ancestor. The only difference with respect to an evolutionary pair model is that, in this case, the emission probabilities will be the substitution (conditional) matrices themselves instead of joint conditional-on-time probabilities. One important limitation of the methods presented here is that, in general, they lack the property of multiplicativity. In consequence, in order to extend the methods presented here to more than two sequences related by a phylogenetic tree, one would have to work with rooted trees. A future challenge is to incorporate these evolution-ary methods into multiple sequence probabilistic models that explicitly describe the phylogenetic relationship between the sequences.

Availability
The different models presented in this paper have been implemented in several small ANSI C programs. These are not fully developed software applications, but demonstrations (for those who want to avoid the mathematical descriptions) of how the different algorithms work. The programs are freely available at http://selab.wustl.edu/ publications/Rivas05/evolve.tar.gz.

Appendix A. Conditions for the saturation of a generalized substitution matrix
In this appendix I provide the conditions for saturation of a generalized evolutionary model of the form Q t = Q 0 e tR . Saturation can be described as for the unitary vector u, and a set of saturation frequencies at time infinity denoted by q ∞ , such that .
Here I show that saturation of Q t = Q 0 e tR is a necessary condition of two properties of the matrix Q = {Q(ij)}, normalization and positivity. I also show that the saturation probabilities of Q t are the same as those of e tR .
Proposition A.1. Consider first the simplest case Q t = e tR . Normalization, i.e. ∑ j Q(ij) = 1, together with positivity, i.e. Q(ij) > 0 ∀i, j, imply that a substitution matrix of the form Q t = e tR saturates to a set of probabilities at time infinity.
Proof. Normalization of the rate matrix, ∑ j Q(ij) = 1 implies that That is, λ = 1 is an eigenvalue of Q. It also has implications for the norm of Q, defined as the largest row sum of absolute values lim , , On the other hand there is an eigenvalue λ = 1 therefore σ(Q) = 1. (192) In consequence, Q has one eigenvalue, λ = 1, and all other eigenvalues are smaller than one.
Therefore because the substitution matrix is of the form Q t = e tR , it implies that the instantaneous rate matrix R has one null eigenvalue, and all the other are negative. If we assume that the null eigenvalue is not degenerate and that the negative eigenvalues are real, we can write with all generality, for some matrix U, and such that λ i > 0 for i = 2, ..., n.
On the other hand using equation (194) we obtain which implies that UΨ 0 is the eigenvector of Q corresponding to the eigenvalue λ = 1. According to (189) that is, Substituting in equation (195) we finally obtain, This is the saturation condition (188) for some saturation probabilities defined by q ∞ = U -1 .
Corollary A.1. For a generalized evolutionary model of the form Q t = Q 0 e tR , Q t also saturates at infinity, and the saturation probabilities of Q t are given by those of e tR , that is, Proof. Note that by construction Q 0 has to have the same normalization and positivity conditions as Q t . It can be shown that under those conditions, Q t = e tR also has to add up to one, summing by rows, and all its elements have to be positive. Therefore, using the result of Proposition A.1, Therefore which proves saturation for an evolutionary probabilistic process of the form Q t = Q 0 e tR . Then one can see that reversibility implies Proof. Summing one of the indices in the reversibility conditions and taking into account the normalization condition for the Q * matrix results in, which in vectorial notation takes the form Lemma B.2. If R = log Q * then the reversibility condition (202) for Q * implies that Proof. If R = log Q * then because of the Taylor series we have Because of the reversibility condition for Q * (202) it is also true that p * (i) (Q * -I) n (ij) = p * (j)(Q * -I) n (ji), (207) for n ≥ 1. Therefore it follows that p * (i) R(ij) = p * (j)R(ji). (208) In addition we can also see by inspecting equation (206) that the normalization condition for Q * translates into ∑ j R(ij) = 0 ∀i, which implies that Lemma B.3. If Q * is a conditional matrix that satisfies the reversibility condition (202) and R = log Q * then the saturation probabilities of R are given by the p * vector in (202), that is, proof. In order to prove that this is the case, we just need to find an example in which that statement is true. Consider again one particular instance with its corresponding marginal probabilities p * . Because the model is reversible for arbitrary divergence times, in particular there should be some p 0 probabilities such that . For this generalized model, the rate matrix is given by . Therefore it follow by Lemma B.3 that the saturation probabilities of R are given by the condition Therefore the saturation probabilities p ∞ are different from p * as long as p 0 ≠ p * . Therefore, we have constructed a parametric family, Q t = Q 0 e tR , in which the marginal probabilities for reversibility are p 0 at time zero, p * at t * , and p ∞ at time infinity, with p 0 ≠ p * ≠ p ∞ . Therefore if there is reversibility at arbitrary time, the marginals have to be time dependent, p t (i)Q t (ij) = p t (j)Q t (ji). (216) In particular in the Section "The evolution of emission probabilities with indels treated as an extra character" we have constructed a system in which the time-dependent reversibility condition (216) is satisfied by marginal probabilities that are quasi-stationary with respect to some (n -1) p 0 probabilities, p t (i) = p 0 (i) (1 -Λ t ), for i = 1,..