XRate: a fast prototyping, training and annotation tool for phylo-grammars

Background Recent years have seen the emergence of genome annotation methods based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. Previously, phylo-grammars have required considerable effort to implement, limiting their adoption by computational biologists. Results We have developed an open source software tool, xrate, for working with reversible, irreversible or parametric substitution models combined with stochastic context-free grammars. xrate efficiently estimates maximum-likelihood parameters and phylogenetic trees using a novel "phylo-EM" algorithm that we describe. The grammar is specified in an external configuration file, allowing users to design new grammars, estimate rate parameters from training data and annotate multiple sequence alignments without the need to recompile code from source. We have used xrate to measure codon substitution rates and predict protein and RNA secondary structures. Conclusion Our results demonstrate that xrate estimates biologically meaningful rates and makes predictions whose accuracy is comparable to that of more specialized tools.


The PROT3 grammar
The HOMSTRAD database is grouped into ten secondary structure classes and contains no two sequences with greater than 90% sequence identity. The 1032 alignments in the HOMSTRAD database were converted to Stockholm format. The DSSP secondary structure annotation was converted into a consensus annotation containing the three secondary structure states using the same mapping as described in the GTJ paper. The consensus was determined by majority; ie, the annotation within a column with the greatest frequency was declared the consensus.
The GTJ program and associated files were downloaded from the GTJ FTP site. The GTJ program and xrate were both evaluated on the xylanase alignment, here referred to as gtjxyl. In addition they were both evaluated on the "glycosyl hydrolases family 10" (referred to here as ghf10) alignment from HOMSTRAD (data not shown). This alignment is in the alpha-beta barrel class and includes the endo-1,4-beta-xylanase A protein from P. fluorescens. This alignment is very similar to the psefl alignment and so was used as an additional test of the performance of the GTJ method and xrate. The ghf10 alignment is part of the beta-glycanase SCOP family which also includes the ghf5 and ghf17 alignments in HOMSTRAD. All three alignments were removed when creating the training databases. Furthermore, the training databases were queried using BLAST to ensure that no homologs to the test sequences were found.
In order to train on the databases, xrate was first used to estimate a phylogenetic tree for each alignment. These trees were estimated by neighbor-joining followed by EM to optimize the branch lengths. A point substitution rate matrix, estimated from a 200-family subset of PFAM, was used for this step. This point substitution matrix was also used as the "seed" for training the rate matrices of the PROT3 grammar.
For most of the EM training, the "mininc" parameter was set to the default value of 0.001, so that for each iteration of EM, the fractional increment in the log-likelihood was required to be at minimum 0.001 if xrate was to regard the increase as an improvement. The "forgive" parameter was set to 5, meaning that xrate would allow five iterations without such an improvement before terminating.
The exact training set used by GTJ was not available, but a closely similar dataset was used for the purposes of comparison.
The model parameters used by the GTJ program were converted to phylo-grammar format. The substitution rates were scaled so that the average substitution rate for the loop state is 1, as per GTJ. The secondary structures predicted by xrate and the GTJ program were compared to the annotated "true" secondary structure. Statistics were calculated for the sensitivity and specificity of the predictions for the three secondary structure states as well as overall prediction accuracy.

The PFOLD grammar
We attempted to replicate the Pfold program using xrate by inputting the Pfold phylo-grammar (which consists of the SCFG, its production probabilities, the initial frequencies and substitution rates of base pairs and unpaired nucleotides, and a parameter denoting that the model is reversible) to xrate as a file in the S-expression format, then using xrate with this grammar to perform secondary structure annotation on a testing set containing multiple sequence alignments from Rfam. The results were compared with annotation done by the Pfold program (2003 version).
The annotation of the data by a stochastic grammar is often determined from the most optimal parse of the data according to the grammar. In the case of SCFGs, this parse is computed by the CYK algorithm. However, Pfold uses posterior decoding theory to find the annotation containing the highest expected number of correct predictions, summed over all parses. Our benchmarks indicate that this approach produces a greater basepair PPV, but lower sensitivity, than the CYK approach (data not shown). To enable consistent testing, we modified Pfold to revert to outputting the secondary structure using CYK, as xrate does. Conceivably, xrate could be modified to produce an annotation in some manner other than CYK. However, due to the limitless variety of objective functions which can be used to do posterior decoding, we have opted not to implement any single posterior decoding method, but rather included an option to print out the entire table of column posterior probabilities in the xrate output. This allows such methods to be implemented as an additional layer on top of the xrate software.
For both Pfold and xrate, a phylogeny tree is required for each alignment showing the evolutionary relationship among the sequences. These trees and their branch lengths were estimated by xrate using the Jukes-Cantor model. We then used these trees for both the Pfold and the xrate predictions. Pfold implements a slightly different tree construction method; we use this model here for simplicity and consistency between the two methods.
One final measure of data preparation must be done to make Pfold and xrate perform similarly enough for a benchmark: the difference in their handling of alignment columns containing gaps must be taken into account. Pfold employs a heuristic where columns containing gaps in 25% of the sequences or more are not considered for annotation, which generally increases PPV, but lowers sensitivity of base pair predictions. For consistency, we removed all such columns before showing the alignment data to xrate or Pfold.
The training set and testing set of multiple sequence alignments were prepared from Rfam version 7 as follows. 148 alignments of RNA families, including only experimentally-determined secondary structure annotations curated from published articles in the literature, were partitioned into 13 superfamilies according to the Rfam grouping: cis-reg (frameshift, IRES, riboswitch, thermoregulator, other) and gene (antisense, miRNA, ribozyme, rRNA, guide snRNA, splicing snRNA, sRNA, other). Note that tRNA and Group I catalytic intron alignments (Rfam IDs RF00005 and RF00028, respectively) were not part of these sets because they only had one alignment per superfamily. Alignments in each of the 13 superfamilies were randomly partitioned into two sets, with one added to the training set, the other added to the testing set. Pseudoknots, comprising a small fraction (175 of the 5780 base pairs) of the two sets, were removed from all annotations. Additionally, alignment RF00061 (containing 98 base pairs after pseudoknot removal) was removed from the test set because it was causing Pfold's implementation of the CYK algorithm to crash. Lastly, because Pfold's SCFG cannot generate single-nucleotide hairpin loops, the only instance of such a loop in the two sets -in RF00232 -was changed to a hairpin loop of 3 unpaired nucleotides by removing the annotation for the closing base pair.
The above procedure yielded a training set of 71 alignments and a testing set of 77 alignments. Trees for each alignment/family in the training set and testing set were estimated with xrate using the Jukes-Cantor model of nucleotide substitution.
Various pseudocounts (for the EM update statistics) and EM convergence criteria were tried to prevent division-by-zero errors and to maximize accuracy of the Rfam-trained Pfold phylo-SCFG. Empirically, we found that adding a pseudocount of 0.001 to start times and mutation counts gave the best results.
As the EM algorithm requires an initial substitution rate matrix (or, as in this case, matrices) to improve upon over its iterations, a seed phylo-grammar must be created. For the seed, we chose rate matrices where the mutation rates are all equal but low, so that the total mutation rate from any state is 1; empirically, we find that "low-balling" the seed like this improves the rate of convergence, as fewer spurious mutations are estimated in the initial rounds of EM.
xrate and Pfold predict slightly different basepair sets. xrate predicts 490 base pairs that Pfold does not, Pfold predicts 698 base pairs that xrate does not, and both predict 1487 same base pairs. The accuracy of the methods is nonetheless comparable.

The Phylo-EM algorithm
This appendix covers the spectral theory of continuous-time finite Markov chains (reversible and irreversible) for comparative sequence analysis, including calculations of the matrix exponential (using either Taylor series or Padé approximant with scaling and squaring, or eigenvector decomposition), Felsenstein's pruning and peeling algorithms for phylogenetic trees, and the Expectation Maximisation algorithm for maximum likelihood parameterisation.

Notation
Vectors are written in bold-face and lower-case (p, q) with single-subscripted elements in normal-face (p i , q j ).
Matrices are written in bold-face and upper-case (Q, R) with double-subscripted elements in normal-face (Q ij , R kl ).

The state space
In this section, "state" refers to a state of a continuous-time Markov chain, rather than a nonterminal symbol in a stochastic grammar (for which the term "state" is occasionally also used, particularly if the stochastic grammar is a Hidden Markov Model).
Suppose there are N states. (We assume for now that the state space is finite, though this can be relaxed. The state space must always, however, be discrete.) Let σ(t) be the state of the system at time t.
Examples of finite state spaces include the set of all nucleotides at a given site in an RNA sequence (N = 4), the set of all dinucleotide pairs (N = 4 2 ), the set of all tetranucleotides (N = 4 4 ) or the set of all stems of length L (N = 4 2L ).
Examples of infinite state spaces include the set of all nucleotide sequences and the set of all RNA secondary structures.

The rate matrix and the equation of state
Let Q ij be the instantaneous rate of mutation from state i to state j, where i = j. Let −Q ii = j =i Q ij be the exit rate from state i. Note that Q ii is negative. The N × N matrix, Q, is called the rate matrix.
Let p i (t) = P (σ(t) = i) be the probability that, at time t, the system is in state i. The N -element vector, p, is called the state vector.
We will be right-multiplying p by Q, forming the product pQ rather than the product Qp. This is a result of the way we have chosen to order Q (the rate from i to j is Q ij rather than Q ji ). Consequently we must treat p as a row vector rather than a column vector.
The equation of state for the continuous-time discrete-state Markov chain is This is a matrix ODE (ordinary differential equation). Expanding this for the j'th state, we obtain The first term represents mutations entering state j, while the second term represents mutations leaving state j.

Discrete-time analogy
We can get some insight into the behaviour of the equation of state (1) if we consider breaking up the time axis into short discrete steps of size ∆t. Let σ(m) be the state at the m'th step, i.e. at time T = m∆t, and let p(m) be the state vector. The process σ(m) is a discrete-time Markov chain, which may be more familiar as it is similar in some ways to the kind of Hidden Markov Model (HMM) that is used for bioinformatic sequence analysis (although in bioinformatics, the time variable actually means "position along the sequence").
If ∆t is small, the probability of transiting from state i to a different state j in one time-step is R ij = Q ij ∆t, while the probability of staying in state i is R ii = 1 − j =i Q ij ∆t = 1 + Q ii ∆t. Thus, the discrete version of the equation of state is p(m + 1) = p(m)R with R = I + Q∆t, where I is the identity matrix. Here R is the discrete-time transition probability matrix for short time intervals ∆t.
(Note that R = I + Q∆t is the first-order Taylor series approximation to the true discrete-time transition matrix, which is exp(Q∆t); see section 3.5. The approximation will be best for small ∆t.) The general solution to the discrete equation of state is p(m) = p(0)R m . Note that the continuous-time matrix exponential exp(Qt) is replaced by R m = (I + Q∆t) m in the discrete case. An implication of this is that we can define the matrix exponential as the following continuous-time limit exp(QT ) = lim ∆t→0 (I + Q∆t) T /∆t where we have used m = T /∆t. We will revisit this continuous-time limit later, in the context of parameter estimation.

The matrix exponential
By analogy with the corresponding scalar ODE, we can write the solution to the equation of state as This should not seem strange-since, for example, the scalar exponential x = exp(kt) is often (formally) defined as the solution to the equation dx dt = kx with boundary condition x(0) = 1. We have just done the same thing, but with a matrix Q instead of a scalar k. Of course, writing down a symbolic solution is one thing; in practise we have to figure out some means of computing the actual entries of M.
One way of doing this is via the Taylor series. For scalars, this is the following infinite series.
It can be verified that this satisfies the scalar ODE dx dt = kx. Furthermore, this remains true when the scalars x, k are replaced with matrices M, Q.
Of course, we can't evaluate an infinite series in finite compute time. However, if kt (or |Qt|) is small, then only the first few terms in the Taylor series will contribute much to the result. We can think of the n'th term, 1 n! (Qt) n , as representing the effect of n successive mutations in the time interval [0, t]. If t is small compared to 1/ρ, where ρ is the expected mutation rate, then only a few mutations are likely to have occurred. Thus, we can expect the Taylor series to converge at quite small n.
In the case where kt (or |Qt|) is not small, we use the method of scaling and squaring, as described in Moler and Van Loan's Nineteen Dubious Ways to Compute the Exponential Matrix. This exploits a fundamental property of the exponential function: e A = (e A/m ) m . m is chosen to be a power of two for which e A/m can be efficiently computed; (e A/m ) m is computed by repeated squaring. In practice, the Padé approximant is more efficient than the Taylor series; Gerton Lunter has kindly provided an implementation of matrix exponentiation using the Padé approximant with scaling and squaring.
In general, each matrix multiplication (& hence each term of the series) takes compute time O(N 3 ). However, if the rate matrix Q is sparse, this can be improved on.

Diagonalising the rate matrix
There is an exact way of evaluating the matrix exponential M = exp(Qt), by diagonalising the rate matrix. To do this we first need to find the eigenvalues and eigenvectors of Q. This is a straightforward application of matrix algebra.
An eigenvalue and the associated (right) eigenvector of the matrix Q are a scalar µ and a column vector u satisfying Qu = µu. This implies that (Q − µI)u = 0 where I is the identity matrix. Therefore where |X| represents the determinant of matrix X. This is the characteristic equation. In general the polynomial in this equation (the characteristic polynomial) has degree N , so there are N solutions and hence N eigenvalues (though some of these may be identical, or degenerate).
Once we have found an eigenvalue, by solving the characteristic equation or by other means, we can find the corresponding eigenvector. Let D be the diagonal matrix whose on-diagonal entries are the N eigenvalues, µ k , and let U be the matrix whose columns are the corresponding (right) eigenvectors, u (k) . It follows that QU = UD.
We here consider diagonalizable matrices Q, for which the inverse of U, U −1 , exists. The rows of U −1 are left eigenvectors of Q.
Using the identities Q = UDU −1 and UU −1 = I, we can rewrite the equation of state as follows Let us change co-ordinates to q(t) = p(t)U. (Hence p(t) = q(t)U −1 .) In this co-ordinate scheme, This separates out into N independent scalar ODEs of the form where we recall that µ k is the k'th eigenvalue. Clearly the solution is q k (t) = q k (0) exp(µ k t). We can write this in matrix form as q(t) = q(0) exp(Dt) if we recognise exp(Dt) as the diagonal matrix whose on-diagonal elements are obtained by exponentiating the diagonal elements of Dt.
We conclude this analysis by right-multiplying q(t) by U −1 to recover which is the analytic result we sought. We have chosen to obtain M(t) by diagonalization (as described in this section) rather than by the Padé approximant with scaling and squaring (section 3.5). We need to exponentiate a given matrix at several different timepoints t; the algebraic expression provided by diagonalization can be manipulated for increased efficiency, such as the eigenbasis projection described below. The Taylor and Padé approaches are more universal, since they do not require that the rate matrix be non-singular (have an inverse); singular rate matrices are not biologically meaningful.

Reversible models, equilibrium transformation and symmetry
Things become considerably simpler if Q is a symmetric matrix, i.e. Q T = Q; then U = U −1 T , so the left and right eigenvectors are the same. Furthermore, the eigenvalues are real (sketch of proof: let x * be the complex conjugate of x; then show that ij u is real, by taking the complex conjugate of the LHS). The eigenvectors can also be chosen to be real.
Symmetry of Q means that Q ij = Q ji for all i, j. In practise, this is rare. A slightly less restrictive constraint is that π i Q ij = π j Q ji , where π is the equilibrium state vector (satisfying πQ = 0); this is called the detailed balance condition and implies that the model is time-reversible. In this case, Q is related to a symmetric matrix by a simple co-ordinate transformation, so that the abovementioned benefits of a symmetric matrix can still be obtained.
Phylogeny also becomes easier for time-reversible models, since (as Felsenstein showed) we can then use the so-called pulley principle to slide the root node around at will, without affecting the likelihood of the tree. However, the irreversible model is more general and more realistic, and algorithms that work with an irreversible model can still be used with a reversible model.
The equilibrium co-ordinate transformation goes as follows. Let S ij = Q ij π i /π j . We can also write this as S = P 1 2 QP − 1 2 where P is a diagonal matrix whose nonzero elements are P ii = π i . Suppose that S = WDW −1 is the diagonalization of S (it's straightforward to prove that S's diagonal matrix D is the same as Q's). Then from which, by inspection, we can infer that U = P − 1 2 W and U −1 = P 1 2 W −1 .

Trees
We now extend the treatment to phylogenetic trees, describing the pruning and peeling algorithms of Felsenstein and others. In the context of machine learning theory, these can be viewed as message-passing algorithms on a Bayesian edge-labeled graphical model (the tree itself). The tree has K nodes, numbered 1 . . . K with children lower than parents (what Knuth calls postorder). Thus node K is the root. Each node n of the tree has an associated state φ n .
For a given node n, let C n be the set of its children (if the tree is binary, this set will have zero or two members) and let (p, g) be the parent and grandparent of n, respectively. For two adjoining nodes m, n, let t mn be the branch length from m to n (i.e. the evolutionary timespan separating m and n).
Let T n represent the subtree containing node n and all its descendants. Let T n be the complement of T n , i.e. all nodes except n and its descendants.
We also have to specify some initial probability distribution over states for the root node of the tree. Let π be this initial state vector. For time-reversible models, we often take π = s where s is the equilibrium state vector, but this is not a requirement in general.

Pruning
A typical situation is that we observe the state of the leaf nodes (having performed some sequencing experiments) but we don't know the state of the ancestral internal nodes. Felsenstein's pruning algorithm allows us to find the likelihood of the observed data at the leaves of the tree, summing over all possible states of the missing data at the internal nodes. This is achieved by a kind of dynamic programming on the tree (it can be viewed as an instance of belief propagation on the underlying graphical model).
Define α to be the likelihood of the subtree T n (i.e. node n and all its descendants) conditioned on node n being in state b. This likelihood is summed over all possible states at the internal nodes. Felsenstein computes this recursively The α b . The summed-over-internal-states likelihood of the observed states at the leaf nodes is given by

Peeling
We can also find the posterior probability that a particular node (or branch) was in a particular state. Again, this is achieved by a dynamic programming algorithm, called peeling.
Recall that T n represents all nodes except n and its descendants. For n < K, T n contains at least one member: p, the parent of n.
Define β (n) a to be the likelihood of T n , again summed over all states at the internal nodes, with node p in state a.
This time, the β (n) a are filled in descending order, i.e. from root-to-leaves. Because of the fill order, we sometimes refer to the α (n) as the "up likelihoods" and the β (n) as the "down likelihoods".
We can now write down the posterior probability q (n) ab of a given branch (p, n) being in states (a, b)

Parameter estimation and EM
Having observed a sequence alignment, and assuming some prior over trees and rate matrices, Bayes' theorem implies a posterior probability distribution over rate matrices Q. Such a distribution can be described algebraically for pairwise sequence analysis (it's a gamma-like distribution), but gets more complex for multiple sequences. Here, we consider the simpler compromise of a single point estimate for Q. A more rigorous inference, yielding confidence limits for Q, can be obtained by MCMC.
The challenge we address is as follows. Given some set of observations of the state of the system at the leaves of the tree (e.g. a multiple alignment of present-day sequences), what is the best rate matrix Q that explains these observations?
The traditional approach (used by Dayhoff et al to construct the PAM matrices) is to take a pairwise alignment of two closely related sequences, count the number of instances C ij of each aligned residue-pair (i, j), estimate the evolutionary distance ∆t separating the two sequences, and then set Q ij ← C ij /∆t.
Because this approach ignores multiple substitutions at the same site (e.g. back-substitutions), it requires that only closely related sequences can be considered. (Effectively, the approach is using the discrete-time approximation introduced in section 3.4.) This means that a lot of information in the multiple alignment gets thrown away, as the sequences are too distant to consider. There are also problems with over-counting sequences.
A better, more recently popularised approach is to use maximum likelihood (ML); that is, to seek the rate matrix Q that maximises the Felsenstein likelihood of equation (4). This is the approach that will now be described.
In general, we can't write down an analytic formula for the ML rate matrix, because the choice of rate matrix Q affects our imputation of the state of the system at the ancestral nodes in the tree. In other words, suppose we start with some initial guess at Q. From this, we can use the above-described peeling algorithm to make some probabilistic estimate of the missing ancestral sequences; let's call this estimate ψ. Now, given ψ, we could make an improved guess at the rate matrix; call this improved guess Q . However, if we then plug this new Q into the peeling algorithm, we'll end up with a different estimate ψ of the ancestral sequences. We could use this new ψ to estimate another rate matrix Q , and thence another ψ , and so on. It seems the best we can hope for is to keep iterating this procedure and hope that it converges.
It turns out that the iterative procedure just described does converge, and that it's a case of the Expectation Maximisation (EM) algorithm. In the following sections, we will describe how this works in more detail.
One should note that there are other methods of estimating rate matrices, based e.g. on resolvents (c.f. Vingron et al). We prefer EM for several reasons. First, the resolvent method is not a maximum likelihood method; it's an approximation. Second, the EM method plays better with the EM algorithms we already use to train HMMs and stochastic context-free grammars (SCFGs). Third, we think EM is more intuitive: it is basically just the same as the Dayhoff method, except that it uses unbiased posterior estimates of the counts, C ij (more on this below). Fourth, EM extends naturally to ML parameterisation of more complex evolutionary models, including (for example) indel events on whole sequences.

EM approach: discrete-time case
We can get some insight into how EM works for rate matrices by revisiting the discrete-time setup of section 3.4. Recall that, in the discrete-time case, the evolutionary problem is similar to the HMM problem in that it assumes a partial observation of a discrete-time Markov process. Therefore, we can re-use the well-known version of EM for HMMs: namely, the Baum-Welch training algorithm.
Let's forget about phylogenetic trees for the moment, and suppose that we make two observations of the process, initially at time 0 and then at the later time T = m∆t. Let the initial observation be σ(0) = a and the later observation be σ(m) = b.
In Baum-Welch training, the underlying iterative procedure is as follows: • E-step: Given the observed data (a, b, m) and the current estimate of the (discrete-time) transition probability matrix, R, estimate the counts matrix C, where C ij is the expected number of times that the Markov model made a transition from state i to state j.
• M-step: Set R ij ← C ij / k C ik .
Recall that the relationship between the discrete-time transition probability matrix R and the continuoustime rate matrix Q is (for small ∆t) Thus, given a Baum-Welch estimate of R, we can obtain Q by setting Q ij ← R ij /∆t (for i = j) and Q ii = − j =i Q ij . In terms of Q, the Baum-Welch-like algorithm for estimating transition rates therefore goes something like this: • E-step: Given the observed data (a, b, m) and the current Q, estimate the counts matrix C.

EM approach: continuous-time limit
Now think about what happens when we take the continuous-time limit ∆t → 0 while keeping the overall time interval T constant. The number of time-steps is m = T /∆t. Thus, as ∆t → 0, we have m → ∞: as the size of the time-step tends to zero, the number of discrete steps required to span the time interval tends to infinity. For two distinct states i, j (with i = j), the expected number C ij of i → j transitions should not depend too much on ∆t. This is because C ij represents something "real" (the number of times that an i mutated to a j), which is independent of the time-step.
However, if we look at the expected number of self-transitions, C ii , we find a problem. Namely, C ii ∼ T /∆t, so that as ∆t → 0, the number of self-transitions will blow up. This is because C ii does not represent a "real" quantity. What it represents is the number of time-steps during which the system stayed in state i and did not mutate, and this totally depends on the size we have chosen for the time-step, which of course is an artificial quantity (and one we hope to eliminate).
To fix this, let us define w i = ∆tC ii , the expected amount of time that the system spent in state i. Let's call this the wait time for state i. We expect that the wait time will not, ultimately, depend on ∆t.
If we now look at the denominator on the right-hand-side of equation (6), we see that the continuous-time limit lim ∆t→0 ∆tC ik is equal to 0 if i = k, and w i if i = k. In other words, the continuous-time limit of the Baum-Welch M-step is The interpretation of this is that the best estimate for the rate from i to j is equal to C ij , the expected number of i → j mutations, divided by w i , the expected amount of time spent in state i. This is exactly equivalent to the Dayhoff approach, except that C ij and w i are now unbiased posterior estimates of the mutation count and the wait time, rather than naive (biased) counts.

Estimating the counts and the wait times
It remains to be shown how to estimate the mutation counts C and wait times w. In this section we show how to do this for a single pair of observations using the eigenvector decomposition of Q.
Recall our observed data are σ(0) = a and σ(T ) = b. The likelihood of such an observation is M ab (T ). Suppose that a mutation i → j occurs at a particular time t, where 0 ≤ t ≤ T . The probability of going from a to i in time t is M ai (t); the probability of the mutation i → j occuring in a time interval of size dt is Q ij dt; and the probability of going from j to b in time T − t is M jb (t). Dividing by M ab (T ) to obtain the posterior probability of this mutation, and integrating over all times t, we obtain Now suppose that the process is in state i at time t, and no mutation occurs. The probability of going from a to i in time t is M ai (t) and the probability of going from i to b in time T − t is M ib (t). This instant contributes dt to the total wait-time in state i. Therefore We can write these results as Plugging in the definition of M ij (t) from equation (2), we obtain where J (T ) is the following kernel matrix 3.14 Putting it together: EM for rates on trees In this section, we outline how to combine the results of sections 3.9, 3.12 and 3.13 to estimate Q by EM from a multiple alignment. The algorithm proceeds as follows.
• For each column c in the multiple alignment: -Use the peeling algorithm of section 3.8 to compute the posterior probability q (n) ab for each adjoining pair of nodes (p, n) of the tree being in states (a, b).
-For each branch (p, n) with length T = T pn and each a, b, i, j ∈ {1 . . . N }: * Set C ij ← C ij + q • Calculate the next estimate of the rate matrix, Q , as follows: • Repeat until the joint likelihood, L, converges.
As with the Baum-Welch algorithm, we can inform the parameter estimation using prior distributions. With Baum-Welch, one typically uses (mixtures of) Dirichlet distributions, as these are naturally conjugate to the multinomial distribution underlying the counts. With the rate EM algorithm, the natural conjugate priors are (mixtures of) gamma distributions over the Q ij (which can, equivalently, be viewed as a gamma distribution for the exit rates −Q ii combined with a Dirichlet distribution over the probabilities −Q ij /Q ii ).

Accumulating the counts in eigenvector space
The inner loop of the algorithm in section 3.14 is over four state variables (a, b, i, j) each of which takes N values. If I ab ij (T pn ) is not cached (which would take O(N 4 ) memory per branch and so is impractical) then a further loop over two state variables (m, n) is required. The complexity of all of these loops together is O(N 6 ), as is evident from the following expanded expression for C ij where A represents each multiple alignment, c each column of A and T (c) the tree of states corresponding to column c.
We can't eliminate this O(N 6 ) time hit, but we can rearrange the sums in a more efficient form.
This can be described as accumulating the counts in eigenvector space, then transforming back. Finally we can use the expression for q where L c is the likelihood of column c. Substituting this into the equation for C ij , and rearranging, gives where U and D are (respectively) the left and right eigenbasis projections of the "up" and "down" peeling likelihoods, α and β.
In practice the projections U and D are calculated at the same time that the peeling likelihoods α and β are; the tree need only be traversed up and down once per EM cycle.
Finally, we can rewrite the expression for C ij as where E is the matrix of "eigencounts".

Complex eigensystems
Recall that the characteristic equation is a degree-N polynomial with real coefficients (since the Q ij ) are real. It follows that the roots of this polynomial must either be real, or complex conjugate pairs (sketch of proof: complex conjugativity is distributive over multiplication and addition, therefore if x is a zero of the characteristic then x * must also be a zero. Intuitively, there is no way the real coefficients can "distinguish" between two conjugate roots ± √ −1.) Let k be the index of an eigenvalue, µ k . If µ k is complex, then let the index of the conjugate eigenvalue be k . Likewise let l be the index of a (possibly different) eigenvalue and l the index of the conjugate (if µ l is complex). Since each of µ k and µ l can be either real or complex, there are four possibilities for complex conjugate symmetries:

Condition
Eigenvalues Eigenvectors Eigenprojections Kernel Eigencounts Summary Write the following complex expansions where ı = √ −1, κ k , λ k are real and V, W, X and Y are real N × N matrices. Using the identities (xyz) * = x * y * z * and x + x * = 2Re [x], we can write equation (2) as where c k = exp(κ k T ) cos(λ k T ) and s k = exp(κ k T ) sin(λ k T ) The complex form of J kl (T ), from equation (8), is as follows. For repeated eigenvalues, µ k = µ l : For unique eigenvalues, µ k = µ l :

Implementation of complex matrix algebra
The following is a fairly detailed discussion of an implementation of the concepts described above. This section may not be of great interest to a general audience; it was written as a guide to aid in code maintenance. Our implementation has made heavy use of Gerton Lunter's EISPACK-based matrix diagonalization and matrix exponentiation code. This code returns the eigenvalues and eigenvectors in a condensed fashion, implicitly using the fact that complex eigenvalues of real matrices come in conjugate pairs. The eigenvalues are contained in a single vector; for complex eigenvalue µ k , the real part of the pair is in position k and the positive imaginary part in position k + 1. The corresponding pair of conjugate (right) eigenvectors are similarly returned in columns k (real part) and k + 1 (positive imaginary part) of the eigenvector matrix U.
The second matrix returned, U −1 , which in the real case is the inverse of the eigenvector matrix and contains left eigenvectors as rows, does not directly correspond to the complex matrix inverse; it is simply the inverse of the eigenvector matrix, viewed as real. The procedure to determine the actual complex inverse matrix from U −1 is straightforward. Rows corresponding to real eigenvalues µ k are correct. As before, rows corresponding to complex eigenvalues µ k are treated as pairs, k and k + 1, with row k containing the real and k + 1 the imaginary part. In addition, these real and imaginary parts must be divided by two; the sign of the imaginary part must also be reversed, i.e., the imaginary value in row k + 1 corresponds to the eigenvalue with negative imaginary part. This translation procedure is easily validated by calculating the product matrix, (UU −1 ) ik = N j=1 U ij U −1 jk using the compaction rules. Entries corresponding to real eigenvalues contribute U ij U −1 jk ; entries corresponding to a conjugate pair of eigenvalues j and j +1 contribute 2U ij U −1 jk −2U i,j+1 U −1 j+1,k , which gives the desired result upon dividing by 2 and reversing the sign of U −1 j+1,k . The following describes how we use this code in calculating the matrix exponential, equation (2), making use of an intermediate matrix Z mj as defined here. This procedure is equivalent to equation(11) after converting U −1 to the actual complex inverse.
The conjugate symmetries make it both possible and advantageous to store the complex vectors and arrays in condensed form; in the vectors the entries corresponding to complex eigenvalue µ k have real part in position k and the imaginary part corresponding to the eigenvalue with positive imaginary part in position k + 1. The complex arrays J kl (T ) and E kl have slightly more interesting conjugate symmetry in the case where both µ k and µ l are complex. Here, e.g. for E kl , the conjugate pairs are E kl ↔ E k+1,l+1 and E k+1,l ↔ E k,l+1 . We adopt the convention that, in the (real) tables used in the implementation, entries E kl and E k+1,l contain the real part of the pair; entries E k+1,l+1 and E k,l+1 contain the imaginary part corresponding to the eigenvalue with positive imaginary part.
The implementation includes four distinct cases, one for each of the possible combinations of real and complex µ k and µ l . If both are real, there is only one corresponding E kl table entry: If µ k is real and µ l is complex, we calculate two table entries together: E kl , containing the real part of the corresponding conjugate pair, and E k,l+1 , containing the imaginary part of the complex entry corresponding to the eigenvalue µ l with positive imaginary part: As one might expect, the case where µ k is complex and µ l is real is similar: In the case where µ k and µ l are both complex, there are two real parts: E kl for the E kl ↔ E k+1,l+1 pair and E k+1,l for the E k+1,l ↔ E k,l+1 pair, and two imaginary parts: E k+1,l+1 for the former pair and E k,l+1 for the latter.
In the final calculation, of C ij , the conjugate symmetries lead to a simpler form. Since the sums in equation (9) include both entries of conjugate pairs, the imaginary part is zero, as expected for transition counts. In practice, the double sum (9) is calculated as a single nested sum Because the elements of the inverse matrix U −1 are left eigenvectors as rows, the U −1 ki entries for a given k contain either strictly real elements, in the case of a real eigenvector µ k , or either the real part (row k) or the imaginary part (row k + 1) in the case of a complex µ k , regardless of subscript i. Similarly the elements of matrix U are (right) eigenvectors as columns, so the U jl entries for a given l are also either strictly real, for real µ l , or either the real or imaginary part of a complex pair l and l + 1 for complex µ l , regardless of subscript j. This means that the evaluation of expression (13) need only be concerned with whether indices k and l correspond to real or complex µ k or µ l and not with i or j.
In evaluating the sum, we again have four distinct cases. If both µ k and µ l are real, we include in the sum only the term: U −1 ki E kl U jl For µ k real and µ l complex, we calculate terms for indices l and l + 1 together, which contribute the following (remembering that we have taken into account the relationship between the U −1 ki entries and the actual complex matrix inverse, which here removes a factor of two): The expression for µ k complex and µ l real is similar; here we process terms k and k + 1 together. The translation of the U −1 k+1,i entry reverses the sign of the second term: (U −1 ki E kl + U −1 k+1,i E k+1,l )U jl In the calculation in the case where both µ k and µ l are complex, we process together terms k and k + 1, l and l + 1, which together give: where E r kl and E i kl are the sum of the real and imaginary E kl terms, respectively: In this calculation the pesky signs are easily validated: the product of two imaginary terms gives a factor of −1, as does the term U −1 k+1,i .

Application to phylo-grammars
Notationally, the application of the above EM algorithm to phylo-grammars is slightly involved, but the gist is as follows.
The "standard" biological formulation, as described in e.g. the textbook by Durbin et al, of the dynamic programming algorithm for aligning sequences to a hidden Markov model (HMM) or stochastic context-free grammar (SCFG), can be sketched (non-rigorously) as follows: • For each subsequence S of the full sequence, iterating from shortest to longest: -For each nonterminal φ in the grammar: * Let E denote the final residue(s) of S and let S denote the remainder of S. * ( * ) Calculate the likelihood of emitting residues E from state φ. * Multiply this by the max (or sum) of all incoming transitions from S . * Store this result in cell (φ, S).
Note that the emission E is not restricted to be a single residue. Examples of models that typically make use of multi-residue emissions are RNA structure (the two nucleotides of a base-pair) and protein-coding genes (the three nucleotides of a codon). In the single-sequence grammar, it is not essential to aggregate these emissions to a single state: one can achieve the same effect by staggering the multi-residue emission over several states (an emission of N correlated residues from an alphabet of size A requires ∼ A N −1 "memory" states, each emitting one symbol). This trick doesn't work for phylo-grammars, where we are interested in the likelihood of not just one N -mer but rather multiple observations of this N -mer in homologous sequences, correlated according to some underlying phylogenetic tree. Thus, while many implementations of singlesequence grammars allow at most one residue to be emitted per state without loss of generality (see e.g. Durbin et al, "Biological Sequence Analysis"). implementations of phylo-grammars must allow multi-residue emissions.
To adapt the above kind of algorithm for phylo-grammars, it suffices merely to modify the emit likelihood calculation, flagged with an asterisk ( * ) in the above text. Specifically, the emission E is no longer a residue (or residues), but rather an alignment column (or columns). Instead of calculating the likelihood of emitting a single symbol (or a set of covariant symbols) one must now calculate the likelihood of an entire alignment column (or set of covariant columns). This is easily achieved using the phylogenetic "pruning" algorithms of section 3.7.
The Inside-Outside and Forward-Backward algorithms are similarly easy to modify. Indeed the calculation of posterior probabilities is unchanged; it is only the accumulation of counts that must be modified. Again sacrificing exactness for brevity, the single-sequence case may be written as follows: • Let C(φ, E) be the expected number of times that state φ emitted residue(s) E.
• For each subsequence S of the full sequence: -For each nonterminal φ in the grammar: * Let E denote the final residue(s) of S and let S denote the remainder of S. * Calculate p, the posterior probability that E was emitted by state φ. * ( * ) Set C(φ, E) ← C(φ, E) + p.
The step that must be modified is again flagged with an asterisk ( * ). Again, for the phylo-grammar E represents a column (or set of columns) rather than a single residue (or set of residues). Rather than simply incrementing a single count, one now runs the phylo-EM algorithm of section 3.13 on column E in order to estimate a full set of expected substitution transitions and wait times (the C ij and w i of section 3.13). These counts, weighted by the posterior probability p, are then added to running totals for C ij and w i for state φ. These running totals are used to update the rate matrix for φ in the M-step of EM.