#### Application 1: Rate matrix estimation

Our first application is the problem of estimating the parameters in a CTMC for evolution of coding DNA sequences which we describe using the 61 × 61 rate matrix (excluding stop codons) given by Goldman and Yang [

14]:

where *π* is the stationary distribution, *κ* is the transition/transversion rate ratio, *ω* is the non-synonymous/synonymous ratio and *α* is a scaling factor. The stationary distribution *π* is determined directly from the data using the codon frequencies. We estimate the remaining parameters *θ* = (*α*, *κ*, *ω*) using the expectation-maximization (EM) algorithm [16] as described below.

Suppose the complete data

**x** is available, consisting of times and types of substitutions in all sites and in all branches of the tree. The complete data log likelihood is, using (1) and (6),

where we use the notation

A similar notation applies for *L*
_{s,tv}, *L*
_{ns,ts}, *L*
_{ns},_{tv}, *N*
_{ns} and *N*, where the last statistic is the sum of substitutions between all states (*i*, *j*) that differ at one position and s, ns, ts and tv subscripts stand for synonymous, non-synonymous, transition and transversion.

The complete data log likelihood can be maximized easily by making the re-parametrization

*β* =

*ακ*. We find that

where *a* = -*L*
_{ns,tv}
*L*
_{ns,ts}
*N*
_{s}, *b* = *L*
_{ns,tv}
*L*
_{s,ts}(*N*
_{ns} - *N*
_{tv}) + *L*
_{ns,ts}
*L*
_{s,tv}(*N*
_{ns} - *N*
_{ts}) and *c* = *L*
_{s,tv}
*L*
_{s,ts}
*N*
_{ns}.

In reality the data **y** is only available in the leaves and the times and types of substitutions in all sites and all branches of the tree are unaccessible. The EM algorithm is an efficient tool for maximum likelihood estimation in problems where the complete data log likelihood is analytically tractable but full information about the data is missing.

The EM algorithm is an iterative procedure consisting of two steps. In the E-step the expected complete log likelihood

conditional on the data **y** and the current estimate of the parameters *θ*
_{0} is calculated. In the M-step the parameters are updated by maximizing *G*(*θ*; *θ*
_{0},**y**). The parameters converge to a local maximum of the likelihood for the observed data.

The expected log likelihood conditional on the data

**y** and under the three parameters

*α*,

*κ* and

*ω* is

Therefore the E-step requires expectations of linear combinations of waiting times in a set of states and number of jumps between certain states. Because of the Markov property this calculation can be divided in two parts. First we use the peeling algorithm [

17,

18] to obtain the probability

that a branch

*k* of length

*t*
_{
k
}with nodes

*γ*
_{
k
}and

*β*
_{
k
}above and below the branch, respectively, has end-points

*a* and

*b*. Second, we calculate the desired summary statistic by summing over all branches. For example we have

The E-step thus consists of calculating conditional expectations of linear combinations of times such as
and substitutions such as
where *L*
_{s,ts} and *N*
_{ts} are given by (8). In our application *n* = 61 and the first type of statistics
is (up to a factor *p*
_{
ab
}(*t*)) on the form (5) with diagonal entries
and all off diagonal entries equal to zero. The second type of statistics
is also on the form (5) with off-diagonal entries
and zeros on the diagonal.

#### Application 2: Robust distance estimation

The second application is a new approach for estimating labeled evolutionary distance, entitled robust counting and introduced in [8]. The purpose is to calculate a distance that is robust to model misspecification. The method is applied to labeled distances, for example, the synonymous distance between two coding DNA sequences. As it is believed that selection mainly acts at the protein level, synonymous substitutions are neutral and phylogenies built on these type of distances are more likely to reveal the true evolutionary history. The distance is calculated using the mean numbers of labeled substitutions conditioned on pairwise site patterns averaged over the empirical distribution of site patterns observed in the data. In the conventional method the average is done over the theoretical distribution of site patterns. The robustness is therefore achieved through the usage of more information from the data and less from the model.

Let *Q* be the rate matrix of the assumed model, *P*(*t*) = *e*
^{
Qt
}, the labeling be given through a set of pairs ℒ and the data be represented by a pairwise alignment **y** = (**y**
_{1}, **y**
_{2}) of length *m*. As data only contains information about the product *Qt*, where *t* is the time distance between the sequences, we can set *t* = 1.

Suppose we observe the complete data consisting of the types of substitutions that occurred in all sites and let

be the labeled number of substitutions. A natural labeled distance is given by

. The labeled distance is estimated as the average across all sites of the expected number of labeled substitutions conditioned on the observed end points:

Therefore this application requires evaluating a sum on the form (5) with off-diagonal entries
and zeros on the diagonal.