The nucleotide substitution rate matrix, conventionally denoted *Q*, holds the rate of change from each type of nucleotide to each other nucleotide in the Markov model of molecular evolution [1] (see [2] for review). *Q* is often treated as a nuisance parameter in phylogenetic reconstruction, and is often assumed to be constant over a phylogenetic tree. However, if *Q* were really constant, all genomes would have identical composition, assuming that sufficient time had passed for the process to reach equilibrium. Neutrally-evolving sites reach equilibrium rapidly (e.g. within individual genera of nematodes [3]). However, genomes range in GC content (the fraction of nucleotides that are G or C, as opposed to A or T) from about 25% to 75% [4, 5], and members of all three domains of life have a wide range of GC content [6]. Consequently, discovering how *Q* varies is key to building better phylogenies and to understanding fundamental processes of molecular evolution. However, there are many methods for providing an estimate of *Q*, i.e.
, that may differ substantially in accuracy and performance, and lack of information about these performance characteristics has hindered discovery of patterns of *Q* on a large scale within and between genomes. Several factors are known to influence the proportions of the four nucleotides in the genome (hereafter, genome composition), and hence presumably *Q*, under specific conditions. For example, in the spirochete *Borrelia burgdorferi*, the strand-specific compositional biases induced by deamination are so great that the strand on which each gene is coded can be determined simply by examining its composition [7, 8]. Other examples include the influence of CpG doublets on mammalian gene composition [9], deamination biases in mitochondria [10], and the origin of mammalian isochores through variation in mutation patterns [11]. Within vertebrate genomes, the balance between oxidation and deamination mutations can also differ radically [12]. However, despite the potential for relating changes in *Q* to specific mutational processes, little attention has been paid to obtaining empirical
in specific lineages [13, 14].

That *Q* does vary, and that variations in *Q* can tell us something interesting about the underlying processes in molecular evolution, has thus drawn substantial attention only in the last 10–15 years. Most interest in obtaining
has come from studies of phylogenetic reconstruction: for example, likelihood methods for tree-building typically involve optimizing a single
matrix for a phylogenetic tree [15], and testing for the symmetries in
that best explain the sequence data [16] is a standard part of modern phylogenetic inference workflows [17]. Although some work has been done on non-stationary models [18–23], and recent evidence suggests that non-stationarity may have a large influence on the accuracy of phylogenetic reconstruction [24, 25], no comparison of the different methods for obtaining
has yet been performed. Available methods for obtaining
are based on different underlying assumptions, and are hence expected to differ both in accuracy and in speed of calculation. Most available methods obtain
, where the entries of
are the estimated probabilities of changing from each base to each other base after a defined time interval of duration *t*.
is related to
by the equation
. Therefore,
*t* can be obtained from
in two ways: either by simply taking the matrix logarithm of
, or by using constrained optimization to find a valid
*t* without negative off-diagonal elements that, when exponentiated, produces
with minimal error. The time factor can be removed by scaling
*t* so that the trace is (-1), leaving
on a standard scale (but with arbitrary units). For equilibrium processes, the weighted trace can be scaled such that
, taking into account the equilibrium base frequencies [26] to standardize to a fixed number of substitutions per site, but for nonstationary processes this procedure is not justified. The simplest method for obtaining
is to count the frequencies that a base *i* ∈ {*A*, *C*, *G*, *T*} in one sequence is aligned with base *j* ∈ {*A*, *C*, *G*, *T*} in the other sequence, and denote this by *n*
_{
ij
}. To isolate the effects of mutation, regions of the sequence that are thought to be under no selection or very weak selection, e.g. introns or 4-fold degenerate synonymous sites, are used for this purpose. If the substitution process is time-reversible, *n*
_{
ij
}= *n*
_{
ji
}in expectation. It suffices to consider the special case where the observed counts *n* is symmetric. If *n* is not symmetric, we replace *n* by the average of *n* and the transpose *n'*[26]. Then the entries of
are given by
, which can then be used to obtain
as described above [2, 26, 27]. Alternatively, since this method of obtaining
assumes the divergence between sequences is small,
. Then an estimate for the rate of change of *i* to *j* is given by
for *i* ≠ *j*, and the diagonal elements are set such that each row of
*t* sums to 0 [28, 29]. We refer to this method as the 'Goldman' method [29]. Again the time factor can be removed by scaling
*t*. However, both of these methods are based on the assumption that the evolutionary process is time-reversible. A method that can better recapture non-time-reversible substitution processes uses rooted triples of sequences where X and Y are the sister taxa, and Z is the outgroup [30]. If X has the same base at that position as Z, the assumption is that the change most likely occurred between Y and its common ancestor with X. This method thus leads to a directional and not necessarily time-reversible
, which can then be used to obtain
(see [31] for additional details, and for a comparison of directional and undirectional estimates of *Q*). We refer to this three-sequence method as the 'Gojobori' method of otaining
[30], and use the implementation in the PyCogent toolkit [32].

More elaborate methods include maximum likelihood methods and Markov triple analysis [

13,

33,

34]. Each of these methods assume nucleotide sites are independent and identically distributed, and do not require the evolutionary process to be stationary or reversible (recall that the sequence composition produced by evolution according to a Markov process over time need not match the composition of the starting sequence, which might be modeled differently). To explain the methods, we introduce the following notation. Consider a rooted triple of sequences from taxa

*X*,

*Y*,

*Z*. Define

*R* to be the root and let

denote the estimated vector of nucleotide probabilities for the sequence at

*R*. Let

denote the estimated nucleotide probability transition matrix from

*R* to taxa

*U*, for

*U* ∈ {

*X*,

*Y*,

*Z*}, i.e.,

The transition probabilities in each row of

sum to 1, therefore each matrix

consists of 12 unknown parameters and similarly the vector

consists of 3 unknown parameters. To determine the unknown parameters, the methods each make use of the observed joint probability distribution between the three taxa. Let

denote this distribution. That is,

Each state of

can be written in terms of

,

,

and

. For example,

Maximum likelihood methods obtain

,

and

by finding the values for the parameters that maximize the likelihood of the observed joint frequencies [

33,

34]. That is, align the

*X*,

*Y*, and

*Z* sequences and let

*n*
_{
XYZ
}(

*i*,

*j*,

*k*) denote the number of times

*X* =

*i*,

*Y* =

*j*,

*Z* =

*k* occurs in the alignment. The likelihood function is proportional to the conditional probability, given by

One method for finding the 39 unknown parameters (12 unknowns in each of three
and three unknowns in
) is to use a global optimization algorithm to search parameter space and find
,
,
and
that maximize the likelihood. We refer to this method as the 'MW' method below. Since the chosen ordering of the nucleotides is arbitrary, there are 4! permutations of the elements of
(while keeping the nucleotide ordering fixed), and 4! corresponding permutations of the rows of the
matrices, that will result in the same
and therefore the same likelihood. Thus, once the global optimization algorithm converges to one of the maximum likelihood solutions, a unique solution can be determined by choosing the permutation that maximizes the sum of the traces of the
,
and
matrices [34], which provides the solution that implies least change in the overall sequence (the diagonal elements represent the probability that a nucleotide remains unchanged after *t* time units). If the underlying process is not one that maximizes the diagonal entries, the process is unidentifiable using this method (i.e. cannot be recovered even with infinite sequence data). One solution might be to restrict the class of processes to those with diagonal-dominant *P* [35], although we did not perform such restrictions in this study.

An alternate approach for finding the maximum likelihood parameters that does not involve a global optimization routine was introduced by Barry and Hartigan and is referred to as the 'BH' method [33, 34] below (we ran this method using default parameters as supplied). BH rewrites the likelihood function in terms of joint probability matrices along each edge of the tree. Solving for the maximum value of the likelihood leads to a system of iterative equations for the joint probability matrices along each edge. Thus, given initial starting guesses for the joint distributions along each edge, the system can be iterated until convergence is reached [33, 34]. The joint probability matrices along each edge can then be used to obtain their corresponding
matrices.

BH is valid on unrooted trees of any size, but for simplicity we briefly outline it here using the unrooted version of the rooted triple defined above. Let

*u*
_{
i
}denote the

*i*th element in the sequence from taxon

*U*, with

*U* ∈

*X*,

*Y*,

*Z*. Similarly, let

*r*
_{
i
}denote the

*i*th element in the sequence at node

*R*. Then

Let

*N* denote the length of the sequences. Then the conditional probability can be written as

The maximum likelihood solutions are therefore the values of

,

and

that maximize the likelihood subject to the constraint that the entries in

sum to 1, as it is a joint probability matrix. In other words, BH maximizes

Differentiating with respect to

and

*λ* and setting equal to zero leads to the following iterative equation for

,

where

*I*(

*x* ∈ {

*x*
_{
i
}}) is an indicator function that takes the value 1 if

*x* =

*x*
_{
i
}and 0 otherwise [

33,

34]. This iterative equation can be written entirely in terms of joint distributions along each edge by using the relation

These steps are repeated to derive iterative equations for
and
, leading to an iterative system for
,
and
. Suitable initial values are chosen and the system is iterated until it converges [33].

The
matrices can then be converted to
matrices along each edge using the relationship in (3). Finally, the
matrices can then be used to obtain
matrices using the relationship
.

Both the global optimization and BH methods maximize the same likelihood function, but they differ in the computational methods used to find the maximum and in whether they write the unknown parameters in terms of joint probability matrices or probability transition matrices. Maximizing the likelihood does not result in a unique solution without adding further restrictions. MW chooses the
matrices with the largest sum of the traces, while in the BH method, the initial starting values determine which maximum is chosen. The BH method's default initial
matrices, are diagonally dominant. Thus we expect the iteration to converge to a diagonally dominant solution (if one exists). Thus, in theory the BH and MW methods will result in the same
matrices. In practice, their answers can differ if either converge instead to a local maximum, or if the initial starting points used in either algorithm lead to too slow of a rate of convergence, or if the original *P* matrices used to generate the sequences are not diagonally dominant.

Another method, which we refer to as the 'Lake' method, uses Markov triple analysis (MTA) as a different approach to finding the

matrices for rooted triples of sequences [

13]. This method uses the fact that the conditional joint probabilities

can be written in terms of

,

, and

, where

is the probability transition matrix from taxon

*Z* to

*R*. (Note that because the evolutionary process is not required to be reversible,

differs from

.) For example,

MTA uses the observed conditional joint probabilities as estimates of the true conditional joint probabilities. This results in a system of 48 nonlinear equations in terms of the 36 unknowns in
,
and
. MTA solves this system by first using the properties of determinants to set up a simpler system in terms of
. Specifically, determinants of the conditional joint probabilities are used to derive a system of 6 quartic polynomials.
is determined by first finding the roots of the polynomials and then searching through 24^{5} possible orderings of the roots to find the ordering that results in a consistent system. In the case that the original system of 48 equations is inconsistent, the Lake method chooses the ordering that minimizes an inconsistency function defined by the author [13]. These ordered roots are coefficients in a system of equations that is linear in the unknown parameters of
. Thus once the ordered roots are found,
can be determined by solving the linear system. Again, any of the 4! permutations of the columns of
will also be a valid solution. Thus the Lake method chooses the ordering with a positive determinant that maximizes the trace. Once a unique
has been determined, 3 systems of equations that are linear in the parameters of
,
,
can be solved to determine these matrices. Again, the
matrices estimated using the Lake method can then be used to obtain
matrices.

If the observed conditional frequencies do in fact accurately estimate the true conditional frequencies and result in a consistent system of 48 equations, the Lake method provides a computationally feasible way to find the solutions of the system. However, since this method estimates the joint probability distribution from observed frequencies, the accuracy of this estimation will be dependent on sequence length and sequence alignment. Therefore, the system of 48 equations can often be inconsistent (i.e. values for the 36 parameters that satisfy all of the equations and that are between 0 and 1 do not exist). In this case, it is possible that the Lake method will be unable to find valid estimations of the probability transition matrices, as the linear systems in the method may also be inconsistent or may result in
,
or
having negative elements.

In this paper, we present a comparison of the speed and accuracy of the different methods described above, using sequences of different lengths and divergences. We also compare two methods for obtaining
matrices from the resulting
matrices: using the matrix log [36] and using constrained optimization.