Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains

Tataru, Paula; Hobolth, Asger

doi:10.1186/1471-2105-12-465

Methodology article
Open access
Published: 05 December 2011

Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains

Paula Tataru¹ &
Asger Hobolth¹

BMC Bioinformatics volume 12, Article number: 465 (2011) Cite this article

5496 Accesses
15 Citations
Metrics details

Abstract

Background

Continuous time Markov chains (CTMCs) is a widely used model for describing the evolution of DNA sequences on the nucleotide, amino acid or codon level. The sufficient statistics for CTMCs are the time spent in a state and the number of changes between any two states. In applications past evolutionary events (exact times and types of changes) are unaccessible and the past must be inferred from DNA sequence data observed in the present.

Results

We describe and implement three algorithms for computing linear combinations of expected values of the sufficient statistics, conditioned on the end-points of the chain, and compare their performance with respect to accuracy and running time. The first algorithm is based on an eigenvalue decomposition of the rate matrix (EVD), the second on uniformization (UNI), and the third on integrals of matrix exponentials (EXPM). The implementation in R of the algorithms is available at http://www.birc.au.dk/~paula/.

Conclusions

We use two different models to analyze the accuracy and eight experiments to investigate the speed of the three algorithms. We find that they have similar accuracy and that EXPM is the slowest method. Furthermore we find that UNI is usually faster than EVD.

Background

In this paper we consider the problem of calculating the expected time spent in a state and the expected number of jumps between any two states in discretely observed continuous time Markov chains (CTMCs). The case where the CTMC is only recorded at discretely observed time points arises in molecular evolution where DNA sequence data is extracted at present day and past evolutionary events are missing. In this situation, efficient methods for calculating these types of expectations are needed. In particular, two classes of applications can be identified.

The first class of applications is concerned with rate matrix estimation. [1] describes how the expectation-maximization (EM) algorithm can be applied to estimate the rate matrix from DNA sequence data observed in the leaves of an evolutionary tree. The EM algorithm is implemented in the software XRate [2] and has been applied in [3] for estimating empirical codon rate matrices. [1] uses the eigenvalue decomposition of the rate matrix to calculate the expected time spent in a state and the expected number of jumps between states.

The second class of applications is concerned with understanding and testing various aspects of evolutionary trajectories. In [4] it is emphasized that analytical results for jump numbers are superior to simulation approaches and various applications of jump number statistics are provided, including a test for the hypothesis that a trait changed its state no more than once in its evolutionary history and a diagnostic tool to measure discrepancies between the data and the model. [4] assumes that the rate matrix is diagonalizable and that the eigenvalues are real, and applies a spectral representation of the transition probability matrix to obtain the expected number of state changes.

[5] and [6] describe a method, termed substitution mapping, for detecting coevolution of evolutionary traits, and a similar method is described in [7]. The substitution mapping method is based on the expected number of substitutions while [7] base their statistics on the expected time spent in a state. Furthermore [7] describes an application concerned with mapping synonymous and non-synonymous mutations on branches of a phylogenetic tree and employs the expected number of changes between any two states for this purpose. [8] uses the expected number of state changes to calculate certain labeled evolutionary distances. A labeled evolutionary distance could for example be the number of state changes from or to a specific nucleotide. In [9] substitution mapping is invoked for identifying biochemically constrained sites. In [7] and [8] the summary statistics are calculated using the eigenvalue decomposition method suggested by [1]. In [5, 6] and [9] the substitution mapping is achieved using a more direct formula for calculating the number of state changes. In this direct approach an infinite sum must be truncated and it is difficult to control the error associated with the truncation. An alternative is described in [10] where uniformization is applied to obtain the expected number of jumps. [10] uses the expected number of jumps on a branch to detect lineages in a phylogenetic tree that are under selection.

A third algorithm for obtaining the number of changes or time spent in a state is outlined in [11]. The algorithm is based on [12] where a method for calculating integrals of matrix exponentials is described. A natural question arises: which of the three methods (eigenvalue decomposition, uniformization or matrix exponentiation) for calculating conditional expectations of summary statistics for a discretely observed CTMC should be preferred? The aim of this paper is to provide an answer to this question. We describe and compare the three methods. Our implementations in R [13] are available at http://www.birc.au.dk/~paula/. (Furthermore the eigenvalue decomposition and uniformization methods are also available as a C++ class in the bio++ library at http://biopp.univ-montp2.fr/.) The performance and discussion of the algorithms are centered around two applications. The first application is concerned with rate matrix estimation; we estimate the Goldman-Yang codon model [14] using the expectation-maximization algorithm. The second application is based on the labeled distance estimation presented in [8].

Consider a stochastic process {X(s): 0 ≤ s ≤ t} which can be described by a CTMC with n states and an n × n rate matrix Q = (q_cd). The off-diagonal entries in Q are non-negative and rows sum to zero, i.e. q_cc= - Σ_d≠cq_cd= -q_c. Maximum likelihood estimation of the rate matrix from a complete observation of the process is straight forward. The likelihood of the process, conditional on the beginning state X(0), is given by (e.g. [15])

L (Q; {X (s) : 0 \leq s \leq t}) = exp (- \sum_{c} q_{c} T_{c}) (\prod_{c = 1}^{n} \prod_{d \neq c} q_{c d}^{N_{c d}}),

(1)

where T_cis the total time spent in state c and N_cdis the number of jumps from c to d. The necessary sufficient statistics for a CTMC are thus the time spent in each state and the number of jumps between any two states. In applications, however, access is limited to DNA data from extant species. The CTMC is discretely observed and we must estimate the mean values of T_cand N_cdconditional on the end-points X(0) = a and X(t) = b. From [15] we have that

E [T_{c} | X (0) = a, X (t) = b] = E [T_{c} | t, a, b] = \frac{I_{c c}^{a b} (t)}{p_{a b} (t)}

(2)

E [N_{c d} | X (0) = a, X (t) = b] = E [N_{c d} | t, a, b] = \frac{q_{c d} I_{c d}^{a b} (t)}{p_{a b} (t)}

(3)

where P(t) = (p_ij(t)) = e^Qtis the transition probability matrix and

I_{c d}^{a b} (t) = \int_{0}^{t} p_{a c} (u) p_{d b} (t - u) d u .

(4)

Many applications require a linear combination of certain substitutions or times. Examples include the number of transitions, transversions, synonymous and non-synonymous substitutions. In the two applications described below the statistics of interest is a linear combination of certain substitutions and times. Let therefore C be an n × n matrix and denote by Σ(C; t) the matrix with entries

\sum (C; a, b, t) = \sum_{c, d} C_{c d} I_{c d}^{a b} (t) .

(5)

We describe, compare and discuss three methods for calculating Σ(C; t). The evaluation of the integrals (4) takes O(n³) time and therefore a naive calculation, assuming that C contains just one entry different from zero has a O(n⁵) running time. Even worse, if C contains O(n²) entries different from zero, then the naive implementation has a O(n⁷) running time. For all three methods our implementations of Σ(C; t) run in O(n³) time.

Results

Applications

Application 1: Rate matrix estimation

Our first application is the problem of estimating the parameters in a CTMC for evolution of coding DNA sequences which we describe using the 61 × 61 rate matrix (excluding stop codons) given by Goldman and Yang [14]:

q_{i j} = \{\begin{matrix} 0 & if there is more than one difference between codons i and j \\ α κ π_{j} & if j is obtained from i by a synonymous transition \\ α π_{j} & if j is obtained from i by a synonymous transversion \\ α ω κ π_{j} & if j is obtained from i by a non - synonymous transition \\ α ω π_{j} & if j is obtained from i by a non - synonymous transversion \end{matrix}

(6)

where π is the stationary distribution, κ is the transition/transversion rate ratio, ω is the non-synonymous/synonymous ratio and α is a scaling factor. The stationary distribution π is determined directly from the data using the codon frequencies. We estimate the remaining parameters θ = (α, κ, ω) using the expectation-maximization (EM) algorithm [16] as described below.

Suppose the complete data x is available, consisting of times and types of substitutions in all sites and in all branches of the tree. The complete data log likelihood is, using (1) and (6),

\begin{align} ℓ (α, κ, ω; x) & = - α L_{s,tv} - α ω L_{ns,tv} - α κ L_{s,ts} - α κ ω L_{ns,ts} \\ + N log α + N_{ts} log κ + N_{ns} log ω, \end{align}

(7)

where we use the notation

L_{s,ts} = \sum_{i} T_{i} \sum_{j} π_{j} 1 ((i, j) \in L_{s,ts}) and N_{ts} = \sum_{i, j} N_{i j} 1 ((i, j) \in L_{ts})

(8)

where e.g.

L_{s, ts} = {(i, j) : i and j differ at one position and the substitution of i with j is a synonymous transition} .

A similar notation applies for L_s,tv, L_ns,ts, L_ns,_tv, N_ns and N, where the last statistic is the sum of substitutions between all states (i, j) that differ at one position and s, ns, ts and tv subscripts stand for synonymous, non-synonymous, transition and transversion.

The complete data log likelihood can be maximized easily by making the re-parametrization β = ακ. We find that

\hat{α} = \frac{N_{tv}}{L_{s,tv} + \hat{ω} L_{ns,tv}}, \hat{β} = \frac{N_{ts}}{L_{s,ts} + \hat{ω} L_{ns,ts}} and \hat{ω} = \frac{- b + \sqrt{b^{2} - 4 a c}}{2 a},

(9)

where a = -L_ns,tvL_ns,tsN_s, b = L_ns,tvL_s,ts(N_ns - N_tv) + L_ns,tsL_s,tv(N_ns - N_ts) and c = L_s,tvL_s,tsN_ns.

In reality the data y is only available in the leaves and the times and types of substitutions in all sites and all branches of the tree are unaccessible. The EM algorithm is an efficient tool for maximum likelihood estimation in problems where the complete data log likelihood is analytically tractable but full information about the data is missing.

The EM algorithm is an iterative procedure consisting of two steps. In the E-step the expected complete log likelihood

G (θ; θ_{0}, y) = E_{θ_{0}} [ℓ (θ; x) | y]

(10)

conditional on the data y and the current estimate of the parameters θ₀ is calculated. In the M-step the parameters are updated by maximizing G(θ; θ₀,y). The parameters converge to a local maximum of the likelihood for the observed data.

The expected log likelihood conditional on the data y and under the three parameters α, κ and ω is

\begin{align} E [ℓ (α, κ, ω; x) | y] & = - α E [L_{s,tv} | y] - α ω E [L_{ns,tv} | y] \\ - α κ E [L_{s,ts} | y] - α κ ω E [L_{ns,ts} | y] \\ + E [N | y] log α + E [N_{ts} | y] log κ + E [N_{n s} | y] log ω . \end{align}

(11)

Therefore the E-step requires expectations of linear combinations of waiting times in a set of states and number of jumps between certain states. Because of the Markov property this calculation can be divided in two parts. First we use the peeling algorithm [17, 18] to obtain the probability $ℙ (γ_{k} = a, β_{k} = b | y, t_{k})$ that a branch k of length t_kwith nodes γ_kand β_kabove and below the branch, respectively, has end-points a and b. Second, we calculate the desired summary statistic by summing over all branches. For example we have

E [L_{s,ts} | y] = \sum_{branch k} \sum_{a, b} ℙ (γ_{k} = a, β_{k} = b | y, t_{k}) E [L_{s,ts} | t_{k}, a, b]

(12)

E [N_{ts} | y] = \sum_{branch k} \sum_{a, b} ℙ (γ_{k} = a, β_{k} = b | y, t_{k}) E [N_{ts} | t_{k}, a, b] .

(13)

The E-step thus consists of calculating conditional expectations of linear combinations of times such as $E [L_{s,ts} | t_{k}, a, b]$ and substitutions such as $E [N_{ts} | t_{k}, a, b]$ where L_s,ts and N_ts are given by (8). In our application n = 61 and the first type of statistics $E [L_{s,ts} | t, a, b]$ is (up to a factor p_ab(t)) on the form (5) with diagonal entries $C_{i i} = \sum_{j} π_{j} 1 ((i, j) \in L_{s,ts})$ and all off diagonal entries equal to zero. The second type of statistics $E [N_{ts} | t, a, b]$ is also on the form (5) with off-diagonal entries $C_{i j} = q_{i j} 1 ((i, j) \in L_{ts})$ and zeros on the diagonal.

Application 2: Robust distance estimation

The second application is a new approach for estimating labeled evolutionary distance, entitled robust counting and introduced in [8]. The purpose is to calculate a distance that is robust to model misspecification. The method is applied to labeled distances, for example, the synonymous distance between two coding DNA sequences. As it is believed that selection mainly acts at the protein level, synonymous substitutions are neutral and phylogenies built on these type of distances are more likely to reveal the true evolutionary history. The distance is calculated using the mean numbers of labeled substitutions conditioned on pairwise site patterns averaged over the empirical distribution of site patterns observed in the data. In the conventional method the average is done over the theoretical distribution of site patterns. The robustness is therefore achieved through the usage of more information from the data and less from the model.

Let Q be the rate matrix of the assumed model, P(t) = e^Qt, the labeling be given through a set of pairs ℒ and the data be represented by a pairwise alignment y = (y₁, y₂) of length m. As data only contains information about the product Qt, where t is the time distance between the sequences, we can set t = 1.

Suppose we observe the complete data consisting of the types of substitutions that occurred in all sites and let $N_{L} = \sum_{i, j} N_{i j} 1 ((i, j) \in L)$ be the labeled number of substitutions. A natural labeled distance is given by $d_{L} = E (N_{L})$ . The labeled distance is estimated as the average across all sites of the expected number of labeled substitutions conditioned on the observed end points:

\begin{align} {\hat{d}}_{L} & = \frac{1}{m} \sum_{s = 1}^{m} E [N_{L} | X (0) = y_{1 s}, X (1) = y_{2 s}] \\ = \frac{1}{m} \sum_{s = 1}^{m} E [\sum_{i, j} N_{i, j} 1 ((i, j) \in L) | 1, y_{1 s}, y_{2 s}] . \end{align}

(14)

Therefore this application requires evaluating a sum on the form (5) with off-diagonal entries $C_{i j} = q_{i j} 1 ((i, j) \in L)$ and zeros on the diagonal.

Algorithms

The calculation of Σ(C; t) is based on the integrals $I_{c d}^{a b} (t)$ . In this section we present three existing methods for obtaining the integrals and extend them to obtain Σ(C; t).

Eigenvalue decomposition (EVD)

When the rate matrix Q is diagonalizable, the computation of transition probabilities p_ab(t) and integrals $I_{c d}^{a b} (t)$ can be done via the eigenvalue decomposition (EVD). EVD is a widely used method for calculating matrix exponentials. Let Q = U ΛU^-1 be the eigenvalue decomposition, with Λ = diag(λ₁, ..., λ_n). It follows that

P (t) = e^{Q t} = e^{(U Λ U^{- 1}) t} = U e^{Λ t} U^{- 1} .

(15)

Because Λ is diagonal, e^Λtis also diagonal with ${(e^{Λ t})}_{i i} = e^{λ_{i} t}$ .

The integral (4) becomes

I_{c d}^{a b} (t) = \sum_{i} U_{a i} {(U^{- 1})}_{i c} \sum_{j} U_{d j} {(U^{- 1})}_{j b} J_{i j} (t)

(16)

where J_{i j} (t) = \{\begin{matrix} t e^{λ_{i} t} & if λ_{i} = λ_{j} \\ \frac{e^{λ_{i} t} - e^{λ_{j} t}}{λ_{i} - λ_{j}} & if λ_{i} \neq λ_{j} . \end{matrix}

(17)

Replacing $I_{c d}^{a b} (t)$ with (16) in (5), rearranging the sums and using $A_{c j} = \sum_{d} C_{c d} U_{d j}, B_{i j} = J_{i j} (t) \sum_{c} {(U^{- 1})}_{i c} A_{c j}, D_{i b} = \sum_{j} B_{i j} {(U^{- 1})}_{j b}$ and $Σ (C; a, b, t) = \sum_{i} U_{a i} D_{i b}$ we find

Σ (C; t) = U [J (t) \circ (U^{- 1} C U)] U^{- 1}

(18)

where ○ represents the entry-wise product.

The eigenvalues and eigenvectors might be complex, but they come in complex conjugate pairs and the final result is always real; for more information we refer to the Supplementary Information in [2]. If the CTMC is reversible, the decomposition can be done on a symmetric matrix obtained from Q (e.g. [15]), which is faster and tends to be more robust. Let π be the stationary distribution. Due to reversibility, π_aq_ab= π_bq_ba, which can be written as ΠQ = Q* Π where Π = diag(π). Let S = Π^1/2Q Π^-1/2.

We have that

\begin{align} S^{*} & = Π^{- 1 ∕ 2} Q^{*} Π^{1 ∕ 2} = Π^{- 1 ∕ 2} (Q^{*} Π) Π^{- 1 ∕ 2} \\ = Π^{- 1 ∕ 2} (Π Q) Π^{- 1 ∕ 2} = Π^{1 ∕ 2} Q Π^{- 1 ∕ 2} = S \end{align}

(19)

where S* is the transpose of S. Then S is symmetric. Let Λ, V be its eigenvalues and eigenvectors, respectively. Then V ΛV^-1 = S = Π^1/2Q Π^-1/2, which implies Q = (Π^-1/2V)Λ(V^-1Π^1/2) and it follows that Q has the same eigenvalues as S and Π^-1/2V for eigenvectors.

The results can be summarized in the following algorithm:

Algorithm 1: EVD

Input: Q, C, t

Output: Σ(C; t)

Step 1: Determine eigenvalues λ_i.

Determine the eigenvectors U_ifor Q and compute U^-1.

Step 2: Determine matrix J(t) from (17).

Step 3: Determine matrix Σ(C;t) from (18).

Uniformization (UNI)

The uniformization method was first introduced in [19] for computing the matrix exponential P(t) = e^Qt. In [11] it was shown how this method can be used for calculating summary statistics, even for statistics that cannot be written in integral form. Let μ = max_i(q_i) and $R = \frac{1}{μ} Q + I$ , where I is the identity matrix.

Then

P (t) = e^{μ (R - I) t} = \sum_{m = 0}^{\infty} R^{m} \frac{{(μ t)}^{m}}{m!} e^{- μ t} = \sum_{m = 0}^{\infty} R^{m} Pois (m; μ t)

(20)

where Pois(m; λ) is the probability of m occurrences from a Poisson distribution with mean λ. Using (20) we also have

\begin{align} I_{c d}^{a b} (t) & = \int_{0}^{t} p_{a c} (u) p_{d b} (t - u) d u \\ = \int_{0}^{t} [\sum_{i = 0}^{\infty} {(R^{i})}_{a c} \frac{{(μ u)}^{i}}{i!} e^{- μ u}] [\sum_{j = 0}^{\infty} {(R^{j})}_{d b} \frac{{(μ (t - u))}^{j}}{j!} e^{- μ (t - u)}] d u \\ = \sum_{i = 0}^{\infty} \sum_{j = 0}^{\infty} {(R^{i})}_{a c} {(R^{j})}_{d b} \frac{μ^{i + j}}{i! j!} e^{- μ t} \int_{0}^{t} u^{i} {(t - u)}^{j} d u \\ = \sum_{i = 0}^{\infty} \sum_{j = 0}^{\infty} {(R^{i})}_{a c} {(R^{j})}_{d b} \frac{μ^{i + j}}{i! j!} e^{- μ t} \frac{i! j!}{(i + j + 1)!} t^{i + j + 1} \\ = \frac{1}{μ} \sum_{i = 0}^{\infty} \sum_{j = 0}^{\infty} {(R^{i})}_{a c} {(R^{j})}_{d b} \frac{{(μ t)}^{i + j + 1}}{(i + j + 1)!} e^{- μ t} \\ = \frac{1}{μ} \sum_{m = 0}^{\infty} Pois (m + 1; μ t) \sum_{l = 0}^{m} {(R^{l})}_{a c} {(R^{m - l})}_{d b} . \end{align}

(21)

Replacing (21) in (5), rearranging the sums and using that $\sum_{d} C_{c d} {(R^{m - l})}_{d b} = {(C R^{m - l})}_{c b}$ and $\sum_{c} {(R^{l})}_{a c} {(C R^{m - l})}_{c b} = {(R^{l} C R^{m - l})}_{a b}$ we derive

Σ (C; t) = \frac{1}{μ} \sum_{m = 0}^{\infty} Pois (m + 1; μ t) \sum_{l = 0}^{m} R^{l} C R^{m - l} .

(22)

The main challenge with this method is the infinite sum and we use (20) to determine a truncation point. In particular if we let λ = μt and truncate at s(λ) we can bound the error using the tail of the Poisson distribution:

|p_{a b} (t) - \sum_{m = 0}^{s (λ)} {(R^{m})}_{a b} Pois (m; μ t)| = \sum_{m = s (λ) + 1}^{\infty} {(R^{m})}_{a b} Pois (m; μ t) \leq \sum_{m = s (λ)}^{\infty} Pois (m; μ t) .

We have that, for large values of λ, $Pois (λ) \approx ℕ (λ, λ)$ , where $ℕ (μ, σ^{2})$ is the normal distribution with mean μ and variance σ². Therefore, for large λ, the error bound

b = \sum_{m = s (λ)}^{\infty} Pois (m; μ t) \approx 1 - Φ (\frac{s (λ) - λ}{\sqrt{λ}}),

where Φ(·) is the cumulative distribution function for the standard normal distribution. Consequently we can approximate the truncation point s(λ) with $\sqrt{λ} Φ^{- 1} (1 - b) + λ$ . If b = 10^-8 we obtain Φ^-1 (1 - b) = 5.6.

Another way to determine s(λ) is to use R to evaluate Pois(m; λ) for values of m that gradually increase, until the tail is at most b = 10^-8. Combining these two approaches, we performed a linear regression, approximating the tails from R by $c_{1} + c_{2} \sqrt{λ} + c_{3} λ$ . We obtained c₁ = 4.0731, c₂ = 5.6469, c₃ = 0.9963 but, in order to be conservative, we use $s (λ) = ⌈4 + 6 \sqrt{λ} + λ⌉$ where ⌈x⌉ is the smallest integer greater than or equal to x. In Figure 1 we compare the exact truncation value and the linear regression approximation.

The linear regression provides an excellent fit to the tail of the distribution.

In summary we have the following algorithm:

Algorithm 2: UNI

Input: Q, C, t

Output: Σ(C; t)

Step 1: Determine μ, s(μt) and R.

Step 2: Calculate R^mfor 2 ≤ m ≤ s(μt).

Step 3: Calculate $A (m) = \sum_{l = 0}^{m} R^{l} C R^{m - l}$ for 0 ≤ m ≤ s(μt).

using the recursion A(m + 1) = A(m)R + R^m+1C.

Step 4: Determine Σ(C; t) from (22).

Exponentiation (EXPM)

This method for calculating the integral (4) was developed in [12] and emphasized in [11]. Suppose we want to evaluate $\int_{0}^{t} e^{Q u} B e^{Q (t - u)} d u$ , where Q and B are n × n matrices. To calculate this integral, we use an auxiliary matrix $A = [\begin{matrix} Q & B \\ 0 & Q \end{matrix}]$ and the desired integral can be found in the upper right corner of the matrix exponential of A:

\int_{0}^{t} e^{Q u} B e^{Q (t - u)} d u = {(e^{A t})}_{1 : n, (n + 1) : 2 n} .

(23)

We are interested in

\begin{align} I_{c d}^{a b} (t) & = \int_{0}^{t} p_{a c} (u) p_{d b} (t - u) d u = \int_{0}^{t} {(e^{Q u})}_{a c} {(e^{Q (t - u)})}_{d b} d u \\ = {(\int_{0}^{t} e^{Q u} 1_{{(c, d)}} e^{Q (t - u)} d u)}_{a b} \end{align}

(24)

where $1_{{(c, d)}}$ is a matrix with 1 in entry (c, d) and zero otherwise. We can use this method to determine $I_{c d}^{a b} (t)$ by simply setting $B = 1_{{(c, d)}}$ , construct the auxiliary matrix A, calculate the matrix exponential of At, and finally read off the integral in entry (a, b) in the upper right corner of the matrix exponential.

Replacing (24) in (5) and rearranging the terms we have

Σ (C; t) = \int_{0}^{t} e^{Q u} \sum_{c, d} C_{c d} 1_{{(c, d)}} e^{Q (t - u)} d u and \sum_{c, d} C_{c d} 1_{{(c, d)}} = C .

(25)

Therefore by setting B = C in the auxiliary matrix we obtain Σ(C;t).

The EXPM algorithm is as follows:

Algorithm 3: EXPM

Input: Q, C, t

Output: Σ(C; t)

Step 1: Construct $A = [\begin{matrix} Q & C \\ 0 & Q \end{matrix}]$ .

Step 2: Calculate the matrix exponential e^At.

Step 3: Σ(C; t) is the upper right corner of the matrix exponential.

Testing

We implemented the presented algorithms in R and tested them with respect to accuracy and speed.

Accuracy

The accuracy of the methods depends on the size of the rate matrix and the time t. To investigate how these factors influence the result, we used two different CTMCs that allow an analytical expression for (4). The first investigation is based on the Jukes-Cantor model where the rate matrix has uniform rates and variable size n:

q_{i j} = \{\begin{matrix} - 1 & if i = j \\ \frac{1}{n - 1} & if i \neq j . \end{matrix}

Q has two unique eigenvalues: 0 with multiplicity 1 and $- \frac{n}{n - 1}$ with multiplicity n-1. We obtain

\begin{gathered} p_{i j} (t) = \{\begin{matrix} \frac{1}{n} + \frac{n - 1}{n} exp (- \frac{n t}{n - 1}) & if i = j \\ \frac{1}{n} - \frac{1}{n} exp (- \frac{n t}{n - 1}) & if i \neq j \end{matrix} \\ and I_{c d}^{a b} (t) = \frac{1}{n^{2}} \{\begin{matrix} t + t exp (- \frac{n t}{n - 1}) - \frac{2 (n - 1)}{n} (1 - exp (- \frac{n t}{n - 1})) & if a \neq c, d \neq b \\ t + {(n - 1)}^{2} t exp (- \frac{n t}{n - 1}) + \frac{2 {(n - 1)}^{2}}{n} (1 - exp (- \frac{n t}{n - 1})) & if a = c, d = b \\ t - (n - 1) t exp (- \frac{n t}{n - 1}) + \frac{(n - 2) (n - 1)}{n} (1 - exp (- \frac{n t}{n - 1})) & otherwise . \end{matrix} \end{gathered}

We compared the result from all three methods against the true value of (5) for size n ranging from 5 to 100, t = 0.1 and random binary matrices C. Entries in C are 1 with probability $\frac{1}{2}$ . For each fixed size, we generated 5 different matrices C. The average normalized deviation is shown in Figure 2.

The second CTMC is the HKY model:

Q = (\begin{matrix} \cdot & κ π_{G} & π_{C} & π_{T} \\ κ π_{A} & \cdot & π_{C} & π_{T} \\ π_{A} & π_{G} & \cdot & κ π_{T} \\ π_{A} & π_{G} & κ π_{C} & \cdot \end{matrix})

where π = (0.2,0.2,0.3,0.3) is the stationary distribution and κ = 2.15 is the transition/transversion rate ratio. This rate matrix has an analytical result for (4) which can be obtained through the eigenvalue decomposition. The eigenvalues and eigenvectors of Q are

\begin{gathered} λ = (0, - 1, - π_{Y} κ - π_{R}, - π_{R} κ - π_{Y}) where π_{R} = π_{A} + π_{G} and π_{Y} = π_{C} + π_{T}, \\ U = (\begin{matrix} 1 & - \frac{π_{Y}}{π_{R}} & 0 & - \frac{π_{G}}{π_{A}} \\ 1 & - \frac{π_{Y}}{π_{R}} & 0 & 1 \\ 1 & 1 & - \frac{π_{T}}{π_{C}} & 0 \\ 1 & 1 & 1 & 0 \end{matrix}), U^{- 1} = (\begin{matrix} π_{A} & π_{G} & π_{C} & π_{T} \\ - π_{A} & - π_{G} & \frac{π_{C} π_{R}}{π_{Y}} & \frac{π_{T} π_{R}}{π_{Y}} \\ 0 & 0 & - \frac{π_{C}}{π_{Y}} & \frac{π_{C}}{π_{Y}} \\ - \frac{π_{A}}{π_{R}} & \frac{π_{A}}{π_{R}} & 0 & 0 \end{matrix}) . \end{gathered}

From this, using the symbolic operations in Matlab [20], we obtained the final analytic expression for (4). Using this model we compared for all three methods the true value of (5) for various values of t and randomly generated binary matrices C. For each t we generated 5 different matrices C. The average normalized deviation is shown in Figure 2.

In both cases, all methods showed good accuracy as the normalized deviation was no bigger than 3 × 10^-9. We also note that EXPM tended to be the most precise while UNI provided the worst approximation. To further investigate the accuracy, we performed calculations on randomly generated reversible rate matrices: we first obtained the stationary distribution from the Dirichlet distribution with shape parameters equal to 1, then all entries q_ijwith i ≥ j from the exponential distribution with parameter 1 and finally calculated the remaining entries using the reversibility property. In all the runs the relative difference between EVD, UNI and EXPM was less than 10^-5. This indicated that all three methods have a similar performance in a wide range of applications.

Speed

Partition of computation

Assume we need to evaluate Σ(C; t) for a fixed matrix C and multiple time points t ∈ {t₁,...t_k}. In each iteration of the EM-algorithm in Application 1 we need this type of calculation while in order to calculate the labeled distance in Application 2 just one time point is required. Using EVD (Algorithm 1) we do the eigenvalue decomposition (Step 1) once and then, for each time point t_i, we apply Step 2 and Step 3. The eigenvalue decomposition, achieved through the R function eigen, has a running time of O(n³). In Step 2 we determine J(t) and this takes O(n²) time. Step 3 has a running time of O(n³) due to the matrix multiplications.

If instead we apply UNI (Algorithm 2), we run Steps 1-3 for the largest time point max(t_i) and then, for each time point t_i, we apply Step 4. Steps 1-3 take O (s(μ max (t_i)) n³) time, and Step 4 takes O(s(μt_i)n²) time for each i ∈ {1,..., k}. Therefore, even though the total time for both methods is O(n³), the addition of one time point contributes with O(n³) for EVD, but only O(s(μt)n²) for UNI. Recall that the constant s(μt) is the truncation point for the infinite sum in the uniformization method.

In the case of EXPM (Algorithm 3) we need to calculate the matrix exponential at every single time point. We used the expm R package [21] with the Higham08 method. This is a Padé approximation combined with an improved scaling and squaring [22] and balancing [23]. The running time is O(n³).

Table 1 provides an overview of the running times for each of the methods. The algorithms are divided into precomputation and main computation where the precomputation consists of steps that must be executed only once, while in the main computation we calculate the value of Σ(C;t) for every time point under consideration.

Table 1 Running time complexity

Full size table

Experiments

We tested the speed of the algorithms in six experiments based on the presented applications and two more experiments using a non-reversible matrix.

GY

The first experiment corresponded to running the EM algorithm on real data consisting of DNA sequences from the HIV pol gene described in [24]. HIV has been extensively studied with respect to selection pressure and drug resistance and in [24] the authors document convergent evolution in pol gene caused by drug resistance mutations. The observed data y was a multiple codon alignment of the sequences. For simplicity, we did not consider the columns with gaps or ambiguous nucleotides. To compare the performance of the methods as a function of the size of the data set, we applied the EM algorithm for 15 data sets containing from 2 up to 16 sequences each, extracted from the HIV pol gene data. For each set we assumed the sequences were related according to a fixed tree; we have reconstructed the phylogenies in Mega [25] using the Jukes-Cantor model and Neighbor-Joining. We ran the EM algorithm until all three parameters converged. Experiments two and three used the previously estimated matrix Q given by (6) with α = 10.5, κ = 4.27 and ω = 0.6. We let C_ij= q_ijand C_ii = 0, corresponding to calculating the total number of expected substitutions $E [N | t, a, b]$ , and computed the value of Σ(C; t_k) for 10 equidistant sorted time points t_kwith 1 ≤ k ≤ 10 (Table 2).

Table 2 Experimental design

Full size table

GTR

In the fourth experiment we estimated the robust labeled distance of two sequences, using the same set-up as in [8]. For each considered evolutionary distance t between 0.1 and 1, we generated 50 pairwise sequence data sets of length 2000 which have evolved for time t under the general time reversible (GTR) model with

Q = (\begin{matrix} \cdot & r_{1} π_{G} & r_{2} π_{C} & r_{3} π_{T} \\ r_{1} π_{A} & \cdot & r_{4} π_{C} & r_{5} π_{T} \\ r_{2} π_{A} & r_{4} π_{G} & \cdot & r_{6} π_{T} \\ r_{3} π_{A} & r_{5} π_{G} & r_{6} π_{C} & \cdot \end{matrix})

where r = (0.5, 0.3,0.6, 0.2,0.3, 0.2) and π = (0.2, 0.2,0.3, 0.3). For labeling, we considered the jumps to and from nucleotide A, leading to C_ij= q_ijif i or j represents nucleotide A. For each data set, we estimated the GTR parameters as described in [8] and calculated the robust distance. Experiments 5 and 6 used the same GTR matrix and C_ij= q_ijif i or j represents nucleotide A and zero otherwise, and computed the value of Σ(C;t_k) for 10 equidistant sorted time points t_kwith 1 ≤ k ≤ 10 (Table 2).

UNR

In the last two experiments we used the same set-up as in experiments 5 and 6 but with a different matrix and time points (Table 2). As the speed of EVD is influenced by the type of the model, we decided to employ a non-reversible matrix. We chose the unrestricted model and carefully set the rates such that the matrix has a complex decomposition:

Q = (\begin{matrix} - 4 & 2 & 1 & 1 \\ 0 & - 3 & 2 & 1 \\ 1 & 0 & - 3 & 2 \\ 2 & 1 & 1 & - 4 \end{matrix}) .

Figure 3 shows the results. For experiments 1 and 4, the plots show the recorded running time under each set-up (different number of sequences or different evolutionary distance). For the remaining experiments each plot starts with the running time of the precomputation which, for UNI, is done on the largest time point t₁₀. Then, at position k, we plot the cumulative running time for precomputation and the evaluation of Σ(C;t_i) for all i ≤ k. Since EVD and EXPM have running times that are independent of t_k, the running times for these two algorithms are the same in experiments 2 and 3, 5 and 6, and 7 and 8. Even more, as EXPM is dependent only on the size of the matrix, the running times in experiments 5-8 are the same. We observe that in all our experiments EXPM is the slowest method. Deciding if EVD or UNI is faster depends on the size and type of the matrix, the number of time points and the values of s(μt). As the main computation for UNI has a running time of O(n²) as opposed to O(n³) for EVD (Table 1), this method should have an increased advantage when the rate matrix is bigger. This means that if many time points are considered, then UNI is generally the faster method. Importantly, we note that the EVD precomputation tends to be faster than the UNI precomputation. We remark that, in the first experiment, UNI proved to be the fastest method while, in the fourth experiment, UNI became slower with the increase of the evolutionary distance between the sequences and it was only faster than EVD for small distances (< 0.2). By setting t_kin an appropriate manner (Table 2), we have the same running time for UNI and EXPM for experiment 7 compared to experiment 6. Due to the fact that in experiment 7 we used the UNR matrix, EVD is slower as opposed to experiment 6. In this case, the difference is observable but not very big, but as the size of the matrix increases, this discrepancy increases too. We also note that the difference between the reversible and non-reversible cases is enough to make UNI faster than EVD in the latter case.

Discussion

The EVD algorithm assumes that the rate matrix is diagonalizable. However, a direct calculation of the integral (4) in the non-diagonalizable case is actually possible using the Jordan normal form for the rate matrix. Let Q = PJP^-1 where J is the Jordan normal form of Q and P consists of the generalized eigenvectors (we recognize that we used P and J for other quantities earlier but for this discussion this should not cause any confusion and we prefer to use standard notation), i.e. J has a block diagonal form J = diag(J₁,..., J_κ) where J_k= λ_kI + N is a matrix with λ_kon the diagonal and 1 on the superdiagonal. We have

e^{Q t} = P diag (e^{J_{1} t}, \dots, e^{J_{K} t}) P^{- 1},

(26)

and noting that N is a nilpotent matrix with degree d_k(equal to the size of block J_k) we obtain

e^{J_{k} t} = e^{t λ_{k}} e^{t N} = e^{λ_{k} t} (I + t N + \frac{t^{2}}{2} N^{2} + \dots + \frac{t^{d_{k} - 1}}{(d_{k} - 1)!} N^{d_{k} - 1}) .

(27)

In order to calculate the integral (4) the expressions (26) and (27) are used. It is evident that this procedure is feasible but also requires much bookkeeping.

In [26] an extension of uniformization, adaptive uniformization, is described for calculating transition probabilities in a CTMC. The basic idea is to perform a local uniformization instead of a global uniformization of the rate matrix and thereby have fewer jumps in the jump process. [26] considers a model with rate matrix

Q = (\begin{matrix} - 3 v & 3 v & 0 & 0 \\ μ & - (μ + 2 v) & 2 v & 0 \\ 0 & μ & - (μ + v) & v \\ 0 & 0 & 0 & 0 \end{matrix})

(state 4 is an absorbing state). If this process starts in state 1 then the first jump is to state 2 and the second is from state 2 to either state 1 or state 3. This feature can be taken into account by having a so-called adaptive uniformized (AU) jump process where the rate for the first jump is 3ν, for the second is μ + 2ν and, assuming μ + ν > 3ν, the rate for the third jump is μ + ν. From the third jump the rate in the AU jump process is μ + 2ν as in the standard uniformized jump process. The AU jump process has a closed-form expression for the jump probabilities (it is a pure birth process) but is of course more complicated than a Poisson jump process. The advantage is that the AU jump process exhibits fewer jumps. This procedure could very well be useful for codon models where the set of states that the process can be in after one or two jumps are limited because only one nucleotide change is allowed in each state change.

In an application concerned with modeling among-site rate variation, [27] applies the uniformization procedure (20) to calculate the transition probabilities instead of the eigenvalue decomposition method (15). [27] shows, in agreement with our results, that uniformization is a faster computational method than eigenvalue decomposition.

The presented methods are not the only ones for calculating the desired summary statistics. For example, in [5] it is suggested to determine the expected number of jumps from the direct calculation

\begin{align} p_{a b} (t) E [N_{c d} | t, a, b] & = \int_{0}^{t} {(e^{Q s})}_{a c} q_{c d} {(e^{Q (t - s)})}_{a c} d s \\ = \sum_{i = 0}^{\infty} \sum_{j = 0}^{\infty} {(Q^{i})}_{a c} q_{c d} {(Q^{j})}_{d b} \int_{0}^{t} \frac{s^{i} {(t - s)}^{j}}{i! j!} d s \\ = \sum_{k = 1}^{\infty} \frac{t^{k}}{k!} \sum_{m = 0}^{k - 1} {(Q^{m})}_{a c} q_{c d} {(Q^{k - m - 1})}_{d b}, \end{align}

where the infinite sum is truncated at k = 10. The problem with this approach is that it is difficult to bound the error introduced by the truncation. In UNI a similar type of calculation applies but the truncation error can be controlled.

Conclusion

Recall that EVD assumes that the rate matrix is diagonalizable and this constraint means that EVD is less general than the other two algorithms. We have shown in the Discussion how a direct calculation of the integral (4) is actually still possible but requires much bookkeeping. On top of being less general, EVD is dependent on the type of the matrix: reversible or non-reversible. We have shown how this discrepancy can make EVD slower than UNI even when the state space has size of only 4.

We found that the presented methods have similar accuracy and EXPM is the most accurate one. With respect to running time, it is not straightforward which method is best. We found that both the eigenvalue decomposition (EVD) and uniformization (UNI) are faster than the matrix exponentiation method (EXPM). The main reason for EVD and UNI being faster is that they can be decomposed into a precomputation and a main computation. The precomputation only depends on the rate matrix for EVD while for UNI it also depends on the largest time point and the matrix C. We also remark that EXPM involves the exponentiation of a matrix double in size. UNI is particularly fast when the product μt is small because in this case only a few terms in the sum (22) are needed.

References

Holmes I, Rubin GM: An expectation maximization algorithm for training hidden substitution models. J Mol Bio 2002, 317: 753–764. 10.1006/jmbi.2002.5405
Article CAS Google Scholar
Klosterman PS, Holmes I: XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinf 2006, 7: 428. 10.1186/1471-2105-7-428
Article Google Scholar
Kosiol C, Holmes I, Goldman N: An empirical codon model for protein sequence evolution. Mol Biol Evol 2007, 24: 1464–79. 10.1093/molbev/msm064
Article CAS PubMed Google Scholar
Minin VN, Suchard MA: Counting labeled transitions in continuous-time Markov models of evolution. J Math Biol 2008, 56: 391–412.
Article PubMed Google Scholar
Dutheil J, Pupko T, Jean-Marie A, Galtier N: A Model-Based Approach for Detecting Co-evolving Positions in a Molecule. Mol Biol Evol 2008, 22: 1919–1928.
Article Google Scholar
Dutheil J, Galtier N: Detecting groups of co-evolving positions in a molecule: a clustering approach. BMC Evol Biol 2007, 7: 242. 10.1186/1471-2148-7-242
Article PubMed Central PubMed Google Scholar
Minin VN, Suchard MA: Fast, accurate and simulation-free stochastic mapping. Phil Trans R Soc B 2008, 363(1512):3985–3995. 10.1098/rstb.2008.0176
Article PubMed Central PubMed Google Scholar
O'Brien JD, Minin VN, Suchard MA: Learning to count: robust estimates for labeled distances between molecular sequences. Mol Biol Evol 2009, 26: 801–814. 10.1093/molbev/msp003
Article PubMed Central PubMed Google Scholar
Dutheil J: Detecting site-specific biochemical constraints through substitution mapping. J Mol Evol 2008, 67: 257–65. 10.1007/s00239-008-9139-8
Article CAS PubMed Google Scholar
Siepel A, Pollard KS, Haussler D: New methods for detecting lineage-specific selection. Proceedings of the 10th International Conference on Research in Computational Molecular Biology (RECOMB) 2006, 190–205.
Chapter Google Scholar
Hobolth A, Jensen JL: Summary statistics for end-point conditioned continuous-time Markov chains. J Appl Prob 2011, 48: 1–14. 10.1239/jap/1300198132
Article Google Scholar
Van Loan CF: Computing integrals involving the matrix exponential. IEEE Transactions on Automatic Control 1978, 23: 395–404. 10.1109/TAC.1978.1101743
Article Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing.[http://www.R-project.org] R Foundation for Statistical Computing
Goldman N, Yang Z: A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences. Mol Biol Evol 1994, 11: 725–736.
CAS PubMed Google Scholar
Hobolth A, Jensen JL: Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm. Stat App Gen Mol Biol 2005, 4: 18.
Google Scholar
Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Statist Soc B 1977, 39: 1–38.
Google Scholar
Yap VB, Speed T: Estimating Substitution Matrices. In Statistical Methods in Mol Evolution. Edited by: Nielsen R. Springer; 2005:420–422.
Google Scholar
Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981, 17: 368–376. 10.1007/BF01734359
Article CAS PubMed Google Scholar
Jensen A: Markov chains as an aid in the study of Markov processes. Skand Aktuarietidskr 1953, 36: 87–91.
Google Scholar
MATLAB R2010a Natick, Massachusetts: The MathWorks Incorporated;
Goulet V, et al.: expm: Matrix exponential.[http://CRAN.R-project.org/package=expm]
Higham J: The Scaling and Squaring Method for the Matrix Exponential Revisited. SIAM Review 2003, 51: 747–764.
Article Google Scholar
Stadelmann M: Matrixfunktionen. Analyse und Implementierung. In Master thesis. ETH Zurich, Mathematics Department; 2009.
Google Scholar
Lemey P, et al.: Molecular footprint of drug-selective pressure in a human immunodeficiency virus transmission chain. J Virol 2005, 79: 11981–11989. 10.1128/JVI.79.18.11981-11989.2005
Article PubMed Central CAS PubMed Google Scholar
Tamura K, et al.: MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Mol Biol Evol Advance Access 2011.
Google Scholar
Van Moorsel APA, Sanders WH: Adaptive uniformization. Stochastic Models 1994, 10: 619–647. 10.1080/15326349408807313
Article Google Scholar
Mateiu L, Rannala B: Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. Systematic Biol 2006, 55: 259–269. 10.1080/10635150500541599
Article Google Scholar

Download references

Acknowledgements

We are grateful to Thomas Mailund and Julien Y. Dutheil for very useful discussions on the presentation and implementation of the algorithms. We would also like to thank the anonymous reviewers for constructive comments and suggestions that helped us improve the paper.

Author information

Authors and Affiliations

Bioinformatics Research Center, Aarhus University, Aarhus, Denmark
Paula Tataru & Asger Hobolth

Authors

Paula Tataru
View author publications
You can also search for this author in PubMed Google Scholar
Asger Hobolth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paula Tataru.

Additional information

Authors' contributions

PT extended the existing methods to linear combinations of statistics, implemented the algorithms and performed the testing. AH conceived the study and guided the development and evaluation of the methods. Both authors wrote the paper. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Tataru, P., Hobolth, A. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics 12, 465 (2011). https://doi.org/10.1186/1471-2105-12-465

Download citation

Received: 01 July 2011
Accepted: 05 December 2011
Published: 05 December 2011
DOI: https://doi.org/10.1186/1471-2105-12-465

Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains

Abstract

Background

Results

Conclusions

Background

Results

Applications

Application 1: Rate matrix estimation

Application 2: Robust distance estimation

Algorithms

Eigenvalue decomposition (EVD)

Uniformization (UNI)

Exponentiation (EXPM)

Testing

Accuracy

Speed

Partition of computation

Experiments

GY

GTR

UNR

Discussion

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us