Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains

  • Paula Tataru1Email author and

    Affiliated with

    • Asger Hobolth1

      Affiliated with

      BMC Bioinformatics201112:465

      DOI: 10.1186/1471-2105-12-465

      Received: 1 July 2011

      Accepted: 5 December 2011

      Published: 5 December 2011

      Abstract

      Background

      Continuous time Markov chains (CTMCs) is a widely used model for describing the evolution of DNA sequences on the nucleotide, amino acid or codon level. The sufficient statistics for CTMCs are the time spent in a state and the number of changes between any two states. In applications past evolutionary events (exact times and types of changes) are unaccessible and the past must be inferred from DNA sequence data observed in the present.

      Results

      We describe and implement three algorithms for computing linear combinations of expected values of the sufficient statistics, conditioned on the end-points of the chain, and compare their performance with respect to accuracy and running time. The first algorithm is based on an eigenvalue decomposition of the rate matrix (EVD), the second on uniformization (UNI), and the third on integrals of matrix exponentials (EXPM). The implementation in R of the algorithms is available at http://​www.​birc.​au.​dk/​~paula/​.

      Conclusions

      We use two different models to analyze the accuracy and eight experiments to investigate the speed of the three algorithms. We find that they have similar accuracy and that EXPM is the slowest method. Furthermore we find that UNI is usually faster than EVD.

      Background

      In this paper we consider the problem of calculating the expected time spent in a state and the expected number of jumps between any two states in discretely observed continuous time Markov chains (CTMCs). The case where the CTMC is only recorded at discretely observed time points arises in molecular evolution where DNA sequence data is extracted at present day and past evolutionary events are missing. In this situation, efficient methods for calculating these types of expectations are needed. In particular, two classes of applications can be identified.

      The first class of applications is concerned with rate matrix estimation. [1] describes how the expectation-maximization (EM) algorithm can be applied to estimate the rate matrix from DNA sequence data observed in the leaves of an evolutionary tree. The EM algorithm is implemented in the software XRate [2] and has been applied in [3] for estimating empirical codon rate matrices. [1] uses the eigenvalue decomposition of the rate matrix to calculate the expected time spent in a state and the expected number of jumps between states.

      The second class of applications is concerned with understanding and testing various aspects of evolutionary trajectories. In [4] it is emphasized that analytical results for jump numbers are superior to simulation approaches and various applications of jump number statistics are provided, including a test for the hypothesis that a trait changed its state no more than once in its evolutionary history and a diagnostic tool to measure discrepancies between the data and the model. [4] assumes that the rate matrix is diagonalizable and that the eigenvalues are real, and applies a spectral representation of the transition probability matrix to obtain the expected number of state changes.

      [5] and [6] describe a method, termed substitution mapping, for detecting coevolution of evolutionary traits, and a similar method is described in [7]. The substitution mapping method is based on the expected number of substitutions while [7] base their statistics on the expected time spent in a state. Furthermore [7] describes an application concerned with mapping synonymous and non-synonymous mutations on branches of a phylogenetic tree and employs the expected number of changes between any two states for this purpose. [8] uses the expected number of state changes to calculate certain labeled evolutionary distances. A labeled evolutionary distance could for example be the number of state changes from or to a specific nucleotide. In [9] substitution mapping is invoked for identifying biochemically constrained sites. In [7] and [8] the summary statistics are calculated using the eigenvalue decomposition method suggested by [1]. In [5, 6] and [9] the substitution mapping is achieved using a more direct formula for calculating the number of state changes. In this direct approach an infinite sum must be truncated and it is difficult to control the error associated with the truncation. An alternative is described in [10] where uniformization is applied to obtain the expected number of jumps. [10] uses the expected number of jumps on a branch to detect lineages in a phylogenetic tree that are under selection.

      A third algorithm for obtaining the number of changes or time spent in a state is outlined in [11]. The algorithm is based on [12] where a method for calculating integrals of matrix exponentials is described. A natural question arises: which of the three methods (eigenvalue decomposition, uniformization or matrix exponentiation) for calculating conditional expectations of summary statistics for a discretely observed CTMC should be preferred? The aim of this paper is to provide an answer to this question. We describe and compare the three methods. Our implementations in R [13] are available at http://​www.​birc.​au.​dk/​~paula/​. (Furthermore the eigenvalue decomposition and uniformization methods are also available as a C++ class in the bio++ library at http://​biopp.​univ-montp2.​fr/​.) The performance and discussion of the algorithms are centered around two applications. The first application is concerned with rate matrix estimation; we estimate the Goldman-Yang codon model [14] using the expectation-maximization algorithm. The second application is based on the labeled distance estimation presented in [8].

      Consider a stochastic process {X(s): 0 ≤ st} which can be described by a CTMC with n states and an n × n rate matrix Q = (q cd ). The off-diagonal entries in Q are non-negative and rows sum to zero, i.e. q cc = - Σ dc q cd = -q c . Maximum likelihood estimation of the rate matrix from a complete observation of the process is straight forward. The likelihood of the process, conditional on the beginning state X(0), is given by (e.g. [15])
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ1_HTML.gif
      (1)
      where T c is the total time spent in state c and N cd is the number of jumps from c to d. The necessary sufficient statistics for a CTMC are thus the time spent in each state and the number of jumps between any two states. In applications, however, access is limited to DNA data from extant species. The CTMC is discretely observed and we must estimate the mean values of T c and N cd conditional on the end-points X(0) = a and X(t) = b. From [15] we have that
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ2_HTML.gif
      (2)
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ3_HTML.gif
      (3)
      where P(t) = (p ij (t)) = e Qt is the transition probability matrix and
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ4_HTML.gif
      (4)
      Many applications require a linear combination of certain substitutions or times. Examples include the number of transitions, transversions, synonymous and non-synonymous substitutions. In the two applications described below the statistics of interest is a linear combination of certain substitutions and times. Let therefore C be an n × n matrix and denote by Σ(C; t) the matrix with entries
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ5_HTML.gif
      (5)

      We describe, compare and discuss three methods for calculating Σ(C; t). The evaluation of the integrals (4) takes O(n 3) time and therefore a naive calculation, assuming that C contains just one entry different from zero has a O(n 5) running time. Even worse, if C contains O(n 2) entries different from zero, then the naive implementation has a O(n 7) running time. For all three methods our implementations of Σ(C; t) run in O(n 3) time.

      Results

      Applications

      Application 1: Rate matrix estimation

      Our first application is the problem of estimating the parameters in a CTMC for evolution of coding DNA sequences which we describe using the 61 × 61 rate matrix (excluding stop codons) given by Goldman and Yang [14]:
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ6_HTML.gif
      (6)

      where π is the stationary distribution, κ is the transition/transversion rate ratio, ω is the non-synonymous/synonymous ratio and α is a scaling factor. The stationary distribution π is determined directly from the data using the codon frequencies. We estimate the remaining parameters θ = (α, κ, ω) using the expectation-maximization (EM) algorithm [16] as described below.

      Suppose the complete data x is available, consisting of times and types of substitutions in all sites and in all branches of the tree. The complete data log likelihood is, using (1) and (6),
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ7_HTML.gif
      (7)
      where we use the notation
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ8_HTML.gif
      (8)
      where e.g.
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equa_HTML.gif

      A similar notation applies for L s,tv, L ns,ts, L ns,tv, N ns and N, where the last statistic is the sum of substitutions between all states (i, j) that differ at one position and s, ns, ts and tv subscripts stand for synonymous, non-synonymous, transition and transversion.

      The complete data log likelihood can be maximized easily by making the re-parametrization β = ακ. We find that
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ9_HTML.gif
      (9)

      where a = -L ns,tv L ns,ts N s, b = L ns,tv L s,ts(N ns - N tv) + L ns,ts L s,tv(N ns - N ts) and c = L s,tv L s,ts N ns.

      In reality the data y is only available in the leaves and the times and types of substitutions in all sites and all branches of the tree are unaccessible. The EM algorithm is an efficient tool for maximum likelihood estimation in problems where the complete data log likelihood is analytically tractable but full information about the data is missing.

      The EM algorithm is an iterative procedure consisting of two steps. In the E-step the expected complete log likelihood
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ10_HTML.gif
      (10)

      conditional on the data y and the current estimate of the parameters θ 0 is calculated. In the M-step the parameters are updated by maximizing G(θ; θ 0,y). The parameters converge to a local maximum of the likelihood for the observed data.

      The expected log likelihood conditional on the data y and under the three parameters α, κ and ω is
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ11_HTML.gif
      (11)
      Therefore the E-step requires expectations of linear combinations of waiting times in a set of states and number of jumps between certain states. Because of the Markov property this calculation can be divided in two parts. First we use the peeling algorithm [17, 18] to obtain the probability http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq1_HTML.gif that a branch k of length t k with nodes γ k and β k above and below the branch, respectively, has end-points a and b. Second, we calculate the desired summary statistic by summing over all branches. For example we have
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ12_HTML.gif
      (12)
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ13_HTML.gif
      (13)

      The E-step thus consists of calculating conditional expectations of linear combinations of times such as http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq2_HTML.gif and substitutions such as http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq3_HTML.gif where L s,ts and N ts are given by (8). In our application n = 61 and the first type of statistics http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq4_HTML.gif is (up to a factor p ab (t)) on the form (5) with diagonal entries http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq5_HTML.gif and all off diagonal entries equal to zero. The second type of statistics http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq6_HTML.gif is also on the form (5) with off-diagonal entries http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq7_HTML.gif and zeros on the diagonal.

      Application 2: Robust distance estimation

      The second application is a new approach for estimating labeled evolutionary distance, entitled robust counting and introduced in [8]. The purpose is to calculate a distance that is robust to model misspecification. The method is applied to labeled distances, for example, the synonymous distance between two coding DNA sequences. As it is believed that selection mainly acts at the protein level, synonymous substitutions are neutral and phylogenies built on these type of distances are more likely to reveal the true evolutionary history. The distance is calculated using the mean numbers of labeled substitutions conditioned on pairwise site patterns averaged over the empirical distribution of site patterns observed in the data. In the conventional method the average is done over the theoretical distribution of site patterns. The robustness is therefore achieved through the usage of more information from the data and less from the model.

      Let Q be the rate matrix of the assumed model, P(t) = e Qt , the labeling be given through a set of pairs ℒ and the data be represented by a pairwise alignment y = (y 1, y 2) of length m. As data only contains information about the product Qt, where t is the time distance between the sequences, we can set t = 1.

      Suppose we observe the complete data consisting of the types of substitutions that occurred in all sites and let http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq8_HTML.gif be the labeled number of substitutions. A natural labeled distance is given by http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq9_HTML.gif . The labeled distance is estimated as the average across all sites of the expected number of labeled substitutions conditioned on the observed end points:
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ14_HTML.gif
      (14)

      Therefore this application requires evaluating a sum on the form (5) with off-diagonal entries http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq10_HTML.gif and zeros on the diagonal.

      Algorithms

      The calculation of Σ(C; t) is based on the integrals http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq11_HTML.gif . In this section we present three existing methods for obtaining the integrals and extend them to obtain Σ(C; t).

      Eigenvalue decomposition (EVD)

      When the rate matrix Q is diagonalizable, the computation of transition probabilities p ab (t) and integrals http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq11_HTML.gif can be done via the eigenvalue decomposition (EVD). EVD is a widely used method for calculating matrix exponentials. Let Q = UΛU -1 be the eigenvalue decomposition, with Λ = diag(λ1, ..., λ n ). It follows that
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ15_HTML.gif
      (15)

      Because Λ is diagonal, e Λt is also diagonal with http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq12_HTML.gif .

      The integral (4) becomes
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ16_HTML.gif
      (16)
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ17_HTML.gif
      (17)
      Replacing http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq11_HTML.gif with (16) in (5), rearranging the sums and using http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq13_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq14_HTML.gif we find
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ18_HTML.gif
      (18)

      where ○ represents the entry-wise product.

      The eigenvalues and eigenvectors might be complex, but they come in complex conjugate pairs and the final result is always real; for more information we refer to the Supplementary Information in [2]. If the CTMC is reversible, the decomposition can be done on a symmetric matrix obtained from Q (e.g. [15]), which is faster and tends to be more robust. Let π be the stationary distribution. Due to reversibility, π a q ab = π b q ba , which can be written as ΠQ = Q*Π where Π = diag(π). Let S = Π1/2 QΠ-1/2.

      We have that
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ19_HTML.gif
      (19)

      where S* is the transpose of S. Then S is symmetric. Let Λ, V be its eigenvalues and eigenvectors, respectively. Then VΛV -1 = S = Π1/2 QΠ-1/2, which implies Q = (Π-1/2 V)Λ(V -1Π1/2) and it follows that Q has the same eigenvalues as S and Π-1/2 V for eigenvectors.

      The results can be summarized in the following algorithm:

      Algorithm 1: EVD

      Input: Q, C, t

      Output: Σ(C; t)

      Step 1: Determine eigenvalues λ i .

                  Determine the eigenvectors U i for Q and compute U -1.

      Step 2: Determine matrix J(t) from (17).

      Step 3: Determine matrix Σ(C;t) from (18).

      Uniformization (UNI)

      The uniformization method was first introduced in [19] for computing the matrix exponential P(t) = e Qt . In [11] it was shown how this method can be used for calculating summary statistics, even for statistics that cannot be written in integral form. Let μ = max i (q i ) and http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq15_HTML.gif , where I is the identity matrix.

      Then
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ20_HTML.gif
      (20)
      where Pois(m; λ) is the probability of m occurrences from a Poisson distribution with mean λ. Using (20) we also have
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ21_HTML.gif
      (21)
      Replacing (21) in (5), rearranging the sums and using that http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq16_HTML.gif and http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq17_HTML.gif we derive
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ22_HTML.gif
      (22)
      The main challenge with this method is the infinite sum and we use (20) to determine a truncation point. In particular if we let λ = μt and truncate at s(λ) we can bound the error using the tail of the Poisson distribution:
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equb_HTML.gif
      We have that, for large values of λ, http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq18_HTML.gif , where http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq19_HTML.gif is the normal distribution with mean μ and variance σ 2. Therefore, for large λ, the error bound
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equc_HTML.gif

      where Φ(·) is the cumulative distribution function for the standard normal distribution. Consequently we can approximate the truncation point s(λ) with http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq20_HTML.gif . If b = 10-8 we obtain Φ-1 (1 - b) = 5.6.

      Another way to determine s(λ) is to use R to evaluate Pois(m; λ) for values of m that gradually increase, until the tail is at most b = 10-8. Combining these two approaches, we performed a linear regression, approximating the tails from R by http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq21_HTML.gif . We obtained c 1 = 4.0731, c 2 = 5.6469, c 3 = 0.9963 but, in order to be conservative, we use http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq22_HTML.gif where ⌈x⌉ is the smallest integer greater than or equal to x. In Figure 1 we compare the exact truncation value and the linear regression approximation.
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Fig1_HTML.jpg
      Figure 1

      Poisson truncation. Comparison between the exact truncation value and the http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq23_HTML.gif approximation. In the plot on the left, λ ranges from 0 to 30, while the plot on the right is a zoom-in for values between 0 to 0.5. The plot shows that the approximation is a conservative fit of the Poission tail.

      The linear regression provides an excellent fit to the tail of the distribution.

      In summary we have the following algorithm:

      Algorithm 2: UNI

      Input: Q, C, t

      Output: Σ(C; t)

      Step 1: Determine μ, s(μt) and R.

      Step 2: Calculate R m for 2 ≤ ms(μt).

      Step 3: Calculate http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq24_HTML.gif for 0 ≤ ms(μt).

                  using the recursion A(m + 1) = A(m)R + R m+1 C.

      Step 4: Determine Σ(C; t) from (22).

      Exponentiation (EXPM)

      This method for calculating the integral (4) was developed in [12] and emphasized in [11]. Suppose we want to evaluate http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq25_HTML.gif , where Q and B are n × n matrices. To calculate this integral, we use an auxiliary matrix http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq26_HTML.gif and the desired integral can be found in the upper right corner of the matrix exponential of A:
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ23_HTML.gif
      (23)
      We are interested in
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ24_HTML.gif
      (24)

      where http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq27_HTML.gif is a matrix with 1 in entry (c, d) and zero otherwise. We can use this method to determine http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq11_HTML.gif by simply setting http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq28_HTML.gif , construct the auxiliary matrix A, calculate the matrix exponential of At, and finally read off the integral in entry (a, b) in the upper right corner of the matrix exponential.

      Replacing (24) in (5) and rearranging the terms we have
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ25_HTML.gif
      (25)

      Therefore by setting B = C in the auxiliary matrix we obtain Σ(C;t).

      The EXPM algorithm is as follows:

      Algorithm 3: EXPM

      Input: Q, C, t

      Output: Σ(C; t)

      Step 1: Construct http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq29_HTML.gif .

      Step 2: Calculate the matrix exponential e At .

      Step 3: Σ(C; t) is the upper right corner of the matrix exponential.

      Testing

      We implemented the presented algorithms in R and tested them with respect to accuracy and speed.

      Accuracy

      The accuracy of the methods depends on the size of the rate matrix and the time t. To investigate how these factors influence the result, we used two different CTMCs that allow an analytical expression for (4). The first investigation is based on the Jukes-Cantor model where the rate matrix has uniform rates and variable size n:
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equd_HTML.gif
      Q has two unique eigenvalues: 0 with multiplicity 1 and http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq30_HTML.gif with multiplicity n-1. We obtain
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Eque_HTML.gif
      We compared the result from all three methods against the true value of (5) for size n ranging from 5 to 100, t = 0.1 and random binary matrices C. Entries in C are 1 with probability http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq31_HTML.gif . For each fixed size, we generated 5 different matrices C. The average normalized deviation is shown in Figure 2.
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Fig2_HTML.jpg
      Figure 2

      Accuracy results. Accuracy has been tested using JC and HKY models. For each run, the normalized deviation is calculated: http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq32_HTML.gif where Σ is the correct value and http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq33_HTML.gif is the calculated one. Each plotted point represents the average over a, b and 5 different randomly generated matrices C as described in the main text.

      The second CTMC is the HKY model:
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equf_HTML.gif
      where π = (0.2,0.2,0.3,0.3) is the stationary distribution and κ = 2.15 is the transition/transversion rate ratio. This rate matrix has an analytical result for (4) which can be obtained through the eigenvalue decomposition. The eigenvalues and eigenvectors of Q are
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equg_HTML.gif

      From this, using the symbolic operations in Matlab [20], we obtained the final analytic expression for (4). Using this model we compared for all three methods the true value of (5) for various values of t and randomly generated binary matrices C. For each t we generated 5 different matrices C. The average normalized deviation is shown in Figure 2.

      In both cases, all methods showed good accuracy as the normalized deviation was no bigger than 3 × 10-9. We also note that EXPM tended to be the most precise while UNI provided the worst approximation. To further investigate the accuracy, we performed calculations on randomly generated reversible rate matrices: we first obtained the stationary distribution from the Dirichlet distribution with shape parameters equal to 1, then all entries q ij with ij from the exponential distribution with parameter 1 and finally calculated the remaining entries using the reversibility property. In all the runs the relative difference between EVD, UNI and EXPM was less than 10-5. This indicated that all three methods have a similar performance in a wide range of applications.

      Speed

      Partition of computation

      Assume we need to evaluate Σ(C; t) for a fixed matrix C and multiple time points t ∈ {t 1,...t k }. In each iteration of the EM-algorithm in Application 1 we need this type of calculation while in order to calculate the labeled distance in Application 2 just one time point is required. Using EVD (Algorithm 1) we do the eigenvalue decomposition (Step 1) once and then, for each time point t i , we apply Step 2 and Step 3. The eigenvalue decomposition, achieved through the R function eigen, has a running time of O(n 3). In Step 2 we determine J(t) and this takes O(n 2) time. Step 3 has a running time of O(n 3) due to the matrix multiplications.

      If instead we apply UNI (Algorithm 2), we run Steps 1-3 for the largest time point max(t i ) and then, for each time point t i , we apply Step 4. Steps 1-3 take O (s(μmax (t i )) n 3) time, and Step 4 takes O(s(μt i )n 2) time for each i ∈ {1,..., k}. Therefore, even though the total time for both methods is O(n 3), the addition of one time point contributes with O(n 3) for EVD, but only O(s(μt)n 2) for UNI. Recall that the constant s(μt) is the truncation point for the infinite sum in the uniformization method.

      In the case of EXPM (Algorithm 3) we need to calculate the matrix exponential at every single time point. We used the expm R package [21] with the Higham08 method. This is a Padé approximation combined with an improved scaling and squaring [22] and balancing [23]. The running time is O(n 3).

      Table 1 provides an overview of the running times for each of the methods. The algorithms are divided into precomputation and main computation where the precomputation consists of steps that must be executed only once, while in the main computation we calculate the value of Σ(C;t) for every time point under consideration.
      Table 1

      Running time complexity

      Method

      EVD

      UNI

      EXPM

      Precomputation

         

      Steps

      1

      1-3

      none

      Order

      O(n 3)

      O(s(μt)n 3)

       

      Main Computation

         

      Steps

      2-3

      4

      1-3

      Order

      O(n 3)

      O(s(μt)n 2)

      O(n 3)

      Experiments

      We tested the speed of the algorithms in six experiments based on the presented applications and two more experiments using a non-reversible matrix.

      GY
      The first experiment corresponded to running the EM algorithm on real data consisting of DNA sequences from the HIV pol gene described in [24]. HIV has been extensively studied with respect to selection pressure and drug resistance and in [24] the authors document convergent evolution in pol gene caused by drug resistance mutations. The observed data y was a multiple codon alignment of the sequences. For simplicity, we did not consider the columns with gaps or ambiguous nucleotides. To compare the performance of the methods as a function of the size of the data set, we applied the EM algorithm for 15 data sets containing from 2 up to 16 sequences each, extracted from the HIV pol gene data. For each set we assumed the sequences were related according to a fixed tree; we have reconstructed the phylogenies in Mega [25] using the Jukes-Cantor model and Neighbor-Joining. We ran the EM algorithm until all three parameters converged. Experiments two and three used the previously estimated matrix Q given by (6) with α = 10.5, κ = 4.27 and ω = 0.6. We let C ij = q ij and C ii = 0, corresponding to calculating the total number of expected substitutions http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_IEq34_HTML.gif , and computed the value of Σ(C; t k ) for 10 equidistant sorted time points t k with 1 ≤ k ≤ 10 (Table 2).
      Table 2

      Experimental design

      GY

            

      Experiment

       

      2

        

      3

       

      k

      t k

      μt k

      s ( μ tk )

      t k

      μt k

      s ( μ tk )

      1

      0.0017

      0.0045

      5

      0.1

      0.2668

      8

      2

      0.0032

      0.0085

      5

      0.2

      0.5337

      9

      3

      0.0046

      0.0124

      5

      0.3

      0.8005

      11

      4

      0.0061

      0.0163

      5

      0.4

      1.0674

      12

      5

      0.0076

      0.0202

      5

      0.5

      1.3342

      13

      6

      0.0090

      0.0241

      5

      0.6

      1.6010

      14

      7

      0.0105

      0.0281

      6

      0.7

      1.8679

      15

      8

      0.0120

      0.0320

      6

      0.8

      2.1347

      15

      9

      0.0135

      0.0359

      6

      0.9

      2.4015

      16

      10

      0.0150

      0.0398

      6

      1.0

      2.6684

      17

      GTR

            

      Experiment

       

      5

        

      6

       

      k

      t k

      μt k

      s(μt k )

      t k

      μt k

      s(μt k )

      1

      0.1760

      0.2668

      8

      0.1

      0.1516

      7

      2

      0.3520

      0.5337

      9

      0.6

      0.9098

      11

      3

      0.5280

      0.8005

      11

      1.1

      1.6680

      14

      4

      0.7039

      1.0674

      12

      1.6

      2.4262

      16

      5

      0.8798

      1.3342

      13

      2.1

      3.1844

      18

      6

      1.0558

      1.6010

      14

      2.6

      3.9426

      20

      7

      1.2318

      1.8679

      15

      3.1

      4.7008

      22

      8

      1.4077

      2.1347

      15

      3.6

      5.4590

      24

      9

      1.5837

      2.4015

      16

      4.1

      6.2172

      26

      10

      1.7597

      2.6684

      17

      4.6

      6.9754

      27

      UNR

            

      Experiment

       

      7

        

      8

       

      k

      t k

      μt k

      s(μ tk )

      t k

      μt k

      s(μ tk )

      1

      0.0379

      0.1516

      7

      0.1

      0.4

      9

      2

      0.2275

      0.9098

      11

      0.6

      2.4

      16

      3

      0.4170

      1.6680

      14

      1.1

      4.4

      21

      4

      0.6066

      2.4262

      16

      1.6

      6.4

      26

      5

      0.7961

      3.1844

      18

      2.1

      8.4

      30

      6

      0.9857

      3.9426

      20

      2.6

      10.4

      34

      7

      1.1752

      4.7008

      22

      3.1

      12.4

      38

      8

      1.3648

      5.4590

      24

      3.6

      14.4

      42

      9

      1.5543

      6.2172

      26

      4.1

      16.4

      45

      10

      1.7439

      6.9754

      27

      4.6

      18.4

      49

      The table shows the time points t k , μ tk and the approximation of the Poission tail s(μt k ). For experiment 2, t k spanned the interval that contains the 10 longest branch lengths from the phylogeny of the 16 HIV pol sequences. In experiment 3 we started at 0.1 and ended at 1. We wished to design experiment 5 such that the corresponding s(μt k ) was the same as s(μt k ) from experiment 3. This allowed us to illustrate the relative performance of the methods when running on different sizes of the rate matrix. Experiment 6 was done on time points starting from 0.1 and ending at 4.6. As before, we wished to design experiment 7 such s(μt k ) corresponded to experiment 6. Experiment 8 used the same t k as experiment 6.

      GTR
      In the fourth experiment we estimated the robust labeled distance of two sequences, using the same set-up as in [8]. For each considered evolutionary distance t between 0.1 and 1, we generated 50 pairwise sequence data sets of length 2000 which have evolved for time t under the general time reversible (GTR) model with
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equh_HTML.gif

      where r = (0.5, 0.3,0.6, 0.2,0.3, 0.2) and π = (0.2, 0.2,0.3, 0.3). For labeling, we considered the jumps to and from nucleotide A, leading to C ij = q ij if i or j represents nucleotide A. For each data set, we estimated the GTR parameters as described in [8] and calculated the robust distance. Experiments 5 and 6 used the same GTR matrix and C ij = q ij if i or j represents nucleotide A and zero otherwise, and computed the value of Σ(C;t k ) for 10 equidistant sorted time points t k with 1 ≤ k ≤ 10 (Table 2).

      UNR
      In the last two experiments we used the same set-up as in experiments 5 and 6 but with a different matrix and time points (Table 2). As the speed of EVD is influenced by the type of the model, we decided to employ a non-reversible matrix. We chose the unrestricted model and carefully set the rates such that the matrix has a complex decomposition:
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equi_HTML.gif
      Figure 3 shows the results. For experiments 1 and 4, the plots show the recorded running time under each set-up (different number of sequences or different evolutionary distance). For the remaining experiments each plot starts with the running time of the precomputation which, for UNI, is done on the largest time point t 10. Then, at position k, we plot the cumulative running time for precomputation and the evaluation of Σ(C;t i ) for all ik. Since EVD and EXPM have running times that are independent of t k , the running times for these two algorithms are the same in experiments 2 and 3, 5 and 6, and 7 and 8. Even more, as EXPM is dependent only on the size of the matrix, the running times in experiments 5-8 are the same. We observe that in all our experiments EXPM is the slowest method. Deciding if EVD or UNI is faster depends on the size and type of the matrix, the number of time points and the values of s(μt). As the main computation for UNI has a running time of O(n 2) as opposed to O(n 3) for EVD (Table 1), this method should have an increased advantage when the rate matrix is bigger. This means that if many time points are considered, then UNI is generally the faster method. Importantly, we note that the EVD precomputation tends to be faster than the UNI precomputation. We remark that, in the first experiment, UNI proved to be the fastest method while, in the fourth experiment, UNI became slower with the increase of the evolutionary distance between the sequences and it was only faster than EVD for small distances (< 0.2). By setting t k in an appropriate manner (Table 2), we have the same running time for UNI and EXPM for experiment 7 compared to experiment 6. Due to the fact that in experiment 7 we used the UNR matrix, EVD is slower as opposed to experiment 6. In this case, the difference is observable but not very big, but as the size of the matrix increases, this discrepancy increases too. We also note that the difference between the reversible and non-reversible cases is enough to make UNI faster than EVD in the latter case.
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Fig3_HTML.jpg
      Figure 3

      Experiments results. Running times for the eight experiments. Experiment 1: rate matrix estimation using EM. The plot shows the running time for calculating the statistics for each method, as a function of the number of sequences included in the data set. For experiments 2 and 3 we calculated the value of Σ(C;t k ) for 10 time points t k . Each plot starts with the running time of the precomputation and at position k we plot the cumulative running time for precomputation and the evaluation of Σ(C;t 1 ) for all t 1 ∈ {t 1 ,...,t k }. The values of t k are provided in Table 2. Experiment 4: robust distance estimation. The plot shows the running time for computing the robust distance as a function of the evolutionary distance t. Experiments 5-8: similar as for experiments 2 and 3 but with a GTR model (experiments 5 and 6) and UNR model (experiments 7 and 8) instead of a GY model.

      Discussion

      The EVD algorithm assumes that the rate matrix is diagonalizable. However, a direct calculation of the integral (4) in the non-diagonalizable case is actually possible using the Jordan normal form for the rate matrix. Let Q = PJP -1 where J is the Jordan normal form of Q and P consists of the generalized eigenvectors (we recognize that we used P and J for other quantities earlier but for this discussion this should not cause any confusion and we prefer to use standard notation), i.e. J has a block diagonal form J = diag(J 1,..., J κ ) where J k = λ k I + N is a matrix with λ k on the diagonal and 1 on the superdiagonal. We have
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ26_HTML.gif
      (26)
      and noting that N is a nilpotent matrix with degree d k (equal to the size of block J k ) we obtain
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equ27_HTML.gif
      (27)

      In order to calculate the integral (4) the expressions (26) and (27) are used. It is evident that this procedure is feasible but also requires much bookkeeping.

      In [26] an extension of uniformization, adaptive uniformization, is described for calculating transition probabilities in a CTMC. The basic idea is to perform a local uniformization instead of a global uniformization of the rate matrix and thereby have fewer jumps in the jump process. [26] considers a model with rate matrix
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equj_HTML.gif

      (state 4 is an absorbing state). If this process starts in state 1 then the first jump is to state 2 and the second is from state 2 to either state 1 or state 3. This feature can be taken into account by having a so-called adaptive uniformized (AU) jump process where the rate for the first jump is 3ν, for the second is μ + 2ν and, assuming μ + ν > 3ν, the rate for the third jump is μ + ν. From the third jump the rate in the AU jump process is μ + 2ν as in the standard uniformized jump process. The AU jump process has a closed-form expression for the jump probabilities (it is a pure birth process) but is of course more complicated than a Poisson jump process. The advantage is that the AU jump process exhibits fewer jumps. This procedure could very well be useful for codon models where the set of states that the process can be in after one or two jumps are limited because only one nucleotide change is allowed in each state change.

      In an application concerned with modeling among-site rate variation, [27] applies the uniformization procedure (20) to calculate the transition probabilities instead of the eigenvalue decomposition method (15). [27] shows, in agreement with our results, that uniformization is a faster computational method than eigenvalue decomposition.

      The presented methods are not the only ones for calculating the desired summary statistics. For example, in [5] it is suggested to determine the expected number of jumps from the direct calculation
      http://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-465/MediaObjects/12859_2011_5136_Equk_HTML.gif

      where the infinite sum is truncated at k = 10. The problem with this approach is that it is difficult to bound the error introduced by the truncation. In UNI a similar type of calculation applies but the truncation error can be controlled.

      Conclusion

      Recall that EVD assumes that the rate matrix is diagonalizable and this constraint means that EVD is less general than the other two algorithms. We have shown in the Discussion how a direct calculation of the integral (4) is actually still possible but requires much bookkeeping. On top of being less general, EVD is dependent on the type of the matrix: reversible or non-reversible. We have shown how this discrepancy can make EVD slower than UNI even when the state space has size of only 4.

      We found that the presented methods have similar accuracy and EXPM is the most accurate one. With respect to running time, it is not straightforward which method is best. We found that both the eigenvalue decomposition (EVD) and uniformization (UNI) are faster than the matrix exponentiation method (EXPM). The main reason for EVD and UNI being faster is that they can be decomposed into a precomputation and a main computation. The precomputation only depends on the rate matrix for EVD while for UNI it also depends on the largest time point and the matrix C. We also remark that EXPM involves the exponentiation of a matrix double in size. UNI is particularly fast when the product μt is small because in this case only a few terms in the sum (22) are needed.

      Declarations

      Acknowledgements

      We are grateful to Thomas Mailund and Julien Y. Dutheil for very useful discussions on the presentation and implementation of the algorithms. We would also like to thank the anonymous reviewers for constructive comments and suggestions that helped us improve the paper.

      Authors’ Affiliations

      (1)
      Bioinformatics Research Center, Aarhus University

      References

      1. Holmes I, Rubin GM: An expectation maximization algorithm for training hidden substitution models. J Mol Bio 2002, 317:753–764.View Article
      2. Klosterman PS, Holmes I: XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinf 2006, 7:428.View Article
      3. Kosiol C, Holmes I, Goldman N: An empirical codon model for protein sequence evolution. Mol Biol Evol 2007, 24:1464–79.PubMedView Article
      4. Minin VN, Suchard MA: Counting labeled transitions in continuous-time Markov models of evolution. J Math Biol 2008, 56:391–412.PubMedView Article
      5. Dutheil J, Pupko T, Jean-Marie A, Galtier N: A Model-Based Approach for Detecting Co-evolving Positions in a Molecule. Mol Biol Evol 2008, 22:1919–1928.View Article
      6. Dutheil J, Galtier N: Detecting groups of co-evolving positions in a molecule: a clustering approach. BMC Evol Biol 2007, 7:242.PubMedView Article
      7. Minin VN, Suchard MA: Fast, accurate and simulation-free stochastic mapping. Phil Trans R Soc B 2008,363(1512):3985–3995.PubMedView Article
      8. O'Brien JD, Minin VN, Suchard MA: Learning to count: robust estimates for labeled distances between molecular sequences. Mol Biol Evol 2009, 26:801–814.PubMedView Article
      9. Dutheil J: Detecting site-specific biochemical constraints through substitution mapping. J Mol Evol 2008, 67:257–65.PubMedView Article
      10. Siepel A, Pollard KS, Haussler D: New methods for detecting lineage-specific selection. Proceedings of the 10th International Conference on Research in Computational Molecular Biology (RECOMB) 2006, 190–205.
      11. Hobolth A, Jensen JL: Summary statistics for end-point conditioned continuous-time Markov chains. J Appl Prob 2011, 48:1–14.View Article
      12. Van Loan CF: Computing integrals involving the matrix exponential. IEEE Transactions on Automatic Control 1978, 23:395–404.View Article
      13. R Development Core Team: R: A Language and Environment for Statistical Computing. [http://​www.​R-project.​org] R Foundation for Statistical Computing
      14. Goldman N, Yang Z: A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences. Mol Biol Evol 1994, 11:725–736.PubMed
      15. Hobolth A, Jensen JL: Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm. Stat App Gen Mol Biol 2005, 4:18.
      16. Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Statist Soc B 1977, 39:1–38.
      17. Yap VB, Speed T: Estimating Substitution Matrices. In Statistical Methods in Mol Evolution. Edited by: Nielsen R. Springer; 2005:420–422.
      18. Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981, 17:368–376.PubMedView Article
      19. Jensen A: Markov chains as an aid in the study of Markov processes. Skand Aktuarietidskr 1953, 36:87–91.
      20. MATLAB R2010a Natick, Massachusetts: The MathWorks Incorporated;
      21. Goulet V, et al.: expm: Matrix exponential. [http://​CRAN.​R-project.​org/​package=​expm]
      22. Higham J: The Scaling and Squaring Method for the Matrix Exponential Revisited. SIAM Review 2003, 51:747–764.View Article
      23. Stadelmann M: Matrixfunktionen. Analyse und Implementierung. In Master thesis. ETH Zurich, Mathematics Department; 2009.
      24. Lemey P, et al.: Molecular footprint of drug-selective pressure in a human immunodeficiency virus transmission chain. J Virol 2005, 79:11981–11989.PubMedView Article
      25. Tamura K, et al.: MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Mol Biol Evol Advance Access 2011.
      26. Van Moorsel APA, Sanders WH: Adaptive uniformization. Stochastic Models 1994, 10:619–647.View Article
      27. Mateiu L, Rannala B: Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. Systematic Biol 2006, 55:259–269.View Article

      Copyright

      © Tataru and Hobolth; licensee BioMed Central Ltd. 2011

      Advertisement