Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences

Dessimoz, Christophe; Gil, Manuel; Schneider, Adrian; Gonnet, Gaston H

doi:10.1186/1471-2105-7-529

Methodology article
Open access
Published: 05 December 2006

Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences

Christophe Dessimoz¹,
Manuel Gil¹,
Adrian Schneider¹ &
…
Gaston H Gonnet¹

BMC Bioinformatics volume 7, Article number: 529 (2006) Cite this article

7482 Accesses
6 Citations
Metrics details

Abstract

Background

The estimation of the difference between two evolutionary distances within a triplet of homologs is a common operation that is used for example to determine which of two sequences is closer to a third one. The most accurate method is currently maximum likelihood over the entire triplet. However, this approach is relatively time consuming.

Results

We show that an alternative estimator, based on pairwise estimates and therefore much faster to compute, has almost the same statistical power as the maximum likelihood estimator. We also provide a numerical approximation for its variance, which could otherwise only be estimated through an expensive re-sampling approach such as bootstrapping. An extensive simulation demonstrates that the approximation delivers precise confidence intervals. To illustrate the possible applications of these results, we show how they improve the detection of asymmetric evolution, and the identification of the closest relative to a given sequence in a group of homologs.

Conclusion

The results presented in this paper constitute a basis for large-scale protein cross-comparisons of pairwise evolutionary distances.

Background

The estimation of evolutionary distances between biological sequences is at the basis of many bioinformatics problems: it plays a particularly important role in phylogenetic tree inference [1, 2] and in an increasing number of comparative genomics analyses over large sets of genes or proteins (e.g. [3–5]). The most accurate way of estimating evolutionary distances is currently maximum likelihood, but the procedure is so time-consuming that is hardly practical when dealing with large datasets. In such cases, complexity is often tackled by working on the basis of individual pairs, such as in distance tree methods or in the "all-against-all" at the beginning of many comparative genomics analyses. However, by estimating an evolutionary distance for each pair individually, no knowledge about the covariance of distance estimates with common evolution can be directly obtained. Thus, when comparing pairwise distances among related sequences, for instance to infer which of two homologs is closer to a third one, confidence intervals cannot be derived directly from the pairwise estimates.

The present article investigates this fundamental problem of estimating the difference between two distances in a triplet of homologs (Fig. 1). We compare the standard multivariate maximum likelihood approach with a much faster estimator based on pairwise distances, and present a formula to estimate its variance. As two examples of applications, we show how our results improve the detection of asymmetric evolution and the identification of the closest relative in a group of homologs. But first, we briefly review the Markovian model of evolution and maximum likelihood estimation of distances.

PAM model of sequence evolution

The evolutionary distance between two biological sequences is generally based on the assumption of a first-order Markovian process of amino acid evolution. This implies two biological assumptions, common to all standard models of evolution: no memory and position-independence. The substitutional processes are described in the form of substitution matrices, defining mutation probabilities from each character to every other character for a given evolutionary distance. These matrices are either parametrical models of sequence evolution or empirically based substitution matrices. Parametrical models are often employed for nucleotide substitution (e.g. Jukes-Cantor [6] or Hasegawa-Kishino-Yano [7]), while empirical matrices (based on counted substitutions of large sets of sequence alignments) are widely used for peptide replacements in proteins. Pioneered by Dayhoff in the 1970s [8], these models have been improved with more sequence data becoming available in the 1990s (e.g. the updated Dayhoff matrices by Gonnet-Cohen-Benner [9] or Jones-Taylor-Thornton (JTT) [10]). Codon substitutions have been described by parametrical (e.g. [11]) as well as empirical (e.g. [12]) matrices.

Because of the additivity of distances computed under the Markovian model of sequence evolution. substitution matrices for a wide range of evolutionary distances can be derived from a single substitution matrix M(d₀) through the equation M(d₀)^x= M(xd₀), which is a special form of the Chapman-Kolmogorov equation for Markov chains. It is common and computationally more efficient to formulate this process in terms of a rate matrix Q from which the probability matrices for distance d are derived as M(d) = e^dQ. We normally measure d in PAM units [8], which completely defines Q.

Maximum likelihood estimation

Evolutionary distances are best estimated by maximum likelihood (ML). In case of a pair of sequences, the ML estimation is well known and practical (see Methods part). When more sequences are under consideration, the complexity of distance estimation by ML increases very steeply, mainly because it requires a multiple sequence alignment (MSA) and the inference of the phylogenetic tree topology, two difficult procedures for which the optimal solution can currently only be computed in exponential time with respect to the number of sequences. A common strategy for tackling this problem is to work on the basis of pairs, such as in distance tree methods. In this article, we focus on the specific problem of estimating, in a triplet of homologs X,Y,Z (Fig. 1). the difference ? between two distances d_XYand d_XZ. In such case, the multidimensional ML approach over the triplet is still practical. We call the estimator of ? obtained by this method $\hat{?}$ _triplet. Alternatively, ? can be estimated by a simple algebraic relation over pairwise distances over X, Y, Z estimated individually. We call this alternative estimator $\hat{?}$ _pairwise. Details about the computation of $\hat{?}$ _tripletand $\hat{?}$ _pairwiseare provided in the Methods section.

Results and discussion

In the present section, we compare the estimators $\hat{?}$ _tripletand $\hat{?}$ _pairwise, and introduce a numerical approximation to estimate the variance of $\hat{?}$ _pairwise, and show that it gives accurate confidence intervals. Finally, we describe two applications of the results.

Comparison between the two estimators

In terms of computational complexity, the two estimators differ significantly. Given m sequences of length n, $\hat{?}$ _tripletrequires the separate treatment of each O(m³) triplet, and considering that an optimal 3-way alignment by dynamic programming (DP) is O(n³), the time complexity is O(m³n³). In contrast, all $\hat{?}$ _pairwisecan be computed on the basis of O(m²) pairs of sequences aligned by DP in O(n²), yielding a time complexity of O(n²m²). Typically, whenever an analysis involves more than a few thousand proteins, millions of triplets have to be considered and $\hat{?}$ _pairwiseis the only practical approach of the two. In terms of accuracy, both estimators are asymptotically unbiased: in the case of $\hat{?}$ _triplet, it is a property of the ML estimator, while in the case of $\hat{?}$ _pairwise, it is the consequence of the linearity of the expected value (see Methods). We compared the two estimators by simulation over a large number of triplets (length: 300 AA), generated randomly according to the PAM model of evolution with different distances d_OX, d_OY, d_OZ(Fig. 1). In each experiment, both estimators were converging toward the correct value for the difference, which confirms that the asymptotic behavior is a reasonable assumption for protein sequences of typical length. In terms of statistical power; surprisingly, the observed variance of the estimates obtained by $\hat{?}$ _pairwisewas on average less than 1% larger than the observed variance of the ML estimator over the triplet, suggesting that $\hat{?}$ _pairwise, although much faster to compute, is on average almost as accurate as $\hat{?}$ _triplet.

The variance of $\hat{?}$ _tripletcan be computed exactly (see Methods section). But there is no direct estimator of the variance of $\hat{?}$ _pairwise, since it results from an algebraic relation over pairwise distances estimated individually, whose covariances are therefore unknown. There are indirect ways of estimating that variance, through the sampling distribution when doing simulation such as the one mentioned above, or bootstrapping when handling real data. However, such procedures are very time consuming. To overcome this problem, we devised a numerical approximation of s²( $\hat{?}$ _pairwise) as function of the pairwise distance estimates.

Numerical approximation of s²( $\hat{?}$ _pairwise)

In essence, the numerical approximation described here was obtained through regression over a large number of samples. We settled for this approach after discovering that the analytical solution to this problem, even when using a simpler model of evolution (all amino-acid mutations with equal probability). requires solving a polynomial of degree 23. The details of this investigation are reported in the Appendix. In view of this inherent complexity, the regression cannot be exact, but it turns out to be a surprisingly precise numerical approximation for comparisons that involve proteins that have an evolutionary distance smaller than 250 PAM units, which corresponds to percentage sequence identity greater or equal to 19.68%. We generated random triplets in the following way: a random-length (uniform 100..500) sequence was chosen as the origin O. Three random PAM distances (uniform 1..125) were selected for d_OX, d_OYand d_OZ. The sequence O was mutated according to these distances to obtain X,Y and Z, our triplet. We generated about 30,000 triplets for three types of scoring matrix: updated Dayhoff matrices [9], DNA for coding genes and JTT [10]. The DNA scoring matrices were computed from a very large set of entire coding gene alignments from mammals. It is used in the OMA project [4] to align entire coding genes and is based on a 4-symbol alphabet. For each triplet, we computed pairwise distance estimates and their variances as input for the approximation. Given that $\hat{?}$ _pairwiseis almost as powerful as $\hat{?}$ _triplet, we computed and used s² ( $\hat{?}$ _triplet) as reference value for s²( $\hat{?}$ _pairwise).

We examined a large number of regressions and one approximation stood out of the rest due to its efficiency, low average error and other minor indications. Table 1 shows the coefficients of the approximation for the three types of scoring matrices.

Table 1 Coefficient of the approximation of s²( $\hat{?}$ _pairwise)

Full size table

For example, the approximation for DNA variances is

\begin{matrix} {\tilde{s}}^{2} ({\hat{?}}_{p a i r w i s e}) = \frac{{\hat{d}}_{Y Z}^{1.3182}}{s^{2} {({\hat{d}}_{Y Z})}^{0.3026}} . \\ \frac{{(s^{2} ({\hat{d}}_{X Y}) + s^{2} ({\hat{d}}_{X Z}))}^{1.0933} {(s^{2} ({\hat{d}}_{X Y}) s^{2} ({\hat{d}}_{X Z}))}^{0.1181}}{{({\hat{d}}_{X Y} + {\hat{d}}_{X Z})}^{1.2449}} \end{matrix}

Readers familiar with numerical analysis will find an analogy between the approximation presented here and standard approximations for transcendental functions. For example, it is customary to approximate exp(x) through a quotient of polynomials p(x)/q(x), for some limited range of x.

The relative error is in all the three cases less than 10%. Furthermore, since we normally use the square root of the variance, the relative error is in such cases half of the indicated. The last column indicates the dimension of the approximation which should be 2 in perfect conditions, and is indeed quite close.

The fact that very different matrices have very similar coefficients, the low error and the almost correct dimensionality reassures us of the quality of the approximation.

To test the accuracy/applicability of the approximation, as well as the other two methods to obtain the variance, we compared the 95 and 99% confidence level obtained using the appropriate number of standard deviations to the actual percentage of correct decisions obtained in a simulation over 400, 000 protein triplets generated as described above. The results are shown in Table 2.

Table 2 Verification of accuracy of confidence intervals

Full size table

As expected, the ML estimator over the entire triplet (first row) yields a precise variance estimate. On the other hand, we see that assuming independence for the estimation of the variance (last row) leads to very inaccurate confidence intervals. Estimating the variance of $\hat{?}$ _pairwiseby bootstrapping (10,000 re-samples) gives good confidence intervals, but the procedure is even more computationally intensive than $\hat{?}$ _triplet, and therefore of little practical use in the present context. Using $\tilde{s}$ ²( $\hat{?}$ _pairwise) in conjunction with the variance of the ML estimator works remarkably well (third and fourth row). And surprisingly, applying the numerical approximation (fourth row) happened to give slightly more accurate results than the exact triplet variance (third row).

Finally, we compared the different estimators on real biological sequences, using data obtained from the OMA orthologs project [4], Triplets of orthologous sequences from various eukaryotes were randomly selected and aligned using the multiple sequence alignment package from Darwin [13]. All positions containing gaps were excluded, and variances were then estimated on the ungapped triplets using the various estimators (Fig. 2). The variance estimates from the approximation formula deviate very little from the results obtained by the two more expensive methods – for simulated as well as empirical alignments. Additionally, the plots illustrate the high correspondence between the results from the ML estimation and the bootstrapping, and show that the estimator based on an assumption of independence often yields overestimates of the variance. The difference between simulated and empirical data probably arises from the limitations of the Markovian model of evolution. Worth noticing is that the agreement of our estimator with bootstrapping is comparable to the one of the ML variance estimator: this implies that our approximation has a similar robustness when applied to real data.

Applications

In the following, we provide two examples of applications that benefit from the increase in statistical power of the estimator $\hat{?}$ _pairwiseenabled by the approximation: detection of asymmetric evolution and identification of the closest relative in a set of homologs. Furthermore, in [14], we show how our result can be used in the context of paralogy detection.

We first define three indicator functions that will be used in these comparisons. They decide whether the pair of proteins X, Y is significantly closer than X, Z at the confidence level expressed by the number of standard deviations k. The first and second ones both use the estimator $\hat{?}$ _pairwise, but the first definition uses as variance of the estimate the upper bound that is obtained by assuming independence of $\hat{d}$ _XYand $\hat{d}$ _XZ(see Methods), whereas the second use the approximation $\tilde{s}$ ²( $\hat{?}$ _pairwise) of the variance. The third indicator function uses the estimator $\hat{?}$ _triplet.

c l o s e r_{i n d} (X, Y, Z, k) = {\begin{array}{l} true & if {\hat{?}}_{p a i r w i s e} < - k \cdot s_{i n d} ({\hat{?}}_{p a i r w i s e}) \\ false & otherwise \end{array}

c l o s e r_{a p p} (X, Y, Z, k) = {\begin{array}{l} true & if {\hat{?}}_{p a i r w i s e} < - k \cdot \tilde{s} ({\hat{?}}_{p a i r w i s e}) \\ false & otherwise \end{array}

c l o s e r_{t r i p l e t} (X, Y, Z, k) = {\begin{array}{l} true & if {\hat{?}}_{t r i p l e t} < - k \cdot s ({\hat{?}}_{t r i p l e t}) \\ false & otherwise \end{array}

Asymmetric evolution

After a gene duplication, the two copies can evolve independently. It has been suggested that in many cases, one duplicate maintains the ancestral function while the other is free to evolve and acquire novel functionality [15]. This scenario implies that the protein with conserved functionality will undergo less sequence evolution than the one exploring new functionalities.

Detecting this asymmetric evolution after duplication is an important factor not only for function prediction or orthologs assignment, but also for bringing new insights in our understanding of genome evolution in general (e.g. [16–19]).

In order to identify cases of asymmetric evolution, one typically considers three sequences – the two duplicates (Y and Z)and an out-group (X). Several methods have been developed to test the significance of the unequal lengths of the branches leading from the common ancestor to the two duplicated sequences. Tests on simulated and real data from Arabidopsis thaliana for two of such methods have suggested very low statistical power to detect asymmetric evolution of duplicates [20].

The closer indicator function can be used to detect asymmetric evolution. With d_XYbeing the distance from the out-group to the closer of the two duplicates and d_XZthe distance to the other one, closer (X, Y, Z, k) decides if the two duplicated proteins have evolved at significantly different rates. The parameter k can be chosen to reflect the confidence level, e.g. 1.96 for the 95% level.

We tested the method using all three variants of closer (k = 1.96) on a protein set from a recent publication about whole genome duplication in S. cerevisiae [21]. From a set of 450 genes pairs that arose by whole genome duplication, they report 115 cases of one paralog evolving at least 50% faster than the other paralog. The position of the ancestral gene was determined by an out-group gene from K. waltii. Additionally, a set of 76 gene pairs is given where at least one of the S. cerevisiae genes evolved at least 50% faster than the K. waltii homolog.

The results are summarized in Fig. 3. We first discuss the differences among three variants of closer. As expected, the over estimation of the variance of the estimator in closer_indconsiderably reduces the cases of asymmetry detected in comparison with closer_app. As for closer_appand closer_triplet, they agree on 400 of 450 cases, with 21 cases only reported by closer _appand 29 only by closer _triplet. This discrepancy results from the error introduced by the approximation for the estimation of the variance of $\hat{?}$ _pairwise, but mostly from the inherent differences in the predictions of the two estimators $\hat{?}$ _pairwiseand $\hat{?}$ _triplet.

If we now compare the predictions of Kellis and colleagues with our results, it appears that in 98 out of 115 cases, their prediction of asymmetric evolution could be confirmed by closer_app, while with the remaining 17 pairs, our method did not support the asymmetry prediction. It is remarkable, however, that all these 17 pairs belong to the group of the 76 pairs with a fast evolving K. waltii homolog. It seems likely that the uncertainty in placing the origin of the triplet (arising from a longer branch to the out-group) causes rate-based methods as used in [21] to report asymmetric divergence despite the unclear situation. As opposed to that, the distance-based methods presented here, by incorporating the variance of the estimates explicitly, take the uncertainty about the point of origin into account, and therefore give more conservative predictions in these cases.

Furthermore, closer_appfound 134 additional cases of asymmetry among the remaining 335 gene pairs in the data set. Together with the 98 cases above, this results in 51.6% of all genes arising from the genome duplication event. This is clearly more than the 5% that could be expected from random chance and agrees with previous studies were significant amounts of asymmetrically evolving duplicates have been reported (e.g. [22, 23]).

Closest homolog without phylogenetic reconstruction

The identification of the closest relative of a protein (or gene) in a set of homologs traditionally requires the reconstruction of the corresponding phylogenetic tree. However, building gene trees remains a time consuming and error-prone task, thus methods based on pairwise evolutionary distance estimates are attractive. In this section, we show that using the variance approximation presented above can boost the statistical power of PAM distance comparisons to determine the closest homolog.

In simple contexts, or when accuracy is not a concern, the problem of identifying the closest relative can be solved reasonably well by coarse approaches, such as the top blast hit, or even the sequence with highest percentage identity. As the number of proteins grows larger and the number of homologs with similar distances increase, these methods show their limits. Indeed, it has been previously shown that the top blast hit is often not the closest relative [24]. At least two ideas lead to better results: the use of evolutionary distance estimates such as PAM distances, and accounting for confidence intervals, so that whenever there is not enough information to reliably discriminate among several distances, all of them are kept, presumably for further analysis.

Since the comparison of the methods requires precise and unbiased knowledge of the closest homolog, we use simulated data generated in the same way as in the section above, according to the PAM model. Families of homologs were created through mutation and duplication following random phylogenetic trees (Fig. 4) with the following properties: (i) each branch has a random mutation rate from a uniform distribution between 0 and 1, (ii) duplication occurs only along the leftmost branch, at random intervals, on average about every 6 PAM units, (iii) the generation is performed in 60 steps and results in trees with an average number of leaves of 13.04 (s = 3.1). The very asymmetric duplication process is used to explore efficiently the parameter space, both in terms of distance magnitude to the closest homolog as in the number of homologs with similar distances.

For each protein X belonging to such a family, the closest homolog predictions using the following three criteria were compared to the actual closest homolog. The first one computes the subset of homologous sequences H that align with X with score higher than a particular fraction of the top score.

S e t_{1} = {Y ? H | S c o r e (X, Y) = (1 - k_{1}) \cdot \underset{Z ? H}{\max ?} (S c o r e (X, Z))}

The second method computes the set of closest homologs, without using our variance approximation, formally

Set₂ = {Y ? H | ? Z ? H, Z ? Y, closer_ind(X, Z, Y, k₂)}

The third method computes the set of closest homologs using our approximation, formally

Set₃ = {Y ? H | ? Y ? H, Z ? Y, closer_app(X, Z, Y, k₃)}

The cut-off parameters k₁, k₂, k₃ can be set according to the desired level of confidence. At k = 0, only the top score, respectively the shortest expected distance, is returned. Higher k values correspond to more conservative predictions, with increasing number of closest homolog candidates. For the evaluation of the methods, we vary k₁ between 0 and 0.25, while k₂, k₃ are varied between 0 and 3. Note that only k₃ corresponds to the number of standard deviations from the expected value.

The resulting curves are presented in Fig. 5. At low cut-off values, all three methods perform similarly, but as k increases, the method using closer_appgives better results.

Conclusion

Computing the difference of two evolutionary distances that are not independent is a common operation in an increasing number of bioinformatics analyses. We presented and compared two estimators for the difference of two evolutionary distances in a triplet of homologs, one estimator based on pairwise distance estimates and the maximum likelihood estimator. Surprisingly, the estimator based on pairwise distance is almost as powerful as the ML estimator. But in terms of time complexity, it scales much better than the ML estimator and is therefore better suited at large-scale analyses. However, since its variance is not easy to estimate, we introduced a numerical approximation that allows the computation of accurate confidence intervals. Finally, we showed how these results can be used to test for asymmetrical evolution, and to identify the closest relative of a sequence in a group of homologs without phylogenetic reconstruction. As of future work, we plan to extend these results to models of evolution allowing rate variations, as well as insertion-deletions.

Methods

PAM distance estimator for a pair

The likelihood of an alignment A at an evolutionary distance d is defined [25–27] as

\begin{matrix} L (A | d) & = & ?_{[x, y] ? A} f (x) M_{x, y} (d) \\ = & ?_{[x, y] ? A} f (x) {[e^{d Q}]}_{x, y} \end{matrix}

with x and y being aligned characters (e.g. amino acids, bases, but no deletions), and f(x) the background frequency of the character x. Maximizing L(A | d) in terms of d gives the ML estimator $\hat{d}$ of the evolutionary distance. This is usually done numerically using the Newton-Raphson method. The variance of the ML estimator $\hat{d}$ can be computed from the second derivative of the log-likelihood:

s^{2} (\hat{d}) = - {(\frac{?^{2} L (A | \hat{d})}{? d^{2}})}^{- 1}

Notice that the variance is obtained for free as it is already computed in Newton's iteration.

PAM distance estimator for a triplet

Estimator based on pairwise distances

One can estimate ? by performing pairwise alignments between X and Y, and between X and Z. The ML method for pairs of homologs, which was described above, computes the estimates $\hat{d}$ _XYand $\hat{d}$ _XZ. By subtracting the first from the second, an estimator for the difference is obtained:

\hat{?}

(1)

_pairwise= $\hat{d}$ _XY- $\hat{d}$ _XZ

Since the two pairwise distance estimators are asymptotically unbiased and normally distributed, and considering the linearity of the expected value and the fact that the difference of two normally distributed variables is also normally distributed, the pairwise estimator $\hat{?}$ _pairwiseis also asymptotically unbiased and normally distributed, with variance

s^{2} ({\hat{?}}_{p a i r w i s e}) = s^{2} ({\hat{d}}_{X Y}) + s^{2} ({\hat{d}}_{X Z}) - 2 c o v ({\hat{d}}_{X Y}, {\hat{d}}_{X Z})

As described above, we obtain s²( $\hat{d}$ _XY) and s²( $\hat{d}$ _XZ) from the ML distance estimation, but the process does not say anything about their covariance. If the two distances are independent, which is only the case if d_OX= 0, the covariance is zero and the variance $s_{i n d}^{2}$ ( $\hat{?}$ _pairwise) = s²( $\hat{d}$ _XY) + s²( $\hat{d}$ _XZ) can be computed. In all other cases, $\hat{d}$ _XYand $\hat{d}$ _XZcovary and the variance of their difference is smaller than the sum of their variances. Therefore, we only have an upper bound for the variance of our estimator:

s^{2} ({\hat{?}}_{p a i r w i s e}) = s_{i n d}^{2} ({\hat{?}}_{p a i r w i s e})

Note that previous work on covariance estimation (e.g. [7, 28]) do not apply here, because they require 3-way sequence alignments and are constrained to parametric models of evolution such as Jukes-Cantor and its generalizations.

Estimator based on triplet

Alternatively, we can estimate ? by subtracting estimates of the distances d_OYand d_OZ

\hat{?}

(2)

_triplet= $\hat{d}$ _OY- $\hat{d}$ _OZ

The estimates $\hat{d}$ _OYand $\hat{d}$ _OZcan be obtained by maximum likelihood over the multiple sequence alignment of X, Y, Z [25], in a manner analogous to the ML estimation for a pair. The likelihood L of a multiple sequence alignment (MSA) is the product, over all positions of the MSA, of the probability of observing characters x, y, z at distance d_OX, d_OY, d_OZof the origin, where such a probability is obtained by marginalizing over every character o at the origin:

L (M S A | d_{O X}, d_{O Y}, d_{O Z}) = \underset{[x, y, z]}{?} \underset{o ? C}{?} f (o) {[e^{d_{O X} Q}]}_{o, x} {[e^{d_{O Y} Q}]}_{o, y} {[e^{d_{O Z} Q}]}_{o, z}

where C is the set of characters – the 20 amino-acids in the present case, and f(o) the background frequency of the character o. Consequently, the log-likelihood function l is

l (M S A | d_{O X}, d_{O Y}, d_{O Z}) = \underset{[x, y, z]}{?} \log ? \underset{o ? C}{?} f (o) {[e^{d_{O X} Q}]}_{o, x} {[e^{d_{O Y} Q}]}_{o, y} {[e^{d_{O Z} Q}]}_{o, z}

The log-likelihood is maximum where its gradient disappears:

? l = (\begin{matrix} ? l / ? d_{O X} \\ ? l / ? d_{O Y} \\ ? l / ? d_{O Z} \end{matrix}) = (\begin{matrix} 0 \\ 0 \\ 0 \end{matrix})

There again, the problem can be solved efficiently by Newton's iteration

{(\begin{matrix} {\hat{d}}_{O X} \\ {\hat{d}}_{O Y} \\ {\hat{d}}_{O Z} \end{matrix})}_{i + 1} = {(\begin{matrix} {\hat{d}}_{O X} \\ {\hat{d}}_{O Y} \\ {\hat{d}}_{O Z} \end{matrix})}_{i} - {(?^{2} l_{i})}^{- 1} ? l_{i}

where (?²l)^-1 is the inverse of the Hessian (derivable in the same fashion as the gradient, not shown here). The inverse of the Hessian also yields the variance-covariance matrix of the estimates $\hat{d}$ _OX, $\hat{d}$ _OY, $\hat{d}$ _OZwhen multiplied by -1. A final use of the Hessian is to check that its complement is positive definite, a condition necessary to ensure that the solution found is indeed a maximum and not a minimum or a saddle point. Therefore, we obtain the variance of $\hat{?}$ _tripletfrom the variance-covariance matrix:

\begin{matrix} s^{2} ({\hat{?}}_{t r i p l e t}) = s^{2} ({\hat{d}}_{O Y}) + s^{2} ({\hat{d}}_{O Z}) - 2 c o v ({\hat{d}}_{O Y}, {\hat{d}}_{O Z}) \\ = [0, 1, - 1] (- \nabla^{2} l) - 1 [\begin{matrix} 0 \\ 1 \\ - 1 \end{matrix}] \end{matrix}

Appendix

Complexity of the analytical solution of k-states model for triplets

In the following, we show that the analytical solution of the maximum-likelihood estimator for the distances of a triplet is very complex, even for a simplified model of mutation. The k-state model [29] is an idealized situation where each position has k possible states and the transition probabilities are all identical and only depend on the time t. For k = 4 this is equivalent to the Jukes-Cantor model [6]. Whatever is the initial state, the probability of a mutation after time t is given by

p (t) = \frac{k - 1}{k} (1 - r^{t})

where r is

r = 1 - \frac{k}{100 (k - 1)}

so that t is measured in PAM units. (Measuring in PAM units is proportional to any other measure, and it means that at t = 1 one percent of the characters are changed, i.e. p(1) = 1/100.) and that all transitions are equally likely, and only depend on the PAM distance. Under this model, the log-likelihood can be expressed in terms of the counts of matches/mismatches of the triplet (X, Y, Z), i.e. N_xxxis the number of positions where all the characters are identical, N_xxzis the number of positions where X and Y coincide but Z differs, etc.

\begin{matrix} l (A | t) = N_{x x x} \log ? (P_{x x x}) + N_{x x z} \log ? (P_{x x z}) + N_{x y x} \log ? (P_{x y x}) + \\ N_{x y y} \log ? (P_{x y y}) + N_{x y z} \log ? (P_{x y z}) \\ P_{x x x} = (1 - p_{x}) (1 - p_{y}) (1 - p_{z}) + \frac{p_{x} p_{y} p_{z}}{{(k - 1)}^{2}} \\ P_{x x z} = (1 - p_{x}) (1 - p_{y}) p_{z} + \frac{p_{x} p_{y} (1 - p_{z})}{k - 1} + \frac{(k - 2) p_{x} p_{y} p_{z}}{{(k - 1)}^{2}} \\ P_{x y x} = (1 - p_{x}) p_{y} (1 - p_{z}) + \frac{p_{x} (1 - p_{y}) p_{z}}{k - 1} + \frac{(k - 2) p_{x} p_{y} p_{z}}{{(k - 1)}^{2}} \\ P_{x y y} = p_{x} (1 - p_{y}) (1 - p_{z}) + \frac{(1 - p_{x}) p_{y} p_{z}}{k - 1} + \frac{(k - 2) p_{x} p_{y} p_{z}}{{(k - 1)}^{2}} \\ P_{x y z} = \frac{k - 2}{k - 1} ((1 - p_{x}) p_{y} p_{z} + p_{x} (1 - p_{y}) p_{z} + p_{x} p_{y} (1 - p_{z}) + \frac{(k - 3) p_{x} p_{y} p_{z}}{k - 1}) \end{matrix}

where p_xis the probability of mutating from the origin to X and similarly for p_yand p_z. Taking partial derivatives of the likelihood with respect to p_x, p_yand p_zgives a system of 3 rational polynomial equations (all the logarithms disappear) in 3 unknowns and 6 parameters. Such a system of equations has a solution that will be an algebraic function of the parameters (a root of a polynomial, where the coefficients of the polynomial involve the parameters). Despite its simple appearance, this system of equations is beyond the capabilities of current computer algebra systems to resolve. And this is not a complete surprise, as the algebraic numbers/functions involved are at least of degree 23. The special case where two of the branches have the same length, has been solved exactly in [30], they find that their solution is an algebraic function of degree 11. This unfortunately is not applicable as we are interested in the cases where the branches away from the origin are of different lengths.

We have computed the exact solution for concrete values of the parameters, in particular N_xxx= 10, N_xxz= 5, N_xyx= 4, N_xyy= 3, N_xyz= 2, k = 3 using Maple and the value of p_xis a root of the irreducible polynomial

6582435840000 + 189590785228800 z - 2438333515038720 z² + ...

... + 10304020514917800 z²¹ - 1635488137841976 z²² + 99990709180560 z²³

This means that the general solution will be an algebraic function of degree 23 or higher, it cannot be lower. If an instantiation of the polynomial with values gives this irreducible polynomial, then the general polynomial must be irreducible of degree 23 or higher (some terms could have simplified in the instantiation). This makes the usefulness of an exact solution inexistent. it is more difficult to solve the polynomial and select the right root than to maximize the likelihood and/or solve the system of equations by numerical methods.

References

Swofford DL, Olsen GL, Waddell PJ, Hillis DM: Phylogenetic inference. 2nd edition. Sunderland, Massachusetts: Sinauer Associates; 1996:407–514.
Google Scholar
Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. University of Washington. Seattle., Department of Genome Sciences; 2004.
Google Scholar
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, (32 Database):277–280. 10.1093/nar/gkh063
Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G: OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements. In RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. Edited by: McLysath A, Huson DH. Springer-Verlag; 2005:61–72.
Google Scholar
DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP: Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics 2006, 22(16):2044–2046. 10.1093/bioinformatics/btl286
Article CAS PubMed Google Scholar
Jukes T, Cantor C: Evolution of protein molecules. In Mammalian protein metabolism III. Edited by: Munro H. New York: Academic Press; 1969:21–132.
Chapter Google Scholar
Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 1985, 22: 160–174. 10.1007/BF02101694
Article CAS PubMed Google Scholar
Dayhoff MO, Schwartz RM, Orcutt BC: A model for evolutionary change in proteins. In Atlas of Protein Sequence and Structure. Volume 5. Edited by: Dayhoff MO. National Biomedical Research Foundation; 1978:345–352.
Google Scholar
Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256(5003):1443–1445. 10.1126/science.1604319
Article CAS PubMed Google Scholar
Jones DT, Taylor WR, Thornton JM: The Rapid Generation of Mutation Data Matrices from Protein Sequences. Comput Applic Biosci 1992, 8: 275–282.
CAS Google Scholar
Goldman N, Yang Z: A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences. Mol Biol Evol 1994, 11(5):725–736.
CAS PubMed Google Scholar
Schneider A, Cannarozzi GM, Gonnet GH: Empirical codon substitution matrix. BMC Bioinformatics 2005., 6(134):
Gonnet GH, Hallett MT, Korostensky C, Bernardin L: Darwin v. 2.0: An Interpreted Computer Language for the Biosciences. Bioinformatics 2000, 16(2):101–103. 10.1093/bioinformatics/16.2.101
Article CAS PubMed Google Scholar
Dessimoz C, Boeckmann B, Roth A, Gonnet GH: Detecting Non-Orthology in the COG Database and Other Approaches Grouping Orthologs Using Genome-Specific Best Hits. Nucleic Acids Res 2006, 34(11):3309–3316. 10.1093/nar/gkl433
Article PubMed Central CAS PubMed Google Scholar
Ohno S: Evolution by Gene Duplication. Springer-Verlag, New York; 1970.
Book Google Scholar
Van de Peer Y, Taylor JS, Braasch I, Meyer A: The Ghost of Selection Past: Rates of Evolution and Functional Divergence of Anciently Duplicated Genes. J Mol Evol 2001, 53(4):436–446. 10.1007/s002390010233
Article CAS PubMed Google Scholar
Dermitzakis ET, Clark AG: Differential Selection After Duplication in Mammalian Developmental Genes. Mol Biol Evol 2001, 18: 557–562.
Article CAS PubMed Google Scholar
Li YJ, Tsoi SCM: Phylogenetic analysis of vertebrate lactate dehydrogenase (LDH) multigene families. J Mol Evol 2002, 54(5):614–24. 10.1007/s00239-001-0058-1
Article CAS PubMed Google Scholar
Wagner A: Asymmetric Functional Divergence of Duplicate Genes in Yeast. Mol Biol Evol 2002, 19: 1760–1768.
Article CAS PubMed Google Scholar
Seoighe C, Scheffler K: Very Low Power to Detect Asymmetric Divergence of Duplicated Genes. In RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. Edited by: McLysath A, Huson DH. Springer-Verlag; 2005:142–152.
Google Scholar
Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004, 428: 617–624. 10.1038/nature02424
Article CAS PubMed Google Scholar
Blanc G, Barakat A, Guyot R, Cooke R, Delseny M: Extensive Duplication and Reshuffling in the Arabidopsis Genome. Plant Cell 2000, 12: 1093–1102. 10.1105/tpc.12.7.1093
Article PubMed Central CAS PubMed Google Scholar
Conant GC, Wagner A: Asymmetric Sequence Divergence of Duplicate Genes. Genome Res 2003, 13: 2052–2058. 10.1101/gr.1252603
Article PubMed Central CAS PubMed Google Scholar
Koski LB, Golding GB: The Closest BLAST Hit Is Often Not the Nearest Neighbor. J Mol Evol 2001, 52(6):540–542.
Article CAS PubMed Google Scholar
Felsenstein J: Evolutionary trees from gene frequencies and quantitative characters: finding maximum likelihood estimates. Evolution 1981, 35: 1229–1242. 10.2307/2408134
Article Google Scholar
Gonnet GH: A Tutorial Introduction to Computational Biochemistry Using Darwin. Tech. rep., Informatik, ETH Zurich, Switzerland; 1994.
Google Scholar
Muller T, Vingron M: Modeling amino acid replacement. J Comput Biol 2000, 7(6):761–776. 10.1089/10665270050514918
Article CAS PubMed Google Scholar
Bulmer M: Use of the method of generalized least-squares in reconstructing phylogenies from sequence data. Mol Biol Evol 1991, 8(6):868–883.
CAS Google Scholar
Cannarozzi GM, Gonnet GH: Idealized Mutational Clocks.Tech. rep., Informatik, ETH, Zurich; 2005. [http://www.biorecipes.com/IdealMut/code.html]
Google Scholar
Chor B, Hendy MD, Snir S: Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions. Mol Biol Evol 2006, 23(3):626–632. 10.1093/molbev/msj069
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank Dan Graur and two anonymous reviewers for helpful comments and ideas.

Author information

Authors and Affiliations

ETH Zurich, Institute of Computational Science, 8092, Zürich, Switzerland
Christophe Dessimoz, Manuel Gil, Adrian Schneider & Gaston H Gonnet

Authors

Christophe Dessimoz
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Gil
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Gaston H Gonnet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christophe Dessimoz.

Additional information

Authors' contributions

CD was the main investigator and writer. MG contributed ideas, wrote part of the method section, and performed simulations. AS contributed the introduction to PAM distances, and the section on asymmetrical evolution. GG devised the numerical approximation and contributed the appendix. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Dessimoz, C., Gil, M., Schneider, A. et al. Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences. BMC Bioinformatics 7, 529 (2006). https://doi.org/10.1186/1471-2105-7-529

Download citation

Received: 14 August 2006
Accepted: 05 December 2006
Published: 05 December 2006
DOI: https://doi.org/10.1186/1471-2105-7-529

Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences

Abstract

Background

Results

Conclusion

Background

PAM model of sequence evolution

Maximum likelihood estimation

Results and discussion

Comparison between the two estimators

Applications

Asymmetric evolution

Closest homolog without phylogenetic reconstruction

Conclusion

Methods

PAM distance estimator for a pair

PAM distance estimator for a triplet

Estimator based on pairwise distances

Estimator based on triplet

Appendix

Complexity of the analytical solution of k-states model for triplets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us