Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences

Background The estimation of the difference between two evolutionary distances within a triplet of homologs is a common operation that is used for example to determine which of two sequences is closer to a third one. The most accurate method is currently maximum likelihood over the entire triplet. However, this approach is relatively time consuming. Results We show that an alternative estimator, based on pairwise estimates and therefore much faster to compute, has almost the same statistical power as the maximum likelihood estimator. We also provide a numerical approximation for its variance, which could otherwise only be estimated through an expensive re-sampling approach such as bootstrapping. An extensive simulation demonstrates that the approximation delivers precise confidence intervals. To illustrate the possible applications of these results, we show how they improve the detection of asymmetric evolution, and the identification of the closest relative to a given sequence in a group of homologs. Conclusion The results presented in this paper constitute a basis for large-scale protein cross-comparisons of pairwise evolutionary distances.


Background
The estimation of evolutionary distances between biological sequences is at the basis of many bioinformatics problems: it plays a particularly important role in phylogenetic tree inference [1,2] and in an increasing number of comparative genomics analyses over large sets of genes or proteins (e.g. [3][4][5]). The most accurate way of estimating evolutionary distances is currently maximum likelihood, but the procedure is so time-consuming that is hardly practical when dealing with large datasets. In such cases, complexity is often tackled by working on the basis of individual pairs, such as in distance tree methods or in the "all-against-all" at the beginning of many comparative genomics analyses. However, by estimating an evolution-ary distance for each pair individually, no knowledge about the covariance of distance estimates with common evolution can be directly obtained. Thus, when comparing pairwise distances among related sequences, for instance to infer which of two homologs is closer to a third one, confidence intervals cannot be derived directly from the pairwise estimates.
The present article investigates this fundamental problem of estimating the difference between two distances in a triplet of homologs (Fig. 1). We compare the standard multivariate maximum likelihood approach with a much faster estimator based on pairwise distances, and present a formula to estimate its variance. As two examples of Because of the additivity of distances computed under the Markovian model of sequence evolution. substitution matrices for a wide range of evolutionary distances can be derived from a single substitution matrix M(d 0 ) through the equation M(d 0 ) x = M(xd 0 ), which is a special form of the Chapman-Kolmogorov equation for Markov chains. It is common and computationally more efficient to formulate this process in terms of a rate matrix Q from which the probability matrices for distance d are derived as M(d) = e dQ . We normally measure d in PAM units [8], which completely defines Q.

Maximum likelihood estimation
Evolutionary distances are best estimated by maximum likelihood (ML). In case of a pair of sequences, the ML estimation is well known and practical (see Methods part). When more sequences are under consideration, the complexity of distance estimation by ML increases very steeply, mainly because it requires a multiple sequence alignment (MSA) and the inference of the phylogenetic tree topology, two difficult procedures for which the optimal solution can currently only be computed in exponential time with respect to the number of sequences. A common strategy for tackling this problem is to work on the basis of pairs, such as in distance tree methods. In this article, we focus on the specific problem of estimating, in a triplet of homologs X,Y,Z (Fig. 1). the difference Δ between two distances d XY and d XZ . In such case, the multidimensional ML approach over the triplet is still practical. We call the estimator of Δ obtained by this method triplet . Alternatively, Δ can be estimated by a simple algebraic relation over pairwise distances over X, Y, Z estimated individually. We call this alternative estimator pairwise . Details about the computation of triplet and pairwise are provided in the Methods section.

Results and discussion
In the present section, we compare the estimators triplet and pairwise , and introduce a numerical approximation to estimate the variance of pairwise , and show that it gives accurate confidence intervals. Finally, we describe two applications of the results.

Comparison between the two estimators
In terms of computational complexity, the two estimators . In contrast, all pairwise can be computed on the basis of O(m 2 ) pairs of sequences aligned by DP in O(n 2 ), yielding a time complexity of O(n 2 m 2 ). Typically, whenever an analysis involves more than a few thousand proteins, millions of triplets have to be considered and pairwise is the only practical approach of the two. In terms of accuracy, both estimators are asymptotically unbiased: in the case of triplet , it is a property of the ML estimator, while in the case of pairwise , it is the consequence of the linearity of the expected value (see Methods). We compared the two estimators by simulation over a large number of triplets (length: 300 AA), generated randomly according to the PAM model of evolution with different distances d OX , d OY , d OZ (Fig. 1). In each experiment, both estimators were converging toward the correct value for the difference, which confirms that the asymptotic behavior is a reasonable assumption for protein sequences of typical length. In terms of statistical power; surprisingly, the observed variance of the estimates obtained by pairwise was on average less than 1% larger than the observed variance of the ML estimator over the triplet, suggesting that pairwise , although much faster to compute, is on average almost as accurate as triplet .
The variance of triplet can be computed exactly (see Methods section). But there is no direct estimator of the variance of pairwise , since it results from an algebraic relation over pairwise distances estimated individually, whose covariances are therefore unknown. There are indirect ways of estimating that variance, through the sampling distribution when doing simulation such as the one mentioned above, or bootstrapping when handling real data. However, such procedures are very time consuming. To overcome this problem, we devised a numerical approxi-mation of σ 2 ( pairwise ) as function of the pairwise distance estimates.

Numerical approximation of σ 2 ( pairwise )
In essence, the numerical approximation described here was obtained through regression over a large number of samples. We settled for this approach after discovering that the analytical solution to this problem, even when using a simpler model of evolution (all amino-acid mutations with equal probability). requires solving a polynomial of degree 23. The details of this investigation are reported in the Appendix. In view of this inherent complexity, the regression cannot be exact, but it turns out to be a surprisingly precise numerical approximation for comparisons that involve proteins that have an evolutionary distance smaller than 250 PAM units, which corresponds to percentage sequence identity greater or equal to 19.68%. We generated random triplets in the following way: a random-length (uniform 100..500) sequence was chosen as the origin O. Three random PAM distances (uniform 1..125) were selected for d OX , d OY and d OZ . The sequence O was mutated according to these distances to obtain X,Y and Z, our triplet. We generated about 30,000 triplets for three types of scoring matrix: updated Dayhoff matrices [9], DNA for coding genes and JTT [10]. The DNA scoring matrices were computed from a very large set of entire coding gene alignments from mammals. It is used in the OMA project [4] to align entire coding genes and is based on a 4-symbol alphabet. For each triplet, we computed pairwise distance estimates and their variances as input for the approximation. Given that pairwise is almost as powerful as triplet , we computed and used σ 2 ( triplet ) as reference value for σ 2 ( pairwise ).
We examined a large number of regressions and one approximation stood out of the rest due to its efficiency, low average error and other minor indications.  Coefficients of the regression on the logarithms for the three types of scoring matrices. The error column shows the mean error, which by virtue of being a regression on logarithms is very close to the relative error.
shows the coefficients of the approximation for the three types of scoring matrices.
For example, the approximation for DNA variances is Readers familiar with numerical analysis will find an analogy between the approximation presented here and standard approximations for transcendental functions. For example, it is customary to approximate exp(x) through a quotient of polynomials p(x)/q(x), for some limited range of x.
The relative error is in all the three cases less than 10%. Furthermore, since we normally use the square root of the variance, the relative error is in such cases half of the indicated. The last column indicates the dimension of the approximation which should be 2 in perfect conditions, and is indeed quite close.
The fact that very different matrices have very similar coefficients, the low error and the almost correct dimensionality reassures us of the quality of the approximation.
To test the accuracy/applicability of the approximation, as well as the other two methods to obtain the variance, we compared the 95 and 99% confidence level obtained using the appropriate number of standard deviations to the actual percentage of correct decisions obtained in a simulation over 400, 000 protein triplets generated as described above. The results are shown in Table 2.
As expected, the ML estimator over the entire triplet (first row) yields a precise variance estimate. On the other hand, we see that assuming independence for the estimation of the variance (last row) leads to very inaccurate confidence intervals. Estimating the variance of pairwise by bootstrapping (10,000 re-samples) gives good confidence intervals, but the procedure is even more computationally intensive than triplet , and therefore of little practical use in the present context. Using 2 ( pairwise ) in conjunction with the variance of the ML estimator works remarkably well (third and fourth row). And surprisingly, applying the numerical approximation (fourth row) happened to give slightly more accurate results than the exact triplet variance (third row).
Finally, we compared the different estimators on real biological sequences, using data obtained from the OMA orthologs project [4], Triplets of orthologous sequences from various eukaryotes were randomly selected and aligned using the multiple sequence alignment package from Darwin [13]. All positions containing gaps were excluded, and variances were then estimated on the ungapped triplets using the various estimators (Fig. 2). The variance estimates from the approximation formula deviate very little from the results obtained by the two more expensive methods -for simulated as well as empirical alignments. Additionally, the plots illustrate the high correspondence between the results from the ML estimation and the bootstrapping, and show that the estimator based on an assumption of independence often yields overestimates of the variance. The difference between simulated and empirical data probably arises from the limitations of the Markovian model of evolution. Worth noticing is that the agreement of our estimator with boot-  strapping is comparable to the one of the ML variance estimator: this implies that our approximation has a similar robustness when applied to real data.

Applications
In the following, we provide two examples of applications that benefit from the increase in statistical power of the estimator pairwise enabled by the approximation: detection of asymmetric evolution and identification of the closest relative in a set of homologs. Furthermore, in [14], we show how our result can be used in the context of paralogy detection.
We first define three indicator functions that will be used in these comparisons. They decide whether the pair of

Asymmetric evolution
After a gene duplication, the two copies can evolve independently. It has been suggested that in many cases, one duplicate maintains the ancestral function while the other is free to evolve and acquire novel functionality [15]. This scenario implies that the protein with conserved functionality will undergo less sequence evolution than the one exploring new functionalities.
Detecting this asymmetric evolution after duplication is an important factor not only for function prediction or orthologs assignment, but also for bringing new insights in our understanding of genome evolution in general (e.g. [16][17][18][19]).
In order to identify cases of asymmetric evolution, one typically considers three sequences -the two duplicates (Y and Z)and an out-group (X). Several methods have been developed to test the significance of the unequal lengths of the branches leading from the common ancestor to the two duplicated sequences. Tests on simulated and real data from Arabidopsis thaliana for two of such methods have suggested very low statistical power to detect asymmetric evolution of duplicates [20].
The closer indicator function can be used to detect asymmetric evolution. With d XY being the distance from the out-group to the closer of the two duplicates and d XZ the distance to the other one, closer (X, Y, Z, k) decides if the two duplicated proteins have evolved at significantly dif-ferent rates. The parameter k can be chosen to reflect the confidence level, e.g. 1.96 for the 95% level.
We tested the method using all three variants of closer (k = 1.96) on a protein set from a recent publication about whole genome duplication in S. cerevisiae [21]. From a set of 450 genes pairs that arose by whole genome duplication, they report 115 cases of one paralog evolving at least 50% faster than the other paralog. The position of the ancestral gene was determined by an out-group gene from K. waltii. Additionally, a set of 76 gene pairs is given where at least one of the S. cerevisiae genes evolved at least 50% faster than the K. waltii homolog.
The results are summarized in Fig. 3. We first discuss the differences among three variants of closer. As expected, the over estimation of the variance of the estimator in closer ind considerably reduces the cases of asymmetry detected in comparison with closer app . As for closer app and closer triplet , If we now compare the predictions of Kellis and colleagues with our results, it appears that in 98 out of 115 cases, their prediction of asymmetric evolution could be confirmed by closer app , while with the remaining 17 pairs, our method did not support the asymmetry prediction. It is remarkable, however, that all these 17 pairs belong to the group of the 76 pairs with a fast evolving K. waltii homolog. It seems likely that the uncertainty in placing the origin of the triplet (arising from a longer branch to the out-group) causes rate-based methods as used in [21] to report asymmetric divergence despite the unclear situation. As opposed to that, the distance-based methods presented here, by incorporating the variance of the estimates explicitly, take the uncertainty about the point of origin into account, and therefore give more conservative predictions in these cases.
Furthermore, closer app found 134 additional cases of asymmetry among the remaining 335 gene pairs in the data set. Together with the 98 cases above, this results in 51.6% of all genes arising from the genome duplication event. This is clearly more than the 5% that could be expected from random chance and agrees with previous studies were significant amounts of asymmetrically evolving duplicates have been reported (e.g. [22,23]).

Closest homolog without phylogenetic reconstruction
The identification of the closest relative of a protein (or gene) in a set of homologs traditionally requires the reconstruction of the corresponding phylogenetic tree. However, building gene trees remains a time consuming and error-prone task, thus methods based on pairwise evolutionary distance estimates are attractive. In this section, we show that using the variance approximation presented above can boost the statistical power of PAM distance comparisons to determine the closest homolog.
In simple contexts, or when accuracy is not a concern, the problem of identifying the closest relative can be solved reasonably well by coarse approaches, such as the top blast hit, or even the sequence with highest percentage identity. As the number of proteins grows larger and the number of homologs with similar distances increase, these methods show their limits. Indeed, it has been previously shown that the top blast hit is often not the closest relative [24]. At least two ideas lead to better results: the use of evolutionary distance estimates such as PAM distances, and accounting for confidence intervals, so that whenever there is not enough information to reliably discriminate among several distances, all of them are kept, presumably for further analysis.
Since the comparison of the methods requires precise and unbiased knowledge of the closest homolog, we use simulated data generated in the same way as in the section above, according to the PAM model. Families of homologs were created through mutation and duplication following random phylogenetic trees (Fig. 4) with the following properties: (i) each branch has a random mutation rate from a uniform distribution between 0 and 1, (ii) duplication occurs only along the leftmost branch, at random intervals, on average about every 6 PAM units, (iii) the generation is performed in 60 steps and results in trees with an average number of leaves of 13.04 (σ = 3.1). The very asymmetric duplication process is used to explore efficiently the parameter space, both in terms of distance magnitude to the closest homolog as in the number of homologs with similar distances.
For each protein X belonging to such a family, the closest homolog predictions using the following three criteria were compared to the actual closest homolog. The first one computes the subset of homologous sequences H that align with X with score higher than a particular fraction of the top score.
The second method computes the set of closest homologs, without using our variance approximation, formally The third method computes the set of closest homologs using our approximation, formally The cut-off parameters k 1 , k 2 , k 3 can be set according to the desired level of confidence. At k = 0, only the top score, respectively the shortest expected distance, is returned. Higher k values correspond to more conservative predictions, with increasing number of closest homolog candidates. For the evaluation of the methods, we vary k 1 between 0 and 0.25, while k 2 , k 3 are varied between 0 and 3. Note that only k 3 corresponds to the number of standard deviations from the expected value.
The resulting curves are presented in Fig. 5. At low cut-off values, all three methods perform similarly, but as k increases, the method using closer app gives better results.

Conclusion
Computing the difference of two evolutionary distances that are not independent is a common operation in an increasing number of bioinformatics analyses. We presented and compared two estimators for the difference of two evolutionary distances in a triplet of homologs, one estimator based on pairwise distance estimates and the )} maximum likelihood estimator. Surprisingly, the estimator based on pairwise distance is almost as powerful as the ML estimator. But in terms of time complexity, it scales much better than the ML estimator and is therefore better suited at large-scale analyses. However, since its variance is not easy to estimate, we introduced a numerical approximation that allows the computation of accurate confidence intervals. Finally, we showed how these results can be used to test for asymmetrical evolution, and to identify the closest relative of a sequence in a group of homologs without phylogenetic reconstruction. As of future work, we plan to extend these results to models of evolution allowing rate variations, as well as insertion-deletions.

PAM distance estimator for a pair
The likelihood of an alignment A at an evolutionary distance d is defined [25][26][27] as   Notice that the variance is obtained for free as it is already computed in Newton's iteration.

PAM distance estimator for a triplet Estimator based on pairwise distances
One can estimate Δ by performing pairwise alignments between X and Y, and between X and Z. The ML method for pairs of homologs, which was described above, computes the estimates XY and XZ . By subtracting the first from the second, an estimator for the difference is obtained: Since the two pairwise distance estimators are asymptotically unbiased and normally distributed, and considering the linearity of the expected value and the fact that the difference of two normally distributed variables is also normally distributed, the pairwise estimator pairwise is also asymptotically unbiased and normally distributed, with variance As described above, we obtain σ 2 ( XY ) and σ 2 ( XZ ) from the ML distance estimation, but the process does not say anything about their covariance. If the two distances are independent, which is only the case if d OX = 0, the covari-ance is zero and the variance ( pairwise ) = σ 2 ( XY ) + σ 2 ( XZ ) can be computed. In all other cases, XY and XZ covary and the variance of their difference is smaller than the sum of their variances. Therefore, we only have an upper bound for the variance of our estimator: Note that previous work on covariance estimation (e.g. [7,28]) do not apply here, because they require 3-way sequence alignments and are constrained to parametric models of evolution such as Jukes-Cantor and its generalizations.  Identification of the closest homolog. Identification of the closest homolog: comparison between methods using alignment score (1), distance with assumption of independence (2) and distance using our variance approximation (3), on simulated data. # missing closest homolog in return sets # non-closest homolog candidates in return sets using Set 1 (score) using Set 2 (closer ind ) using Set 3 (closer app )

Estimator based on triplet
where (∇ 2 l) -1 is the inverse of the Hessian (derivable in the same fashion as the gradient, not shown here). The inverse of the Hessian also yields the variance-covariance matrix of the estimates OX , OY , OZ when multiplied by -1. A final use of the Hessian is to check that its complement is positive definite, a condition necessary to ensure that the solution found is indeed a maximum and not a minimum or a saddle point. Therefore, we obtain the variance of triplet from the variance-covariance matrix: This means that the general solution will be an algebraic function of degree 23 or higher, it cannot be lower. If an instantiation of the polynomial with values gives this irreducible polynomial, then the general polynomial must be irreducible of degree 23 or higher (some terms could have simplified in the instantiation). This makes the usefulness of an exact solution inexistent. it is more difficult to solve the polynomial and select the right root than to maximize the likelihood and/or solve the system of equations by numerical methods.