Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution
© Baele et al.; licensee BioMed Central Ltd. 2013
Received: 2 July 2012
Accepted: 22 January 2013
Published: 6 March 2013
Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. In the Bayesian framework, model selection is typically performed through the evaluation of a Bayes factor, the ratio of two marginal likelihoods (one for each model). Recently introduced techniques to estimate (log) marginal likelihoods, such as path sampling and stepping-stone sampling, offer increased accuracy over the traditional harmonic mean estimator at an increased computational cost. Most often, each model’s marginal likelihood will be estimated individually, which leads the resulting Bayes factor to suffer from errors associated with each of these independent estimation processes.
We here assess the original ‘model-switch’ path sampling approach for direct Bayes factor estimation in phylogenetics, as well as an extension that uses more samples, to construct a direct path between two competing models, thereby eliminating the need to calculate each model’s marginal likelihood independently. Further, we provide a competing Bayes factor estimator using an adaptation of the recently introduced stepping-stone sampling algorithm and set out to determine appropriate settings for accurately calculating such Bayes factors, with context-dependent evolutionary models as an example. While we show that modest efforts are required to roughly identify the increase in model fit, only drastically increased computation times ensure the accuracy needed to detect more subtle details of the evolutionary process.
We show that our adaptation of stepping-stone sampling for direct Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Our proposed approach for Bayes factor estimation also has preferable statistical properties over the use of individual marginal likelihood estimates for both models under comparison. Assuming a sigmoid function to determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation.
An increasing number of studies in phylogenetics have demonstrated, using a range of different inferential methods of varying complexity, that the assumption of site-independent evolution is overly restrictive and that evolutionary models that take into account context-dependencies may greatly improve model fit. Specifically, context-dependent models are useful when studying mammalian genomes due to the extensive methylation of C in CG doublets, which could make such Cs hotspots for mutation (see , for a review). Indeed, more accurate mathematical models of molecular sequence evolution continue to be developed for good reasons as the additional complexity of such models can lead to the identification of important evolutionary processes that would be missed with simpler models. These models however come at a drastically elevated computational cost due to their increase in number of parameters and the need for data augmentation to make the likelihood calculations feasible .
Kass and Raftery  introduce different gradations to assess the log Bayes factor as evidence against M0. A value between 0 and 1 is not worth more than a bare mention, whereas a value between 1 and 3 is considered as positive evidence against M0. Values larger than 3 and 5 are considered to respectively give strong and very strong evidence against M0. The Bayes factor offers advantages over likelihood-ratio-tests comparing nested models in which one garners evidence only in favor of rejecting less complex models. Instead, the Bayes factor evaluates the relative merits of both competing models and does not require nested models. Further, when the individual (log) marginal likelihoods are estimated correctly, the (log) Bayes factor takes into account differences in dimensions, so higher dimensional models are not automatically preferred.
Several useful methods have been proposed to evaluate marginal likelihoods (and by extension Bayes factors) in phylogenetics, but they are often limited to specific model selection situations, see  for an overview. Comparing a few of the methods of potentially general applicability, Lartillot and Philippe  test a (simple) Monte Carlo estimator of integrating the likelihood against the model prior and two variants of importance sampling (IS): the posterior harmonic mean estimator (HME) and the stabilized HME to a path sampling (PS) approach [9-11]. Of these approaches, the HME  is by far the most popular method in the field of phylogenetics, only requiring samples from the posterior distribution (see e.g. ). The HME is however often severely biased and overestimates the true marginal likelihood . Because the HME estimator’s variance may be infinite, a modified, stabilized version has been proposed  with extensions to quantify its Monte Carlo error in phylogenetics . PS is shown to outperform the other methods across all scenarios, remaining well-behaved in cases with high dimensions where all three IS methods fail, even when these IS methods use a huge number of posterior samples .
Recently, Xie et al.  introduced stepping-stone sampling (SS), which employs ideas from both importance sampling and path sampling to estimate the marginal likelihood using a path that bridges the posterior and prior distribution of a model. Using a Gaussian model example, the authors demonstrate that SS yields a substantially less biased estimator than PS, while requiring significantly fewer path steps (called ratios in the context of SS) to reliably estimate the marginal likelihood. Further, SS outperforms the HME in terms of accuracy, consistency and repeatability . While the performance of PS and SS was originally assessed using Gaussian model examples and small phylogenetic examples, these approaches have recently been shown to considerably outperform the HME and a posterior simulation-based analogue of Akaike’s information criterion (AICM) using extensive simulations and empirical analyses in the context of demographic and relaxed molecular clock model comparison .
As a consequence of these developments, path sampling (PS; ) and stepping-stone sampling (SS; ) estimators are currently being integrated in popular software packages such as BEAST . These methods represent very general estimators as they can be applied to any model for which MCMC samples can be obtained. Despite the increased computational demands associated with these estimators, they are particularly suited to assess the performance of high-dimensional models, since the HME systematically favors parameter-rich models (see the corresponding section in this paper) .
In this paper, we focus on direct (log) Bayes factor estimation between two competing models instead of estimating each individual model’s marginal likelihood. Direct (log) Bayes factor estimation, available in the path sampling paradigm as ‘model-switch path sampling’ and connecting the two models under comparison in the space of unnormalized densities, is more accurate and less prone to errors as we show here, which is especially important for cases where the difference between the logarithm of the marginal likelihoods of the two models is small compared to the separate marginal likelihoods themselves . Given the recent introduction of stepping-stone sampling to estimate marginal likelihoods more efficiently , we here introduce ‘model-switch stepping-stone sampling’ to perform direct (log) Bayes factor estimation and provide evidence that this represents the more reliable approach.
We test the mentioned approaches on three data sets (see Methods section) with distinctive properties to show the advantages of our presented approach. Two mammalian data sets are analyzed, which are expected to yield increases in model fit due to the CpG-methylation-deamination process acting upon mammalian sequences and the ability of the proposed context-dependent model to incorporate this process. The largest of these data sets contains large amounts of sites for each of the parameters that need to be estimated and is hence expected to yield accurate context-dependent parameter estimates that will lead to a substantial increase in model fit over a traditional site-independent model. The smaller mammalian data set is likely to yield context-dependent parameter estimates with large variance, rendering the process of assessing differences in model fit challenging and hence presenting an interesting case for the different model selection approaches that are being compared. Finally, we test a green plant data set of which the substitution processes are not expected to conform to those accounted for by the context-dependent model and which is hence expected to yield a decrease in model fit compared to a site-independent evolutionary model.
A first (mammalian) data set is a subset of the sequence data set analyzed in Prasad et al. , which is an expanded version of that reported by Thomas et al. (2003). All sequences are orthologous to a 1.9-Mb region of human chromosome 7 (build hg18, chr7:115,597,757-117,475,182) that includes 10 known genes (e.g., CFTR, ST7, and CAV1). We selected five sequences from the Laurasiatheria, i.e. the sequences of domestic pig (Sus scrofa domestica), Indian muntjac (Muntiacus muntjak vaginalis), sheep (Ovis aries), cow (Bos taurus) and horse (Equus caballus) assuming the following topology ((((cow,sheep),muntjak),pig),horse). The reported analyses in this manuscript were performed on the conserved non-coding part of the data set of Prasad et al. , comprising 97,682 nucleotides per sequence. We refer to this data set as the “Laurasiatheria” data set.
A second (mammalian) data set consists of the ψ η-globin pseudogene sequences of six primates: human (Homo Sapiens), chimpanzee (Pan Troglodytes), gorilla (Gorilla Gorilla), orang-utan (Pongo Pygmaeus), rhesus monkey (Macaca Mulatta) and spider monkey (Ateles Geoffroyi), containing 6,166 nucleotides in each sequence. We have used the fixed consensus tree shown in the work of Yang  and refer to this dataset as the “Pseudogenes” data set.
A third (plant) data set consists of 20 small subunit (SSU) rRNA genes (nuclear) obtained from the alignment of Karol et al. . We have used the following sequences: Cyanophora paradoxa, Nephroselmis olivacea, Chlamydomonas moewusii, Volvox carteri, Paulschulzia pseudovolvox, Coleochaete orbicularis 2651, Coleochaete solute 32d1, Coleochaete irregularis 3d2, Coleochaete sieminskiana 10d1, Zygnema peliosporum, Mougeotia sp758, Gonatozygon monotaenium 1253, Onychonema sp832, Cosmocladium perissum 2447, Lychnothamnus barbatus 159, Nitellopsis obtusa F131B, Chara connivens F140, Lamprothamnium macropogon X695, Arabidopsis thaliana and Taxus mairei. We used the 50% majority-rule posterior consensus tree under the general time-reversible model. We refer to this dataset as the “Nuclear SSU rRNA” data set, which contains 1,619 nucleotides per sequence.
Context-dependent evolutionary model
We allow the substitution probabilities for a given site to depend on its “evolutionary context”, i.e. the combination of neighboring bases of that site. We assume that evolution occurs independently on each tree branch (see e.g. ) but do not employ a branch-specific (or lineage-specific) context-dependent evolutionary model. We do not follow the approach of Hwang and Green , who have approximated the continuous-time Markov substitution process by partitioning each branch into two or more discrete time units so that the average substitution rate per time unit is ≤ 0.005, because our previous work demonstrated that this does not result in different parameter estimates . We here present a first-order (i.e. depending on the two immediate neighbors) context-dependent evolutionary model which mimics the model of Hwang and Green .
For bases x≠z in the first-order context-dependent model, let ψ i (x→z∣w y) be the probability that, in one unit of time, the base x at position i mutates to z, given neighboring bases w (at position i−1) and y (at position i+1). Should the branch not be partitioned, it is hence assumed that the neighboring bases of position i remain constant across the branch. Partitioning the branch into two or more parts allows the substitution probabilities to depend on more recent ancestral sequences than at the start of the branch. The probability of no substitution is . If w, x, y or z is a gap, ψ i (x→z∣w y)=1. Each combination of two neighbors yields its own set of context-dependent frequencies fw−y, i.e. the substitution probabilities for a given set of immediate neighbors w and y yield a distribution of frequencies for those neighbors. These frequencies are used to scale the substitution probabilities ψ i (x→z∣w y) so that one unit of time is the time in which we expect to see one change per base, and this for each of the 16 evolutionary contexts, so that . This first-order context-dependent substitution model requires 192 parameters to be estimated when it is assumed that complementary events are unequal. Here, we assume that such events are equal, i.e. that ψ i (x→z∣w y)=ψ i (x c →z c ∣y c w c ), where x c denotes the complement of x, which reduces the number of parameters for this model to 96.
For the Laurasiatheria data set, the distribution of bases x that are at the ancestral root sequence is modeled as an inhomogeneous second-order Markov chain with transition parameters Π(x∣v,w), where v and w are the bases that immediately precede x (as in ). If x, v or w is a gap, Π(x∣v,w)=1. For the pseudogenes data set, an inhomogeneous first-order Markov chain with transition parameters Π(x∣v) is assumed (see ), where v is the base that immediately precedes x. If x or v is a gap, Π(x∣v)=1. For the nuclear SSU data set , an zero-order Markov chain Π(x) is assumed (see Results section). No symmetry conditions are imposed on the Π values.
When the model allows for the presence of multiple contexts of evolution, each context is assumed to have its own prior, independently of other contexts. The use of a hyperparameter for the branch-length priors is to reduce sensitivity of the posterior to the prior [4, 29].
Harmonic mean estimators
Among the available approaches to calculate the marginal likelihood of a model, the harmonic mean estimator (HME) is by far the most used in the field of phylogenetics, only requiring samples from the posterior distribution. Because HME variance may be infinite, a modified stabilized version (sHME) has been proposed . The HME is however often severely biased and results in overestimating the true marginal likelihood [8, 14]. Lartillot and Philippe  suggest an intuitive reasoning for this by stating that, if the likelihood is unimodal, the marginal likelihood is more or less the product of two factors: the likelihood reached in the high-likelihood region (the mode height) and the relative size of this region (the mode width, which tends to be smaller for higher dimensional models). Estimators such as the HME have difficulties in assessing the mode width, which is estimated by measuring the relative frequency at which points of the sample fall inside and outside the mode, requiring that a sufficient number of points outside the mode be included in the sample. Lartillot and Philippe  state that, in practice, the contrast between the low and the high likelihoods is in general so large that even a posterior sample of astronomical size will be virtually confined within the mode. The estimated frequency at which the low-likelihood region is visited is then 0, which means that, in effect, the HME behaves as if the mode was occupying the entire parameter space, and therefore, completely underestimates the dimensional penalty. As a result, the HME overestimates the marginal likelihood, an overestimation that is more pronounced in the case of higher dimensional models, leading to the HME being biased in favor of such models.
Path sampling is considered a natural generalization of importance sampling and uses many and continuously connected “bridge” densities (called a “path”) to compute (ratios of) normalizing constants . Path sampling is hence an extension of bridge sampling, which generalizes importance sampling through the use of a single “bridge” density. In a comparative study on a variety of methods for computing Bayes factors, from Laplace approximation to bridge sampling, DiCiccio et al.  show that bridge sampling typically provides an order of magnitude of improvement. Path sampling, which was not part of their study, has been demonstrated to yield even more dramatic improvement .
Monte Carlo simulation is widely used in statistics, mainly because of its general applicability, to approximate such analytically intractable normalizing constants. Arguably, it is also the only general method available for dealing with complex, high-dimensional problems . Up until the introduction of bridge sampling and path sampling, estimation methods in statistics often relied on the scheme of importance sampling, either using draws from an approximate density or from one of p i (θ). Theoretical  and empirical evidence [30, 31] provided in the context of bridge sampling, show that substantial reductions of Monte Carlo errors can be achieved with little or minor increase in computational effort, by using draws from more than one p i (θ). The key idea is to use “bridge” densities to effectively shorten the distances among target densities, distances that are responsible for large Monte Carlo errors with the standard importance sampling methods. In fact, Gelman and Meng  show that importance sampling, bridge sampling and path sampling represent a natural methodological evolution, from using no bridge densities to using a (potentially) infinite number of them.
Equation 8 is restricted to a sample from the last iteration of each β to calculate the marginal likelihood, as in the original paper on path sampling . Therefore, this approach only uses a limited amount of the available samples and more information can potentially be retrieved from the MCMC iterations in order to improve the estimation of the marginal likelihood. One possibility lies in using the mean of multiple values for each β, collected at fixed intervals, which requires replacing U(θ k ) and U(θk+1) in equation 8 by and respectively, i. e. the mean of a collection of samples at β k and βk+1 respectively. We propose to collect a sample every 10 update cycles, where an update cycle involves an update for every model parameter, branch length and ancestral site. Such a scenario is valid since, for large values of K, is only slightly more dispersed than (i.e. in such a case, the subsequent power posteriors will be very similar) and hence serves as an excellent importance distribution . In other words, for large values of K, the MCMC chain will smoothly transition between different values of β, and hence the likelihood and parameter values in subsequent MCMC chains will not be all that different from one another, which allows for early sampling at each new value of β. We refer to Appendix A for a derivation of the discretization and sampling error corresponding to the path sampling estimator discussed here.
Recently, Xie et al.  presented a novel approach to estimate marginal likelihoods called ‘stepping-stone sampling’. The authors show that their approach yields an unbiased estimate of the marginal likelihood, as opposed to PS, and that their calculations can be performed more efficiently than PS. Using a simulated Gaussian example data set, which is instructive because of the fact that the true value of the marginal likelihood is available analytically, Xie et al.  show that PS and SS perform much better (with SS being the best) than the HME at estimating the marginal likelihood. The authors go on to analyze a 10-taxon green plant data set using DNA sequences of the chloroplast-encoded large subunit of the RuBisCO gene (rbcL) and establish that PS requires a larger number of power posteriors to be explored compared to SS to overcome its additional bias. Using the HME to estimate the marginal likelihood is reported to yield higher values than using both PS and SS.
Although is unbiased, changing to the log scale introduces a bias . Note that direct Bayes factor estimation (i.e. constructing a path between two competing models) using stepping-stone sampling yields lower variance than calculating the ratio of two independently estimated marginal likelihoods (with each marginal likelihood estimator constructing a path between its prior and posterior). We refer to Appendix B for the derivation of this variance and its comparison to the ratio of variances accompanying marginal likelihood estimation.
with α0=0<α1<…<α n <1=αn+1 dividing the interval [0,1] into n subintervals. Each of these integrals can be calculated independently in parallel, allowing the calculation of the log Bayes factor to be distributed across multiple computer nodes (20 in this case), yielding results much faster than when running on a single node. While the method discussed in  was applied to path sampling, it can easily be adapted for stepping-stone sampling (not shown) and hence each integral shown in equation 19 can be calculated using both path sampling and stepping-stone sampling. We calculate the difference in contribution to the log Bayes factor for each of the calculated 20 sub-integrals and consider the sum of the absolute values of these differences to be a bidirectional error (or repeatability error). The higher this error, the larger the difference between both annealing and melting calculations of the same sub-integral, indicating that more stringent computational settings are needed to accurately estimate the contribution to the total log Bayes factor.
Even using very demanding computational settings to calculate the log Bayes factor in both directions, small differences between the two estimates are to be expected for two reasons. One is the repeatability issue discussed in , which results in small deviations of the marginal likelihood depending on the starting seeds of the analyses, a second is the so-called “thermic lag” of the MCMC chain . Indeed, as β changes continuously during sampling, the chain is never exactly at equilibrium, which will cause a “thermic lag” of the MCMC chain. When sampling a value of Θ at the current value of β, one is in effect sampling from , with β′ slightly smaller than β. Because U(θ) is an increasing function of β, one expects this lag to result on average in an underestimation of the true marginal likelihood when β goes from 0 to 1 (i.e. the ‘annealing’ integration) and in an overestimation when β goes from 1 to 0 (i.e. the ‘melting’ integration) . This thermic lag bias results from the fact that when the value of β is adjusted, the Markov chain takes some time to adjust to the new value , i.e. needs to be equilibrated before taking samples. We here check how consistent the estimators are in yielding an underestimation for the annealing calculations and an overestimation for the melting calculations.
Laurasiatheria data set
Model comparison using the harmonic mean estimator (HME) for each of the three data sets
Nuclear SSU rRNA
As a baseline for the comparison of our analyses of different sigmoid shape parameters to determine which power posteriors to estimate, we first used a flexible-increment approach as introduced in previous work  and determined the annealing and melting estimates for the log Bayes factor and the accompanying bidirectional error (see Methods section). The flexible-increment approach is an extension of the original (constant-increment) path sampling method, where each integration interval shown in equation 19 employs a different but constant increment size for β. An equal number of iterations were run across all 20 integration intervals, summing up to a total of K=2.000 path steps, with Q=200 MCMC iterations being run per path step (see Additional file 1: Table S1, settings ‘F’ - ‘2.000’ - ‘200’). Our flexible-increment approach yields lower bidirectional errors for all three log Bayes factor estimators compared to the original path sampling method  (see Additional file 1: Table S1, settings ‘C’ - ‘2.000’ - ‘200’) and yields more similar results for the annealing and melting integrations. Using these settings, it is immediately apparent that the estimated log Bayes factor using PS or SS is drastically different from the one estimated earlier in the manuscript using the HME and sHME, with the log Bayes factor estimates differing by about 600 log units.
Given that Xie et al.  have shown that a value of K=8 path steps is sufficient for marginal likelihood estimation using stepping-stone sampling for a data set of 10 green plant rbcL sequences (and that K=32 is sufficient for path sampling), we have first reversed the integration settings for two shape values (i.e. K=200 and Q=2.000 for α=10.0 and α=12.0; see Additional file 1: Table S1). The results seem to indicate that both versions of the path sampling estimator converge really well, with reasonable bidirectional errors. This result is misleading however, as shown by our stepping-stone sampling estimator (which is a less biased estimator) which converges to a different value for the log Bayes factor. In fact, only stepping-stone sampling offers an indication that something is amiss, given its higher bidirectional error, but mainly due to its melting estimate of the log Bayes factor, which is much higher than the annealing estimate. Hence, K=200 path steps are clearly insufficient to obtain reliable estimates of the log Bayes factor and may even yield fairly accurate but unreliable results. Increasing K to 400 solves the problem of the previous integration settings for a sigmoid shape value α=10.0 for all three estimators, albeit that the associated bidirectional error estimates are very large. However, for a shape value α=12.0, both path sampling estimators are still unable to bracket the true log Bayes factor value, hence requiring that the number of path steps K be further increased. In other words, even a number of path steps of K=400, which seems to be excessive based on the work of Xie et al. , is only sufficient when a suitable sigmoid shape value is chosen.
Since doubling the number of iterations per path step to Q=400, when K=2.000, only leads to limited improvements of the bidirectional error (see Figure 2 and Additional file 1: Table S1), we now gradually increase the number of path steps, starting with K=4.000 or twice the amount of our first series of analyses, since increasing K alleviates the bias of stepping-stone sampling . Further increasing the number of path steps to K=8.000 confirms this conclusion. Because of the computational demands, we only apply a final doubling of the number of path steps (K=16.000) to the sigmoid shape values α=10.0 and α=12.0 (see Figure 2). Bidirectional errors continue to decrease, leading to even better bracketing of the true value of the log Bayes factor (see Figure 3). Further, comparing the estimates for both sigmoid shapes shows that the different shapes for the first time lead to very similar results for both annealing and melting calculations, allowing us to accurately infer the true value. For α=10.0, the mean bidirectional log Bayes factor equals 2921.09 and for α=12.0 this equals 2921.16.
Pseudogenes data set
Model comparison using path sampling (PS) and stepping-stone sampling (SS) for the pseudogenes data set
Of the different sigmoid shape values tested, a shape value of α=6.0 performs much worse than all other values tested, which perform quite similarly. The original path sampling method  performed the worst (3 cases) in terms of yielding an underestimation of the log Bayes factor using the annealing integration and an overestimation using the melting integration, with our adaptation of path sampling and our proposed stepping-stone sampling approaching only reporting 1 comparison for which this is the case (i.e. for α=13.0). A shape value of α=9.0 yields the lowest bidirectional error for SS, yielding a mean bidirectional log Bayes factor of 175.12, confirming the conclusion of the HME and sHME that the proposed context-dependent model HG04 provides a significantly better model fit than the GTR model. Once again however, HME and sHME seem to clearly be biased towards higher-dimensional model as the log Bayes factor is again overestimated. This result for the pseudogenes data set confirms that a context-dependent model offers increased performance over site-independent models in mammalian data sets, even in the case of (relatively) short sequences and a small number of sequences.
Nuclear SSU rRNA data set
A land plant data set, such as the one analyzed here , offers a real challenge for marginal likelihood and (log) Bayes factor estimators using the models compared in this manuscript. The HG04 model is aimed at capturing specific context-dependent processes, such as the CpG-methylation-deamination process in mammalian sequences. Since we are unaware of any such processes being present in land plant sequences, we expect that the context-dependent HG04 model will actually be outperformed by the site-independent GTR model in terms of model fit, on the premise that a model selection approach is used that is able to penalize the model for its excessive amount of parameters that are not accompanied by fitting site patterns.
Given the probably low amount of context-dependent substitution processes that are able to significantly increase model fit, the ancestral root distribution will most likely also not contain any dependencies. We now test this assumption using both the HME, sHME and SS approach, comparing a zero-order, first-order and second-order ancestral root distribution to the GTR model (which uses its base frequencies to determine the independent stationary distribution at the ancestral root). Note that Table 1 shows 3 estimates of the HME and sHME for the GTR model, depending on the actual ancestral root distribution used with the context-dependent model to ensure that the same set of likelihood contributions are used for the different models. According to the HME and sHME (see Table 1), the most parameter-rich ancestral root distribution also explains the data the best, with a second-order distribution being preferred with a log Bayes factor of 54.64 log units over the GTR model (45.48 log units for a first-order and 35.20 log units for a zero-order ancestral root distribution). This result is questionable, since the pseudogene mammalian data set (which is a clear example of a data set that should be analyzed using a context-dependent model and accompanying ancestral root distribution) benefits the most from a first-order ancestral root distribution .
We now turn our attention to calculating the log Bayes factor using the proposed SS approach. Assuming K=2.000 ratios with Q=200 iterations each, a zero-order ancestral root distribution yielded a bidirectional mean log Bayes factor of -58.56 log units versus the GTR model, a first-order distribution a bidirectional mean log Bayes factor of -64.21 log units and a second-order distribution a bidirectional mean log Bayes factor of -94.23 log units. The outcome here is much more biologically plausible than what was obtained using the HME and sHME, with no sign of dependencies being detected, not at the ancestral root nor throughout the remainder of the underlying tree. These results hence support the claim that the HME tends to be biased towards higher-dimensional models, both on the level of the context-dependent model and the ancestral root distribution used, and is hence unable to accurately perform model selection (especially for models with higher dimensions).
Model comparison using path sampling (PS) and stepping-stone sampling (SS) for the nuclear SSU rRNA data set
Discussion and conclusion
In this paper we have compared the performance of two versions of path sampling, an accurate (in that it is able to reliably estimate marginal likelihoods and hence Bayes factors) but computationally demanding model comparison approach, with that of stepping-stone sampling, for which we provide a so-called model-switch version to directly estimate (log) Bayes factors. We have shown that our adaptation of stepping-stone sampling for direct (log) Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Further, we have demonstrated that the (log) Bayes factor estimator proposed in this manuscript generally has lower variance than the (log) Bayes factor estimator obtained through the ratio of marginal likelihoods estimated using stepping-stone sampling.
The large number of combinations of number of path steps / ratios and chain lengths we investigated leads to the recommendation of stepping-stone sampling over path sampling. Indeed, for a relatively small number of path steps / ratios, path sampling tends to converge towards an entirely different value for the (log) Bayes factor, whereas only stepping-stone sampling is able to provide indications that more stringent analyses need to be performed to better approximate the (log) Bayes factor. Stepping-stone sampling is hence better-suited to provide rough initial estimates, using shorter and hence less time-consuming runs, of the magnitude of modeling assumptions. Both path sampling and stepping-stone sampling methods to estimate (log) marginal likelihoods and (log) Bayes factors are much more reliable approaches to perform model selection than the harmonic mean estimator, which is often employed because of its simplicity and computationally appealing properties.
Given that at both ends of the integration interval when performing model-switch path sampling and stepping-stone sampling, one of the models requires sampling from its prior distribution, we have opted for a sigmoid function to determine the necessary power posterior distributions from which sampling is required. Of the shape values compared for the different data sets, a value of between 9.0 and 12.0 is the most appropriate to accurately determine the difference in model fit for high-dimensional models as it is able to accurately bracket the (log) Bayes factor and is accompanied by a low bidirectional error.
Given that path sampling and stepping-stone sampling are far more reliable approaches when estimating the (log) marginal likelihood, we have checked whether this affects the outcome when performing model comparison. Whereas for the Laurasiatheria data set, there is a large difference in the log Bayes factor estimated by path sampling and stepping-stone sampling on one hand, and the harmonic mean estimator on the other hand, these approaches still reach the same conclusion, i.e. that the context-dependent model presented here yields a much better model fit than a site-independent evolutionary model. While this is also the case for a smaller mammalian pseudogene data set, albeit with lower log Bayes factors, we show that for a plant nuclear rRNA SSU data set, the conclusions of the harmonic mean estimator and the path sampling and stepping-stone sampling estimators do not concur, with the former method being unable to accurately penalize the context-dependent model for its excess parameters.
High-dimensional models are typically used in, for example, studies of context-dependence in mammalian sequences. Context-dependent models have been shown to yield much larger increases in model fit than the assumption of among-site rate variation  or using mixture models . The so-called first-order context-dependent evolutionary model (i.e. assuming an influence from its two immediate flanking bases), which we analyze in this paper, offers a fair balance between parameter complexity and performance, with further improvements in model fit appearing quite challenging . In other words, the drastic increase in number of parameters, from 12 to 96, is justified by the context-dependent evolutionary processes present in the data, even though this means that far less data per parameter is available. In non-mammalian sequences, these models may be prove to be less useful, as demonstrated in this manuscript, as the drastic increase in number of parameters is not accompanied by fitting context-dependent evolutionary patterns in the underlying data.
While stepping-stone sampling outperforms both versions of path sampling, as we have shown in this paper, it is not a silver bullet and still requires massive computation times, especially for the high-dimensional models tested in this manuscript. Despite recent claims that data augmentation can speed up (log) Bayes factor calculation , the use of data augmentation is often a prerequisite to evaluate the likelihood for a context-dependent model and hence does not yield additional increases in speed here. For the Laurasiatheria data set, the most demanding settings we tested (K=16.000 and Q=200) require about 13 days of calculation running across 40 cores simultaneously using 8-core Intel(R) Xeon(R) CPU X7560 processors running at 2.27GHz. This yields a total computational effort for this data set of about 4.300 days or close to 12 computation years. The other two data sets analyzed in this manuscript require little more than 1 day of calculation running across 40 cores simultaneously on the reported system.
One way to circumvent the added complexity of model-switch path sampling and stepping-stone sampling is to shorten the path from posterior to prior whilst still calculating the marginal likelihood for each model separately. Recently, Fan et al.  propose a more general version of stepping-stone sampling that introduces an arbitrary “working” prior distribution parameterized using MCMC samples from the posterior distribution. The authors show that if this reference distribution exactly equals the posterior distribution, the marginal likelihood can be estimated exactly. The authors show that generalized stepping-stone sampling is considerably more efficient and does not require sampling from distributions close to the true prior, currently still an issue with path sampling and stepping-stone sampling for many models. However, at the moment this method is restricted to evaluations on a fixed phylogenetic tree topology. Integrating over plausible tree topologies complicates generalized stepping-stone sampling because of the need to define a reference distribution for topologies that provides a good approximation to the posterior. However, most context-dependent modeling approaches already make the assumption of a fixed underlying tree to make likelihood calculations feasible, making this last requirement less of an issue. The extension of generalized stepping-stone sampling towards constructing a direct path between two competing models is currently the subject of ongoing work.
Appendix A: discretization and sampling error for path sampling
Again, the presented formulas are valid only if the points are truly independent draws from their respective distributions. If this is not the case, then a factor τ=K/K eff (i.e. the decorrelation time) needs to be taken into account in equation 23, to account for the effective sample size . Given that β moves between 0 and 1, the decorrelation time might change. While Lartillot and Philippe  did not observe large variations in the decorrelation time for different values of β for the models they compared, this is not generally so, as shown in .
Appendix B: variance for stepping-stone sampling
Note that the two equations above are not used in the remainder of this appendix.
Comparing expressions 30 and 34 shows that the variance of our direct Bayes factor estimation differs from the variance of the ratio of marginal likelihoods only in the last term of expression 30 (i.e. the “covariance term”). This term is not zero because the densities are evaluated in the same parameter draws. Furthermore, we can logically expect this covariance to be generally positive. This formalizes the idea that direct Bayes factor estimation (i.e. constructing a path between two competing models) yields lower variance than calculating the ratio of two independently estimated marginal likelihoods (with each marginal likelihood estimator constructing a path between its prior and posterior), if the two model priors are the same. From expression 30, it follows that the variance of our proposed approach is finite as long as the denominators are not zero, which is to be expected as otherwise the marginal likelihoods themselves would have to be zero.
The research leading to these results has received funding from the European Union Seventh Framework Programme [FP7/2007-2013] under Grant Agreement no. 278433 and ERC Grant agreement no. 260864. Stijn Vansteelandt acknowledges support from Ghent University (Multidisciplinary Research Partnership ”Bioinformatics: from nucleotides to networks”) and IAP research network grant no. P06/03 from the Belgian government (Belgian Science Policy). The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by Ghent University, the Hercules Foundation and the Flemish Government - department EWI. We acknowledge the support of the National Evolutionary Synthesis Center (NESCent) through a working group (Software for Bayesian Evolutionary Analysis).
- Baele G: Context-dependent evolutionary models for non-coding sequences: an overview of several decades of research and an analysis of Laurasiatheria and Primate evolution. Evol Biol 2012, 39: 61-82. 10.1007/s11692-011-9139-2View ArticleGoogle Scholar
- Baele G, Van de Peer Y, Vansteelandt S: A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. Syst Biol 2008,57(5):675-692. 10.1080/10635150802422324View ArticlePubMedGoogle Scholar
- Yang Z, Rannala B: Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol Biol Evol 1997,14(7):717-724. 10.1093/oxfordjournals.molbev.a025811View ArticlePubMedGoogle Scholar
- Suchard MA, Weiss RE, Sinsheimer JS: Bayesian selection of continuous-time Markov chain evolutionary models. Mol Biol Evol 2001,18(6):1001-1013. 10.1093/oxfordjournals.molbev.a003872View ArticlePubMedGoogle Scholar
- Steel MA: Should phylogenetic models be trying to ‘fit an elephant’? Trends Genet 2005,21(6):307-309. 10.1016/j.tig.2005.04.001View ArticlePubMedGoogle Scholar
- Jeffreys H: Some tests of significance treated by theory of probability. In Proceedings of the Cambridge Philosophical Society, Volume 31. 1935; 203-222.Google Scholar
- Kass RE, Raftery AE: Bayes factors. J Am Stat Assoc 1995,90(430):773-795. 10.1080/01621459.1995.10476572View ArticleGoogle Scholar
- Lartillot N, Philippe H: Computing Bayes factors using thermodynamic integration. Syst Biol 2006,55(2):195-207. 10.1080/10635150500433722View ArticlePubMedGoogle Scholar
- Ogata Y: A Monte Carlo method for high dimensional integration. Num Math 1989,55(2):137-157. 10.1007/BF01406511View ArticleGoogle Scholar
- Gelman A, Meng XL: Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Stat Sci 1998,13(2):163-185.View ArticleGoogle Scholar
- Meng XL, Wong WH: Simulating ratios of normalizing constants via simple identity: a theoretical exploration. Stat Sinica 1996, 6: 831-860.Google Scholar
- Newton MA, Raftery AE: Approximating Bayesian inference with the weigthed likelihood bootstrap. J R Stat Soc B 1994, 56: 3-48.Google Scholar
- Nylander JA, Ronquist F, Huelsenbeck JP, Nieves-Aldrey JL: Bayesian phylogenetic analysis of combined data. Syst Biol 2004, 53: 47-67. 10.1080/10635150490264699View ArticlePubMedGoogle Scholar
- Xie W, Lewis PO, Fan Y, Kuo L, Chen MH: Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst Biol 2011,60(2):150-160. 10.1093/sysbio/syq085PubMed CentralView ArticlePubMedGoogle Scholar
- Redelings BD, Suchard MA: Joint Bayesian estimation of alignment and phylogeny. Syst Biol 2005,54(3):401-418. 10.1080/10635150590947041View ArticlePubMedGoogle Scholar
- Fan Y, Wu R, Chen MH, Kuo L, Lewis PO: Choosing among partition models in Bayesian phylogenetics. Mol Biol Evol 2011, 28: 523-532. 10.1093/molbev/msq224PubMed CentralView ArticlePubMedGoogle Scholar
- Baele G, Lemey P, Bedford T, Rambaut A, Suchard MA, Alekseyenko AV: Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol Biol Evol 2012,29(9):2157-2167. 10.1093/molbev/mss084PubMed CentralView ArticlePubMedGoogle Scholar
- Drummond AJ, Suchard MA, Xie D, Rambaut A: Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol 2012,29(8):1969-1973. 10.1093/molbev/mss075PubMed CentralView ArticlePubMedGoogle Scholar
- Prasad AB, Allard MW, NISC Comparative Sequencing Program, Green ED: Confirming the phylogeny of mammals by use of large comparative sequence data sets. Mol Biol Evol 2008,25(9):1795-1808. 10.1093/molbev/msn104PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Z: Estimating the pattern of nucleotide substitution. J Mol Evol 1994, 39: 105-111.PubMedGoogle Scholar
- Karol KG, McCourt RM, Cimino MT, Delwiche CF: The closest living relatives of land plants. Science 2001, 294: 2351-2353. 10.1126/science.1065156View ArticlePubMedGoogle Scholar
- Felsenstein J: Inferring Phylogenies. Sinauer Associates: Sunderland; 2004.Google Scholar
- Hwang DG, Green P: Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci USA 2004, 101: 13994-14001. 10.1073/pnas.0404142101PubMed CentralView ArticlePubMedGoogle Scholar
- Baele G, Van de Peer Y, Vansteelandt S: Modelling the ancestral sequence distribution and model frequencies in context-dependent models for non-coding sequences. BMC Evol Biol 2010, 10: 244. 10.1186/1471-2148-10-244PubMed CentralView ArticlePubMedGoogle Scholar
- Baele G, Van de Peer Y, Vansteelandt S: Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences. J Mol Evol 2010, 71: 34-50. 10.1007/s00239-010-9362-yView ArticlePubMedGoogle Scholar
- Baele G, Li WLS, Drummond AJ, Suchard MA, Lemey P: Accurate model selection of relaxed molecular clocks in Bayesian phylogenetics. Mol Biol Evol 2012, 30: 239-243.PubMed CentralView ArticlePubMedGoogle Scholar
- Huelsenbeck JP, Bollback JP, Levine AM: Inferring the root of a phylogenetic tree. Syst Biol 2002, 51: 32-43. 10.1080/106351502753475862View ArticlePubMedGoogle Scholar
- Zwickl DJ, Holder MT: Model parameterization, prior distributions, and the general time-reversible model in Bayesian phylogenetics. Syst Biol 2004, 53: 877-888. 10.1080/10635150490522584View ArticlePubMedGoogle Scholar
- Yang Z, Rannala B: Branch-length prior influences Bayesian posterior probability of phylogeny. Syst Biol 2005, 54: 455-470. 10.1080/10635150590945313View ArticlePubMedGoogle Scholar
- DiCiccio TJ, Kass RE, Raftery A, Wasserman L: Computing Bayes factors by combining simulation and asymptotic approximations. J Am Statist Assoc 1997, 92: 903-915. 10.1080/01621459.1997.10474045View ArticleGoogle Scholar
- Meng XL, Schilling S: Fitting full-information factor models and an empirical investigation of bridge sampling. J Am Statist Assoc 1996, 91: 1254-1267. 10.1080/01621459.1996.10476995View ArticleGoogle Scholar
- Lartillot N, Philippe H: A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol 2004, 21: 1095-1109. 10.1093/molbev/msh112View ArticlePubMedGoogle Scholar
- Chen MH, Shao QM, Ibrahim JG: Monte Carlo methods in Bayesian Computation. New York: Statistics Springer; 2000.View ArticleGoogle Scholar
- Van de Peer Y, Baele G: Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences. BMC Evol Biol 2009.,9(87):Google Scholar
- Yang Z: Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol Evol 1996,11(9):367-372. 10.1016/0169-5347(96)10041-0View ArticlePubMedGoogle Scholar
- Rodrigue N, Philippe H, Lartillot N: Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol 2006,23(9):1762-1775. 10.1093/molbev/msl041View ArticlePubMedGoogle Scholar
- Oehlert GW: A note on the Delta method. Am Stat 1992, 46: 27-29.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.