Bayesian estimation of scaled mutation rate under the coalescent: a sequential Monte Carlo approach

Background Samples of molecular sequence data of a locus obtained from random individuals in a population are often related by an unknown genealogy. More importantly, population genetics parameters, for instance, the scaled population mutation rate Θ=4N e μ for diploids or Θ=2N e μ for haploids (where N e is the effective population size and μ is the mutation rate per site per generation), which explains some of the evolutionary history and past qualities of the population that the samples are obtained from, is of significant interest. Results In this paper, we present the evolution of sequence data in a Bayesian framework and the approximation of the posterior distributions of the unknown parameters of the model, which include Θ via the sequential Monte Carlo (SMC) samplers for static models. Specifically, we approximate the posterior distributions of the unknown parameters with a set of weighted samples i.e., the set of highly probable genealogies out of the infinite set of possible genealogies that describe the sampled sequences. The proposed SMC algorithm is evaluated on simulated DNA sequence datasets under different mutational models and real biological sequences. In terms of the accuracy of the estimates, the proposed SMC method shows a comparable and sometimes, better performance than the state-of-the-art MCMC algorithms. Conclusions We showed that the SMC algorithm for static model is a promising alternative to the state-of-the-art approach for simulating from the posterior distributions of population genetics parameters. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1948-6) contains supplementary material, which is available to authorized users.

where P (t) ii is the probability of a nucleotide to remain the same during the evolutionary time t and P (t) ij is the probability of a nucleotide changing from initial state i to final state j during the evolutionary time t.
F84 model: F84 model distinguishes between the rate of transitions and transversions using the κ parameter and in addition, it allows unequal nucleotide frequencies i.e., π A = π C = π G = π T . Thus, both κ and the nucleotide frequencies are to be estimated when F84 model is considered. The three types of changes are: where Π j = π A + π G if j is a purine (A or G) and Π j = π C + π T if j is a pyrimidine (C or T).

Note S3
Here, we present the prior distributions for all the unknown parameters in the Bayesian model that we considered.

Prior density of Θ:
Θ is a positive parameter and we impose a uniform distribution in the interval [0, Θ max ] ,i.e., with the probability density function (PDF) given as Table 1 shows the values of Θ max considered in the experiments with simulated datasets and the mitochondrial DNA sequence data (mtDNA data). When analyzing real biological datasets, results obtained from some other estimation approaches can provide knowledge about the range of Θ max .
Θ max for the prior distribution Simulated data: Θ = 0.01 0.10 Simulated data: Θ = 0.10 1.00 Simulated data: Θ = 0.50 1.00 mtDNA data 1.00 Prior density of λ: λ is the set of unknown parameters associated with each mutational model and as stated earlier, the mutational models considered are K80 and F84 models. K80 model has only one unknown parameter i.e., the transitiontransversion ratio κ while the F80 model has two unknown parameters i.e., κ and the nucleotide frequencies π.
Prior density of κ: Like Θ, κ is a positive parameter and we impose a uniform distribution in the interval [0, κ max ] ,i.e., κ ∼ U(0, κ max ), and the PDF is given as Table 2 shows the values of κ max considered in the experiments with simulated datasets and the mtDNA dataset.
Prior density of π: Here, we impose Dirichlet distribution on the nucleotide frequencies π i.e., π ∼ Dir(α), and the PDF is given by where Γ(·) is the gamma function, α = (α A , α C , α G , α T ) are the concentration parameters and π A + π C + π G + π T = 1. In our experiments, we chose α = β 0 π emp , where π emp is the empirical value of the nucleotide frequencies calculated from the data and β 0 , a scalar value, chosen to be 10, is used to control the peakedness of the prior distribution of π around π emp [1].
Prior density of the genealogy: Algorithm 1 describes the procedure to sample from the distribution of the possible genealogies of m copies in a population with effective population size N e .

Algorithm 1
The coalescent: Simulation of a random genealogy.
1: Start with m lineages. 2: Set k = m and T = 0. 3: Draw a random quantity t k from an exponential distribution with expectation 4N e /(k(k − 1)). 4: Randomly pick two of the k copies of the gene, without replacement. 5: Create a node of the genealogical tree which is the immediate common ancestor of these two copies, and which existed T + t k generations before the present. 6: Set T = T + t k . 7: Replace these two copies by this common ancestor and set k = k − 1. 8: If k = 1 we are done. Otherwise return to step 3.

Note S4
Obtaining the weight at time t from the weights at time t-1 In the main text, it was shown that the set of weighted samples {H n approximates the joint target distributionπ t−1 . After propagating the sam-ples at time t − 1 using a forward Markov kernel K t (H t−1 , H t ), the unnormalized weights at time t are calculated as follows: , from the definitions of π t and π t−1 in the main text and noticing that Z t and Z t−1 are constants with respect to H n t and H n t−1 , theñ where {w n t−1 } N n=1 are the unnormalized weights at time t−1 and {W t (H n t−1 , H n t )} N n=1 , the unnormalized incremental weights, calculated as , n = 1, ..., N.

Target Distributions, Forward and Backward Kernels Specification Population Parameters Estimation
Here, we specify the exact form of the sequence of target distributions {π t } T t=1 , the forward kernels, {K t } T t=2 and the backward kernels {L t−1 } T t=2 .

Sequence of target distributions and forward kernels:
Since we are interested in the likelihood tempered target sequence we sample initially from the prior distribution π 1 = p(H) and gradually introduce the effect of the likelihood in order to obtain an approximation of the posterior distribution p(H|D), i.e, π T at t = T . Although, there are a few choices of K t that have been proposed, we employ MCMC kernels of invariant distribution π t , as K t . There is more freedom in specifying the forward kernels in SMC since kernels do not need to be reversible or even Markov. As such, we will employ a hybrid Gibbs and Metropolis-Hastings (M-H) samplers that allows local moves in order to successively move all the parameters of interest in H i.e., Θ, Υ and λ. The set of proposal distributions for the genealogy and other parameters are discussed below.

Sequence of backward kernels:
To obtain a good performance of the SMC samplers algorithm, the backward kernel L t is often optimized with respect to the forward kernel K t . Thus, if an MCMC kernel is used as forward kernel, then the following L t is employed: a good approximation of the optimal backward kernel when the discrepancy between π t and π t−1 is small. Thus, since the chosen L t is the reversal Markov kernel associated with K t , then the unnormalized incremental weights become :

Resampling Procedure
Resampling is performed when the ESS is significantly less than the number of samples, discarding the ineffective samples and then multiply the effective ones. In all our experiments, we performed resampling when the ESS is less than N/10. The resampling procedure is briefly summarized as follows: • Interpret each weight w n t as the probability of obtaining the sample index n in the set {H n t : n = 1, ..., N }. • Draw N samples from the discrete probability distribution and replace the old sample set with this new one.
• Set all weights to the constant value w n t = 1/N .

Note S5
Here, we present the hybrid Gibbs and Metropolis-Hastings algorithm of invariant distribution π t employed to move the particles in the SMC algorithm presented in the main text. In the algorithm, V is the total number of parameters and R M CM C is the chain length for each sample (particle). for is the proposal distribution for the v th parameter (See below for details).

Proposal distributions:
Next, we present the proposal distributions h v (·|·) in Algorithm 2 for all the parameters in H.
Proposal distribution for genealogy: The conditional coalescent proposal for the genealogy is well described in [2,3,4] and the algorithms can easily be implemented.