ABCtoolbox: a versatile toolkit for approximate Bayesian computations
© Wegmann et al; licensee BioMed Central Ltd. 2010
Received: 30 September 2009
Accepted: 4 March 2010
Published: 4 March 2010
The estimation of demographic parameters from genetic data often requires the computation of likelihoods. However, the likelihood function is computationally intractable for many realistic evolutionary models, and the use of Bayesian inference has therefore been limited to very simple models. The situation changed recently with the advent of Approximate Bayesian Computation (ABC) algorithms allowing one to obtain parameter posterior distributions based on simulations not requiring likelihood computations.
Here we present ABCtoolbox, a series of open source programs to perform Approximate Bayesian Computations (ABC). It implements various ABC algorithms including rejection sampling, MCMC without likelihood, a Particle-based sampler and ABC-GLM. ABCtoolbox is bundled with, but not limited to, a program that allows parameter inference in a population genetics context and the simultaneous use of different types of markers with different ploidy levels. In addition, ABCtoolbox can also interact with most simulation and summary statistics computation programs. The usability of the ABCtoolbox is demonstrated by inferring the evolutionary history of two evolutionary lineages of Microtus arvalis. Using nuclear microsatellites and mitochondrial sequence data in the same estimation procedure enabled us to infer sex-specific population sizes and migration rates and to find that males show smaller population sizes but much higher levels of migration than females.
ABCtoolbox allows a user to perform all the necessary steps of a full ABC analysis, from parameter sampling from prior distributions, data simulations, computation of summary statistics, estimation of posterior distributions, model choice, validation of the estimation procedure, and visualization of the results.
Bayesian statistics has gained popularity in scientific inference, especially in population genetics and genomics . Consider a model creating data D determined by parameters θ whose joint prior density is denoted by π (θ). The quantity of interest is the posterior distribution of the parameters θ, which is given by Bayes theorem as π (θ|D)~f(D|θ)π (θ), where f(D|θ) is the likelihood of the data. Unfortunately the evaluation of likelihoods is often difficult or even impossible for complex models. However, Monte-Carlo simulations can be used to approximate the likelihood function. For instance, a simple rejection algorithm has been proposed [2–4] to estimate the likelihood: a candidate parameter vector θ is simulated from a prior distribution and accepted if the corresponding vector of summary statistics S is sufficiently "close" to the observed summary statistics Sobs with respect to some metric in the space of S (i.e., if ||S - Sobs|| ≤ ε for a fixed tolerance ε). The precision of the posterior estimate will improve with smaller values of ε, but small ε values are often associated with very small acceptance rates (that are proportional to the likelihood) and will thus require many more computations.
More recently, Beaumont et al.  introduced a simple post-sampling adjustment to improve the estimation of posteriors from the joint distribution of accepted parameter and summary statistics vectors. They introduced a locally-weighted linear regression of parameters on summary statistics around the observed summary statistics, which allows one to obtain good estimations with relatively large acceptance intervals. Additional improvements have been proposed to more efficiently explore the parameter space, including several variations of likelihood-free MCMC [6–9] or Particle-based samplers [10–12]. All these methods are generally referred to as Approximate Bayesian Computations (ABC) and reviewed in [13, 14].
Here, we present ABCtoolbox, a series of computer programs that can be pipelined to estimate parameters of complex models. In complement to available ABC packages [15–17], ABCtoolbox incorporates several ABC algorithms, handles various types of data by interacting with external simulation programs and includes tools to rigorously validate estimations, a necessary step of any Bayesian inference. ABCtoolbox comes with a detailed manual introducing all implemented algorithms and giving hints on how to successfully apply ABC inference. We demonstrate the usability of ABCtoolbox by inferring sex-specific migration rates and population sizes of the common vole Microtus arvalis.
ABCsampler (i) samples model parameter values (ii) calls an external simulation program, (iii) calls an external program computing summary statistics on simulated data, and (iv) writes parameter values and resulting summary statistics to a file. These four basic steps are then iterated many times until a large set of simulations is generated. The main parameters for ABCsampler are the names of the simulation program and the program to compute summary statistics, and the ABC sampler to be used. The sampling of parameter values can either be done using a rejection algorithm , a likelihood free MCMC  or a Population Monte Carlo sampler . We refer the reader to the original papers and to the ABCtoolbox user manual for details on the implemented algorithms. Due to the fully object orientated implementation of ABCsampler, it would be straightforward to include additional sampling algorithms such as likelihood free MCMC with state-space augmentation  or to adopt a tempering scheme for the likelihood free MCMC .
Interaction of ABCsampler with external programs is done via the command line or via input files, which allows one to use most of the many genetic data simulation programs available, such as SIMCOAL2 , SPLATCHE , ms  or FREGENE . ABCsampler is also fully flexible concerning the number of programs to be called per iteration, which allows one to generate different types of data for the same model parameters (Figure 1). For instance, DNA sequences and microsatellites data with different ploidy levels can be generated by calling SIMCOAL2 twice with different parameter files as input. ABCsampler also offers the possibility to call any script or program to modify the output of the simulation program (Figure 1). This feature allows, for instance, to simulate observed patterns of missing data by simply removing simulated genotypes from the output of SIMCOAL2. ABCsampler can also deal with linear combinations of summary statistics like Partial Least Squares (PLS), where one computes a few orthogonal components in the summary statistics space best explaining the parameter variability [9, 22].
The program ABCestimator directly reads the output of ABCsampler and computes posterior distributions based on a fraction of the simulations closest to the observed data. The regression adjustment implemented in ABCestimator follows a recent formulation termed ABC-GLM  that differs from that proposed by Beaumont et al. , but it also leads to the marginal posterior distribution of each parameter. In addition, it computes the marginal density of a model, which naturally leads to the computation of Bayes factors and performing Bayesian model choice . ABCestimator further offers two ways to validate the estimation procedure. First, one can test the ability of ABC to estimate parameters by analyzing a large number of simulated data sets with known parameter values drawn from the prior distributions (generated using ABCsampler) . Based on such a test dataset, ABCestimator computes accuracy measures, such as bias, mean squared errors and coverage properties . Secondly, ABCestimator offers a new way to check if the observed data are in strong disagreement with the assumed model. The idea is to compute the distribution of the marginal densities for all simulations retained for posterior estimation. The marginal density of the observed data is then compared against this distribution to compute a p-value indicative of the ability of the model to reproduce the data.
ABCtoolbox is also bundled with several additional command line utilities: the command-line version of the program ARLEQUIN , arlsumstat, to compute summary statistics, an additional program to compute simple summary statistics on microsatellite data (strStats), a program to linear-transform summary statistics (transformer), a program to simulate data from general linear models (GLM) and R scripts to estimate PLS components and to visualize results. The functionalities and usage of ABCtoolbox are fully described in a user manual, which also includes a complete methodological references of all algorithms implemented in ABCtoolbox and many hints and suggestions on how to perform successful ABC parameter estimations.
Results and Discussion
We demonstrate the use of ABCtoolbox by studying the history of the common vole Microtus arvalis. This small rodent is probably the most abundant European mammal  and it has become a model species to study recolonization processes in Europe after the last ice age [26–28]. The common vole has generally very limited dispersal abilities , and differences in migration patterns between males and females have been reported . Here we use 11 available population samples from the Central (8 samples) and Eastern (3 samples) M. arvalis evolutionary lineages to infer sex-specific migration rates and population sizes, along with key parameters of the demographic history of these lineages. All 218 sampled individuals (between 16 and 25 per population) have been typed for 11 nuclear microsatellites and for 320 bp of the female transmitted mitochondrial DNA (mtDNA) control region [26, 30].
Assumed Model of Microtus arvalis evolution
Characteristics of the prior and obtained posterior distributions.
U [10, 500]
U [10, 200]
N A a
Nm males a
T DIV b
U [40,000, 80,000]
μ DNA × 108
U [10-8, 5*10-7]
μ STR × 104
U [10-5, 5*10-4]
U 8.00 12.00
Estimation Procedure and Validation
Microtus arvalis recent demographic history
ABCtoolbox allows a user to perform all the necessary steps of a full ABC analysis, like parameter sampling from prior distributions, data simulations, computation of summary statistics, estimation of posterior distributions, model choice, validation of the estimation procedure, and visualization of the results. It includes various ABC algorithms, several of which are not found in any other available software package so far (e.g. likelihood-free MCMC [8, 9], Population Monte Carlo [10–12] and ABC-GLM ), and the future addition of new algorithms is straightforward due to ABCtoolbox's object-oriented design. In combination with SIMCOAL2 , ABCtoolbox is more flexible than available programs to perform ABC estimation of demographic parameters from genetic data [15, 17] in that it allows (i) the simultaneous use of several data types (i.e. microsatellites, DNA sequences, or SNP data), each with potentially different ploidy levels and modes of transmission (ii) the inference of parameters under complex demographic scenarios including any combination of admixture, divergence and migration between an arbitrary number of populations with dynamic size changes, and (iii) the incorporation of additional features of the data, such as varying levels of missing data or population-specific inbreeding levels. However, ABCtoolbox is not tied to SIMCOAL2 and may equally well be used with other genetic simulation programs, like ms , SPLATCHE  or FREGENE . It can thus allow parameter inference under spatially explicit models or models with natural selection. In fact, ABCtoolbox can interact with virtually any command-line simulation software and is capable of using several simulation programs per iteration, which could allow for instance the simultaneous analysis of different data sets or different types of data. The additional ability to use linear combinations of summary statistics best explaining parameter variability is also of interest when a prior selection of the most informative statistics is difficult.
The flexibility of ABCtoolbox has a cost in performance. Indeed, for each simulation, external programs may need to be launched, and communication between programs is mediated by text files, which is not optimal, and therefore simple models may be more quickly analyzed with canned ABC sampling software (e.g. [15, 17]). However, this disadvantage may become negligible when considering the possibility of distributing simulations over the nodes of a cluster, or when simulation time of large data sets under complex models are much more costly than communication between subprograms. In any case, the overall speed of the computations primarily depends on the speed of the genetic simulations program. In this respect, we have used here SIMCOAL2 for its flexibility and ease of parameterization rather than for its speed, and the use of continuous-time coalescent simulations (such as ms ) should lead to much faster inferences.
Availability and requirements
Project name: ABCtoolbox
Project home page: http://www.cmpg.iee.unibe.ch/
Operating system(s): Platform independent with supported Linux and Windows binaries
Other requirements: Windows users need to install the CYGWIN Linux-like environment for Windows, available on http://www.cygwin.com.
Programming language: C++ and R
License: GNU GPL version 3 or later
We thank Gerald Heckel for helpful comments on an earlier draft of the manuscript.
Funding: This work has been supported by Swiss National Foundation grants No 3100A0-112072 and 3100A0-126074 to LE.
- Beaumont MA, Rannala B: The Bayesian revolution in genetics. Nat Rev Genet 2004, 5(4):251–261. 10.1038/nrg1318View ArticlePubMedGoogle Scholar
- Tavaré S, Balding DJ, Griffiths RC, Donnelly P: Inferring coalescence times from DNA sequence data. Genetics 1997, 145(2):505–518.PubMedPubMed CentralGoogle Scholar
- Weiss G, von Haeseler A: Inference of population history using a likelihood approach. Genetics 1998, 149: 1539–1546.PubMedPubMed CentralGoogle Scholar
- Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW: Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 1999, 16(12):1791–1798.View ArticlePubMedGoogle Scholar
- Beaumont MA, Zhang W, Balding DJ: Approximate bayesian computation in population genetics. Genetics 2002, 162(4):2025–2035.PubMedPubMed CentralGoogle Scholar
- Ratmann O, Jorgensen O, Hinkley T, Stumpf M, Richardson S, Wiuf C: Using likelihood-free inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum . PLoS Comput Biol 2007, 3(11):2266–2278. 10.1371/journal.pcbi.0030230View ArticleGoogle Scholar
- Bortot P, Coles SG, Sisson SA: Inference for stereological extremes. J Am Stat Assoc 2007, 102(477):84–92. 10.1198/016214506000000988View ArticleGoogle Scholar
- Marjoram P, Molitor J, Plagnol V, Tavare S: Markov chain Monte Carlo without likelihoods. Proceedings Of The National Academy Of Sciences Of The United States Of America 2003, 100(26):15324–15328. 10.1073/pnas.0306899100View ArticlePubMedPubMed CentralGoogle Scholar
- Wegmann D, Leuenberger C, Excoffier L: Efficient Approximate Bayesian Computation coupled with Markov Chain Monte Carlo without likelihood. Genetics 2009, 182(4):1207–1218. 10.1534/genetics.109.102509View ArticlePubMedPubMed CentralGoogle Scholar
- Sisson SA, Fan Y, Tanaka MM: Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America 2007, 104(6):1760–1765. 10.1073/pnas.0607208104View ArticlePubMedPubMed CentralGoogle Scholar
- Beaumont M, Cornuet JM, Marin JM, Robert C: Adaptivity for ABC algorithms: the ABC-PMC scheme. Biometrika 2009, 96(4):983–990. 10.1093/biomet/asp052View ArticleGoogle Scholar
- Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MPH: Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of The Royal Society Interface 2009, 6(31):187–202. 10.1098/rsif.2008.0172View ArticlePubMed CentralGoogle Scholar
- Marjoram P, Tavare S: Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet 2006, 7(10):759–770. 10.1038/nrg1961View ArticlePubMedGoogle Scholar
- Sisson SA, Fan Y: Likelihood-free Markov chain Monte Carlo. In Handbook of Markov chain Monte Carlo. Edited by: SP Brooks AG, Jones G, Meng X-L. Chapman and Hall/CRC Press; 2010:in press.Google Scholar
- Cornuet JM, Santos F, Beaumont MA, Robert CP, Marin JM, Balding DJ, Guillemaud T, Estoup A: Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics 2008, 24(23):2713–2719. 10.1093/bioinformatics/btn514View ArticlePubMedPubMed CentralGoogle Scholar
- Thornton K: Automating approximate Bayesian computation by local linear regression. BMC Genetics 2009, 10(1):35. 10.1186/1471-2156-10-35View ArticlePubMedPubMed CentralGoogle Scholar
- Lopes JS, Balding D, Beaumont MA: PopABC: a program to infer historical demographic parameters. Bioinformatics 2009, 25(20):2747–2749. 10.1093/bioinformatics/btp487View ArticlePubMedGoogle Scholar
- Laval G, Excoffier L: SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics 2004, 20(15):2485–2487. 10.1093/bioinformatics/bth264View ArticlePubMedGoogle Scholar
- Currat M, Ray N, Excoffier L: SPLATCHE: a program to simulate genetic diversity taking into account environmental heterogeneity. Molecular Ecology Notes 2004, 4(1):139–142. 10.1046/j.1471-8286.2003.00582.xView ArticleGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18(2):337–338. 10.1093/bioinformatics/18.2.337View ArticlePubMedGoogle Scholar
- Chadeau-Hyam M, Hoggart CJ, O'Reilly PF, Whittaker JC, De Iorio M, Balding DJ: Fregene: Simulation of realistic sequence-level data in populations and ascertained samples. BMC Bioinformatics 2008, 9: 11. 10.1186/1471-2105-9-364View ArticleGoogle Scholar
- Tenehaus M, Gauchi J-P, Ménardo C: Régression PLS et Applications. Revue de Statistique Appliquée 1995, 43(1):57.Google Scholar
- Leuenberger C, Wegmann D: Bayesian Computation and Model Selection Without Likelihoods. Genetics 2010, 184(1):243–52. 10.1534/genetics.109.109058View ArticlePubMedPubMed CentralGoogle Scholar
- Excoffier L, Estoup A, Cornuet JM: Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 2005, 169(3):1727–1738. 10.1534/genetics.104.036236View ArticlePubMedPubMed CentralGoogle Scholar
- Excoffier L, Lischer HEL: Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, in press.Google Scholar
- Heckel G, Burri R, Fink S, Desmet JF, Excoffier L: Genetic structure and colonization processes in European populations of the common vole, Microtus arvalis. Evolution 2005, 59(10):2231–2242.View ArticlePubMedGoogle Scholar
- Braaker S, Heckel G: Transalpine colonisation and partial phylogeographic erosion by dispersal in the common vole ( Microtus arvalis ). Molecular Ecology 2009, 18(11):2518–2531. 10.1111/j.1365-294X.2009.04189.xView ArticlePubMedGoogle Scholar
- Hamilton G, Currat M, Ray N, Heckel G, Beaumont M, Excoffier L: Bayesian estimation of recent migration rates after a spatial expansion. Genetics 2005, 170(1):409–417. 10.1534/genetics.104.034199View ArticlePubMedPubMed CentralGoogle Scholar
- Boyce CCK, Boyce JL: Population Biology of Microtus arvalis . II. Natal and Breeding Dispersal of Females. Journal of Animal Ecology 1988, 57(3):723–736. 10.2307/5089View ArticleGoogle Scholar
- Fink S, Excoffier L, Heckel G: Mitochondrial gene diversity in the common vole Microtus arvalis shaped by historical divergence and local adaptations. Mol Ecol 2004, 13(11):3501–3514. 10.1111/j.1365-294X.2004.02351.xView ArticlePubMedGoogle Scholar
- Schweizer M, Excoffier L, Heckel G: Fine-scale genetic structure and dispersal patterns in the common vole Microtus arvalis . Mol Ecol 2007, 16(12):2463–2473. 10.1111/j.1365-294X.2007.03284.xView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.