Software  Open  Published:
msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation
BMC Bioinformaticsvolume 8, Article number: 268 (2007)
Abstract
Background
Although testing for simultaneous divergence (vicariance) across different populationpairs that span the same barrier to gene flow is of central importance to evolutionary biology, researchers often equate the gene tree and population/species tree thereby ignoring stochastic coalescent variance in their conclusions of temporal incongruence. In contrast to other available phylogeographic software packages, msBayes is the only one that analyses data from multiple species/population pairs under a hierarchical model.
Results
msBayes employs approximate Bayesian computation (ABC) under a hierarchical coalescent model to test for simultaneous divergence (TSD) in multiple codistributed populationpairs. Simultaneous isolation is tested by estimating three hyperparameters that characterize the degree of variability in divergence times across codistributed population pairs while allowing for variation in various within populationpair demographic parameters (subparameters) that can affect the coalescent. msBayes is a software package consisting of several C and R programs that are run with a Perl "frontend".
Conclusion
The method reasonably distinguishes simultaneous isolation from temporal incongruence in the divergence of codistributed population pairs, even with sparse sampling of individuals. Because the estimate step is decoupled from the simulation step, one can rapidly evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. Given the complex and idiosyncratic nature of testing multispecies biogeographic hypotheses, we envision msBayes as a powerful and flexible tool for tackling a wide array of difficult research questions that use population genetic data from multiple codistributed species. The msBayes pipeline is available for download at http://msbayes.sourceforge.net/ under an open source license (GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run with a Perl "frontend" and runs on Linux, Mac OSX, and most POSIX systems. Although the current implementation is for a single locus per speciespair, future implementations will allow analysis of multiloci data per species pair.
Background
Testing for simultaneous divergence (vicariance) across different populationpairs that span the same historical barrier to gene flow is of central importance to evolutionary biology, biogeography and community ecology [1–4]. Such inferences inform processes underlying speciation, community composition, range delineation, and the ecological consequences of climatic changes. Estimating a population divergence time with an appropriate statistical model [5] can be accomplished in a variety of ways [6–8], yet analyzing comparative phylogeographic data with multiple cooccurring species pairs that vary with respect to demographic parameters and pairwise coalescent times is less straightforward.
Instead of conducting an independent analysis on every populationpair and testing the hypothesis of temporal concordance based on this set of independent parameter estimates of divergence time, the hierarchical model employed by msBayes follows the suggestion of [9] by concurrently estimating three hyperparameters that characterize the mean, variability and number of different divergence events across a set of populationpairs. The model employed in msBayes allows estimation of these hyperparameters across a multispecies data set while explicitly incorporating uncertainty and variation in the subparameters that independently describe each populationpair's demographic history (divergence time, current, ancestral and founding effective population sizes), postdivergence migration rate and recombination rate. The msBayes software pipeline is based on the introduction of the approximate Bayesian computation (ABC) method for sampling from the hyperposterior distribution for testing for simultaneous divergence [10]. We review the important features here. Although the current implementation is for a single locus per speciespair, future implementations will allow analysis of multiloci data per species/population pair.
In contrast to previous ABClike models [11–15], our TSD is accomplished by implementing a hierarchical Bayesian model in which the subparameters (Φ; within populationpair parameters) are conditional on "hyperparameters" (ϕ) that describe the variability of Φ among the Y populationpairs. For example, divergence times (Φ) can vary across a set of population pairs conditional on the set of hyperparameters (ϕ) that varies according to their hyperprior distribution. Instead of explicitly calculating the likelihood expression P(Data  ϕ,Φ) to get a posterior distribution, we sample from the posterior distribution P((ϕ,Φ)  Data) by simulating the data K times under the coalescent model using candidate parameters drawn from the prior distribution P(ϕ,Φ). A summary statistic vector D for each simulated dataset is then compared to the observed summary statistic vector in order to generate random observations from the joint posterior distribution f(ϕ_{ i },Φ_{ i }D_{ i }) by way of a rejection/acceptance algorithm [16] followed by an optional weighted local regression step [15]. Loosely speaking, hyperparameter values are accepted and used to construct the posterior distribution with probabilities proportional to the similarity between the summary statistic vector from the observed data and the summary statistic vector calculated from simulated data.
The hierarchical model consists of ancestral populations that split at divergence times T_{ Y }= {τ_{1}...τ_{ Y }} in the past. The hyperparameter set, ϕ quantifies the degree of variability in these Y divergence times across the Y ancestral populations and their Y descendent population pairs: (1) Ψ, the number of possible divergence times (1 ≤ Ψ ≤ Y); (2) E(τ), the mean divergence time; and (3) Ω, the ratio of variance to the mean in these Y divergence times, Var(τ)/E(τ). The subparameters for the ith populationpair (Φ_{ i }) are allowed to vary independently across Y population pairs and include divergence time (τ_{ i }), current population sizes, ancestral population sizes, postdivergence founding population sizes, durations of postdivergence population growth, recombination rates, and postdivergence migration rates. The multiple populationpair splitting model is depicted in Figure 1. Each divergence time parameter τ is scaled by 2N_{ AVE }generations, where N_{ AVE }is the parametric expectation of N (effective population size) across Y population pairs given the prior distribution.
The summary statistic vector D employed in msBayes currently consists of up to six summary statistics collected from each of the Y population pairs (π,θ_{ W }, Var(π  θ_{ W }), π_{ net }, π_{ b }, and π_{ w }). This includes π, the average number of pairwise differences among all sequences within each population pair, θ_{ W }the number of segregating sites within each population pair normalized for sample size, [17], Var(π  θ_{ W }) in each population pair, and π_{ net }, Nei and Li's net nucleotide divergence between each pair of populations [18]. This last summary statistic is the difference (π_{ b } π_{ w }) where π_{ b }is the average pairwise differences between each population pair and π_{ w }is the average pairwise differences within a sister pair of descendent populations. The default setting includes the first four aforementioned summary statistics because they were found to be a least correlated subset of a larger group [19], however, future versions of msBayes will allow users to choose other summary statistics.
An extensive simulation study was conducted in [10] to evaluate the performance of our hierarchical ABC model. Because comparative phylogeographic studies are often conducted on multispecies data sets that include rare taxa from which it is difficult to obtain samples from many individuals, we extend the previous evaluation to explore the effectiveness of msBayes in conducting a TSD given small sample sizes (≤ 5 individuals per population pair).
Implementation
After preparation of a sample size file and the input files from DNA sequence data, running msBayes is a three step process that includes: (A) calculating the observed summary statistic vector from the DNA sequence input files and the sample size file; (B) running coalescent simulations of the DNA sequence data using parameters drawn from the hyperprior (ϕ) and prior (Φ); and (C) sampling from the posterior distribution and obtaining posterior estimates of Ψ, E(τ), and Ω across the Y population pairs (Figure 2).
Step A is accomplished by a commandline Perl program (obsSumStats.pl) which uses two C programs to calculate the observed summary statistic vector file. Specifically, the user runs obsSumStats.pl after collecting separate aligned DNA sequence files from each populationpair in FASTA format, and constructing an additional text file that describes the samples sizes, length of genes and transition/transversion rate ratios.
Step B iteratively simulates data sets under the hierarchical model by: 1.) randomly drawing hyperparameters and subparameters from the hyperprior and subprior distributions; 2.) using these hyperparameters and subparameters to simulate finite sites DNA sequence data from Y populationpairs; and 3.) calculating a summary statistic vector D from the simulated data set of Y populationpairs. This is accomplished with several C programs that are run with a Perl "frontend" (msbayes.pl) that either prompts the user for the upperbounds of various priors and the number of iterations to simulate or alternatively uses a batch configuration file with equivalent information. The first C program draws hyperparameters and subparameters from their hyperprior and subprior distributions. These parameters are then fed into several C programs that simulate finitesites DNA sequence data using msarbpopQH a modified version of Hudson's classic coalescent simulator ms [20], which incorporates finite sites mutation and arbitrary population structure and dynamics. Another set of C programs calculates a summary statistic vector (D) for every simulated data set of Y population pairs.
Step C is accomplished by our commandline userinterface program (acceptRej.pl). This Perl program internally uses R for the calculation. The algorithim is a simple extension of the original R scripts which are kindly provided by M. Beaumont [15]. This step does the rejection/acceptance sampling and local regression to produce the approximate sample of the posterior distribution. This third step uses the output of step B as the input and produces an output file that contains multiple graphical depictions of the posterior distributions and a text output file with various summaries of the posterior distributions (estimates of Ψ, E(τ), and Ω across the Y population pairs). The user can choose which summary statistics to include within D (the summary statistic vector), choose the proportion of accepted draws from the prior, and can optionally choose to perform simple rejection sampling without the additional local regression step.
We distribute msBayes as C source code and precompiled binaries that run on Linux or Mac OS X operating systems. The msBayes package also includes the R functions, and Perl scripts, as well as installation/running instructions.
Results and Discussion
Performance of estimator with small sample sizes
At the present time, there are no other available coalescentbased tools for analyzing multiple population pairs simultaneously to yield hyperparameter estimates. Although IM and IMa are most similar to msBayes [8, 21] because they estimate divergence times and population sizes from single pairs of populations under a coalescent model, these do not employ a hierarchical model and therefore can only do so one pair at a time. The program MCMCcoal can estimate divergence times of a known phylogeny under a coalescent model, but can only use the separate divergence time estimates to test for phylogeographic congruence [7]. The program BEST [6] infers a species phylogeny and demographic parameters (e.g. divergence times and population sizes) using a population coalescent model, but likewise can only use the individual divergence time estimates to test for phylogeographic congruence across a multispecies dataset. On the other hand, the hierarchical model employed in msBayes not only can estimate hyperparameters but also comes with the benefit of additional information gained from the "borrowing strength" across datasets [22–24]. In this case, the resulting emergent multispecies hyperestimates use more of the information than the sum of their parts (within speciespair estimates).
Although the hierarchical ABC model employed in msBayes was extensively evaluated in [10], the behavior of the ABC estimator given minimal sampling of individuals was not examined. Because comparative phylogeographic studies are often conducted on multispecies data sets that include rare taxa from which it is difficult to obtain samples from many individuals, we evaluate how low sample sizes can affect inference. To this end, we explored the performance in scenarios where ≤ 5 per population pair were sampled from each of 10 population pairs. We created 1,000 simulated data sets under each of two different histories: (1) simultaneous divergence history and (2) variable divergence history among population pairs. In the simultaneous divergence history (true Ω = Var(τ)/E(τ) = 0), all ten population pairs arose from ancestral populations at τ = 1.8 before the present. In the variable divergence history (true Ω = 0.1), two population pairs arose at τ = 1.0 and eight population pairs arose at τ = 2.0 before the present. We simulated these two histories with small sample sizes (2–5 individual per populationpair) and with larger sample sizes (20 individuals per population pair; 10 per descendent population). The simulated data sets consisted of haploid mtDNA samples from ten population pairs that were 550–600 base pairs in length. From each of the four sets of 1,000 simulated data sets, we used msBayes to obtain 1,000 ABC estimates of the hyperparameter, Ω, with the goal of assessing the effects of sample sizes on the root mean square error (RMSE) of the ABC Ω estimator. Each estimate of Ω was obtained from the mode of 1,000 accepted draws (after the local regression step) from 500,000 random draws from the hyperprior, as these conditions were found to be optimal in [10]. For the larger sample sizes we use four classes of summary statistics (π, θ_{ W }, Var(π  θ_{ W }) and π_{ net }), while for the smaller sample sizes we only use π_{ b }to avoid null or n.a.n. values (not a number) that are yielded when only one sample is collected from a descendent population.
The simulation analysis demonstrated that msBayes can usually distinguish simultaneous divergence from temporal incongruence in divergence, even with sparse sampling of individuals. The estimates of Ω were not markedly improved by sampling 20 individuals per population pair (10 each population) when compared to sampling 2–5 individuals per population pair (1–3 each population; Figure. 3). However, Ω is being overestimated under both sample sizes and this upward bias is stronger with larger sample sizes when true Ω = 1. Therefore, simultaneous divergence is easier to correctly reject with larger sample sizes. Root mean square error (RMSE) for estimating Ω was < 0.12 when the true history was simultaneous divergence (Ω = 0), and RMSE was < 0.18 when the true history involved 2 different divergence events across 10 population pairs (Ω = 0.1). It is encouraging that one can obtain fair estimates with so few samples per population pair and that two samples per population pair can be analyzed by msBayes.
An attractive benefit of an ABC method such as msBayes is that one can perform this estimator evaluation relatively quickly. Simulating data from parameters drawn from the prior is only done once per set of conditions (sample size/history) and can be done in approximately 5 hours per population pair on a 2 GHz linux computer. The computational time can be further reduced as the simulations can be run parallel on multiple processors. Because the acceptance/rejection step is decoupled from simulating the prior, multiple estimates from a series of simulated datasets can be accomplished without resimulating the prior each time. The acceptance/rejection step for a single estimate is accomplished in one second to well under a minute such that 1,000 estimates can be obtained from 1,000 data sets simulated from fixed known parameter values in under an hour to within 24 hours on a single processor.
General use and future development
The most important aspect of msBayes is that its flexible and modular nature will allow us and others to add in new features. This characteristic is essential for a phylogeographic software tool because phylogeographic studies are highly idiosyncratic. Using population genetic data to test how climate and/or geological changes result in biogeographic shifts, speciation, extinction and consequent changes in ecological interactions can involve a wide array of hypotheses and models that conform to no generality with regards to model complexity, parameterization and sampling. We therefore anticipate making several extensions to msBayes, and will encourage other bioinformaticians to make versions that suite particular difficult research questions. Furthermore, phylogeographic studies are most powerful when combined with independent evidence (or hypotheses) about past habitat distributions that are generated from other types of historic data and ecological distribution models [25]. Particular historical hypotheses can then be directly parameterized by paleodistribution models and tested with genetic data within the msBayes framework using Bayes factors [26].
One feature we plan to include in future versions of msBayes is an option to simulate from the prior after constraining the number of divergence events per Y population pairs to one fixed number. This will then allow getting estimates for when these different isolation events took place as well as estimating which population pairs originated at either of the particular divergence events. Other upcoming features to be included are: 1.) multiple loci per population pair by expanding the summary statistic vector and adding additional hyperparameters controlling mutation rate variation across loci; 2.) having more summary statistics available; 3.) allowing analysis of only one population pair at a time; 4.) testing multispecies colonization hypotheses; 5.) three or more population models (as opposed to two population models); 6.) allowing microsatellite data and 7.) automating the simulation testing procedure used to obtain estimator bias.
It should be noted that migration could hinder the ability of this method to correctly infer simultaneous divergence. Moderate migration in a subset of species/population pairs could cause the method to incorrectly support temporal discordance in divergence when the true history involved temporal congruence because migration can erase the genetic signal of isolation [27, 28]. Although the Bayesian support for temporal concordance in divergence times would likely weaken if this happens in a subset of species/population pairs, we will explore using the summary statistic Var(π) as a means to tease apart migration from isolation as suggested in [29, 30].
Conclusion
The msBayes software pipeline will increasingly become an important tool as the field of comparative phylogeography progresses to become a more rigorous and statistical enterprise [5]. The program can obtain hyperparameter estimates using hierarchical models in a reasonable amount of time without having the problems associated with convergence and mixing found in MCMC methods (Markov chain Monte Carlo). Because the estimation step is decoupled from the simulation step, one can quickly evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. The method can reasonably distinguish biogeographic congruence from temporal incongruence, even with sparse sampling of individuals. Given the complex and idiosyncratic nature of testing multispecies biogeographic hypotheses, we envision msBayes as a powerful and flexible tool that is open for modification when faced with particularly difficult research questions. Finally, due to its flexible and modular design, msBayes will be a wellsuited tool for the heterogeneous data sets that are emerging and being combined to test complex historical hypotheses.
Availability and requirements
The installation instructions, documentation, source code and precompiled binary for msBayes are all available for download at http://msbayes.sourceforge.net/ under an open source license (GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run with a Perl "frontend" and runs on Linux, Mac OSX, and most POSIX systems.
Abbreviations
 ABC:

Approximate Bayesian Computation
 TSD:

Test of simultaneous divergence
 mtDNA:

Mitochondrial DNA
References
 1.
Avise JC: Phylogeography: The history and formation of species. Cambridge: Harvard University Press; 2000.
 2.
Wen J: Origin and the evolution of the eastern North American distinct distributions of flowering plants. Annu Rev Ecol Syst 1999, 30: 421–455. 10.1146/annurev.ecolsys.30.1.421
 3.
Lessios HA: The first stage of speciation as seen in organisms separated by the Isthmus Panama. In Endless Forms: species and speciation. Edited by: Howard DJ, Berlocher S. Oxford, U.K.: Oxford University Press; 1998:186–201.
 4.
Barraclough TG, Nee S: Phylogenetics and speciation. Trends Ecol Evol 2001, 16(7):391–399. 10.1016/S01695347(01)021619
 5.
Knowles LL, Maddison WP: Statistical phylogeography. Mol Ecol 2002, 11(12):2623–2635. 10.1046/j.1365294X.2002.01637.x
 6.
Edwards SV, Liu L, Pearl DK: Highresolution species trees without concatination. Proc Natl Acad Sci USA 2007, 104: 5936–5941. 10.1073/pnas.0607004104
 7.
Rannala B, Yang ZH: Bayes Estimation of Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From Multiple Loci. Genetics 2003, 164: 1645–1656.
 8.
Hey J, Nielsen R: Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc Natl Acad Sci USA 2007, 104: 2785–2790. 10.1073/pnas.0611164104
 9.
Edwards SV, Beerli P: Perspective: Gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 2000, 54(6):1839–1854.
 10.
Hickerson MJ, Stahl E, Lessios HA: Test for simultaneous divergence using approximate Bayesian computation. Evolution 2006, 60: 2435–2453.
 11.
Thornton K, Andolfatto P: Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a Netherlands population of Drosophila melanogaster. Genetics 2005, 172: 1607–1619. 10.1534/genetics.105.048223
 12.
Excoffier L, Estoup A, Cornuet JM: Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 2005, 169: 1727–1738. 10.1534/genetics.104.036236
 13.
Estoup A, Beaumont BA, Sennedot F, Moritz C, Cornuet JM: Genetic analysis of complex demographic scenarios: spatially expanding populations of the cane toad, Bufo marinus . Evolution 2004, 58: 2021–2036.
 14.
Tallmon DA, Luikart G, Beaumont BA: Comparative evaluation of a new effective population size estimator based on approximate Bayesian computation. Genetics 2004, 167: 977–988. 10.1534/genetics.103.026146
 15.
Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian computation in population genetics. Genetics 2002, 162: 2025–2035.
 16.
Weiss G, von Haeseler A: Inference of population history using a likelihood approach. Genetics 1998, 149: 1539–1546.
 17.
Watterson GA: On the number of segregating sites in genetic models without recombination. Theor Popul Biol 1975, 7: 256–276. 10.1016/00405809(75)900209
 18.
Nei M, Li W: Mathematical model for studying variation in terms of restriction endonucleases. Proc Natl Acad Sci USA 1979, 76: 5269–5273. 10.1073/pnas.76.10.5269
 19.
Hickerson MJ, Dolman G, Moritz C: Comparative phylogeographic summary statistics for testing simultaneous vicariance across taxonpairs. Mol Ecol 2006, 15: 209–224. 10.1111/j.1365294X.2005.02718.x
 20.
Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics 2002, 18: 337–338. 10.1093/bioinformatics/18.2.337
 21.
Hey J, Nielsen R: Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis . Genetics 2004, 167: 747–760. 10.1534/genetics.103.024182
 22.
Beaumont MA, Rannala B: The Bayesian revolution in genetics. Nat Rev Genet 2004, 5: 251–261. 10.1038/nrg1318
 23.
James W, Stein C: Estimation with quadradic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability: 1960. Berkeley, CA: University of California Press; 1960.
 24.
Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. London, UK: Chapman and Hall/CRC; 1995.
 25.
Hugall A, Moritz C, Moussalli A, Stanisic J: Reconciling paleodistribution models and comparative phylogeography in the Wet Tropics rainforest land snail Gnarosophia bellendenkerensis (Brazier 1875). Proc Natl Acad Sci USA 2002, 99(9):6112–6117. 10.1073/pnas.092538699
 26.
Kass RE, Raftery A: Bayesian factors. Journal of the American Statistical Association 1995, 90: 773–795. 10.2307/2291091
 27.
Kalinowski ST: Evolutionary and statistical properties of three genetic distances. Mol Ecol 2002, 11(8):1263–1273. 10.1046/j.1365294X.2002.01520.x
 28.
Slatkin M: Gene flow in natural populations. Ann Rev Ecol Syst 1985, 16: 393–430. 10.1146/annurev.ecolsys.16.1.393
 29.
Wakeley J: The variance of pairwise nucleotide differences in two populations with migration. Theor Popul Biol 1996, 49: 39–57. 10.1006/tpbi.1996.0002
 30.
Wakeley J: Distinguishing migration from isolation using the variance of pairwise differences. Theor Popul Biol 1996, 49: 369–386. 10.1006/tpbi.1996.0018
Acknowledgements
We thank A. Lancaster and three anonymous reviewers for making helpful suggestions and E. Andersson for handling the manuscript. We thank M. Beaumont for kindly providing R scripts and critically useful discussions. We thank D. Hudson for permission to use E. Stahl's finite sites version of his ms coalescent simulator under GNU Public License. We thank J. McGuire and C. Moritz for use of the linux parallel computing cluster housed at the Museum of Vertebrate zoology (University of California, Berkeley). We thank MBI (Mathematical Biosciences Institute) for hosting the workshop on Phylogenetics and Phylogeography. Support for M. J. Hickerson was provided by a NSF postdoctoral fellowship in interdisciplinary informatics. N. Takebayashi was supported by NIH Grant Number 2P20RR16466 from the INBRE program of the National Center for Research Resources and NSF DEB0640520. E. Stahl, was supported by Sloan/DOE Fellowship award DEFG0200ER62993.
Author information
Additional information
Authors' contributions
MJH developed the idea for using ABC within a hierarchical model to analyze multiple population pairs simultaneously. ES developed the finite sites version of D. Hudson's classic coalescent simulator (ms). MJH and NT developed C, R, and Perl routines and modified preexisting R and C routines to comprise an ABC algorithm that makes use of a Hierarchical model. NT extensively developed the C and Perl routines that comprise the user version of msBayes now available. NT and MJH maintains the msBayes website and NT developed the installation configurations and precompiled binaries. All authors read and approved the final version of the manuscript.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Root Mean Square Error
 Posterior Distribution
 Approximate Bayesian Computation
 Population Pair
 Simultaneous Divergence