msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation
© Hickerson et al; licensee BioMed Central Ltd. 2007
Received: 06 April 2007
Accepted: 26 July 2007
Published: 26 July 2007
Although testing for simultaneous divergence (vicariance) across different population-pairs that span the same barrier to gene flow is of central importance to evolutionary biology, researchers often equate the gene tree and population/species tree thereby ignoring stochastic coalescent variance in their conclusions of temporal incongruence. In contrast to other available phylogeographic software packages, msBayes is the only one that analyses data from multiple species/population pairs under a hierarchical model.
msBayes employs approximate Bayesian computation (ABC) under a hierarchical coalescent model to test for simultaneous divergence (TSD) in multiple co-distributed population-pairs. Simultaneous isolation is tested by estimating three hyper-parameters that characterize the degree of variability in divergence times across co-distributed population pairs while allowing for variation in various within population-pair demographic parameters (sub-parameters) that can affect the coalescent. msBayes is a software package consisting of several C and R programs that are run with a Perl "front-end".
The method reasonably distinguishes simultaneous isolation from temporal incongruence in the divergence of co-distributed population pairs, even with sparse sampling of individuals. Because the estimate step is decoupled from the simulation step, one can rapidly evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. Given the complex and idiosyncratic nature of testing multi-species biogeographic hypotheses, we envision msBayes as a powerful and flexible tool for tackling a wide array of difficult research questions that use population genetic data from multiple co-distributed species. The msBayes pipeline is available for download at http://msbayes.sourceforge.net/ under an open source license (GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run with a Perl "front-end" and runs on Linux, Mac OS-X, and most POSIX systems. Although the current implementation is for a single locus per species-pair, future implementations will allow analysis of multi-loci data per species pair.
Testing for simultaneous divergence (vicariance) across different population-pairs that span the same historical barrier to gene flow is of central importance to evolutionary biology, biogeography and community ecology [1–4]. Such inferences inform processes underlying speciation, community composition, range delineation, and the ecological consequences of climatic changes. Estimating a population divergence time with an appropriate statistical model  can be accomplished in a variety of ways [6–8], yet analyzing comparative phylogeographic data with multiple co-occurring species pairs that vary with respect to demographic parameters and pairwise coalescent times is less straightforward.
Instead of conducting an independent analysis on every population-pair and testing the hypothesis of temporal concordance based on this set of independent parameter estimates of divergence time, the hierarchical model employed by msBayes follows the suggestion of  by concurrently estimating three hyper-parameters that characterize the mean, variability and number of different divergence events across a set of population-pairs. The model employed in msBayes allows estimation of these hyper-parameters across a multi-species data set while explicitly incorporating uncertainty and variation in the sub-parameters that independently describe each population-pair's demographic history (divergence time, current, ancestral and founding effective population sizes), post-divergence migration rate and recombination rate. The msBayes software pipeline is based on the introduction of the approximate Bayesian computation (ABC) method for sampling from the hyper-posterior distribution for testing for simultaneous divergence . We review the important features here. Although the current implementation is for a single locus per species-pair, future implementations will allow analysis of multi-loci data per species/population pair.
In contrast to previous ABC-like models [11–15], our TSD is accomplished by implementing a hierarchical Bayesian model in which the sub-parameters (Φ; within population-pair parameters) are conditional on "hyper-parameters" (ϕ) that describe the variability of Φ among the Y population-pairs. For example, divergence times (Φ) can vary across a set of population pairs conditional on the set of hyper-parameters (ϕ) that varies according to their hyper-prior distribution. Instead of explicitly calculating the likelihood expression P(Data | ϕ,Φ) to get a posterior distribution, we sample from the posterior distribution P((ϕ,Φ) | Data) by simulating the data K times under the coalescent model using candidate parameters drawn from the prior distribution P(ϕ,Φ). A summary statistic vector D for each simulated dataset is then compared to the observed summary statistic vector in order to generate random observations from the joint posterior distribution f(ϕ i ,Φ i |D i ) by way of a rejection/acceptance algorithm  followed by an optional weighted local regression step . Loosely speaking, hyper-parameter values are accepted and used to construct the posterior distribution with probabilities proportional to the similarity between the summary statistic vector from the observed data and the summary statistic vector calculated from simulated data.
The summary statistic vector D employed in msBayes currently consists of up to six summary statistics collected from each of the Y population pairs (π,θ W , Var(π - θ W ), π net , π b , and π w ). This includes π, the average number of pairwise differences among all sequences within each population pair, θ W the number of segregating sites within each population pair normalized for sample size, , Var(π - θ W ) in each population pair, and π net , Nei and Li's net nucleotide divergence between each pair of populations . This last summary statistic is the difference (π b - π w ) where π b is the average pairwise differences between each population pair and π w is the average pairwise differences within a sister pair of descendent populations. The default setting includes the first four aforementioned summary statistics because they were found to be a least correlated subset of a larger group , however, future versions of msBayes will allow users to choose other summary statistics.
An extensive simulation study was conducted in  to evaluate the performance of our hierarchical ABC model. Because comparative phylogeographic studies are often conducted on multi-species data sets that include rare taxa from which it is difficult to obtain samples from many individuals, we extend the previous evaluation to explore the effectiveness of msBayes in conducting a TSD given small sample sizes (≤ 5 individuals per population pair).
Step A is accomplished by a command-line Perl program (obsSumStats.pl) which uses two C programs to calculate the observed summary statistic vector file. Specifically, the user runs obsSumStats.pl after collecting separate aligned DNA sequence files from each population-pair in FASTA format, and constructing an additional text file that describes the samples sizes, length of genes and transition/transversion rate ratios.
Step B iteratively simulates data sets under the hierarchical model by: 1.) randomly drawing hyper-parameters and sub-parameters from the hyper-prior and sub-prior distributions; 2.) using these hyper-parameters and sub-parameters to simulate finite sites DNA sequence data from Y population-pairs; and 3.) calculating a summary statistic vector D from the simulated data set of Y population-pairs. This is accomplished with several C programs that are run with a Perl "front-end" (msbayes.pl) that either prompts the user for the upper-bounds of various priors and the number of iterations to simulate or alternatively uses a batch configuration file with equivalent information. The first C program draws hyper-parameters and sub-parameters from their hyper-prior and sub-prior distributions. These parameters are then fed into several C programs that simulate finite-sites DNA sequence data using msarbpopQH a modified version of Hudson's classic coalescent simulator ms , which incorporates finite sites mutation and arbitrary population structure and dynamics. Another set of C programs calculates a summary statistic vector (D) for every simulated data set of Y population pairs.
Step C is accomplished by our command-line user-interface program (acceptRej.pl). This Perl program internally uses R for the calculation. The algorithim is a simple extension of the original R scripts which are kindly provided by M. Beaumont . This step does the rejection/acceptance sampling and local regression to produce the approximate sample of the posterior distribution. This third step uses the output of step B as the input and produces an output file that contains multiple graphical depictions of the posterior distributions and a text output file with various summaries of the posterior distributions (estimates of Ψ, E(τ), and Ω across the Y population pairs). The user can choose which summary statistics to include within D (the summary statistic vector), choose the proportion of accepted draws from the prior, and can optionally choose to perform simple rejection sampling without the additional local regression step.
We distribute msBayes as C source code and pre-compiled binaries that run on Linux or Mac OS X operating systems. The msBayes package also includes the R functions, and Perl scripts, as well as installation/running instructions.
Results and Discussion
Performance of estimator with small sample sizes
At the present time, there are no other available coalescent-based tools for analyzing multiple population pairs simultaneously to yield hyper-parameter estimates. Although IM and IMa are most similar to msBayes [8, 21] because they estimate divergence times and population sizes from single pairs of populations under a coalescent model, these do not employ a hierarchical model and therefore can only do so one pair at a time. The program MCMCcoal can estimate divergence times of a known phylogeny under a coalescent model, but can only use the separate divergence time estimates to test for phylogeographic congruence . The program BEST  infers a species phylogeny and demographic parameters (e.g. divergence times and population sizes) using a population coalescent model, but likewise can only use the individual divergence time estimates to test for phylogeographic congruence across a multi-species dataset. On the other hand, the hierarchical model employed in msBayes not only can estimate hyper-parameters but also comes with the benefit of additional information gained from the "borrowing strength" across datasets [22–24]. In this case, the resulting emergent multi-species hyper-estimates use more of the information than the sum of their parts (within species-pair estimates).
Although the hierarchical ABC model employed in msBayes was extensively evaluated in , the behavior of the ABC estimator given minimal sampling of individuals was not examined. Because comparative phylogeographic studies are often conducted on multi-species data sets that include rare taxa from which it is difficult to obtain samples from many individuals, we evaluate how low sample sizes can affect inference. To this end, we explored the performance in scenarios where ≤ 5 per population pair were sampled from each of 10 population pairs. We created 1,000 simulated data sets under each of two different histories: (1) simultaneous divergence history and (2) variable divergence history among population pairs. In the simultaneous divergence history (true Ω = Var(τ)/E(τ) = 0), all ten population pairs arose from ancestral populations at τ = 1.8 before the present. In the variable divergence history (true Ω = 0.1), two population pairs arose at τ = 1.0 and eight population pairs arose at τ = 2.0 before the present. We simulated these two histories with small sample sizes (2–5 individual per population-pair) and with larger sample sizes (20 individuals per population pair; 10 per descendent population). The simulated data sets consisted of haploid mtDNA samples from ten population pairs that were 550–600 base pairs in length. From each of the four sets of 1,000 simulated data sets, we used msBayes to obtain 1,000 ABC estimates of the hyper-parameter, Ω, with the goal of assessing the effects of sample sizes on the root mean square error (RMSE) of the ABC Ω estimator. Each estimate of Ω was obtained from the mode of 1,000 accepted draws (after the local regression step) from 500,000 random draws from the hyper-prior, as these conditions were found to be optimal in . For the larger sample sizes we use four classes of summary statistics (π, θ W , Var(π - θ W ) and π net ), while for the smaller sample sizes we only use π b to avoid null or n.a.n. values (not a number) that are yielded when only one sample is collected from a descendent population.
An attractive benefit of an ABC method such as msBayes is that one can perform this estimator evaluation relatively quickly. Simulating data from parameters drawn from the prior is only done once per set of conditions (sample size/history) and can be done in approximately 5 hours per population pair on a 2 GHz linux computer. The computational time can be further reduced as the simulations can be run parallel on multiple processors. Because the acceptance/rejection step is decoupled from simulating the prior, multiple estimates from a series of simulated datasets can be accomplished without re-simulating the prior each time. The acceptance/rejection step for a single estimate is accomplished in one second to well under a minute such that 1,000 estimates can be obtained from 1,000 data sets simulated from fixed known parameter values in under an hour to within 24 hours on a single processor.
General use and future development
The most important aspect of msBayes is that its flexible and modular nature will allow us and others to add in new features. This characteristic is essential for a phylogeographic software tool because phylogeographic studies are highly idiosyncratic. Using population genetic data to test how climate and/or geological changes result in biogeographic shifts, speciation, extinction and consequent changes in ecological interactions can involve a wide array of hypotheses and models that conform to no generality with regards to model complexity, parameterization and sampling. We therefore anticipate making several extensions to msBayes, and will encourage other bioinformaticians to make versions that suite particular difficult research questions. Furthermore, phylogeographic studies are most powerful when combined with independent evidence (or hypotheses) about past habitat distributions that are generated from other types of historic data and ecological distribution models . Particular historical hypotheses can then be directly parameterized by paleo-distribution models and tested with genetic data within the msBayes framework using Bayes factors .
One feature we plan to include in future versions of msBayes is an option to simulate from the prior after constraining the number of divergence events per Y population pairs to one fixed number. This will then allow getting estimates for when these different isolation events took place as well as estimating which population pairs originated at either of the particular divergence events. Other upcoming features to be included are: 1.) multiple loci per population pair by expanding the summary statistic vector and adding additional hyper-parameters controlling mutation rate variation across loci; 2.) having more summary statistics available; 3.) allowing analysis of only one population pair at a time; 4.) testing multi-species colonization hypotheses; 5.) three or more population models (as opposed to two population models); 6.) allowing microsatellite data and 7.) automating the simulation testing procedure used to obtain estimator bias.
It should be noted that migration could hinder the ability of this method to correctly infer simultaneous divergence. Moderate migration in a subset of species/population pairs could cause the method to incorrectly support temporal discordance in divergence when the true history involved temporal congruence because migration can erase the genetic signal of isolation [27, 28]. Although the Bayesian support for temporal concordance in divergence times would likely weaken if this happens in a subset of species/population pairs, we will explore using the summary statistic Var(π) as a means to tease apart migration from isolation as suggested in [29, 30].
The msBayes software pipeline will increasingly become an important tool as the field of comparative phylogeography progresses to become a more rigorous and statistical enterprise . The program can obtain hyper-parameter estimates using hierarchical models in a reasonable amount of time without having the problems associated with convergence and mixing found in MCMC methods (Markov chain Monte Carlo). Because the estimation step is decoupled from the simulation step, one can quickly evaluate different ABC acceptance/rejection conditions and the choice of summary statistics. The method can reasonably distinguish biogeographic congruence from temporal incongruence, even with sparse sampling of individuals. Given the complex and idiosyncratic nature of testing multi-species biogeographic hypotheses, we envision msBayes as a powerful and flexible tool that is open for modification when faced with particularly difficult research questions. Finally, due to its flexible and modular design, msBayes will be a well-suited tool for the heterogeneous data sets that are emerging and being combined to test complex historical hypotheses.
Availability and requirements
The installation instructions, documentation, source code and precompiled binary for msBayes are all available for download at http://msbayes.sourceforge.net/ under an open source license (GNU Public License). The msBayes pipeline is comprised of several C and R programs that are run with a Perl "front-end" and runs on Linux, Mac OS-X, and most POSIX systems.
List of Abbreviations used
Approximate Bayesian Computation
Test of simultaneous divergence
We thank A. Lancaster and three anonymous reviewers for making helpful suggestions and E. Andersson for handling the manuscript. We thank M. Beaumont for kindly providing R scripts and critically useful discussions. We thank D. Hudson for permission to use E. Stahl's finite sites version of his ms coalescent simulator under GNU Public License. We thank J. McGuire and C. Moritz for use of the linux parallel computing cluster housed at the Museum of Vertebrate zoology (University of California, Berkeley). We thank MBI (Mathematical Biosciences Institute) for hosting the workshop on Phylogenetics and Phylogeography. Support for M. J. Hickerson was provided by a NSF post-doctoral fellowship in interdisciplinary informatics. N. Takebayashi was supported by NIH Grant Number 2P20RR16466 from the INBRE program of the National Center for Research Resources and NSF DEB-0640520. E. Stahl, was supported by Sloan/DOE Fellowship award DE-FG02-00ER62993.
- Avise JC: Phylogeography: The history and formation of species. Cambridge: Harvard University Press; 2000.Google Scholar
- Wen J: Origin and the evolution of the eastern North American distinct distributions of flowering plants. Annu Rev Ecol Syst 1999, 30: 421–455. 10.1146/annurev.ecolsys.30.1.421View ArticleGoogle Scholar
- Lessios HA: The first stage of speciation as seen in organisms separated by the Isthmus Panama. In Endless Forms: species and speciation. Edited by: Howard DJ, Berlocher S. Oxford, U.K.: Oxford University Press; 1998:186–201.Google Scholar
- Barraclough TG, Nee S: Phylogenetics and speciation. Trends Ecol Evol 2001, 16(7):391–399. 10.1016/S0169-5347(01)02161-9View ArticlePubMedGoogle Scholar
- Knowles LL, Maddison WP: Statistical phylogeography. Mol Ecol 2002, 11(12):2623–2635. 10.1046/j.1365-294X.2002.01637.xView ArticlePubMedGoogle Scholar
- Edwards SV, Liu L, Pearl DK: High-resolution species trees without concatination. Proc Natl Acad Sci USA 2007, 104: 5936–5941. 10.1073/pnas.0607004104PubMed CentralView ArticlePubMedGoogle Scholar
- Rannala B, Yang ZH: Bayes Estimation of Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From Multiple Loci. Genetics 2003, 164: 1645–1656.PubMed CentralPubMedGoogle Scholar
- Hey J, Nielsen R: Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc Natl Acad Sci USA 2007, 104: 2785–2790. 10.1073/pnas.0611164104PubMed CentralView ArticlePubMedGoogle Scholar
- Edwards SV, Beerli P: Perspective: Gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 2000, 54(6):1839–1854.PubMedGoogle Scholar
- Hickerson MJ, Stahl E, Lessios HA: Test for simultaneous divergence using approximate Bayesian computation. Evolution 2006, 60: 2435–2453.View ArticlePubMedGoogle Scholar
- Thornton K, Andolfatto P: Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a Netherlands population of Drosophila melanogaster. Genetics 2005, 172: 1607–1619. 10.1534/genetics.105.048223View ArticlePubMedGoogle Scholar
- Excoffier L, Estoup A, Cornuet J-M: Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 2005, 169: 1727–1738. 10.1534/genetics.104.036236PubMed CentralView ArticlePubMedGoogle Scholar
- Estoup A, Beaumont BA, Sennedot F, Moritz C, Cornuet J-M: Genetic analysis of complex demographic scenarios: spatially expanding populations of the cane toad, Bufo marinus . Evolution 2004, 58: 2021–2036.View ArticlePubMedGoogle Scholar
- Tallmon DA, Luikart G, Beaumont BA: Comparative evaluation of a new effective population size estimator based on approximate Bayesian computation. Genetics 2004, 167: 977–988. 10.1534/genetics.103.026146PubMed CentralView ArticlePubMedGoogle Scholar
- Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian computation in population genetics. Genetics 2002, 162: 2025–2035.PubMed CentralPubMedGoogle Scholar
- Weiss G, von Haeseler A: Inference of population history using a likelihood approach. Genetics 1998, 149: 1539–1546.PubMed CentralPubMedGoogle Scholar
- Watterson GA: On the number of segregating sites in genetic models without recombination. Theor Popul Biol 1975, 7: 256–276. 10.1016/0040-5809(75)90020-9View ArticlePubMedGoogle Scholar
- Nei M, Li W: Mathematical model for studying variation in terms of restriction endonucleases. Proc Natl Acad Sci USA 1979, 76: 5269–5273. 10.1073/pnas.76.10.5269PubMed CentralView ArticlePubMedGoogle Scholar
- Hickerson MJ, Dolman G, Moritz C: Comparative phylogeographic summary statistics for testing simultaneous vicariance across taxon-pairs. Mol Ecol 2006, 15: 209–224. 10.1111/j.1365-294X.2005.02718.xView ArticlePubMedGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18: 337–338. 10.1093/bioinformatics/18.2.337View ArticlePubMedGoogle Scholar
- Hey J, Nielsen R: Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis . Genetics 2004, 167: 747–760. 10.1534/genetics.103.024182PubMed CentralView ArticlePubMedGoogle Scholar
- Beaumont MA, Rannala B: The Bayesian revolution in genetics. Nat Rev Genet 2004, 5: 251–261. 10.1038/nrg1318View ArticlePubMedGoogle Scholar
- James W, Stein C: Estimation with quadradic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability: 1960. Berkeley, CA: University of California Press; 1960.Google Scholar
- Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. London, UK: Chapman and Hall/CRC; 1995.Google Scholar
- Hugall A, Moritz C, Moussalli A, Stanisic J: Reconciling paleodistribution models and comparative phylogeography in the Wet Tropics rainforest land snail Gnarosophia bellendenkerensis (Brazier 1875). Proc Natl Acad Sci USA 2002, 99(9):6112–6117. 10.1073/pnas.092538699PubMed CentralView ArticlePubMedGoogle Scholar
- Kass RE, Raftery A: Bayesian factors. Journal of the American Statistical Association 1995, 90: 773–795. 10.2307/2291091View ArticleGoogle Scholar
- Kalinowski ST: Evolutionary and statistical properties of three genetic distances. Mol Ecol 2002, 11(8):1263–1273. 10.1046/j.1365-294X.2002.01520.xView ArticlePubMedGoogle Scholar
- Slatkin M: Gene flow in natural populations. Ann Rev Ecol Syst 1985, 16: 393–430. 10.1146/annurev.ecolsys.16.1.393View ArticleGoogle Scholar
- Wakeley J: The variance of pairwise nucleotide differences in two populations with migration. Theor Popul Biol 1996, 49: 39–57. 10.1006/tpbi.1996.0002View ArticlePubMedGoogle Scholar
- Wakeley J: Distinguishing migration from isolation using the variance of pairwise differences. Theor Popul Biol 1996, 49: 369–386. 10.1006/tpbi.1996.0018View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.