Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees

Background There has been increasing interest in coalescent models which admit multiple mergers of ancestral lineages; and to model hybridization and coalescence simultaneously. Results Hybrid-Lambda is a software package that simulates gene genealogies under multiple merger and Kingman’s coalescent processes within species networks or species trees. Hybrid-Lambda allows different coalescent processes to be specified for different populations, and allows for time to be converted between generations and coalescent units, by specifying a population size for each population. In addition, Hybrid-Lambda can generate simulated datasets, assuming the infinitely many sites mutation model, and compute the F ST statistic. As an illustration, we apply Hybrid-Lambda to infer the time of subdivision of certain marine invertebrates under different coalescent processes. Conclusions Hybrid-Lambda makes it possible to investigate biogeographic concordance among high fecundity species exhibiting skewed offspring distribution.


Background
Species trees describe ancestral relations among species. Gene genealogies describe the random ancestral relations of alleles sampled within species. Species trees are often assumed to be bifurcating [6], and gene genealogies to follow the Kingman coalescent [23,27] in allowing at most two lineages to coalesce at a time.
Recently, there has been increasing interest in coalescent models which admit multiple mergers of ancestral lineages [1,2,9,12,36,38,39] and to model hybridization and coalescence simultaneously [3,25,26,28,46]. For high fecundity species exhibiting sweepstake-like reproduction, such as oysters and other marine organisms [1,4,9,11,17,18,38], the Kingman coalescent may not be appropriate, as it is based on low offspring number population models (see recent reviews by [19] and [42]). Thus, we consider coalescents [8,35,36] derived from *Correspondence: joe.zhu@well.ox.ac.uk 1 Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK Full list of author information is available at the end of the article sweepstake-like reproduction models, and allow more than two lineages to coalesce at a time.
We introduce the software Hybrid-Lambda for simulating gene trees under two models of -coalescents within rooted species trees and rooted species networks. Our program differs from existing software which also allows multiple mergers, such as SIMCOAL 2.0 [29] -which allows multiple mergers in gene trees due to small population sizes under the Wright-Fisher model -in that we apply coalescent processes that are obtained from population models explicitly modelling skewed offspring distributions, as opposed to bottlenecks.
Species trees may also fail to be bifurcating due to either polytomies or hybridization events. The simulation of gene genealogies within a species network which admits hybridization is another application of Hybrid-Lambda. The package ms [24] can also simulate gene genealogies within species networks under Kingman's coalescent. However the input of ms is difficult to automate when the network is sophisticated or generated from other software. Other simulation studies using species networks have either used a small number of network topologies coded individually (for example, in phylonet [43,45,46]) or have assumed that gene trees have evolved on species trees embedded within the species network [22,28,31]. Hybrid-Lambda will help to automate simulation studies of hybridization by allowing for a large number of species network topologies and allowing gene trees to evolve directly within the network. Hybrid-Lambda can simulate both Kingman and -coalescent processes within species networks. A comparison of features of several software packages that output gene genealogies under coalescent models is given in Table 1.

Implementation
The program input file for Hybrid-Lambda is a character string that describes relationships between species. Standard Newick format [33] is used for the input of species trees and the output of gene trees, whose interior nodes are not labelled. An extended Newick formatted string [5,25] labels all internal nodes, and is used for the input of species networks (see Fig. 1).

Parameters
Hybrid-Lambda can use multiple lineages sampled from each species and simulate Kingman or multiple merger ( )-coalescent processes within a given species network. In addition, separate coalescent processes can be specified on different branches of the species network. The coalescent is a continuous-time Markov process, in which times between coalescent events are independent exponential random variables with different rates. The rates are determined by a so-called coalescent parameter that can be input via command line, or a(n) (extended) Newick formatted string with specific coalescent parameters as branch lengths. By default, the Kingman coalescent is used, for which two of b active lineages coalesce at rate λ b,2 = b 2 . One can choose between two different examples of a -coalescent, whose parameters have clear biological interpretation. While we cannot hope to cover the huge class of Lambdacoalescents, our two examples are the ones that have been most studied in the literature [2,7,13]. If the coalescent parameter is between 0 and 1, then we use ψ for the coalescent parameter, and the rate λ b,k at which Eldon and Wakeley [9]. If the coalescent parameter is between 1 and 2, then we use α for the coalescent parameter, and the rate of k-mergers ( where B(·, ·) is the beta function [39]. Hybrid-Lambda assumes by default that the input network (tree) branch lengths are in coalescent units. However, this is not essential. Coalescent units can be converted through an alternative input file with numbers of generations as branch lengths, which are then divided by their corresponding effective population sizes. By default, effective population sizes on all this parameter using the command line, or using a(n) (extended) Newickbranches are assumed to be equal and unchanged. Users can change formatted string to specify population sizes on all branches through another input file.
The simulation requires ultrametric species networks, i.e. equal lengths of all paths from tip to root.

Results and discussion
Hybrid-Lambda outputs simulated gene trees in three different files: one contains gene trees with branch lengths in coalescent units, another uses the number of generations as branch lengths, and the third uses the number of expected mutations as branch lengths. Besides outputting gene tree files, Hybrid-Lambda also provides several functions for analysis purposes: • user-defined random seed for simulation, • output simulated data in 0/1 format assuming the infinitely many sites mutation model, • a frequency table of gene tree topologies, • a figure of the species network or tree (this function only works when L A T E X or dot is installed) (Fig. 2), • the expected F ST value for a split model between two populations, • when gene trees are simulated from two populations, the software Hybrid-Lambda can generate a table of relative frequencies of reciprocal monophyly, paraphyly, and polyphyly.

Simulation example
We give a simulation example showing the impact of the particular coalescent model on estimating the divergence time for two populations. Results can be confirmed using analytic approximations to F ST . This is shown in the Appendix along with example code for using Hybrid-Lambda for this example. Eldon B and Wakeley J [10] showed that population subdivision can be observed in genetic data despite high migration between populations. One of the most widely used measures of population differentiation is the F ST statistic. The relationship between F ST and biogeography depends on the underlying coalescent process, which might be especially important for the interpretation of divergence and demographic history of many marine species. Here we used Hybrid-Lambda to simulate divergence between two populations based on different -coalescents, as well as the standard Kingman coalescent. Mutations were simulated in Hybrid-Lambda under the infinite-sites model. The summary statistic F ST was estimated for these data and was used to compare F ST estimated from mtDNA from five species of marine invertebrates. These species were used in previous studies to test the hypothesis that contemporary oceanic conditions are creating subdivisions between the North Island and South Island reef populations of New Zealand [16,34,44]. These studies represent some of the earliest mitochondrial studies on the marine disjunction between the North and South Islands of New Zealand.
The F ST statistic between North Island and South Island populations reported for these species ranges from approximately 0.07 to 0.8 (Fig. 3). Cellana ornata displays a very strong split, which was estimated to have occurred around 0.2-0.3 million years ago based on published estimates of divergence rates and reciprocal monophyly displayed in the data set. This result may be supported by our simulations using the Kingman coalescent. However, when multiple mergers and a higher fraction of replacement by a single parent is allowed to occur then our simulations support much younger splits between the populations ∼ 9,000 generations or ∼ 48,000 generations ago (Fig. 3). Similarly, the strong split observed for Coscinasterias muricata could be placed anywhere from ∼ 9,000 to 45,000 generations ago depending on the degree to which multiple mergers are allowed to occur. While the range for Patiriella regularis, Cellana radians and C. flava is much smaller, it is still not clear cut as to whether divergence would be observed under different coalescent models. Here we used ψ = 0.01 and ψ = 0.23, and α = 1.5 and α = 1.9, with larger values of ψ and smaller values of α corresponding to higher probabilities of multiple mergers. Our choice of parameter values corresponds to the estimated values obtained for mtDNA of oysters and Atlantic cod. An estimate for Pacific oysters based on mitochondrial DNA for ψ was 0.075 [9]. The results for our choice of parameter values suggest that our conclusions about a much earlier split of the populations than previously estimated are robust with regard to parameter choice. A recent study of Atlantic cod [2] estimated ψ between 0.07 and 0.23 for nuclear genes and near 0.01 for mitochondrial genes. The same study estimated α to be 1.0 and 1.28 for nuclear genes and between 1.53 and 2.0 for mitochondrial genes.

Conclusions
The implications for using alternative coalescent models are far reaching. Many marine organisms reproduce through broadcast spawning of thousands to millions of gametes, and while the expected survival of these offspring is low, there is the potential for a small subset of the adults to have a greater contribution to the next generation than assumed by the Kingman coalescent. Hybrid-Lambda makes it possible to investigate the effect of high fecundity on biogeographic concordance among species that exhibit high fecundity and high offspring mortality, including in complex demagraphic scenarios that allow hybridization.

Availability and requirements
Hybrid-Lambda can be downloaded from http:// hybridlambda.github.io/. The program is written in C++ (requires compilers that support C++11 standard to build), and released under the GNU General Public License (GPL) version 3 or later. Users can modify and make new distributions under the terms of this license. For full details of this license, visit http://www.gnu.org/licenses/. Hybrid-Lambda works on Unix-like operating systems. We have used travis continuous integration to test compiling the program on Linux and Mac OS. An API in R [37] is currently under development.

Appendix: F ST calculations
Here we show analytic calculations that can be used to obtain expressions for F ST when mutation rates are low. The effect of α on F ST for fixed generation times is shown in Fig. 4.
Assume two populations A and B have been isolated until time τ in the past as measured from the present. Assume also that the same coalescent process is operating in populations A and B. Let T w denote the time until coalescence for two lines when drawn from the same population, and T b when drawn from different populations. Let λ A denote the coalescence rate for two lines in population A, and λ AB for the common ancestral population AB. For the Beta(2 − α, α)-coalescent, λ A = 1, for the point-mass process λ A = ψ 2 . One now obtains Slatkin [40] obtained the approximation, where μ is the per generation mutation rate, Thus, using (3) gives The result (5) seems to make sense, since lim τ →0 F However, deciding the timeunit of τ now becomes important, since the timescale of a Beta(2 − α, α)coalescent is proportional to N α−1 , 1 < α < 2 [39], where N is the population size. One can obtain a more accurate expression of the timescale given knowledge about the mean of the potential offspring distribution (see [39]). However, since the mean is unknown in most cases, we apply the approximation N α−1 . Assuming n ≥ 2 sequences from each population, the 'observed' FST (F ST ) was computed asF ST = 1 − n n−1 where H w is the average pairwise differences within populations, H w = 1 2 (H w,1 + H w,2 ), and H b is the average of n 2 pairwise differences between populations.
• -sim_num_mut outputs simulated genealogies in Newick string, of which the number of mutations on internal branches are labelled. • -seg generates haplotype data set.
• -fst computes F ST of the generated haplotype data set.
One can use this example to generate the data for Fig. 4 by setting the -S flag to -S 1 1.