- Methodology article
- Open Access
Critical assessment of coalescent simulators in modeling recombination hotspots in genomic sequences
BMC Bioinformatics volume 15, Article number: 3 (2014)
Coalescent simulation is pivotal for understanding population evolutionary models and demographic histories, as well as for developing novel analytical methods for genetic association studies for DNA sequence data. A plethora of coalescent simulators are developed, but selecting the most appropriate program remains challenging.
We extensively compared performances of five widely used coalescent simulators – Hudson’s ms, msHOT, MaCS, Simcoal2, and fastsimcoal, to provide a practical guide considering three crucial factors, 1) speed, 2) scalability and 3) recombination hotspot position and intensity accuracy. Although ms represents a popular standard coalescent simulator, it lacks the ability to simulate sequences with recombination hotspots. An extended program msHOT has compensated for the deficiency of ms by incorporating recombination hotspots and gene conversion events at arbitrarily chosen locations and intensities, but remains limited in simulating long stretches of DNA sequences. Simcoal2, based on a discrete generation-by-generation approach, could simulate more complex demographic scenarios, but runs comparatively slow. MaCS and fastsimcoal, both built on fast, modified sequential Markov coalescent algorithms to approximate standard coalescent, are much more efficient whilst keeping salient features of msHOT and Simcoal2, respectively. Our simulations demonstrate that they are more advantageous over other programs for a spectrum of evolutionary models. To validate recombination hotspots, LDhat 2.2 rhomap package, sequenceLDhot and Haploview were compared for hotspot detection, and sequenceLDhot exhibited the best performance based on both real and simulated data.
While ms remains an excellent choice for general coalescent simulations of DNA sequences, MaCS and fastsimcoal are much more scalable and flexible in simulating a variety of demographic events under different recombination hotspot models. Furthermore, sequenceLDhot appears to give the most optimal performance in detecting and validating cross-over hotspots.
Coalescent simulation is a very useful tool in population genetics with a rich variety of applications, particularly to evaluate and compare performances of various statistical methods in rare variant analysis [1–3], to estimate parameters for different population histories [4, 5] and to infer phylogenetic trees . In simulating DNA sequence data for studying human complex diseases, recombination hotspots, defined as genomic intervals with local recombination rates increased relative to that of the surrounding DNA region, need to be taken into account given their ubiquity in the human genome . Further, population geneticists are interested in examining haplotype block patterns delineated by recombination hotspots along chromosomes in fine-mapping locations of disease loci. Over the past decade, a plethora of coalescent simulators have been developed, e.g. SelSim , CoaSim , FastCoal , Mlcoalsim , and RECOAL , to name a few. Of them, five most representative and widely used coalescent simulators, Hudson’s ms , msHOT , Markovian Coalescent Simulator (MaCS) , Simcoal2 , and fastsimcoal .
Coalescent process, based on the neutral Wright-Fisher Model, was first introduced by Kingman [18, 19] (Figure 1). Simulating the entire ancestral recombination graph (ARG) [13, 20] encompassing all coalescent and recombination events in the past for a sample of DNA sequences imposes a considerable computational burden. In 2002, Hudson  developed ms, a Monte Carlo program based on standard coalescent that assumes an infinite-sites model for mutational events, which allows for recombination and gene conversion events, as well as symmetric migrations of subpopulations. While the standard coalescent starts at the present time and simulates backward in time, Wiuf and Hein (1999) are the first to introduce a sequential interpretation of the coalescent with recombination , such that genealogies could be simulated by moving along a DNA sequence. Based on the Wiuf and Hein algorithm, McVean and Cardin in 2005 developed a sequential Markov coalescent (SMC) method . The SMC starts with a coalescent tree at the left-hand end of the sequence and progressively modifies the tree with recombination events as it moves to the right, and has been shown to produce patterns of polymorphisms and linkage disequilibrium (LD) extremely similar to those generated under a classical ARG model [10, 15, 23]. SMC’ , an improved algorithm based on the SMC method, was developed by Marjoram and Wall in 2006 , such that a lineage separated by a recombination event is allowed to also attach to its old path (Figure 2). While retaining the speed and memory efficiency of SMC, SMC’ appears significantly more accurate when benchmarked by the exact coalescent process. MaCS, also a variant of the Wiuf and Hein algorithm, generates simulated data that are virtually identical to data simulated under the standard coalescent, but in much less time and using much less memory. MaCS is considered as a “generalized” SMC, which could provide a better approximation to results of standard coalescent than SMC, by tuning a user-specified “history” parameter h denoting a sequence length . Taken together, both SMC’ and MaCS give closer approximations to standard coalescent than SMC.
Although a set of DNA sequences with a pre-determined number of recombination hotspots at user-defined positions could be generated by msHOT, MaCS, Simcoal2, and fastsimcoal, there appears to be a lack of rigorous validation of hotspot detection software tools in the current literature. Popular computer programs for discovering recombination hotspots include LDMAP , LDhot [7, 25], LDhat 2.2 rhomap package , Hotspotter , and sequenceLDhot . The widely used Haploview program [29, 30] is also an appealing tool to visually localize recombination hotspots given the delineated LD block structure. Based on popularity, LDhat 2.2 rhomap, sequenceLDhot, and Haploview programs were chosen for detecting recombination hotspots in both real data and simulated data generated by the coalescent process. Because sequenceLDhot gave the best performance in inferring locations of cross-over hotspots, recombination hotspots simulated by msHOT, MaCS, and fastsimcoal were subject to validation by sequenceLDhot.
Coalescent process was initially derived as an approximation of the neutral Wright-Fisher model. This approximation works well when sample sizes are small relative to the population size. Mutations are assumed to be Poisson distributed along each branch given the mutation rate and branch length. Normally, an infinite-sites model  is assumed, which means no recurrent mutations occur. Each recombination event breaks the sequence into several segments, and each segment is modeled by a genealogy tree. Simulation of recombination hotspots is realized by changing the rates where these recombination events occur. The process that includes both mutation and recombination events is illustrated by ARGs (Figure 2). The SMC is an approximating algorithm for simulating a series of trees that differ from each other by a single recombination event, starting from the left end and moving to the right end of the DNA sequence.
Consider a population of constant size consisting of N diploid individuals, which means there are 2N copies for a given gene. Generations are assumed to be non-overlapping and denoted by T = 1, 2, …. Each individual in the next generation receives two copies of the gene (one from each parent) and for each respective parental copy, the gene is selected randomly and with replacement from the two copies of the gene present among the parents. At time T = 1, without loss of generality, assume i (e.g., i =2) of these gene copies are of type A. Let Y T denote the random variable for the number of type A gene at the time of T, then Y T |Y T-1 = y T-1 ~ Bin (2N, y T-1 /2N), such that
where y T denotes the realized value of the random variable Y T .
Based on this model, the trajectory of the coalescent process (Figure 1) tracing from the current generation backwards in time to the generation where A is coalesced could be modeled. The current generation could then be simulated through this trajectory given a random seed.
Figure 2 illustrates a simplest ARG . Take the third 4-letter sequence “TCCT” as an example. The common ancestral sequence evolves into two branches. For each branch, mutations have taken place on alleles at loci 1 and 3, respectively, giving rise to “TCGT” and “ACCT”. Next, a recombination event arises between these two sequences on (2, 3), and “TCCT” is produced. In the standard coalescent, a full ARG delineating all past coalescent and recombination events is constructed, with simulated samples corresponding to the edges of the graph.
SMC approach and its two variants (SMC’ and MaCS)
In contrast to standard coalescent which starts at the present time and is simulated backwards in time, the SMC approach  builds a coalescent tree by moving along the DNA sequence starting from the left end, and progressively updates the tree by incorporating recombination event(s) until the right end of the tree is reached. The region flanked by two neighboring recombination events is modeled by a local tree. Recombination is assumed to follow an exponential distribution along the sequence. A recombination event occurs randomly along the current tree, and the detached recombining lineage is free to coalesce with the other remaining lineages, leading to a new tree with a potentially different topology and most recent common ancestor (MRCA) (Figure 3A). SMC’ , a variant of SMC, allows the detached lineage cut by recombination event to coalesce with the exact same branch where it is detached from (Figure 3B), and results in a closer approximation to exact standard coalescent. MaCS algorithm , also a modified SMC algorithm, further allows for coalescence with trees one recombination away on the left hand side, thus achieving a higher accuracy than SMC in approximating the standard coalescent. Fastsimcoal , a completely rewritten continuous-time implementation of Simcoal2 , implements a fast SMC’ that allows for multiple recombination events between markers at fixed recombination distances on the sequence.
Detection of recombination hotspots
The LDhat 2.2 rhomap package , sequenceLDhot , and Haploview program [29, 30] were applied respectively to detect recombination hotspots in the simulated sequences. LDhat is a popular software for estimation of recombination rates, which was developed based on Bayesian reversible-jump Markov Chain Monte Carlo (MCMC) algorithm, and rhomap is a new method incorporated into LDhat 2.2 that specializes in fitting cross-over hotspot model. Another widely used computer program in detecting recombination hotspots— sequenceLDhot, uses an approximate marginal likelihood method of  to estimate a likelihood ratio (LR) statistic to unveil a cross-over hotspot. Further, Haploview was implemented to obtain a visualization of the LD block structure that could reflect varying recombination rates along a contiguous DNA sequence.
Real data set
The 216-kb human leukocyte antigen (HLA) class II region is a well-studied region where recombination hotspots have been identified with sperm typing technology [34–36]. The original data set analyzed in the current study (http://www.le.ac.uk/ge/ajj/HLA/Genotype.html) contains genotype data for 50 unrelated UK Caucasians for 296 markers [i.e., 264 single nucleotide polymorphisms (SNPs) and 22 1–11-bp insertion/deletion polymorphisms] . A subset of 263 SNPs without missing data were selected for recombination hotspot detection. Multi-locus haplotypes are required as an input for hotspot detection programs such as sequenceLDhot, and they provide crucial phase information that is important for understanding haplotype structure [37, 38]. Therefore, haplotypes across the 263 SNPs were reconstructed by PHASE v2.1 program, a haplotype inference method based on the (i) coalescent theory using a variant of canonical Gibbs sampling [33, 39], (ii) an LD decay model , and (iii) the partition-ligation algorithm . In total, 100 haplotypes for the 216-kb region were statistically inferred.
Simulation data sets
In order to assess the running efficiency with or without recombination hotspots, we simulated a set of DNA sequences (i.e., haplotypes) according to 0-, 2-, and 5-hotspot models using these five simulators respectively. All three hotspot models were simulated based on sequences with lengths 1- and 5-Mb, respectively. For each hotspot model and each sequence length, we simulated sample sizes (i.e., number of sequences) of 100, 500, 1,000, and 10,000, respectively. We implemented a symmetric two-island model with a total effective population size of 10,000. The recombination rate and mutation rate were both assumed to be 1.0 × 10-8 per site. Fifty replicates were done and the running times were recorded. All simulations were run on the platform — Linux OS, 2.0-GHz CPU, 1TB-RAM.
To validate the recombination hotspot position and intensity accuracy, we simulated the data sets according to 2- and 5-hotspot models along a 200-kb long DNA sequence. The recombination hotspots’ intensities were set to be 100 times higher compared to the background recombination rates.
The features of the five simulators (i.e., ms, msHOT, MaCS, Simcoal2, and fastsimcoal) are shown in Table 1. Except for ms, all the other four simulators are able to simulate recombination hotspots with arbitrarily chosen recombination rates. Simcoal2 and fastsimcoal cannot simulate gene conversion events, whereas ms, msHOT, and MaCS could. MaCS, Simcoal2 and fastsimcoal allow stochastic uncertainty in sampling of segregating sites to be incorporated as a factor. For simulating rare mutation events, the desired allele frequencies could be preset for MaCS, Simcoal2 and fastsimcoal, but not for ms and msHOT. One of the advantageous features is that these coalescent simulators are able to simulate according to presumed migration patterns (e.g., between subpopulations residing in islands A and B), although the population structures vary to some degree — ms, msHOT and MaCS require these subpopulations to be symmetric across all islands (i.e., islands A and B must have equal sample sizes, but A→B and B→A migration rates could differ), while both Simcoal2 and fastsimcoal allow for arbitrarily chosen population structures (i.e., these programs allow for asymmetry). This feature makes the coalescent simulation a good choice for simulating admixture populations. Both Simcoal2 and fastsimcoal could simulate a mixture of different data types including DNA sequences, SNPs, microsatellites, which make them good tools for studying a variety of chromosome structures. In particular, the coalescent simulation strategy of fastsimcoal2 adopts a so-called “Approximate Bayesian Computation (ABC)”  to estimate the demographic parameters and the mutation and recombination rates, which has a unique feature that allows for sampling simulation parameters from predefined prior distributions, thus facilitating simulations for estimating the parameters. Because Simcoal2 implements a discrete time generation-by-generation simulation, it allows for multiple coalescent events per generation. However, the generation-by-generation simulation is very time-consuming, which significantly lowers its computational speed (see below).
For each design of simulation, we generated 50 replicates according to our simulation settings (see simulation data sets in Methods section) and the average execution times are presented in Tables 2 and 3. For simulations of a 1-Mb sequence (Table 2), all the simulations could be performed within 6 minutes. When sample size was increased to 10,000, both ms and msHOT gave a segmentation fault error on our platform (Linux OS, 2.0-GHz CPU, 1TB-RAM) when 0-hotspot model was simulated. For simulations of a 5-Mb sequence (Table 3), both ms and msHOT gave segmentation fault errors when the sample sizes exceeded 500. This indicates the limitations of ms and msHOT in simulating long sequences and for large sample sizes. It is interesting that when recombination hotspots are simulated, msHOT is able to simulate long DNA sequences for large sample sizes. This is because msHOT is exactly the same as ms when simulating a 0-hotspot model. When simulating 2- and 5-hotspot models, msHOT and Simcoal2 performed much slower than MaCS and fastsimcoal because of algorithmic differences. Further, when the number of hotspots got larger, msHOT performed slower than Simcoal2. To simulate 1,000 DNA sequences (i.e., haplotypes), it could take three and a half hours for msHOT, and one and a half hours for Simcoal2, respectively. In contrast, MaCS and fastsimcoal were much faster – simulating 1,000 of 5-Mb sequences under the 5-hotspot model took only 90 seconds for MaCS and 40 seconds for fastsimcoal, respectively. For all simulation settings, fastsimcoal always performed a little faster than MaCS (Tables 2 and 3). It is worthy to mention that increasing the number of hotspots did not affect the execution times for MaCS, Simcoal2, and fastsimcoal, but did so for msHOT.
Validation of recombination hotspots
For real data based on the HLA class II region, LDhat 2.2 rhomap, sequenceLDhot and Haploview were applied to detect the six recombination hotspots (i.e., DNA1, DNA2, DNA3, DMB1, DMB2, and TAP2) identified by sperm typing . This 216-kb region has a comparable length to that of our simulated data (200-kb). Results of these three hotspot detection programs are shown in Figure 4. The Haploview plot displayed a rugged “block-like” pattern of haplotype structure (top panel), which did not reveal precise locations of these recombination hotspots. Since LDhat 2.2 rhomap adopted the MCMC approach, the estimation of recombination rates along the DNA sequence was performed in five runs in order to determine the consistency across different runs. For each time, we ran rhomap for a total of 10,000,000 iterations including a burn-in of 1,000,000 iterations and taking a sample every 5,000 iterations, As suggested by the LDhat 2.2 user manual, both block penalty and hotspot penalty were set to be 0’s. As shown in Figure 4 (middle panel), moderate variations across these five runs were observed (results shown by different colors), and the positions of the six recombination hotspots were not indubitably discernable. The LR graph for the estimation results by sequenceLDhot provided the most accurate results, correctly identifying five of the six recombination hotspots, but not the leftmost hotspot (i.e., DNA1) which is the weakest hotspot spaced very close to the adjacent DNA2 hotspot (Figure 4; bottom panel). For simulated data, we first simulated a total of 100 DNA sequences (200-kb in length) for 368 SNPs based on a 1-hotspot model by MaCS. LDhat 2.2 rhomap was run five times (results shown by different colors) to detect this recombination hotspot for a single simulation data set. Then, sequenceLDhot was run to detect the recombination hotspot. Among the five runs of LDhat 2.2 rhomap, only one run appeared to identify the correct hotspot peak (100-kb position), and the results across these five runs were not consistent (Figure 5; top panel). By contrast, sequenceLDhot precisely detected the recombination hotspot at the correct position (Figure 5; bottom panel). We also simulated a total of 100 DNA sequences (200-kb in length) for 459 SNPs based on a 2-hotspot model by fastsimcoal. Haploview results gave rise to a rumpled LD pattern that could be barely perceived as three “fuzzy” LD blocks (i.e., left, middle and right), but the precise locations of the two recombination hotspots were far from clear (Figure 6; top panel). However, sequenceLDhot correctly detected the two simulated recombination hotspots at expected 70- and 140-kb positions (Figure 6; bottom panel) unambiguously. Taken together, our results based on both real and simulated data indicate that for hotspot detection, LDhat 2.2 rhomap performed much less accurate than sequenceLDhot, and the visual discernment of recombination hotspots based on the Haploview plot was much less obvious as compared with results of sequenceLDhot. Therefore, sequenceLDhot was chosen as the software tool to validate the recombination hotspots from simulated data by various coalescent simulators, which was shown to be much more robust and accurate in detecting recombination hotspots than the other two computer programs. Twenty replicates for each simulation setting (see Simulation data sets subsection in Methods section) were simulated for each coalescent simulator. Two types of metrics were applied in assessing the accuracy of the simulated recombination hotspots. The first metric is the proportion of simulated hotspots that could not be detected by sequenceLDhot based on a hotspot intensity threshold. The other metric is the extent of shifting of the detected hotspot position from the expected position. We define a recombination hotspot shifting > 1-kb (to either left or right) as a significant shifting. For each recombination, hotspot, sequenceLDhot provided an estimated value of the LR statistic. If the LR exceeded a predefined threshold (denoted as c), the hotspot would be treated as a significant one. The higher the LR, the more certain the corresponding hotspot. We chose the default parameter settings such that the width (w) of the hotspot was 2,000-bp, and the spacing (l) was 1,000-bp. The value of c of LR for claiming a significant hotspot was set to be 10. Of the five programs, only msHOT, MaCS, and fastsimcoal were selected for comparisons because ms could not handle a user-specified hotspot model (Tables 1, 2, 3) and Simcoal2 was not so scalable (Table 4).
When sequence data were simulated according to the 2-hotspot model, sequenceLDhot detected 39 of the total 40 hotspots from data simulated by msHOT, and of the detected ones, two shifted significantly away from their expected positions. The mean shifting of all detected hotspots was 26-kb to the left. It had the highest mean LR (45.83) and the lowest standard deviation (18.35) (Table 4). Data simulated by MaCS had the lowest mean LR (28.73) and the highest standard deviation (23.36), with 38 of the total 40 hotspots detected and 4 of the detected ones significantly shifted. The mean shifting of recombination hotspots simulated by MaCS was 86.65-kb. Hotspots simulated by fastsimcoal had the smallest shifting — a mean of 10.8-kb, but had the least number of significant ones found by sequenceLDhot — 36 of 40. Two of the detected ones were found significantly shifted. Thus, our simulation results based on 2-hotspot model showed that, MaCS had the poorest performance, with the largest mean shifting and number of significant shiftings, whereas msHOT and fastsimocal had comparable performances. For simulations according to the 5-hotspot model, fastsimcoal-simulated data had the best performance, with 94 out of 100 hotspots detected by sequenceLDhot, as compared to 91 out of 100 for msHOT-simulated data, and 93 out of 100 for MaCS-simulated data. Besides, it had the highest mean LR (48.78) and the lowest standard deviation (22.06). Its mean shifting was also the lowest (i.e., 35-kb to the left), while the number of significantly shifted hotspots was comparable to that of MaCS and smaller than that of msHOT. The mean shifting for msHOT was substantially higher than those of MaCS and fastsimcoal, which was 981.23-bp to the left. In our simulations based on the 5-hotspot model, fastsimcoal had the most stable performance and generated the most accurate recombination hotspots whereas msHOT had comparatively the worst performance of the three coalescent simulators. Figure 7 showed positions of hotspot peaks detected by sequenceLDhot from data simulated by msHOT (Figure 7), MaCS (Figure 7B), and fastsimcoal (Figure 7C) according to the 5-hotspot model. For each model, for the sake of clarity, we randomly selected five of the 20 replicates, and then superimposed them on the same graph. Each color represents a replicate. All these five lines were supposedly to be matched perfectly with 5 hotspot peaks at positions 3-, 6-, 9-, 12-, and 15-kb, respectively, if all the simulations were accurate. However, the positions of hotspot peaks detected from data simulated by msHOT were not consistent across the five replicates (Figure 7A). A significant shifting was observed for one of the five replicates (red line), and the LRs of the five hotspot peaks across the five replicates exhibited dramatic variations (Table 4 and Figure 7A). If an LR of 6 was applied as the threshold to define hotspot peaks, 6 (16.13%) were “noisy” (i.e., incorrect) peaks out of the 31 total detected peaks (i.e., LR > 6) (Figure 7A) from data simulated by msHOT. Hotspot peaks detected from MaCS-simulated data appeared more consistent (Figure 7B) — No significant shiftings were detected and 5 (16.67%) “noisy” peaks out of 30 total detected peaks were found. For fastsimcoal-simulated data sets, no “noisy” peaks were found (i.e., all 25 detected peaks were correct) and the positions of these hotspot peaks were the most consistent (Figure 7C). A similar pattern was shown when the LR graph was generated for all 20 data sets simulated by each of msHOT, MaCS, and fastsimcoal (data not shown). Therefore, although msHOT could be applied to simulate recombination hotspot peaks, their positions appear to be less precise compared to those of MaCS and fastsimcoal, at least based on a 5-hotspot model.
Advances in next-generation sequencing technologies have resulted in a dramatic increase in generating whole genome sequence data. There is an urgent need to develop novel methods for analyzing such huge amounts of data. Along with it, computer simulation of genome-wide data is also crucial. Coalescent model has been the most attractive model in population genetics, and is widely recognized as the cornerstone in statistical analysis of DNA sequences . The quintessential feature of coalescent is to start with the current sample of DNA sequences and then trace backward in time to identify past events since their MRCA [44, 45]. The standard coalescent provides an accurate characterization of genealogies of haploid individuals of constant size, which can incorporate recombination [13, 45, 46]. However, the original standard coalescent does have several restrictive features based on the neutral theory that limits its application to real-world DNA sequences, and has been extended to handle selection [47–49], gene conversion [14, 50, 51], and migration [52–55]. As indicated by , the generating models based on coalescent theory should resemble real data as much as possible. Therefore, even the exact standard coalescent might not be the “best” model that could generate simulated data “most” similar to present-day DNA sequences. Nevertheless, the coalescent model based on the Wright-Fisher model is a theoretically convenient and reasonable approximation to real-world scenarios. By incorporating prior information based on coalescent theory, PHASE v2.1 [33, 39, 40] significantly improved phasing accuracies for both real and simulated data sets. Therefore, standard coalescent remains a widely applied tool in modeling real-world DNA sequences. However, standard coalescent implemented based on full ARGs incurs a high computational cost for a relatively long DNA sequence (e.g., > 5-Mb), making it difficult to simulate DNA sequences at the genome scale for a large sample size (e.g., > 500). To overcome this obstacle, SMC, an approximation to standard coalescent, has been developed which scales linearly with the length of the DNA sequence being simulated from the left to the right, and has the remarkable advantage of being much faster and more extensible than standard coalescent algorithm . Based on variants of SMC, both MaCS and fastsimcoal could generate LD patterns of DNA sequences very close to those generated under a classical ARG model but much more swiftly .
After our extensive simulations based on five widely used programs, for simulating up to a few hundred of samples (sequences) with sequence lengths spanning several Mbs, Hudson’s ms is a great choice for its flexibility in handling historical events and robust modeling. For simulating sequences up to tens of or a few hundred of Mbs or a large number of samples, ms is no longer adaptable. In our simulations, ms could not handle 10,000 samples for a 1-Mb sequence or 500 samples for a 5-Mb sequence. The basic algorithm of msHOT  is an extended version of ms (which generates ARGs for a sample of chromosomes based on coalescent theory) by adding both recombination hotspot and gene conversion hotspot models to the implementation of standard coalescent by ms. In simulating the simplest scenario of 0-hotspot model assuming also no gene conversions, the implementation of msHOT appears to be the same as that of ms (because there is no necessity to invoke complex cross-over and gene conversion hotspot models). However, when there is at least one recombination hotspot in the simulated DNA sequences, the implementation of msHOT algorithm must differ from that of ms to account for the presence of recombination hotspot(s) in coalescent simulation. As stated in , the modification of msHOT allows the user to insert any pre-specified non-overlapping cross-over hotspots and non-overlapping gene conversion hotspots into the genetic region by specifying the locations and intensities for each. Specifically, incorporating R recombination hotspots requires the user to specify a left endpoint (a h ), right endpoint (b h ), and intensity (I h ) for each hotspot h, where h = 1, …, R. Inside a given hotspot h, the probability of a recombination occurring between two adjacent base pairs in a single transmission from parent to offspring is λ h r bp . Outside recombination hotspot(s), this probability is the recombination probability per base pair — r bp . That is the reason when for simulating scenarios for 2- and 5-hotspot models, msHOT performance appears to be much slower (to account for the extra complexity introduced by the user-defined recombination hotspots) than for the 0-hotspot model. In addition, in the absence of a recombination hotspot, just like ms, msHOT could not handle a sample size of 10,000 sequences for simulating 1-Mb DNA sequence data (Table 2) or sample sizes of 500, 1,000, and 10,000 for simulating 5-Mb DNA sequence data (Table 2). However, in the presence of at least one recombination hotspot, by taking a very different implementation compared to ms by including a more complex hotspot model, msHOT could handle such large sample sizes. By contrast, MaCS, which is based on a modified SMC algorithm, could achieve coalescent simulations for many more samples with much longer length, while accurately approximating the results simulated by standard coalescent (i.e., ms) and maintaining its flexibility. Theoretical interpretations for empirical observations of Table 3 are as follows. When simulating 2- and 5-hotspot models for relatively long DNA sequences (i.e., 5-Mb), msHOT (built on an extended algorithm based on ms) and Simcoal2 [built on a discrete generation-by-generation approach (rather than a continuous time approximation)] are understandably much slower than MaCS and fastsimcoal due to their critical algorithmic differences — MaCS has taken a faster modified SMC approach and fastsimcoal has also taken a computationally more efficient SMC’ approach. As indicated in Background section, the SMC method of , and the SMC’ method of  are both approximations to the standard coalescent algorithm, which have the advantage of being much faster. Specifically, MaCS  is a generalized SMC which is equivalent to SMC when the “history” parameter h = 1 bp, but becomes a closer approximation to ms than SMC when h increases (such that more information of adjacent genealogies are stored). Generally speaking, MaCS produces simulated data that are virtually identical to data simulated under the standard coalescent, but in much less time and using much less memory . Similar to MaCS, fastsimcoal is based on a continuous-time SMC’ by applying ABC, which is much faster than msHOT or discrete generation coalescent approach of Simcoal2 that also gives excellent approximations to standard coalescent with a much quicker speed .
Based on cross-over hotspot validation results (Table 4 and Figure 7), recombination hotspots simulated by MaCS did not appear to be as accurate as those of msHOT and fastsimcoal for a 2-hotspot model. For a 5-hotspot model, MaCS outperformed msHOT, but fastsimcoal was the most accurate simulator. When there is a demand for simulating DNA sequences under a variety of population genetic models [especially in the presence of cross-over hotspot(s)], fastsimcoal appears to be the best choice. Different data types [DNA, SNP, simple tandem repeat (STR)] and sequences of different structures could be simulated by fastsimcoal, in addition to its advantages in terms of efficiency, accuracy, and capability of generating any user-defined patterns of recombination hotspots. From a practical standpoint, when a set of recombination hotspots need to be simulated, msHOT, MaCS, and fastsimcoal are all applicable, but msHOT performed much slower than MaCS and fastsimcoal, and had the lowest accuracy based on validation results of sequenceLDhot.
While Hudson’s ms remains an excellent choice for simulating relatively short DNA sequences (< several Mbs) under general scenarios, MaCS and fastsimcoal are much more scalable and flexible in simulating many different demographic histories and diverse DNA sequence structures (e.g., SNPs and STRs). Based on both running time and hotspot validation comparisons, fastsimcoal is shown to be the fastest and most reliable and consistent coalescent simulator, especially when the number of hotspots is large. MaCS is a runner-up with a lower speed and a slightly less accuracy. Based on our extensive simulation evaluation and comparison results, cautions should be taken in applications of these widely used coalescent simulators, such that sequence data simulated by a given software should be checked and validated — e.g., the positions and intensities of recombination hotspots, to guard against any discrepancy between the intended objective and the actual simulation results. Further, for detecting and validating recombination hotspots, among the three widely used computer programs, sequenceLDhot appears to be the best choice— fast, robust and accurate. In real-world DNA sequence data, a variety of factors such as GC content, local LD block structure, DNA elements that act as enhancers or inhibitors of recombination [56–58], could affect the intensities and locations of recombination hotspots along a given chromosome. For example, recombination hotspots correlate positively with GC content [59, 60]. Further, certain DNA motifs are enriched in cross-over hotspots, among which CCTCCCT and CCCCACCCC are the most prominent . Thereafter, in real-world scenarios, these factors should be taken into consideration in identifying genuine cross-over hotspots.
Approximate Bayesian Computation
Ancestral recombination graph
Human leukocyte antige
Markov Chain Monte Carlo
Most recent common ancestor
Sequential Markov coalescent
Single nucleotide polymorphism
Simple tandem repeat.
Brisbin A, Jenkins GD, Ellsworth KA, Wang L, Fridley BL: Localization of association signal from risk and protective variants in sequencing studies. Front Genet. 2012, 3: 173-
Kinnamon DD, Hershberger RE, Martin ER: Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. Plos One. 2012, 7 (2): e30238-10.1371/journal.pone.0030238.
Morris AP, Zeggini E: An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010, 34 (2): 188-193. 10.1002/gepi.20450.
Weiss G, von Haeseler A: Inference of population history using a likelihood approach. Genetics. 1998, 149 (3): 1539-1546.
Burgess R, Yang Z: Estimation of hominoid ancestral population sizes under bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol Biol Evol. 2008, 25 (9): 1979-1994. 10.1093/molbev/msn148.
Liu L, Yu LL, Kubatko L, Pearl DK, Edwards SV: Coalescent methods for estimating phylogenetic trees. Mol Phylogenet Evol. 2009, 53 (1): 320-328. 10.1016/j.ympev.2009.05.033.
McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P: The fine-scale structure of recombination rate variation in the human genome. Science. 2004, 304 (5670): 581-584. 10.1126/science.1092500.
Spencer CCA, Coop G: SelSim: a program to simulate population genetic data with natural selection and recombination. Bioinformatics. 2004, 20 (18): 3673-3675. 10.1093/bioinformatics/bth417.
Mailund T, Schierup MH, Pedersen CNS, Mechlenborg PJM, Madsen JN, Schauser L: CoaSim: a flexible environment for simulating genetic data under coalescent models. BMC Bioinforma. 2005, 6: 252-10.1186/1471-2105-6-252.
Marjoram P, Wall JD: Fast “coalescent” simulation. BMC Genet. 2006, 7: 16-10.1186/1471-2156-7-16.
Ramos-Onsins SE, Mitchell-Olds T: Mlcoalsim: multilocus coalescent simulations. Evol Bioinform. 2007, 3: 41-44.
Kang CJ, Marjoram P: Exact coalescent simulation of new haplotype data from existing reference haplotypes. Bioinformatics. 2012, 28 (6): 838-844. 10.1093/bioinformatics/bts033.
Hudson RR: Generating samples under a wright-fisher neutral model of genetic variation. Bioinformatics. 2002, 18 (2): 337-338. 10.1093/bioinformatics/18.2.337.
Hellenthal G, Stephens M: MsHOT: modifying Hudson’s ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics. 2007, 23 (4): 520-521. 10.1093/bioinformatics/btl622.
Chen GK, Marjoram P, Wall JD: Fast and flexible simulation of DNA sequence data. Genome Res. 2009, 19 (1): 136-142.
Laval G, Excoffier L: SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004, 15: 2485-2487.
Excoffier L, Foll M: fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 2011, 27 (9): 1332-1334. 10.1093/bioinformatics/btr124.
Kingman JFC: The coalescent. Stoch Process Appl. 1982, 13 (3): 14-
Kingman JFC: On the genealogy of large populations. J Appl Probab. 1982, 19: 17-
Griffiths RC, Marjoram P: Ancestral inference from samples of DNA sequences with recombination. J Comput Biol. 1996, 3 (4): 479-502. 10.1089/cmb.1996.3.479.
Wiuf C, Hein J: Recombination as a point process along sequences. Theor Popul Biol. 1999, 55 (3): 248-259. 10.1006/tpbi.1998.1403.
McVean GAT, Cardin NJ: Approximating the coalescent with recombination. Philos T Roy Soc B. 2005, 360 (1459): 1387-1393. 10.1098/rstb.2005.1673.
Eriksson A, Mahjani B, Mehlig B: Sequential Markov coalescent algorithms for population models with demographic structure. Theor Popul Biol. 2009, 76 (2): 84-91. 10.1016/j.tpb.2009.05.002.
Maniatis N, Collins A, Xu CF, McCarthy LC, Hewett DR, Tapper W, Ennis S, Ke X, Morton NE: The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis. Proc Natl Acad Sci U S A. 2002, 99 (4): 2228-2233. 10.1073/pnas.042680999.
Winckler W, Myers SR, Richter DJ, Onofrio RC, McDonald GJ, Bontrop RE, McVean GAT, Gabriel SB, Reich D, Donnelly P, et al: Comparison of fine-scale recombination rates in humans and chimpanzees. Science. 2005, 308 (5718): 107-111. 10.1126/science.1105322.
Auton A, McVean G: Recombination rate estimation in the presence of hotspots. Genome Res. 2007, 17 (8): 1219-1227. 10.1101/gr.6386707.
Li N, Stephens M: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003, 165 (4): 2213-2233.
Fearnhead P: SequenceLDhot: detecting recombination hotspots. Bioinformatics. 2006, 22 (24): 3061-3066. 10.1093/bioinformatics/btl540.
Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21 (2): 263-265. 10.1093/bioinformatics/bth457.
Barrett JC: Haploview: visualization and analysis of SNP genotype data. Cold Spring Harb Protoc. 2009, 2009 (10): pdb ip71-
Kimura M: The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969, 61 (4): 893-903.
Song YS, Hein J: Constructing minimal ancestral recombination graphs. J Comput Biol. 2005, 12 (2): 147-169. 10.1089/cmb.2005.12.147.
Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet. 2003, 73 (5): 1162-1169. 10.1086/379378.
Jeffreys AJ, Ritchie A, Neumann R: High resolution analysis of haplotype diversity and meiotic crossover in the human TAP2 recombination hotspot. Hum Mol Genet. 2000, 9 (5): 725-733. 10.1093/hmg/9.5.725.
Jeffreys AJ, Kauppi L, Neumann R: Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 2001, 29 (2): 217-222. 10.1038/ng1001-217.
Jeffreys AJ, Neumann R: Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nat Genet. 2002, 31 (3): 267-271. 10.1038/ng910.
Niu T: Algorithms for inferring haplotypes. Genet Epidemiol. 2004, 27 (4): 334-347. 10.1002/gepi.20024.
Zhang Y, Niu T: Haplotype Structure. Handbook on Analyzing Human Genetic Data: Computational Approaches and Software. Edited by: Lin S, Zhao H. 2010, Springer-Verlag, 25-80.
Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001, 68 (4): 978-989. 10.1086/319501.
Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet. 2005, 76 (3): 449-462. 10.1086/428594.
Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet. 2002, 70 (1): 157-169. 10.1086/338446.
Beaumont MA, Zhang WY, Balding DJ: Approximate Bayesian computation in population genetics. Genetics. 2002, 162 (4): 2025-2035.
Jeffreys AJ, Kauppi L, Neumann R: Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 2001, 29 (2): 217-222. 10.1038/ng1001-217.
Fu YX, Li WH: Coalescing into the 21st century: an overview and prospects of coalescent theory. Theor Popul Biol. 1999, 56 (1): 1-10. 10.1006/tpbi.1999.1421.
Rosenberg NA, Nordborg M: Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat Rev Genet. 2002, 3 (5): 380-390. 10.1038/nrg795.
Hudson RR: Properties of a neutral allele model with intragenic recombination. Theor Popul Biol. 1983, 23 (2): 183-201. 10.1016/0040-5809(83)90013-8.
Krone SM, Neuhauser C: Ancestral processes with selection. Theor Popul Biol. 1997, 51 (3): 210-237. 10.1006/tpbi.1997.1299.
Neuhauser C, Krone SM: The genealogy of samples in models with selection. Genetics. 1997, 145 (2): 519-534.
Pokalyuk C, Pfaffelhuber P: The ancestral selection graph under strong directional selection. Theor Popul Biol. 2013, 87: 25-33.
Wiuf C, Hein J: The coalescent with gene conversion. Genetics. 2000, 155 (1): 451-462.
Wiuf C: A coalescence approach to gene conversion. Theor Popul Biol. 2000, 57 (4): 357-367. 10.1006/tpbi.2000.1462.
Beerli P, Felsenstein J: Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics. 1999, 152 (2): 763-773.
Arenas M, Posada D: Recodon: coalescent simulation of coding DNA sequences with recombination, migration and demography. BMC Bioinforma. 2007, 8: 458-10.1186/1471-2105-8-458.
Notohara M: An application of the central limit theorem to coalescence times in the structured coalescent model with strong migration. J Math Biol. 2010, 61 (5): 695-714. 10.1007/s00285-009-0318-z.
Steinrucken M, Paul JS, Song YS: A sequentially Markov conditional sampling distribution for structured populations with migration and recombination. Theor Popul Biol. 2013, 87: 51-61.
Zhang J, Li F, Li J, Zhang MQ, Zhang X: Evidence and characteristics of putative human alpha recombination hotspots. Hum Mol Genet. 2004, 13 (22): 2823-2828. 10.1093/hmg/ddh310.
Arnheim N, Calabrese P, Tiemann-Boege I: Mammalian meiotic recombination hot spots. Annu Rev Genet. 2007, 41: 369-399. 10.1146/annurev.genet.41.110306.130301.
Rana NA, Ebenezer ND, Webster AR, Linares AR, Whitehouse DB, Povey S, Hardcastle AJ: Recombination hotspots and block structure of linkage disequilibrium in the human genome exemplified by detailed analysis of PGM1 on 1p31. Hum Mol Genet. 2004, 13 (24): 3089-3102. 10.1093/hmg/ddh337.
Fullerton SM, Bernardo Carvalho A, Clark AG: Local rates of recombination are positively correlated with GC content in the human genome. Mol Biol Evol. 2001, 18 (6): 1139-1142. 10.1093/oxfordjournals.molbev.a003886.
Clark AG, Wang X, Matise T: Contrasting methods of quantifying fine structure of human recombination. Annu Rev Genomics Hum Genet. 2010, 11: 45-64. 10.1146/annurev-genom-082908-150031.
Zheng J, Khil PP, Camerini-Otero RD, Przytycka TM: Detecting sequence polymorphisms associated with meiotic recombination hotspots in the human genome. Genome Biol. 2010, 11 (10): R103-10.1186/gb-2010-11-10-r103.
We are very grateful to the Editor and two anonymous reviewers for their constructive comments which have improved manuscript. Dr. Deng was supported in part by R01AR050496, R21AG027110, R01AG026564, R21AA015973, R01AR057049, and R03TW008221 from NIH and Specialized Center of Research Grant P50 AR055081 funded jointly by the NIAMS and the Office of Research on Women’s Health. Dr. Niu was supported in part by a start-up fund of the Center for Bioinformatics and Genomics, Tulane University.
The authors declare that they have no competing interests.
TN conceived the idea for the project, TN and HWD contributed to study design and conception, TY and TN participated in data simulation, analysis and interpretation, TY and TN drafted the manuscript and HWD revised it critically for intellectual content. All authors read and approved the final version of the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Yang, T., Deng, HW. & Niu, T. Critical assessment of coalescent simulators in modeling recombination hotspots in genomic sequences. BMC Bioinformatics 15, 3 (2014). https://doi.org/10.1186/1471-2105-15-3
- Population genetics
- Linkage disequilibrium
- Single nucleotide polymorphism