Rapid forward-in-time simulation at the chromosome and genome level
© Aberer and Stamatakis; licensee BioMed Central Ltd. 2013
Received: 7 April 2013
Accepted: 3 July 2013
Published: 9 July 2013
Skip to main content
© Aberer and Stamatakis; licensee BioMed Central Ltd. 2013
Received: 7 April 2013
Accepted: 3 July 2013
Published: 9 July 2013
In population genetics, simulation is a fundamental tool for analyzing how basic evolutionary forces such as natural selection, recombination, and mutation shape the genetic landscape of a population. Forward simulation represents the most powerful, but, at the same time, most compute-intensive approach for simulating the genetic material of a population.
We introduce AnA-FiTS, a highly optimized forward simulation software, that is up to two orders of magnitude faster than current state-of-the-art software. In addition, we present a novel algorithm that further improves runtimes by up to an additional order of magnitude, for simulations where a fraction of the mutations is neutral (e.g., only 10% of mutations have an effect on fitness). Apart from simulated sequences, our tool also generates a graph structure that depicts the complete observable history of neutral mutations.
The substantial performance improvements allow for conducting forward simulations at the chromosome and genome level. The graph structure generated by our algorithm can give rise to novel approaches for visualizing and analyzing the output of forward simulations.
The field of population genetics strives to determine, how elementary evolutionary forces (i.e., natural selection, recombination, or random genetic drift) shape the genetic landscape within population of a species. As for related areas in evolutionary biology, rapid advances in next-generation sequencing technology (e.g., ) have transformed the field with the completion of the first phase of the 1000 genome project  representing a recent highlight. Thus, population genetics is rapidly transforming into a data-driven science , while many methods and tools are not up to the challenge yet  with respect to scalability and efficiency.
Because of the complex processes in population genetics, the inference machinery for analyzing the properties of a population is often limited and needs to be complemented by simulations. Some of the most common use cases of simulation software in population genetics are (i) the verification of novel methods/models [5, 6] and (ii) the generation of datasets for assessing how using a different model affects the properties of of a population. A promising, but computationally particularly expensive approach is to use simulations for inferring specific properties (e.g., demography ) of a given data sample, as implemented, for instance by using approximate Bayesian computation techniques . For an overview of available software and applications, see .
Forward-in-Time simulation represents a powerful approach to simulating the evolutionary processes that act on genomic regions (see review by ). Essentially, sequences are represented in silico and the basic evolutionary operations/events (e.g., mutation, selection, and recombination) are explicitly applied to each simulated individual, on a generation-by-generation basis. The inferred, simulated sequences represent the exact outcome of the underlying/assumed evolutionary process. However, this high accuracy comes at a high computational cost. Execution times are quadratic with respect to the number of individuals, because the number of simulated generations needs to be increased with growing population size. To obtain accurate results, the size of the population that needs to be simulated must be chosen realistically. However, realistic population sizes range between a computationally already challenging 10,400 individuals in humans to an entirely prohibitive number of 1,150,000 individuals in some species of Drosophila melanogaster or 25,000,000 in Escherichia coli.
These excessive computational requirements explain the popularity of approximate coalescent (backward) simulations [12, 13] that are substantially faster. The coalescent-based approach is faster because only genomic samples that survived until the present are simulated backward in time. A substantial drawback of coalescent simulations is that, natural selection can only be incorporated into the model by means of a single mutation [14, 15].
To date, only few attempts have been made to improve the performance of forward simulators using optimization techniques from computer science and via low-level technical improvements. Chadeau-Hyam et al. developed a strategy to re-scale simulation parameters such as to decrease the effective population size that needs to be simulated . However, this shortcut induces a decrease in accuracy (see ). Forwsim (later extended by ) implements a forward-algorithm that regularly simulates a small user-defined set of mutations under selection, forward in time. The neutral part of each haplotype is simulated separately and executed with a delay of 8 generations. Thereby, one can circumvent simulating haplotypes, if no ancestral material has survived until the present. Distinguishing between neutral and non-neutral mutations for accelerating forward simulations represents a natural implementation choice, specifically since it is assumed that the majority of mutations is effectively neutral .
Finally, SFS_CODE has become a state-of-the-art forward simulator, because of an efficient implementation in C coupled with a plethora of simulation features and parameters.
Here, we describe design as well as optimization techniques that allow for substantially more efficient forward-in-time simulations. We introduce a novel algorithm, for the a posteriori simulation of neutral mutations. In addition to runtime gains, our algorithm generates a data structure that depicts the entire observable history of each individual sequence, a feature that –to the best of our knowledge– is unique among simulation software. We make available a corresponding software package, the Ancestry-Aware Forward-in-Time Simulator (AnA-FiTS, available under GNU GPL at our project webpage: http://www.exelixis-lab.org/aberer/anafits/index.html), a C++ implementation that can handle challenging post-genomic datasets for population genetics analyses.
In the following, we describe our algorithm for ancestry-based simulation of neutral mutations. Subsequently, we discuss implementation and optimization issues.
For forward simulation of a single generation of a population with effective population size N e , a scaled mutation rate of θ=4N e μ (with μ being the per-base mutation rate), the scaled recombination rate ρ=4N e r (with r being the per-base recombination rate), the basic simulation steps are: (i) sample diploid individuals by their fitness, (ii) determine the number of recombination events (Poisson random number drawn from Poi(ρ)) and create recombinants, (iii) determine the number of non-neutral mutations (Poisson random number drawn from Poi(θ)) and mutate sequences, (iv) recalculate the fitness values for each individual.
Since neutral mutations, by definition, do not affect the survival probability of an individual (resp. the sampling probability), we do not need to simulate neutral mutations forward in time. This observation was first explored in forwsim, where the simulation of the neutral part of a sequence was delayed for a small number of generations. Thereby, the actual calculations required for this part of the simulation can be discarded, if the haplotype under consideration did not survive (because of drift or selection) until the present. The authors demonstrated that, the simulation of neutral sequences can be accelerated by a factor of two to five by optimally choosing this so-called look-ahead parameter, that is, the number of generations, by which the simulation of neutral mutations is delayed.
During l forward-simulation generations (typically l:=10N e for diploid organisms), AnA-FiTS annotates an ancestry , where a i is a surjective function that maps the individual haplotypes in generation i to the haplotypes of the preceding generation i−1. In the following, we outline this procedure for a single locus/chromosome. The extension to multiple loci is described in the implementation section. The variable a i is defined as a i :h i,j →h i−1,l , where h i,j is the haplotype with index j in generation i, that is mapped to its parental haplotype with index l in generation i−1. Haplotype indices are ordered, such that adjacent pairs that start with an even index represent the two homologous versions of each locus for diploid organisms (i.e., (h i,0,h i,1) or (h i,2,h i,3) represent an individual).
As mentioned before, the algorithm also keeps track of all recombination events (potentially multiple recombinations per sequence and per generation) that occurred during the forward phase. For each event, we store the absolute sequence position, the haplotype index, and the generation number. For a sequence of length L, the information in and the corresponding set of recombination events can be used to determine (via back-tracking) the regions of ancestral haplotypes that survived until the present. Only mutations that occurred within these regions are observable in the present. In the first step, our algorithm determines haplotypes in ancestral generations (referred to as survivors) that contributed material to the present generation (denoted surviving regions). For these regions, we simulate neutral mutations and identify those recombination events that contributed to mutations in the present haplotype instances.
The algorithm starts at the present generation and initializes surviving regions for each haplotype with a maximum interval of [ 0,L], where L is the sequence length. Regions for ancestral haplotypes are initialized by the tuple (∞,0). Note that, the surviving regions of a haplotype instance can be fragmented/scattered into many segments by recombinations. To economize on runtime and memory, we only keep track of the start position of the very first segment r S and the end position of the very last segment r E . By using this approximate surviving ’super-’region R=(r S ,r E ), we overestimate the size of the actual surviving region. We correct for this in a later stage of the algorithm, when sequences are extracted from the graph structure. Finally, to determine the surviving haplotype region in generation i−1, we iterate over all survivors of generation i starting with haplotypes contained in the present generation. For each surviving haplotype with region (r S ,r E ), we propagate the surviving regions as described in the following algorithm to the members of the preceding generation. Let h i−1,k with region be the parent haplotype and h i−1,k+1 with region be the homologous haplotype in the corresponding parent individual:
For simplicity, the algorithm described above assumes that the recombinant emerged from at most one recombination event. For multiple recombination events per recombinant, the algorithm has to be adapted, such that surviving regions of parental haplotypes are extended until the next recombination breakpoint. This has also been implemented in the AnA-FiTS software.
During the procedure described above, we also determine the set of survivors from the preceding generation. In the case that a recombination event does not split the surviving region (r S ,r E ), we can simply ignore it, if the following property does not hold: assume, haplotype h i,l is mapped to h i−1,k with surviving region (r S ,r E ) and h i,l is a recombinant with a breakpoint b<r S . This means that, h i,l does not inherit mutations from h i−1,k , but instead from the homologous haplotype h i−1,k+1 it recombined with. If we can not ignore the recombination, we create a node instance for the graph that will be constructed in the next step. The survivors of each generation are stored for further processing in subsequent steps.
In the algorithm step described above, we determined survivors for each generation i, the set of relevant recombinations (stored as so-called R-nodes), and the set of relevant neutral mutations (so-called M-nodes). For all nodes, we store the generation of origin and the corresponding haplotype index h i,l . The goal of the procedure we describe here, is to construct a graph (the Backward Event Graph BEG, see Figure 1 for an example). The BEG is a directed acyclic graph with two types of nodes: (i) M-nodes (for mutations) with out-degree 1 and (ii) R-nodes (for recombinations) with out-degree 2. Thus, each node in the graph represents the state of an ancestral haplotype that reflects the history of all mutation or recombination events experienced by its ancestors. A directed edge connecting node n 1 to node n 2 indicates that n 2 emerged from n 1 after mutation or recombination.
The algorithm described above constructs the BEG and an array of haplotype states n i (references to nodes in the BEG), that survived until the present. In the final step, we extract the exact sequence for each node n i from the BEG by recursively traversing the graph starting at n i , while keeping track of the region of interest which is segmented by each recombination event which will be encountered. Since we execute the backtracking procedure for each node n i individually, (in contrast to the algorithm description) the exact borders of all segments of the surviving region for this node are known at any given point in time. Thus, we can ignore all mutation events that did not occur within the surviving region, but were inserted into the graph because of overestimation or because this mutation forms part of the surviving region of another haplotype.
In practice, the above extraction procedure is inefficient. As a consequence, the sequence reconstruction step dominates execution times. Consider two haplotype states n 1 and n 2 that have survived, where n 2 is an M-node with ancestor n 1. For both nodes, we have to traverse all ancestral nodes of n 1 and n 2 to obtain the two corresponding sequences s 1 and s 2. To accelerate this process, it suffices to create sequence s 1 via backtracing starting at n 1 and to generate s 2 as a copy of s 1 and then simply add the mutation events induced by node n 2 in the end. In the following, we explain, how the computational cost of sequence extraction can be reduced.
Prior to the actual sequence extraction step, the graph is traversed twice, starting with each haplotype n i that survived until the present. In the first traversal, we approximate the surviving region for each node as described in the algorithm. Thus, if we decide to explicitly determine and store the sequence of a node, we already know the maximum size of the region of interest beforehand.
In a second traversal that starts with each haplotype in the present, we determine for which nodes intermediate sequences should be created. In other words, the surviving mutations are represented explicitly in this node. During these traversals, we actually stop a traversal, if the traversed node has already been visited by another traversal. For each node, we also store the number v of times the node has been visited. We refer to nodes with v≥2 as coalescent nodes, since ancestral material (resp. the surviving regions) of two lineages coalesces in these nodes. Finally, we determine the distribution of v and the 5% nodes with highest v are represented explicitly, while all remaining nodes will be traversed several times during sequence extraction.
Then, the sequences can be extracted via a final full traversal. The 5% cut-off (a tuning parameter of the algorithm) was determined empirically. It yields good results with respect to balancing runtime versus memory requirements. Note that, if every coalescent node (v≥2) is represented explicitly, memory requirements would become prohibitive for whole-chromosome simulations with 10,000 individuals. In general, the 5% cut-off proves to be sufficient to attain a substantial decrease of the traversal cost for sequence extraction.
The ancestry-based algorithm for simulating neutral mutations as discussed in the previous section has specifically been developed to improve runtimes of forward-simulations. In the following, we discuss low-level implementation issues and optimization techniques of AnA-FiTS.
For the majority of possible AnA-FiTS invocations with respect to user parameter combinations, storing the ancestry requires the largest amount of memory. Depending on the effective population size N e , we need to store ancestry information for 10·N e 2 haplotypes. If this exceeds the amount of available memory, the user can set a command line flag such that the memory utilization of the ancestry does not exceed a user-specified soft limit (see below). AnA-FiTS then tries to split up the forward simulation into sections (a number of generations that is as large as the user parameter permits) and updates the graph after each section leading to increased runtimes. In some instances the graph representation itself can become at least as memory intensive as the ancestry. Furthermore, for large graphs, the explicit bit vector representation of intermediate sequences may further increase memory requirements. Therefore, currently no hard upper memory utilization limit can be imposed on AnA-FiTS runs.
The ancestry only works correctly for a single locus. For multiple unlinked loci (e.g., chromosomes), we map each individual to its two parent individuals instead of mapping it to individual haplotypes. For this, we use a per-chromosome bit vector to keep track of which of the two possible homologous haplotypes was inherited from the respective parent. Apart from their high memory-efficiency, bit vectors can also be efficiently initialized with random bits using a random number generator (i.e., each bit represents a choice as to which haplotype was inherited by an individual).
To avoid unnecessary and inefficient frequent memory allocations, the entire ancestry (an array of arrays) and the array of survivors are allocated as single, monolithic blocks. For the ancestry, we dynamically determine the minimal number of bytes that is required for storing a generation as a function of the population size of the preceding generation (e.g., if the previous generation contained 250 individuals, a 8-bit unsigned char is sufficient). Since, under most simulation scenarios, N e does not exceed 65,536 individuals, it is mostly sufficient to use the unsigned short integer (=2 bytes) datatype for saving memory.
Note that, beside all efforts to keep the memory consumption of AnA-FiTS as low as possible, the entire ancestry (or –as described above– part of it) is kept in main memory during simulation. This necessarily means that AnA-FiTS exhibits higher memory requirements than comparable programs.
High quality (pseudo-)random number generators are essential for simulation tools. To date, the Mersenne twister algorithm  is considered as state-of-the-art in pseudo-random number generation (PRNG). A SIMD-based version of the Mersenne twister (SFMT)  employs Streaming Single instruction, multiple data Extensions (SSE) instructions to generate random numbers at almost twice the speed as the original Mersenne twister. AnA-FiTS comes with a distribution of RandomLib , a C++ implementation of SFMT. For efficiency, RandomLib initializes an array with random bytes and then transforms the bytes of this array (in steps of 32 or 64 bits) into the required primitive data type (i.e., double, integer, boolean) for the requested range.
For forward simulations, we can exploit two application-specific properties to more efficiently use PRNGs: (i) usually for the ancestry, integer values of less than 4 bytes are required, and (ii) uniform random numbers are frequently drawn from the same interval (e.g., in case of the ancestry from [0,N), where N is the effective population size of the previous generation). When a large amount of random numbers with these two properties (small integer numbers, constant interval) is required, we can directly initialize a target array from the internal array of the PRNG and transform the minimal number of random bytes for the respective integer type into uniform random numbers for the required range using SSE instructions. For instance, if simulations require sampling 10 integers ∈[ 0,1000), we can copy 20 bytes (=10unsigned short) from the PRNG’s internal array into a location of the ancestry memory block and use vector instructions to transform the 20 bytes into 10 integers of the specified range. Without this optimization, RandomLib would produce 40 bytes of randomness and transform 4 bytes into one integer separately.
For forward simulators that work with a given number and pre-defined locations of polymorphic sites (such as ), it is straight-forward to represent haplotypes as bit arrays. If we want to simulate a sequence with a huge number of base-pairs (e.g., the human chromosome with 10 Mbp), bit arrays are not the data structure of choice any more because they become too sparse. Thus, in other forward simulator implementations, a haplotype is stored as a sequence of mutations with respect to an initial reference sequence. For instance, SFS_CODE uses self-balancing binary trees to represent the polymorphisms of a haplotype. Sorted binary trees allow for rapid insertion of novel mutations and fast extraction of sub-sequences which are required for creating recombinants.
An alternative data structure is used in forwsim that maintains sorted arrays of sequence positions. Since for AnA-FiTS, we want to store additional information about each mutation (e.g., generation of origin or base), we use a data structure that is based on references/pointers to mutation object instances. When sorted arrays are used, sub-sequence replication for creating recombinants can be efficiently implemented using the memcpy() system call. This is because finding coordinates within the sequence via a binary search has the same time complexity of O(l o g(n)) as searches in a sorted tree.
Additionally, changing the array size induces a smaller memory allocation overhead than allocating additional nodes in sorted trees. The main drawback of sorted arrays is the insertion cost of . In practice however, obtaining the correct relative position via a binary search is usually more expensive than executing an insertion as long as insertions are implemented via the memcpy() system call. Note that, the memcpy() system call is also vectorized via SSE intrinsics. Finally, the number of insertion operations is relatively small in AnA-FiTS, since we only simulate non-neutral mutations forward in time.
We experimented with various sorted tree implementations, for instance, the C++ STL std::set implementation or Judy arrays by , which are based on cache-optimized 256-ary trees. For AnA-FiTS, we found that the tree data structures did not outperform our sorted array implementation.
In forward simulations, the number of distinct haplotype instances containing the sorted array of mutation references and further meta-information and the number of mutation instances changes dynamically at a rapid pace. Thus, dynamic memory allocations (repeated calls to malloc () and free()) that are invoked in the inner parts of nested loops quickly dominate runtimes. As mentioned in the previous section, AnA-FiTS uses block allocation, instead of dynamic allocation whenever possible. Apart from being faster this also prevents heap fragmentation.
One approach to handle the rapid turnover of haplotype and mutation instances more efficiently is to use free lists in conjunction with reference counting. The underlying idea is to keep track for each allocated instance, how often it is referenced in the current generation (in case of haplotypes) or in present haplotypes (in case of mutations). Instead of deallocating unreferenced instances, instances are appended to a list, the free list, from which new instances can then be allocated again, when required. However, this allocation strategy still uses more than 10% of total runtime.
The main challenge in implementing the BEG algorithm is to design an efficient method for extracting sequences from the graph.
For this task, we can use the following property: during the sequence extraction phase, all neutral mutations that occurred during simulation are known. Thus, bit arrays are the most adequate data structure for storing neutral sequences. After sorting mutations by their location, each mutation is assigned its corresponding index in the bit array. As a result, the cost of adding a mutation that is encountered during sequence extraction to a given sequence is reduced to merely setting a bit. Handling recombinations is computationally more expensive, since we have to determine the bit array indices of the start and end points for the surviving regions of the parent haplotype via a binary search. Carrying over the relevant mutations from parent haplotypes is implemented as a bit-wise or-operation.
This implementation is particularly efficient because the sequence output of the simulation can directly be written as bit array to a binary file. Thus, output files do not require a lot of disk space and programs that directly evaluate the output of a large simulation can efficiently parse and post-process this output data.
In this section, we evaluate the runtime performance of AnA-FiTS. In general, conducting a fair comparison between forward simulation tools represents a difficult task, since each simulator uses different assumptions (e.g., finite sites versus infinite sites or pre-defined mutation sites versus random mutation sites). For a meaningful technical analysis, we separately evaluate the two key aspects of AnA-FiTS: efficient implementation of non-neutral forward simulation and ancestry-based simulation of neutral mutations. For realistic use cases, both features independently contribute a certain amount of run time improvement.
For all runs discussed throughout this section, we set mutation and recombination frequencies to μ per base=rper base=2.5·10−8 which roughly correspond to the empirical rates for humans [27, 28]. For fixed μ and r, the runtime depends on the locus length L and the effective population size N e . All runs were executed on a 4×8 core machine (AMD Opteron 6174) with a total of 256 GB RAM. The execution times provided here are the result of 4 independent runs with different random number seeds. We discard the slowest runtime in order to correct for the impact of seed values that generate exceptionally expensive simulations and average over the remaining three.
Moreover, we validate AnA-Fits by comparing the output of our software to datasets generated with established simulation codes. Finally, we describe runtime gains for the simulation of partially neutral sequences, specifically in a whole-genome setting for human sequences.
Here, we evaluate the efficiency of the pure forward simulation part of our implementation and compare it to SFS_CODE. AnA-FiTS can simulate each mutation forward in time by choosing a constant small selection coefficient s≠0 for each mutation that is to be simulated. If s is small enough, it is effectively neutral. This way the BEG algorithm is not applied. Simulating close to neutral mutations (s:=0), forward in time, represents the most expensive case, since for large values of s the mutation quickly undergoes selection and for small negative values it is removed from the gene pool by selection. In both cases the mutation quickly vanishes which in turn reduces runtimes.
Figure 6 depicts the relative runtimes of both programs. Overall, AnA-FiTS is between 2.9 and 96.2 times faster than SFS_CODE. In all except 2 runs, AnA-FiTS is at least one order of magnitude faster than SFS_CODE.
The runtime improvements do not show a clear dependency on one of the two input dimensions (N e , sequence length). We assume that this is because, for certain steps, SFS_CODE switches between alternative implementations depending on the number of recombination/mutation events that are expected per generation. Further parameters that may influence the runtime differences between the codes are: (i) frequency of the cleanup steps in either code (fixed to 100 in AnA-FiTS since in AnA-FiTS no significant performance impact could be detected) and (ii) number of linked loci the sequence was split into (SFS_CODE only).
As already mentioned, the BEG algorithm is inspired by the postponed simulation of neutral mutations as implemented in forwsim. Thus, in order to evaluate the stand-alone efficiency of the BEG algorithm, we compare the runtime for simulating entirely neutral sequences between AnA-FiTS and forwsim. Note that, forwsim simplifies the Fisher-Wright model in two ways: (i) while it assumes finite sites, at most one mutation per site is allowed and (ii) haplotypes undergo at most one recombination event per generation. We omit a comparison between forwsim and AnA-FiTS that includes forward simulation of non-neutral mutations. The reason for this is that, forwsim assumes a fitness model that differs from the model implemented in AnA-FiTS and SFS_CODE: for each non-neutral site, the user has to specify selection coefficients for the two homozygous cases and the heterozygous case. Thus, forwsim needs to iterate over all non-neutral mutations of each individual and recompute the fitness for each individual as opposed to AnA-FiTS and SFS_CODE, where one simply multiplies the per-haplotype coefficients. In the final analysis, the non-neutral mutation model implemented in forwsim, is more powerful (balancing selection can be accommodated), but slower and has a hard limit with respect to the number of non-neutral mutations that can be simulated. In AnA-FiTS, the fitness of an individual is the product of the fitness effects f (where f=1−s with selection coefficient s) of all mutations on either haplotype.
We performed a validation analogous to the validation of SFS_CODE. Since under neutrality, summary statistics of datasets that are simulated using AnA-FiTS can easily be compared with the equivalent summary statistics generated by the popular coalescent software ms, we mainly validate datasets that have been simulated under neutrality. We also compare simulations with AnA-FiTS under selection to datasets produced by SFS_CODE.
To ensure that population genetic simulation software produces correct results, it is common to compare summary statistics for datasets produced by the software that needs to be validated to the respective summary statistics of widely-used and well-tested software packages. To date, ms is among the most cited population genetic simulation tools. Slight deviations in summary statistics may be expected, since for instance ms implements an infinite-sites model of mutation, whereas AnA-FiTS implements a full finite-sites model. We used AnA-FiTS to simulate chromosome-sized sequences. Thus, the effect of non-single nucleotide polymorphisms or multiple mutation events that give rise to the same genotype should not influence the respective summary statistics.
In addition to this standard validation setup, for AnA-FiTS we also have to show that (i) the BEG-algorithm yields correct results (which could be shown by a formal mathematical proof alternatively) and that (ii) the algorithm is correctly implemented. Thus, irrespective of the respective simulation scenarios we show, that all three simulation modes produce the same summary statistics as ms: (i) all mutations are simulated forward in time (referred to as AF-FOR), (ii) a part of the mutations is simulated forward in time (s:=0), while the rest is obtained via an ancestry-based simulation (referred to as AF-MIX), (iii) all mutations are simulated via ancestry-based simulation (referred to as AF-ANC),
We test the correctness of our algorithm and implementation for 5 specific scenarios employing the number of haplotypes, the number of segregating sites, the nucleotide diversity, and the site frequency spectrum as summary statistics. The scenarios are (for a present population size of N p ): (i) mutation only (no recombination), (ii) mutation and recombination, (iii) mutation, recombination, a doubling in population size taking place 0.2·N p generations ago, (iv) mutation, recombination, a with a factor-2-bottleneck taking place 0.25·N p generations ago, (v) mutation, recombination, with an exponential population growth setting in 0.5·N p generations ago.
To validate our software for simulations with selection, we compared the same summary statistics (see previous section) as obtained by AnA-FiTS to summary statistics for datasets simulated with SFS_CODE. To keep the computational effort within reasonable limits, we reduced the number of simulations to 15,000, the sequence length to 105 bp, the effective population size to N e =250, and set the mutation and recombination rate per sequence and per base pair to m:=r:=10−6 (sampling size retained at 50 sequences).
For SFS_CODE simulations, we simulated the 100 Kbp sequence as 20 linked loci of length 5 Kbp. As shown in the Figure 9, the distributions of summary statistics of AnA-FiTS show a high level of agreement to the respective distributions obtained with SFS_CODE.
Performance gains with ancestry-based simulation
N e = 1,000;L = 5·106
N e = 5,000;L = 106
Given these runtime improvements, coupled with the continuous advances in hardware, large-scale simulations of genomic regions at the chromosome level are becoming feasible. AnA-FiTS also allows for simulating multiple chromosomes (resp. large unlinked loci, one separate graph is generated per locus). If we reduce the fraction of non-neutral mutations to 10%, the entire (female) human genome (simulated as 23 unlinked loci with a total 3.037 ·109 bp for 10·N e generations) can be simulated for 100 individuals within 55.2 seconds, resulting in a total of 159,651 segregating sites. For 200 individuals (same parameters), our implementation requires 441.7 seconds, yielding a dataset with 323,673 segregating sites.
We discussed the design and optimization of a highly efficient simulator for the forward-in-time simulation of population genetic datasets. Our software is up to two orders of magnitude faster than competing state-of-the art software. Moreover, we described the BEG algorithm, that may further accelerate simulation for the common scenario, when a fraction of mutations does not have an impact on the fitness of individuals. Our algorithm outperforms a similar exploratory approach by up to one order of magnitude, while it is at the same time capable of retaining the entire observable history for each haplotype. We demonstrated that our software allows for forward-in-time simulation at the whole-chromosome and even whole-genome level within acceptable execution times.
Our novel graph structure (the BEG) may become useful for population geneticists. Graph metrics can be computed on this structure, and the surviving material of any ancestrial haplotype can be reconstructed for any given point in time. Alternatively, local trees could be extracted from the BEG for each sequence segment that did not undergo recombination.
We believe that the availability of AnA-FiTS will lead to an increased usage of forward simulation in population genetics, given its importance for research on natural selection, and in the context of approximate Bayesian computation. Thus, our tool can handle the analytical challenges of the post-genomic era.
Pseudo random number Generator.
We wish to thank P. Pavlidis at HITS for his valuable input during the early phases of this project. We would also like to thank J.D. Jensen and G. Ewing for useful comments on this manuscript. We thank two anonymous reviewers for their comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.