SMARTPOP: inferring the impact of social dynamics on genetic diversity through high speed simulations

Background Social behavior has long been known to influence patterns of genetic diversity, but the effect of social processes on population genetics remains poorly quantified – partly due to limited community-level genetic sampling (which is increasingly being remedied), and partly to a lack of fast simulation software to jointly model genetic evolution and complex social behavior, such as marriage rules. Results To fill this gap, we have developed SMARTPOP – a fast, forward-in-time genetic simulator – to facilitate large-scale statistical inference on interactions between social factors, such as mating systems, and population genetic diversity. By simultaneously modeling genetic inheritance and dynamic social processes at the level of the individual, SMARTPOP can simulate a wide range of genetic systems (autosomal, X-linked, Y chromosomal and mitochondrial DNA) under a range of mating systems and demographic models. Specifically designed to enable resource-intensive statistical inference tasks, such as Approximate Bayesian Computation, SMARTPOP has been coded in C++ and is heavily optimized for speed and reduced memory usage. Conclusion SMARTPOP rapidly simulates population genetic data under a wide range of demographic scenarios and social behaviors, thus allowing quantitative analyses to address complex socio-ecological questions.


Theoretical Expectations
As the output of SMARTPOP is non-deterministic, alternative checks based on mathematical results from theoretical population genetics have been developed to confirm that the system is behaving correctly.
Results from a large number of simulations were compared against values expected under coalescent theory [1]. For example, the mean and variance of the time to the most recent common ancestor (TMRCA) assuming a constant population size [2,3,4] is: The time to the most recent female common ancestor was simulated for mitochondrial DNA (mtDNA) in a constant sized population with random mating to approximate the Canning's model (i.e., the theory for which the equations above were derived [5]). Figure 1 shows that the mean and variance of 1,000 simulations do not vary from theoretical expectations (Student's t test: P mean = 0.95, P variance = 0.70). This test procedure was repeated for both male and female lineages (i.e., mtDNA and Y chromosome) for a range of population sizes.
Second, the simulation model distributes the number of children per woman as a Poisson random variable. We confirmed that simulations produce the correct distribution (i.e., a mean and variance of 2 for constant sized populations).
Third, mating systems were tested by comparing the observed and expected number of mates per individual. Under monogamy, each individual must have no more than one mate. Under polygamy, the mean number of mates must be close to one with some non-zero variance.

Comparisons with Coalescent Simulators
Coalescent simulators, such as MS [6] and SIMCOAL [7], are used widely in the community to produce simulated population genetics datasets. As such programs reconstruct genetic lineages backward-in-time, they necessarily have strong assumptions (e.g., random mating). To validate our forward-in-time simulator, we compared data simulated by SIMCOAL and SMARTPOP under random mating for defined sets of parameters (e.g., mutation rate and sequence length). To ensure direct comparability, SMARTPOP simulations were first allowed to reach equilibrium by running them for a large number of generations beyond the expected TMRCA.
The two models di↵er in a second key feature: the backward-in-time process is controlled by the e↵ective population size, while the forward-in-time process is controlled by the census population size. To account for this di↵erence, each SMARTPOP simulation was run under a random census population size, the corresponding e↵ective population size was inferred from the resulting genetic data, and a paired SIMCOAL simulation was run with this value. The mean and variance of several genetic diversity estimators were then compared for both datasets. The two methods produce highly concordant results (Figures 2 and 3).

Metamorphic Testing
As software has increased in complexity, a new test procedure (metamorphic testing [8]) has been developed to address the problem of validating complex software systems. Within the last few years, metamorphic testing has begun to be applied to bioinformatics software [9,10]. The approach leverages scaling properties of the simulation model ("metamorphic relations"), for which a defined change in the output can be predicted for a defined change in the input.
The primary challenge is the identification of metamorphic relations appropriate to the problem. Theoretical population genetics suggests several scaling relations. The following cases have been tested in SMARTPOP: • If the mutation rate is multiplied by a factor x, then the diversity estimators S, ✓ w and ✓ ⇡ scale linearly with x. • If the e↵ective population size is multiplied by a factor x, then the diversity estimators S, ✓ w and ✓ ⇡ scale linearly with x. Because the coalescent comparisons described earlier were performed manually, only a relatively small set of parameters could be tested. Metamorphic testing allows the validation process to be scaled up to a large number of test parameters.
Mean values for 1,000 simulations were tested using a random set of starting parameters (e.g., population size and mutation rate) with x drawn from a random discrete (integer) uniform distribution, Unif (1,5). In all cases, di↵erences between the means of x ⇥ E(parameter) and E(x ⇥ parameter) were less than 10%, thus confirming that the metamorphic relations hold for the simulation software.

Summary Comparison
To speed up analyses, several summary statistics are calculated directly within SMARTPOP. To validate these estimators, a series of checks were implemented.
Because most related programs were designed to handle small sample sizes, the population-level dataset simulated by SMARTPOP was sampled randomly. DNA sequences for these simulated individuals were imported into COMPUTE [11] and ARLEQUIN v. 3.5 [12], and the same set of summary statistics returned by SMARTPOP was calculated. The values obtained by SMARTPOP, COMPUTE and ARLEQUIN were then compared across 1,000 simulated datasets (Table 1). Di↵erences in values were negligible -integer summaries were identical; non-integer summaries exhibited extremely low variance due to rounding error. All exceptions (✓ w , ✓ ⇡ and Tajima's D) result from the implementation of slightly di↵erent equations.
Homozygosity Theta ⌘ c: Not comparable as Tajima's D is a function of ✓w and ✓⇡, both of which di↵er in COMPUTE.

Model Implementation
Forward-in-time simulators produce individuals and their DNA sequences using an explicit set of demographic, social and genetic models. While we use models that have wide acceptance in the field, their exact implementation has a direct impact on the simulations. The following sections describe these models in more detail, but much more extensive information is available on the project website (http://smartpop.sourceforge.net).

Demographic Models
Population size can either be constant or change through time, as defined by the user. Population size is controlled internally via the number of o↵spring. Let N t be the size of the parent generation. The number of o↵spring is then calculated using the following demographic function with size change variables a, b and c defined by the user: This is a general population size change equation that allows linear, exponential and logistic growth and decline. Once the total size of the next generation is defined, each female (or male in the case of polyandry) is assigned a random number of o↵spring drawn from a Poisson distribution conditioned on the desired population size. At an individual level, the number of o↵spring for each female (or male) is a Poisson random variable constrained by the fact that exactly N t+1 o↵spring are born in the population as a whole.
Social Models SMARTPOP currently allows several mainstream mating systems to be run. •

Defining Starting Conditions
Unlike backward-in-time methods, such as the coalescent, forward-in-time simulations are highly dependent on their starting point. This problem has been raised by other studies [13], but there is little consensus on how to define the initial population. Most programs start from a 'null' population comprising individuals that are genetically identical [14]. In such cases, it is typically advised to run the simulations "long enough" (i.e., some long, but undefined period of time) for the system to reach an equilibrium state. This long 'pre-run' stage is often discarded as a burn-in phase, but can require substantial runtime, especially for large populations.
Other programs allow simulations to start from a real population genetic dataset [15], but this requires preexisting data and is also meaningful only for inferences about the future of a population, not its past.
SMARTPOP provides multiple methods to define a simulation's starting point depending on the user's needs and research questions. By default, a 'null' population of identical individuals is used. This traditional approach is acceptable if users can tolerate long runtimes, and importantly, the assumption of starting from a genetic equilibrium is appropriate for their study system. However, these two assumptions are now critically limiting for many population genetic inference settings.
To speed up simulations, SMARTPOP o↵ers an optional bu↵ering feature. This enacts accelerated evolution using a high mutation rate, which stops after a user-defined diversity threshold is reached. This period of accelerated evolution is then discarded as a burn-in, and the genetic dataset returned by SMARTPOP starts from this point in the run. Bu↵ering is performed independently for each simulation to ensure di↵erent random starting points.  Table 2 presents the time in generations taken by each set of simulations to reach equilibrium (defined here as |✓ ⇡ (t) ✓ ⇡ (1)| < 1). The final column lists the CPU time in seconds to run 100 simulations to equilibrium using the bu↵ering phase. If the bu↵ering threshold is set close to the mean pairwise distance at equilibrium (e.g., ✓ = 35), the simulation evolves to the equilibrium state faster than if no bu↵ering were used (red). However, if the threshold is far from the equilibrium point (e.g., ✓ = 100), the simulation can take longer to reach equilibrium. In terms of runtime, simulating this example system with bu↵ering of ✓ = 35 is twice as fast as starting from the null 'all individuals identical' set. To put this in perspective, optimal bu↵ering could save 1.5 hours of runtime over a standard run of 1,000,000 simulations.  This discussion raises the issue of equilibrium and its appropriateness for biological modeling. All populations are dynamic -they move, split, merge, grow and contract. Processes that are strongly time localized can have genetic e↵ects over a much longer timeframe (see Figure 1B and 1C of the main article). The modularity of SMARTPOP enables such dynamic studies by saving and reloading simulations with di↵erent parameters. This allows users to define any starting point that is the outcome of some prior evolutionary process. For example, it is possible to simulate a population of 100 settlers that recently migrated from a larger population of size 200. One way to do this would be to simulate a population (n = 200) until it reaches equilibrium (i.e., a long time), save the simulated populations, and then reload them but this time sampling only 100 individuals. The following command lines show this example: ./smartpop -p 200 -t 20 -nstep 50 -sample 50 -sizeMt 3200 -save file1 ./smartpop -load file1 -p 100 -t 20 -nstep 50 -sample 50 -o fileresult -mtdiv However, this process is time consuming, especially if it requires a large population to reach equilibrium. Bu↵ering provides an alternative approach. Accelerated evolution can be used to reach a much higher diversity than the equilibrium state, thus mimicking a small population that recently separated from a large one (such as might occur during a settlement event). Figure 5 shows three sets of simulations for a monogamous population of size 100 with the same parameters as the example above, but with di↵erent starting points: the null 'all individuals identical' set (black), bu↵ering with ✓ = 75 (red) and down-sampling as described above (blue). The null 'all individuals identical' method cannot be used to model a settlement event, and is shown here solely to emphasize that all simulations eventually reach the same equilibrium point. Note, however, that bu↵ering creates a diversity dynamic that is concordant with the sampling method, but bu↵ering is much faster (1.03 vs 2.36 s).
These simple examples illustrate the speed gain that bu↵ering can provide for di↵erent scenarios. As the simulated population size increases, this gain becomes even more pronounced and bu↵ering may become necessary to keep runtimes to an acceptable level.