AdmixSim 2: a forward-time simulator for modeling complex population admixture

Background Computer simulations have been widely applied in population genetics and evolutionary studies. A great deal of effort has been made over the past two decades in developing simulation tools. However, there are not many simulation tools suitable for studying population admixture. Results We here developed a forward-time simulator, AdmixSim 2, an individual-based tool that can flexibly and efficiently simulate population genomics data under complex evolutionary scenarios. Unlike its previous version, AdmixSim 2 is based on the extended Wright-Fisher model, and it implements many common evolutionary parameters to involve gene flow, natural selection, recombination, and mutation, which allow users to freely design and simulate any complex scenario involving population admixture. AdmixSim 2 can be used to simulate data of dioecious or monoecious populations, autosomes, or sex chromosomes. To our best knowledge, there are no similar tools available for the purpose of simulation of complex population admixture. Using empirical or previously simulated genomic data as input, AdmixSim 2 provides phased haplotype data for the convenience of further admixture-related analyses such as local ancestry inference, association studies, and other applications. We here evaluate the performance of AdmixSim 2 based on simulated data and validated functions via comparative analysis of simulated data and empirical data of African American, Mexican, and Uyghur populations. Conclusions AdmixSim 2 is a flexible simulation tool expected to facilitate the study of complex population admixture in various situations. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04415-x.

Backward-time simulation tools are based on the idea of tracing back to the most recent common ancestor of currently surviving individuals and then attributing genetic information to each individual on the coalescent tree. These tools are very efficient and memory-saving because only existing individuals are considered. However, they are not powerful in complex demographic scenarios. Coalescent-based methods are not suitable for cases involving questions of interest that require complete ancestral information since only the ancestral information of currently surviving individuals is saved. Unlike coalescent-based methods, forward-time approaches start from some initial reference populations and then implement specific evolutionary events generation by generation, such as recombination, de novo mutation, natural selection, and migration. Many efficient forward-time tools have been developed recently. The National Cancer Institute's Genetic Simulation Resources (GSR) website (https:// surve illan ce. cancer. gov/ genet ic-simul ation-resou rces/ packa ges/) listed 178 simulators. These tools were initially designed for various modeling emphases and objectives, so that they have been widely applied to different genetics studies, such as the human epidemiology (simuPOP [7,8]), the ecology of wild species (Nemo [9]), and the genome-wide association studies (genomeSIM [10], genomeSIMLA [11]). Among forward-time simulators, some possess functions to simulate a process of population admixture directly, such as admix'em [12], SELAM [13], SLiM [14][15][16], simuPOP, fwdpp [17], and forqs [18]. They are flexible and computationally efficient in fields of their respective designing objectives and also workable on other demographic scenarios. In particular, admix'em focuses simulations on two-way admixture. SELAM is especially useful in simulating admixture resulted from a founding population with all subpopulations being co-existing throughout the simulation. SLiM and forqs are both powerful in simulating the evolution of quantitative traits when assuming various demographic histories. simuPOP and fdwpp are especially suitable for users who are good at Python or Cpp programming. In addition, SLiM 3.3 [16], GeneEvolve [19], simuPOP, and AdmixSim [20] also support real genomic data as input, this feature is helpful in many applications, especially when mutation-drift equilibrium is to be approached which can be time-consuming. Taken together, there is room for developing tools specifically for modeling complex population admixture. Here, we introduce a highly flexible and user-friendly forward-time simulation tool called Admix-Sim 2. It allows simulating real sequence data in a series of complex admixture processes with various evolutionary driving forces including de novo mutation, drift, recombination, and natural selection. Notably, AdmixSim 2 can also be applied in admixture scenarios for sex chromosomes and non-human species. The source code of AdmixSim 2 is freely available at https:// www. picb. ac. cn/ PGG/ resou rce. php or https:// github. com/ Shuhua-Group/ Admix Sim2.

Implementation
The data structure of AdmixSim 2 can be mainly divided into five layers: segment, chromosome, individual, single-generation population, and multiple-generation population. AdmixSim 2 begins with a set of individuals from ancestral populations, corresponding sex settings, chromosome information, and a user-defined demographic model. Each Individual is a diploid and constituted by two chromosomes. In the founding generation, each chromosome of ancestral individuals is assigned with a unique integer-type label ranging from 0 to 2N − 1 N = k i=1 N i , N i is the number of individuals of the ith ancestral population, i = 1, 2, …, k). Chromosomes of a certain population take consecutive labels. Following the demographic model settings, for each population, the probability of being sampled to be a parent of newly generated individuals is proportional to the corresponding admixture proportion. More specifically, when there are three ancestral populations, i.e., pop1, pop2, and pop3, their admixture proportions are 0.2, 0.1, and 0.7 respectively. To make new individual data, we first generate a random number between 0 and 1. One individual of pop1 is selected if the random number is less than 0.2, while one individual of pop2 is selected if the random number is larger than 0.2 and less than 0.3, otherwise, one individual of pop3 is selected. Within each population, the probability of being sampled for each individual is proportional to the corresponding fitness. Each chromosome of newly generated individuals consists of a list of segments. Each segment is identified by the start physical position, the end physical position, and the label it originates from. The new generated individual inherits corresponding segments from parents with additional segments resulted from recombination. Asides from the ancestral segments, each chromosome has two additional lists recording the recombination breakpoints and de novo mutation points, respectively, and one map recording the physical positions under selection and corresponding states of the allele. The generation of new recombination and mutation points can be modeled as the Poisson process and assigned along the chromosome based on the user-defined recombination rate and mutation rate, respectively. Each population in AdmixSim 2 is constituted by individuals spreading several generations specified in the demographic model.
Next, we provide a brief description of the workflow of AdmixSim 2. Detailed information can be found in the user's manual. AdmixSim 2 follows the extended Wright-Fisher (WF) model, which includes diploid individuals and non-overlapping, discrete generations. Unlike the strict WF model, the population size over generations can fluctuate and can be defined by users. Figure 1 illustrates the general flow chart of AdmixSim 2. Briefly, four input files are required for ancestral haplotype data, individual information, single nucleotide variation (SNV) information, and model description. The model description file defines the overall simulation processes by combining a series of modules of a standard format for every single process. Even though it is possible to use models in combination for extensive admixture scenarios, modifications of a single model are also applicable and flexible. With the input of these four files, AdmixSim 2 implements the whole simulation process defined in the model description file across generations. At the end of the simulation, haplotype data of individuals of specified populations are provided, and users can flexibly sample a certain number of individual data for output. Output data of any generation are available for all of the simulated populations. The output individual information file, haplotype data file, and updated SNV information file are in a similar format as the input file, which provides convenience and flexibility for subsequent simulations across different demographic models. Moreover, the haplotype data as provided by AdmixSim 2 are helpful for follow-up analysis and revealing the properties of real genomic data, which are not available in many other simulation tools. In addition to the aforementioned output files, AdmixSim 2 also generates three additional outputs, frequency of alleles under selection, ancestral track information, and a log file.
Admixture, de novo mutation, and natural selection are the evolutionary events that we mainly considered in AdmixSim 2. Corresponding parameters can be set in input files or using the command-line interface. Initial settings for sex in the individual information file define monoecious and dioecious individuals for simulation. In the case of monoecious individuals, we logically sample two different individuals as parents of each offspring in the next generation. The probability that samples an individual out is proportional to the individual's fitness. The fitness of an individual is calculated as one plus the sum of selection coefficients of all selection conditions satisfied. The negative value is rescaled to zero. Each parent contributes one gamete to the offspring with mutations and recombination. The recombination rate and mutation rate can be set as uniform or nonuniform (locus-specific) in either the command Colors indicate different ancestral populations, squares represent males, circles represent females. Red spots represent mutations and red crosses represent recombinations. AdmixSim2 requires four input files separately recording ancestral haplotype data, individual information, SNV information, and demographic model. During the simulation, each individual in the current generation undergoes mutation and is then sampled to be a parent of offspring in the next generation. The probability of being included in this sample is proportional to the individual's fitness. Each parent contributes one gamete to the offspring after recombination. At the end of the simulation, there are six output files. The first three take nearly the same format as the corresponding input, and they can be used for subsequent simulations with a new demographic model line or the SNV information file. For selection, AdmixSim 2 can simulate the case of a single locus, multi-locus (continuous or discrete), population-specific, and sexspecific selection. Here, the sex-specific selection is incorporated by assigning different selection coefficients for two sexes. Sex-specific population sizes and sex-specific admixture proportions are supported as well. The selection coefficients are allowed to change over generations. Users can also simplify the simulation by using a single parameter setting to avoid complex evolutionary processes with recombination, de novo mutation, and natural selection, respectively.
We implemented some features for the efficiency of run time and memory usage. First of all, we record the recombination segments rather than the whole haplotype data, which is much more memory-efficient as stated in forqs. The time consuming is insensitive to the number of simulated SNVs and remains relatively small even with a large recombination rate and mutation rate. Second, we discard the population information of the previous generation after simulating the current generation unless it is specified to be output. Generally speaking, there are data of two generations existing simultaneously in the simulation process. Finally, we record the haplotype data of ancestral populations into memory after simulating all the generations instead of at the beginning. The ancestral haplotype data are used to extract the segmental sequence based on the segment information carried by each output haplotype. We compared the basic features and abilities of AdmixSim 2 with QuantiNemo 2 [21], SLiM 3.3, simuPOP, GeneEvolve, SELAM, forqs, and admix'em ( Table 1). The results showed that AdmixSim 2 is an irreplaceable tool for simulating population admixture with real genomic data.

Admixture, recombination, and mutation
To validate the basic functions of our simulator, we used admixture models of three typical admixed populations (African Americans, Mexicans, and Uyghurs) which are of 2, 3, and 4 distinct ancestral originations, respectively (Additional file 4: Table S1). The simulation model of each population is shown in Additional file 1: Figure S1 [22,23]. To simulate African Americans, we chose populations European and African from the 1000 Genomes Project (KGP) as reference. Totally 6,196,135 SNVs on chromosome 1 were used for simulation. We performed principal component analysis (PCA) with flashpca [24] (Fig. 2A) and supervised admixture analysis with ADMIXTURE [25] (Fig. 2C) of both simulated and empirical data. The results of PCA and admixture analysis were largely concordant between simulated data and empirical data, indicating the simulated data by AdmixSim 2 are a satisfactory alternative on a global level. We further calculated the segment length proportion of each ancestral population to validate the function of sampling patterns and recombination. The variation of these proportions was largely stochastic without significant deviation from the expected value as set in the admixture model (Fig. 2B). To validate the function of adding mutations, we set the mutation rate as 10 -8 per generation per site and compared the distribution of mutation counts of each output haplotype to the Poisson distribution (Fig. 2D). The results show that the two distributions were extremely close (p = 0.69, chi-square goodness of fit test). These results further demonstrated the simulated data by AdmixSim 2 are a satisfactory alternative on the level of local genomic regions. For the other two admixed populations, Mexicans and Uyghurs, the aforementioned features of simulated data were confirmed (Additional file 2: Figure S2 and Additional file 3: Figure S3).

Recombination and length distribution of segments of distinct ancestry in the analysis of X chromosome and autosomal data
The main difference is the recombination pattern in simulating X chromosome and autosome data. While simulating X chromosome data, we assume that recombination only occurs in females. Thus, the number of recombination events of the X chromosome is expected to be two-thirds of that of autosome theoretically. We simulated an admixed population with a population size of 5000 using pseudo data of 3 Morgan. The simulation was based on hybrid isolation (HI) model with the initial ancestral contribution of 4:1 for two ancestral populations (hereafter refers to Anc1 and Anc2, respectively) and ended at generation 50. The recombination rate was set as 10 -8 Morgan per base pair. There are differences between the numbers of recombination breakpoints of autosomes and those of X chromosome under the identical simulation model and parameter settings (Fig. 3A). The mean numbers of breakpoints of autosome and X chromosome were 149.66 and 100, respectively, thus satisfied the two-thirds ratio as expected. Next, we examined the consistency between length distribution of ancestral segments in simulation and that in the theoretical deduction based on the method proposed in MultiWaver   [22]. The statistical significance was determined by the Kolmogorov-Smirnov test. As expected, the two distributions did not show statistical significance, which again demonstrated a fundamental capacity of AdmixSim 2 to simulate X chromosome data (Fig. 3B).

Natural selection
To verify the frequency variation tendency of alleles under selection, we simulated a twolocus segment under a selection amenable to the additive model with an initial freqncy of 5% and a selection coefficient of 0.1. Under the additive model, the fitness is 1, 1 + s, and 1 + 2 s for individuals carrying 0, 1, and 2 selected haplotypes, respectively, where s denotes the selection coefficient. During each simulation of 500 repeats, we recorded the selection frequency at each generation. Using x t to denote allele frequency at generation t, we have the theoretical frequency at generation t + 1 as Given the condition of x 0 = 0.05 and s = 0.1 , the mean value of allele frequencies across 500 repeats was compared to the expected value at each generation (Fig. 4A). The results of simulation and theoretical expectation are consistent regarding the frequency of allele under selection (p = 0.999, Kolmogorov-Smirnov test).
To test the number of generations of an allele being fixed under selections with different initial frequencies and coefficients, we simulated 1 centiMorgan chromosome and used the HI model with an initial contribution of 1:1 from two ancestral populations. Each ancestral population had 100 individuals in the founding generation and the population size of the admixed population was 5,000 throughout the simulation. Recombination rate (Morgan per base pair) and mutation rate (per generation per site) were both set as 10 -8 . We considered four distinct initial frequencies of the segment under selection, 5%, 10%, 15%, and 20%. The selection coefficient varied in the change of 0.01-0.1, taking 0.01 as a step (Additional file 5: Table S2). For each combination of initial frequency and selection coefficient setting, we performed 100 replications, and then calculated the average of fixation time with standard error (Fig. 4B). As expected, the fixation time decreased with the increase of selection coefficients. Besides, there was a negative correlation between the initial frequencies and the time in generations taken to reach a fixation state. These results revealed a The histogram represents the simulated segment length distribution and the red curve is the theoretical exponential distribution. The theoretical distribution fits quite well with the simulated one. Moreover, X chromosomes possess a smaller effective recombination rate compared to autosomes powerful performance of AdmixSim 2 in terms of simulating natural selection during admixture.

Time and memory cost
Further, we evaluated the time cost of AdmixSim 2 using a HI model with equal contribution from the two ancestral populations. Six distinct factors were considered in the evaluation by varying values of one and holding the rest constant, including chromosome length, recombination rate, mutation rate, admixed population size in simulation, end generation, and a number of loci under selection (Additional file 6: Table S3). The time cost exhibited an approximately linear increase with the value of the condition rising and was relatively low (Fig. 5). Later, we compare the runtime and memory cost of AdmixSim 2 and SLiM 3.3 [16] using the population Yoruba in Ibadan, Nigeria (YRI) from the KGP dataset (Fig. 6). Here, for simplicity, we simulated the case that

Discussion
Since AdmixSim 2 was designed to focus on simulating the admixture process, the pre-admixture data can be generated from different strategies. In human genetics, users can use the genetic data from public data sets like the KGP, the Estonian Biocentre Human Genome Diversity Panel (EGDP), the Simons Genome Diversity Project (SGDP), and the International HapMap Project. Since AdmixSim 2 can simulate the scenario that the ancestral populations evolve same generations as the admixed population, it is practical to use the populations from these public datasets as ancestral populations. Besides, populations from these datasets could be used as proxy populations in local ancestry inference. With the advance of genotyping and technologies, the ability to generate large amounts of sequence data in a relatively short amount of time is helping to enable a wide range of genetic analysis applications. Therefore, the pre-admixture genetic data can be more easily obtained in the future owing to advanced sequencing technologies. More generally, users can use coalescent approaches like msprime [26] [28] based on the KGP data. Last but not least, the output haplotype data of AdmixSim 2 can also be used as the input of further simulations. The first three output files (haplotype data, individual information, and updated SNV information) of AdmixSim 2 take the same format as the corresponding input, and they can be used for subsequent simulations with a new demographic model. AdmixSim 2 record the recombination segments rather than the whole haplotype data, which brings a proportion of efficiency for the simulation. The way of segment representation in AdmixSim 2 is similar to that implemented in forqs, but there is some slight difference. Each haplotype chunk in forqs is represented by two numbers (position, id): the position where it begins, and the identifier of the founding haplotype from which it is derived. For AdmixSim 2, each segment is identified by the start physical position, the end physical position, and the ancestral label it originates from. Despite this difference is not seemly substantial, we believe specify an ending physical position is more convenient to calculate the selection fitness and extract segmental sequence data without indexing the next segment. Besides, we noticed that forqs is mainly used to simulate the scenarios that recombination and/or natural selection on polygenic quantitative traits within a relatively small number of generations, while the emphasis of AdmixSim 2 is on simulating complex population admixture events more flexibly. Previous studies have demonstrated that a segment-based simulator is not as efficient as array-based simulators (like SLiM) when involving demographic events taking place over thousands of generations [18]. Nevertheless, we here mainly focus on the studies of modern humans, and the history usually much less than 1000 generations for typical admixed populations of modern humans. The previous version, AdmixSim, has been used in many evolutionary and population genetic studies, including those exploring the genetic history of Xinjiang's Uyghurs [23], validating statistical tool CAMer [29], MultiWaver series software [22,30], and Libgdrift [31]. We summarized the similarities and differences between AdmixSim and AdmixSim 2 ( Table 2). We expect AdmixSim 2 to be widely used in population admixture studies, such as modeling complicated demographic history, validating methods for local ancestry inference, and dating population admixture. In addition, autosomal and sex chromosomal data are both supported. This is particularly useful in studying sex-biased admixture.
Nevertheless, there is still room for improvement in AdmixSim 2. As in many other programs, we here also assume non-overlapping and discrete generations, which may have some limitations in practical applications. It is possible to consider overlapping and continuous generations in future versions. In addition, de novo mutation is not considered for determining individual fitness during simulation of natural selection, and the potential effect on admixture inference is to be further evaluated. Furthermore, the tree sequence recording (TSR) algorithm [32] implemented in SLiM 3.3 provides the necessary information to detect coalescence events and construct the genealogical tree at each chromosome position, although applying the TSR algorithm has a large impact on the performance of models, in terms of both runtime and memory usage (page 40 of SLiM 3.3 manual). The TSR algorithm has also been applied into many other simulators like msprime and inference methods like tsinfer [33]. Nonetheless, there is no single simulation tool is ideal for all cases and some other simulation tools are suitable for a few special scenarios. For example, in the framework of selection, SLiM 3.3, SELAM, and admix'em consider the cases including phenotypic selection, epistatic selection, or balancing selection. Although AdmixSim 2 does not cover the comprehensive scenarios of mate choice as some other simulators, it implements sex-biased admixture (different admixture proportions for two sexes) and sex-specific selection (different selection coefficients for two sexes).