SimBA: simulation algorithm to fit extant-population distributions
- Laxmi Parida^{1}Email author and
- Niina Haiminen^{1}
https://doi.org/10.1186/s12859-015-0525-0
© Parida and Haiminen; licensee BioMed Central. 2015
Received: 7 November 2014
Accepted: 4 March 2015
Published: 14 March 2015
Abstract
Background
Simulation of populations with specified characteristics such as allele frequencies, linkage disequilibrium etc., is an integral component of many studies, including in-silico breeding optimization. Since the accuracy and sensitivity of population simulation is critical to the quality of the output of the applications that use them, accurate algorithms are required to provide a strong foundation to the methods in these studies.
Results
In this paper we present SimBA (Simulation using Best-fit Algorithm) a non-generative approach, based on a combination of stochastic techniques and discrete methods. We optimize a hill climbing algorithm and extend the framework to include multiple subpopulation structures. Additionally, we show that SimBA is very sensitive to the input specifications, i.e., very similar but distinct input characteristics result in distinct outputs with high fidelity to the specified distributions. This property of the simulation is not explicitly modeled or studied by previous methods.
Conclusions
We show that SimBA outperforms the existing population simulation methods, both in terms of accuracy as well as time-efficiency. Not only does it construct populations that meet the input specifications more stringently than other published methods, SimBA is also easy to use. It does not require explicit parameter adaptations or calibrations. Also, it can work with input specified as distributions, without an exemplar matrix or population as required by some methods. SimBA is available at http://researcher.ibm.com/project/5669.
Keywords
Background
In many studies, it is important to work with an artificial population to evaluate the efficacy of different methods or simply generate a founder population for an in silico breeding regimen. The reader is directed to [1] for a review of computer simulations for population and evolutionary genetics. The populations are usually specified by a set of characteristics such as minimum allele frequency (MAF) distribution, linkage disequilibrium (LD) distribution and population fixation indices (F _{ ST } or G _{ ST }) (see [2] for detailed descriptions). For instance, in the context of optimizing marker assisted strategies in breeding, the founder population (collection of germplasm lines) could be simulated. Similarly, population of elite lines of small biparental families as in small grain cereals could be simulated for early yield trials, i.e. the stage at which the largest number of individuals with phenotype data are available. Applications of simulation in plant and animal breeding are discussed by [3]. A generative model to simulate the population, i.e., evolving a population over time, is usually rather difficult since the different parameters (such as selection, ancestor population sizes, mutation and recombination rates etc), as well as the “breeding” regimens, are not well understood and almost impossible to estimate effectively. Examples of such methods include those by [4,5]. The non-generative models, on the other hand, do not evolve the population, and the methods often start with a sample population having the desired characteristics and perturb it, either by a regimen of recombinations between the samples or local perturbations. Examples of such methods include [6-8].
EASYPOP [4] and simuPOP [5] are examples of simulators based on an underlying generative model. EASYPOP provides a variety of mating systems and migration and mutation models, while simuPOP can be used to simulate arbitrary non-random mating models. In contrast to these, HapSim [6], SIMLD [7], and epiSIM [8] software are based on non-generative models and simulate the given LD distributions through perturbations. HapSim perturbs an existing haplotype matrix, provided as input, as follows. It models a haplotype as a multivariate random variable with known marginal distributions and pairwise correlation coefficients. First it computes the covariance matrix from an existing sample, and then draws simulated haplotypes from a multivariate normal distribution with that covariance matrix. SIMLD starts with an initial population with the highest possible LD, decaying it over generations through recombinations to fit a desired profile. EpiSIM introduces a notion of Average of Adjacent LD Levels for generating LD patterns, and employs a Markov Chain process to simulate a chromosome. The reader is directed to [9] for a fairly comprehensive list of genetic simulator software systems. The problem we address in this paper is precisely defined as:
Problem 1.
The task is to generate a stratified population (i.e., with d subpopulations) of n diploids (or 2n haploids) with m SNPs that satisfy the following specified characteristics: MAF distribution p, LD distribution r ^{2} and F _{ ST } or G _{ ST } (stratification) values.
Here we discuss SimBA which is a non-generative approach that has both stochastic and combinatorial components. We model the task as an optimization problem. Based on an algorithm presented in an earlier work [10], we adapt the solution to include multiple simultaneous steps, and introduce the population stratification component. The use of discrete methods enables SimBA to be optimized even in run-time, i.e., the algorithm is linear in the size of the output, thus extremely time-efficient. Additionally, the use of discrete problem-modeling lends unprecedented sensitivity to the algorithms. Subtle changes in the input distributions are observed as corresponding, accurate changes in the output distributions of allele frequency, linkage, and stratification values. Such sensitivity is hard to obtain in perturbation-based non-generative modeling as well as in generative modeling, since the biological processes that give rise to the resulting distributions are generally not well understood. We demonstrate this sensitivity by comparing with results from similar systems from literature. Also, SimBA does not require extensive population-specific parameter tuning or exemplar populations as starting points, unlike most non-generative methods.
Methods
Background
Thus \(0 \leq p_{s_{k}} = \frac {1 \pm \sqrt {1 - 2H_{s_{k}}}}{2} \leq 1\). Note that \(0 \leq H_{s_{k}} \leq 1/2\).
Constructing stratified populations with given MAF p, LD r ^{2} & F _{ ST }’s
- 1.
n, the total population size and d demes with the subpopulation (s _{ k }) sizes N _{1}, N _{2}, …, N _{ d }.
- 2.
MAF \(\overline p\) and LD r ^{2} distributions for the total population.
- 3.
Wright’s Fixation Index F _{ ST } for d−1 demes as \(F_{s_{1}T},F_{s_{2}T}, \ldots,F_{s_{d-1}T}\). \(F_{s_{d}T}\) is dependent on the specified d−1 values, hence not specified as input.
OUTPUT: Matrix M where each row is a haplotype and each column is a (bi-allelic) marker.
- 1.Construct the d demes separately as follows.
- (a)For deme k=1,2,..,(d−1), compute \(p_{s_{k}}\):$$\begin{array}{@{}rcl@{}} p_{s_{k}} &= &\min \left(\frac{1 - \sqrt{1 - 2H_{s_{k}}}}{2}, \frac{1 + \sqrt{1 - 2H_{s_{k}}}}{2} \right) \\ &= &\frac{1 - \sqrt{1 - 2H_{s_{k}}}}{2}, \end{array} $$(4)where \(H_{s_{k}}\) is \(\bar {p}(1-\bar {p})(1 - F_{s_{k}T})\). Then$$p_{s_{d}} = \frac {n\bar{p} - \sum_{k=1}^{d-1}N_{k}p_{s_{k}}} {N_{d}}. $$
- (b)
DEME(\(p_{s_{k}},r^{2}\)): Construct each deme, k=1,2,..,d, with \(p_{s_{k}}\) and LD r ^{2}.
- (a)
- 2.
Take the union of the d demes.
Constructing Deme(p,r ^{2})
- (1)
one at a time and
- (2)
without any backtracking.
Note that LD is defined as a pairwise constraint, thus a marker has m−1 LD characteristics (constraints) with respect to the other m−1 markers. Thus with m markers there are a total of m(m−1)/2 pairwise LD constraints. However, to keep the problem tractable, in practice only up to k<<m constraints are considered for each marker. The LD constraints are captured as follows. The columns of matrix M are constructed in some order j _{1},j _{2},..,j _{ m }. Thus while working on j _{ s }, columns j _{ l }, l<s are not altered, but provide constraints. It is critical to traverse the sparse constraint space in a strategic way. Let l be the distance between a pair of markers a<b, i.e., l=b−a. In our implementation, the l-distance is picked based on the r ^{2} between the pair a and b, relative to the r ^{2} values for the other distances from b.
Thus the problem of fitting whittles down to following subproblem, which is used iteratively to span the constraint space and construct the population.
Problem 2.
( k -Constrained Marker Problem ( k -CMP)) Given markers (columns) j _{1}, j _{2},., j _{ k }, and r _{1}, r _{2}, …, r _{ k }, and p _{ k+1}, the task is to generate column j _{ k+1} with MAF p _{ k+1} such that the pairwise LD with column j _{ l } is r _{ l } (as in Equation 1), l=1,2,..,k.
Outline of our approach to solving Problem k-CMP
Next, we move the 1’s in column j _{ k+1}, such that it simultaneously satisfies k conditions, to get a best-fit of \({D}^{\text {obs}}_{l}(j_{l},j_{k+1})\) to \(\overline {D}(j_{l},j_{k+1})\). To achieve this, we compare column j _{ k+1} with columns j _{ l }, l=1,2,..,k, that have already been assigned. Thus, first, for each pair of markers j _{ l },j _{ k+1}, compute the target deviation, \(D_{l}^{\text {target}}\), based on input. Then, move the 1’s in column j _{ k+1} of the output matrix, to get a best-fit to the targets \(D_{l}^{\text {target}}\), \(D_{2}^{\text {target}}, \ldots, D_{k}^{\text {target}}\) simultaneously.
Problem k-CMP: Hill Climbing
An algebraic approach to obtain an exact solution to k-CMP has been discussed earlier [10]. Thus an appropriate value of k for a specific class of population(s), can be estimated by using the algebraic method. Here we optimize the hill climbing algorithm presented in [10] by, taking several best simultaneous steps, instead of a single step.
For l=1,2,…,k, let \(G_{l} = {nD}_{l}^{\text {target}}\). Given the input matrix, number of constraints k, column j _{ k+1} with MAF \(p_{j_{k+1}}\) and target deviations G _{1}, G _{2},…,G _{ k }, the hill climbing algorithm is carried out in three steps. For a given k, a cost graph \({\mathcal {G}}_{k}\) is built as a pre-processing step. Then, for a given input, the cost graph is instantiated with the input and target signs to obtain \({\mathcal {G}}^{0}_{k}\). Then in Step 2, s moves are made to obtain \({\mathcal {G}}^{s}_{k}\). Finally the output (column j _{ k+1}) is simply read off \({\mathcal {G}}^{s}_{k}\).
Pre-processing: Construct the cost graph \({\mathcal {G}}_{k}\)
- Nodes.
The cost graph has 2^{ k } vertices, corresponding to the distinct binary patterns. For example, when k=2, the four distinct patterns (and vertices) Z are 00, 01, 10, 11.
- Edges.A directed edge is introduced between every pair of vertices. We first give the rationale for the cost function, which tracks the LD deviations exactly, for each pair of columns j _{ l } and j _{ k+1}, l=1,2,..,k. The choice of the cost function is critical in evaluating the “climb” in hill climbing process. There are four scenarios as below and the rationale for the cost is discussed in [10]:$$\begin{array}{@{}rcl@{}} c(0,1) &= &-1 \hspace*{15pt} \text{(Scenario I)}\\ c(1,0) &= &+1 \hspace*{15pt} \text{(Scenario II)}\\ c(0,0) = c(1,1) &= &0. \hspace*{15pt} \text{(Scenarios III \& IV)} \end{array} $$Using the cost function as defined, the directed edge Z _{1}→Z _{2} is labeled with the k-tuple as follows. Figure 2 gives an example for k=2.$$\begin{array}{*{20}l} {wt}_{Z_{1}Z_{2}} &=&\left(c(Z_{1}[1],Z_{2}[1]), \ldots,\right.\\ &&\left.c(Z_{1}[k],Z_{2}[k])\right) \text{and} \end{array} $$(6)$$\begin{array}{*{20}l} {wt}_{Z_{1}Z_{2}}^{\Sigma} &= &\sum_{l=1}^{k} {wt}_{Z_{1}Z_{2}}[l]. \end{array} $$(7)
Step 1: Instantiate \({\mathcal {G}}_{k}\) with input data and target to get signed cost graph \({\mathcal {G}}_{k}^{0}\)
Lemma 1.
For any k, a signed cost graph is acyclic.
Proof.
Clearly, if k=1, then the signed graph has no cycles. For k>1, since the nodes in the signed graph must have distinct k-patterns, once a transition is made from 0 to 1, for −ve sign (1 to 0 for + sign respectively), the directed path can not return back to 0 (1 respectively). Hence, there can be no cycles.
Step 2: Move at step s, to get to \({\mathcal {G}}_{k}^{s}\) from \({\mathcal {G}}_{k}^{s-1}\)
See Figure 3(b)-(c) for an example of moves. f(Z) is the number of Z0 patterns in the data (shown in red) and t(Z) is the number of Z1 patterns in the data (shown in green).
Lemma 2.
At each step s, \({\mathcal {G}}_{k}^{s}\) encodes \(p_{j_{k+1}}\).
Proof.
- 1.
Let n be the total number of samples, n _{ i } be the number of samples with i at column j, i=0,1. Then the following holds for each \({\mathcal {G}}_{k}^{s}\): \( f(Z), t(Z) \geq 0, \sum _{Z} (f(Z) + t(Z)) = n, \sum _{Z} f(Z) = n_{0}, \sum _{Z} t(Z) = n_{1}. \)
- 2.
For each node Z in \({\mathcal {G}}_{k}^{s}\), the sum f(Z)+t(Z) is invariant across all s.
Thus it follows that \({\mathcal {G}}_{k}^{s}\) encodes \(p_{j_{k+1}}\) at each step s.
Corollary 1.
\(p_{j_{k+1}}\), the MAF at marker j _{ k+1}, matches the input exactly.
In the toy example in Figure 3, n=22, n _{0}=14 and n _{1}=8.
Time complexity
The algorithm has two characteristics that make it efficient in time. Firstly, the algorithm does not backtrack, i.e., change the previous j _{1}, j _{2},.., j _{ k } columns, while considering column j _{ k+1}. Also note that \(|D_{l}^{\text {target}}| \leq p_{j_{k+1}}n\), because the possible range of the deviation from target, \(|D_{l}^{\text {target}}|\), for the pair of columns j _{ l },j _{ k+1} is \([0,min(p_{j_{l}}n,p_{j_{k+1}}n)]\). Therefore the number of 1’s that are moved (at most once) is bounded by \(p_{j_{k+1}}n\). Secondly, at each column it uses a greedy heuristic of only making a move that results in an improvement in the overall cost. In practice, we observe that these work very effectively in terms of achieving the target accuracies of the solution. The pre-processing step is independent of any data, hence it is computed off-line. Further, the nodes are sorted, using a hash table, by the maximum positive weight of the outgoing edges incident on this node.
Instantiating (Time \({\mathcal O}(n)\)).
Column j _{ k+1} is scanned once to instantiate cost graph \({\mathcal {G}}_{k}\) with the given k conditions to obtain \({\mathcal {G}}^{0}_{k}\). A node Z in \({\mathcal {G}}^{s}_{k}\) is defined to be active if both t(Z)>0 and f(Z)>0. Then:
Observation 1.
- (a)
The number of active nodes in the initial graph is linear: \({\mathcal {G}}^{0}_{k} \leq n\).
- (b)
If node Z is not active in the initial graph \({\mathcal {G}}^{0}_{k}\), then Z is not active in any of the subsequent graphs \({\mathcal {G}}^{s}_{k}\), for all s>0. Similarly, if node Z is active in \({\mathcal {G}}^{0}_{k}\), then Z is active in \({\mathcal {G}}^{s}_{k}\), for all s>0.
In the observation, (a) holds since a node is only active if at least one individual is assigned to it, and there are n individuals. Additionally, (b) holds since moves are only made between active nodes, corresponding to the existing individuals’ 0/1 assignments at columns j _{1}..j _{ k }; a node can not become active since we are not changing the values at columns j _{1}…j _{ k } in the process of generating column j _{ k+1}. By the above observation, it takes linear time to instantiate the problem. Additionally, a hash table is used to store the elements that can be accessed in constant time.
Lemma 3.
Each problem instance, handling one marker, is optimized in time \({\mathcal O}(n)\).
Proof.
Since the signed cost graph has no cycles, by Lemma 1, no move is ever undone in the hill climbing regimen. Each cell in column j _{ k+1} is touched no more than twice (once as from and once as a to). At each move, the candidate cell is obtained in \({\mathcal {O}}(1)\) time, since a hash table is used to store the elements. Thus each marker is handled in time \({\mathcal {O}}(n)\).
Results and discussion
We implemented the single-step hill climbing method described in [10], at each step choosing the move from the space of all possible moves that most improves the target fit. Although an individual can potentially be touched several times in the process of generating one column, in our experiments we have found each column to be generated in fewer than n moves. We also implemented the subpopulation construction described in Methods. Our hill climbing and subpopulation algorithms are collectively called SimBA.
Stratified population construction
We evaluated the methods on real human MAF and LD data provided by the International HapMap Project [13], in the LD Data collection: African ancestry in Southwest USA “ASW”, Han Chinese in Beijing, China “CHB”, Luhya in Webuye, Kenya “LWK”, Japanese in Tokyo, Japan “JPT” populations. For one set of experiments we used the phasing data, evaluating the ability of SimBA to reconstruct a population when a population with those constraints is known to exist. For another set of experiments, we used averaged linkage disequilibrium data, evaluating the ability of SimBA to deconstruct distributions into constraints, and then reconstruct a population with those constraints.
In the experiment shown in Figure 4(b), we evaluated the ability of the algorithm to deconstruct the distributions and then carry out the best-fit (i.e., it is not clear that an exact solution matrix exists, due to numerous possibly conflicting r ^{2} constraints as input). We binned the ASW population averaged linkage disequilibrium data (physical distance, r ^{2} pairs) into m=100 equal size bins, s.t. each bin corresponds to a range of 2 kb (data only included LD for marker pairs at most 200 kb apart). The r ^{2} values in each bin were distributed as targets across the m columns in order, s.t. largest values were assigned to columns with the largest possible r ^{2} limits (given the columns’ allele frequencies).
Again, SimBA produced a good fit to the average LD distribution. The MAF distribution fits exactly, by design (see Corollary 1) and F _{ st } fits very close to the target values.
Comparison with existing methods
We compared SimBA with existing methods for generating populations with specific LD and MAF distributions: epiSIM [8], HapSim [6], and SIMLD [7]. epiSIM only considers the LD between adjacent markers, therefore its input consists of r ^{2} specifying the first distance only. HapSim is a perturbation approach, requiring as input an existing matrix having the desired distributions. SIMLD is a non-generative forward-simulator, starting from a fixed matrix having maximum possible LD and decaying it over generations. To test the accuracy of SimBA against the existing methods, we ran all methods simulating the JPT/CHB data, and we also checked the sensitivity by simulating the ASW and LWK populations. Let matrix H denote the first m=100 biallelic markers in chr 22 of the respective HapMap population, from which the MAF and r ^{2} constraints were extracted. Each method was run as follows.
epiSIM was given as input the minimum and maximum MAF, the mean r ^{2} for adjacent markers, and MAF and r ^{2} for the first two markers. HapSim was given as input the matrix H. SIMLD requires several population-specific parameters, which the authors have specified for the JPT/CHB population. Since the setting of their parameter values is not well defined and requires knowledge of population history, we used these same parameters for the ASW and LWK populations. SIMLD was given the MAF per marker and average r ^{2} per distance. SimBA was given as input the MAF per marker and r ^{2} per each pair of markers.
The JPT/CHB results across ten independent runs in Figure 5(b-e) show that only SimBA and HapSim constructed populations having LD close to the target values. SimBA is the most accurate, as measured by the sum of squared distances from target r ^{2} values. The run time of SimBA is better than most other methods, only SIMLD is faster. SIMLD was run with hard parameters specified for the JPT/CHB data by the authors, and these pre-tuned parameters produce a fairly accurate fit. epiSIM only matches the first distance since other distances are beyond the scope of the algorithm. HapSim results are only close to the target for the first two distances. The sensitivity comparison in Figure 5(a) shows that SimBA is far more accurate in distinguishing between the two similar populations than any other method. The other methods also produce two slightly different populations, but they are farther from the target values. HapSim does better on these populations than on the JPT/CHB data, while still being farther from the targets than SimBA.
Comparison of population simulation methods
epiSim | HapSim | SIMLD | SimBA | |
---|---|---|---|---|
Matches MAF distribution | - | X | X | X |
Matches r ^{2} distribution | - | X | X | X |
Number of control parameters | 0 | 0 | 5 | 1 |
Population stratification | - | - | - | X |
Simulation time | 33 | 12 | <1 | 5 |
Conclusions
SimBA is a novel non-generative method for simulating populations with various specified distributions such as MAF, LD etc. We show that SimBA outperforms the other methods both in terms of accuracy and efficiency. It runs in time linear with the size of the output. Furthermore, similar but distinct input characteristics result in distinct outputs with high fidelity to the specified distributions. SimBA does not require extensive population-specific parameter tuning or exemplar populations as starting points, unlike most non-generative methods. SimBA executable and manual are available at http://researcher.ibm.com/project/5669.
Declarations
Acknowledgements
The authors thank Claude Lebreton for useful discussions, setting the context for the studied research questions.
Authors’ Affiliations
References
- Hoban S, Bertorelle G, Gaggiotti OE. Computer simulations: tools for population and evolutionary genetics. Nat Rev Genet. 2012; 13:110–22.PubMedGoogle Scholar
- Balding D, Bishop M, Cannings C. Handbook of statistical genetics, third edn. Chichester, England: Wiley J. and Sons Ltd.; 2007.View ArticleGoogle Scholar
- Daetwyler HD, Calus MPL, Pong-Wong R, de los Campos G, Hickey JM. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 2013; 193(2):347–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Balloux F. EASYPOP (Version 1.7): a computer program for population genetics simulations. J Hered. 2001; 92:3.View ArticleGoogle Scholar
- Peng B, Amos CI. Forward-time simulations of non-random mating populations using simuPOP. Bioinformatics. 2008; 24:11.View ArticleGoogle Scholar
- Montana G. HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics. 2005; 21:23.View ArticleGoogle Scholar
- Yuan X, Zhang J, Wang Y. Simulating linkage disequilibrium structures in a human population for SNP association studies. Biochem Genet. 2011; 49:5–6.View ArticleGoogle Scholar
- Shang J, Zhang J, Lei X, Zhao W, Dong Y. EpiSIM: simulation of multiple epistasis, linkage disequilibrium patterns and haplotype blocks for genome-wide interaction analysis. Genes Genomes. 2013; 35:3.Google Scholar
- Peng B, Chen H-S, Mechanic LE, Racine B, Clarke J, Clarke L, et al.Genetic Simulation Resources: a website for the registration and discovery of genetic data simulators. Bioinformatics. 2013; 29:8.View ArticleGoogle Scholar
- Haiminen N, Lebreton C, Parida L.Best-fit in linear time for non-generative population simulation. LNCS. 2014; 8701:247–262.Google Scholar
- Kearsey MJ, Pooni HS. The genetical analysis of quantitative traits. London, England: Chapman & Hall; 1996.View ArticleGoogle Scholar
- Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sunderland MA, USA: Sinauer Associates; 1998.Google Scholar
- The International HapMap Consortium. The international HapMap project. Nature. 2003; 426:789–96.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.