Sequence generation
GENOME is used as the default tool to simulate a population of sequences based on the coalescent model. Alternatively, as other sequence simulators can have their own unique features, SeqSIMLA also accepts a population of sequences generated by other programs. GENOME either accepts different recombination rates among chromosomal blocks or assumes a fixed rate across the genome. There is no recombination within each of the chromosomal blocks. To simulate block structures similar to real populations, we downloaded the recombination hotspots across the genome from the HapMap project [11], with the highest recombination rate in each hotspot region used as the recombination rate for the center of the hotspot. Crossovers during meiosis are simulated based on the recombination rates for the centers of hotspots. Alternatively, the user can assume that the recombination rates are uniform across the chromosomes, which is the default setting in GENOME.
Disease models
We do not have restrictions on the number of disease loci to be simulated. A logistic function as follows is used to calculate the penetrance:
where X = (G
1
,G
2
,…,G
n
) is a vector of genotype coding for n disease variants, B= (β1, β2, …, β
n
) is a vector of the conditional log of odds ratios for the associated genotypes, and α determines the disease prevalence K. The parameter α is ln (f0/(1 − f0)), which is the log odds of the penetrance for X with no mutant alleles. The odds ratio eβi represents the increased odds for the disease for an additional mutant allele at variant i[19]. For the prevalence model (Model 1), the disease prevalence K is specified by the user. We iteratively search for α in the range between −20 and 20 and calculate disease prevalence K
i
based on α
i
in iteration i. The value α
i
is selected for α if |K
i
− K| < ϵ, where ϵ is small (e.g., 0.001). Alternatively, the user can specify f0 directly, and uses the population attributable risk (PAR) to determine the GRRs for the disease loci (the PAR model or Model 2). The logistic function can be represented by the function of f0 and GRR :
where f
0
is the baseline penetrance specified by the user, GRR
i
is the GRR for the genotype at marker i, PAR
i
is the population attributable risk, and R
i
is the risk allele frequency for marker i. The sum of PAR
i
for the disease loci is equal to the overall PAR specified by the user. The parameter k is coded as the number of mutant allele counts (0, 1, 2) for an additive model, the presence/absence of an mutant allele (2/0) for a dominant model, and the presence/absence of a homozygous mutant genotype (2/0) for a recessive model. The model can assume that rarer variants have higher GRR values, given all causal variants contribute equally to the total PAR. SeqSIMLA can also randomly generate a PAR for each of the disease loci, while keeping the overall PAR fixed. Alternatively, the user can specify a fixed GRR across all disease loci.
The user can simulate dominant, recessive, or additive models for the disease loci under Models 1 and 2. The disease model is determined by the genotype coding in X for Model 1 and by the parameter k for Model 2. For Model 1, the user can specify whether a variant has a risk or protective effect using the parameters in B. For Model 2, the GRR for variant i with a protective effect is the inverse of GRR
i
. The user can also specify the proportion of risk variants in all variants with effects.
Quantitative trait
We also do not have restrictions on the number of quantitative trait loci (QTL). The user needs to specify the total phenotypic variance V
P
and the proportion of variance explained by each of the QTL. Assuming that the proportion of variance explained by QTL j is f
j
and the allele frequencies for QTL j are p
j
and q
j
, the genotypic value a
j
can be calculated for additive, dominant, and recessive models as follows [20]:
Additive model:
Dominant model:
Recessive model:
where .
Assume QTL j has two alleles A
1
and A
2
, where A
1
is the minor allele responsible for the larger value in the trait. For a set of M QTL, the phenotypic value Y is a random variable defined as:
where μ is the general mean specified by the user, G
j
follows a normal distribution with mean μ
j
and variance V
Pj
, P follows a normal distribution with mean 0 and variance V
poly
specified by the user, and E follows a normal distribution with mean 0 and variance . P and E model the polygenic and environmental components, respectively. The mean μ
j
for G
j
is defined as:
Data types
SeqSIMLA can simulate two data types - three-generation family data with 12 members and unrelated cases and controls. The structure for each family is shown in Figure 1. We assume random mating in a population of haplotypes generated by GENOME to simulate family data. For the disease models, a family is ascertained if there is a user-specified number of affected siblings (e.g., 1-3) in the third generation. To generate case-control data, we simulate cases by randomly selecting unrelated affected individuals and simulate controls by randomly selecting unrelated unaffected individuals in the third generations of unrelated families. For the quantitative trait model, the user can decide whether the families will be ascertained based on affection status in family members, which is the same procedure as in the disease models, or randomly from the population.
Efficiency improvements
SeqSIMLA determines that an individual is affected by comparing the probability calculated from the penetrance function given the person’s genotypes to a random number. This process can be inefficient for a rare disease with low penetrance. We implemented a similar strategy as in Edwards et al. [21] to efficiently simulate unrelated cases. We first simulated a small set of cases (e.g. 100 cases) using the penetrance function in Model 1 or 2. The conditional probability P(X|Affected) can be calculated based on the set of cases, where X is a multilocus genotype observed at the disease variants in the set of cases. Then a multilocus genotype at the disease variants for a case is simulated based on the conditional distribution, and two sequence haplotypes, which are consistent with the multilocus genotype, are randomly selected from the population of sequences. The advantage of this method is that the run time is not affected by the disease prevalence. However, the conditional probability is subject to sampling error as it is estimated in a small set of samples.
As generating each of the families, unrelated cases, and unrelated controls is an independent process, the procedures can be parallelized with threads on a multi-core computer. We used Java Thread to parallelize the code. Each thread generates about the same amount of families, cases, and controls. We used simulations to evaluate the performance of SeqSIMLA running with threads.