SeqSIMLA: a sequence and phenotype simulation tool for complex disease studies
© Chung and Shih; licensee BioMed Central Ltd. 2013
Received: 29 March 2013
Accepted: 14 June 2013
Published: 20 June 2013
Association studies based on next-generation sequencing (NGS) technology have become popular, and statistical association tests for NGS data have been developed rapidly. A flexible tool for simulating sequence data in either unrelated case-control or family samples with different disease and quantitative trait models would be useful for evaluating the statistical power for planning a study design and for comparing power among statistical methods based on NGS data.
We developed a simulation tool, SeqSIMLA, which can simulate sequence data with user-specified disease and quantitative trait models. We implemented two disease models, in which the user can flexibly specify the number of disease loci, effect sizes or population attributable risk, disease prevalence, and risk or protective loci. We also implemented a quantitative trait model, in which the user can specify the number of quantitative trait loci (QTL), proportions of variance explained by the QTL, and genetic models. We compiled recombination rates from the HapMap project so that genomic structures similar to the real data can be simulated.
SeqSIMLA can efficiently simulate sequence data with disease or quantitative trait models specified by the user. SeqSIMLA will be very useful for evaluating statistical properties for new study designs and new statistical methods using NGS. SeqSIMLA can be downloaded for free at http://seqsimla.sourceforge.net.
Computer programs that can simulate genotypes with phenotypes based on user-specified disease or quantitative trait models are essential in genetic studies. They can be used to evaluate statistical power when planning a study design based on the proposed sample size, the assumed genotypic relative risks (GRR), and allele frequencies. They are also useful for evaluating type I error rates for new statistical association tests and power comparisons between the new tests and other existing tests. Therefore, many simulation programs have been developed, mostly aiming to generate genome-wide association study (GWAS) data with dichotomous or quantitative traits [1-5].
Next-generation sequencing (NGS) has become a popular technique for identifying novel rare variants associated with complex diseases . Statistical association tests that can account for rare variants have also been developed rapidly [7-10]. These tests aim to identify multiple rare causal variants in a group of variants selected by biological functions, such as exons, genes, and pathways. A common approach is to pool all the variants in the group to increase the statistical power for associations. To evaluate the statistical power for new tests, a simulation tool that can simulate multiple rare casual variants based on sequence data is necessary. However, simulation programs developed for GWAS may not be appropriate for evaluating statistical properties for NGS studies, because they were designed to simulate common variants based on GWAS panels (e.g., Illumina and Affymetrix) or HapMap project data . Thus, computer software that can simulate sequence data based on realistic models with phenotypes becomes important.
To our knowledge, SimRare is the only existing public software designed specifically to simulate sequence data with phenotypes . SimRare has three modules, including a sequence generation module, a module for phenotype generation based on genotypes, and a module for evaluating association methods. The forward-time simulation algorithm [5, 13] is used in SimRare to generate variant data. SimRare focuses on generating unrelated samples and on evaluating association methods developed for unrelated samples. As more and more family-based association studies using NGS are conducted [14-17], software that can generate sequence data in families will be very useful for evaluating the properties of family-based NGS analysis.
We developed the Sequence and phenotype Simulator, SeqSIMLA, which can simulate sequence data in unrelated case-control or family samples with user-specified disease or quantitative trait models. SeqSIMLA uses GENOME  as the default sequence generator, which efficiently generates data using the coalescent model. SeqSIMLA also accepts a population of sequences generated by other sequence generators. SeqSIMLA can simulate multiple causal variants in regions on different chromosomes, where the recombination rates between regions are based on the rates estimated from the Hap Map project  or a user-specified fixed rate. We compared the features between SeqSIMLA and SimRare and used simulations to demonstrate that SeqSIMLA can generate data in a reasonable time frame.
GENOME is used as the default tool to simulate a population of sequences based on the coalescent model. Alternatively, as other sequence simulators can have their own unique features, SeqSIMLA also accepts a population of sequences generated by other programs. GENOME either accepts different recombination rates among chromosomal blocks or assumes a fixed rate across the genome. There is no recombination within each of the chromosomal blocks. To simulate block structures similar to real populations, we downloaded the recombination hotspots across the genome from the HapMap project , with the highest recombination rate in each hotspot region used as the recombination rate for the center of the hotspot. Crossovers during meiosis are simulated based on the recombination rates for the centers of hotspots. Alternatively, the user can assume that the recombination rates are uniform across the chromosomes, which is the default setting in GENOME.
where f 0 is the baseline penetrance specified by the user, GRR i is the GRR for the genotype at marker i, PAR i is the population attributable risk, and R i is the risk allele frequency for marker i. The sum of PAR i for the disease loci is equal to the overall PAR specified by the user. The parameter k is coded as the number of mutant allele counts (0, 1, 2) for an additive model, the presence/absence of an mutant allele (2/0) for a dominant model, and the presence/absence of a homozygous mutant genotype (2/0) for a recessive model. The model can assume that rarer variants have higher GRR values, given all causal variants contribute equally to the total PAR. SeqSIMLA can also randomly generate a PAR for each of the disease loci, while keeping the overall PAR fixed. Alternatively, the user can specify a fixed GRR across all disease loci.
The user can simulate dominant, recessive, or additive models for the disease loci under Models 1 and 2. The disease model is determined by the genotype coding in X for Model 1 and by the parameter k for Model 2. For Model 1, the user can specify whether a variant has a risk or protective effect using the parameters in B. For Model 2, the GRR for variant i with a protective effect is the inverse of GRR i . The user can also specify the proportion of risk variants in all variants with effects.
We also do not have restrictions on the number of quantitative trait loci (QTL). The user needs to specify the total phenotypic variance V P and the proportion of variance explained by each of the QTL. Assuming that the proportion of variance explained by QTL j is f j and the allele frequencies for QTL j are p j and q j , the genotypic value a j can be calculated for additive, dominant, and recessive models as follows :
SeqSIMLA determines that an individual is affected by comparing the probability calculated from the penetrance function given the person’s genotypes to a random number. This process can be inefficient for a rare disease with low penetrance. We implemented a similar strategy as in Edwards et al.  to efficiently simulate unrelated cases. We first simulated a small set of cases (e.g. 100 cases) using the penetrance function in Model 1 or 2. The conditional probability P(X|Affected) can be calculated based on the set of cases, where X is a multilocus genotype observed at the disease variants in the set of cases. Then a multilocus genotype at the disease variants for a case is simulated based on the conditional distribution, and two sequence haplotypes, which are consistent with the multilocus genotype, are randomly selected from the population of sequences. The advantage of this method is that the run time is not affected by the disease prevalence. However, the conditional probability is subject to sampling error as it is estimated in a small set of samples.
As generating each of the families, unrelated cases, and unrelated controls is an independent process, the procedures can be parallelized with threads on a multi-core computer. We used Java Thread to parallelize the code. Each thread generates about the same amount of families, cases, and controls. We used simulations to evaluate the performance of SeqSIMLA running with threads.
Results and discussion
SeqSIMLA is implemented in Java, which is portable on different operating systems, including Linux and Windows. The parameters required for SeqSIMLA can be specified in the command line. Alternatively, the user can specify the parameters using a control file so that they can be saved and reused. SeqSIMLA writes the variant data in standard PLINK file format (map and ped files) , which has been widely adopted in genetic analysis. Map distance between two variants is calculated by Haldane’s mapping function . An additional phenotype file with quantitative trait values, which is also in the same format as the PLINK phenotype file, is generated if the user chooses to simulate a quantitative trait.
Run time of SeqSIMLA for generating family and case-control data
Power and type I error rates for linkage and association tests
Two penetrance functions, Models 1 and 2, are used to determine disease status in SeqSIMLA. Model 1, which is based on the logistic function and has been used extensively in many simulation studies [19, 30, 31], allows the user to determine the conditional odds ratio for each of the disease variants and the disease prevalence. Therefore, the user can simulate disease models based on estimated odds ratios of candidate variants from previous association studies and estimated disease prevalence from a prevalence study. Model 2, which is based on the population attributable risk, has the advantage of controlling the overall PAR for a group of disease variants. The model can assume that rarer disease variants have higher GRRs, given that all of the variants have the same PAR . The model can also assume disease variants contribute unequally to the overall PAR. Therefore, the model is suitable to simulate a large number of rare disease variants with different odds ratios, while keeping the overall PAR in a specified value.
Both SeqSIMLA and SimRare are able to generate sequence data for independent samples, but with some different underlying settings. SimRare uses the forward-time simulation program srv implemented in the SimuPOP environment to generate sequence data. The srv program provides multi-locus selection models with random fitness effects, which are ideal for simulating multiple rare variants under selection. The default sequence generator, GENOME, in SeqSIMLA is a backward-time simulator. Similar to the limitation in other backward-time simulators, selection is not modeled in GENOME . However, the backward-time simulators are generally faster than the forward-time simulators. For simulating disease status, both SeqSIMLA and SimRare allow the user to specify the odds ratios or population attributable risk for disease variants, the proportion of protective variants, the mode of inheritance, and the disease prevalence. For simulating quantitative trait values, SeqSIMLA allows the user to specify the total phenotypic variance, the proportion of variance explained by each of the causal variants, and the mean of the trait values, while SimRare allows the user to specify the deviations from the mean.
Comparisons of features between SeqSIMLA and SIMRARE
GENOME (default) or a sequence pool generated by a simulator
srv in SimuPOP
Multiple genes on multiple chromosomes
A gene or multiple independent genes
Simulated data type
Families/unrelated cases and controls
Unrelated cases and controls
Number of disease/trait models
We implemented two disease models in SeqSIMLA, in which the user can flexibly specify the number of disease loci, effect sizes or PAR, disease prevalence, and risk or protective loci. We also implemented a quantitative trait model, in which the user can specify the number of QTL, proportions of variance explained by the QTL, and genetic models. We compiled recombination rates from the HapMap project, so that genomic structures similar to real data can be simulated. Future development of SeqSIMLA includes more flexibility in simulating family structures, such as twins or multi-generation families. SeqSIMLA can be used as a complementary tool to SimRare. If the user would like to perform a power study based on case-control design for new and existing statistical methods, SimRare is ideal. If the user would like to perform a power study for family-based design, or to simulate causal variants in multiple genes with different LD patterns among genes, SeqSIMLA is more suitable. In summary, as statistical methods for rare variant association analysis are developing rapidly, SeqSIMLA will be useful for evaluating statistical properties for the new methods based on case-control or family designs. SeqSIMLA will also be useful for power studies when planning association studies based on NGS.
Availability and requirements
Project name: SeqSIMLA
Project home page: http://seqsimla.sourceforge.net
Operating system(s): Unix, Linux, Windows
Programming language: Java
Other requirements: Java JDK 7
License: GNU GPL
Any restrictions to use by non-academics: None
We thank Ms. Hsin-Ni Tsai for testing and commenting on the software development. This study was supported by grants from the National Health Research Institutes (PH-101-PP-47) and National Science Council (NSC 101-2218-E-400-001) in Taiwan.
- Schmidt M, Hauser ER, Martin ER, Schmidt S: Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction. Stat Appl Genet Mol Biol. 2005, 4: Article15Google Scholar
- Edwards TL, Bush WS, Turner SD, Dudek SM, Torstenson ES, Schmidt M, Martin E, Ritchie MD: Generating linkage disequilibrium patterns in data simulations using genomeSIMLA. Lecture notes in computer science. 2008, 4973 (2008): 24-35.View ArticleGoogle Scholar
- Zhang F, Liu J, Chen J, Deng HW: HAPSIMU: a genetic simulation platform for population-based association studies. BMC Bioinforma. 2008, 9: 331-10.1186/1471-2105-9-331.View ArticleGoogle Scholar
- Gunther T, Gawenda I, Schmid KJ: Phenosim--a software to simulate phenotypes for testing in genome-wide association studies. BMC Bioinforma. 2011, 12: 265-10.1186/1471-2105-12-265.View ArticleGoogle Scholar
- Peng B, Kimmel M: SimuPOP: a forward-time population genetics simulation environment. Bioinformatics. 2005, 21 (18): 3686-3687. 10.1093/bioinformatics/bti584.View ArticlePubMedGoogle Scholar
- Kilpinen H, Barrett JC: How next-generation sequencing is transforming complex disease genetics. Trends in genetics : TIG. 2013, 29 (1): 23-30. 10.1016/j.tig.2012.10.001.View ArticlePubMedGoogle Scholar
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83 (3): 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A GroupWise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5 (2): e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Ionita-Laza I, Buxbaum JD, Laird NM, Lange C: A new testing strategy to identify rare variants with either risk or protective effect on disease. PLoS Genet. 2011, 7 (2): e1001289-10.1371/journal.pgen.1001289.PubMed CentralView ArticlePubMedGoogle Scholar
- Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR: Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010, 86 (6): 832-838. 10.1016/j.ajhg.2010.04.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, et al: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851-861. 10.1038/nature06258.View ArticlePubMedGoogle Scholar
- Li B, Wang G, Leal SM: SimRare: a program to generate and analyze sequence-based data for association studies of quantitative and qualitative traits. Bioinformatics. 2012, 28 (20): 2703-2704. 10.1093/bioinformatics/bts499.PubMed CentralView ArticlePubMedGoogle Scholar
- Peng B, Liu X: Simulating sequences of the human genome with rare variants. Hum Hered. 2010, 70 (4): 287-291. 10.1159/000323316.View ArticlePubMedGoogle Scholar
- Erlich HA, Valdes AM, McDevitt S, Simen BB, Blake LA, McGowan KR, Todd JA, Rich SS, Noble J: Next generation sequencing reveals the association of DRB3*02:02 with type I diabetes. Diabetes. 2013Google Scholar
- Park G, Gim J, Kim A, Han KH, Kim HS, Oh SH, Park T, Park WY, Choi B: Multiphasic analysis of whole exome sequencing data identifies a novel mutation of ACTG1 in a nonsyndromic hearing loss family. BMC Genomics. 2013, 14 (1): 191-10.1186/1471-2164-14-191.PubMed CentralView ArticlePubMedGoogle Scholar
- Tanaka D, Nagashima K, Sasaki M, Funakoshi S, Kondo Y, Yasuda K, Koizumi A, Inagaki N: Exome sequencing identifies a new candidate mutation for susceptibility to diabetes in a family with highly aggregated type 2 diabetes. Mol Genet Metab. 2013, 109 (1): 112-117. 10.1016/j.ymgme.2013.02.010.View ArticlePubMedGoogle Scholar
- Kong XF, Bousfiha A, Rouissi A, Itan Y, Abhyankar A, Bryant V, Okada S, Ailal F, Bustamante J, Casanova JL, et al: A novel homozygous p.R1105X Mutation of the AP4E1 gene in twins with hereditary spastic paraplegia and mycobacterial disease. PLoS One. 2013, 8 (3): e58286-10.1371/journal.pone.0058286.PubMed CentralView ArticlePubMedGoogle Scholar
- Liang L, Zollner S, Abecasis GR: GENOME: a rapid coalescent-based whole genome simulator. Bioinformatics. 2007, 23 (12): 1565-1567. 10.1093/bioinformatics/btm138.View ArticlePubMedGoogle Scholar
- Kinnamon DD, Hershberger RE, Martin ER: Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. PLoS One. 2012, 7 (2): e30238-10.1371/journal.pone.0030238.PubMed CentralView ArticlePubMedGoogle Scholar
- FD S, Mackay TF: Quantitative genetics. 1996, San Francisco: Benjamin CummingsGoogle Scholar
- Edwards TL, Song Z, Li C: Enriching targeted sequencing experiments for rare disease alleles. Bioinformatics. 2011, 27 (15): 2112-2118. 10.1093/bioinformatics/btr324.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81 (3): 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Haldane JBS: The combination of linkage values, and the calculation of distances between the loci of linked factors. J Genet. 1919, 8: 299-309.View ArticleGoogle Scholar
- Schaffner SF, Foo C, Gabriel S, Reich D, Daly MJ, Altshuler D: Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005, 15 (11): 1576-1583. 10.1101/gr.3709305.PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH: Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004, 305 (5685): 869-872. 10.1126/science.1099870.View ArticlePubMedGoogle Scholar
- Ji W, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, Newton-Cheh C, State MW, Levy D, Lifton RP: Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat Genet. 2008, 40 (5): 592-599. 10.1038/ng.118.PubMed CentralView ArticlePubMedGoogle Scholar
- Wellcome Trust Case Control C: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7145): 661-678. 10.1038/nature05911.View ArticleGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30 (1): 97-101. 10.1038/ng786.View ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011, 89 (1): 82-93. 10.1016/j.ajhg.2011.05.029.PubMed CentralView ArticlePubMedGoogle Scholar
- De G, Yip WK, Ionita-Laza I, Laird N: Rare variant analysis for family-based design. PLoS One. 2013, 8 (1): e48495-10.1371/journal.pone.0048495.PubMed CentralView ArticlePubMedGoogle Scholar
- Basu S, Pan W: Comparison of statistical tests for disease association with rare variants. Genet Epidemiol. 2011, 35 (7): 606-619. 10.1002/gepi.20609.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoban S, Bertorelle G, Gaggiotti OE: Computer simulations: tools for population and evolutionary genetics. Nat Rev Genet. 2011, 13 (2): 110-122.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.