- Open Access
HAPSIMU: a genetic simulation platform for population-based association studies
© Zhang et al; licensee BioMed Central Ltd. 2008
- Received: 19 March 2008
- Accepted: 05 August 2008
- Published: 05 August 2008
Population structure is an important cause leading to inconsistent results in population-based association studies (PBAS) of human diseases. Various statistical methods have been proposed to reduce the negative impact of population structure on PBAS. Due to lack of structural information in real populations, it is difficult to evaluate the impact of population structure on PBAS in real populations.
We developed a genetic simulation platform, HAPSIMU, based on real haplotype data from the HapMap ENCODE project. This platform can simulate heterogeneous populations with various known and controllable structures under the continuous migration model or the discrete model. Moreover, both qualitative and quantitative traits can be simulated using additive genetic model with various genetic parameters designated by users.
HAPSIMU provides a common genetic simulation platform to evaluate the impact of population structure on PBAS, and compare the relative performance of various population structure identification and PBAS methods.
- Quantitative Trait Locus
- Population Structure
- Heterogeneous Population
- Real Population
Population-based association studies (PBAS) are powerful for disease gene mapping, and are widely applied to the identification of genetic determinant of human diseases [1, 2]. However, it is still an issue as to how to effectively evaluate and reduce the negative impact of population structure on PBAS [1, 3].
Population structure, a common feature in real populations [4, 5], is an important cause leading to inconsistent results in PBAS [1, 6]. Various statistical methods have been proposed to reduce the negative impact of population structure on PBAS, [7–10]. Because of different hypotheses and algorithms, the performance of these PBAS methods may be different in different situations. Therefore, a comparison of the relative performance of various PBAS methods in heterogeneous populations may provide a practical guideline for empirical researchers to choose proper study methods which are best suitable for their respective situations, and make appropriate interpretation of their results.
Due to lack of structural information in real populations, it is difficult or impossible to accurately evaluate the impact of population structure on PBAS in real populations. Simulation, which can generate heterogeneous populations with known structures, is therefore an alternative choice for the studies aforementioned. Currently, several genetic simulation programs are available [11, 12]. Most of these programs can simulate only genotype data, and not phenotype data. Furthermore, very few of these programs can generate heterogeneous populations with various known and controllable structures. Therefore, it is difficult to apply them to evaluate the impact of population structure on PBAS. To address the problems discussed above, we developed a genetic simulation platform, HAPSIMU, based on real haplotype data from the HapMap ENCODE project [see Additional file 1].
Additive genetic model is implemented in HAPSIMU to simulate qualitative and quantitative. For qualitative trait, the relationship among population prevalence (K), genotype relative risk (GRR) (r), frequency of causal allele (p) and penetrance (fi) of genotype at a causal locus in simulated heterogeneous populations can be expressed as:
f0 = K/(1-2p+2pr),
f1 = rf0,
f2 = 2rf0-f0,
where Vj denotes the phenotypic variation explained by the QTL j, and pj denotes the frequency of the disease susceptible allele at the QTL j.
HAPSIMU can simulate heterogeneous populations with various known population structures under the continuous migration model or the discrete model. In the continuous migration model, population structure is controlled by the admixture proportion of YRI in the simulated heterogeneous populations. In the discrete model, frequency difference of disease susceptible allele(s) between the simulated CEPH and YRI subpopulations, proportions of CEPH and YRI in cases and controls (for qualitative trait) or variance explained by population stratification (for quantitative trait) can be preset by users to simulate heterogeneous populations. Additionally, missing genotype can be simulated in HAPSIMU at a rate designated by users.
HAPSIMU can output the simulated data with various selectable file formats required by five prevailing PBAS software: Admixmap , Plink , STRUCTURE & STRAT [9, 10], GC  and EIGENSOFT . Currently, HAPSIMU 1.0 is designed to run on Windows operation systems. Future versions of HAPSIMU 1.0 will be able to run on Linux operation systems and to include more practical functions, for instance, future versions of HAPSIMU 1.0 can simulate heterogeneous populations using the genotype data provided by researchers in their own studies.
The simulated genotype and phenotype data of heterogeneous populations can be used to compare the relative performance of various PBAS methods in heterogeneous populations. The comparison results can provide a practical guideline for researchers to select proper study methods and make appropriate inference of the results in PBAS.
The simulated admixed populations can also be applied to performance comparison studies of various population structure identification and admixture mapping methods [10, 15, 17]. For instance, Sankararaman et al., recently developed a new method to identify population structure . They simulated a set of admixed populations using the genotype data of chromosome 1 from the HapMap project, and presented the high accuracy of their new approach in population structure inference. Compared with their simulation algorithm, there are two significant differences for HAPSIMU. In Sankararaman et al.,'s study, genotype data were simulated with the same recombination fractions (10-8) for all base pairs, while HAPSIMU can simulate genotype data based on the real genetic map distances reported by the HapMap ENCODE project. Additionally, we selected 12,867 highly informative marker loci from 10 ENCODE regions to conduct simulations, which may further increase the effectiveness and robustness of our simulation approach for population structure.
In summary, HAPSIMU provides a common genetic simulation platform for PBAS. The simulated heterogeneous populations can be used to assess the impact of population structure on PBAS, and compare the performance of various population structure identification and PBAS methods.
Project name: HAPSIMU
Project home page: http://l.web.umkc.edu/liujian/
Operating system(s): Microsoft Windows
Programming language: C++
License: Free for non-commercial usage
The study was partially supported by Xi'an Jiaotong University. The investigators of this work were also benefited from grants from the Ministry of Education of China, NIH (R01 AR050496, R21 AG 027110, R01 AG026564 and P50 AR055081), National Science Foundation of China, Huo Ying Dong Education Foundation and Hunan Province.
- Marchini J, Cardon LR, Phillips MS, Donnelly P: The effects of human population structure on large genetic association studies. Nat Genet 2004, 36(5):512–517.View ArticlePubMedGoogle Scholar
- Risch NJ: Searching for genetic determinants in the new millennium. Nature 2000, 405(6788):847–856. 10.1038/35015718View ArticlePubMedGoogle Scholar
- Lander ES, Schork NJ: Genetic dissection of complex traits. Science 1994, 265(5181):2037–2048. 10.1126/science.8091226View ArticlePubMedGoogle Scholar
- Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, Pato MT, Petryshen TL, Kolonel LN, Lander ES, Sklar P, Henderson B, Hirschhorn JN, Altshuler D: Assessing the impact of population stratification on genetic association studies. Nat Genet 2004, 36(4):388–393.View ArticlePubMedGoogle Scholar
- Guthery SL, Salisbury BA, Pungliya MS, Stephens JC, Bamshad M: The structure of common genetic variation in United States populations. Am J Hum Genet 2007, 81(6):1221–1231. 10.1086/522239PubMed CentralView ArticlePubMedGoogle Scholar
- Deng HW: Population admixture may appear to mask, change or reverse genetic effects of genes underlying complex traits. Genetics 2001, 159(3):1319–1323.PubMed CentralPubMedGoogle Scholar
- Devlin B, Roeder K: Genomic control for association studies. Biometrics 1999, 55(4):997–1004. 10.1111/j.0006-341X.1999.00997.xView ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006, 38(8):904–909.View ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Association mapping in structured populations. Am J Hum Genet 2000, 67(1):170–181.PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 2000, 155(2):945–959.PubMed CentralPubMedGoogle Scholar
- Dudek SM, Motsinger AA, Velez DR, Williams SM, Ritchie MD: Data simulation software for whole-genome association and other studies in human genetics. Pac Symp Biocomput 2006, 499–510.Google Scholar
- Li C, Li M: GWAsimulator: a rapid whole-genome simulation program. Bioinformatics 2008, 24(1):140–142. 10.1093/bioinformatics/btm549View ArticlePubMedGoogle Scholar
- Kosambi DD: The estimation of map distances from recombination values. Annals of Eugenics 1944, 12: 172–175.View ArticleGoogle Scholar
- Long JC: The genetic structure of admixed populations. Genetics 1991, 127(2):417–428.PubMed CentralPubMedGoogle Scholar
- McKeigue PM, Carpenter JR, Parra EJ, Shriver MD: Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: application to African-American populations. Ann Hum Genet 2000, 64(Pt 2):171–186. 10.1046/j.1469-1809.2000.6420171.xView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007, 81(3):559–575.PubMed CentralView ArticlePubMedGoogle Scholar
- Sankararaman S, Kimmel G, Halperin E, Jordan MI: On the inference of ancestries in admixed populations. Genome Res 2008, 18(4):668–675. 10.1101/gr.072751.107PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.