phenosim - A software to simulate phenotypes for testing in genome-wide association studies
© Günther et al; licensee BioMed Central Ltd. 2011
Received: 15 March 2011
Accepted: 29 June 2011
Published: 29 June 2011
There is a great interest in understanding the genetic architecture of complex traits in natural populations. Genome-wide association studies (GWAS) are becoming routine in human, animal and plant genetics to understand the connection between naturally occurring genotypic and phenotypic variation. Coalescent simulations are commonly used in population genetics to simulate genotypes under different parameters and demographic models.
Here, we present phenosim, a software to add a phenotype to genotypes generated in time-efficient coalescent simulations. Both qualitative and quantitative phenotypes can be generated and it is possible to partition phenotypic variation between additive effects and epistatic interactions between causal variants. The output formats of phenosim are directly usable as input for different GWAS tools. The applicability of phenosim is shown by simulating a genome-wide association study in Arabidopsis thaliana.
By using the coalescent approach to generate genotypes and phenosim to add phenotypes, the data sets can be used to assess the influence of various factors such as demography, genetic architecture or selection on the statistical power of association methods to detect causal genetic variants under a wide variety of population genetic scenarios. phenosim is freely available from the authors' website http://evoplant.uni-hohenheim.de
In recent years, genome-wide association studies (GWAS) became widely used to uncover the genetic basis of complex traits by comparing patterns of genetic and phenotypic variation [1–3]. The power of such studies depends on various factors that include the genetic architecture of the trait, the demographic history of the population, and variation in mutation and recombination rates . In addition, the trait under investigation may be adaptive or (in case of a disease trait) can evolve under purifying selection, which both would result in a non-neutral pattern of genetic diversity in the genomic neighborhood of the causal mutation.
Coalescent simulations are widely used to simulate genotypes under complex demographies  with recent extensions to include recombination hotspots  and selection , or to simulate whole genomes . Simulations are often used to test population genetic hypotheses by comparing simulated and observed data. However, such simulations produce only genotypes but not phenotypes, which are also required to test methods for detecting significant associations between genetic and phenotypic variation. Although some tools provide an option to map phenotypes onto simulated genotypes, they only allow the simulation of qualitative phenotypes  or require time-consuming forward-in-time simulations to create genotypes from complex demographic scenarios [10–13].
Here, we present phenosim, a tool written in Python  that was designed to add a phenotype to genotypes simulated by coalescent-based simulation tools. Simulated phenotypes may either be qualitative or quantitative traits with different effect sizes and may show epistatic interactions. Hence, the simulation of case/control studies as well as the search for quantitative trait nucleotides (QTNs) of a complex trait with a user-defined architecture is possible. By combining simulated genotypes and phenotypes, researchers can assess the influence of different factors on the power of new methods for association mapping, compare different methods or estimate an optimal sample size and number of markers for a given study design.
For qualitative traits, one- and two-locus models are supported. The user defines the model by setting the penetrance (probability of being affected) for all genotypes. In the two-locus model, this is done by a penetrance table for all possible allelic combinations among the two loci. Therefore, the user may define arbitrary interactions between all alleles of the loci. The case/control-status of all simulated individuals is then assigned according to the model. In many cases, disease states are caused by risk alleles segregating at low allele frequencies in the overall population. As such low frequency variants share a genealogy that may differ from high frequency variants and thus the linkage pattern around these variants may be different , the user can restrict causal mutations to a certain frequency range to obtain realistic risk loci. However, as this may result in a low number of cases in the final sample, users need to simulate larger populations and optionally enter a minimum number of cases to be sampled from the population. This procedure reflects the sampling procedure of many case/control studies.
where Q ij := 1 if the j-th QTN is homozygous derived, Q ij := 0 if the QTN is heterozygous and Q ij := -1 if the QTN is homozygous ancestral. Dominant effects at each QTN and additive effects between loci are also supported for diploids. In this case, equation (1) is used with a ij := 0 for homozygous ancestral QTNs and a ij := 1 for heterozygous and homozygous derived individuals.
If exactly two QTNs are selected, a positive, additive epistatic effect π E between these QTNs can be simulated. This epistasis is modeled as a fictive third QTN, whose allelic state a iE is 1, if the individual carries at least one derived allele at both basal QTNs. For users with a some Python scripting experience, other types of epistasis can easily be simulated by modifying the code of phenosim. To simulate a causal haplotype or allelic heterogeneity among two causal variants within a single gene, both QTNs may also be located on a common haploblock defined by the four-gamete test .
To our knowledge, quantiNemo is the only software that currently supports the simulation of interactions between QTNs. However, quantiNemo utilizes time-consuming forward simulations, whereas phenosim allows to include epistasis between QTNs within a time-efficient coalescent framework.
After phenotypes have been generated, a predefined number of markers and/or individuals can be sub sampled from the total simulated population. The causal marker(s) can be optionally removed from the sample, since frequently the causal mutation itself is not genotyped in a genome-wide study. Finally, genotypes and phenotypes are written into different output file formats that can be directly used as input for commonly used association programs such as Blossoc/QBlossoc[16, 18], EMMA/EMMAX[19, 20], PLINK, QTDT/MERLIN[22, 23] and TASSEL 3.0 . A snapshot of phenosim is available as Additional File 1 whereas the most current version is maintained at http://evoplant.uni-hohenheim.de
Results and Discussion
On average, a single simulation ran 4 min with GENOME and 2 min with phenosim on a single core of an Intel Xeon X5650 (2.66 GHz) Processor. To compare this running time with other software tools, we simulated two QTNs and 249,998 neutral loci in a population of 500 diploid individuals using quantiNemo. In six minutes, quantiNemo generated ~120 generations. As the expected coalescent time for a sample is ~ 4N e generations , this is by far not enough to get a realistic variation pattern comparable to what can be achieved by GENOME in the same time. Although forward simulations like quantiNemo allow more complex demographic, selection and trait scenarios, the combination of coalescent simulators and phenosim is much more suitable for generating multiple simulations of large sample sizes.
Demographic effects, genetic architecture, selection, and different mutation and recombination rates affect the ability to detect the genetic basis of complex traits in natural populations . Such population genetic parameters can now be estimated from genome-wide marker sets prior to further analyses. Since GWAS are widely used in plant and animal genetics, there is a great interest in assessing the power of a particular study or method. Using coalescent simulations in conjunction with phenosim, one can investigate the statistical power and other characteristics of GWAS methods efficiently. Additionally, as different causal markers may contribute different effects to a trait, the essential sample size and number of markers to detect a certain pattern can be estimated.
Availability and requirements
Project name: phenosim
Project home page: http://evoplant.uni-hohenheim.de
Operating system(s): Platform independent
Programming language: Python
Other requirements: Python 2.X
License: no license required
Any restrictions to use by non-academics: none
This work is supported by the BMBF under the German plant genomics program GABI (GABI-GENOBAR; 0315066F) and the Bioenergy 2021 Initiative (BioÖl, 0315429C); TG is funded by a Volkswagen Foundation Evolutionary Biology scholarship (I/84 225). We thank two anonymous reviewers for their comments and Sariel Hübner for testing the scripts and comments on the software features.
- Hindorff La, Sethupathy P, Junkins Ha, Ramos EM, Mehta JP, Collins FS, Manolio TA: Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America 2009, 106(23):9362–7. 10.1073/pnas.0903103106PubMed CentralView ArticlePubMedGoogle Scholar
- Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, Meng D, Platt A, Tarone AM, Hu TT, Jiang R, Muliyati NW, Zhang X, Amer MA, Baxter I, Brachi B, Chory J, Dean C, Debieu M, de Meaux J, Ecker JR, Faure N, Kniskern JM, Jones JDG, Michael T, Nemri A, Roux F, Salt DE, Tang C, Todesco M, Traw MB, Weigel D, Marjoram P, Borevitz JO, Bergelson J, Nordborg M: Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines . Nature 2010, 465(7298):627–31. 10.1038/nature08800PubMed CentralView ArticlePubMedGoogle Scholar
- Stranger BE, Stahl Ea, Raj T: Progress and Promise of Genome-wide Association Studies for Human Complex Trait Genetics. Genetics 2010, 187(2):367–383.View ArticlePubMedGoogle Scholar
- Wang WYS, Barratt BJ, Clayton DG, Todd JA: Genome-wide association studies: theoretical and practical concerns. Nature reviews Genetics 2005, 6(2):109–18. 10.1038/nrg1522View ArticlePubMedGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18: 337–338. 10.1093/bioinformatics/18.2.337View ArticlePubMedGoogle Scholar
- Hellenthal G, Stephens M: msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics 2007, 23(4):520–1. 10.1093/bioinformatics/btl622View ArticlePubMedGoogle Scholar
- Ewing G, Hermisson J: MSMS: A Coalescent simulation program including recombination, demographic structure, and selection at a single locus. Bioinformatics 2010, 26(16):2064–2065. 10.1093/bioinformatics/btq322PubMed CentralView ArticlePubMedGoogle Scholar
- Liang L, Zöllner S, Abecasis GR: GENOME: a rapid coalescent-based whole genome simulator. Bioinformatics 2007, 23(12):1565–7. 10.1093/bioinformatics/btm138View ArticlePubMedGoogle Scholar
- Mailund T, Schierup MH, Pedersen CNS, Mechlenborg PJM, Madsen JN, Schauser L: CoaSim: A flexible environment for simulating genetic data under coalescent models. BMC Bioinformatics 2005, 6: 252. 10.1186/1471-2105-6-252PubMed CentralView ArticlePubMedGoogle Scholar
- Chadeau-Hyam M, Hoggart CJ, O'Reilly PF, Whittaker JC, Iorio MD, Balding DJ: Fregene: simulation of realistic sequence-level data in populations and ascertained samples. BMC Bioinformatics 2008, 9: 364. 10.1186/1471-2105-9-364PubMed CentralView ArticlePubMedGoogle Scholar
- Lambert BW, Terwilliger JD, Weiss KM: ForSim: a tool for exploring the genetic architecture of complex traits with controlled truth. Bioinformatics 2008, 24(16):1821–2. 10.1093/bioinformatics/btn317PubMed CentralView ArticlePubMedGoogle Scholar
- Neuenschwander S, Hospital F, Guillaume F, Goudet J: quantiNemo: an individual-based program to simulate quantitative traits with explicit genetic architecture in a dynamic metapopulation. Bioinformatics 2008, 24(13):1552–3. 10.1093/bioinformatics/btn219View ArticlePubMedGoogle Scholar
- Peng B, Amos CI: Forward-time simulation of realistic samples for genome-wide association studies. BMC Bioinformatics 2010, 11: 442. 10.1186/1471-2105-11-442PubMed CentralView ArticlePubMedGoogle Scholar
- van Rossum G: Python Reference manual. Amsterdam: CWI (Centre for Mathematics and Computer Science); 1995.Google Scholar
- Nordborg M, Tavaré S: Linkage disequilibrium: what history has to tell us. Trends in Genetics 2002, 18(2):83–90. 10.1016/S0168-9525(02)02557-XView ArticlePubMedGoogle Scholar
- Besenbacher S, Mailund T, Schierup MH: Local phylogeny mapping of quantitative traits: higher accuracy and better ranking than single-marker association in genomewide scans. Genetics 2009, 181(2):747–53.PubMed CentralView ArticlePubMedGoogle Scholar
- Hudson RR, Kaplan NL: Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 1985, 111: 147–64.PubMed CentralPubMedGoogle Scholar
- Mailund T, Besenbacher S, Schierup MH: Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics 2006, 7: 454. 10.1186/1471-2105-7-454PubMed CentralView ArticlePubMedGoogle Scholar
- Kang HM, Zaitlen Na, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E: Efficient control of population structure in model organism association mapping. Genetics 2008, 178(3):1709–23. 10.1534/genetics.107.080101PubMed CentralView ArticlePubMedGoogle Scholar
- Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E: Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 2010, 42(4):348–354. 10.1038/ng.548PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Toddbrown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, Debakker P, Daly M: PLINK: A tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 2007, 81(3):559–575. 10.1086/519795View ArticlePubMedGoogle Scholar
- Abecasis GR, Cardon LR, Cookson WO: A general test of association for quantitative traits in nuclear families. American journal of human genetics 2000, 66: 279–92. 10.1086/302698PubMed CentralView ArticlePubMedGoogle Scholar
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nature Genetics 2002, 30: 97–101. 10.1038/ng786View ArticlePubMedGoogle Scholar
- Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES: TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 2007, 23(19):2633–5. 10.1093/bioinformatics/btm308View ArticlePubMedGoogle Scholar
- Kim S, Plagnol V, Hu TT, Toomajian C, Clark RM, Ossowski S, Ecker JR, Weigel D, Nordborg M: Recombination and linkage disequilibrium in Arabidopsis thaliana . Nature Genetics 2007, 39(9):1151–5. 10.1038/ng2115View ArticlePubMedGoogle Scholar
- Li Y, Huang Y, Bergelson J, Nordborg M, Borevitz JO: Association mapping of local climate-sensitive quantitative trait loci in Arabidopsis thaliana . Proceedings of the National Academy of Sciences 2010, 107(49):21199–21204. 10.1073/pnas.1007431107View ArticleGoogle Scholar
- Platt A, Vilhjálmsson BJ, Nordborg M: Conditions under which genome-wide association studies will be positively misleading. Genetics 2010, 186(3):1045–1052. 10.1534/genetics.110.121665PubMed CentralView ArticlePubMedGoogle Scholar
- Nordborg M: D. J. Balding, M. J. Bishop, and C. Cannings (Editors), Handbook of Statistical Genetics. In Coalescent theory. New York: John Wiley and Sons; 2001:179–212.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.