Generating samples for association studies based on HapMap data
 Jing Li^{1}Email author and
 Yixuan Chen^{1}
DOI: 10.1186/14712105944
© Li and Chen; licensee BioMed Central Ltd. 2008
Received: 22 March 2007
Accepted: 24 January 2008
Published: 24 January 2008
Abstract
Background
With the completion of the HapMap project, a variety of computational algorithms and tools have been proposed for haplotype inference, tag SNP selection and genomewide association studies. Simulated data are commonly used in evaluating these new developed approaches. In addition to simulations based on population models, empirical data generated by perturbing real data, has also been used because it may inherit specific properties from real data. However, there is no tool that is publicly available to generate large scale simulated variation data by taking into account knowledge from the HapMap project.
Results
A computer program (gs) was developed to quickly generate a large number of samples based on real data that are useful for a variety of purposes, including evaluating methods for haplotype inference, tag SNP selection and association studies. Two approaches have been implemented to generate dense SNP haplotype/genotype data that share similar local linkage disequilibrium (LD) patterns as those in human populations. The first approach takes haplotype pairs from samples as inputs, and the second approach takes patterns of haplotype block structures as inputs. Both quantitative and qualitative traits have been incorporated in the program. Phenotypes are generated based on a disease model, or based on the effect of a quantitative trait nucleotide, both of which can be specified by users. In addition to singlelocus disease models, twolocus disease models have also been implemented that can incorporate any degree of epistasis. Users are allowed to specify all nine parameters in a 3 × 3 penetrance table. For several commonly used twolocus disease models, the program can automatically calculate penetrances based on the population prevalence and marginal effects of a disease that users can conveniently specify.
Conclusion
The program gs can effectively generate large scale genetic and phenotypic variation data that can be used for evaluating new developed approaches. It is freely available from the authors' web site at http://www.eecs.case.edu/~jxl175/gs.html.
Background
With the completion of the HapMap project [1], largescale, highdensity singlenucleotide polymorphism (SNP) markers and information on haplotype structure and frequencies become available. A variety of statistical approaches have been proposed for association studies using haplotypes [2, 3] and more are expected for whole genome association studies. The utilities of such approaches are frequently very difficult to obtain through analytical analysis. Evaluations on those methods commonly rely on experiments based on simulations or empirical data. For example, one can generate a large number of samples based on population models such as the coalescent theory, and the program ms [4] is routinely used in the community. Thus statistical properties of new approaches can be investigated in a controlled manner as functions of population parameters such as recombination rates, mutation rates, population structure and migration rates. But simulations based on population genetics models may not be able to capture the true property of LD in human populations, due to their simplified assumptions. Conclusions based on such simulated data may be misleading or inaccurate in reality. Some researchers [2, 3] have used their inhouse tools to generate empirical data based on real data, in hopes that their empirical data could inherit major properties of human populations such as LD patterns. But each group may use its own models and tools, and comparisons on results from different groups are usually impossible in general. A public available program that can generate a large number of independent samples based on real human data can be of great use for evaluating new proposed approaches. We have implemented a program in C++, called gs, that can efficiently generate such samples based on real data. Two heuristic approaches have been implemented. One is based on phased haplotype pairs and the other is based on haplotype block structures. The program can also directly take data from the HapMap project as its inputs. Our experiments show that genotype data generated by both approaches observe similar local haplotype structures and LD patterns as those in the input data, and also keep a proper level of variety. Therefore, genotype data can be used in testing algorithms for tag SNP selection and haplotype inference.
Simulation data is always indispensable for evaluation purposes. The major goal of this tool is to allow users to generate large scale simulation data for association studies, including both fine mapping and genome wide association. Both quantitative and qualitative traits have been incorporated. To generate disease phenotypes, users can either specify a onelocus disease model or a twolocus disease model with or without epistasis. Many studies and growing evidences have revealed the importance of epistasis in the etiology of complex traits. More and more efforts have been made to detect epistatic interactions [5–7]. For example, Marchini et al. studied the feasibility and power of a full twolocus model in the context of genome wide association studies and compared its performance with a single locus based approach and a twostage design [5]. Evans et al. compared the performance of four different search strategies (i.e., a singlelocus search, an exhaustive twolocus search, and two, twostage procedures) to detect interacted loci using twolocus models [6]. Luliana et al. compared two natural twostage approaches: the conditional approach and the simultaneous approach [7]. But for all the above studies, simulations were based on genotypes only at the trait loci. Although it is hardly possible to simulate thousands of individuals for hundreds of thousands SNP markers for thousands times for genome wide association studies due to storage and time constraints, our program can embed the disease genotypes into genome regions that mimic genomic content in human populations. Associations with markers can be tested in a more realistic scenario.
Implementation
The program has implemented two methods to generate haplotypes/genotypes, i.e., the extension method and the block method. Both methods can be used to generate qualitative and quantitative phenotype data. We first introduce the two methods in the context of generating casecontrol data, and then briefly discuss how quantitative traits can be generated. The extension to twolocus models will then be discussed.
Extension Method
The actual genotype g will then be selected based on the conditional distribution. The genotype frequencies in the above formula can be obtained from allele frequencies under the assumption of HardyWeinberg equilibrium. The probability of being affected given a particular genotype (penetrance) is given by users as a parameter. To generate the haplotype pairs h_{1} and h_{2} for this affected individual, the program randomly selects two haplotypes h_{3} and h_{4} from the inputs with the genotype at the disease locus t as required (i.e., (h_{3} [t], h_{4} [t]) = g, where h_{ i }[t] denotes the allele at the t^{ th }locus on haplotype h_{ i }). In their original paper [2], haplotype h_{1} will be given the same alleles as h_{3} from locus t  l to t + l, where l is a parameter that can be specified by users. To extend h_{1} to the right for one more locus, it randomly selects another haplotype h_{5} that has the same alleles as h_{1} from locus t  l + 1 to locus t + l, and let h_{1} [t + l + 1] = h_{5} [t + l + 1]. By iterating the above process, one can extend h_{1} to the right and then to the left. We found that LD patterns from samples generated this way greatly depend on the parameter l (data not shown). Even for one particular data set, the extend of LD may vary substantially in different segments. A single l can not accommodate all the cases. We have extended the above method by introducing two parameters, l_{ min }and l_{ max }. The overlapped length for both the initial assignment and the extension of h_{1} will be stochastically determined by two values l_{ l }and l_{ r }(l_{ min }≤ l_{ l }, l_{ r }≤ l_{ max }), one for each direction. The values of l_{ l }and l_{ r }depend upon the strength of local LD. More specifically, l_{ r }is initialized as l_{ min }. The value of l_{ r }is increased by 1 if the LD measure D' between locus t + l_{ r }and locus t + l_{ r }+ 1 is greater than a uniformly distributed random number between 0 and 1. The process will terminate when the value of D' is smaller than such a random number or when l_{ r }= l_{ max }. The value of l_{ l }can be determined similarly to the left. The haplotype h_{1} will be given the same alleles as h_{3} from locus t  l_{ l }to t + l_{ r }initially. At each extension step, the procedure is the same as in the original paper [2], but with differences in determining the length of the overlapped region. For example, to extend to the right for one more locus, suppose the current locus (the right most one) is t_{1}. The leftmost locus t_{2} of the overlapped region is stochastically determined based on pairwise LD with the constraint that l_{ min }≤ t_{1}  t_{2} ≤ l_{ max }. A haplotype that shares the same segment with h_{1} from t_{2} to t_{1} will be randomly selected and its allele at t_{1} + 1 will be copied to h_{1}. A detailed description of the above procedure can be found in the manual of the program. The haplotype h_{2} can be obtained similarly. The required number of cases can be generated by repeating the process. One can generate normal individuals using the same approach based on the genotype distribution conditional on the fact that the individuals are normal. By using two parameters, the method takes both longrange LD (up to l_{ max }) and shortrange LD into considerations.
Block Method
Quantitative Traits
where z follows the standard normal distribution, a is half of the difference between two homozygous genotypic values (assume the mutant allele increases the phenotypic value), and k is a parameter representing the dominance effect, it is known [10] that V_{ A }and V_{ D }can be written asV_{ A }= 2p(1  p)a^{2}(1  k(2p  1))^{2}, and V_{ D }= (2p(1  p)ak)^{2},
where p is the frequency of the mutant allele. Thus a and k can be obtained based on the above equations. The phenotypic value of an individual can be calculated by substituting a and k into Equation 2.
TwoLocus Models
Odds table for the jointly dominantdominant model. α is the baseline odd and (1 + θ) is the genotype odd ratio.
bb  Bb  BB  

aa  α  α  α 
Aa  α  α(1 + θ)  α(1 + θ) 
AA  α  α(1 + θ)  α(1 + θ) 
Penetrance table for the jointly dominantdominant model, where α and θ are defined in Table 1.
bb  Bb  BB  

aa  $\frac{\alpha}{1+\alpha}$  $\frac{\alpha}{1+\alpha}$  $\frac{\alpha}{1+\alpha}$ 
Aa  $\frac{\alpha}{1+\alpha}$  $\frac{\alpha (1+\theta )}{1+\alpha (1+\theta )}$  $\frac{\alpha (1+\theta )}{1+\alpha (1+\theta )}$ 
AA  $\frac{\alpha}{1+\alpha}$  $\frac{\alpha (1+\theta )}{1+\alpha (1+\theta )}$  $\frac{\alpha (1+\theta )}{1+\alpha (1+\theta )}$ 
The frequencies on genotype combinations (Pr(g_{ i })) can be obtained from allele frequencies under the HardyWeinberg Equilibrium assumption. The details of all the nine builtin models can be found in the manual of the program. Once the three by three penetrance table is ready, the gs program can calculate the conditional probability of each genotype combination given affected status using a similar formula as Equation 1. Actual genotypes at the two loci will be selected based on the conditional distribution. The haplotypes will then be selected independently for these two loci, using either the extension method or the block method. The above twolocus model does not explicitly consider linkage disequilibrium between them (i.e., the two loci are assumed to be in linkage equilibrium). We will consider models with two disease loci that are in LD, as well as haplotypebased disease models in a subsequent version.
Formats
The inputs to the program can be obtained from HapMap project or from users' own research projects. Phased haplotype pairs from HapMap website can be directly incorporated into the software using the extension method. To use the block method, one can generate block structures from phased haplotype pairs using the program Haploview [9]. Both approaches allow us to generate large data sets that mimic the true local LD from human populations. A variety of parameters can be specified by users, including some that control the output formats. There are basically three different output formats including a widely used format in the community, linkage format. Users can choose to output phased haplotype data or unphased genotype data. The disease causing SNP can be kept or removed in the final outputs. More details about file formats can be found in the manual of the program. The program does not directly model population structures. But one can create a data set that is a mixture of different populations with different allele frequencies and effects by combining samples generated based on inputs from different populations. To generate random numbers, the Mersenne Twister algorithm [12] has been adopted.
Results and Discussion
Datasets
We have tested both generating methods using two different data sets. The first dataset consists of all 10 ENCODE regions from the HapMap project, which is a portion of the 44 regions from the ENCODE project. These ENCODE regions were selected either manually, or randomly based on gene density and level of nonexonic conservation. Details about selection criteria can be found at ENCODE website [13]. The 10 HapMapENCODE regions were resequenced in 48 unrelated individuals and genotype data were obtained from about 20 thousand SNPs of all 270 HapMap samples. The haplotype pairs of each individual can be downloaded directly from the HapMap phase I data release. For each region, we take the parental haplotypes from 30 trios in a population with northern and western European ancestry. We mainly present our results using one region (ENr112 on chromosome 2p16.3) in the paper and results on other regions can be found on our website as supplementary materials. The second dataset is from a genome wide association study of a neurological disease [14]. This dataset is one of the first sets of publicly available genome wide SNP data, which can be downloaded from Coriell Institute for Medical Research [15]. In its first stage, the study genotyped 550k SNPs across genome using the Illumina Infinium II SNP chip, for 276 patients with sporadic amyotrophic lateral sclerosis (ALS) disease and 271 normal individuals. We have to preprocess the data before running our program, by either inferring haplotypes or predicting haplotype block structures. We chose the most recent version of a widely used program fastPhase [16] for haplotype inference. Haploview was selected for haplotype block structure prediction.
Because the phenotypic models are standard, we mainly assess the performance of our program in terms of the similarity of haplotype structures between simulated data and input data. The haplotype structure is measured mainly by two quantities, i.e., the number of haplotype blocks and the average number of block length within each region, as well as visual examination of local LD patterns. For each region, its haplotype structure is first inferred using Gabriel's method [17] implemented in Haploview based on its default values, and the number of haplotype blocks and the average number of block length is obtained. For each method and for each fixed set of parameters, 100 replicates will be generated and these simulated data will be reloaded to Haploview to obtain their haplotype structures. The average number of blocks and average length of blocks of the 100 replicates will be compared to values obtained from the input data.
Parameters
Allele frequencies, LD and block structure
Comparison of block structures of simulated data and original data.
Dataset  Original  Extension  Block  

# blocks  avg len  # blocks  avg len  overlap  # blocks  avg len  overlap  
ALS Chr1 500 SNPs  93  3.67  91.95  3.61  91.98%  92.91  3.54  94.34% 
ENr112.2p16.3  42  19.14  51.02  16.86  98.34%  67.30  12.21  97.17% 
ENr131.2q37.1  54  16.85  54.95  17.02  96.88%  79.59  11.21  95.26% 
ENr113.4q26  23  44.78  26.55  39.96  98.11%  91.51  10.78  93.87% 
ENm010.7p15.2  28  15.36  27.47  16.05  96.98%  31.60  13.68  96.99% 
ENm013.7q21.13  22  33.05  17.59  41.58  96.27%  57.58  11.88  91.08% 
ENm014.7q31.33  23  37.70  19.15  45.33  96.79%  70.3  11.69  92.04% 
ENr321.8q24.11  24  21.50  26.12  19.88  97.08%  32.01  15.87  96.38% 
ENr232.9q34.11  24  17.29  31.08  13.85  96.85%  36.02  11.56  96.94% 
ENr123.12q12  31  21.81  26.57  27.02  97.27%  48.40  13.92  96.15% 
ENr213.18q12.1  19  29.74  22.32  26.40  97.68%  36.83  15.41  96.71% 
Results of the block method on the same physical region with different densities and different missing data imputing methods.
Dataset  Original  Block  

# SNPs  # blocks  avg len  # blocks  avg len  overlap  
Phase I  1157  42  19.14  67.3  12.21  97.2% 
Phase I (nr)  441  38  6.29  53.2  4.85  96.2% 
ALS  165  25  4.92  23.2  5.2  94.4% 
Phase I (nr, nLD)  441  38  6.29  30.8  7.25  71.4% 
Efficiency
The gs program is efficient and runs fast. It can generate 100 replicates of 200 individuals with around 500 markers within minutes on a desktop with a 3.4 GHz processor and 2 GB memory. We have also tested the program in the context of genome wide association studies using the ALS data, which is based on Illumina 550K SNP chips [14]. For the extension method, the gs program needs haplotypes as its inputs. The program fastPhase [16] was used to obtain haplotype information from genotypes. We randomly selected one chromosome (chromosome 6) with 36,381 SNPs, and it took fastPhase about 96 hours to obtain the haplotype pairs of 271 (normal) samples from the ALS dataset. This limits us to perform the tests using only one chromosome for the extension method in this study. On average (from five replicates), it took about 140 minutes in generating 1000 cases and 1000 controls from the haplotype pairs of the 271 samples for the 36K SNPs. Because our algorithm is linear in the number of SNPs, the estimated time to generate 550K SNPs in a sequential manner is roughly 33 hours. If a cluster with multiple processors were used, the computation for different chromosomes can be run independently, and the total time is bounded by the running time of the chromosome with the largest number of SNPs. For the block method, gs needs haplotype block structures as inputs. However, Haploview cannot handle datasets with such sizes at a chromosome level for structure prediction. We could not test the block method at the genome level. Based on our experiences on small datasets, the block method is slight faster than the extension method. Therefore, for simulations at a genome level, the block method may require less or similar time as the extension method does.
Discussion
We develop a program to generate genotypes/haplotypes by perturbing real data based on two approaches. Qualitative or quantitative phenotypes can be generated based on genetic models. The goal of generating simulated data by perturbation is to create a large number of replicates that share similar properties with real data. For the extension method, the randomness is mainly from the sampling procedure at each step when an extension occurs. Experiments show it is robust across a wide range of parameter values and SNP densities. For the block method, noise can be introduced when imputing rare haplotypes or imputing SNPs not within original blocks, in addition to deviations due to random sampling. Test results show it is more suitable for data with small blocks (length ≤ 10, e.g., Illumina 550K SNP chips). One should also notice one practical limitation while simulating data using perturbations. When simulated data only inherit properties from the set of input samples, it may never be able to represent the whole population if the inputs are biased. For example, the HapMap project only consists of a small number of samples in each of its ethnic groups. Rare SNPs may not be typed and rare haplotypes may not be observed. Data generated based on HapMap samples cannot reveal the true distribution of rare SNPs. This problem will be alleviated with the availability of more real data in the future.
The standard error of the average block length of each ENCODE region is usually great (data not shown), which reflects the fact that block lengthes vary dramatically. Therefore, the number of blocks and the average length of blocks can not give a whole picture of the block structure. Visual examination reveals that local LD structures from simulated data generated by both methods show high concordance with those from original data. However, none of the two methods can have a good control on long range LD. Another limitation of the program is that it requires inputs to be either haplotypes or haplotype structures. Both of them have to be inferred from genotype data, and the inference usually takes much longer time than the simulation itself. We will investigate new perturbation approaches directly based on genotype data. The method in generating genotypes at the disease loci assumes that the alleles are in HardyWeinberg equilibrium (HWE) in the population where samples are drawn. It is known that a population can deviate from the HWE due to many reasons [18]. For example, if there are differences in the survival rates of individuals with different genotypes, deviations from HWE might occur. Some other reasons that cause deviation from the HWE include nonrandom mating, preferential selection of samples, etc. The samples generated by our program are not suitable to study diseases that might deviate from HWE for those reasons.
Conclusion
We have developed a software tool (gs) that can efficiently generate a large number of samples with genomic and phenotypic variations based on HapMap data or any real data. Experiments show that the two approaches can produce data that share similar local LD patterns as those in the input data. Both singlelocus and twolocus disease models have been incorporated in the implementation. The data generated by the program can be used for a variety of purposes, including the evaluation of algorithms for haplotype inference, tag SNP selection and association studies. It can be used to evaluate algorithms for gene fine mapping as well as algorithms for genome wide association studies.
Availability and requirements
Project name: Generating Samples based on HapMap data
Project home page: http://www.eecs.case.edu/~jxl175/gs.html
Operating system(s): Windows and Linux/Unix.
Language: C++.
Any restrictions to use by nonacademics: none.
List of Abbreviations
 SNP:

singlenucleotide polymorphism
 LD:

linkage disequilibrium
 DAF:

Disease allele frequency
Declarations
Acknowledgements
This research was supported by the NIH/NLM grant LM008991, the startup fund from Case Western Research University to JL, and in part by a U.S. Public Health Service Resource grant (RR03655) from the National Center for Research Resources. The authors thank three anonymous reviewers for their constructive suggestions, which greatly improve the paper.
Authors’ Affiliations
References
 The International HapMap Consortium: A haplotype map of the human genome. Nature 2005, 437: 1299–320. 10.1038/nature04226PubMed CentralView ArticleGoogle Scholar
 Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP: Linkage disequilibrium mapping via cladistic analysis of singlenucleotide polymorphism haplotypes. Am J Hum Genet 2004, 75: 35–43. 10.1086/422174PubMed CentralView ArticlePubMedGoogle Scholar
 Li J, Jiang T: Haplotypebased linkage disequilibrium mapping via direct data mining. Bioinformatics 2005, 21: 4384–93. 10.1093/bioinformatics/bti732View ArticlePubMedGoogle Scholar
 Hudson RR: Generating samples under a WrightFisher neutral model. Bioinformatics 2002, 18: 337–8. 10.1093/bioinformatics/18.2.337View ArticlePubMedGoogle Scholar
 Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nat Genet 2005, 37: 413–7. 10.1038/ng1537View ArticlePubMedGoogle Scholar
 Evans DM, Marchini J, Morris AP, Cardon LR: Twostage twolocus models in genomewide association. PLoS Genet 2006, 2(9):e157. 10.1371/journal.pgen.0020157PubMed CentralView ArticlePubMedGoogle Scholar
 Lonita L, Man M: Optimal twostage strategy for detecting interacting genes in complex diseases. BMC Genetics 2006, 7: 39. 10.1186/14712156739PubMed CentralView ArticleGoogle Scholar
 Terwilliger JD, Ott J: Handbook of Human Genetic Linkage. Baltimore, Johns Hopkins; 1994.Google Scholar
 Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 2005, 21: 263–5. 10.1093/bioinformatics/bth457View ArticlePubMedGoogle Scholar
 Lynch M, Walsh B: Genetics and analysis of quantitative traits. Sinauer Associates, MA, USA; 1998.Google Scholar
 Li W, Reich J: A Complete Enumeration and Classification of TwoLocus Disease Models. Hum Hered 2000, 50: 334–349. 10.1159/000022939View ArticlePubMedGoogle Scholar
 Matsumoto M, Nishimura T: A 623dimensionally equidistributed uniform pseudorandom number generator. ACM Trans on Modeling and Computer Simulation 1998, 8: 3–30. 10.1145/272991.272995View ArticleGoogle Scholar
 ENCODE Project[http://www.genome.gov/10506161]
 Schymick JC, Scholz SW, Fung HC, Britton A, Arepalli S, Gibbs JR, Lombardi F, Matarin M, Kasperaviciute D, Hernandez DG, Crews C, Bruijn L, Rothstein J, Mora G, Restagno G, Chio A, Singleton A, Hardy J, Traynor BJ: Genomewide genotyping in amyotrophic lateral sclerosis and neurologically normal controls. Lancet Neurology 2007, 6(4):322–328. 10.1016/S14744422(07)700376View ArticlePubMedGoogle Scholar
 Coriell Institute for Medical Research[http://ccr.coriell.org]
 Scheet P, Stephens M: A fast and flexible statistical model for largescale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 2006, 78: 629–644. 10.1086/502802PubMed CentralView ArticlePubMedGoogle Scholar
 Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, LiuCordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science 2002, 296(5576):2225–9. 10.1126/science.1069424View ArticlePubMedGoogle Scholar
 Sham P: Statistics in human genetics. New York, NY: Oxford University Press; 1998.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.