Generating samples for association studies based on HapMap data
© Li and Chen. 2008
Received: 22 March 2007
Accepted: 24 January 2008
Published: 24 January 2008
Skip to main content
© Li and Chen. 2008
Received: 22 March 2007
Accepted: 24 January 2008
Published: 24 January 2008
With the completion of the HapMap project, a variety of computational algorithms and tools have been proposed for haplotype inference, tag SNP selection and genome-wide association studies. Simulated data are commonly used in evaluating these new developed approaches. In addition to simulations based on population models, empirical data generated by perturbing real data, has also been used because it may inherit specific properties from real data. However, there is no tool that is publicly available to generate large scale simulated variation data by taking into account knowledge from the HapMap project.
A computer program (gs) was developed to quickly generate a large number of samples based on real data that are useful for a variety of purposes, including evaluating methods for haplotype inference, tag SNP selection and association studies. Two approaches have been implemented to generate dense SNP haplotype/genotype data that share similar local linkage disequilibrium (LD) patterns as those in human populations. The first approach takes haplotype pairs from samples as inputs, and the second approach takes patterns of haplotype block structures as inputs. Both quantitative and qualitative traits have been incorporated in the program. Phenotypes are generated based on a disease model, or based on the effect of a quantitative trait nucleotide, both of which can be specified by users. In addition to single-locus disease models, two-locus disease models have also been implemented that can incorporate any degree of epistasis. Users are allowed to specify all nine parameters in a 3 × 3 penetrance table. For several commonly used two-locus disease models, the program can automatically calculate penetrances based on the population prevalence and marginal effects of a disease that users can conveniently specify.
The program gs can effectively generate large scale genetic and phenotypic variation data that can be used for evaluating new developed approaches. It is freely available from the authors' web site at http://www.eecs.case.edu/~jxl175/gs.html.
With the completion of the HapMap project , large-scale, high-density single-nucleotide polymorphism (SNP) markers and information on haplotype structure and frequencies become available. A variety of statistical approaches have been proposed for association studies using haplotypes [2, 3] and more are expected for whole genome association studies. The utilities of such approaches are frequently very difficult to obtain through analytical analysis. Evaluations on those methods commonly rely on experiments based on simulations or empirical data. For example, one can generate a large number of samples based on population models such as the coalescent theory, and the program ms  is routinely used in the community. Thus statistical properties of new approaches can be investigated in a controlled manner as functions of population parameters such as recombination rates, mutation rates, population structure and migration rates. But simulations based on population genetics models may not be able to capture the true property of LD in human populations, due to their simplified assumptions. Conclusions based on such simulated data may be misleading or inaccurate in reality. Some researchers [2, 3] have used their in-house tools to generate empirical data based on real data, in hopes that their empirical data could inherit major properties of human populations such as LD patterns. But each group may use its own models and tools, and comparisons on results from different groups are usually impossible in general. A public available program that can generate a large number of independent samples based on real human data can be of great use for evaluating new proposed approaches. We have implemented a program in C++, called gs, that can efficiently generate such samples based on real data. Two heuristic approaches have been implemented. One is based on phased haplotype pairs and the other is based on haplotype block structures. The program can also directly take data from the HapMap project as its inputs. Our experiments show that genotype data generated by both approaches observe similar local haplotype structures and LD patterns as those in the input data, and also keep a proper level of variety. Therefore, genotype data can be used in testing algorithms for tag SNP selection and haplotype inference.
Simulation data is always indispensable for evaluation purposes. The major goal of this tool is to allow users to generate large scale simulation data for association studies, including both fine mapping and genome wide association. Both quantitative and qualitative traits have been incorporated. To generate disease phenotypes, users can either specify a one-locus disease model or a two-locus disease model with or without epistasis. Many studies and growing evidences have revealed the importance of epistasis in the etiology of complex traits. More and more efforts have been made to detect epistatic interactions [5–7]. For example, Marchini et al. studied the feasibility and power of a full two-locus model in the context of genome wide association studies and compared its performance with a single locus based approach and a two-stage design . Evans et al. compared the performance of four different search strategies (i.e., a single-locus search, an exhaustive two-locus search, and two, two-stage procedures) to detect interacted loci using two-locus models . Luliana et al. compared two natural two-stage approaches: the conditional approach and the simultaneous approach . But for all the above studies, simulations were based on genotypes only at the trait loci. Although it is hardly possible to simulate thousands of individuals for hundreds of thousands SNP markers for thousands times for genome wide association studies due to storage and time constraints, our program can embed the disease genotypes into genome regions that mimic genomic content in human populations. Associations with markers can be tested in a more realistic scenario.
The program has implemented two methods to generate haplotypes/genotypes, i.e., the extension method and the block method. Both methods can be used to generate qualitative and quantitative phenotype data. We first introduce the two methods in the context of generating case-control data, and then briefly discuss how quantitative traits can be generated. The extension to two-locus models will then be discussed.
The actual genotype g will then be selected based on the conditional distribution. The genotype frequencies in the above formula can be obtained from allele frequencies under the assumption of Hardy-Weinberg equilibrium. The probability of being affected given a particular genotype (penetrance) is given by users as a parameter. To generate the haplotype pairs h 1 and h 2 for this affected individual, the program randomly selects two haplotypes h 3 and h 4 from the inputs with the genotype at the disease locus t as required (i.e., (h 3 [t], h 4 [t]) = g, where h i [t] denotes the allele at the t th locus on haplotype h i ). In their original paper , haplotype h 1 will be given the same alleles as h 3 from locus t - l to t + l, where l is a parameter that can be specified by users. To extend h 1 to the right for one more locus, it randomly selects another haplotype h 5 that has the same alleles as h 1 from locus t - l + 1 to locus t + l, and let h 1 [t + l + 1] = h 5 [t + l + 1]. By iterating the above process, one can extend h 1 to the right and then to the left. We found that LD patterns from samples generated this way greatly depend on the parameter l (data not shown). Even for one particular data set, the extend of LD may vary substantially in different segments. A single l can not accommodate all the cases. We have extended the above method by introducing two parameters, l min and l max . The overlapped length for both the initial assignment and the extension of h 1 will be stochastically determined by two values l l and l r (l min ≤ l l , l r ≤ l max ), one for each direction. The values of l l and l r depend upon the strength of local LD. More specifically, l r is initialized as l min . The value of l r is increased by 1 if the LD measure D' between locus t + l r and locus t + l r + 1 is greater than a uniformly distributed random number between 0 and 1. The process will terminate when the value of D' is smaller than such a random number or when l r = l max . The value of l l can be determined similarly to the left. The haplotype h 1 will be given the same alleles as h 3 from locus t - l l to t + l r initially. At each extension step, the procedure is the same as in the original paper , but with differences in determining the length of the overlapped region. For example, to extend to the right for one more locus, suppose the current locus (the right most one) is t 1. The leftmost locus t 2 of the overlapped region is stochastically determined based on pairwise LD with the constraint that l min ≤ t 1 - t 2 ≤ l max . A haplotype that shares the same segment with h 1 from t 2 to t 1 will be randomly selected and its allele at t 1 + 1 will be copied to h 1. A detailed description of the above procedure can be found in the manual of the program. The haplotype h 2 can be obtained similarly. The required number of cases can be generated by repeating the process. One can generate normal individuals using the same approach based on the genotype distribution conditional on the fact that the individuals are normal. By using two parameters, the method takes both long-range LD (up to l max ) and short-range LD into considerations.
where z follows the standard normal distribution, a is half of the difference between two homozygous genotypic values (assume the mutant allele increases the phenotypic value), and k is a parameter representing the dominance effect, it is known  that V A and V D can be written asV A = 2p(1 - p)a 2(1 - k(2p - 1))2, and V D = (2p(1 - p)ak)2,
where p is the frequency of the mutant allele. Thus a and k can be obtained based on the above equations. The phenotypic value of an individual can be calculated by substituting a and k into Equation 2.
Odds table for the jointly dominant-dominant model. α is the baseline odd and (1 + θ) is the genotype odd ratio.
α(1 + θ)
α(1 + θ)
α(1 + θ)
α(1 + θ)
Penetrance table for the jointly dominant-dominant model, where α and θ are defined in Table 1.
The frequencies on genotype combinations (Pr(g i )) can be obtained from allele frequencies under the Hardy-Weinberg Equilibrium assumption. The details of all the nine built-in models can be found in the manual of the program. Once the three by three penetrance table is ready, the gs program can calculate the conditional probability of each genotype combination given affected status using a similar formula as Equation 1. Actual genotypes at the two loci will be selected based on the conditional distribution. The haplotypes will then be selected independently for these two loci, using either the extension method or the block method. The above two-locus model does not explicitly consider linkage disequilibrium between them (i.e., the two loci are assumed to be in linkage equilibrium). We will consider models with two disease loci that are in LD, as well as haplotype-based disease models in a subsequent version.
The inputs to the program can be obtained from HapMap project or from users' own research projects. Phased haplotype pairs from HapMap website can be directly incorporated into the software using the extension method. To use the block method, one can generate block structures from phased haplotype pairs using the program Haploview . Both approaches allow us to generate large data sets that mimic the true local LD from human populations. A variety of parameters can be specified by users, including some that control the output formats. There are basically three different output formats including a widely used format in the community, linkage format. Users can choose to output phased haplotype data or unphased genotype data. The disease causing SNP can be kept or removed in the final outputs. More details about file formats can be found in the manual of the program. The program does not directly model population structures. But one can create a data set that is a mixture of different populations with different allele frequencies and effects by combining samples generated based on inputs from different populations. To generate random numbers, the Mersenne Twister algorithm  has been adopted.
We have tested both generating methods using two different data sets. The first dataset consists of all 10 ENCODE regions from the HapMap project, which is a portion of the 44 regions from the ENCODE project. These ENCODE regions were selected either manually, or randomly based on gene density and level of non-exonic conservation. Details about selection criteria can be found at ENCODE website . The 10 HapMap-ENCODE regions were resequenced in 48 unrelated individuals and genotype data were obtained from about 20 thousand SNPs of all 270 HapMap samples. The haplotype pairs of each individual can be downloaded directly from the HapMap phase I data release. For each region, we take the parental haplotypes from 30 trios in a population with northern and western European ancestry. We mainly present our results using one region (ENr112 on chromosome 2p16.3) in the paper and results on other regions can be found on our website as supplementary materials. The second dataset is from a genome wide association study of a neurological disease . This dataset is one of the first sets of publicly available genome wide SNP data, which can be downloaded from Coriell Institute for Medical Research . In its first stage, the study genotyped 550k SNPs across genome using the Illumina Infinium II SNP chip, for 276 patients with sporadic amyotrophic lateral sclerosis (ALS) disease and 271 normal individuals. We have to preprocess the data before running our program, by either inferring haplotypes or predicting haplotype block structures. We chose the most recent version of a widely used program fastPhase  for haplotype inference. Haploview was selected for haplotype block structure prediction.
Because the phenotypic models are standard, we mainly assess the performance of our program in terms of the similarity of haplotype structures between simulated data and input data. The haplotype structure is measured mainly by two quantities, i.e., the number of haplotype blocks and the average number of block length within each region, as well as visual examination of local LD patterns. For each region, its haplotype structure is first inferred using Gabriel's method  implemented in Haploview based on its default values, and the number of haplotype blocks and the average number of block length is obtained. For each method and for each fixed set of parameters, 100 replicates will be generated and these simulated data will be reloaded to Haploview to obtain their haplotype structures. The average number of blocks and average length of blocks of the 100 replicates will be compared to values obtained from the input data.
Comparison of block structures of simulated data and original data.
ALS Chr1 500 SNPs
Results of the block method on the same physical region with different densities and different missing data imputing methods.
Phase I (nr)
Phase I (nr, nLD)
The gs program is efficient and runs fast. It can generate 100 replicates of 200 individuals with around 500 markers within minutes on a desktop with a 3.4 GHz processor and 2 GB memory. We have also tested the program in the context of genome wide association studies using the ALS data, which is based on Illumina 550K SNP chips . For the extension method, the gs program needs haplotypes as its inputs. The program fastPhase  was used to obtain haplotype information from genotypes. We randomly selected one chromosome (chromosome 6) with 36,381 SNPs, and it took fastPhase about 96 hours to obtain the haplotype pairs of 271 (normal) samples from the ALS dataset. This limits us to perform the tests using only one chromosome for the extension method in this study. On average (from five replicates), it took about 140 minutes in generating 1000 cases and 1000 controls from the haplotype pairs of the 271 samples for the 36K SNPs. Because our algorithm is linear in the number of SNPs, the estimated time to generate 550K SNPs in a sequential manner is roughly 33 hours. If a cluster with multiple processors were used, the computation for different chromosomes can be run independently, and the total time is bounded by the running time of the chromosome with the largest number of SNPs. For the block method, gs needs haplotype block structures as inputs. However, Haploview cannot handle datasets with such sizes at a chromosome level for structure prediction. We could not test the block method at the genome level. Based on our experiences on small datasets, the block method is slight faster than the extension method. Therefore, for simulations at a genome level, the block method may require less or similar time as the extension method does.
We develop a program to generate genotypes/haplotypes by perturbing real data based on two approaches. Qualitative or quantitative phenotypes can be generated based on genetic models. The goal of generating simulated data by perturbation is to create a large number of replicates that share similar properties with real data. For the extension method, the randomness is mainly from the sampling procedure at each step when an extension occurs. Experiments show it is robust across a wide range of parameter values and SNP densities. For the block method, noise can be introduced when imputing rare haplotypes or imputing SNPs not within original blocks, in addition to deviations due to random sampling. Test results show it is more suitable for data with small blocks (length ≤ 10, e.g., Illumina 550K SNP chips). One should also notice one practical limitation while simulating data using perturbations. When simulated data only inherit properties from the set of input samples, it may never be able to represent the whole population if the inputs are biased. For example, the HapMap project only consists of a small number of samples in each of its ethnic groups. Rare SNPs may not be typed and rare haplotypes may not be observed. Data generated based on HapMap samples cannot reveal the true distribution of rare SNPs. This problem will be alleviated with the availability of more real data in the future.
The standard error of the average block length of each ENCODE region is usually great (data not shown), which reflects the fact that block lengthes vary dramatically. Therefore, the number of blocks and the average length of blocks can not give a whole picture of the block structure. Visual examination reveals that local LD structures from simulated data generated by both methods show high concordance with those from original data. However, none of the two methods can have a good control on long range LD. Another limitation of the program is that it requires inputs to be either haplotypes or haplotype structures. Both of them have to be inferred from genotype data, and the inference usually takes much longer time than the simulation itself. We will investigate new perturbation approaches directly based on genotype data. The method in generating genotypes at the disease loci assumes that the alleles are in Hardy-Weinberg equilibrium (HWE) in the population where samples are drawn. It is known that a population can deviate from the HWE due to many reasons . For example, if there are differences in the survival rates of individuals with different genotypes, deviations from HWE might occur. Some other reasons that cause deviation from the HWE include non-random mating, preferential selection of samples, etc. The samples generated by our program are not suitable to study diseases that might deviate from HWE for those reasons.
We have developed a software tool (gs) that can efficiently generate a large number of samples with genomic and phenotypic variations based on HapMap data or any real data. Experiments show that the two approaches can produce data that share similar local LD patterns as those in the input data. Both single-locus and two-locus disease models have been incorporated in the implementation. The data generated by the program can be used for a variety of purposes, including the evaluation of algorithms for haplotype inference, tag SNP selection and association studies. It can be used to evaluate algorithms for gene fine mapping as well as algorithms for genome wide association studies.
Project name: Generating Samples based on HapMap data
Project home page: http://www.eecs.case.edu/~jxl175/gs.html
Operating system(s): Windows and Linux/Unix.
Any restrictions to use by non-academics: none.
Disease allele frequency
This research was supported by the NIH/NLM grant LM008991, the start-up fund from Case Western Research University to JL, and in part by a U.S. Public Health Service Resource grant (RR03655) from the National Center for Research Resources. The authors thank three anonymous reviewers for their constructive suggestions, which greatly improve the paper.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.