Global haplotype partitioning for maximal associated SNP pairs
- Ali Katanforoush^{1}Email author,
- Mehdi Sadeghi^{2, 3},
- Hamid Pezeshk^{4} and
- Elahe Elahi^{5}
https://doi.org/10.1186/1471-2105-10-269
© Katanforoush et al; licensee BioMed Central Ltd. 2009
Received: 22 January 2009
Accepted: 27 August 2009
Published: 27 August 2009
Abstract
Background
Global partitioning based on pairwise associations of SNPs has not previously been used to define haplotype blocks within genomes. Here, we define an association index based on LD between SNP pairs. We use the Fisher's exact test to assess the statistical significance of the LD estimator. By this test, each SNP pair is characterized as associated, independent, or not-statistically-significant. We set limits on the maximum acceptable proportion of independent pairs within all blocks and search for the partitioning with maximal proportion of associated SNP pairs. Essentially, this model is reduced to a constrained optimization problem, the solution of which is obtained by iterating a dynamic programming algorithm.
Results
In comparison with other methods, our algorithm reports blocks of larger average size. Nevertheless, the haplotype diversity within the blocks is captured by a small number of tagSNPs. Resampling HapMap haplotypes under a block-based model of recombination showed that our algorithm is robust in reproducing the same partitioning for recombinant samples. Our algorithm performed better than previously reported models in a case-control association study aimed at mapping a single locus trait, based on simulation results that were evaluated by a block-based statistical test. Compared to methods of haplotype block partitioning, we performed best on detection of recombination hotspots.
Conclusion
Our proposed method divides chromosomes into the regions within which allelic associations of SNP pairs are maximized. This approach presents a native design for dimension reduction in genome-wide association studies. Our results show that the pairwise allelic association of SNPs can describe various features of genomic variation, in particular recombination hotspots.
Background
Analysis of Single Nucleotide Polymorphisms (SNPs) in the DNA of unrelated individuals revealed a block-like structure of haplotype variation along the human genome. Using the first available genome-wide data of SNPs on chromosome 21, Patil et al. [1] showed that in particular regions on the chromosome, the observed diversity of SNP haplotypes is less than the expected. Almost at the same time, a similar structure in haplotypes within a region of 103 SNPs on chromosome region 5q31 was reported by Daly et al. [2]. In the latter study, a block structure of haplotypes was revealed using a Hidden Markov Model for estimating recombination rates. This approach, unlike models based on haplotype diversity, incorporated a quantity measuring Linkage Disequilibrium (LD) between pairs of SNPs.
It is well known that effects such as population bottlenecks, geographic isolation, and natural selection can increase the extent of linkage disequilibrium in genomes [3]. However, in established large populations under random mating, haplotype blocks reflect mutation events in the ancestors of that population and recombination events along the chromosomes. Increased frequency of recombination events are likely to create genomic regions with haplotype blocks of small size. Based on HapMap data, Myers et al. [4] identified short genomic regions within the human genome in which the recombination rates are orders of magnitude higher than background levels. Such regions are called "recombination hotspots". Identification of hotspots of recombination and estimation of rates of crossover therein are important issues [5–7].
However, existing approaches that address these issues are generally complicated and computationally intensive. Haplotype blocks can rapidly provide rough estimations about hotspots of recombination. Knowledge of haplotype blocks has other applications as well. The block structure of chromosomes can be used in statistical approaches aimed at achieving better understanding of genetic features that differentiate ethnic groups [8–10]. The phasing problem in genotype analysis is typically solved by computational methods; one can reduce the computational cost by breaking the input genotype data into smaller units along boundaries of haplotype blocks. As a consequence, block partitioning is performed prior to phasing and other analysis in genomic sequence-based endeavors, including those incorporated in Haploview [11]. Similarly, Zhao et al. [12] proposed the use of block partition inside the PL-EM algorithm for the haplotype inference problem [13], as a substitute for the common PL technique.
One of the most important applications of SNP haplotype data is in regard to identification of disease causing genes. State-of-the-art sequencing technologies that provide large volume SNP data along with efficient statistical analyses; have made the use of haplotype data for gene identification a realistic goal. These statistical analyses usually incorporate time demanding reiterative tasks on large data sets. Clearly, the reduction of data volume by making use of haplotype blocks allows for more rapid and efficient analyses. B. Browning and S. Browning [14] presented a method for disease association studies based on haplotype blocks that relies on pairwise association between SNPs. Other approaches in disease association studies require samples of limited haplotype diversity [15–17].
In case-control association studies designed to identify disease causing genes, one of two strongly associated SNPs can be used as a "proxy" for testing association of the trait with the other. Grouping such associated "proxy" SNPs together compresses information to be used for case-control studies. Many strategies have been proposed to conduct case-control studies in an economic manner [18–20]. The key idea shared by all these approaches is that essentially all important information derived by analysis of association tests between each individual SNPs and phenotype of interest can be derived by analysis of a subset of SNPs, called tagSNPs. The most widely used method for tagSNP selection has been introduced by Carlson et al. [21]. Applying a threshold on r^{2}, in each iteration of Carlson's method, the SNP included in the largest number of associated pairs is selected as a tagSNP.
Summary of haplotype block partitioning methods compared in this study.
abbr. | Method | Partitioning structure | Block definition | Block constraint | Software | Ref. |
---|---|---|---|---|---|---|
HOT | Hotspot | Local* | recombination hotspots | none | "precomputed results available by HapMap" | [4] |
MB | Minimum block number | Global | minimum number of blocks | haplotype diversity | HapBlock v.3 | [23] |
HB | HapBlock | Global | minimizing total number of tagSNPs | haplotype diversity | HapBlock v.3 | [23] |
MDL | MDBlock | Global | minimum description length | haplotype diversity | MDBlock v.1 | [43] |
GAM | Four gamete test | Local | evidence of recombination | fourth gamete | Haploview v.4 | [44] |
GAB | Gabriel's method | Local | evidence of recombination | strongly associated SNPs | Haploview v.4 | [26] |
GPG | GPMAP** based on Gabriel's index | Global | maximizing associated SNPs | independent SNPs | Haploview+GPMAP | present paper |
GPF | GPMAP based on Fisher's exact Test | Global | maximizing associated SNPs | independent SNPs | Haploview+GPMAP | present paper |
In this work, we discuss on a haplotype block partitioning that is based on pairwise association of SNPs. There are several statistics for determining allelic correlation between SNPs. Two well known statistics are D and r^{2} which, respectively, represent the sample covariance and the sample correlation coefficient of two bi-allelic markers. Theoretical distributions of these statistics have been well studied. For instance, it is known that nr^{2} is asymptotically chi-squared [25], where n is the sample size. These two statistics depend on marginal frequencies of two SNPs. D', a third measure of linkage disequilibrium introduced by Lewontin [25], is preferred by some researchers. D' is a normalized statistic with respect to marginal frequencies and varies linearly between 0 and 1 as a function of the forth gamete frequency.
In fact, although the point estimate of D' is independent of sample size, its distribution under the hypothesis of independence is a function of sample size. To measure the significance of D', Gabriel et al. [26] suggested the use of an interval estimate. Here, we suggest that p-values as derived from the Fischer's exact test be used to assess the significance of D'. With respect to disease association studies, wherein this test is used to assess phenotype-SNP association, it seems reasonable to use the same test for SNP-SNP association.
Haplotype blocks can alternatively produce "local partitioning" and "global partitioning". In local partitioning, haplotype blocks are defined independently from the configuration of other haplotype blocks through the genome. Usually non-contiguous blocks are produced that look like a series of "islands" within the genome. In contrast, in global partitioning, the aim is to split the genome into haplotype blocks; the entire genome is "tiled" meaning all regions of the genome are within a block. Here, no single definition is applicable to all the blocks. To the best of our knowledge, little effort has been made to incorporate a pairwise measure of SNPs into a global block partitioning method.
In an earlier effort to discuss how well LD patterns are consistent with block boundaries, Wall and Pritchard have evaluated three measures, namely "coverage", "absence of holes", and "non-overlapping blocks" [27]. "Coverage" refers to the portion of each chromosome which is covered by haplotype blocks. Two haplotype blocks overlap if their boundaries cross each other. A "hole" in a haplotype block occurs where a SNP is not in strong LD with any of other SNPs in the block. These features have been assessed only for the haplotype blocks produced by Gabriel's method. Obviously, all global partitioning approaches produce non-overlapping blocks which together completely cover the genome.
The block partitioning approach we present in this work consists of two steps. First, an association index is derived that characterizes each pair of SNPs as "associated", "independent", or "not-statistically-significant". Then, haplotype blocks are determined such that the number of associated SNP pairs within each block is maximized, while a limitation is set on fraction of acceptable independent SNP pairs. An iterating search algorithm is used to find the solution of the constrained optimization. The method produces global partitioning of chromosomes. Our method results in complete coverage, no overlapping, and "absence of holes" in blocks.
Having applied the method, we comprehensively compare its performance with some previously reported methods for haplotype block designation. Robustness of each method is assessed by evaluating the consistency of block boundaries on permuted samples. We also assess the potentials of block structures created by each method to serve as a reference block structure for genome-wide disease association studies. In this regard, they are assessed under conditions of use of different marker density. Finally, we evaluate whether our haplotype block partitioning method can be used to detect recombination hotspots in the genome.
Methods
Association test for SNP pairs
In this section, we discuss on the use of Fisher's exact test on quantifying the concept of "strongly associated" SNPs. The Fisher's exact test is a test of independence, mostly used on 2 × 2 contingency tables [28]. Despite tests based on r^{2} which are approximately assessed by chi-square distribution, Fisher's exact test provides an exact probability of rejecting the null hypothesis (test of independence), observing the samples. There are many concrete applications of Fisher's exact test and its related statistics in molecular biology and genetics [29, 30].
Assume that n samples of L bi-allelic sites are available. We assign a 0/1 random variable to each SNP.
where n_{ ab }is the number of outcomes of ab for X_{ i }X_{ j }, n_{1·} = n_{10} + n_{11} and n_{·1} = n_{01} + n_{11}. In this context, it can be simply shown that F_{ ex }depends only on n_{11}, n_{1·} and n_{·1}; i.e. F_{ ex }= F_{ ex }(n_{11}; n_{1·}, n_{·1}).
where n_{max}= argmax_{ i }F_{ ex }(i; n_{1·}, n_{·1}) and corresponds to the most balanced contingency table in which the sign of D changes.
We speed up the computation of p-value of association tests by establishing a table of precomputed p-values for every triple (n_{1·}, n_{·1}, n_{11}), for each n_{1·} = 1,..., ⌊n/2⌋, n_{·1} = 1,..., n_{1·}, and n_{11} = 1,..., n_{·1}. By table look-up and taking the symmetry into account, p-values of association tests of all SNP pairs in the genomic region of interest are obtained.
Characterizing SNP pairs based on association test
Obviously, the significance increases when the size of sample is increased. In other words, when the sample size is increased then interval estimates become shorter as the p-value becomes smaller. Taking these into consideration, we classify all SNP pairs into three classes; "associated", "independent", and "not-statistically-significant". For given predetermined and p_{0} we define those pairs with |D'| < as independent and other pairs as associated but if |D'| > and p-val_{one-tailed} > p_{0}, simultaneously, we count a not-statistically-significant SNP pair. is the least value of all |D'| for which SNP pairs could be assumed as associated and p_{0} is the level of significance of the test of independence. Choosing the proper value for essentially depends on genetic features of population. However, choosing a stringent cut-off for p-val_{one-tailed} makes the choice of less strict. = 0.8 and p_{0} = 0.01 are our default setting for these thresholds.
Maximizing associated SNP pairs subject to limited independent pairs
Given haplotype samples, all pairs of SNPs are classified as associated, independent, and not-statistically-significant pairs in the whole region. It is usually convenient to avoid extra computation by setting a maximum physical distance above which no linkage is assumed between SNPs. For instance, markers 500 kb from each other are usually assumed independent by some researchers. Toward an objective definition for haplotype blocks, we assume that in a population away from genetic drift, selection force, and migration, a haplotype block eventually determines two boundaries on the genome within which every SNP pair is in "association". A similar idea has been recently considered by Pattaro et al. [33], though in their own approach, a likelihood model for the LD pattern in haplotype blocks is introduced in which two distinct distributions model independent and associated SNPs, separately.
Many SNP pairs may be identified as independent pairs in a block, basically because not all existing variations are available and the limited samples from the population haplotypes may not be adequate to estimate the real situation. Therefore, we model the problem as finding a block partitioning such that the most possible number of associated SNP pairs are included in blocks while independent SNP pairs within blocks are kept limited.
where A[a, b] and B[a, b] are the numbers of independent and associated SNP pairs in the genome segment between SNP a and SNP b, respectively. The maximization is taken over all partitioning sets such that 0 = s_{0} <s_{1} < ⋯ <s_{ k }= L are (unknown) indices of SNPs at left edges of (unknown) k blocks. N_{ ind }is the number of independent SNP pairs in entire genomic region and α is an arbitrary constant between zero and one that denotes the largest tolerable fraction of independent pairs in blocks.
where S(i; d) = B[i - d + 1, i] - λA[i - d + 1, i] is the score of the genomic interval ending at SNP i and consisting of d SNPs and S^{ opt }(i) is the score of optimum block partitioning for i leading SNPs. Arbitrarily, the maximum number of SNPs within a block can be set to w. To obtain a proper value for Lagrange multiplier, we apply a binary search procedure in which reduced problem with respect to different values for λ is iteratively solved until the desired constraint on the number of independent SNP pairs within blocks is satisfied. In general, increasing λ decreases the sum of independent SNP pairs included in blocks. In our experience and when α = 0.01, the Lagrange multiplier is obtained by about 10 iterations.
An alternative algorithm based on Gabriel's index
In fact, our approach can be applied to improve any method that introduces haplotype blocks based on some pairwise index for SNPs. For instance, Gabriel's method [26] also introduces a three state index for SNP pairs based on the confidence interval of D'. None of the previous methods in this category incorporates any global optimization on block partitioning.
We have developed other haplotype block partitioning using the above optimization scheme substituting the Gabriel's index as SNP pair characterization. Both varieties of our method, one based on the association index derived from the Fisher's exact test and the other based on Gabriel's index have been incorporated in our extension to the open source widely accessed software, Haploview ver. 4. The software is running under JAVA and is publicly available via http://bioinf.cs.ipm.ac.ir/gpmap.
To deal with unphased genotype data, our method is assisted by the "two loci genotypes" phasing approach as implemented by [11] in Haploview. This approach is a simplified EM algorithm to infer frequencies of four possible alleles on two loci. Detailed formulas for this preprocessing can be found in [12].
Method Comparison
We compare our proposed algorithm with other methods of haplotype block partitioning based on some descriptive aspects of haplotype blocks, performance on a block-based case-control study, and detecting recombination hotspots. In addition to the new methods introduced in the present paper, we choose six other available haplotype block partitioning algorithms. Table 1 summarizes the main features of these methods in our trial.
In Table 1, HOT is an exception. It has been developed as a method for inferring recombination "hotspots" throughout the human genome. However, a region between two consecutive hotspots can also be considered a haplotype block because the recombination level in the region is relatively low. In addition, comparison between other methods and HOT may suggest clues on extending application of block partitioning methods to recombination hotspot detection.
There is another method, MB, which has not been explicitly introduced in the literature. It is a special option in the HapBlock software in which for each block, exactly one tagSNP is assumed. Therefore, this algorithm finds the least possible number of haplotype blocks covering the genome while satisfying the haplotype diversity criterion.
General aspects of haplotype block partitioning
HapMap ENCODE samples used for haplotype block partitioning.
Region name | Chromosome band | Genomic interval (NCBI B36) | Genotyped SNPs* |
---|---|---|---|
ENr112 | 2p16.3 | Chr2:51512208..52012208 | 2,601 |
ENr131 | 2q37.1 | Chr2:234156563..234656627 | 2,214 |
ENr113 | 4q26 | Chr4:118466103..118966103 | 2,538 |
ENm010 | 7p15.2 | Chr7:26924045..27424045 | 1,830 |
ENm013 | 7q21.13 | Chr7:89621624..90121624 | 1,770 |
ENm014 | 7q31.33 | Chr7:126368183..126865324 | 3,343 |
ENr321 | 8q24.11 | Chr8:118882220..119382220 | 2,128 |
ENr232 | 9q34.11 | Chr9:130725122..131225122 | 1,909 |
ENr123 | 12q12 | Chr12:38626477..39126476 | 2,189 |
ENr213 | 18q12.1 | Chr18:23719231..24219231 | 1,990 |
There are about 2000 SNPs assayed in each ENCODE region in the CEU panel. However, we reduced SNPs to those which are commonly ascertained for all three HapMap panels, CEU, YRI, and JPT+CHB. Moreover, for each region, we drew out the top 400 SNPs ordered by heterozygosity out of the whole region. To do so, we divided the region of interest into 20 equal-length subintervals and then for each one, we picked the 20 most heterozygous SNPs from the SNPs shared in all panels. Therefore, a nearly uniform distribution of the most "informative" SNPs was obtained. This preliminary reduction was necessary for some of the block partitioning methods, as they can not achieve the result for a huge sample size in a reasonable time. We apply all methods of Table 1 for block partitioning to these data. In our study, we first examine the resulting haplotype blocks for haplotype diversity and htSNP coverage.
To measure haplotype diversity, we apply a clustering approach that is a simple generalization of the commonly used definition of "common haplotypes" introduced by Patil et al. [1]. For each block, we group haplotypes into the same cluster such that every two members differ in at most four percent of SNPs. This fine, yet nonzero tolerance, resolves the ill effects of random noises and/or wrongly-assayed SNPs in estimation of the haplotype diversity in long length blocks. The clusters with six or more haplotypes are considered as non-occasional clusters and indicate significant polymorphisms in the population. By common haplotype coverage, we mean the fraction of whole sample which belongs to any non-occasional cluster.
The consistency of haplotype blocks with the pattern of LD would be also appealing. Intuitively, a hole in a haplotype block is where an SNP has no significant association with other SNPs of the same block [27]. In a similar way, we call cases of SNPs that are in strong association with SNPs of other blocks islands. Precisely, we count an SNP as a hole if its intra-block average |D'| is less than 0.8 and as an island if its inter-block average |D'| is greater than 0.8.
where L is the number of SNPs in the whole region, k is the number of blocks of P_{1} ∪ P_{2}, and l is the number of SNPs in each block of the union partitioning. This measure shows the fraction of SNP pairs which are commonly included by both P_{1} and P_{2}.
Robustness of block partitioning methods
The htSNP coverage
To obtain the tagging SNPs that are required to describe all haplotype variations within a block, we employed the htSNPer software [36]. The htSNPer software incorporates an efficient branch and bound algorithm to find the exact solution of the minimum htSNP selection problem. We used default settings of htSNPer, i.e. htSNPs were defined to cover 80% "common haplotypes" and the threshold for common haplotype frequency was set to 0.05. However, we changed parameters of its block partitioning subroutine such that it does not function. This setting allows the minimum set of haplotype-tagging SNPs to be obtained for each haplotype block of each method. Recall that resulting htSNPs are mainly affected by the shape of underlying block partitioning.
Given a set of htSNPs, we find the largest vicinity on the chromosome within which only one htSNP is enough to capture all haplotype variations. Then, we repeat the procedure on remaining regions by two, three, and more htSNPs until the entire data is covered. We refer to the length of total genomic segments covered in the k-th step of this procedure as "k-htSNPs coverage". We compare methods of haplotype block partitioning with regard to this measure, too.
Application of haplotype blocks in disease association studies
The performance of block-like models of genomes in recognizing trait-associated loci can be assessed through a case-control design. A plan of experiment can include the following steps: 1) Using available haplotype data in HapMap, the block structure of chromosomes in a certain population is determined. 2) Considering the genotyping cost and practical limitations, an efficient number of SNPs is selected as markers to assay genotypes of case and control samples in these loci, as phase I genotyping. 3) An association test is performed on each block to obtain a scan of probably trait-associated blocks over the map. 4) More SNPs in those probable blocks of the previous step are genotyped until the desired fine map is achieved. Compared to the frequently used method of sliding window, this approach has two advantages. First, it needs a lower cost for genotyping. Secondly, there is not a common agreement on the selection of an optimum window.
To evaluate the performance of various models for block partitioning in case-control studies, we apply the first three steps of the block-based design as mentioned above on simulated data. We consider two additive single locus disease models with GRR1 = 3 and GRR1 = 5. Since we have obtained haplotype blocks on samples taken from HapMap CEU panel, it is necessary to make sure that simulated samples have the same genetic structure as the base data. The software gs [37] enables us to generate genotype samples for case-control studies, using real genotype data under the desired disease models. Applying the extension model of the software, we generate 500 sets of samples, each on consisting 50 case and 50 control genotypes. We repeat the simulation for low and moderate disease allele frequencies, independently. Before the next experiments, we remove the causative SNP from each sample.
Clustering haplotypes in each block, we perform both association and significance tests by applying the Pearson chi-square statistic. The clustering algorithm is the same as the one used to define "common haplotypes". In other words, after clustering each haplotype category consists of haplotypes no two of which differ from each other in more than four percent of their SNP genotypes.
We follow two policies to select those markers needed for the phase I genotyping. In the first policy, we choose the first SNP out of every k consecutive SNPs. Following Carlson's approach [21] for tagSNP selection, in the second policy, we prioritize SNPs in each block based on their orders in Carlson's algorithm. We select SNPs from each block based on their order until the required number of SNPs is taken from the whole region.
Our objective is to compare the power of different algorithms under the condition that false discovery rates of all algorithms are the same. We set this common rate to 10%, as we have observed that lower levels result in unacceptably weak power in all methods (not shown). We use half the 500 sets of simulated case-control samples to find the proper p-value threshold corresponding to the 10% false discovery rate for each method. In details, chi-square values are obtained for blocks of each method. By the result, we can estimate the distribution of the chi-square statistic for each method. To obtain the desired p-value threshold, we find the p-value corresponding to the first decile of chi-square values of blocks which do not include the trait locus. Once the p-value threshold is obtained for the respective algorithm, we perform the association test on the remaining 250 case-control sets and assess the statistical power. For a better comparison, we also examine the method of single site association test, besides the block-based association test.
Performance on detecting recombination hotspots
Our third assessment is on the application of haplotype block partitioning algorithms for detecting recombination hotspots. Since there is no consensus on recombination hotspots within real haplotype data, we apply the msHOT software [38], to simulate haplotype samples. This software is an extension of Hudson's algorithm [39] and generates samples under the coalescent model with recombination. We generate 1000 sample sets, each one includes 40 haplotypes of 300 SNPs. Other samples with 100 haplotypes in each set are also generated. Conditions set include setting hotspot region lengths at a maximum of 2 kb, a maximum of six hotspots per region of 300 kb, and a recombination rate of 50 to 400 times higher than the background rate at hotspot regions of recombination. The frequency of hotspots in the simulations was based on available knowledge of such features in the human genome [4]. The positions of the hotspots observed in the different simulations are recorded. All the block partitioning algorithms being considered are then applied to these sample sets.
To assess the performance and accuracy of haplotype block partitioning methods on detecting recombination hotspots, we counted the times that haplotype block boundaries and hotspot regions coincided with each other. Block boundaries that occur outside hotspot regions are regarded as false positives, while hotspot regions not positioned at block boundaries are regarded as false negatives. In the latter case, we consider 2 kb flanking intervals around block boundaries as a standard extent of the hotspot region. We refer to the sum of the false positive and the false negative rates as total error rate in hotspot detection. We define the ratio of hotspot regions coinciding with block boundaries to the number of all hotspot regions as the power.
Results and discussion
Blocks of HapMap ENCODE regions
Features of different haplotype block partitioning methods on haplotype samples in ENCODE regions.
HOT | MB | HB | MDL | GAM | GAB | GPG | GPF | |
---|---|---|---|---|---|---|---|---|
Average block length (kbp) | 68.9 | 46.8 | 36.2 | 17.1 | 13.3 | 23.3 | 35.7 | 39.7 |
Average block length (SNP) | 52 | 36 | 27 | 13 | 10 | 18 | 27 | 30 |
Total run time (sec.) | N/A | 22 | 743 | 3295 | 112 | 143 | 5 | 5 |
Common haplotype coverage | 0.67 | 0.89 | 0.91 | 0.96 | 0.96 | 0.93 | 0.88 | 0.87 |
Hole freq. | 0.50 | 0.23 | 0.14 | 0.06 | 0.04 | 0.04 | 0.15 | 0.18 |
Island freq. | 0.08 | 0.10 | 0.09 | 0.17 | 0.19 | 0.11 | 0.06 | 0.07 |
Robustness | N/A | 92.0 | 69.2 | 99.4 | 99.7 | 100 | 100 | 97.6 |
The consistency between block partitioning and LD pattern on ENCODE regions can be shown by frequencies of "hole" and "island". In general, it is expected that the partitioning with wider blocks may include more holes in blocks and smaller blocks may miss more islands. Considering these measures, methods GPG, GPF, and HB seem to be more reliable when most information of the LD pattern is to be maintained. In contrast, GAB, GAM, and MDL produce firmer blocks.
Block similarity among several methods of haplotype block partitioning.
HOT | MB | HB | MDL | GAM | GAB | GPG | GPF | |
---|---|---|---|---|---|---|---|---|
HOT | 1.00 | 0.32 | 0.31 | 0.16 | 0.13 | 0.26 | 0.39 | 0.40 |
MB | 0.32 | 1.00 | 0.65 | 0.36 | 0.30 | 0.58 | 0.60 | 0.56 |
HB | 0.31 | 0.65 | 1.00 | 0.41 | 0.35 | 0.60 | 0.56 | 0.54 |
MDL | 0.16 | 0.36 | 0.41 | 1.00 | 0.54 | 0.42 | 0.30 | 0.29 |
GAM | 0.13 | 0.30 | 0.35 | 0.54 | 1.00 | 0.41 | 0.25 | 0.24 |
GAB | 0.26 | 0.58 | 0.60 | 0.42 | 0.41 | 1.00 | 0.57 | 0.55 |
GPG | 0.39 | 0.60 | 0.56 | 0.30 | 0.25 | 0.57 | 1.00 | 0.88 |
GPF | 0.40 | 0.56 | 0.54 | 0.29 | 0.24 | 0.55 | 0.88 | 1.00 |
Number of htSNPs in each ENCODE region.
region name | chromosome band | HOT | MB | HB | MDL | GAM | GAB | GPG | GPF |
---|---|---|---|---|---|---|---|---|---|
ENr112 | 2p16.3 | 37 | 35 | 33 | 51 | 96 | 77 | 32 | 33 |
ENr131 | 2q37.1 | 32 | 48 | 40 | 60 | 101 | 79 | 47 | 42 |
ENr113 | 4q26 | 34 | 37 | 30 | 46 | 67 | 47 | 31 | 31 |
ENm010 | 7p15.2 | 36 | 37 | 34 | 49 | 87 | 78 | 37 | 36 |
Enm013 | 7q21.13 | 17 | 16 | 15 | 34 | 69 | 29 | 23 | 25 |
Enm014 | 7q31.33 | 38 | 27 | 25 | 48 | 67 | 47 | 27 | 27 |
ENr321 | 8q24.11 | 27 | 35 | 26 | 49 | 63 | 49 | 31 | 30 |
ENr232 | 9q34.11 | 51 | 47 | 42 | 63 | 70 | 75 | 49 | 52 |
ENr123 | 12q12 | 14 | 33 | 29 | 48 | 79 | 59 | 37 | 38 |
ENr213 | 18q12.1 | 31 | 36 | 30 | 52 | 69 | 46 | 28 | 33 |
Effect of different block structures on performance of disease association study
Type I error in the disease association study.
Disease model parameters | SS* | MB | HB | MDL | GAB | GPG | GPF |
---|---|---|---|---|---|---|---|
DAF = 5% – 15%, GRR1 = 3 | 0.26 | 0.20 | 0.17 | 0.16 | 0.17 | 0.13 | 0.13 |
DAF = 5% – 15%, GRR1 = 5 | 0.32 | 0.24 | 0.21 | 0.20 | 0.22 | 0.16 | 0.17 |
DAF = 20% – 30%, GRR1 = 3 | 0.29 | 0.21 | 0.18 | 0.17 | 0.19 | 0.14 | 0.15 |
DAF = 20% – 30%, GRR1 = 5 | 0.34 | 0.25 | 0.22 | 0.22 | 0.23 | 0.17 | 0.19 |
Selecting marker positions by SNPs that have been ordered based on "informativeness" results in lesser decrease in the power of methods where marker distribution has lower density. In the case where SNPs are selected by prioritizing, the power of methods remains high, even when only one fifth of the original SNPs are used as markers (Figure 6). MDL is relatively efficient when markers are selected uniformly. When markers are prioritized, as one might expect, HB performs better. Nevertheless, the two versions of our model – GPG and GPF – are more efficient even when marker density is low.
Performance on recombination hotspot detection
Total error rate on detection of recombination hotspots*.
MB | HB | MDL | GAM | GAB | GPG | GPF | ||
---|---|---|---|---|---|---|---|---|
N = 40 | false positive rate | 2.6 | 15.4 | 3.8 | 5.3 | 3.0 | 1.9 | 1.6 |
false negative rate | 2.5 | 0.9 | 1.2 | 0.7 | 3.1 | 0.9 | 1.0 | |
total error rate | 5.1 | 16.3 | 5.0 | 6.0 | 6.1 | 2.8 | 2.6 | |
N = 100 | false positive rate | 3.8 | 12.2 | 2.6 | 4.9 | 3.4 | 1.1 | 1.0 |
false negative rate | 2.2 | 1.1 | 0.9 | 0.9 | 2.2 | 0.8 | 1.1 | |
total error rate | 6.0 | 13.3 | 3.5 | 5.8 | 5.6 | 1.9 | 2.0 |
Conclusion
Here, we present a method for global haplotype partitioning based on pairwise analysis of SNPs. In this approach, haplotype blocks are defined such that the number of associated pairs in blocks is maximal, and blocks include only a small number of independent SNP pairs. The normalized coefficient of linkage disequilibrium, D', is used as a scan statistic to determine independent SNP pairs and Fisher's exact test and its corresponding p-value determine the significance of dependency between SNP pairs. Furthermore, Gabriel's index is applied in determination of association classes.
Since the early observation of haplotype block structure in human genomes, several groups have developed block models assuming constraints on haplotype diversity. However, it has been suggested that such assumptions should be used carefully in applications [41, 42]. Our results from ENCODE data show that methods based on pairwise analysis of SNPs, without initial assumptions on haplotype diversity, find blocks in which haplotype diversity is consistent with the standard thresholds used in classical methods. We assessed the similarity of haplotype block structures by counting SNP pairs in overlapping regions in blocks of different partitionings. We did not find any general concordance among block boundaries in different methods. A previous study has also reached the same conclusion [24]. Nevertheless, each method does produce blocks with 50% similarity with blocks of at least one other method.
The consistency of block boundaries within each single method was also investigated by a permutation resampling. To do so, we recorded the number of times in which a certain method would reproduce the same boundaries when applied to simulated recombinant samples. It was observed that the rule of Gabriel to determine the association index within SNP pairs was highly robust. Our algorithm was also relatively robust.
In our method, the number of htSNPs is not subjected to minimization. However, the number and also the coverage of htSNPs within the resulting blocks compare well with the optimal values obtained by diversity-based approaches.
In a case-control study, a block-based approach for mapping a single locus trait was applied to blocks of various methods. The results show that any block-based association test is considerably more efficient than the conventional single site association test. In particular, our newly developed block partitioning method performed best accuracy for the case-control study, even when a low marker density is available.
Biological considerations suggest that block boundaries produced by block partitioning methods should exhibit some concordance with recombination hotspots. In this regard, we assessed the performance of methods on simulated data. Global block partitioning methods performed best both in terms of accuracy and power. In fact, our method may be considered an efficient and simple tool for gaining insight of recombination hotspots.
In conclusion, our assessments show that our proposed global partitioning method, the method of minimum description length, and Gabriel's method are all promising for case-control association studies and for detection of recombination hotspots. Furthermore, we have shown that allelic association of SNP pairs can partially describe aspects of genomic variations in human populations.
Declarations
Acknowledgements
AK, MS and HP would like to thank the Department of Research Affairs of University of Tehran. Authors wish to thank Changiz Eslahchi (SBU, Tehran) and Mehrdad Pedram (TUMS, Tehran) repectively for comments on the method and editing the text. AK wishes to thank Sayed Amir Marashi (IMPRS-CBSC, Berlin) for his help in preparing this paper.
This research was partially supported by grants from IPM. (No. CS 1385-1-02).
Authors’ Affiliations
References
- Patil N, Berno A, Hinds D, Barrett W, Doshi J, Hacker C, Kautzer C, Lee D, Marjoribanks C, McDonough D, Nguyen B, Norris M, Sheehan J, Shen N, Stern D, Stokowski R, Thomas D, Trulson M, Vyas K, Frazer K, Fodor S, Cox D: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 2001, 294(5547):1719–1723. 10.1126/science.1065573View ArticlePubMedGoogle Scholar
- Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolution haplotype structure in the human genome. Nature Genetics 2001, 29(2):229–32. 10.1038/ng1001-229View ArticlePubMedGoogle Scholar
- McVean G: Linkage disequilibrium, recombination and selection. In The Handbook of Statistical Genetics. 3rd edition. Edited by: Balding DJ, Bishop M, Cannings C. Chichester, UK: Wiley; 2007:909–940.View ArticleGoogle Scholar
- Myers S, Bottolo L, Freeman C, McVean G, Donnelly P: Genetics: A fine-scale map of recombination rates and hotspots across the human genome. Science 2005, 310(5746):321–324. 10.1126/science.1117196View ArticlePubMedGoogle Scholar
- Greenspan G, Geiger D: High density linkage disequilibrium mapping using models of haplotype block variation. Bioinformatics 2004., 20(SUPPL 1):Google Scholar
- Fearnhead P: SequenceLDhot: Detecting recombination hotspots. Bioinformatics 2006, 22(24):3061–3066. 10.1093/bioinformatics/btl540View ArticlePubMedGoogle Scholar
- Ding Z, Mailund T, Song YS: Efficient whole-genome association mapping using local phylogenies for unphased genotype data. Bioinformatics 2008, 24(19):2215–2221. 10.1093/bioinformatics/btn406PubMed CentralView ArticlePubMedGoogle Scholar
- Smith A, Thomas D, Munro H, Abecasis G: Sequence features in regions of weak and strong linkage disequilibrium. Genome Research 2005, 15(11):1519–1534. 10.1101/gr.4421405PubMed CentralView ArticlePubMedGoogle Scholar
- Hinds D, Stuve L, Nilsen G, Halperin E, Eskin E, Ballinger D, Frazer K, Cox D: Whole-genome patterns of common DNA variation in three human populations. Science 2005, 307(5712):1072–1079. 10.1126/science.1105436View ArticlePubMedGoogle Scholar
- Gu S, Pakstis AJ, Li H, Speed WC, Kidd JR, Kidd KK: Significant variation in haplotype block structure but conservation in tagSNP patterns among global populations. Eur J Hum Genet 2007, 15(3):302–312. 10.1038/sj.ejhg.5201751View ArticlePubMedGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 2005, 21(2):263–265. 10.1093/bioinformatics/bth457View ArticlePubMedGoogle Scholar
- Zhao Y, Xu Y, Wang Z, Zhang H, Chen G: A better block partition and ligation strategy for individual haplotyping. Bioinformatics 2008, btn519.Google Scholar
- Qin Z, Niu T, Liu J: Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms . American Journal of Human Genetics 2002, 71(5):1242–1247. 10.1086/344207PubMed CentralView ArticlePubMedGoogle Scholar
- Browning BL, Browning SR: Efficient multilocus association testing for whole genome association studies using localized haplotype clustering. Genetic Epidemiology 2007, 31(5):365. 10.1002/gepi.20216View ArticlePubMedGoogle Scholar
- Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP: Linkage Disequilibrium Mapping via Cladistic Analysis of Single-Nucleotide Polymorphism Haplotypes. The American Journal of Human Genetics 2004, 75: 35–43. 10.1086/422174View ArticlePubMedGoogle Scholar
- Bardel C, Darlu P, Genin E: Clustering of haplotypes based on phylogeny: how good a strategy for association testing? Eur J Hum Genet 2005, 14(2):202–206. 10.1038/sj.ejhg.5201501View ArticleGoogle Scholar
- Huang B, Amos C, Lin D: Detecting haplotype effects in genomewide association studies. Genetic Epidemiology 2007, 31(8):803–812. 10.1002/gepi.20242View ArticlePubMedGoogle Scholar
- Li C, Li M, Lange E, Watanabe R: Prioritized subset analysis: Improving power in genome-wide association studies. Human Heredity 2008, 65(3):129–141. 10.1159/000109730View ArticlePubMedGoogle Scholar
- He J, Zelikovsky A: Informative SNP selection methods based on SNP prediction. IEEE Transactions on Nanobioscience 2007, 6: 60–67. 10.1109/TNB.2007.891901View ArticlePubMedGoogle Scholar
- Zhou N, Wang L: Effective selection of informative SNPs and classification on the HapMap genotype data. BMC Bioinformatics 2007, 8: 484. 10.1186/1471-2105-8-484PubMed CentralView ArticlePubMedGoogle Scholar
- Carlson C, Eberle M, Rieder M, Yi Q, Kruglyak L, Nickerson D: Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium. American Journal of Human Genetics 2004, 74: 106–120. 10.1086/381000PubMed CentralView ArticlePubMedGoogle Scholar
- Ding K, Zhou K, Zhang J, Knight J, Zhang X, Shen Y: The effect of haplotype-block definitions on inference of haplotype-block structure and htSNPs selection. Molecular Biology and Evolution 2005, 22: 148–159. 10.1093/molbev/msh266View ArticlePubMedGoogle Scholar
- Zhang K, Qin Z, Chen T, Liu J, Waterman M, Sun F: HapBlock: Haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics 2005, 21: 131–134. 10.1093/bioinformatics/bth482View ArticlePubMedGoogle Scholar
- Indap A, Marth G, Struble C, Tonellato P, Olivier M: Analysis of concordance of different haplotype block partitioning algorithms. BMC Bioinformatics 2005, 6: 303. 10.1186/1471-2105-6-303PubMed CentralView ArticlePubMedGoogle Scholar
- Lewontin RC, ichi Kojima K: The Evolutionary Dynamics of Complex Polymorphisms. Evolution 1960, 14(4):458–472. 10.2307/2405995View ArticleGoogle Scholar
- Gabriel S, Schaffner S, Nguyen H, Moore J, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero S, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander E, Daly M, Altshuler D: The structure of haplotype blocks in the human genome. Science 2002, 296(5576):2225–2229. 10.1126/science.1069424View ArticlePubMedGoogle Scholar
- Wall J, Pritchard J: Assessing the performance of the haplotype block model of linkage disequilibrium. American Journal of Human Genetics 2003, 73(3):502–515. 10.1086/378099PubMed CentralView ArticlePubMedGoogle Scholar
- Yates F: Test of Significance for 2 × 2 Contingency Tables. Journal of the Royal Statistical Society 1984, 147(3):426–463.View ArticleGoogle Scholar
- Zapata C, Alvarez G: On Fisher's Exact Test for Detecting Gametic Disequilibrium Between DNA Polymorphisms. Annals of Human Genetics 1997, 61(01):69–75. 10.1017/S0003480096005969View ArticleGoogle Scholar
- Lydersen S, Laake P: Power comparison of two-sided exact tests for association in 2 × 2 contingency tables using standard, mid p, and randomized test versions. Statistics in Medicine 2003, 22(24):3859–3871. 10.1002/sim.1671View ArticlePubMedGoogle Scholar
- Hasselblad V, Lokhnygina Y: Tests for 2 × 2 tables in clinical trials. Journal of Modern Applied Statistical Methods 2007, 6(2):456–468.Google Scholar
- Hwang J, Yang M: An optimality theory for mid p-values in 2 × 2 contingency tables. Statistica Sinica 2001, 11(3):807–826.Google Scholar
- Pattaro C, Ruczinski I, Fallin D, Parmigiani G: Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC Genomics 2008, 9: 405. 10.1186/1471-2164-9-405PubMed CentralView ArticlePubMedGoogle Scholar
- The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851–861. 10.1038/nature06258PubMed CentralView ArticleGoogle Scholar
- Feingold E, Good P, Guyer M, Kamholz S, Liefer L, Wetterstrand K, Collins F, Gingeras T, Kampa D, Sekinger E, Cheng J, Hirsch H, Ghosh S, Zhu Z, Patel S, Piccolboni A, Yang A, Tammana H, Bekiranov S, Kapranov P, Harrison R, Church G, Struhl K, Ren B, Kim T, Barrera L, Qu C, Calcar SV, Luna R, Glass C, Rosenfeld M, Guigo R, Antonarakis S, Birney E, Brent M, Pachter L, Reymond A, Dermitzakis E, Dewey C, Keefe D, Denoeud F, Lagarde J, Ashurst J, Hubbard T, Wesselink J, Castelo R, Eyras E, Myers R, Sidow A, Batzoglou S, Trinklein N, Hartman S, Aldred S, Anton E, Schroeder D, Marticke S, Nguyen L, Schmutz J, Grimwood J, Dickson M, Cooper G, Stone E, Asimenos G, Brudno M, Dutta A, Kamani N, Taylor C, Kim H, Robins G, Stamatoyannopoulos G, Stamatoyannopoulos J, Dorschner M, Sabo P, Hawrylycz M, Humbert R, Wallace J, Yu M, Navas P, McArthur M, Noble W, Dunham I, Koch C, Andrews R, Celland G, Wilcox S, Fowler J, James K, Groth P, Dovey O, Ellis P, Wraight V, Mungall A, Dhami P, Fiegler H, Langford C, Carter N, Vetrie D, Snyder M, Euskirchen G: The ENCODE (ENCyclopedia of DNA Elements) Project. Science 2004, 306(5696):636–640. 10.1126/science.1105136View ArticleGoogle Scholar
- Ding K, Zhang J, Zhou K, Shen Y, Zhang X: htSNPer1.0: software for haplotype block partition and htSNPs selection. BMC Bioinformatics 2005, 6: 38. 10.1186/1471-2105-6-38PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Chen Y: Generating samples for association studies based on HapMap data. BMC Bioinformatics 2008, 9: 44. 10.1186/1471-2105-9-44PubMed CentralView ArticlePubMedGoogle Scholar
- Hellenthal G, Stephens M: msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics (Oxford, England) 2007, 23(4):520–1. 10.1093/bioinformatics/btl622View ArticleGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics (Oxford, England) 2002, 18(2):337–8. 10.1093/bioinformatics/18.2.337View ArticleGoogle Scholar
- Coulonges C, Delaneau O, Girard M, Do H, Adkins R, Spadoni JL, Zagury JF: Computation of haplotypes on SNPs subsets: advantage of the "global method". BMC genetics 2006, 7: 50. 10.1186/1471-2156-7-50PubMed CentralView ArticlePubMedGoogle Scholar
- Nothnagel M, Wollstein A, Krawczak M: Comparative assessment of the association information captured by SNP tagging. Human Heredity 2007, 64: 27–34. 10.1159/000101420View ArticlePubMedGoogle Scholar
- Climer S, Jager G, Templeton AR, Zhang W: How Frugal is Mother Nature with Haplotypes? Bioinformatics 2008, btn572.Google Scholar
- Anderson EC, Novembre J: Finding Haplotype Block Boundaries by Using the Minimum-Description-Length Principle. The American Journal of Human Genetics 2003, 73(2):336–354. 10.1086/377106View ArticlePubMedGoogle Scholar
- Wang N, Akey JM, Zhang K, Chakraborty R, Jin L: Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. American Journal of Human Genetics 2002, 71(5):1227–34. 10.1086/344398PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.