Analysis of concordance of different haplotype block partitioning algorithms

Background Different classes of haplotype block algorithms exist and the ideal dataset to assess their performance would be to comprehensively re-sequence a large genomic region in a large population. Such data sets are expensive to collect. Alternatively, we performed coalescent simulations to generate haplotypes with a high marker density and compared block partitioning results from diversity based, LD based, and information theoretic algorithms under different values of SNP density and allele frequency. Results We simulated 1000 haplotypes using the standard coalescent for three world populations – European, African American, and East Asian – and applied three classes of block partitioning algorithms – diversity based, LD based, and information theoretic. We assessed algorithm differences in number, size, and coverage of blocks inferred under different conditions of SNP density, allele frequency, and sample size. Each algorithm inferred blocks differing in number, size, and coverage under different density and allele frequency conditions. Different partitions had few if any matching block boundaries. However they still overlapped and a high percentage of total chromosomal region was common to all methods. This percentage was generally higher with a higher density of SNPs and when rarer markers were included. Conclusion A gold standard definition of a haplotype block is difficult to achieve, but collecting haplotypes covered with a high density of SNPs, partitioning them with a variety of block algorithms, and identifying regions common to all methods may be the best way to identify genomic regions that harbor SNP variants that cause disease.


Background
Single Nucleotide Polymorphisms (SNPs) are single base pair differences between individuals in a population. The recent completion of the Human Genome Project has helped facilitate the discovery of millions of SNPs and their use in genetic association studies for human disease [1]. Association studies work on the premise that SNP genotypes are correlated with a disease phenotype. Indi-vidual SNPs are genotyped and the frequency of alleles are compared between groups of affected and un-affected individuals. SNPs that are tested for association either must be the causative allele or be in linkage disequilibrium (LD) with the causative allele. LD is the non-random association of alleles between adjacent loci [2]. SNPs that are in LD with causative allele serve as a proxy and the association with the disease phenotype is maintained.
Numerous studies have shown that the human genome contains regions of high LD with low haplotype diversity [3][4][5][6]. These regions are called haplotype blocks. The existence of haplotype blocks reduces the number of SNPs required in association studies by identifying and typing only the subset of tag SNPs which uniquely identify common haplotypes present in a block. The frequencies of these haplotypes can be compared in groups of affected and unaffected individuals [7].
Haplotype blocks are defined computationally by various algorithms and can be classified into three categories: diversity based [3,8], LD-based [6], and information-theoretic [9]. Patil et al. [3] used a diversity based greedy algorithm to partition Chromosome 21 into haplotype blocks in a sample of 20 re-sequenced chromosomes. Their algorithm considers all blocks of consecutive SNPs of one SNP or larger, and defines a haplotype block boundary where at least 80% of observed haplotypes within a block are represented at least one or more times in their sample of chromosomes. Overlapping block boundaries were eliminated by choosing the block with the maximum ratio of SNPs in the block to the number of SNPs required to discriminate all haplotypes represented in the block. The process was repeated until the entire length of the chromosome was partitioned into haplotype blocks. Zhang et al. [8] subsequently provided a dynamic programming implementation for this approach in their software Hap-Block [10].
Gabriel et al. [6] used a LD-based algorithm to define haplotype blocks in a worldwide sample of chromosomes from Africa, Asia, and Europe. The authors computed confidence bounds of the value of D', a standard measurement of LD [11], and defined pairs of SNPs to be in strong LD (little evidence of recombination) if the one-sided 95% D' confidence bound is between 0.7 and 0.98. The authors defined a haplotype block if least 95% of pairwise SNP comparisons in a region show little evidence of recombination based upon their D' confidence bounds. The program Haploview [12] implements this method of Gabriel et al. Anderson and Novembre [9] use the Minimum Description Length (MDL) principle for defining haplotype blocks which incorporates LD decay between blocks and haplotype diversity within blocks [9]. The MDL principle is an application of information theory to statistical modeling which searches for patterns in data [13]. The description length of a data set is a function of the length with which data can be encoded in binary digits, or bits [9]. The best set of block boundaries defined by Anderson and Novembre's method is the set of block boundaries that has the shortest description length for a set of SNP genotypes that span a genomic region. The authors use a dynamic programming algorithm they call the iterative dynamic programming algorithm (IDP) and a faster, but approximate, dynamic programming algorithm called iterative approximate dynamic programming algorithm (IADP) to find the minimum description length for a set of haplotypes. Their method is implemented in the program MDBlocks [9].
Previous studies on the empirical performance of block partitioning methods have focused on data sets with differing minor allele frequency cutoffs. The studies of Daly et al. [5], Patil et al. [3], and Gabriel et al. [6] used minor allele frequency cutoffs of 5%, 10%, and 20%, respectfully. Schulze et al. [14] assessed the effects of varying the minor allele frequency cutoff on the number of blocks and tag SNPs inferred by the LD based method of Gabriel et al. [6] and diversity based method of Zhang et al. [8]. As rarer SNPs were removed and the allele frequency cutoff raised, the number of blocks inferred decreased for both methods, showing that the block structure is highly influenced by the allele frequency of SNPs used in their analysis.
Ke et al. [15] studied the impact of SNP density on block boundaries from three different partitioning algorithms: the previously discussed LD approach of Gabriel et al., the four-gamete test [16], and a D' threshold approach of Phillips et al [17]. The author's study genotyped over 5000 SNPs in a 10 Mb region of chromosome 20 in four different populations: CEPH families, U.K. Caucasians, African Americans, and East Asians. Block boundaries of the algorithms were assessed with differing marker densities starting at 2 kb and going to 10 kb. Their results show that longer blocks at sparser densities are broken into smaller blocks as more SNPs are added in. Other studies describing the LD block structure of the human genome also used varying marker densities. The study by Phillips et al. [17] on chromosome 19 used an average marker density of one SNP per 17.65 kb with a median value of 5.5 kb. Gabriel et al. [6] used an average density of one SNP every 2 kb. Daly et al. [5] used a density of one marker approximately every 5 kb. Patil et al. [3] used a higher density of SNPs with one SNP every 1.3 kb. This study was also the only one that completely re-sequenced the entire chromosome for all 20 samples.
The ideal data set to fully assess the performance of block partitioning algorithms would be a comprehensively resequenced large genomic region in a large number of independent chromosomes. Unfortunately, such data are not available at this time. Only a limited number of samples have been re-sequenced extensively. In addition to the study by Patil et al., as of June 2005 the SeattleSNPs [18] data set has re-sequenced 234 human genes in 24 African-American and 23 European CEPH samples spanning a total of 4868 kb of sequence. The ENCODE project [19] intends to re-sequence five 500 kb genomic regions in the 48 individuals of the HapMap Consortium data set [20]. Therefore, to fully assess the performance of block partitioning algorithms we generated three populations consisting of 1000 haplotypes using the coalescent, a stochastic technique that simulates the genetic history of a sample of chromosomes [11]. Haplotypes representing a 200 kb chromosomal region for three world populations -European, African American, and East Asian -were simulated using an implementation of the coalescent that uses a population-specific demographic history. The population specific profiles we used were previously published in Marth et al. [21], where the authors derive a closed mathematical formula for computing the allele frequency spectrum for a specified demographic profile. The demographic profiles for each of the populations were derived by computing allele frequency spectra predicted by Marth's equation for numerous demographic scenarios and testing the fit between it and the observed spectra from the SNP Consortium data set [1] for each respective population.
In the study presented here, we partitioned our coalescent-derived haplotypes into blocks using the three algorithms described above (diversity based, LD based, information theoretic). We assessed algorithm differences in number, size, and coverage of blocks under different values of marker density, allele frequency, and sample size on the performance of block partitioning algorithms. Our results show a great divergence in haplotype blocks predicted by each method, and supports the notion that it may be advisable to use multiple algorithms in parallel to comprehensively account for all haplotype blocks in the human genome.

Data simulation and block partitioning
One thousand haplotypes representing a 200 kb region were generated via the standard coalescent with population specific demographic profiles for three world populations: European, African American, and East Asian. All datasets were analyzed with the three block partitioning algorithms described in the Methods section.
In addition to the complete dataset of 1000 haplotypes, 1000 bootstrap sub-sample replicates of 24 or 96 haplotypes were sampled and filtered for different SNP density (all markers, one marker approximately every 1 kb, one marker approximately every 5 kb) and minor allele frequency (MAF) cutoff values (0.1%, 5%, and 10%). Each bootstrap replicate was partitioned using three methods (HapBlock, Gabriel's method, and MDBlocks). Computer memory constraints prevented MDBlocks from partitioning all 1000 chromosomes using all SNPs for each coalescent-derived population. For the same reason we were only able to analyze 200 bootstrap subsamples of 24 or 96 chromosomes with MDBlocks. More details on coalescent simulations and bootstrap sampling is given in the Methods section of the paper.

European population partitions using all chromosomes
All 1000 European chromosomes were analyzed with HapBlock and Gabriel's method. There were 1349 polymorphic sites with an average SNP density of one SNP per European population partitions  Figure 1 displays the resulting block partitions using all SNPs from the two methods, with the HapBlock partition denoted as HB and Gabriel's method denoted as GA. Table 2 displays descriptive statistics for the HapBlock and Gabriel's method population partitions. No matching block boundaries existed between HapBlock and Gabriel's method. HapBlock inferred a larger number of blocks of smaller physical length than Gabriel's method, but 74% of the sequence was common to blocks inferred by both methods. Both algorithms gave similiar values of coverage, which is defined as sum of the physical haplotype block lengths in base pairs divided by total length of region [22], with values of 85.6% for HapBlock and 86.1% for Gabriel's method, respectively.
When analyzing all chromosomes using only SNPs with a MAF of 10% or greater, the total number of markers was reduced to 367 with an average of one SNP every 540 bp. Table 2 also shows descriptive statistics using only SNPs with a MAF of 10% or higher. The number of inferred blocks for HapBlock dropped dramatically from 180 to 48. For Gabriel's method the change was not as large, with 33 blocks inferred. MDBlocks inferred 18 blocks which had the largest physical size. HapBlock, Gabriel's method, and MDBlocks covered 80.7%, 79.8%, and 84.4% of the 200 kb region in blocks. HapBlock again inferred a greater number of blocks of smaller size when compared to the other two methods. Of each possible pair of partitions, only Gabriel's method and MDBlocks contained one set of matching boundaries. Still, a large fraction of sequence, 57%, was common to all three partitions. Table 1 shows percentage of total sequence common to all population block partitions with this population and condition, as well as other populations examined in this study. Figure 1 shows the population partitions for all three methods using only SNPs with at least a MAF 10%, and block regions common for all three algorithms. Next, we compared the population partitions of HapBlock and Gabriel's method using all markers vs. all markers with a MAF of at least 10%. For HapBlock, 46% of the blocks inferred with SNPs with the higher MAF were broken up with the addition of rarer markers however, 70.8% of the chromosome is common to both partitions. For Gabriel's method 73% of the sequence is common to partitions resulting from the two differing allele frequency conditions. Only 3% of Gabriel's method blocks were broken into smaller markers with the additon of rarer SNPs.

African American population partitions using all chromosomes
A total of of 1653 polymorphic sites with an average density of one SNP every 119 bp defined the 1000 haplotypes in our sample. Using only SNPs with a frequency of at least 10% resulted in a total of 382 markers with an average spacing of one SNP every 521 bp.  Table 1). HapBlock and Gabriel's method shared two matching boundaries, and HapBlock and MDBlocks shared one matching boundary. Figure 2 displays all three block partitions and shared block regions between each partition. Comparing the HapBlock and Gabriel partitions with the full marker set to the corresponding partition of the same method with rarer SNPs filtered out shows that there were common regions identified in both. For HapBlock 67% of the 200 kb region was common to blocks for both conditions. For African American population partitions  Gabriel's method 74% of the sequence is included in both partitions.

East Asian population partitions using all chromosomes
A total of 1649 SNPs with an average spacing of one SNP every 120 bp defined the 1000 Asian haplotypes.  Figure 3 account for 46% of the chromosomal region.

Population partitions at other conditions
Descriptive statistics for population partitions of each method at other density and MAF conditions are shown in additional file 7.

≥10%
To assess variation in block structure on more realistic sample sizes (i.e. sample sizes that are being obtained by re-sequencing) we bootstrap subsampled 96 or 24 chromosomes from our original set 1000 times. Figure 4 shows the block partitions resulting from HapBlock for the first 50 individual bootstrap subsamples of size 96 using all SNPs with a MAF of at least 10%. It was clearly evident that the block structure varied between the bootstrap subsamples and the population partition. To find SNPs that were consistently inferred together in blocks above a threshold frequency across all bootstrap subsamples we defined consensus block partitions for HapBlock for threshold values from 100 to 50 percent. (For more details on consensus blocks see Methods.) As the threshold for defining a consensus block is lowered, the physical length of a block increases monotonically and blocks defined at higher thresholds are combined. Table 3 shows the percentage of chromosomal region common to both the population partition and consensus block partitions of HapBlock using only SNPs with a MAF of at least 10%.
Gabriel's method and MDBlocks partitions also showed within population variation in block structure. (See additional files 1 and 2.) Table 3 also contains the percentage of total sequence common between the population partitions of Gabriel's method and MDBlocks, and each consensus block definition. Similar to the HapBlock results as the threshold for defining a consensus block is lowered, the amount block regions common to both partitions increased. Of the three methods, MDBlocks consensus blocks had the greatest amount of total sequence in common with the population partition. Table 4 shows the percentage total sequence common to all three consensus block definitions at each threshold value. Figure 5 displays consensus blocks from each algorithm defined at a 80% threshold, and block regions common to all three East Asian population partitions   Figure 3).
Similar patterns for African American and East Asian bootstraps were found. Variation in block structure between bootstrap samples existed. The same pattern of an inverse relationship between the number of blocks and their average size as SNP density increased, remained. As the threshold for a consensus block is lowered, the percentage of sequence common between the population block partition increased monotonically. Also as the threshold is lowered, there is a greater percentage of total sequence common to all consensus blocks defined from each method. (Data not shown).

Discussion
We generated three populations of haplotypes via coalescent simulations to assess the performance of three block partitioning algorithms under different marker density and allele frequency conditions. Each of the block algorithms employed in this study partitions a genomic region into haplotype blocks using vastly different approaches. In addition to the three algorithms described here, there are other definitions for haplotype blocks not examined [16,17]. Despite all these algorithms, there is no widely accepted definition of how to best define haplotype blocks [23].
The descriptive statistics of each population block partition using all 1000 chromosomes clearly show that results are different in number, size, and coverage of inferred blocks, particularly with a higher density of markers. Hap-Block generally inferred the largest number of blocks of smallest size and MDBlocks inferred the fewest number of blocks of largest size. While there are few exact matching block boundaries between different partitions, there is a large amount of common block regions between them.
Increasing the density of markers had a more dramatic effect on the percent coverage for Gabriel's method than the other two methods due to fact that LD patterns are sensitive to marker density and can change with the addition of more markers [15]. The amount of coverage, in turn, influences the percentage of total sequence common to all partitions since there is a greater chance of overlap between them.
To assess within population variation in block structure we bootstrap subsampled haplotypes of sizes 24 or 96 chromosomes. The descriptive statistics of the bootstrap partitions indicate that the number of inferred blocks increases as a higher density of markers is used. Also, the average number of base pairs per blocks decreases with a higher density of markers, hence there is an inverse relationship between the number of blocks inferred and their physical size. Removing rarer SNPs does not necessarily decrease the number of blocks inferred for each of the methods when conditioning on a density value. This result is in contrast to the results of Schulze et al. [14], who found that removing rarer SNPs decreased the number of blocks inferred by the HapBlock and Gabriel's method. This maybe a result of the stochastic nature of the coalescent. To get a clearer picture of the effect of allele frequency on the performance of block partitioning algorithms, it may require the simulation of many more genealogies.
Our consensus block definitions attempt to identify SNPs consistently inferred together in blocks across all bootstrap replicates. The amount of common block regions between the population partition and consensus block definitions from bootstrap samples depends heavily on the threshold to define a consensus block, as well as the percent coverage of the bootstrap and population partitions. If a significant proportion of the chromosome is inferred in blocks in both the population and consensus definitions, there is a greater chance of finding common block regions. However as discussed earlier, this attribute is influenced by SNP density and allele frequency of markers.  Comparing the consensus blocks for one method to the population partition of the same method addresses the block structure variation within a particular algorithm. To find common block regions in bootstrap subsamples from differing algorithms, we found the overlapping boundaries between consensus block regions from each algorithm. For certain density, MAF conditions, and consensus block thresholds, there was very low or non-existent overlap. These numbers can be severely reduced if a particular method fails to infer a large number of blocks covering a significant portion of sequence, as was the case for Gabriel's method at the sparsest marker density of 5 kb. When using all SNPs with a 10% MAF for European haplotypes, the percentage overlap between all consensus blocks ranges from 9-57% depending on the consensus threshold. Finding block regions common to all three methods is an encouraging sign because each algorithm takes a different approach to the block partitioning problem. If the haplotype block paradigm is an accurate description of underlying LD patterns of the human genome, different algorithms should find common block regions since the three methods base their algorithms on various attributes of the paradigm.
Rather than searching for exact matching boundaries using Schwartz concordance test statistic as a measure of block concordance [24], we chose to compute the percentage of common block regions between two different block partitions as our metric of concordance. While using the block concordance test statistic is a valid approach, the method cannot assess the significance of block boundaries which may differ by few SNPs, but still have a significant degree of overlap between block regions. There was only one matching boundary between each possible of pair of partitions using all 1000 European haplotypes using all SNPs with a MAF of 10%. However, 57% of the 200 kb region was common to blocks defined from all three methods.
In our analysis we focused on the number, size, and coverage of haplotype blocks inferred by three different algorithms. We do not discuss tag SNPs identification because we view it as a separate problem. However, it should be pointed out that the dynamic programming approach of HapBlock is closely tied to tag SNPs because it defines blocks which minimize the number of SNPs needed to distinguish common haplotypes within a block. Recently, a method formulated by Halldorsson et al. [25] selects tag SNPs which does not require a haplotype block definition. Also, the tag SNP algorithm LDselect [26] chooses tag SNPs independent of chosen haplotype block boundaries.
Another point to address in our study design is that the simulated haplotypes used were derived from a single realization of a coalescent simulation, hence our study does not address genetic sampling [27]. Since we bootstrap subsampled 96 or 24 individuals from a population of 1000, we fix the genetic history of our data set and focus on the statistical sampling on the performance of block partitioning algorithms used in this study. We also chose not to vary recombination rate or incorporate recombination hotspots in our simulations since we only analyzed a 200 kb region. Due to these limitations, we did not compare populations to each other. Rather, we examined the trends seen in each population and used coalescent simulations with three different population histories to ensure that the results from the three block partitioning algorithms were not due to the coalescent parameters chosen. The recent study of Ding et al. [28] address the affects of population genetic parameters, such as the mutation and recombination rate, on the diversity and LD based algorithms discussed here for multiple realizations of coalescent genealogies.

Conclusion
In summary, our results show that for the population partitions using all 1000 chromosomes, there is a varied range of number, size, and coverage of blocks between the different methods. The percentage total sequence common to all three partitioning algorithms ranges from 3-61% depending on the population and is generally higher using a high density of SNPs with a wide range of MAF. Bootstrap sampling of haplotypes from the population shows there is within population variation in block structure for all three methods. Our consensus block definition attempts to define blocks based on sets of SNPs consistently found together in blocks across all bootstrap replicates. Using a higher density of markers there is an increased percentage of total sequence in common with consensus blocks and population partitions. The percentage of common block regions between consensus blocks defined from all three methods is influenced by the percent coverage of individual partitions, which itself is influenced by the density and allele frequency of markers that comprise the haplotypes to be partitioned.

Coalescent simulations
Haplotypes representing a 200 kb chromosomal region for three world populations, European, African American, and East Asian were generated using an implementation of the standard coalescent with uniform recombination, which uses a population-specific demographic history. The demographic profiles for each of the three populations considered in this study were as determined by Marth et al. from The SNP Consortium genotype data [21].

Haplotype block partitioning algorithms
Three categories of block partitioning algorithms were used in the study: diversity based, LD based, and information theoretic. The software programs that implement each method are described below. All three programs were run on a Sun Blade 1000 with dual 750 MHz Ultra Sparc III processors and 4.5 GB of RAM.

MDBlocks
MDBlocks vl.0 uses the Minimum Description Length (MDL) principle for defining blocks [9]. It considers the set of all possible block boundaries and finds the one with the minimum description length using two versions of a dynamic programming algorithm. The first is called the iterative dynamic programming algorithm (IDP) The second is a faster, but approximate method called the iterative approximate dynamic programming algorithm (IADP). Due to the number and size of haplotypes analyzed, we used the IADP option. MDBlocks ran out of computer memory when attempting to partition all 1000 haplotypes using all SNPs for each population when using the IADP algorithm. MDBlocks is available for download here: http:// ib.berkeley.edu/labs/slatkin/eriq/software/mdb_web/.

Bootstrap subsampling
To assess variation in the number and size of blocks inferred by the three partitioning algorithms used in the study under differing values of sample size, SNP density, and MAF cutoffs, 1000 bootstrap subsamples of sizes 24 or 96 were drawn with replacement from the population. A true bootstrap sample is one that is the same size as the original sample (1000). Since we are making smaller samples of 96 or 24 chromosomes, it is more properly called a bootstrap subsample. Initially each bootstrap subsample contained the full set of SNPs, and was progressively filtered for each possible pair of SNP density and MAF cutoff values.
Hence, the same set of individuals that make up a particular bootstrap subsample can be compared at differing density and MAF conditions. If a subsample contained a monomorphic site, it was removed prior to the initial filtering of density and allele frequency conditions. Since monomorphic sites do not contain any information, in an information theoretic sense, and information theory forms the basis for MDBlocks, the program would crash. Removing monomorphic SNPs in our bootstrapping routine solved this problem (Eric C. Anderson, personal communication).

Consensus block definition
To identify SNPs that are consistently inferred together in the same block across all bootstrap subsamples, we introduce the idea of a consensus block. Let the collection P = p 1 , ..., p 1000 be the collection of bootstrap partitions resulting from a particular method. Let S be the set of SNPs that comprise the haplotypes. For each SNP i and SNP i+1 , we calculate how often they are assigned to the same block across all bootstrap samples. We call this the neighbor probability. We define a consensus block as collection of consecutive SNPs whose neighbor probability is greater than or equal to some threshold percentage t, for t = 100 90 80 70 60 50. Consensus blocks were defined for each of the density and MAF conditions for bootstrap subsamples of sizes 24 and 96. As described in the previous section, if a bootstrap subsample initially contained a monomorphic site, it was removed. However, this leads to the situation that not all bootstrap replicates may contain the same SNPs. To calculate consensus blocks, then we take the union of all markers used across all bootstrap replicates and then proceed to calculate neighbor probability. If a particular SNP was not used in a particular bootstrap, and is a member of a the union set of SNPs, its block assignment was treated as missing data and imputed in the following way. If the adjacent markers to the left and to the right of the missing marker were assigned to the same block number, then the missing SNP in question was assigned to the same block.

Data storage
All data regarding coalescent-derived haplotypes (SNP positions, allele frequencies, etc), block partitions (number of blocks inferred, block boundaries, etc), and consensus block definitions were stored in tables in a MySQL v3.23 database.

Visualization of block partitions
Block partitions were visualized in the UCSC Genome Browser [29].

Block partition intersection
Finding the common regions between different block partitions was achieved by executing the appropriate MySQL query on tables holding information for block partition boundaries. For a subset of density and MAF conditions (all SNPs, all SNPs with a 10% MAF) correctness of the database query was verified by using sequence intersection feature of the UCSC Table Browser [30].