htSNPer1.0: software for haplotype block partition and htSNPs selection

Ding, Keyue; Zhang, Jing; Zhou, Kaixin; Shen, Yan; Zhang, Xuegong

doi:10.1186/1471-2105-6-38

Software
Open access
Published: 01 March 2005

htSNPer1.0: software for haplotype block partition and htSNPs selection

Keyue Ding^1,3,
Jing Zhang^2,3,
Kaixin Zhou¹,
Yan Shen^1,3 &
…
Xuegong Zhang²

BMC Bioinformatics volume 6, Article number: 38 (2005) Cite this article

7242 Accesses
30 Citations
Metrics details

Abstract

Background

There is recently great interest in haplotype block structure and haplotype tagging SNPs (htSNPs) in the human genome for its implication on htSNPs-based association mapping strategy for complex disease. Different definitions have been used to characterize the haplotype block structure in the human genome, and several different performance criteria and algorithms have been suggested on htSNPs selection.

Results

A heuristic algorithm, generalized branch-and-bound algorithm, is applied to the searching of minimal set of haplotype tagging SNPs (htSNPs) according to different htSNPs performance criteria. We develop a software htSNPer1.0 to implement the algorithm, and integrate three htSNPs performance criteria and four haplotype block definitions for haplotype block partitioning. It is a software with powerful Graphical User Interface (GUI), which can be used to characterize the haplotype block structure and select htSNPs in the candidate gene or interested genomic regions. It can find the global optimization with only a fraction of the computing time consumed by exhaustive searching algorithm.

Conclusion

htSNPer1.0 allows molecular geneticists to perform haplotype block analysis and htSNPs selection using different definitions and performance criteria. The software is a powerful tool for those focusing on association mapping based on strategy of haplotype block and htSNPs.

Background

Several recent genome-wide and experimental studies suggested that the genome consists of chromosome regions of strong inter-marker linkage disequilibrium (LD) (i.e., haplotype blocks) and has discrete boundaries defined by recombination hotspots [1–4]. There are a few common haplotypes of limited haplotype diversity within each haplotype block, which can be characterized by only a small number of haplotype tagging SNPs (htSNPs). Haplotype blocks and htSNPs have great implication for association-based mapping of disease genes, by significantly reducing the genotyping effort with only a modest loss of power [5]. A new genomic map (i.e., haplotype map) for characterizing the haplotype structure in human genome is now underway to speed up the searching for genes involved in complex diseases.

A range of operational definitions has been used to identify haplotype block structures [1, 2, 6, 7], which can be roughly cataloged into three groups [19]. First, there are methods based on diversity in the sequence, such as those of Patil et al. [1] and Zhang et al. [7], which define blocks with low sequence diversity by some diversity measure. The second group is based on LD methods, such as that of Gabriel et al. [2], which defines blocks with generally high pairwise LD within blocks and low pairwise LD between blocks. Finally, there are methods that look for direct evidence of recombination, such as that of [6], using the four-gamete test developed by Hudson and Kaplan [8] and defining blocks as apparently recombination-free regions.

There is still no consensus on the performance criteria of htSNPs selection. Broadly, these criteria are categorized into two groups. One comprises the diversity criteria which evaluate the information captured from the original haplotype diversity [1, 9], such as the α-percent coverage, requiring that the total frequencies of all haplotypes completely distinguished by the htSNPs set should be no less than α. The other group consists of the association-based criteria, concerned most directly with the issue of prediction – the ability of the reduced set H to detect unknown SNPs in the set A of all SNPs within the genome region of interest [10].

To provide molecular geneticists more convenience in analyzing haplotype block structures and in selecting htSNPs, we develop a computational tool, htSNPer1.0, with a graphical user interface (GUI) implementing the above algorithms for block structure partition and htSNPs selection.

Implementation

htSNPer1.0 is a computer program with a GUI for characterizing the haplotype block structure and selecting htSNPs. The core algorithm is implemented in C++ language, and the graphic interface is coded in Java. The software is platform-independent.

Here, we will be concerned with haplotype block partition and htSNPs selection of unphased autosomal SNPs genotype data. For the block definitions that can directly handle unphased genotype data such as Gabriel et al. [2] and those based on pairwise LD [11], the unphased data are first partitioned into blocks over which there is sufficient restriction of haplotype diversity. Then, haplotypes are estimated approximately within each block (by EM algorithm). Finally, based on these estimated haplotypes, htSNPs are selected according to certain htSNPs performance criterion [10]. For those block definitions that can only handle phased haplotype data [1, 6], haplotypes are estimated first (by EM algorithm) from unphased genotype data. Then block partition and htSNPs selection are both based on these estimated haplotypes.

Haplotype estimation – EM algorithm

We apply the EM algorithm used by SNPHAP to estimate haplotypes from genotype data [18]. When the data consist of a large number of SNPs, the number of possible haplotype instances may become extremely large. In order to avoid this problem, the program starts from the first two SNPs and extends the solution by sequentially adding the rest SNPs. As each new SNP is added, the number of possible haplotypes is expanded considering all possible larger haplotypes. After EM algorithm estimating the posterior probabilities, the program deletes genotype assignments with posterior probability lower than 0.001. Then the posterior probabilities of the rest genotype assignments are recomputed.

We use the EM algorithm in SNPHAP because it is simple and fast, and can be easily integrated in our C++ code. There are other algorithms like HAPLOTYPER [15], PHASE [16] and PLEM [17] that are better studied and more widely used. However, one distinctive feature of htSNPer1.0 is to estimate haplotypes within each haplotype block. Within the blocks there is very limited haplotype diversity, so in such cases the algorithm in SNPHAP performs reasonably well. If one likes to do the haplotype phasing before block partition, he/she can use HAPLOTYPER [15], PHASE [16] or PLEM [17] to get more accurate estimation, and then input the estimated haplotypes to htSNPer1.0 to do the block partition and htSNP selection.

Definitions for haplotype blocks

htSNPer has integrated four haplotype block definitions: chromosome coverage [1], average pairwise LD |D'| [11], estimated pairwise LD confidence limits [2] with minor modifications by Wall and Prichard [14], and no historical recombination [6].

1.
Chromosome coverage [1]. A block is defined as a region in which the sum frequencies of common haplotypes (whose frequency is over a threshold, e.g. 0.05) is no less than a threshold. For this definition of blocks we apply a dynamic programming for haplotype partitioning [7]. We define a Boolean function block (i, j) = 1 if the consecutive SNPs from SNP_ito SNP_jcan be defined as a block according to the above definition, and block (i, j) = 0 otherwise. Let f (i, j) be the size of the minimal htSNP set found by GBB algorithm (see below) for α-percent coverage within the block from SNP_ito SNP_j. Given a block partition (1, i₁), (i₁ + 1, i₂),..., (i_n-1+ 1, i_n), the total number of htSNPs for these n blocks is f (1, i₁) + f (i₁ + 1, i₂) +...+ f (i_n-1+ 1, i_n). The optimal block partition is defined to be the one that minimizes the total number of htSNPs.

Denote S_jto be the total number of htSNPs for the optimal block partition of the first j SNPs, and set S₀ = 0. According to dynamic programming theory, we have . Through this recursion the dynamic programming partitions the haplotypes for the optimal block partition.

2.
Average pairwise LD |D'| [11]. Within a block the average pairwise |D'| is no less than a threshold.
3.
Estimated pairwise LD confidence limits [2] with minor modifications by Wall and Prichard [14]. For details see Additional file 1.

4.
No historical recombination [6]. A block is defined as a region without any historical recombination, which is examined by Four Gamete Test.

The above definitions of 2, 3 and 4 do not guarantee a unique solution for partition. In htSNPer1.0, blocks are searched from the start of the input data and expanded as long as possible by sequentially adding the next SNPs.

htSNPs selection criteria

htSNPer1.0 can find the minimal htSNP set of global optimum. Different definitions of optimum can be derived according to different htSNP performance criteria [10]. A generalized definition of "optimum" can be described as the minimal set of htSNPs that satisfies a given htSNP performance criterion. For example, weighted-average haplotype r² is regarded as one of the most informative association-based htSNP performance measure by Weale et al. [10], which is defined as following:

Weighted-average haplotype

where

and we denote

as the the frequency of haplotypes with allele 1 at SNP i, as the frequency of haplotypes in the g th htSNP-defined group (haplotypes within each group are identical at htSNP loci), and as the frequency of haplotypes both in the g th htSNP-defined group and with allele 1 at SNP i.

If the htSNP performance criterion is defined as the weighted-average haplotype r² of the selected set of htSNPs should be at least 90% of the maximum possible value (which is the weighted-average haplotype r² when all SNPs are selected as htSNPs), then the "optimum" according to this criterion can be described as the minimal set of htSNPs whose weighted-average haplotype r² is at least 90% of the maximum possible value.

We have integrated the three htSNP performance criteria into our htSNPer software: α-percent coverage [1], explained proportion of Clayton's haplotype diversity [9], and weighted-average haplotype r² [10].

α-percent coverage: the total frequencies of all haplotypes that are not completely distinguished by the htSNP set is less than 1 - α.

Explained proportion of Clayton's haplotype diversity:

, where f_i, f_{haplo = g}and f_i,gare defined in the same way as above.

Weighted-average haplotype r² : see above.

htSNPer1.0 takes advantages of a novel heuristic algorithm – Generalized Branch-and-Bound (GBB) algorithm, which is applicable for all kinds of htSNPs performance criteria, to search the minimal htSNPs set with both efficiency and global optimum, comparing to the exhaustive searching [7] which guarantees global optimum but runs very slowly, and to the greedy algorithm [1, 13] which is faster but doesn't guarantee global optimum.

The GBB algorithm

Consider a block B containing N haplotypes and each haplotype has M bi-allelic SNPs markers. Each SNP marker can divide N haplotypes into two groups: one consists of all the haplotypes with its major allele, and the other with its minor allele. GBB algorithm is based on the following branching rule and Generalized Prune-rule, using the depth-first searching strategy (Figure 1).

1)
Each node {T, R} in the searching tree consists of two parts: the test-set T and the discard-set R where T is the set of SNPs that have been selected, and R is the set of SNPs that should not be selected for the future. If the set of all SNPs is denoted as S, then the set of SNPs that can be used at the node is S\(T ∪ R). The search tree starts from the root node for which T = Φ and R = Φ.
2)
A child node is generated by adding a SNP to T according to the branching rule. The node is pruned if it meets the Generalized Prune-rule.

Importance calculation

Given a certain node {T, R}, SNPs in T divide all the haplotypes into t non-overlapping groups called equivalence classes. Any haplotypes that belong to the same group are identical at all SNP sites in T. A biallelic SNP divides all the haplotypes into two groups: G_majorand G_minor. To evaluate the competence of the SNP, the importance of a SNP is defined by

Branching rule

Given a node {T, R}, sort the SNPs in S\(T ∪ R) according to the importance calculation non-increasingly: I(SNP₁ |T) ≥ I(SNP₂ |T) ≥ … ≥ I(SNP_|S|-|T|-|R||T), create the children {T ∪ SNP₁, R}, {T ∪ SNP₂, R ∪ SNP₁}, {T ∪ SNP₃, R ∪ SNP₁ ∪ SNP₂}, ..., {T ∪ SNP_|S|-|T|-|R|, SNP_h}, and explore the children in this order.

Generalized prune-rule

Check whether the SNP subset T meets the htSNP performance criterion. If it does, prune the node when |T| ≥ U, or update U when |T| <U ; otherwise, prune the node when |T| ≥ U or |S| - |T| - |R| < 1 where U is the size of the best solution found so far.

The Importance Calculation in Branching rule is originally devised for the α-percent coverage criterion [1, 7]. But it is also applicable for other criteria, although it may not be the best one. Actually, one can devise specific Branching-rule and Prune-rule according to specific htSNPs selection criterion in the GBB framework to achieve super efficiency and global optimization. The GBB framework and algorithm we proposed are applicable to all htSNP criteria, and are at least more efficient than enumeration.

Results and discussion

htSNPer1.0 takes advantages of a novel heuristic algorithm, Generalized Branch-and-Bound algorithm. It is applicable for all kinds of htSNPs performance criteria. The algorithm is of high computational efficiency and it can reach the global optimum.

htSNPer1.0 has integrated three htSNPs performance criteria and four haplotype block definitions. Besides genotype data, htSNPer1.0 can also handle haplotype data directly. It takes a simple flat-file as input. A dialogue box is used to set up parameters and for htSNPs selection algorithm (GBB algorithm and greedy algorithm). In the tabbed-output panel, htSNPer1.0 demonstrates the results both in the form of graphics and plain-texts. A graphical representation of haplotype block partition and htSNPs selection is provided in the graphic panel (Figure 2). In this example, there are 51 SNPs and 50 haplotypes in its input. The LD-based definition was used for haplotype partition, weighted-average haplotype r² for htSNPs performance criteria, and branch-and-bound algorithm in htSNPs selection. In the text-output panel, there is more information about the analysis and results, such as the methods/criterion used on haplotype block definition and htSNPs selection. Users can also select different haplotype definitions and htSNPs performance criteria to compare the results from the result tree in the left panel.

Application example

Study sample

In order to compare the time used and the htSNPs numbers chosen with different softwares, we used the human chromosome 21 haplotype [1] as the test data. This dataset consists of 20 haplotype samples, and 24,047 common SNPs (minor allele frequency no less than 0.10). About 21% of the chromosome 21 data are missing data.

Results based on various haplotype block definitions and htSNP Selection Criteria

Results of the three different htSNPs performance criteria by htSNPer1.0:

Alpha-percent coverage: 3,953 blocks and 5,082 htSNPs.

Haplotype Diversity: 3,055 blocks and 4,619 htSNPs.

Weighted-average haplotype r² : with this criterion and GBB algorithm, htSNPer can not run on our computer because of the large amount of required memory. Using greedy algorithm instead of GBB (see Additional file 1), we get 3,098 blocks and 6,962 htSNPs.

Using the same data set, the same htSNPs performance criterion of α-percent coverage but the different block searching algorithm, Patil et al. [1] reported 4,135 blocks and 4,563 htSNPs. Zhang et al. [7] reported 2,515 blocks and 3,582 htSNPs. About 21% of the chromosome 21 data are missing data and different programs use different strategy to handle missing data. All these contribute to the differences between the results of the above programs.

We also ran the different programs to compare the computational efficiency. The algorithms used for comparison were the GBB algorithm in htSNPer1.0, the greedy algorithm by Zhang et al. [13] and the enumeration algorithm in Zhang et al. [7]. For comparison, we used the diversity-based haplotype block definition and α-percent coverage criterion in htSNPs selection (α = 0.8). Running on our computer with 2.4 GHz AMD Athlon processor (1GB memory), with the block definition of Patil et al. [1], α-percent coverage htSNPs performance criterion and a dynamic programming for haplotype partition, htSNPer1.0 requires 3 hours 23 minutes to identify 3,953 blocks and 5,082 htSNPs. For comparison, the greedy algorithm in Zhang et al. [13] requires 20.74 seconds, identifying 3,766 blocks and 8,733 htSNPs, and the enumeration method in Zhang et al. [7] was too slow to apply on our computer on this human chromosome 21 dataset. The difference in efficiency may be partially due to the different block partition searching strategies applied by these three programs. Zhang et al. [13] uses greedy algorithm for block partition, while htSNPer1.0 and Zhang et al. [7] use dynamic programming algorithm.

Conclusion

In conclusion, htSNPer1.0 is a java-based program with Graphic User Interface. It allows molecular geneticists to perform haplotype block analysis and htSNPs selection using different definitions, different performance criteria, as well as different algorithms. The software is a powerful tool for those focusing on association mapping based on haplotype block and htSNPs strategy.

Availability and requirements

htSNPer1.0 is a graphic user interface and a platform-independent software. The software is available at http://www.chgb.org.cn/htSNPer/htSNPer.html. The source code is available on request to the authors.

htSNPer1.0 takes a plain text file as input, either unphased autosomal SNPs genotype data or phased haplotype data. It requires the Java Running Environment (Jre1.4 or later version) to run the program properly. Detailed tutorials, htSNPer1.0 help system and examples are distributed within htSNPer1.0 software. Please inform the corresponding author if you are a non-academic user.

References

Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BTN, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SPA, Cox DR: Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21. Science 2001, 294: 1719–1723. 10.1126/science.1065573
Article CAS PubMed Google Scholar
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science 2002, 296: 2225–2229. 10.1126/science.1069424
Article CAS PubMed Google Scholar
Jeffreys AJ, Kauppi L, Neumann R: Intensely punctuate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet 2001, 29: 217–222. 10.1038/ng1001-217
Article CAS PubMed Google Scholar
May CA, Shone AC, Kalaydjieva L, Sajantila A, Je_reys AJ: Crossover clustering and rapid decay of linkage disequilibrium in the Xp/Yp pseudoautosomal gene SHOX. Nat Genet 2001, 31: 272–275. 10.1038/ng918
Article Google Scholar
Zhang K, Calabrese P, Nordborg M, Sun FZ: Haplotype block structure and its applications to association studies. Am J Hum Genet 2002, 71: 1386–1394. 10.1086/344780
Article PubMed Central CAS PubMed Google Scholar
Wang N, Akey JM, Zhang K, Chakraborty R, Jin L: Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am J Hum Genet 2002, 71: 1227–1334. 10.1086/344398
Article PubMed Central CAS PubMed Google Scholar
Zhang K, Deng M, Chen T, Waterman MS, Sun FZ: A dynamic programming algorithm for haplotype block partition. Proc Natl Acad Sci USA 2002, 99: 7335–7339. 10.1073/pnas.102186799
Article PubMed Central CAS PubMed Google Scholar
Hudson R, Kaplan N: Statistical properties of the number of recombination events in the history of a sample of sequences. Genetics 1985, 111: 147–164.
PubMed Central CAS PubMed Google Scholar
Clayton D: Choosing a set of haplotype tagging SNPs from a larger set of diallelic loci.[http://www-gene.cimr.cam.ac.uk/clayton/software/stata/htSNP/htsnp.pdf]
Weale ME, Depondt C, Macdonald SJ, Smith A, Lai PS, Shorvon SD, Wood NW, Goldstein DB: Selection and Evaluation of Tagging SNPs in the Neuronal-Sodium-Channel Gene SCN1A: Implications for Linkage-Disequilibrium Gene Mapping. Am J Hum Genet 2003, 73: 551–565. 10.1086/378098
Article PubMed Central CAS PubMed Google Scholar
Reich DE, Cargill M, Bolk S, Ireland J, C SP, Richter DJ, Lavery T, Kouyoumjian R, Farhadian SF, Ward R, Lander ES: Linkage disequilibrium in the human genome. Nature 2001, 411: 199–204. 10.1038/35075590
Article CAS PubMed Google Scholar
Anderson EC, Novembre J: Finding haplotype block boundaries by using the Minimum-Description-Length principle. Am J Hum Genet 2003, 73: 336–354. 10.1086/377106
Article PubMed Central CAS PubMed Google Scholar
Zhang K, Jin L: HaploBlockFinder: haplotype block analyses. Bioinformatics 2003, 19: 1300–1301. 10.1093/bioinformatics/btg142
Article CAS PubMed Google Scholar
Wall JD, Pritchard JK: Assessing the Performance of the Haplotype Block Model of Linkage Disequilibrium. Am J Hum Genet 2003, 73: 502–515. 10.1086/378099
Article PubMed Central CAS PubMed Google Scholar
Niu T, Qin ZS, Xu X, Liu JS: Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms. Am J Hum Genet 2002, 70: 157–169. 10.1086/338446
Article PubMed Central CAS PubMed Google Scholar
Stephens M, Smith NJ, Donnelly P: A New Statistical Method for Haplotype Reconstruction from Population Data. Am J Hum Genet 2001, 68: 978–989. 10.1086/319501
Article PubMed Central CAS PubMed Google Scholar
Qin ZS, Niu T, Liu J: Partition-Ligation Expectation-Maximization Algorithm for Haplotype Inference with Single-Nucleotide Polymorphisms. Am J Hum Genet 2002, 71: 1242–1247. 10.1086/344207
Article PubMed Central CAS PubMed Google Scholar
SNPHAP[http://www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt]
Schwartz R, Halldorsson BV, Bafna V, Clark AG, Istrail S: Robustness of Inference of Haplotype Block Structure. J Comp Biol 2003, 10: 13–19. 10.1089/106652703763255642
Article CAS Google Scholar

Download references

Acknowledgements

We thank Dr. Michael Q. Zhang for valuable discussion. This work is supported in part by China National Key Program on Basic Research (Grant G1998051003 to Y.S and 2004CB518605 to X.Z.), China National High-Tech R&D Program (Grant 863-102-10-03-05 to Y.S), and NSFC (Grants 39625007 and 39993420 to Y.S, 60275007 and 60234020 to X.Z).

Author information

Authors and Affiliations

National Laboratory of Medical Molecular Biology; Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences (CAMS) and Peking Union Medical College (PUMC), Beijing, 100005, China
Keyue Ding, Kaixin Zhou & Yan Shen
MOE Key Laboratory of Bioinformatics/Department of Automation, Tsinghua University, Beijing, 100084, China
Jing Zhang & Xuegong Zhang
Chinese National Human Genome Center, Beijing, 100176, China
Keyue Ding, Jing Zhang & Yan Shen

Authors

Keyue Ding
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kaixin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yan Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xuegong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yan Shen or Xuegong Zhang.

Additional information

Authors' contributions

K. Ding and J. Zhang provided software design, development and testing of the software, and J. Zhang wrote the source codes. K. Zhou participated in its design. Y. Shen and X. Zhang provided biological direction and validation of the tool. Y. Shen conceived of this project, and participated in its coordination. All authors have read and approved the final manuscript.

Keyue Ding, Jing Zhang contributed equally to this work.

Electronic supplementary material

Additional File 1: It provides more details of the methodology that are not covered in the main text. (PDF 44 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ding, K., Zhang, J., Zhou, K. et al. htSNPer1.0: software for haplotype block partition and htSNPs selection. BMC Bioinformatics 6, 38 (2005). https://doi.org/10.1186/1471-2105-6-38

Download citation

Received: 25 June 2004
Accepted: 01 March 2005
Published: 01 March 2005
DOI: https://doi.org/10.1186/1471-2105-6-38

htSNPer1.0: software for haplotype block partition and htSNPs selection