htSNPer1.0 is a computer program with a GUI for characterizing the haplotype block structure and selecting htSNPs. The core algorithm is implemented in C++ language, and the graphic interface is coded in Java. The software is platformindependent.
Here, we will be concerned with haplotype block partition and htSNPs selection of unphased autosomal SNPs genotype data. For the block definitions that can directly handle unphased genotype data such as Gabriel et al. [2] and those based on pairwise LD [11], the unphased data are first partitioned into blocks over which there is sufficient restriction of haplotype diversity. Then, haplotypes are estimated approximately within each block (by EM algorithm). Finally, based on these estimated haplotypes, htSNPs are selected according to certain htSNPs performance criterion [10]. For those block definitions that can only handle phased haplotype data [1, 6], haplotypes are estimated first (by EM algorithm) from unphased genotype data. Then block partition and htSNPs selection are both based on these estimated haplotypes.
Haplotype estimation – EM algorithm
We apply the EM algorithm used by SNPHAP to estimate haplotypes from genotype data [18]. When the data consist of a large number of SNPs, the number of possible haplotype instances may become extremely large. In order to avoid this problem, the program starts from the first two SNPs and extends the solution by sequentially adding the rest SNPs. As each new SNP is added, the number of possible haplotypes is expanded considering all possible larger haplotypes. After EM algorithm estimating the posterior probabilities, the program deletes genotype assignments with posterior probability lower than 0.001. Then the posterior probabilities of the rest genotype assignments are recomputed.
We use the EM algorithm in SNPHAP because it is simple and fast, and can be easily integrated in our C++ code. There are other algorithms like HAPLOTYPER [15], PHASE [16] and PLEM [17] that are better studied and more widely used. However, one distinctive feature of htSNPer1.0 is to estimate haplotypes within each haplotype block. Within the blocks there is very limited haplotype diversity, so in such cases the algorithm in SNPHAP performs reasonably well. If one likes to do the haplotype phasing before block partition, he/she can use HAPLOTYPER [15], PHASE [16] or PLEM [17] to get more accurate estimation, and then input the estimated haplotypes to htSNPer1.0 to do the block partition and htSNP selection.
Definitions for haplotype blocks
htSNPer has integrated four haplotype block definitions: chromosome coverage [1], average pairwise LD D' [11], estimated pairwise LD confidence limits [2] with minor modifications by Wall and Prichard [14], and no historical recombination [6].

1.
Chromosome coverage [1]. A block is defined as a region in which the sum frequencies of common haplotypes (whose frequency is over a threshold, e.g. 0.05) is no less than a threshold. For this definition of blocks we apply a dynamic programming for haplotype partitioning [7]. We define a Boolean function block (i, j) = 1 if the consecutive SNPs from SNP_{
i
}to SNP_{
j
}can be defined as a block according to the above definition, and block (i, j) = 0 otherwise. Let f (i, j) be the size of the minimal htSNP set found by GBB algorithm (see below) for αpercent coverage within the block from SNP_{
i
}to SNP_{
j
}. Given a block partition (1, i_{1}), (i_{1} + 1, i_{2}),..., (i_{n1}+ 1, i_{
n
}), the total number of htSNPs for these n blocks is f (1, i_{1}) + f (i_{1} + 1, i_{2}) +...+ f (i_{n1}+ 1, i_{
n
}). The optimal block partition is defined to be the one that minimizes the total number of htSNPs.
Denote S_{
j
}to be the total number of htSNPs for the optimal block partition of the first j SNPs, and set S_{0} = 0. According to dynamic programming theory, we have . Through this recursion the dynamic programming partitions the haplotypes for the optimal block partition.

2.
Average pairwise LD D' [11]. Within a block the average pairwise D' is no less than a threshold.

3.
Estimated pairwise LD confidence limits [2] with minor modifications by Wall and Prichard [14]. For details see Additional file 1.

4.
No historical recombination [6]. A block is defined as a region without any historical recombination, which is examined by Four Gamete Test.
The above definitions of 2, 3 and 4 do not guarantee a unique solution for partition. In htSNPer1.0, blocks are searched from the start of the input data and expanded as long as possible by sequentially adding the next SNPs.
htSNPs selection criteria
htSNPer1.0 can find the minimal htSNP set of global optimum. Different definitions of optimum can be derived according to different htSNP performance criteria [10]. A generalized definition of "optimum" can be described as the minimal set of htSNPs that satisfies a given htSNP performance criterion. For example, weightedaverage haplotype r^{2} is regarded as one of the most informative associationbased htSNP performance measure by Weale et al. [10], which is defined as following:
Weightedaverage haplotype
where
and we denote
as the the frequency of haplotypes with allele 1 at SNP i, as the frequency of haplotypes in the g th htSNPdefined group (haplotypes within each group are identical at htSNP loci), and as the frequency of haplotypes both in the g th htSNPdefined group and with allele 1 at SNP i.
If the htSNP performance criterion is defined as the weightedaverage haplotype r^{2} of the selected set of htSNPs should be at least 90% of the maximum possible value (which is the weightedaverage haplotype r^{2} when all SNPs are selected as htSNPs), then the "optimum" according to this criterion can be described as the minimal set of htSNPs whose weightedaverage haplotype r^{2} is at least 90% of the maximum possible value.
We have integrated the three htSNP performance criteria into our htSNPer software: αpercent coverage [1], explained proportion of Clayton's haplotype diversity [9], and weightedaverage haplotype r^{2} [10].
αpercent coverage: the total frequencies of all haplotypes that are not completely distinguished by the htSNP set is less than 1  α.
Explained proportion of Clayton's haplotype diversity:
, where f_{
i
}, f_{haplo = g}and f_{i,g}are defined in the same way as above.
Weightedaverage haplotype r^{2} : see above.
htSNPer1.0 takes advantages of a novel heuristic algorithm – Generalized BranchandBound (GBB) algorithm, which is applicable for all kinds of htSNPs performance criteria, to search the minimal htSNPs set with both efficiency and global optimum, comparing to the exhaustive searching [7] which guarantees global optimum but runs very slowly, and to the greedy algorithm [1, 13] which is faster but doesn't guarantee global optimum.
The GBB algorithm
Consider a block B containing N haplotypes and each haplotype has M biallelic SNPs markers. Each SNP marker can divide N haplotypes into two groups: one consists of all the haplotypes with its major allele, and the other with its minor allele. GBB algorithm is based on the following branching rule and Generalized Prunerule, using the depthfirst searching strategy (Figure 1).

1)
Each node {T, R} in the searching tree consists of two parts: the testset T and the discardset R where T is the set of SNPs that have been selected, and R is the set of SNPs that should not be selected for the future. If the set of all SNPs is denoted as S, then the set of SNPs that can be used at the node is S\(T ∪ R). The search tree starts from the root node for which T = Φ and R = Φ.

2)
A child node is generated by adding a SNP to T according to the branching rule. The node is pruned if it meets the Generalized Prunerule.
Importance calculation
Given a certain node {T, R}, SNPs in T divide all the haplotypes into t nonoverlapping groups called equivalence classes. Any haplotypes that belong to the same group are identical at all SNP sites in T. A biallelic SNP divides all the haplotypes into two groups: G_{
major
}and G_{
minor
}. To evaluate the competence of the SNP, the importance of a SNP is defined by
Branching rule
Given a node {T, R}, sort the SNPs in S\(T ∪ R) according to the importance calculation nonincreasingly: I(SNP_{1} T) ≥ I(SNP_{2} T) ≥ … ≥ I(SNP_{STR}T), create the children {T ∪ SNP_{1}, R}, {T ∪ SNP_{2}, R ∪ SNP_{1}}, {T ∪ SNP_{3}, R ∪ SNP_{1} ∪ SNP_{2}}, ..., {T ∪ SNP_{STR}, SNP_{
h
}}, and explore the children in this order.
Generalized prunerule
Check whether the SNP subset T meets the htSNP performance criterion. If it does, prune the node when T ≥ U, or update U when T <U ; otherwise, prune the node when T ≥ U or S  T  R < 1 where U is the size of the best solution found so far.
The Importance Calculation in Branching rule is originally devised for the αpercent coverage criterion [1, 7]. But it is also applicable for other criteria, although it may not be the best one. Actually, one can devise specific Branchingrule and Prunerule according to specific htSNPs selection criterion in the GBB framework to achieve super efficiency and global optimization. The GBB framework and algorithm we proposed are applicable to all htSNP criteria, and are at least more efficient than enumeration.