BMC Bioinformatics Methodology article A model-based approach to selection of tag SNPs

Background Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. Results Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. Conclusion Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available.


Background
Genetic association studies at the population level are one of the most promising ways to discover the genetic basis of subtle human phenotypes such as complex diseases or drug responses [1][2][3].The aim of these studies is to map genetic factors underlying such phenotypes by comparing genetic information and phenotypes of individuals sampled from a population.As whole-genome sequencing for each individual remains currently impossible, the genetic information is typically assessed through a set of genetic markers that carry information on their neighborhoods due to Linkage Disequilibrium (LD).Single Nucleotide Polymorphisms (SNPs), the most common type of polymorphisms in the human genome, are markers of great interest in this context.In fact, they are so common that the information they carry seems highly redundant in high LD regions of the human genome.Consequently, it makes sense to select a small fraction of the SNPs, the tag SNPs, for mapping purposes.This can significantly reduce genotyping effort without much loss of power [4].One of the main goals of the international HapMap project is to acquire the knowledge of the LD structure needed for the choice of efficient tag SNPs [5].
The stake in the choice of the tag SNPs has driven the development of numerous methods for their selection during the past few years.These methods were reviewed by Halldórsson et al. [6].In their view, methods differ mostly in two major aspects: the quality or correlation measure used for the definition of tag SNPs and the algorithm used for the minimization of the final number of tag SNPs.For instance, we can seek a subset of SNPs such that every SNP that does not belong to the tag set (i.e. a tagged SNP) has a r 2 measure of pairwise LD with a tag SNP greater than a given threshold [7,8].Some studies suggest that the human genome is composed of haplotype blocks with high LD and relatively limited haplotype diversity, separated by short regions of low LD [9][10][11].The concept of block has immediately received a great deal of attention in the context of tag SNP selection because a block may contain a large number of SNPs, but a few SNPs are enough to uniquely identify the haplotypes in a block.A straightforward block-based strategy consists of two separate steps: (1) identify haplotype blocks and (2) select the tag SNPs [9].In one of the most popular tag SNP selection approaches introduced by Zhang et al. [12], both the selection of block boundaries and the choice of tag SNPs are optimized jointly to capture the majority of haplotype diversity within blocks.The idea is implemented in the HapBlock software [13].Here, rather than a "hard" definition of blocks, we describe LD by "soft" parameter values in an appropriate probabilistic model.Our choice of a block-free perspective is motivated by numerous observations on block partitioning documented in the literature.The most important reason is that although some regions of the human genome seems to conform quite well to a description in terms of blocks, other regions do not [14].It also appears that the definition of a block is not straightforward and very different partitioning are obtained depending on the adopted definition, leading to the selection of very different number of tag SNPs [15].In addi-tion, block partitioning has been reported to be affected by many factors such as the SNP density, the number of observed sequences and the choice of a particular set of markers [16][17][18].A good probabilistic model of the haplotype sequences can better capture LD patterns than haplotype blocks or simple measures of pairwise LD do.We also find two other benefits in adopting a model-based point of view.First, it allows us to tackle the tag SNP selection problem as a data compression problem using the widely accepted Shannon's measure of information content.Second, a probabilistic model provides the machinery to predict tagged SNPs from tag ones.Effectiveness of tag sets can then be evaluated through their information content and their ability to predict other SNPs, both measurements requiring a good probabilistic model.This direct measure of performance makes it possible to compare various methods, including model-based and other tagging methods.Similar ideas can be found in some previous studies.In particular, the formulation of the haplotype tag SNP selection as a data compression problem and the idea of evaluating tag sets through their prediction performances are advocated in Halldórsson et al. [19] and Shannon entropy based criteria to measure LD and to select tag SNPs are described in [18] and [20], respectively.However, overall, little attention has been paid to selection and use of good probabilistic models for tag SNP selection.The coalescent model [21] and its generalizations are the most appealing approaches to relating genetic variation to recombination rate and other genetic and demographic factors.However, the inference based on coalescence is computationally challenging.Rather, we consider approximations that are computationally tractable.We start with a simple Markov model that has fast algorithms for exact solutions and can serve as a baseline reference.Next we consider several hidden Markov models (HMMs).In particular, we study the model introduced by Li and Stephens in the context of recombination rate estimation [22].We show how entropy under different models of haplotype sequences can be maximized in practice.As another reference, we propose a greedy method that maximizes entropy without the linear structure embedded in Markov models and HMMs.
Model-based SNP tagging hinges on the issue of model comparison.Criterion such as AIC [23] and BIC [24] have been proposed in the literature for model selection.In this article, to deal with various complicated models, we adopt the principle of Minimum Description Length (MDL), which unifies the likelihood theory and Shannon's theory [25].According to the MDL principle, the best model gives the shortest description code-length of the data.Analytical forms of description code-lengths are only available for relatively simple models [26].The two-stage coding scheme has been adopted to delineate haplotype blocks using MDL after formulating the task as the choice of a model whose parameter space size increases smoothly with the number of blocks [27][28][29].Here, we compare a relatively small number of models with very different parameter spaces and thus we prefer to evaluate the description code-length of the models by using a cross-validation version of the predictive coding scheme for bit-counting that automatically penalizes over-fitting.
We compare the performances of the models and of the resulting tag sets on haplotype data from the international HapMap and ENCODE projects.The performances of the tag sets selected by our model-based method are also compared with tag sets selected by the HapBlock software [13], with tag sets selected by the method proposed by Carlson et al. [8], with SNPs chosen at random, and with evenly spaced SNPs.Finally, we assess the loss of predictive power of the tag sets due to the typing of genotypes instead of haplotypes.

Model-based tag SNP selection
Let X = {X j , j = 1, 2, ... , n} denote the random variable that corresponds to a haplotype sequence.When selecting a subset of tag SNPs indexed by J ⊂ {1, 2,..., n}, we want to minimize the loss of information when we experimentally assess the subset J instead of the whole set of SNPs.In other words, we would like to find the optimal compression X J of X. Shannon has shown that the information content, or randomness, of X is well measured by its entropy defined as H(X) = -X log ‫(ސ‬X) [30].For any given subset J, the information content of X can be decomposed into two parts using the chain rule of entropy: H(X) = H(X J ) + H( |X J ), where the entropy H(X J ) is the information carried by the subset J and the conditional entropy is the information loss due to the residual randomness of the SNPs that do not belong to J.In this framework, we seek a set of tag SNPs J that maximizes H(X J ) subject to the constraint on the number of markers in J.The same subset J minimizes H( |X J ) and maximizes the mutual information between X J and X.
The method of entropy maximization requires the specification of a probabilistic model for X.Better models provide more accurate quantification of the information content contained in sets of SNPs and thus are also expected to allow the selection of better tag SNPs.In the Methods section we describe the models along with tractable tag SNP selection procedures for each of them.

Model comparison using description code-lengths
Each model provides a coding system that can be used to encode haplotype sequences.When averaged over many new sequences, the length of this code gives an assessment of the quality of the model: the shorter the code-length, the better the model (MDL principle).Code-lengths are closely related to the information content of the data (see [31] for a didactic presentation).Both quantities are usually expressed in bits and any achievable code-length is an upper bound to the real information content of the data.Code-length equals to the information content only in the idealistic case where the "true" model is used for encoding.In Table In all cases, Li and Stephens models with homogeneous recombination rate (LS-HOM) and with heterogeneous recombination rate (LS-HET) stand out by their short description lengths.The shortest description length is always achieved by LS-HET that shortens the code-lengths by about 40% compared to the GR-2 model.The amplitudes of the differences between code-lengths associated to LS-HET and LS-HOM vary greatly depending on the genomic region.It is less than 2% in the ENm013 and ENrl23 regions and more than 8% in the ENrl3l and ENr232 regions, being around 5% for the chromosomescale data set (Chromosome 7).These differences may reflect variations in the pattern of recombination.In addition to demonstrating a strong superiority of the Li and Stephens HMMs, code-lengths also reveals some interesting features of the data.We note in particular that the information content, or randomness, as assessed with the best model (LS-HET) is systematically higher for YRI hap- lotypes than for CEU haplotypes.For instance, the difference is 44% on the Chromosome 7 data set.We also observe that the information content measured in bits is always much smaller than the number of SNPs: a prerequisite for the existence of small tag sets capturing most of the information.

Informativeness versus number of tag SNPs
Next we check how informativeness increases as we select more tag SNPs.We emphasize that the calculation of informativeness does not necessarily depend on the tagging method and should in fact be measured according to the best available model (i.e.LS-HET) when comparing tag sets.Figure 1 shows the information content versus number of tag SNPs in the ENr112 region from the CEU population.We do not present the results for other regions as they look qualitatively similar.
In Figure 1 (a), the information content is computed by the same model as used to select the tag SNPs.As expected the information content increases as more SNPs are selected.Let us look at the two extreme cases.In the Li and Stephens models, the information curve increases at a much slower pace after 75 SNPs are selected.On the other end, in the simple Markov model, the information content keeps growing rapidly as more SNPs are selected.
With the results of model comparison in mind, we know that Li and Stephens models better describe the true information pattern.This supports the hypothesis of high redundancy in SNP information.Those extra bits seen in the simple Markov model reflect its relatively low efficiency in encoding haplotypes.
In Figure 1 (b), the information is computed with the LS-HET model.As expected, tag sets selected with LS-HET are the most informative.The informativeness of a tag set can be thought as a measure of its ability to describe SNP data.
If we compare models according to this measure, the results are consistent with those obtained using cross-validated code-lengths although differences between tag sets are smaller than differences between models.Averaging information content over all ENCODE regions from both CEU and YRI, we find that 100 SNPs selected with LS-HET capture the same amount of information as 108 SNPs selected with LS-HOM or 113 SNPs selected with GR-2, next come GR-1 (120 SNPs) and HMM-4D (150 SNPs).
We also study tag sets made of evenly spaced SNPs and randomly picked SNPs (Figure 1 (b)).They appear to be less informative.The poor performance of these tag SNPs is particularly evident for small tag sets.The information content of the 20-SNPs tag set selected by the heterogeneous Li and Stephens model is about 16 bits whereas it is only about 8 bits for random or evenly spaced SNPs.Pop. (a) SNPs (b) Markov HMM-2 (c) HMM-4D (d) GR-1 (e) GR-2 (f) LS-HOM (g) LS-HET (h)   Chr.When a model is used to select tag SNPs, the information increases at the rate of about one bit per SNP at the very beginning because any model starts by choosing approximately independent SNPs with high minor allele frequencies.The rates slow down as more SNPs are selected: the SNPs that do not belong to the tag set tend to have strong dependence with tag SNPs or lower minor alleles frequencies.

Prediction of tagged SNPs
A good tag SNP set should allow us to predict the allelic status of the tagged SNPs with high accuracy.Explicitly or implicitly, a prediction procedure is usually associated with a probabilistic model.Here, we take the most probable allele, given a model and the haplotype known at the tag SNPs, as our prediction.These probabilities are easy to compute in the HMM framework.Predictions based on greedy models rely on the observed frequencies in 1000 simulations of the tagged SNPs given the haplotype at the tag SNPs. Figure 2 shows the fraction of correctly predicted SNPs in the ENr112 region from the CEU population.In Figure 2 (a), predictions are based on the same model as used to select the tag SNPs.The relative stand-ings of these curves are similar to those in Figure 1, although their curvatures are different.Once again, the Li and Stephens models stand out.The correct prediction rates are as high as 99.0% and 99.1% with 75 tag SNPs when LS-HOM and LS-HET are used for both SNP tag and SNP prediction, respectively.This rate is 97.3% for the GR-2 model and next it drops to 95.7% and 93.6% for the HMM-4D and GR-1 models.
Figure 2 (b) presents results that help distinguish between the contribution of the tag set and the contribution of the prediction method to the rate of correct prediction: we take LS-HET as our prediction method no matter how the tag set was selected.Thereby the prediction accuracy gives a direct measure of the predictive value of a tag set.Overall, differences among tag sets are much smaller than those reported in Figure 2 (a).With 75 tag SNPs, the rate of correct prediction is 99.1% for the tag sets selected by LS-HOM, GR-2 or HMM-4D.GR-1 tag set comes next with a 99.0% prediction accuracy.
The comparisons between tag sets on the other data sets show similar patterns for the relative standings of the rate Information content versus the number of tag SNPs in the ENr112 region from CEU population of correct prediction.However, it is worth mentioning that rates correct prediction achieved on CEU data sets are higher than those obtained on YRI data sets with the same number of tag SNPs.For instance, on average over all ENCODE regions, 75 LS-HET tag SNPs allow a prediction accuracy of 99.0% on CEU ENCODE regions but only 98.2% on YRI ENCODE regions where about 140 SNPs are needed to achieved an average of 99.0% of correct predictions.As expected, the fraction of SNPs that need to be retained in the tag set to obtain a desired level of correct prediction depend also strongly on the density of SNPs.
On the chromosome 7 data set, where the average spacing between SNPs is about 6 times higher than in ENCODE regions, about 11000 SNPs are needed to obtain a 99.0% correct prediction rate on the CEU data set (17000 SNPs on the YRI data set).

Comparison with other methods
We compare our best tag SNP selection method based on the LS-HET model with two methods previously described in the literature: the block-based dynamic algorithm implemented in the HapBlock software [13] and the Carlson et al. method [8]. Figure 3 presents results obtained on the different regions with different sets of user-defined parameters.
Figure 3 (a) reports the comparison with the HapBlock software in terms of rate of false prediction.We see that HapBlock tag sets are better than our tag sets in only 5 cases out of 111 (4.5%, p-value ≈ le -25 by the non-parametric sign-test).On average, the ratio between the false prediction rate of the HapBlock tag sets and that of our tag sets is 1.36 for the ten ENCODE regions and 1.27 for Chromosome 7. In addition to the better predictive power, the information content as measured by LS-HET was always higher for our tag sets than for the tag sets selected using either the HapBlock or the Carlson et al. method (data not shown).

Genotyping versus haplotyping
The methods described above assume known haplotype information.However, most current typing platforms directly measure genotypes but not haplotypes.This means that the genotype at each typed SNP is experimentally known but not the phasing between alleles at different loci.Optimizing the tag set in the genotyping context may have to take into account the possibility that some SNPs are more informative than others for inferring haplotypes at the tag SNP loci.Such an optimization is not in the scope of this paper and would be computationally challenging if we do not want to sacrifice the necessary sophistication of a good haplotype model.Instead, we propose to assess the loss of predictive power due to phase uncertainty at the tag SNP loci by comparing the ability of our tag set to predict tagged SNP genotypes from either haplotypes or genotypes at the tag SNP loci.
The task of genotype modelling does not fundamentally differ from haplotype modelling as a genotype is simply a pair of haplotype whose phase is unknown (assuming Hardy-Weinberg equilibrium).Therefore, the best genotype model is just a modified version of the best haplotype model.Naturally we use the LS-HET model in this context: the LS-HET model generalizes rather easily to genotype sequences (see [32]); prediction is performed after computing the posterior probability of each genotype at each SNP with the forward-backward algorithm for HMMs. Figure 4 shows the ability of our LS-HET tag sets to predict the genotypes when haplotypes at the tag SNPs loci are known (phase known) and when only genotypes are known (phase unknown).It appears that little predictive power is lost due to phase uncertainty as soon as the tag sets are reasonably large.For instance, on ENCODE regions from the CEU population, the genotype is predicted correctly at 98.0% of the positions with 75 SNPs when haplotypes are given and 97.9% when only genotypes are known (Figure 4 (a)).On chromosome 7 from CEU population, the rate of correct prediction decreases from 98.0% to 97.8% at 11000 tag SNPs (Figure 4 (b)).

Discussion
It has been proposed to use Shannon entropy to assess LD [18] and to select tag SNPs [20] without relying on explicit models of haplotype sequences.The model-free approach relies on direct estimation of the entropy from the empirical haplotype frequencies and is not scalable to large set of SNPs.When SNP sets are large enough any haplotype is observed at most only once which causes the empirical estimate of Shannon entropy to plateau to log Q, with Q being the number of sequences sampled.In comparison, explicit modelling of the haplotypes makes it possible to estimate Shannon entropy whatever the size of the SNP set.In this context, appropriate models can bypass the concept of blocks and still account for simultaneous correlations between multiple markers.
The Li and Stephens models are becoming widely used in contexts including recombination rate estimation [22,33,34] and haplotype phase reconstruction [32].This study is the first that reports their use in the context of tag SNP selection and it confirms their strengths.They outperform the other models in several aspects: description code-length of data, informativeness of tag SNPs, and prediction of tagged SNPs.Furthermore, we show that the heterogeneous version of the model, accounting for fine scale variations in the recombination rate, is actually better than the homogeneous version.This was not a priori obvious because the heterogeneous model has many more parameters than the homogeneous model and their inference could be unreliable.It should be noted that these parameters are closely related to the recombination rates [22] which are known to be hard to estimate [35].However, we do not see a large difference between the heterogeneous and the homogeneous Li and Stephens model for the practical purpose of tag SNP selection.This may suggest that inference of the fine scale variations in the recombination rate, and in particular of the location and Assessment of the loss in prediction accuracy due to tag set phase uncertainty using the LS-HET model Figure 4 Assessment of the loss in prediction accuracy due to tag set phase uncertainty using the LS-HET model.Black lines stand for genotype predictions from tag set genotypes and gray lines for predictions from tag set haplotypes.Both curves are shown for the CEU and YRI populations.the intensity of the recombination hot-spots [33,36], is not very important in the context of tag SNP selection.The relatively good performances of our models of contextsize 1 or 2 might encourage more work on related models such as graphical models for SNP data [37].
In this study we consider the problem of haplotype tag SNP selection assuming that haplotypes are available.In reality, the data are usually just genotype.Genotype trios give much more information on the haplotypes but they necessitate more genotyping and are impractical in most large case-control studies because of difficulties to find suitable trios.As in most of the works on haplotype tag SNP selection, the problem can be circumvented by using haplotypes inferred from genotype data.Examples of methods for inferring haplotypes from genotype data, or phase reconstruction, include those proposed by Excoffier and Slatkin [38], Niu et al. [39] and Stephens and Scheet [32].However, these methods are not error-free and this may cause a problem at two levels.First, the choice of the tag SNPs relies on a set of haplotypes and could be sensitive to phase errors in these haplotypes.We do not believe this to be a very important concern as (1) a high density of SNP will typically be assessed and thus will permit accurate haplotype reconstruction and (2) trio genotyping is possible if haplotype reconstruction actually appear to be a problem.Second, the set of tag SNPs may not be optimal for phase reconstruction.Our results rule out this concern by showing that the impact of phase uncertainty on the ability to predict tagged SNPs from a good tag set is marginal.Even a very hypothetical tag set that could simultaneously remove any phase uncertainty and preserve the maximal amount of information on the tagged SNPs would not perform more than a few percent better than our tag sets.In addition, one may argue that if the tag set were genotyped in many individuals, these genotypes may be used to further reduce the phase uncertainty at the tag SNP loci.
Our results show that tagged SNPs can be predicted with a very high accuracy using a good tag set and a good prediction method.For instance, 1000 SNPs mapping a 500 Kbp long ENCODE regions from CEU population can be predicted with an average error rate of 1% using 75 tag SNPs.Rough computations suggest that if the 3 Gbp of whole genome was mapped at the same SNP density (6,000,000 SNPs), 500,000 tag SNPs would be enough to predict the other SNPs with a 1% average error rate.Due to the more complex haplotype structure of African populations [11,40], achieving the same rate for YRI population could require about twice as many tag SNPs.It would be interesting to investigate whether or not one should use these predictions in association studies.The answer may not be evident.On one hand, working with a smaller set of SNPs reduces the dimension of the analysis and thus it could be advantageous to restrict the analysis to the tag SNPs.On the other hand, tag SNP selection relies on knowledge about the haplotype structure that is latter lost if predictions are not used.Moreover, when the association study rely on single marker analyses where each SNP is tested separately, then it is probably worth to first predict tagged SNP to increase the probability of finding a single marker heavily linked to the causal SNP.Throughout this study we select tag sets by maximizing entropy given a model.The use of this quantity is motivated by strong arguments from information theory.If we compare how entropy and global prediction accuracy measure the power of tag sets, results suggest that entropy is a more sensitive quantity than prediction accuracy.For instance, the comparison between the average rate of correct prediction achieved with a tag set selected using the best model (LS-HET) and relatively simple models (GR-2, and even GR-1 or HMM-4D) show quite similar performances for each tag sets whereas the superiority of the Li and Stephens model is clearly established both in terms of code-lengths and predictive ability.We also see that the pace at which entropy increases with the number of tag SNPs does not slow down as quickly as the rate of correct prediction.We note that both measures are adopted in the popular machine learning tool of classification and regression tree (CART) [41].Namely, entropy is used to generate more partition nodes while prediction (classification) accuracy is used to prune decision trees.An important issue in association studies is the choice of the number of tag SNPs.It may be a good idea to adopt the strategy used in the CART methodology for tag SNP selection.That is, after having pre-selected a tag set (possibly larger than needed) on the basis of entropy, we could determine the final number of tag SNPs by shrinking the original set on the basis of the prediction accuracy.In this context, it is worth mentioning that tag SNPs selected from a given collection may better predict SNPs from this collection than other SNPs [18].Therefore, it could be sound in future works to set aside a fraction of the SNPs during the tag SNP selection process that could latter be used in order to choose the final number of tag SNPs on the basis of the prediction accuracy.
Finally, we were able to provide evidence of the advantage of our best tag SNP selection method over both the methods implemented in the HapBlock software and the method proposed by Carlson et al. [8].We did that by comparing the performances of tag sets on their abilities to predict the tagged SNPs (as well as on the basis of their information content).These comparisons are meaningful because they rely on a good probabilistic model.In the future, this scheme that allows direct comparison of the intrinsic performances of tagging sets should be useful to further clarify the issues associated to the choice of tag SNPs.

Conclusion
Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods.The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs.Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs.This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes.

Availability and requirements
A program that implements tag SNP selection, entropy computation, and tagged SNP prediction based on Li and Stephens HMMs is freely available under the terms of the GNU Public Licence at http://www-rcf.usc.edu/~lilei/tagsnp.html and is also attached to this publication [see Additional file 1].The programs implementing the other tag SNP selection methods described in this paper will be made available upon request to the authors.Requirements of the software: source code is in C++ language and was compiled on i586 Linux platforms; numerical maximization relies on routines of the GNU Scientific Library http://www.gnu.org/software/gsl/ that needs to be installed to compile the program.

Markov model
We start off with a simple Markov model for haplotype sequences.This model includes two parameters for each pair of adjacent loci: (1,1).We take the maximum likelihood estimate of the parameters given the observed sequences (estimates are smoothed by adding a pseudo-count 0.1 to all counts).
In a Markov model, we found that the subsets of markers with the highest entropy can be determined by a dynamic programming algorithm.Let J*(k, i) denote the optimal subset of k markers that includes the i th marker and k -1 other markers chosen among {1, 2,...,i -1}.The Markov property allows a recursive computation of J*(k, i) and its associated entropy using the following algorithm.
The time complexity of the algorithm to find J*(m) is proportional to n 2 × m and its memory requirement is proportional to n × m.

Hidden Markov models
Despite the existence of a relatively fast algorithm that finds optimal tag sets with respect to maximum entropy, the above Markov model suffers from being too simple to sufficiently describe the complicated LD structure.HMMs are interesting alternatives to Markov models.A HMM consists of two layers.In the context of haplotype modelling, the hidden layer models the decay of LD along the genome due to recombination through a Markov chain S with transition parameters where u and v take values in a hidden state space .In the observation layer, the sequence X is "emitted" given the underlying haplotype "backbone" S according to the emission parameters b The maximum likelihood estimates of the parameters can be obtained by the classical Expectation-Maximization (EM) algorithm [42][43][44].Some HMMs for haplotype sequences have already been introduced for purposes other than marker selection.We will compare results obtained with two HMMs: the model of Daly et al. [10] and the models of Li and Stephens [22].For the sake of completeness, we also include a simple HMM embedding the Markov model described in the previous section.

Efficient bottom-up selection of tag SNPs in HMM
We adopt a simple bottom-up strategy to select subsets of markers.Of course, this cannot ensure an optimal solution.Starting with an empty set of markers, we add markers one by one in such a way that the gain of information content is maximized at each step.Let J denote the current set of markers.We want to add the marker i* that maximizes H(X J∪{i} ).This marker i* also maximizes H In the HMM framework, exact computation of the conditional entropy H(X i | X J ) is not tractable but we will now describe an efficient way to approximate it.Using a sample (x (1) ,..., x (K) ) of K sequences simulated from the HMM, we can approximate The central limit theorem states that the standard deviation of the error decreases asymptotically at a rate of order 1/ .The term ), where ‫(ސ‬S i = u | X J = ) can be obtained for all i by a single pass of the forward-backward algorithm [44] (time complexity proportional to n).In principle, ‫(ސ‬S i = u | X J = X J ) should be recomputed for all i after each addition of a SNP to J. The whole bottomup selection algorithm would then be of complexity n 2 .However, it is clear that adding a SNP i* somewhere does not change the informativeness of SNPs sufficiently far away.Consequently, it is unnecessary to update ) for all i.The distance from which the terms do not need to be recomputed depends on a complex balance between factors such as the level of LD around i*, the set J, the particular haplotype x (k) and the level of approximation we accept.It is remarkable that the terms ‫(ސ‬S i = u | X J = ) can be computed in the order they appear on each side of i* (this implies to keep forward and backward terms from the previous iteration of the bottom-up algorithm).Thereby, at each step of the algorithm, we can automatically restrict the computation to the relevant portion of each sequence x (k) .This decreases considerably the time complexity of the bottom-up algorithm for large sequences.In our applications, we take K = 500 and we update ‫(ސ‬S i = u | X J = ) only for those i that are close enough from i* such that max u |‫(ސ‬S i = u|X j' = ) -‫(ސ‬S i = u|X J = )| > 0.001, where J' = {J ∪ i*}.

From unconstrained HMMs to Li and Stephens' models
The first HMM we consider is a two-state heterogeneous HMM, which is the most general with two hidden states Namely, all the transitions associated with a change of backbone are modeled as having the same probability.This model includes more haplotype backbones and is more meaningful than the unconstrained two-state model.However, the choice of the number of hidden states is mainly arbitrary, even though four may be adequate to the particular region studied by Daly et al. [10].Besides, at each position the four haplotype backbones are modeled as having the same marginal probability.Li and Stephens [22] introduced an attractive generalization of Daly's model that bypasses the choice of the number of hidden states.Given Q previously observed sequences x = (x (1) , x (2) ,..., x (Q) ), their HMM models an additional sequence as a mosaic of segments sampled from these Q sequences.In this model, the hidden variable S i corresponds to the template sequence at locus i.Although the number of hidden states is as large as Q, a single parameter α is introduced to account for the recombination intensity: where d i denotes the physical distance between loci i and i + 1. Emission probabilities model possible differences (point mutations) at each position i between the new sequence and its local template sequence u using a single parameter β = ‫(ސ‬X i ≠ |S i = u) (a slightly different parametrization is found in the original paper).
A version of this HMM accounting for recombination hotspots and other recombination rates heterogeneities has also been proposed [22].Instead of a constant α, different α i 's are allowed along the genome.Following Li and Stephens, we impose a prior distribution for the α i 's by setting a i = γλ i , where γ does not depend on i and log 10 (λ i )~(0,0.5 2 ).This choice puts the α i 's values in the neighborhood of a baseline level γ, allowing the recombination intensity to vary up and down by one order of magnitude.Li and Stephens [22] were mainly interested in estimating α or the α i 's and in how the estimates relate to the recombination parameter ρ of the coalescent model.This motivation lead them to propose two computationally intensive estimation procedures referred as PAC-A and PAC-B in their paper.Here the aim is to fit


a model for the occurrence of an additional sequence and thus it makes more sense to maximize , where θ = (α,β) and x (-q) denotes all the Q observed sequences but q.This maximization is related to the pseudo-likelihood methods introduced by Besag [45].We developed an EM-like algorithm to obtain the value of θ that maximizes this quantity.The k th iteration of our algorithm consists in getting θ (k) = (α (k) , β (k) ) that maximizes as where Q' = Q -1 and {x i ≠ x u } = 1 when x i ≠ x u and {x i ≠ x u } = 0 elsewhere.The terms (S i = u|x (q) and ) are obtained with the forward-backward algorithm for HMMs and α (k) is found by numerical maximization.Similarly, in the heterogeneous version of the model, the EM-like algorithm is used to maximize , where ∏ i π(λ i ) is the density of the log-normal prior for the λ i 's.
At the k th iteration of the algorithm, (γ (k) , ,..., ) are choose to maximize and the updating formula for β is the same as in the homogeneous case.For our chromosome scale data-set, we allow the baseline intensity α or γ to change every 100 SNPs (approximately 500 Kbp).

Tag SNP selection through greedy entropy maximization
As an exploratory effort, we drop the linear structure embedded in Markov models and HMMs while sticking to the principle of maximum entropy.According to the bottom-up selection strategy, when adding one marker i* to a subset J, this marker should maximizes H(X i | X J , where stands for empirically estimated probabilities.As a result, a model whose density factorizes according to a Directed Acyclic Graph (DAG) is built along the way for those SNPs in the tag set.We smooth the estimates of the conditional probabilities by adding 0.1 to all counts.Later on, we refer to these models as the "greedy" models with context-size 1 and context-size 2.

Other tag SNP selection methods used for comparison
For comparison, we use two SNP tag procedures described in the literature.Both methods select subsets whose sizes depend simultaneously on the thresholds chosen by the user and on the particular set of sequences.
The first method is the block-based dynamic programming algorithm implemented in the HapBlock software [12,13].The user needs to choose among three criteria to define blocks and five criteria to select the tag SNPs within the blocks.In keeping with [9] and [12], we use the common haplotype criterion to define potential haplotype blocks.Two parameters α HB and β HB play a role in this definition: any region where common haplotypes represent at least a fraction α HB of the sequences can be considered as a block; a common haplotype is defined as a haplotype sequence accounting for more than a fraction β HB of the observed sequences.In keeping with [4,12,15], we also use the diversity criterion to choose the tag SNPs: within each block, the smallest subset of SNPs that can distinguish more than a proportion α HB of the sequences is selected (note that here we use the same threshold as in the block definition).In parallel, we tested another tag SNP selection criterion based on the empirical entropy of the block: the smallest subset of SNPs that can account for a fraction α HB of the original entropy is selected [13].
The second method comes from a recent work by Carlson et al. [8].The method is based on the r 2 pairwise measure of LD between two loci defined by r 2 = (p AB -p A p B ) 2 / (p A p a p B p b ), where p A = 1 -p a and p B = -p b stand for the alleles frequencies at each locus and p AB denotes the joint allele frequency.A greedy algorithm searches for a tag set such that any SNP not in the subset with a minor allele frequency higher than 0.1 has an empirical r 2 measure with a tag SNP higher than a chosen threshold.
It has been suggested that any set of SNPs approximately evenly spaced along the sequence is a good tag set [46].To check this statement we use a simple procedure to build-P θ , ( ) ( ) ( ) , min ( , ) , p p p up sets of increasing size whose SNPs are approximately evenly spaced.We start with a single SNP which is the closest to the middle point of the sequence.One SNP at a time, we increase the size of the tag set by (1) finding the longest interval that does not contain an already selected SNP, but does contain at least one unselected SNP; (2) adding the SNP that is the closest to the middle point of this interval.

Data sets and cross-validation scheme
We assess the performances of the tag SNP selection procedures on several data sets that differ by the number and density of markers as well as by the population sampled.
The data come from the genotype trios available from the international HapMap and ENCODE projects (final phase I data release, June 2005).A genotype trio is made of the genotype of one child and those of its two parents.Trio data make it possible to infer haplotypes by simple Mendelian genetic rules at most genotyped positions (assuming no recombination in the last generation).We use the hap2 program [47] to infer haplotypes at unsolved positions and missing data.This program combines trio information with population information obtained from the entire sample.The phased data sets are available for download [see Additional file 2].
Our genome-wide data consists of two sets of 120 haplotypes: one is sampled from a Yoruba population in Nigeria (YRI data sets), the other comes from a Utah population of European ancestry (CEU data sets).In each population, 30 trios were genotyped at approximately 1,100,000 SNPs roughly evenly spaced along the genome.The 120 independent haplotypes are obtained after discarding children's haplotypes.The average spacing between adjacent SNPs in this genome-wide data set is around 3 kbp.This density is probably not enough to capture the detailed pattern of polymorphism.
The ENCODE project assesses a much higher density of SNPs in the same individuals for ten 500 kbp regions (the ENCODE regions).These ten regions were chosen to represent a wide variety of LD levels.In the final phase I release, the density is about one SNP per 500 bp in most ENCODE regions.

(
a) : Population. (b) : Number of actually polymorphic SNPs in the population considered. (c-h) : HMM-2 for the unconstrained two-state HMM; HMM-4D for the Daly et al.HMM; LS-HOM and LS-HET for the homogeneous and the heterogeneous Li and Stephens models; GR-1 and GR-2 for the "greedy" models with context sizes 1 and 2.

Figure 1
Information content versus the number of tag SNPs in the ENr112 region from CEU population.(a) The information content is computed according to the model used to select the tag SNPs.Top-down: Markov model (dotted black line), unconstrained and Daly et al.HMMs (dot-dashed gray and black lines, respectively), models with context sizes 1 and 2 (dashed gray and black lines, resp.), homogeneous and heterogeneous Li and Stephens HMMs (solid gray and black lines, resp.).(b) The information content is always computed according the heterogeneous Li and Stephens model, regardless of the tag SNP selection method.The symbols for the tag SNP selection methods are the same as in (a).Additional symbols show the information content of evenly spaced tag SNPs (black circles), and randomly chosen tag SNPs (gray circles).bits) number of tagging SNPs information content (bits) number of tag SNPs (b) information content as measured (a) information content as measured by the model used for tagging by the heterogeneous Li & Stephens model

Figure 3 ( 2
Figure 3 (b) reports the comparison with Carlson et al. method.The tag sets selected by the Carlson et al. method are better than our tag sets in only 14 cases out of 88 (16%, p-value ≈ le -10 by the non-parametric sign-test).The average ratios between the false prediction rate of the Carlson et al. tag sets and that of our tag sets is 1.13 for the 10 ENCODE regions and 1.26 for Chromosome 7.

Figure 3
Comparison with other tag SNP selection algorithms.Performances of tag sets selected with (a) the HapBlock program, or (b) the Carlson method are compared with the performances tag sets of the same sizes selected with the LS-HET model.The horizontal axis reports the false prediction rate associated with the tag sets selected by the HapBlock program or the Carlson method.The vertical axis reports the ratio of this false prediction rate over the false prediction rate associated with the tag set of the same size selected with the LS-HET model.All the predictions are performed by the LS-HET model and the false prediction rates are averaged over the six cross-validation experiments.Each point corresponds to the combination of one data set and one set of user-defined parameters.Large symbols emphasize results obtained on Chromosome 7; circles and rectangles stand for results on CEU and YRI data sets, respectively.(a) For the HapBlock software, (α HB , β HB ) are: (0.8, 0.099), (0.9, 0.099), (0.9, 0.049) and (0.95, 0.049) with the diversity criterion for tag SNP selection (open symbols); and (0.8, 0.099) and (0.9, 0.099) with the entropy criterion for tag SNP selection (close symbols).(b) The r 2 thresholds for the Carlson method are: 0.35, 0.5, 0.65 and 0.80.false prediction rate ratio of false prediction rates (a) HapBlock program (b) Carlson method

| | = 2 .
For each SNP site, there are | | × (| | -1) transition parameters and | | emission parameters in an unconstrained HMM.With a practical sample size, we can only deal with unconstrained HMMs of a few hidden states.In order to use HMMs with large number of hidden states, we need to reduce the number of free parameters by imposing constraints on transition and emission probabilities.Daly et al. [10] introduced a constrained HMM with four hidden states | | = 4 to model SNP sequences.As unconstrained models, this model has | | emission parameters per SNP site but transition probabilities are highly constrained: instead of | | × (| | -1) transition parameters per SNP site, a single parameter is used.

Table 1 : Model comparison using code-length, or average negative cross-log-likelihood. Code lengths are given in bits. For Chromosome 7 data sets, SNP loci were sub-sampled in ENCODE regions to maintain an uniform coverage of the chromosome.
SNPs already selected.Namely, we compute min j∈J H(X i | ).Without a specific model such as a Markov or a hidden Markov model, it is impossible to compute H(X i | X J ).Instead we replace H(X i | X J ) by the entropy of X i given the one or two SNPs that are the most informative relative to i among the