Selecting additional tag SNPs for tolerating missing data in genotyping
- Yao-Ting Huang^{1},
- Kui Zhang^{3},
- Ting Chen^{4} and
- Kun-Mao Chao^{1, 2}Email author
https://doi.org/10.1186/1471-2105-6-263
© Huang et al; licensee BioMed Central Ltd. 2005
Received: 26 May 2005
Accepted: 01 November 2005
Published: 01 November 2005
Abstract
Background
Recent studies have shown that the patterns of linkage disequilibrium observed in human populations have a block-like structure, and a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. In reality, some tag SNPs may be missing, and we may fail to distinguish two distinct haplotypes due to the ambiguity caused by missing data.
Results
We show there exists a subset of SNPs (referred to as robust tag SNPs) which can still distinguish all distinct haplotypes even when some SNPs are missing. The problem of finding minimum robust tag SNPs is shown to be NP-hard. To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming relaxation algorithm. The experimental results indicate that (1) the solutions found by these algorithms are quite close to the optimal solution; (2) the genotyping cost saved by using tag SNPs can be as high as 80%; and (3) genotyping additional tag SNPs for tolerating missing data is still cost-effective.
Conclusion
Genotyping robust tag SNPs is more practical than just genotyping the minimum tag SNPs if we can not avoid the occurrence of missing data. Our theoretical analysis and experimental results show that the performance of our algorithms is not only efficient but the solution found is also close to the optimal solution.
Background
In reality, a SNP may not be genotyped and considered to be missing data (i.e., we fail to obtain the allele configuration of the SNP) if it does not pass the threshold of data quality [13, 16, 19, 20]. These missing data may cause ambiguity when using the minimum set of tag SNPs to distinguish an unknown haplotype sample. Figure 1 illustrates the influence of missing data when identifying haplotype samples. In this figure, a haplotype block (see Figure 1 (A)) defined by 12 SNPs and 4 haplotype patterns is presented (from the public haplotype data of human Chromosome 21 [13]). We follow the same assumption as previous studies that all SNPs are diallelic (i.e., taking on only two values) [1, 13]. Suppose we select SNPs S_{1} and S_{12} as tag SNPs. The haplotype sample h_{1} is identified as haplotype pattern P_{3} unambiguously (see Figure 1 (B)). Consider haplotype samples h_{2} and h_{3} with one missing tag SNP (see Figure 1 (C)). h_{2} can be identified as haplotype patterns P_{2} or P_{3}, and h_{3} can be identified as P_{1} or P_{3}. As a result, these missing tag SNPs result in ambiguity when distinguishing unknown haplotype samples.
Although we can not avoid the occurrence of missing data, the remaining SNPs within the haplotype block may provide abundant information to resolve the ambiguity. For example, if we re-genotype an additional SNP S_{5} for h_{2} (see Figure 1 (D)), h_{2} is identified as haplotype pattern P_{3} unambiguously. On the other hand, if SNP S_{8} is re-genotyped (see Figure 1 (E)), h_{3} is also identified unambiguously. These additional SNPs are referred to as "auxiliary tag SNPs," which can be found from the remaining SNPs in the block and are able to resolve the ambiguity caused by missing data.
This paper focuses on the problem of finding robust tag SNPs to tolerate a number of missing data. Throughout this paper, we denote m as the maximum number of missing SNPs to be tolerated, which corresponds to different missing rates in different genotyping experiments. And we wish to find a minimum set of robust tag SNPs which can distinguish each pair of haplotypes even when up to m SNPs are missing. We assume that the haplotype phases and block partition are available as the input. Numerous methods have been developed to infer haplotypes from genotype data [12, 14, 15]. Several algorithms have also been proposed to find the block partition [4, 13, 17]. The problem of finding minimum robust tag SNPs is shown to be NP-hard (See Theorem 1). To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming (LP) relaxation algorithm. The proposed algorithms have been implemented and tested on a variety of simulated and empirical data. We also analyze the efficiency and solutions of these algorithms. An algorithm for finding auxiliary tag SNPs is described assuming robust tag SNPs have been computed in advance.
Results
We have implemented the first and second greedy algorithms in JAVA [see Additional files 1 and 2]. The LP-relaxation algorithm has been implemented in Perl [see Additional file 3], where the LP problem is solved via a program called "lp_solve" [11]. The LP-relaxation algorithm is a randomized method. Thus, this program is repeated for 10 times to explore different solutions and the best solution among them is chosen as the output.
Results on simulated data
We then generate 100 data sets containing long haplotypes. Each data set is composed of 10 haplotypes with 40 SNPs. Figure 3 (b) illustrates the experimental results on these long data sets (corresponding to SNP missing rates from 0% to 37%). The optimal solutions for m > 2 can not be found by the OPT program within a reasonable period of time (after one week computation) and are not shown in this figure. It is because the possible solutions in long data sets are too large to enumerate. On the other hand, both greedy and iterative LP-relaxation algorithms run in polynomial time and always output a solution efficiently. In this experiment, both greedy algorithms slightly outperforms the iterative LP-relaxation algorithm. In addition, the number of missing SNPs allowed is larger than those in short data sets. For example, to tolerate 10 missing SNPs (i.e., m = 10), all programs output less than 28 SNPs. The remaining SNPs in each data set are still sufficient to tolerate more missing SNPs.
Figure 4 (b) illustrates the experimental results on long data sets generated by Hudson's program (corresponding to SNP missing rates from 0% to 29%). The optimal solutions for m > 2 again can not be found by the OPT program within a reasonable period of time. In this experiment, the performance of the first greedy and iterative LP-relaxation algorithms are similar, and they slightly outperform the second greedy algorithm as m becomes large.
Results on real data
We also test these programs on two real data sets: (1) public haplotype data of human Chromosome 21 released by Patil et al. [13]; and (2) a 500 KB region on human Chromosome 5q31 which may contain a genetic variant related to the Crohn disease by Daly et al. [4]. Patil's data include 20 haplotypes of 24,047 SNPs spanning over about 32.4 MB, which are partitioned into 4,135 haplotype blocks. By genotyping 103 SNPs with minor allele frequency at least 5%, Daly et al. partition the 500 KB region into 11 haplotype blocks. Each haplotype block in these real data sets contains different numbers of SNPs and haplotypes (e.g., from several SNPs to hundreds of SNPs). When m increases, some short blocks may not contain enough SNPs for tolerating missing data (e.g., m > the number of SNPs in a block). As a consequence, S_{ a }here stands for the average number of robust tag SNPs over those blocks still containing solutions.
The detailed result of first greedy algorithm on Daly's 11 blocks.
Block ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | S _{ a } |
---|---|---|---|---|---|---|---|---|---|---|---|---|
m = 0 | 1 | 1 | 2 | 3 | 3 | 2 | 3 | 2 | 2 | 2 | 2 | 23/11 = 2.09 |
m = 1 | 2 | 2 | f | 5 | f | 3 | 5 | 4 | f | 3 | 3 | 27/8 = 3.375 |
m = 2 | 3 | 3 | f | 8 | f | f | f | f | f | 5 | f | 19/4 = 4.75 |
m = 3 | 4 | 4 | f | f | f | f | f | f | f | f | f | 8/2 = 4 |
m = 4 | 5 | 5 | f | f | f | f | f | f | f | f | f | 10/2 = 5 |
m = 5 | 6 | f | f | f | f | f | f | f | f | f | f | 6/1 = 6 |
m = 6 | 7 | f | f | f | f | f | f | f | f | f | f | 7/1 = 7 |
Discussion
In terms of efficiency, the first and second greedy algorithms are faster than the LP-relaxation algorithm. The greedy algorithms usually returns a solution in seconds and the LP-relaxation algorithm requires about half minute for a solution. It is because the running time of LP-relaxation algorithm is bounded by the time of solving the LP problem. Furthermore, this LP-relaxation algorithm is repeated for 10 times to explore 10 different solutions. The OPT program for searching the optimal solution is apparently slower than the others. The optimal solution usually can not be found within a reasonable period of time if the size of the block becomes large. ¿From our empirical study, the optimal solution can be found in reasonable time by the OPT program if the block contains less than 20 SNPs (e.g., the short random data sets). But for those large data sets with more than 40 SNPs, the OPT program is significantly outperformed by the approximation algorithms (e.g., fail to output a solution within one week computation).
The number of total tag SNPs found by each algorithm. The percentage of tag SNPs with respect to total SNPs is shown in parentheses.
Random data | Hudson's data | Patil's data | Daly's data | |||
---|---|---|---|---|---|---|
Total blocks | 100 | 100 | 100 | 100 | 4135 | 11 |
Total SNPs | 2000 | 4000 | 2000 | 4000 | 24047 | 103 |
1^{ st }Greedy | 400 (20%) | 400 (10%) | 509 (25.5%) | 472 (11.8%) | 4610 (19.2%) | 23 (22.3%) |
2^{ nd }Greedy | 400 (20%) | 400 (10%) | 509 (25.5%) | 472 (11.8%) | 4610 (19.2%) | 23 (22.3%) |
LP-relaxation | 400 (20%) | 400 (10%) | 509 (25.5%) | 471 (11.8%) | 4657 (19.4%) | 23 (22.3%) |
OPT | 400 (20%) | 400 (10%) | 492 (24.6%) | 443 (11.1%) | 4595 (19.1%) | 23 (22.3%) |
We then analyze the genotyping cost that can be saved by using tag SNPs. In Table 2, the percentage of tag SNPs in each data set is shown in parentheses. The experimental results indicate that the cost of genotyping tag SNPs is significantly reduced in comparison with genotyping all SNPs in a block. For example, in Patil's data, we only need to genotype about 19% of tag SNPs in each block, which saves about 81% genotyping cost.
The tradeoffs between additional tag SNPs required and maximum missing rates allowed. These results come from the first greedy algorithm applied on random and Hudson's data sets with 40 SNPs.
m | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|
Random data (40 SNPs) | average number of robust tag SNPs | 4 | 6 | 8.51 | 10.47 | 12.89 | 14.92 |
corresponding SNP missing rate | 0 | 16.7% | 23.5% | 28.6% | 31.0% | 33.5% | |
average number of extra tag SNPs | 0 | 2 | 4.51 | 6.47 | 8.89 | 10.92 | |
Hudson's data (40 SNPs) | average number of robust tag SNPs | 4.72 | 7.71 | 11.28 | 14.67 | 18.23 | 21.67 |
corresponding SNP missing rate | 0 | 13.0% | 17.7% | 20.4% | 21.9% | 23.1% | |
average number of extra tag SNPs | 0 | 2.99 | 6.56 | 9.95 | 13.51 | 16.95 |
- 1.
It can be easily done by discarding the rare haplotypes and retain the common haplotypes as the input of our algorithms. This approach would not require modification to our algorithms. But the retained common haplotypes will be processed as equally weighted.
- 2.
Our algorithms try to find a set of SNPs such that each pair of haplotypes are distinguished by a threshold of at least (m + 1) SNPs. A simplest way to weight the haplotypes is choosing different thresholds for each pair of haplotypes according to their population frequency. The haplotype pairs with higher frequency can then be assigned with more tag SNPs than the lower ones by our algorithms.
Conclusion
In this paper, we show there exists a set of robust tag SNPs which is able to tolerate a number of missing data. Our study indicates that genotyping robust tag SNPs is more practical than genotyping minimum tag SNPs for association studies if we can not avoid the occurrence of missing data. We describe two greedy and one LP-relaxation approximation algorithms for finding robust tag SNPs. Our experimental results and theoretical analysis show that these algorithms are not only efficient but the solutions found are also close to the optimal solution. In terms of genotyping cost, we observe that the genotyping cost saved by using robust tag SNPs is significant, and genotyping additional tag SNPs to tolerate missing data is still cost-effective. One future direction is to assign weights to different types of SNPs (e.g., SNPs in coding or non-coding regions), and design algorithms for the selection of weighted tag SNPs.
Software availability
Project name: efficient algorithms for utilizing SNP information.
Project home page: http://www.csie.ntu.edu.tw/~kmchao/tools/Robust_Tag_SNP
Operating system: the implemented greedy algorithms are platform independent, and the implemented LP-relaxation algorithm runs on the Windows operating system.
Programming language: the greedy algorithms are implemented in JAVA, and the LP-relaxation algorithm is implemented in Perl.
Methods
To distinguish a haplotype pattern unambiguously, each pair of patterns must be distinguished by at least one SNP in C'. For example (see Figure 6 (A)), we say patterns P_{1} and P_{2} can be distinguished by SNP S_{2} since M_{ h }[2,1] ≠ M_{ h }[2,2]. A formal definition of this problem is given below.
Problem: Minimum Robust Tag SNPs (MRTS)
Input: An N × K matrix M_{ h }and an integer m.
- (1)
for each pair of patterns P_{ i }and P_{ j }, these is a SNP S_{ k }∈ C' such that M_{ h }[k, i] ≠ M_{ h }[k, j];
- (2)
when at most m SNPs are discarded from C' arbitrarily, (1) still holds.
Lemma 1. C' ⊆ C is the set of robust tag SNPs which allows at most m missing SNPs iff each node in P has at least (m + 1) edges connecting to each node in C'.
Proof. Let C' be the set of robust tag SNPs which allows at most m missing SNPs. Suppose patterns P_{ i }and P_{ j }are distinguished by only m SNPs in C' (i.e., (i, j) has only m edges connecting to nodes in C'). However, if these m SNPs are all missing, no other SNPs in C' are able to distinguish patterns P_{ i }and P_{ j }, which is a contradiction. Thus, each pair of patterns must be distinguished by at least (m + 1) SNPs, which implies that each node in P must have at least (m + 1) edges connecting to nodes in C'. The proof of the other direction is similar.
In the following, we give a lower bound regarding the minimum number of robust tag SNPs required, which is used to skip some solution space by the OPT program.
Lemma 2. Given K haplotype patterns, the minimum number of robust tag SNPs required is at least log K.
Proof. Recall that the value of a SNP is binary. The maximum number of distinct haplotypes which can be distinguished by N SNPs is at most 2^{ N }. As a result, for a given data set containing K haplotype patterns, the minimum number of SNPs required is at least log K.
The following theorem shows the NP-hardness of the MRTS problem, which implies there is no polynomial time algorithm to find the optimal solution of MRTS.
Theorem 1. The MRTS problem is NP-hard.
Proof. When m = 0, MRTS is the same as the original problem of finding minimum tag SNPs, which is known as the minimum test set problem [6, 17]. Since the minimum test set problem is NP-hard and can be reduced to a special case of MRTS, MRTS is NP-hard.
The first greedy algorithm
This greedy algorithm works by covering the grids from the first row to the (m + 1)-th row, and greedily selects a SNP which covers most uncovered grids in the i-th row at each iteration. In other words, while working on the i-th row, a SNP is selected if its reformulated set S' maximizes |S' ∩ R_{ i }|, where R_{ i }is the set of uncovered grids at the i-th row.
Figure 7 illustrates an example for this algorithm to tolerate one missing tag SNP (i.e., m = 1). The SNPs S_{1}, S_{4}, S_{2}, and S_{3} are selected in order. When all grids in this table are covered, each pair of patterns is distinguished by (m + 1) SNPs in the corresponding column. Thus, the SNPs in this table are the robust tag SNPs which can tolerate up to m missing SNPs. The pseudo code of this greedy algorithm is given below.
- 1
R_{ i }← P, ∀i ∈ [1, m + 1]
- 2
C' ← φ
- 3
for i = 1 to m + 1 do
- 4
while R_{ i }≠ φ do
- 5
select and remove a SNP S from C that maximizes |S' ∩ R_{ i }|
- 6
C' ← C' ∪ S
- 7
j ← i
- 8
while S' ≠ φ and j ≤ m + 1 do
- 9
S_{ tmp }← S' ∩ R_{ j }//S_{ tmp }is a temporary variable for holding the result of S' ∩ R_{ i }
- 10
R_{ j }← R_{ j }- S_{ tmp }
- 11
S' ← S' - S_{ tmp }
- 12
j ← j + l
- 13
endwhile
- 14
endwhile
- 15
endfor
- 16
return C'
The time complexity of this algorithm is analyzed as follows. At Line 4, the number of iterations of the intermediate loop is bounded by |R_{ i }| ≤ |P|. Within the loop body (Lines 5–13), Line 5 takes O(|C||P|) because we need to check all SNPs in C and examine the uncovered grids of R_{ i }. The inner loop (Lines 8–13) takes only O(|S'|). Thus, the entire program runs in O(m|C||P|^{2}).
i.e.,
By (1) and (3), we get
The second greedy algorithm
This section describes the second greedy algorithm which returns a solution of better approximation than that of the first greedy algorithm. Let R_{ i }be the set of uncovered grids at the i-th row. Unlike the row-by-row manner of the first greedy algorithm, this algorithm greedily selects a SNP that covers most uncovered grids in the table (i.e., its reformulated set S' maximizing |S' ∩ (R_{1} ∪ ... ∪ R_{m+1})|). Let T be the collection of R_{ i }(i.e., T is the set of all uncovered grids in the table). If the grids in the i-th row are all covered (i.e., R_{ i }= φ), R_{ i }is removed from T. This algorithm runs until T = φ (i.e., all grids in the table are covered).
- 1
R_{ i }← P, ∀ i ∈ [1, m + 1]
- 2
T ← {R_{1}, R_{2},... ,R_{m+1}}
- 3
C' ← φ
- 4
while T ≠ φ do
- 5
select and remove a SNP S from C that maximizes |S' ∩ (R_{1} ∪ ... ∪ R_{m+1})|
- 6
C' ← C' ∪ S
- 7
for each R_{ i }∈ T and S' ≠ ø do
- 8
S_{ tmp }← S' ∩ R_{ i }// S_{ tmp }is a temporary variable for holding the result of S' ∩ R_{ i }
- 9
R_{ i }← R_{ i }- S_{ tmp }
- 10
S' ← S' - S_{ tmp }
- 11
if R_{ i }= φ then T ← T - R_{ i }
- 12
endfor
- 13
end while
- 14
return C'
The time complexity of this algorithm is analyzed as follows. At Line 4, the number of iterations of the loop is bounded by O(|T|) = O(m|P|). Within the loop, Line 5 takes O(|C||P|) time because we need to check each SNP in C and examine if it can cover any uncovered grid in each column. The inner loop (Lines 7–12) is bounded by O(|S'|) <O(|P|). Thus, the running time of this program is O(m|C||P|^{2}).
By (4), we have
The iterative LP-relaxation algorithm
In practice, a probabilistic approach is sometimes more useful since the randomization can explore different solutions. In this section, we reformulate the MRTS problem to an Integer Programming (IP) problem. Based on the IP problem, we propose an iterative Linear Programming (LP)-relaxation algorithm. The iterative LP-relaxation algorithm is described below.
Step 1. Given a haplotype block containing N SNPs and K haplotype patterns. Let {x_{1},x_{2}, ...,x_{ N }} be the set of integer variables for the N SNPs, where x_{ k }= 1 if the SNP S_{ k }is selected and x_{ k }= 0 otherwise. Define D(P_{ i }, P_{ j }) as the set of SNPs which are able to distinguish P_{ i }and P_{ j }patterns. By Lemma 1, to allow at most m missing SNPs, each pair of patterns must be distinguished by at least (m + 1) SNPs. Therefore, for each set D(P_{ i }, P_{ j }), at least (m + 1) SNPs have to be selected to distinguish P_{ i }and P_{ j }patterns. As a consequence, the MRTS problem can be formulated as the following IP problem:
Step 2. Since solving the IP problem is NP-hard [6], we relax the integer constraint of x_{ k }, and the IP problem becomes an LP problem defined as follows:
The above LP problem can be solved in polynomial time by efficient algorithms such as the interior point method (Forsgren et al., 2002) [5].
Step 3. Let {y_{1}, y_{2}, ..., y_{ N }} be the set of linear solutions obtained from (6), where 0 ≤ y_{ k }≤ 1. We construct the corresponding integer solutions {x_{1}, x_{2}, ..., x_{ N }} by the following randomized rounding method:
Note that the constructed integer solutions do not necessary satisfy all inequalities in (5). The randomized rounding method simply assigns x_{ k }to 1 or 0 using the value of y_{ k }as the likelihood, regardless of the inequalities in (5).
Step 4. We check whether the integer solutions constructed in Step 3 satisfy all inequalities in (5) or not.
Case 1. If some inequalities in (5) are still unsatisfied, we repeat Steps 1, 2, and 3 only for those unsatisfied inequalities until all of them are satisfied.
Case 2. If all inequalities in (5) are satisfied, we construct a final solution by setting x_{ k }= 1 if x_{ k }is assigned to 1 in any one of the iterations and setting x_{ k }= 0 otherwise.
We now evaluate the solution returned by the iterative LP-relaxation algorithm. The selection of each SNP is considered as a Bernoulli random variable x_{ k }taking values 1 (or 0) with probability y_{ k }(or 1 - y_{ k }). Let X_{i,j}be the sum of random variables in one inequality of (5), i.e.,
By (6), the expected value of X_{i,j}(after randomized rounding) is
By (7), we know E[X_{i,j}] ≤ m + 1. Since the right-hand side of (8) decreases when E[X_{i,j}] > m, we can replace E[X_{ i },_{ j }] with (m + 1) to obtain an upper bound, i.e.,
Theorem 4. The iterative LP-relaxation algorithm gives a solution of O(m ln K) approximation.
OPT(LP) ≤ OPT(IP).
Let the set of solutions returned in t iterations be {Z_{1}, Z_{2},...,Z_{ t }}.
Note that we repeat this algorithm only for those unsatisfied inequalities. Thus, E[Z_{1}] ≥ E[Z_{2}] ≥ ... ≥ E[Z_{ t }]. Let x_{ p }denote the final solution obtained in Step 4. The expected final solution is
With a high probability, the iterative LP-relaxation algorithm stops after O(m ln K) iterations and finds a solution of O(m ln K) approximation. □
An algorithm for finding auxiliary tag SNPs
This section describes an algorithm for finding auxiliary tag SNPs assuming robust tag SNPs have been computed in advance. Given a haplotype block M_{ h }containing N SNPs and K haplotypes, we define C_{ tag }⊆ C as the set of tag SNPs genotyped from a haplotype sample with some missing data. This haplotype sample may fail to be distinguished because of the ambiguity caused by missing data. We wish to find the minimum number of auxiliary tag SNPs from the remaining SNPs in the block to resolve the ambiguity. A formal definition of this problem is given below.
Problem: Minimum Auxiliary Tag SNPs (MATS)
Input: An N × K matrix M_{ h }, and a set of SNPs C_{ tag }genotyped from a sample with missing data.
Output: The minimum subset of SNPs C_{ aux }⊆ C - C_{ tag }such that each pair of ambiguous patterns can be distinguished by SNPs in C_{ aux }.
The following theorem shows the NP-hardness of the MATS problem.
Theorem 5. The MATS problem is NP-hard.
Proof. Consider that all SNPs in C_{ tag }are missing. This special case of the MATS problem becomes finding the minimum tag SNPs from C - C_{ tag }, which is already known to be NP-hard [17]. Therefore, MATS is also NP-hard. □
Step 1. The patterns that match the haplotype sample are stored into a set A. For example (see Figure 11), if we genotype SNPs S_{1}, S_{2}, and S_{3} for the sample h_{2} and the SNP S_{1} is missing, patterns P_{1} and P_{3} both match h_{2}. Thus, A = {P_{1}, P_{3}}
Step 2. If |A| = 1, the sample is identified unambiguously and we are done (e.g., h_{1} in Figure 11). If |A| > 1 (e.g., h_{2}), for each pair of ambiguous patterns in A (e.g., P_{1} and P_{3}), traverse the corresponding column in T_{ r }, find the next unused SNP (e.g., S_{4}), and add the SNP to C_{ aux }. As a result, the SNPs in C_{ aux }can distinguish each pair of ambiguous patterns, which are the auxiliary tag SNPs for the haplotype sample.
The worst case of this algorithm is that all SNPs in C_{ tag }are missing data, and we need to traverse each column in T_{ r }. Thus, the running time of this algorithm is O(|T_{ r }|) = O(m|P|).
Declarations
Acknowledgements
We thank the referees for their valuable comments that resulted in numerous improvements in the presentation. Yao-Ting Huang and Kun-Mao Chao were supported in part by NSC grants 93-2213-E-002-029 and 94-2213-E-002-091 from the National Science Council, Taiwan. Ting Chen was supported in part by NIH CEGS: Implications of Haplotype Structure in the Human Genome, Grant No. P50 HG002790.
Authors’ Affiliations
References
- Bafna V, Halldórsson BV, Schwartz R, Clark AG, Istrail S: Haplotypes and informative SNP selection algorithms: don't block out information. Proc RECOMB'03 2003, 19–27.View ArticleGoogle Scholar
- Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 2004, 74: 106–120.PubMed CentralView ArticlePubMedGoogle Scholar
- Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to algorithms. The MIT Press; 2001.Google Scholar
- Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolution haplotype structure in the human genome. Nat Genet 2001, 29(2):229–232.View ArticlePubMedGoogle Scholar
- Forsgren A, Gill PE, Wright MH: Interior methods for nonlinear optimization. SIAM Rev 2002, 44: 525–597.View ArticleGoogle Scholar
- Garey MR, Johnson DS: Computers and intractability. Freeman, New York; 1979.Google Scholar
- Halldórsson BV, Bafna V, Lippert R, Schwartz R, Vega FM, Clark AG, Istrail S: Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Research 2004, 1633–1640.Google Scholar
- Halperin E, Eskin E: Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics 2004.Google Scholar
- Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Whole-genome patterns of common DNA variation in three human populations. Science 2005, 307: 1072–1079.View ArticlePubMedGoogle Scholar
- Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 2002, 18: 337–338.View ArticlePubMedGoogle Scholar
- LP Solve[http://www.cs.sunysb.edu/~algorith/implement/lpsolve/implement.shtml]
- Niu T, Qin Z, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002, 70: 157–159.PubMed CentralView ArticlePubMedGoogle Scholar
- Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SP, Cox DR: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 2001, 294: 1719–1723.View ArticlePubMedGoogle Scholar
- Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 2003, 73: 1162–1169.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Xu Y: Haplotype inference by maximum parsimony. Bioinformatics 2003, 19(14):1773–1780.View ArticlePubMedGoogle Scholar
- Yang Y, Zhang J, Hoh J, Matsuda F, Xu P, Lathrop M, Ott J: Efficiency of single-nucleotide polymorphism haplotype estimation from pooled DNA. Proc Nat Acad Set 2003, 100(12):7225–7230.View ArticleGoogle Scholar
- Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic programming algorithm for haplotype partitioning. Proc Nat Acad Sci 2002, 99(11):7335–7339.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang K, Sun F, Waterman MS, Chen T: Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data. Am J Hum Genet 2003, 73: 63–73.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang K, Qin ZS, Liu JS, Chen T, Waterman MS, Sun F: Haplotype block partition and tag SNP selection using genotype data and their applications to association studies. Genome Research 2004, 14: 908–916.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao JH, Lissarrague S, Essioux L, Sham PC: GENECOUNTING: haplotype analysis with missing genotypes. Bioinformatics 2002, 18: 1694–1695.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.