Linkage disequilibrium based genotype calling from lowcoverage shotgun sequencing reads
 Jorge Duitama^{1},
 Justin Kennedy^{1},
 Sanjiv Dinakar^{2},
 Yözen Hernández^{3},
 Yufeng Wu^{1} and
 Ion I Măndoiu^{1}Email author
https://doi.org/10.1186/1471210512S1S53
© Duitama et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Abstract
Background
Recent technology advances have enabled sequencing of individual genomes, promising to revolutionize biomedical research. However, deep sequencing remains more expensive than microarrays for performing wholegenome SNP genotyping.
Results
In this paper we introduce a new multilocus statistical model and computationally efficient genotype calling algorithms that integrate shotgun sequencing data with linkage disequilibrium (LD) information extracted from reference population panels such as Hapmap or the 1000 genomes project. Experiments on publicly available 454, Illumina, and ABI SOLiD sequencing datasets suggest that integration of LD information results in genotype calling accuracy comparable to that of microarray platforms from sequencing data of lowcoverage. A software package implementing our algorithm, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/GeneSeq/.
Conclusions
Integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LDoblivious methods, rendering lowcoverage sequencing as a viable alternative to microarrays for conducting largescale genomewide association studies.
Background
Recent advances in massively parallel sequencing have dramatically increased throughput compared to the classic Sanger technology, with several commercially available platforms including 454, Illumina, ABI SOLiD, and Helicos delivering billions of bases per day. This has enabled sequencing of several individual genomes [1–8], ushering the era of personal genomics. Thousands of other individual genomes are currently being sequenced as part of large scale projects such as the international 1000 genomes project [9], and whole genome sequencing is likely to become routine as sequencing costs continue to decrease. However, analysis of whole genome sequencing data remains challenging [10] and experimental design optimization has only recently started to receive attention [11].
In this paper we focus on one of the most fundamental genomic analyses, namely determining the genotypes at known loci of genome variation such as single nucleotide polymorphisms (SNPs). Diploid organisms including humans inherit two (possibly identical) variants or alleles at autosomal loci, and most medical applications of personal genomics require accurate identification of both variants, the combination of which is referred to as genotype. Of particular interest are loci that are heterozygous, i.e., loci for which the two chromosomes carry different alleles. However, identifying heterozygous loci from lowcoverage wholegenome sequencing data poses a significant challenge. Sequencing data is obtained using the so called “shotgun” approach, whereby millions of short DNA fragments called reads are generated from randomly selected locations on the two chromosomes. If, for example, there are only two reads generated from a heterozygous locus, there is a 50% chance that one allele would be missed. To compensate for sequencing errors, existing methods for detecting heterozygous loci have even higher minimum allele coverage requirements, e.g., in [3, 8], calling an allele requires the presence of at least two reads supporting it. Consequently, due to the relatively low sequencing depth used in these two studies (about 7.5×), the reported sensitivity of detecting heterozygous SNPs was of only 75%.
A simple way to improve genotype calling accuracy is to increase sequencing depth, as the probability of “missing” an allele decreases with the number of reads. After taking into account the effect of sequencing errors it has been estimated that, in the absence of additional information, achieving 99% sensitivity at detecting heterozygous SNPs would require an average sequencing depth of over 21× [12]. Our main contribution is to demonstrate that high accuracy SNP genotypes can be inferred from shotgun sequencing data of much lower depth by exploiting the correlation between alleles at nearby SNP sites, commonly referred to as linkage disequilibrium (LD).
LD patterns over millions of common SNPs have been mapped for several populations as part of the Hapmap project [13]. The strong LD observed in human populations has already been exploited by methods for imputation of genotypes at untyped SNP loci based on nearby SNP genotypes [14–19], see [20] for a recent review, and more recently, for improving genotype calling accuracy from microarray hybridization signals [21]. Another striking demonstration of the power of LD has been the inference of Watson’s APOE status [22] despite the removal of sequencing reads covering this region from the published dataset [8]. In this work we introduce a novel hierarchical factorial Hidden Markov Model (HMM) that allows integrated analysis of LD information extracted from reference population panels such as Hapmap and shortread sequencing data generated by current technologies. Although the ensuing multilocus genotype inference is computationally hard, we develop a scalable heuristic similar to the posterior decoding algorithm for HMMs. A software package implementing this algorithm has been released under the GNU General Public License and is available at http://dna.engr.uconn.edu/software/GeneSeq/. We also present experimental results on publicly available 454, Illumina, and ABI SOLiD wholegenome sequencing datasets showing that integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LDoblivious methods. For example, at 6× average mapped read coverage, our algorithm calls heterozygous SNP genotypes with about 96% accuracy, and accuracy can be further increased to 9899% by leaving uncalled a small percentage of SNP genotypes with low posterior probabilities. This accuracy is comparable to that achieved by microarraybased genotyping platforms. Coupled with continued decreases in sequencing costs, the reduced sequencing depth required when using LD information renders lowcoverage sequencing as a potentially more costeffective alternative to microarrays for the next generation of genome wide association studies (GWAS). For example, the ABI SOLiD 4hq is expected to deliver 300Gb of sequencing data per run, or the equivalent of 16 individual genomes at 6× coverage, with a cost of only $600 per genome [23]. Undoubtedly, cost will be an important factor in future GWAS studies, which are expected to use much higher sample sizes compared to past studies in order to enable the study of genegene and geneenvironment interactions [24].
Methods
In this section we begin by describing a simplified statistical model that assumes independence between loci, then extend it to include dependences between alleles at different SNPs due to LD. We next formalize the multilocus genotype calling problem in the context of the extended model and show that computing the most likely multilocus genotype is computationally hard. Finally, we present a posterior decoding heuristic which independently selects the most likely genotype at each locus conditional on the entire set of reads.
Notations
We use uppercase italic letters (e.g., X) to denote random variables and lowercase italic letters (e.g., x) to denote generic values taken by them. Vectors of random variables and generic values are denoted by boldface uppercase (e.g., X), respectively boldface lowercase letters (e.g., x). When there is no ambiguity on the underlying probabilistic event we use P(x) to denote P(X = x), with similar shorthands used for joint and conditional probabilities of multiple events. For simplicity we consider only bialelic SNPs on autosomes. For every SNP locus, we denote the two possible alleles by 0 and 1, and the three genotypes by 0,1, and 2, with 0 and 2 denoting the homozygous 0 and homozygous 1 genotypes, and 1 denoting the heterozygous genotype.
Single SNP genotype calling
In this section we describe a genotype inference model that assumes the SNPs to be unlinked as in [8], but further incorporates allele uncertainty quantified by sequencing quality scores, read mapping uncertainty, and population genotype frequencies estimated from a reference panel.
Let r be a read mapped onto the genome. If r covers SNP locus i, we denote by r(i) the allele observed in the read at this locus. Since our focus is on genotyping SNPs represented in a reference panel, we further assume that panel SNPs at which the individual under study has novel allele variants (observed in [8] at only 0.02% of the markers) have been identified in a preliminary analysis, e.g., by using binomial probability test of [8]. Based on this assumption, all reads with alleles not represented in the panel population are ignored, and for remaining reads r we have that r(i) ∈ {0,1}. The probability that allele r(i) is affected by a sequencing error is denoted by ε_{ r }_{(i)}. In our experiments we set ε_{ r }_{(i)} = 10^{ q }_{ r }_{(i)}/10, where where q_{ r }_{(}_{ i }_{)} denotes the Phred quality score of r(i) [25].
and P(G_{ i } = g) denotes the population frequency of genotype g, estimated from the reference panel. If read mapping uncertainty is available in the form of probabilities m(r) that read r is mapped at the correct position, such information can be accounted for in genotype calling by replacing the above conditional probabilities with genotype weights obtained from (1)(3) by rising the terms corresponding to read r to power m(r). Although the resulting weights can no longer be interpreted as conditional genotype probabilities, they naturally allow interpreting the presence of a read r with mapping confidence m(r) < 1 as the equivalent of observing an m(r) fraction of an identical read mapped with confidence 1.
A statistical model for multilocus genotype inference
At the core of the model are two lefttoright HMMs M and M′ (dotted boxes in Fig. 1), each emitting haplotypes with frequencies corresponding to those in the populations of origin for the sequenced individual's parents. Under M and M′, each haplotype is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes, where K is a population specific model parameter. Formally, for every SNP locus i ∈ {1, … , n}, we let be a random variable representing the allele observed at this locus on the maternal (paternal) chromosome of the individual under study, and be a random variable denoting the founder haplotype from which H_{ i } (respectively ) originates. As in previous works [15, 17, 28–30], we assume that F_{ i } form the states of a first order HMM with emissions H_{ i }, and estimate probabilities P(f_{1}), P(f_{ i }_{+1}f_{ i }), and P(h_{ i }f_{ i })) using the classical BaumWelch algorithm [31] based on haplotypes inferred from a panel representing the population of origin of the individual’s mother. Probabilities , and are estimated in the same way based on haplotypes inferred from a panel representing the population of origin of the individual’s father.
This implies that P(r_{ i }g_{ i }) are given by equations (1)(3), and in the following we will assume that probabilities P(r_{ i }g_{ i }) are precomputed in O(m) time, where is bounded above by the total number of reads. We can now formulate the following:
Multilocus Genotyping Problem (MGP)
Given: Trained HMM models M, M′ and set of shotgun reads r = (r_{1}, … , r_{ n })
Find: Multilocus genotype g* ∈ {0,1,2}^{ n } with maximum posterior probability, i.e.,
g* = argmax_{ g }P(gr, M, M′) (6)
Computational complexity
In this section we show that MGP is NPhard. Let Maximum Multilocus Genotype Probability Problem (MMGPP) denote the optimization version of MGP that requires finding max_{ g }P(gr, M, M′).
Theorem 1. For any ∊ > 0, MMGPP cannot be approximated within unless P=NP, and it cannot be approximated within unless ZPP=NP. Furthermore, this holds even if M′ = M.

The number of SNPs n is set to one plus the length of the strings emitted by M_{0}.

At the first SNP, for two founder states and we have ; all other founder states have zero initial probability.

For every SNP locus i > 1 we add a new founder as well as a set of founders corresponding to the states at “column” i — 1 of M_{0}.

All founder , emit 0 with probability 1. Furthermore, for every i = 2, … , n.

Founder emits 1 with probability 1, and has transitions to founders , according to the initial probabilities of M_{0}.

All other emission and transition probabilities are identical to those for the corresponding states of M_{0}.
Finally, we set r = {r_{0}, r_{1}} where r_{0} is a read that supports allele 0 at first SNP and r_{1} is a read that supports the allele 1 at first SNP. Error probabilities for both alleles are set to zero.
Note that P(gr, M, M′) ≠ 0 only for multilocus genotypes with g_{1} = 1 and g_{ i } ∈ {0,1} for i = 2, … , n.
The last equality comes from the fact that g can only be observed when the maternal haplotype is 0^{ n } and the paternal haplotype is g or viceversa, and each of these configurations have a probability of P(g_{2}, … , g_{ n }M_{0})/4. The innaproximability result follows from [32] since, by (7), P(gr, M, M′) is constant fraction of P(g_{2}, … , g_{ n }M_{0}).
Since an algorithm similar to the forward algorithm for HMMs can be used to compute in polynomial time the marginal probability of a given genotype, Theorem 1 implies the following:
Corollary 2. MGP is NPHard.
Posterior decoding algorithm
We next present an MGP heuristic similar to the posterior decoding algorithm for HMMs. Specifically, the algorithm selects for each SNP locus i the genotype ĝ_{ i } with maximum posterior probability given the read data r. Note that, unlike the single SNP genotype calling method, where we condition only on the set r_{ i } of reads overlapping locus i, in the posterior decoding algorithm we take into account the entire set of reads:
Posterior decoding algorithm
Step 1. For each i = 1, … , n, ĝ_{ i } ← argmax_{ g }_{i}P(g_{ i }r)
Step 2. Return ĝ = (ĝ_{1}, … , ĝ_{ n })
Thus all probabilities P(g_{ i }, r) can be computed in O(nK^{2}) once the forward and backward probabilities and are available.
Forward and backward probabilities can thus be computed in O(nK^{3}) by using recurrences (8), (11), and (12), respectively (13), (14), and (15), resulting in an overall runtime of O(m + nK^{3}), where m is the number of reads, n is the number of SNPs, and K is a user selected parameter denoting the number of founders in the HMM models of haplotype diversity in the parental populations (we used K = 7 in our experiments).
Results and discussion
Datasets
 1.
Watson 454: A set of 74.4 million reads downloaded from the NCBI SRA database (submission number: SRA000065). The reads, with an average length of ~265 bp, were generated using the Roche 454 FLX platform as part of James Watson’s personal genome project. This is a subset of the 106.5 million 454 reads analyzed in [8]. Unless noted otherwise, the haplotype panel used to train identical HMM models for the maternal and paternal populations was obtained by phasing CEU trio genotypes from Hapmap r23a [13] using the ENT algorithm of [33] and retaining parent haplotypes from each trio. As in [8], genotype calling accuracy was assessed using the SNP genotypes determined using duplicate hybridization experiments with Affymetrix 500k microarrays (only concordant genotypes were retained in the test set).
 2.
NA18507 Illumina: A set of 525 million pairedend reads downloaded from the NCBI SRA database (submission number: SRA000271). These 36bp reads, which were generated using the Illumina Genome Analyzer from a Hapmap Yoruban individual identified as NA18507, are a subset of the dataset analyzed in [1]. For the analysis of this dataset the HMM models for maternal and paternal populations were trained using YRI haplotypes from Hapmap r22, excluding the haplotypes of the YRI trio that contains NA18507. As gold standard we used the genotypes published as part of Hapmap r22 for individual NA18507.
 3.
NA18507 SOLiD: A set of 900 million single ABI SOLiD reads generated from Hapmap individual NA18507 was kindly provided by the authors of [4]. Reads varied in length between 20 and 44 bp, and were already mapped to the reference genome. Corresponding raw reads are available for download from the NCBI SRA database (submission number: SRA000272). HMM models and gold standard genotypes were determined in the same way as for the NA18507 Illumina dataset.
Read mapping
Summary statistics for the three datasets used in evaluation
Dataset  Test SNPs  Raw Reads  Raw Sequence  Mapped Reads  Avg. Mapped SNP coverage 

Watson 454  443K  74.2M  19.7Gb  49.8M (67%)  5.85× 
NA18507 Illumina  2.85M  525M  18.9Gb  397M (78%)  6.10× 
NA18507 SOLiD  2.85M  2.45G  75Gb  900M (37%)  9.85× 
Genotyping accuracy
To evaluate the effects of read coverage on genotype calling for each dataset of m mapped reads we created four subsets of sizes m/16, m/8, m/4 and m/2 by picking reads at random. For each subset we called genotypes using the HMMbased posterior decoding algorithm, the binomial test of [8] (with a threshold of 0.01), and the single SNP posterior probability described under Methods. We also included in the comparison genotype calls obtained by SOAPsnp [36] and MAQ [35], two widely used LDoblivious Bayesian methods implemented in the SAMtools package [37]. Unfortunately we could not compare our method with similar tools developed as part of the 1000 genomes project [38, 39], which have only become publicly available when this article was in press. We measured the accuracy of each genotype calling method by computing the percentage of SNP genotype calls that match the gold standard available for each dataset. As in previous papers [1, 3, 4, 8], we separately report accuracy for homozygous and heterozygous SNPs.
Since our algorithm computes a posterior probability for each SNP genotype, further increases in calling accuracy can be obtained at the expense of leaving uncalled a small percentage of SNP genotypes with low posterior probability. Such “nocalls” are commonly used in microarraybased genotyping for SNPs for which hybridization signals are ambiguous. Fig. 6(b) shows the tradeoffs achievable between the concordance and call rate when running the HMM posterior decoding algorithm on the full set of Watson 454 reads. Over all SNPs, concordance with the duplicate Affymetrix genotypes reaches 99.4% at a nocall rate of only 6%.
Conclusions
In this paper we introduced a statistical model for multilocus genotyping that integrates shotgun sequencing data with LD information extracted from a reference panel. Although finding the multilocus genotype with maximum posterior probability under the integrated model is NPHard, experimental results suggest that a simple posterior decoding algorithm produces highly accurate genotype calls even from lowcoverage sequencing data. Compared to current LDoblivious genotype calling methods, our method allows researchers to achieve a desired accuracy target with reduced sequencing costs. For example, genotype calling accuracy achieved at 56× average coverage by a previously proposed binomial test is matched by the HMMbased posterior decoding algorithm using less than 1/4 of the reads. While a full comparison of sequencing and microarray based genotyping in the context of GWAS is beyond the scope of this paper, experimental results on three publicly available datasets generated using the 454, Illumina, and ABI SOLiD sequencing platforms suggest that at a mapped coverage depth of 56× our algorithm achieves an accuracy that is comparable to that of microarray platforms. Concordance rates reported for microarrays often exceed 99.9% (see, e.g., [41]), and are even higher for methods that integrate hybridization signals with LD information [21]. However, due to cost constraints, microarrays typically assay only a fraction of the SNPs represented in reference panels. For example, the next generation of Illumina microarrays is expected to assay only 5 million of the estimated 35 million SNPs generated by the 1000 genomes project [42]. Genotypes for the untyped SNPs would have to be inferred based solely on LD information, and even the best imputation methods have error rates of 56% [20], or 23% when leaving 10% of SNPs uncalled. Since the majority of SNPs must be imputed, this results in an overall accuracy below that achieved by the HMM posterior algorithm on the Watson 454 dataset.
In ongoing work we are exploring efficient algorithms for LDbased haplotype reconstruction from paired shotgun sequencing reads. We also plan to empirically compare our method with similar tools developed as part of the 1000 genomes project [38, 39].
Authors contributions
IIM and YW conceived the study. JD, JK, SD, and YH implemented the methods and conducted the experiments. JD and IIM drafted the manuscript. All authors participated in the development of the methods, data analysis, and manuscript revision. All authors have read and approved the final manuscript.
Declarations
Acknowledgments
JD, JK, and IIM were supported in part by NSF awards IIS0546457, IIS0916948, and DBI0543365. YW was supported in part by NSF award IIS0953563. SD and YH were supported in part by NSF award CCF0755373.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Authors’ Affiliations
References
 Bentley D, et al.: Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry. Nature 2008, 456: 53–59. 10.1038/nature07517PubMed CentralView ArticlePubMedGoogle Scholar
 Drmanac R, et al.: Human Genome Sequencing Using Unchained Base Reads on SelfAssembling DNA Nanoarrays. Science 2009, 327(78):78–81.PubMedGoogle Scholar
 Levy S, et al.: The Diploid Genome Sequence of an Individual Human. PLoS Biology 2007, 5(10):e254+. 10.1371/journal.pbio.0050254PubMed CentralView ArticlePubMedGoogle Scholar
 McKernan K, et al.: Sequence and structural variation in a human genome uncovered by shortread, massively parallel ligation sequencing using twobase encoding. Genome Research 2009, 19: 1527–1541. 10.1101/gr.091868.109PubMed CentralView ArticlePubMedGoogle Scholar
 Pushkarev D, Neff N, Quake S: Singlemolecule sequencing of an individual human genome. Nature Biotechnology 2009, 27(9):847–850. 10.1038/nbt.1561PubMed CentralView ArticlePubMedGoogle Scholar
 Schuster S, et al.: Complete Khoisan and Bantu genomes from southern Africa. Nature 2010, 463(18):943–947. 10.1038/nature08795PubMed CentralView ArticlePubMedGoogle Scholar
 Wang J, et al.: The diploid genome sequence of an Asian individual. Nature 2008, 456: 60–65. 10.1038/nature07484PubMed CentralView ArticlePubMedGoogle Scholar
 Wheeler D, et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 2008, 452: 872–876. 10.1038/nature06884View ArticlePubMedGoogle Scholar
 The 1000 Genomes Project Consortium: The 1000 Genomes Project Consortium.[http://www.1000genomes.org/]
 Snyder M, Du J, Gerstein M: Personal genome sequencing: current approaches and challenges. Genes & Development 2010, 24: 423–431. 10.1101/gad.1864110View ArticleGoogle Scholar
 Bashir A, Bansal V, Bafna V: Designing deep sequencing experiments: detecting structural variation and estimating transcript abundance. BMC Genomics 2010, 11: 385. 10.1186/1471216411385PubMed CentralView ArticlePubMedGoogle Scholar
 Wendl M, Wilson R: Aspects of coverage in medical DNA sequencing. BMC Bioinformatics 2008, 9: 239. 10.1186/147121059239PubMed CentralView ArticlePubMedGoogle Scholar
 The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449: 851–861. 10.1038/nature06258PubMed CentralView ArticleGoogle Scholar
 Howie BN, Donnelly P, Marchini J: A Flexible and Accurate Genotype Imputation Method for the Next Generation of GenomeWide Association Studies. PLoS Genet 2009, 5(6):e1000529. 10.1371/journal.pgen.1000529PubMed CentralView ArticlePubMedGoogle Scholar
 Kennedy J, Măndoiu I, Paşaniuc B: Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity. Journal of Computational Biology 2008, 15(9):1155–1171. 10.1089/cmb.2007.0133View ArticlePubMedGoogle Scholar
 Li Y, Abecasis GR: Mach 1.0: Rapid Haplotype Reconstruction and Missing Genotype Inference. American Journal of Human Genetics 2006, 79: 2290.Google Scholar
 Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes. Nature Genetics 2007, 39: 906–913. 10.1038/ng2088View ArticlePubMedGoogle Scholar
 Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missingdata imputation. American Journal of Human Genetics 2005, 76: 449–462. 10.1086/428594PubMed CentralView ArticlePubMedGoogle Scholar
 Wen X, Nicolae DL: Association studies for untyped markers with TUNA. Bioinformatics 2008, 24: 435–437. 10.1093/bioinformatics/btm603PubMed CentralView ArticlePubMedGoogle Scholar
 Marchini J, Howie B: Genotype imputation for genomewide association studies. Nature reviews. Genetics 2010, 11(7):499–511. 10.1038/nrg2796View ArticlePubMedGoogle Scholar
 Browning B, Yu Z: Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces FalsePositive Associations for Genomewide Association Studies. The American Journal of Human Genetics 2009, 85(18):847–861. 10.1016/j.ajhg.2009.11.004View ArticlePubMedGoogle Scholar
 Nyholt DR, Yu CE, Visscher PM: On Jim Watson’s APOE status: genetic information is hard to hide. European Journal of Human Genetics 2008, 17(2):147–149. 10.1038/ejhg.2008.198PubMed CentralView ArticlePubMedGoogle Scholar
 Applied Biosystems: SOLiD 4 System product description.[https://products.appliedbiosystems.com/]
 Burton PR, Hansell AL, Fortier I, Manolio TA, Khoury MJ, Little J, Elliott P: Size matters: just how big is BIG?: Quantifying realistic sample size requirements for human genome epidemiology. Int. J. Epidemiol. 2009, 38: 263–273. 10.1093/ije/dyn147PubMed CentralView ArticlePubMedGoogle Scholar
 Ewing B, Green P: Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Research 1998, 8(3):186–194.View ArticlePubMedGoogle Scholar
 Ghahramani Z, Jordan M: Factorial Hidden Markov Models. Mach. Learn. 1997, 29(2–3):245–273. 10.1023/A:1007425814087View ArticleGoogle Scholar
 Fine S, Singer Y, Tishby N: The Hierarchical Hidden Markov Model: Analysis and Applications. Mach. Learn. 1998, 32: 41–62. 10.1023/A:1007469218079View ArticleGoogle Scholar
 Kimmel G, Shamir R: A blockfree hidden Markov model for genotypes and its application to disease association. Journal of Computational Biology 2005, 12: 1243–1260. 10.1089/cmb.2005.12.1243View ArticlePubMedGoogle Scholar
 Rastas P, Koivisto M, Mannila H, Ukkonen E: Phasing genotypes using a Hidden Markov model. In Bioinformatics Algorithms: Techniques and Applications, preliminary version Proc. WABI 2005. Wiley; 2008:355–373.Google Scholar
 Schwartz R: Algorithms for Association Study Design Using a Generalized Model of Haplotype Conservation. Proc. CSB 2004, 90–97.Google Scholar
 Baum L, Petrie T, Soules G, Weiss N: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics 1970, 41: 164–171. 10.1214/aoms/1177697196View ArticleGoogle Scholar
 Lyngsø R, Pedersen C: The consensus string problem and the complexity of comparing hidden Markov models. Journal of Computer Systems Science 2002, 65(3):545–569. 10.1016/S00220000(02)000090View ArticleGoogle Scholar
 Gusev A, Mandoiu I, Pasaniuc B: Highly Scalable Genotype Phasing by Entropy Minimization. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2008, 5(2):252–261. 10.1109/TCBB.2007.70223View ArticleGoogle Scholar
 Kurtz S, et al.: Versatile and open software for comparing large genomes. Genome Biology 2004, 5(2):R12. 10.1186/gb200452r12PubMed CentralView ArticlePubMedGoogle Scholar
 Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 2008, 18: 1851–1858. 10.1101/gr.078212.108PubMed CentralView ArticlePubMedGoogle Scholar
 Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J: SNP detection for massively parallel wholegenome resequencing. Genome Research 2009, 19: 1124–1132. 10.1101/gr.088013.108PubMed CentralView ArticlePubMedGoogle Scholar
 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al.: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078–2079. 10.1093/bioinformatics/btp352PubMed CentralView ArticlePubMedGoogle Scholar
 Li Y, Abecasis G: Thunder (beta version).2010. [http://genome.sph.umich.edu/wiki/Thunder]Google Scholar
 Le SQQ, Durbin R: SNP detection and genotyping from lowcoverage sequencing data on multiple diploid samples. Genome research 2010.Google Scholar
 Kennedy J, Mandoiu I, Pasaniuc B: GEDI: Scalable Algorithms for Genotype Error Detection and Imputation. Tech. Rep. 0911.1765, Cornell University arXiv eprint; 2009. [http://arxiv.org/abs/0911.1765]Google Scholar
 Hong H, Su Z, Ge W, Shi L, Perkins R, Fang H, Xu J, Chen J, Han T, Kaput J, Fuscoe J, Tong W: Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples. BMC Bioinformatics 2008, 9(Suppl 9):S17. 10.1186/147121059S9S17PubMed CentralView ArticlePubMedGoogle Scholar
 Illumina: Empowering GWAS for a new era of discovery.[http://www.illumina.com/documents/products/technotes/technote_empower_gwas.pdf]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.