Volume 13 Supplement 6
Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2012)
Highresolution genetic mapping with pooled sequencing
 Matthew D Edwards^{1} and
 David K Gifford^{1, 2, 3}Email author
DOI: 10.1186/1471210513S6S8
© Edwards and Gifford; licensee BioMed Central Ltd. 2012
Published: 19 April 2012
Abstract
Background
Modern genetics has been transformed by highthroughput sequencing. New experimental designs in model organisms involve analyzing many individuals, pooled and sequenced in groups for increased efficiency. However, the uncertainty from pooling and the challenge of noisy sequencing data demand advanced computational methods.
Results
We present MULTI POOL, a computational method for genetic mapping in model organism crosses that are analyzed by pooled genotyping. Unlike other methods for the analysis of pooled sequence data, we simultaneously consider information from all linked chromosomal markers when estimating the location of a causal variant. Our use of informative sequencing reads is formulated as a discrete dynamic Bayesian network, which we extend with a continuous approximation that allows for rapid inference without a dependence on the pool size. MULTI POOL generalizes to include biological replicates and caseonly or casecontrol designs for binary and quantitative traits.
Conclusions
Our increased information sharing and principled inclusion of relevant error sources improve resolution and accuracy when compared to existing methods, localizing associations to single genes in several cases. MULTI POOL is freely available at http://cgs.csail.mit.edu/multipool/.
Background
Advances in highthroughput DNA sequencing have created new avenues of attack for classical genetics problems. A robust method for determining the genetic elements that underlie a phenotype is to gather and group individuals of different phenotypes, interrogate the genome sequences of each group, and identify elements that are present in different proportions between the groups. We describe MULTI POOL, a multilocus method for analyzing highthroughput DNA sequencing reads obtained from large pools of phenotypicallyextreme individuals.
Targeted experiments
Bulk segregant analysis with highthroughput sequencing has been applied in yeast to study drug resistance in [2], high temperature growth in [3], and viability on alternate carbon sources in [4]. Related pooled sequencing experiments used fly [5] and Arabidopsis [6] model systems. In human, analogous pooled sequencing studies currently require target capture methods and a preselected set of candidate loci [7].
Pools may be selected from a single phenotypic extreme, opposite extremes, or one extreme and a control sample. Pools may also be obtained by grouping based on binary traits rather than quantitative phenotype extremes. Early studies used microarrays for pooled genotyping [8–10], but recent developments in highthroughput sequencing allow its use as a direct genotyping platform. While genotyping or sequencing individuals is an alternate choice, the appeal of pooled analysis is the dramatic reduction in cost while maintaining the statistical power of large sample sizes. See [11–13] for a discussion on pooled association studies and experiment design considerations.
Challenges
Pooled genetic mapping studies using highthroughput sequencing present a number of unique difficulties. The core statistical quantity of interest, the allele frequency in each pool, is observed only indirectly. The strainspecific read counts that are used to estimate the allele frequencies are corrupted by sampling noise at most reasonable sequencing depths, read mapping errors [14], reference genome inaccuracies, and biological bias during sample preparation. In addition, the allele frequency measurements are nonuniformly spaced along the genome, depending on the polymorphism structure between the strains of interest. As an illustration, we refer to the bottom two plots in Figure 1 which show simulated 50X average sequencing coverage using polymorphisms from two yeast strains. Linkage implicates a wide region along the shown chromosome, and the allele frequencies estimated from read counts are noisy and not necessarily highest at the exact location of the causal allele.
However, the unbiased nature of genotyping via highthroughput sequencing results in nearly saturated marker coverage where almost all polymorphisms are queried. This avoids the laborious process of marker discovery and assay design required by earlier genotyping technologies. The dense marker coverage also allows for a high degree of information sharing, which motivates the methods underlying MULTI POOL.
Previous statistical methods
Previous statistical approaches to analyzing pooled genotyping data have focused on alternate regimes where genetic markers are relatively sparse and measurements are relatively accurate. Often, only single loci are tested for association, necessarily ignoring data from nearby markers. Additionally, singlelocus methods encounter difficulties with missing data, such as regions that are difficult to sequence or map or have very few polymorphisms.
Earlier work applied hidden Markov models (HMMs) to fine mapping within small regions with fewer number of markers [15, 16], and was extended to pooled genotype measurements in similar scenarios [17]. However, these methods relied on computationally intensive sampling methods and were applied to datasets with only a few dozen markers. Conceptually similar methods have been explored for human studies, focusing on utilizing haplotype structure in the analysis of pooled experiments [18]. In more recent pooled sequencing experiments, a slidingwindow method was applied on pvalues from local tests in [2], while a local weighted method motivated by a probabilistic model was given in [3]. However, these models do not explicitly model the location of the causal locus while considering all relevant marker data.
Approach
MULTI POOL is designed for experimental crosses and dense noisy genotyping, as obtained by sequencing, and handles datasets with tens or hundreds of thousands of markers. We develop a statistical model that can combine information across many nearby markers while accounting for the nonuniform noise levels introduced by varying sequencing depth and marker spacing. The specific advances we present with MULTI POOL include:

A modelbased framework that allows for information sharing across genomic loci and incorporation of experimentspecific noise sources. These methods improve on previous approaches that rely on heuristic techniques to select sliding window sizes, which may sacrifice resolution.

Statistical tests using an informationsharing dynamic Bayesian network (DBN) that report robust location estimates and confidence intervals. The multilocus methods allow for principled inference even in regions without strainspecific markers and reduce experimental noise when many markers are available.

Extensions of our method to any number of replicates and multiple experimental designs, within the same principled statistical framework.
Methods
We develop inference methods for the pool allele frequency at a particular genome position, given the pooled read samples. First, we propose generative models which describe the experimental process. Next, these models are used to construct likelihoodbased statistics to assess the significance of associations in multiple experimental designs.
Obtaining allele frequency measurements
All sequencing reads from a particular pooling experiment are aligned to one strain's reference genome using the short read aligner bwa[19]. To increase specificity, only uniquelymapping reads are considered. In practice, any short read aligner that can produce or export its output to the standard SAM format is compatible with this workflow. Next, a wholegenome pileup is generated using samtools[20]. A genome pileup lists the particular base calls at each genomic position, using the set of mapped sequencing reads. The genome pileup produces reference and nonreference allele counts at each base. Using singlestrain sequencing data, lists of polymorphic bases can be determined and extracted from the pileup of the pooled experiments. The result is a list of allelespecific read counts at many polymorphic sites across the genome. The coverage of the marker sites will vary according to local sequencing depth and mappability [14], and the density will vary according to the local polymorphism level. A similar approach was applied to generate allele counts in [2].
Multilocus model
MULTI POOL uses a probabilistic model that considers one chromosome at a time and explicitly models the effect that recombination and pool size have on neighboring allele frequencies. The model is a dynamic Bayesian network that describes the changing allele frequencies in the pool along a chromosome. The chromosome of interest is segmented into discrete blocks of equal size. A hidden state corresponding to each block reflects the pool allele frequency in the pool at that locus, varying along the genome as recombination causes random fluctuations. Each locus may emit sequencing reads according to its local pool allele frequency (hidden state). These reads may originate from multiple markers falling within the same region or a single marker. When there are no polymorphisms or mappable reads available in a region, the locus has no emissions and therefore the observed data do not directly constrain the hidden state at that locus. Finally, a particular locus may include the causal gene and therefore be directly associated with the phenotype. We assume there is only one causal locus in the analyzed region. For the genetic mapping problem, the causal locus is unknown and the key inference task is identifying its location and degree of association with the phenotype.
Model specification
Emission probabilities
Technical pooling variance that increases the local measurement noise, such as allelespecific PCR amplification bias, could be assumed to act in locusindependent manner and be modeled with increased variance in this expression.
Transition probabilities
This formulation shows that the latent allele frequencies form a firstorder autoregressive Gaussian process with mean $\frac{r}{1\left(12r\right)}=\frac{1}{2}$ and variance $\frac{\left(1r\right)r/N}{1{\left(12r\right)}^{2}}=\frac{1}{4N}$, which can be verified with a singlelocus analysis.
Initial probabilities
Inference: discrete model
Inference of the hidden state values can proceed outwards from the causal locus, using the conditional independence structure of the model. We describe the algorithms in terms of standard HMM techniques, but note that a more general treatment in terms of message passing is also possible.
The first term in the sum operates on an HMM with rightwards arrows in its graph, while the second term operates on an HMM with leftwards arrows (see Figure 2). However, the latent states form a reversible Markov chain, allowing us to reverse the arrows in the left graphical model fragment. After this transformation, the likelihood computations for all choices of the causal node x_{ c }use the same graphical structure over the latent states $\stackrel{\u20d7}{x}$ when conditioned on the causal node x_{ c }: two chains with all rightwards arrows, separated by the conditioned node x_{ c }. Using this fact, we can compute the desired likelihoods with intermediate computations from a single graphical model.
Running the forwardbackward algorithm requires considering all transitions in each chromosome block, leading to a runtime quadratic in the size of the pool: O(N^{ 2 }L). This dominates the cost for the final step of computing $\text{Pr}\left(\stackrel{\u20d7}{y}p\right)$ for all causal locus locations and a fixed p, which is O(NL). The quadratic dependence on the pool size renders the exact modeling of large pools prohibitive, motivating the continuous approximation in the next section.
Inference: continuous approximation
Since the probability distributions during inference are represented with a constant number of parameters instead of a full vector (as in the discrete case), inference is more efficient. Specifically, computing the required quantities $\text{Pr}\left({x}_{c}=j\stackrel{\u20d7}{y}\right)$ for all c requires O(L) time. This removes the dependence on the size of the pool that was present in the discrete method, allowing MultiPool to perform accurate inference in very large pools.
Statistical tests
We perform the maximization over p' numerically and calculate the likelihood ratio for all positions of the causal locus by reweighting the posterior probabilities.
Multiple experiments
The numerator is the product of two singleexperiment maximizations, while the denominator is the coupled model likelihood that was presented for replicate analysis.
Using these results, MultiPool reports log_{10} likelihood ratios (LOD scores in the genetics community), maximumlikelihood estimates (MLE) of the causal locus location, and approximate credible intervals for the location of the causal locus. Assuming a uniform prior over causal locus locations, $\mathrm{Pr}({x}_{c}\overrightarrow{y}\propto \mathrm{Pr}(\overrightarrow{y}{x}_{c})$ for a particular set of observations $\stackrel{\u20d7}{y}$. In each case we fix p at its MLE, but could alternately integrate it out. Therefore, we can compute multilocus statistics that include information from the entire dataset in experiments where multiple pools are available.
Results
Simulation results
Experimental results
Singlelocus comparisons
In cases where the associated region is localized to a single gene, we compare the LOD scores from MultiPool to a likelihood ratio computed using allele frequencies calculated by summing allele read counts in sliding windows. The data likelihoods under the causal and noncausal models are calculated according to the model in [12], with the genotyping noise calculated from the local informative read depth. We use 50bp genome segments in the dynamic Bayesian network (DBN) model and set the recombination rate in the model to the empirical average in yeast [24].
Large pool results
Localization of known associated genes in large drugselected pools
Dataset  Target  DBN dist.  1kb window dist.  10kb window dist. 

4NQO viable rep. 1  RAD5  5305  18355  14605 
4NQO viable rep. 2  RAD5  745  6195  3145 
Combined  RAD5  805  755  3145 
4NQO viable rep. 1  MKT1  3223  15127  1673 
4NQO viable rep. 2  MKT1  5223  15127  5423 
Combined  MKT1  4323  15127  5423 
Localization of known associated genes in large heatselected pools
Dataset  Target  DBN dist.  1kb window dist.  10kb window dist. 

Heat tol. rep. 1A  IRA1  10589  16739  14739 
Heat tol. rep. 1B  IRA1  10889  20689  6389 
Heat tol. rep. 2  IRA1  8889  2589  17289 
Heat tol. rep. 1A  IRA2  311  3240  511 
Heat tol. rep. 1B  IRA2  961  17670  1661 
Heat tol. rep. 2  IRA2  340  4190  2390 
Conclusion
We presented MultiPool, a computational method to map genetic elements from pooled sequencing studies. Taking advantage of recent increases in throughput, these experimental designs use sequencing to provide unbiased and laborefficient genotyping. As throughput continues to increase, similar studies will be extended to larger and more complex genomes. By including all relevant data in a unified framework, MultiPool improves the analysis of these experiments with increased accuracy and the principled estimation of association intervals. The statistical framework is most beneficial for the case where there are many noisy markers, as observed in genotyping via sequencing. In these cases, combining information across the genome is critical in reducing noise and increasing statistical power. More generally, the methods developed and applied in this work support the application of selection and pooled genotyping for experimental organisms. When experimental procedures can create medium or large allele frequency differences, the responsible genes can be mapped with great precision. These methods do not require the step of explicit polymorphism discovery or genotyping array design, yielding large time and cost savings.
Future work could replace our uniform prior over possible causal locus locations with an informative prior that uses conservation data, functional information, or other relevant data types (as in [25]). Other extensions include a more subtle handling of read mapping ambiguities and SNP calling uncertainty. One possibility is to use expected (average) counts under an erroraware probabilistic model instead of hard assignments, which should scale gracefully as certainty lowers. This could reduce MultiPool's reliance on a particular aligner and SNP calling strategy.
Abbreviations
 DBN:

dynamic Bayesian network
 HMM:

hidden Markov model
 LOD:

base 10 logarithm of odds
 MLE:

maximum likelihood estimate
 QTL:

quantitative trait locus
 RMSE:

root mean square error.
Declarations
Acknowledgements
We thank Ian Ehrenreich for sharing sequencing data and Shaun Mahony for helpful comments on the manuscript. M.D.E. was supported by an NSF Graduate Research Fellowship under grant no. 0645960.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2012).
Authors’ Affiliations
References
 Michelmore RW, Paran I, Kesseli RV: Identification of markers linked to diseaseresistance genes by bulked segregant analysis: a rapid method to detect markers in specific genomic regions by using segregating populations. Proc Natl Acad Sci USA. 1991, 88 (21): 98289832. 10.1073/pnas.88.21.9828.PubMed CentralView ArticlePubMedGoogle Scholar
 Ehrenreich IM, Torabi N, Jia Y, Kent J, Martis S, Shapiro JA, Gresham D, Caudy AA, Kruglyak L: Dissection of genetically complex traits with extremely large pools of yeast segregants. Nature. 2010, 464 (7291): 10391042. 10.1038/nature08923.PubMed CentralView ArticlePubMedGoogle Scholar
 Parts L, Cubillos FA, Warringer J, Jain K, Salinas F, Bumpstead SJ, Molin M, Zia A, Simpson JT, Quail MA, Moses A, Louis EJ, Durbin R, Liti G: Revealing the genetic structure of a trait by sequencing a population under selection. Genome Res. 2011, 21 (7): 11311138. 10.1101/gr.116731.110.PubMed CentralView ArticlePubMedGoogle Scholar
 Wenger JW, Schwartz K, Sherlock G: Bulk segregant analysis by highthroughput sequencing reveals a novel xylose utilization gene from Saccharomyces cerevisiae. PLoS Genet. 2010, 6 (5): e100094210.1371/journal.pgen.1000942.PubMed CentralView ArticlePubMedGoogle Scholar
 Andolfatto P, Davison D, Erezyilmaz D, Hu TT, Mast J, SunayamaMorita T, Stern DL: Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res. 2011, 21 (4): 610617. 10.1101/gr.115402.110.PubMed CentralView ArticlePubMedGoogle Scholar
 Schneeberger K, Ossowski S, Lanz C, Juul T, Petersen AH, Nielsen KL, Jørgensen J, Weigel D, Andersen SU: SHOREmap: simultaneous mapping and mutation identification by deep sequencing. Nat Methods. 2009, 6 (8): 550551. 10.1038/nmeth0809550.View ArticlePubMedGoogle Scholar
 Calvo SE, Tucker EJ, Compton AG, Kirby DM, Crawford G, Burtt NP, Rivas M, Guiducci C, Bruno DL, Goldberger OA, Redman MC, Wiltshire E, Wilson CJ, Altshuler D, Gabriel SB, Daly MJ, Thorburn DR, Mootha VK: Highthroughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nat Genet. 2010, 42 (10): 851858. 10.1038/ng.659.PubMed CentralView ArticlePubMedGoogle Scholar
 Borevitz JO, Liang D, Plouffe D, Chang H, Zhu T, Weigel D, Berry CC, Winzeler E, Chory J: Largescale identification of singlefeature polymorphisms in complex genomes. Genome Res. 2003, 13 (3): 513523. 10.1101/gr.541303.PubMed CentralView ArticlePubMedGoogle Scholar
 Hazen SP, Borevitz JO, Harmon FG, PrunedaPaz JL, Schultz TF, Yanovsky MJ, Liljegren SJ, Ecker JR, Kay SA: Rapid array mapping of circadian clock and developmental mutations in Arabidopsis. Plant Physiol. 2005, 138 (2): 990997. 10.1104/pp.105.061408.PubMed CentralView ArticlePubMedGoogle Scholar
 Brauer MJ, Christianson CM, Pai DA, Dunham MJ: Mapping novel traits by arrayassisted bulk segregant analysis in Saccharomyces cerevisiae. Genetics. 2006, 173 (3): 18131816. 10.1534/genetics.106.057927.PubMed CentralView ArticlePubMedGoogle Scholar
 Sham P, Bader JS, Craig I, O'Donovan M, Owen M: DNA Pooling: a tool for largescale association studies. Nat Rev Genet. 2002, 3 (11): 862871.View ArticlePubMedGoogle Scholar
 Jawaid A, Bader JS, Purcell S, Cherny SS, Sham P: Optimal selection strategies for QTL mapping using pooled DNA samples. Eur J Hum Genet. 2002, 10 (2): 125132. 10.1038/sj.ejhg.5200771.View ArticlePubMedGoogle Scholar
 Macgregor S, Zhao ZZ, Henders A, Nicholas MG, Montgomery GW, Visscher PM: Highly costefficient genomewide association studies using DNA pools and dense SNP arrays. Nucleic Acids Res. 2008, 36 (6): e3510.1093/nar/gkm1060.PubMed CentralView ArticlePubMedGoogle Scholar
 Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK: Effect of readmapping biases on detecting allelespecific expression from RNAsequencing data. Bioinformatics. 2009, 25 (24): 32073212. 10.1093/bioinformatics/btp579.PubMed CentralView ArticlePubMedGoogle Scholar
 McPeek MS, Strahs A: Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to finescale genetic mapping. Am J Hum Genet. 1999, 65 (3): 858875. 10.1086/302537.PubMed CentralView ArticlePubMedGoogle Scholar
 Morris AP, Whittaker JC, Balding DJ: Bayesian finescale mapping of disease loci, by hidden Markov models. Am J Hum Genet. 2000, 67: 155169. 10.1086/302956.PubMed CentralView ArticlePubMedGoogle Scholar
 Johnson T: Bayesian method for gene detection and mapping, using a case and control design and DNA pooling. Biostatistics. 2007, 8 (3): 546565.View ArticlePubMedGoogle Scholar
 Homer N, Tembe WD, Szelinger S, Redman M, Stephan DA, Pearson JV, Nelson SF, Craig D: Multimarker analysis and imputation of multiple platform poolingbased genomewide association studies. Bioinformatics. 2008, 24 (17): 18961902. 10.1093/bioinformatics/btn333.PubMed CentralView ArticlePubMedGoogle Scholar
 Li H, Durbin R: Fast and accurate longread alignment with BurrowsWheeler transform. Bioinformatics. 2010, 26 (5): 589595. 10.1093/bioinformatics/btp698.PubMed CentralView ArticlePubMedGoogle Scholar
 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 20782079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
 Bishop CM: Pattern Recognition and Machine Learning. 2007, SpringerGoogle Scholar
 Murphy K: Filtering, Smoothing and the Junction Tree Algorithm. University of California, Berkeley Technical Report. 1999,http://www.cs.ubc.ca/~murphyk/Papers/smooth.ps.gz,Google Scholar
 Ghahramani , Hinton GE: Parameter estimation for linear dynamical systems. University of Toronto Technical Report. 1996, 6 (CRGTR962): 16.http://www.cs.toronto.edu/~hinton/absps/tr962.pdf,Google Scholar
 Mancera E, Bourgon R, Brozzi A, Huber W, Steinmetz LM: Highresolution mapping of meiotic crossovers and noncrossovers in yeast. Nature. 2008, 454 (7203): 479485. 10.1038/nature07135.PubMed CentralView ArticlePubMedGoogle Scholar
 Lee S, Dudley AM, Drubin D, Silver PA, Krogan NJ, Pe'er D, Koller D: Learning a prior on regulatory potential from eQTL data. PLoS Genet. 2009, 5: e100035810.1371/journal.pgen.1000358.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.