A variational Bayes algorithm for fast and accurate multiple locus genomewide association analysis
 Benjamin A Logsdon^{1},
 Gabriel E Hoffman^{1} and
 Jason G Mezey^{1, 2}Email author
DOI: 10.1186/147121051158
© Logsdon et al; licensee BioMed Central Ltd. 2010
Received: 10 August 2009
Accepted: 27 January 2010
Published: 27 January 2010
Abstract
Background
The success achieved by genomewide association (GWA) studies in the identification of candidate loci for complex diseases has been accompanied by an inability to explain the bulk of heritability. Here, we describe the algorithm VBay, a variational Bayes algorithm for multiple locus GWA analysis, which is designed to identify weaker associations that may contribute to this missing heritability.
Results
VBay provides a novel solution to the computational scaling constraints of most multiple locus methods and can complete a simultaneous analysis of a million genetic markers in a few hours, when using a desktop. Using a range of simulated genetic and GWA experimental scenarios, we demonstrate that VBay is highly accurate, and reliably identifies associations that are too weak to be discovered by singlemarker testing approaches. VBay can also outperform a multiple locus analysis method based on the lasso, which has similar scaling properties for large numbers of genetic markers. For demonstration purposes, we also use VBay to confirm associations with gene expression in cell lines derived from the Phase II individuals of HapMap.
Conclusions
VBay is a versatile, fast, and accurate multiple locus GWA analysis tool for the practitioner interested in identifying weaker associations without high false positive rates.
Background
Genomewide association (GWA) studies have identified genetic loci associated with complex diseases and other aspects of human physiology [1, 2]. All replicable associations identified to date have been discovered using GWA analysis techniques that analyze one genetic marker at a time [3]. While successful, it is well appreciated that singlemarker analysis strategies may not be the most powerful approaches for GWA analysis [4]. Multiple locus inference is an alternative to singlemarker GWA analysis that can have greater power to identify weaker associations, which can arise due to small allelic effects, low minor allele frequencies (MAF), and weak correlations with genotyped markers [4]. By correctly accounting for the effects of multiple loci, such approaches can reduce the estimate of the error variance, which in turn increases the power to detect weaker associations for a fixed sample size. Since loci with weaker associations may contribute to a portion of the socalled 'missing' or 'dark' heritability [5–7], multiple locus analyses have the potential to provide a more complete picture of heritable variation.
Methods for multiple locus GWA analysis must address a number of problems, including 'overfitting' where too many associations are included in the genetic model, as well as difficulties associated with model inference when the number of genetic markers is far larger than the sample size [8]. Two general approaches have been suggested to address these challenges: hierarchical models and partitioning/classification. Hierarchical modeling approaches [9–14] employ an underlying regression framework to model multiple markerphenotype associations and use the hierarchical model structure to implement penalized likelihood [10], shrinkage estimation [15], or related approaches to control overfitting. These methods have appealing statistical properties for GWA analysis when both the sample size and the number of true associations expected are far less than the number of markers analyzed, which is generally considered a reasonable assumption in GWA studies [8]. Alternatively, partitioning methods do not (necessarily) assume a specific form of the markerphenotype relationships but rather assume that markers fall into nonoverlapping classes, which specify phenotype association or no phenotype association [13, 16]. Control of model overfitting in high dimensional GWA marker space can then be achieved by appropriate priors on marker representation in these classes [13].
Despite the appealing theoretical properties of multiple locus methods that make use of hierarchical models or partitioning, these methods have not seen wide acceptance for GWA analysis. There are at least two reasons for this. First, an ideal multiple locus analysis involves simultaneous assessment of all markers in a study and, given the scale of typical GWA experiments, most techniques are not computationally practical options [9, 10, 16–18]. Second, there are concerns about the accuracy and performance of multiple locus GWA analysis. This is largely an empirical question that needs to be addressed with simulations and analysis of real data.
Here we introduce the algorithm VBay, a (V)ariational method for (Bay)esian hierarchical regression, that can address some of the computational limitations shared by many multiple locus methods [9, 10, 16–18]. The variational Bayes algorithm of VBay is part of a broad class of approximate inference methods, which have been successfully applied to develop scalable algorithms for complex statistical problems, in the fields of machine learning and computational statistics [19–22]. The specific type of variational method implemented in VBay is a meanfield approximation, where a high dimensional joint distribution of many variables (in this case genetic marker effects) is approximated by a product of many lower dimensional distributions [23]. This method is extremely versatile and can be easily extended to a range of models proposed for multiple locus analysis [4, 11, 14, 24].
The specific model implemented in VBay is a hierarchical linear model, which includes marker class partitioning control of model overfitting. This is particularly well suited for maintaining a low falsepositive rate when identifying weaker associations [13]. VBay implements a simultaneous analysis of all markers in a GWA study and, since the computational time complexity per iteration of VBay is linear with respect to sample size and marker number, the algorithm has fast convergence. For example, simultaneous analysis of a million markers, genotyped in more than a thousand individuals, can be completed using a standard desktop (with large memory capacity) in a matter of hours.
We take advantage of the computational speed of VBay to perform a simulation study of performance, for GWA data ranging from a hundred thousand to more than a million markers. In the Results we focus on the simulation results for single population simulations, but we also implement a version of the algorithm to accommodate known population structure and missing genotype data. We demonstrate that in practice, VBay consistently and reliably identifies both strong marker associations, as well as those too weak to be identified by singlemarker analysis. We also demonstrate that VBay can outperform a recently proposed multiple locus methods that uses the least absolute shrinkage and selection operator (lasso) penalty [14], a theoretically well founded and widely accepted method for high dimensional model selection. VBay therefore provides a powerful complement to singlemarker analysis for discovering weaker associations that may be responsible for a portion of missing heritability.
Results and Discussion
The VBay Algorithm
The VBay algorithm consists of two components: a hierarchical regression model with marker class partitioning and a variational algorithm for approximate Bayesian inference. The underlying hierarchical model of VBay is a Bayesian mixture prior regression [25] that has been previously applied to association and mapping problems [13]. The regression portion of this hierarchical model is a standard regression used to model genetic markerphenotype associations, and allows for natural incorporation of population structure and other covariates. The model partitioning incorporates global features of genetic marker associations, which are assumed to be distributed among positive, negative, and zero effect classes. The zero effect class is used to provide a parametric representation of the assumption that most markers in GWA studies will not be linked to causative alleles and therefore do not have true associations with phenotype [13].
Approximate Bayesian inference with VBay is accomplished by an algorithm adapted from variational Bayes methods [26]. As with other variational Bayes methods, the goal of VBay is to approximate the joint posterior density of the hierarchical regression model with a factorized form and then to minimize the KullbackLiebler (KL) divergence between the factorized form and the full posterior distribution [27]. This is accomplished by taking the expectation of the log joint posterior density, with respect to each parameter's density from the factorized form, and iterating until convergence [23]. The overall performance of VBay will depend on how well the factorized form approximates an informative mode of the posterior distribution of the hierarchical model. We have chosen a factorization with respect to each regression and hierarchical parameter, which appears to perform extremely well for identifying weak associations when analyzing simulated GWA data that include large numbers of genetic markers.
Computational speed
The computational efficiency of VBay derives from two properties: it is a deterministic algorithm and the objective function has a factorized form. Since VBay is deterministic it does not need the long runs of Markov chains required by exact Bayesian MCMC algorithms [28]. For GWA analysis, these latter stochastic algorithms can be very slow to converge, particularly when marker numbers are large and when there are complex marker correlations produced by linkage disequilibrium [8]. The factorized form of VBay means that the minimization is performed with respect to each parameter independently, where each iterative update satisfies consistency conditions for maximizing the lower bound, given the state of the other parameters. Unlike univariate update algorithms, which may not necessarily have efficient updates with respect to the likelihood gradient function [4], the consistency conditions produced by the factorized form ensure that the univariate updates produce a computationally efficient approach to a KLdivergence minimum.
More precisely, VBay has linear time complexity scaling with respect to both marker number and sample size per iteration (Additional file 1, Methods). VBay therefore has better computational scaling properties than most currently proposed multiple locus algorithms for full likelihood or exact MCMC Bayesian analysis, when simultaneously considering all markers in a GWA study [9, 10, 16–18]. While the total time to convergence will depend on the true underlying genetic model, total computational times appear to be very tractable. As an example, using a dualquad core Xeon 2.8 Ghz, with 16 Gb of memory, VBay converges in less than four hours for data sets in the range of 1 million markers, for a sample size of 200, and has average convergence around ten hours for sample sizes of 1000.
Significance thresholds
Performance of VBay compared to singlemarker analysis
Components and range of values used to simulate GWA data.
Component  Values 

sample  200 or 1000 
markers  0.1 to 1.0 million 
missing  0% or 2% 
loci  4, 8, or 32 
effects  gamma(2,1) or fixed 
heritability  0.5 or 0.9 
populations  one or four 
Comparison of VBay and singlemarker GWA analysis of simulated data for 1 million markers.
sample  loci  (min/max)^{ a } 
 VBay min 

 singlemarker min 


200  4  0.24 (0.0032/0.75)  0.83  0.026  98.9  0.55  0.16  87.4 
200  32  0.028 (6.7e5/0.28)  0.053  0.033  26.9  0.072  0.050  35.3 
1000  4  0.23 (0.0050/0.65)  1.00  0.0050  100  0.78  0.045  98.7 
1000  32  0.028 (8.3e5/0.30)  0.61  0.0037  95.6  0.32  0.0099  78.2 
VBay performance is a direct function of the individual heritabilities, and not the total heritability of the phenotype. The individual heritability is defined by both the minor allele frequency and the effect size (see Methods). Therefore loci with large effects may still have low individual heritabilities if the minor allele frequencies of the true loci are low (or vice versa). For example, for our simulations where the total heritability was controlled to be 0.5, and the individual heritabilities were shifted to be smaller overall, VBay performance was far closer to singlemarker analysis. When we increased the individual heritabilities associated with associations in these simulations, while holding the total heritability at 0.5, VBay can outperform singlemarker analysis. For all simulations, when an individual heritability falls below a certain threshold, neither approach could detect the association. There exists a limit to how weak an association can be and still be detected by VBay, given the sample size of the GWA study. Even in the worst case scenarios simulated, with many loci with small individual heritabilities and a small sample size, the performance of VBay was not significantly different from singlemarker analysis across simulations. This result suggests that even if the number of loci were increased (i.e. the average individual heritability was decreased), the performance of VBay would at worst be the same as singlemarker analysis.
Comparison to the Lasso
The VBay algorithm was compared to the lasso, one of the only other currently proposed multiple locus methods that make use of a hierarchical regression model and have similar scaling properties to VBay [14]. For comparison to VBay, we use a form that implements a lasso type penalty [30], based on the algorithm presented in Wu et al. [14], modified to allow continuous phenotypes.
Power comparison for VBay, the lasso, and singlemarker GWA analysis from simulated data with 100,000 markers.
sample  loci  VBay  the lasso  singlemarker 

200  4  90.0%  87.5%  47.5% 
200  32  14.1%  4.69%  7.19% 
1000  4  97.5%  77.5%  60.0% 
1000  32  80.6%  65.0%  33.1% 
Genomewide association analysis of HapMap gene expression
To investigate the empirical properties of VBay, we performed a GWA analysis on gene expression levels measured in eternal lymphoblastoid cell lines, generated from the 210 unrelated individuals of Phase II of the International HapMap project [31]. Individuals in this sample were genotyped for upwards of 3.1 million SNPs and were derived from four populations: Caucasian with European origin (CEU), Chinese from Beijing (CHB), unrelated Japanese from Tokyo (JPT), and Yoruba individuals from Ibadan, Nigeria (YRI) [32]. In the original GWA analysis of these data, Stranger et al. used a singlemarker testing approach, considering each population independently, and limiting the analysis to SNPs in the cisregions of each gene to control the level of multiple test correction [31].
Using a version of VBay that accounts for population structure and missing genotype data, we analyzed the pooled data from these populations. We did not limit the analysis to cisregions, although we did limit our analyses to SNPs with MAF > .10, leaving 1.03 million markers genomewide. To minimize computational cost, we also limited our analysis to the 100 expression probes Stranger et al. found to have the most significant associations, and an additional 20 probes with the largest residual variance, after correcting for population structure. For comparison, we also applied a singlemarker analysis to these pooled data, for the 120 expression probes, incorporating a covariate to account for population structure.
Conclusions
VBay addresses computational efficiency and performance concerns associated with many multiple locus GWA algorithms. While VBay currently utilizes a hierarchical partitioning model, the same approach could be used to implement scalable algorithms for a wide range of models. For example, different shrinkage or penalization models such as the lasso [11, 14], ridge regression [24], or a normal exponential gamma distribution penalty [4] are easily implemented by removing the partitioning and substituting the appropriate prior distribution. Further, the variational Bayes method used for computation does not require specific closed form integrals arising from hyperparameter distributions, which characterize many of the proposed algorithms for full penalizedlikelihood or Bayesian GWA analysis [4, 11, 24]. There is therefore the potential for developing an entire class of scalable multiple locus algorithms for GWA analysis that could be tuned for different genetic and experimental conditions within the VBay framework.
Methods
VBay Algorithm
where y_{ i }is the phenotype of the i^{ th }individual, μ is the sample mean, x_{ ij }is the genotype of the j^{ th }marker of the i^{ th }individual, β_{ j }is the effect of the j^{ th }marker, and e_{ i }~ N . While we limit the current presentation of the model to continuous traits with normal error, more complex error structures and extensions to discrete traits is straightforward. Because (1) is a linear model, it can be easily expanded to test for dominance or epistasis using a standard mapping approach. In addition, confounding factors such as population structure can be accounted for by the addition of covariates. The effects of these additional covariates can be modeled within the hierarchical regression framework or can be treated simply as nuisance parameters and given uninformative priors. We used an uninformative prior for the error parameter, , and a constant (improper) prior for the mean parameter μ.
In our analyses we chose an uninformative Dirichlet prior by setting the parameters θ_{ β }, ϕ_{ β }, ψ_{ β }all to one. The hyperparameters and reflect the partitioning aspect of the model. Within the positive and negative partitions, the population variance parameters ( and ) have priors. This choice of prior for the regression coefficients in the positive and negative effect classes increases the robustness to outliers. Assuming the number of markers in the GWA data set, m, is greater than the sample size, n, we truncate the Dirichlet distribution such that , where the truncation puts a lower bound on the harshness of shrinkage [8]. We found this truncation very important when considering data sets with large numbers of markers. Without truncation, the evidence in the data is too weak to enforce harsh enough shrinkage for desirable model selection.
and then minimizing the KLdivergence between the factorized and full form. Equation (5) is a natural factorization for the VBay hierarchical model since most of the priors are conjugate. The posterior factorized distributions all have closed form expressions and each parameter is completely characterized by an expected sufficient statistic [27] (Additional file 1, Methods). The algorithm is therefore equivalent to updating these expected sufficient statistics.
with C some normalizing constant, and E_{θ}indicating expectation of the log of equation (4) with respect to every other parameter's factorized distribution, except q (θ). This defines a system of equations which can be iterated through until convergence [23, 27]. With the factorized form, it is a simple matter to demonstrate the time complexity of VBay is (nm) per iteration (Additional file 1, Methods).
VBay Convergence
The factorization of equation (4) is used to define a function ℒ(θ) which lower bounds the log posterior probability of the data (i.e. the probability of the observed data after integrating out all parameters in the model). The lower bound ℒ(θ) is defined as the expectation of the log of equation (4) with respect to every factorized distribution plus the entropy of each factorized distribution. In the full form, the convergence of VBay to a local maximum of the lower bound ℒ(θ) is guaranteed because of the convexity of ℒ(θ) with respect to each parameter's approximate posterior distribution [33]. In the described implementation we used an approximation for some higher order expectation terms that we found increased computational efficiency (Additional file 1, Methods).
Given that global convergence to a single stationary point is not guaranteed [26], the standard practice is to use multiple parameter initializations. We found that with random initializations of expectations of β_{ j }, VBay finds local modes that correspond to overfit (underdetermined) models, while with initializations of only a few nonzero expectations of β_{ j }'s, VBay tends to update these values close to zero before converging. We therefore use the approach of setting all expectations of β_{ j }parameters equal to zero as a starting point for all runs of VBay, an approach that has precedent in simultaneous marker analysis [4]. This also corresponds to appropriate starting estimates given our prior assumption that not too many markers are associated with a phenotype.
We have found that the order in which the parameters are updated can affect local convergence, particularly when there is missing genetic data. In general, the different association models we found using different orderings were not widely different from one another, often differing in whether they included one or two specific associations. For cases where we found ordering did make a difference, we ran VBay with multiple random orderings and used the conservative criteria of considering only associations found to be significant in at least 80% of the cases to be true positives for all simulations and data analyses compared to singlemarker analysis. The cutoff of 80% corresponds directly to a false discovery rate of 0%. We also considered a less stringent cutoff and an observed false discovery rate of 5% in the comparison to the lasso.
VBay Software
An implementation of VBay is available at http://mezeylab.cb.bscb.cornell.edu/Software.aspx. The software has basic control parameters available to the user and only requires tab delimited genotype and phenotype files as input. The algorithm itself consists of the following steps: 1) randomize marker ordering, 2) initialize the expected sufficient statistics and expectations of parameters, 3) update the expected sufficient statistics for a particular parameter, given the expectations of all the other parameters, 4) update the expectations of a particular parameter given the expectations of all the other parameters, 5) repeat steps 3 and 4 for all the parameters in the model, 6) check convergence based on the current estimate of the lower bound, ℒ(θ). Further functional details are presented in Additional file 1, Tables S3S9. The main output from the algorithm is the log_{10} of pvbay = p_{j+}+ p_{j}statistic for each marker, which can be used to assess significance of a marker association.
The Lasso
where ℓ(βY) is the loglikelihood for the relevant generalized linear model. By penalizing the magnitude of each β_{ j }coefficient, MAP estimates shrink the coefficient values compared to the estimates under the unpenalized model. This shrinkage causes most coefficients to be exactly zero, so that only very few markers are selected to be nonzero for a single value of λ. This penalty produces a convex loglikelihood surface with a single maximum even for underdetermined systems (i.e. when there are more markers than samples). Therefore, the lasso can jointly consider all markers in a single model and simultaneously account for variance in the response caused by multiple markers. The lasso model is fit for multiple values of λ and a single subset of coefficients is selected to be nonzero by 10fold crossvalidation. Confidence scores are obtained for each selected marker by comparing an unpenalized model with all selected markers to a model that omits each marker in turn. An Ftest is performed for each marker, but note that these confidence scores cannot be interpreted as typical pvalues since they are obtained from a two step procedure. Algorithmic details for fitting the LASSO model for the linearGaussian case are provided by [35, 36].
Simulation Study
GWA data were simulated under the set of conditions listed in Table 1. The genomic marker data were generated using MaCS [29], a scalable approximate coalescent simulator, using the default approximation tree width. For the comparison to singlemarker analysis, three basic types of genotype data sets were simulated. For the first and second type, 0.5 Gb of DNA was simulated from a single diploid population with N_{ e }= 10000, the population scaled mutation rate 4N_{ e }μ = θ = 0.001, and the genomewide population scaled recombination rate 4N_{ e }κ = ρ = .00045, values taken from Voight et al. [37]. Samples of 200 and 1000 were sampled screening the minor allele frequency (MAF) to be 0.10, leaving more than onemillion markers for analysis. For the third type, 200 diploid samples of 0.5 Gb were simulated from a simple four population migration model. The approximation , as observed in the overall Phase I HapMap analysis [38], was used to determine the population per generation migration rate for a simple symmetric island migration model, with populations of equal size. After screening MAF to be > 0.10, this left over 660 thousand markers for analysis. The final data included the addition of 2% missing data.
where f_{ j }is the MAF of locus j, β_{ j }is the additive effect of the locus j, and is the total phenotypic variance of the trait.
GWA analysis of the simulated data were performed using both VBay and a linear regression singlemarker analysis. When population structure was incorporated, the linear model (1) becomes a fixed effect ANOVA model, for both VBay and the singlemarker analysis. The population means in VBay were treated as having normal priors centered on zero with a very large variance (τ = 1000), and were updated in a similar fashion as the other parameters in the VBay algorithm. The VBay algorithm was run until the tolerance for the likelihood portion of the lower bound ℒ(θ) was < 10^{9}. For the simulations with missing data, the minor allele frequency across loci (f_{ j }∀j) was estimated given the observed genotype data, and then the missing data points were sampled from a Bin(n = 2, f_{ j }), i.e. assuming HardyWeinberg equilibrium, for both VBay and singlemarker analysis. We did random resampling of missing data to test the robustness of the output of VBay and the singlemarker analysis (Additional file 1, Methods).
The false positive and true positive rates were calculated for each set of replicate simulations. Care was taken to account for the effect of linkage disequilibrium on the test statistics, for both VBay and singlemarker analysis. A simple window was computed around each marker to determine when the r^{2} decayed to 0.4. The cutoff of 0.4 was used to be as generous to singlemarker analysis as possible. Any marker in this window was considered a true positive. In the case where multiple recombination events occurred recently between different ancestral lineages, multiple blocks of markers in linkage disequilibrium were generated, that were separated by markers in low linkage disequilibrium. In these cases, a conservative rule for evaluating a true positive was implemented. If a marker had a pvbay> 0.99, or log_{10} pvalue for the singlemarker analysis in excess of the Bonferroni correction, and the r^{2} between the significant genetic marker and the true location was greater than 0.4, then the marker was considered a true positive.
For the comparison between VBay, the lasso, and single marker analyses, onehundred thousand markers and samples sizes of 200 or 1000 for a single population were simulated (the reduced number of markers for these simulations was used to conserve CPU cycles). The genetic architectures were simulated as with the larger scale simulations, but with only 4 or 32 loci being sampled randomly from the onehundred thousand markers, and effects sampled from a Γ(2, 1) distributions for 10 replicated data sets. Eight random reorderings of the markers were used with the VBay analysis, and the false discovery rate for VBay was controlled based on the consensus of associations found across reorderings with pvbay> 0.99 (e.g. a false discovery rate of 5% corresponded to an association being found in at least 3 out of the 8 reorderings). The false discovery rate for the lasso (using Fstatistics) and singlemarker analysis were controlled based on the pvalues computed for each method respectively.
Data Analysis
We performed a GWA analysis for gene expression levels measured in the eternal lymphoblastoid cell lines that were generated for the 210 unrelated individuals of Phase II of the International HapMap project [31]. This sample included 60 individuals sampled from Utah of European descent (CEU), 45 individuals sampled from Han Chinese population (CHB), 45 individuals sampled from Japanese population (JPT), and 60 individuals sampled from the Yoruban population in Africa (YRI). Expression data for these lines were available for 47,000 probes for (~17,000 genes) assayed with the Illumina bead array. For our analyses, we screened for MAF > 0.10 in all populations which left 1.03 * 10^{6} SNPs on chromosomes 1 to 22. The X and Y chromosomes were not analyzed by Stranger et al. and we ignored these chromosomes in our analyses as well. Stranger et al. [31] reported 879 gene expression probes with highly significant ciseQTL associations, found by testing within populations, where every SNP in a 2 Mb window around each gene was analyzed. We performed a GWA analysis, with both VBay and a singlemarker regression, for their top 100 most significant expression probes. We combined genotypic data across populations, where we accounted for the effect of population structure in each case by including appropriate covariates. We also tested the top 20 probes, not in their association list that had the largest residual variance after correcting for population structure. Only 120 expression probes were analyzed to conserve CPU cycles; all 879 could easily be analyzed in a future study. The total missing data for this SNP set was 1.78%. We accounted for missing data using the same approach as with our simulated data analysis.
Availability and Requirements
Both binaries and source code for the VBay software are available at the following URL: http://mezeylab.cb.bscb.cornell.edu/Software.aspx. The source code is released under the GNU General Public License http://www.gnu.org/licenses/. The binary was compiled for 32bit architecture on Ubuntu 8.04 http://www.ubuntu.com/ using the compiler gcc http://gcc.gnu.org/ and the GNU scientific library http://www.gnu.org/software/gsl/. To recompile from source both gcc and GSL are required. Documentation describing how to use VBay as well as example data sets are also available at http://mezeylab.cb.bscb.cornell.edu/Software.aspx.
Abbreviations
 VBay:

Variational Bayes algorithm for genomewide association analysis
 GWA(S):

GenomeWide Association (Study)
 MAF:

Minor Allele Frequency
 eQTL:

expressionQuantitative Trait Loci
 ANOVA:

Analysis of Variance
 EM:

ExpectationMaximization algorithm
 SNP:

Single Nucleotide Polymorphism
 ROC curve:

Receiver Operating Characteristic curve
 Lasso:

Least Absolute Shrinkage and Selection Operator.
Declarations
Acknowledgements
We thank Cornell University for funding for JGM, the Cornell Center for Vertebrate Genomics and the Cornell Provost Fund for support of BAL. We would like to thank Kirk Lohmueller, Adam Siepel, Brian White, Keyan Zhao, and Nadia Singh for insightful comments on the manuscript. We would also like to thank two anonymous reviewers for suggestions which strengthened the overall quality of the manuscript. This research was conducted using the resources of the Cornell University Center for Advanced Computing.
Authors’ Affiliations
References
 Donnelly P: Progress and challenges in genomewide association studies in humans. Nature 2008, 465(7223):728–731. 10.1038/nature07631View ArticleGoogle Scholar
 Hindorff L, Junkins H, Mehta J, Manolio T: A Catalog of Published GenomeWide Association Studies.[http://www.genome.gov/gwastudies] Accessed 2009
 McCarthy M, Abecasis G, Cardon L, Goldstein D, Little J, Ioannidis J, Hirschhorn J: Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics 2008, 9(5):356–369. 10.1038/nrg2344View ArticlePubMedGoogle Scholar
 Hoggart C, Whittaker J, De lorio M, Balding D: Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genetics 2008, 4(7):e1000130. 10.1371/journal.pgen.1000130View ArticlePubMedPubMed CentralGoogle Scholar
 Iyengar S, Elston R: The genetic basis of complex traits: rare variants or "common gene, common disease"? Methods in molecular biology (Clifton, NJ) 2007, 376: 71. full_textView ArticleGoogle Scholar
 Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M: Mapping complex disease traits with global gene expression. Nature Reviews Genetics 2009, 10(3):184–194. 10.1038/nrg2537View ArticlePubMedPubMed CentralGoogle Scholar
 Maher B: Personal genomes: The case of the missing heritability. Nature 2008, 456(7218):18. 10.1038/456018aView ArticlePubMedGoogle Scholar
 Zhang M, Zhang D, Wells M: Variable selection for large p small n regression models with incomplete data: mapping QTL with epistasis. BMC Bioinformatics 2008., 9(251):
 Yi N, Banerjee S: Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics 2009, 181(3):1101–1113. 10.1534/genetics.108.099556View ArticlePubMedPubMed CentralGoogle Scholar
 Yi N, Shriner D: Advances in Bayesian multiple quantitative trait loci mapping in experimental crosses. Heredity 2008, 100(3):240–252. 10.1038/sj.hdy.6801074View ArticlePubMedGoogle Scholar
 Yi N, Xu S: Bayesian Lasso for quantitative trait loci mapping. Genetics 2008, 179(2):1045–1055. 10.1534/genetics.107.085589View ArticlePubMedPubMed CentralGoogle Scholar
 Liu J, Liu Y, Liu X, Deng H: Bayesian mapping of quantitative trait loci for multiple complex traits with the use of variance components. Am J Hum Genet 2007, 81(2):304–320. 10.1086/519495View ArticlePubMedPubMed CentralGoogle Scholar
 Zhang M, Montooth K, Wells M, Clark A, Zhang D: Mapping multiple quantitative trait loci by Bayesian classification. Genetics 2005, 169(4):2305–2318. 10.1534/genetics.104.034181View ArticlePubMedPubMed CentralGoogle Scholar
 Wu T, Chen Y, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression. Bioinformatics 2009, 25(6):714. 10.1093/bioinformatics/btp041View ArticlePubMedPubMed CentralGoogle Scholar
 Xu S: Estimating polygenic effects using markers of the entire genome. Genetics 2003, 163(2):789–801.PubMedPubMed CentralGoogle Scholar
 Zhang Y, Liu J: Bayesian inference of epistatic interactions in casecontrol studies. Nature Genetics 2007, 39(9):1167–1173. 10.1038/ng2110View ArticlePubMedGoogle Scholar
 Cordell H, Clayton D: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. The American Journal of Human Genetics 2002, 70: 124–141. 10.1086/338007View ArticlePubMedGoogle Scholar
 Evans D, Marchini J, Morris A, Cardon L: Twostage twolocus models in genomewide association. PLoS Genet 2006, 2(9):e157. 10.1371/journal.pgen.0020157View ArticlePubMedPubMed CentralGoogle Scholar
 Girolami M: A variational method for learning sparse and overcomplete representations. Neural Computation 2001, 13(11):2517–2532. 10.1162/089976601753196003View ArticlePubMedGoogle Scholar
 Hermosillo G, Chefd'Hotel C, Faugeras O: Variational methods for multimodal image matching. International Journal of Computer Vision 2002, 50(3):329–343. 10.1023/A:1020830525823View ArticleGoogle Scholar
 Jaakkola T, Jordan M: Bayesian parameter estimation via variational methods. Statistics and Computing 2000, 10: 25–37. 10.1023/A:1008932416310View ArticleGoogle Scholar
 Blei D, Jordan M: Variational inference for Dirichlet process mixtures. Bayesian Analysis 2006, 1: 121–144. 10.1214/06BA104View ArticleGoogle Scholar
 Bishop CM: Pattern recognition and machine learning. New York: Springer Science; 2006.Google Scholar
 Malo N, Libiger O, Schork N: Accommodating linkage disequilibrium in geneticassociation analyses via ridge regression. The American Journal of Human Genetics 2008, 82(2):375–385. 10.1016/j.ajhg.2007.10.012View ArticlePubMedGoogle Scholar
 George E, McCulloch R: Variable selection via Gibbs sampling. Journal of the American Statistical Association 1993, 88(423):881–889. 10.2307/2290777View ArticleGoogle Scholar
 Wainwright M, Jordan M: Graphical models, exponential families, and variational methods. In New Directions in Statistical Signal Processing. Volume 2005. MIT Press; 2003:138.Google Scholar
 Beal M: Variational algorithms for approximate Bayesian inference. PhD thesis. University of London; 2003.Google Scholar
 Gelman A, Carlin J, Stern H, Rubin D: Bayesian data analysis. Boca Raton, Florida: Chapman and Hall; 2004.Google Scholar
 Chen G, Marjoram P, Wall J: Fast and flexible simulation of DNA sequence data. Genome Res 2009, 19: 136–142. 10.1101/gr.083634.108View ArticlePubMedPubMed CentralGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996, 58: 267–288.Google Scholar
 Stranger B, Forrest M, Dunning M, Ingle C, Beazley C, Thorne N, Redon R, Bird C, de Grassi A, Lee C, TylerSmith C, Carter N, Scherer S, Tavare S, Deloukas P, Hurles M, Dermitzakis E: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315(5813):848–853. 10.1126/science.1136678View ArticlePubMedPubMed CentralGoogle Scholar
 International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851–861. 10.1038/nature06258View ArticleGoogle Scholar
 Boyd S, Vandenberghe L: Convex opimization. New York: Cambridge University Press New York; 2004.View ArticleGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996, 267–288.Google Scholar
 Wu T, Lange K: Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2008, 2: 224–244. 10.1214/07AOAS147View ArticleGoogle Scholar
 Friedman J, Hastie T, Hofling H, Tibshirani R: Pathwise coordinate optimization. Annals of Applied Statistics 2007, 1(2):302–332. 10.1214/07AOAS131View ArticleGoogle Scholar
 Voight B, Adams A, Frisse L, Qian Y, Hudson R, Di Rienzo A: Interrogating multiple aspects of variation in a full resequencing data set to infer human population size changes. Proceedings of the National Academy of Sciences 2005, 102(51):18508–18513. 10.1073/pnas.0507325102View ArticleGoogle Scholar
 Altshuler D, Brooks L, Chakravarti A, Collins F, Daly M, Donnelly P: A haplotype map of the human genome. Nature 2005, 437(7063):1299–1320. 10.1038/nature04226View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.