V-Bay Algorithm
The V-Bay algorithm consists of two components, a hierarchical regression model with marker class partitioning and a variational Bayes computational algorithm. The hierarchical regression is adapted directly from Zhang et al. [13] with minor alterations. The first level of the hierarchical regression model for a sample of n individuals with m markers is a standard multiple regression model:
where y
i
is the phenotype of the ithindividual, μ is the sample mean, x
ij
is the genotype of the jthmarker of the ithindividual, β
j
is the effect of the jthmarker, and e
i
~ N
. While we limit the current presentation of the model to continuous traits with normal error, more complex error structures and extensions to discrete traits is straightforward. Because (1) is a linear model, it can be easily expanded to test for dominance or epistasis using a standard mapping approach. In addition, confounding factors such as population structure can be accounted for by the addition of covariates. The effects of these additional covariates can be modeled within the hierarchical regression framework or can be treated simply as nuisance parameters and given uninformative priors. We used an uninformative prior
for the error parameter,
, and a constant (improper) prior for the mean parameter μ.
The second level of the hierarchical model consists of a partitioning of markers into positive, negative, and zero effect classes and the prior control over the distributions of these classes. The partitioning is accomplished by modeling each of the regression coefficients using mixture prior distributions:
where
is an indicator function for β
j
with a value of zero, and N+ and N- are positive and negative truncated distributions [13]. The priors on the population distribution of positive and negative effect probability hyperparameters (
and
) are:
In our analyses we chose an uninformative Dirichlet prior by setting the parameters θ
β
, ϕ
β
, ψ
β
all to one. The hyperparameters
and
reflect the partitioning aspect of the model. Within the positive and negative partitions, the population variance parameters (
and
) have
priors. This choice of prior for the regression coefficients in the positive and negative effect classes increases the robustness to outliers. Assuming the number of markers in the GWA data set, m, is greater than the sample size, n, we truncate the Dirichlet distribution such that
, where the truncation puts a lower bound on the harshness of shrinkage [8]. We found this truncation very important when considering data sets with large numbers of markers. Without truncation, the evidence in the data is too weak to enforce harsh enough shrinkage for desirable model selection.
The variational Bayes component of V-Bay is constructed by approximating the joint posterior density of the hierarchical model:
in terms of a factorized form:
and then minimizing the KL-divergence between the factorized and full form. Equation (5) is a natural factorization for the V-Bay hierarchical model since most of the priors are conjugate. The posterior factorized distributions all have closed form expressions and each parameter is completely characterized by an expected sufficient statistic [27] (Additional file 1, Methods). The algorithm is therefore equivalent to updating these expected sufficient statistics.
Minimizing the KL-divergence between each marginal distribution (e.g. q(β
j
)) and the full joint distribution is performed by considering the expectation of the full log joint distribution with respect to each parameter. For a generic parameter θ, the expectation step is equivalent to setting:
with C some normalizing constant, and E-θindicating expectation of the log of equation (4) with respect to every other parameter's factorized distribution, except q (θ). This defines a system of equations which can be iterated through until convergence [23, 27]. With the factorized form, it is a simple matter to demonstrate the time complexity of V-Bay is
(nm) per iteration (Additional file 1, Methods).
V-Bay Convergence
The factorization of equation (4) is used to define a function ℒ(θ) which lower bounds the log posterior probability of the data (i.e. the probability of the observed data after integrating out all parameters in the model). The lower bound ℒ(θ) is defined as the expectation of the log of equation (4) with respect to every factorized distribution plus the entropy of each factorized distribution. In the full form, the convergence of V-Bay to a local maximum of the lower bound ℒ(θ) is guaranteed because of the convexity of ℒ(θ) with respect to each parameter's approximate posterior distribution [33]. In the described implementation we used an approximation for some higher order expectation terms that we found increased computational efficiency (Additional file 1, Methods).
Given that global convergence to a single stationary point is not guaranteed [26], the standard practice is to use multiple parameter initializations. We found that with random initializations of expectations of β
j
, V-Bay finds local modes that correspond to over-fit (under-determined) models, while with initializations of only a few non-zero expectations of β
j
's, V-Bay tends to update these values close to zero before converging. We therefore use the approach of setting all expectations of β
j
parameters equal to zero as a starting point for all runs of V-Bay, an approach that has precedent in simultaneous marker analysis [4]. This also corresponds to appropriate starting estimates given our prior assumption that not too many markers are associated with a phenotype.
We have found that the order in which the parameters are updated can affect local convergence, particularly when there is missing genetic data. In general, the different association models we found using different orderings were not widely different from one another, often differing in whether they included one or two specific associations. For cases where we found ordering did make a difference, we ran V-Bay with multiple random orderings and used the conservative criteria of considering only associations found to be significant in at least 80% of the cases to be true positives for all simulations and data analyses compared to single-marker analysis. The cutoff of 80% corresponds directly to a false discovery rate of 0%. We also considered a less stringent cutoff and an observed false discovery rate of 5% in the comparison to the lasso.
V-Bay Software
An implementation of V-Bay is available at http://mezeylab.cb.bscb.cornell.edu/Software.aspx. The software has basic control parameters available to the user and only requires tab delimited genotype and phenotype files as input. The algorithm itself consists of the following steps: 1) randomize marker ordering, 2) initialize the expected sufficient statistics and expectations of parameters, 3) update the expected sufficient statistics for a particular parameter, given the expectations of all the other parameters, 4) update the expectations of a particular parameter given the expectations of all the other parameters, 5) repeat steps 3 and 4 for all the parameters in the model, 6) check convergence based on the current estimate of the lower bound, ℒ(θ). Further functional details are presented in Additional file 1, Tables S3-S9. The main output from the algorithm is the -log10 of p-vbay = pj++ pj-statistic for each marker, which can be used to assess significance of a marker association.
The Lasso
Originally proposed by Tibshirani [34], recently applied to GWA data by Wu et al. [14] and modified by Hoggart et al. [4], the lasso is a form of hierarchical regression that imposes a double exponential prior on the coefficients of each marker. Although expressed in a Bayesian context, maximum a posteriori (MAP) estimates are obtained by maximizing the following penalized log-likelihood:
where ℓ(β|Y) is the log-likelihood for the relevant generalized linear model. By penalizing the magnitude of each β
j
coefficient, MAP estimates shrink the coefficient values compared to the estimates under the unpenalized model. This shrinkage causes most coefficients to be exactly zero, so that only very few markers are selected to be nonzero for a single value of λ. This penalty produces a convex log-likelihood surface with a single maximum even for underdetermined systems (i.e. when there are more markers than samples). Therefore, the lasso can jointly consider all markers in a single model and simultaneously account for variance in the response caused by multiple markers. The lasso model is fit for multiple values of λ and a single subset of coefficients is selected to be nonzero by 10-fold cross-validation. Confidence scores are obtained for each selected marker by comparing an unpenalized model with all selected markers to a model that omits each marker in turn. An F-test is performed for each marker, but note that these confidence scores cannot be interpreted as typical p-values since they are obtained from a two step procedure. Algorithmic details for fitting the LASSO model for the linear-Gaussian case are provided by [35, 36].
Simulation Study
GWA data were simulated under the set of conditions listed in Table 1. The genomic marker data were generated using MaCS [29], a scalable approximate coalescent simulator, using the default approximation tree width. For the comparison to single-marker analysis, three basic types of genotype data sets were simulated. For the first and second type, 0.5 Gb of DNA was simulated from a single diploid population with N
e
= 10000, the population scaled mutation rate 4N
e
μ = θ = 0.001, and the genome-wide population scaled recombination rate 4N
e
κ = ρ = .00045, values taken from Voight et al. [37]. Samples of 200 and 1000 were sampled screening the minor allele frequency (MAF) to be 0.10, leaving more than one-million markers for analysis. For the third type, 200 diploid samples of 0.5 Gb were simulated from a simple four population migration model. The approximation
, as observed in the overall Phase I HapMap analysis [38], was used to determine the population per generation migration rate for a simple symmetric island migration model, with populations of equal size. After screening MAF to be > 0.10, this left over 660 thousand markers for analysis. The final data included the addition of 2% missing data.
Given the simulated genotypic data, phenotypic data were produced with a simple additive linear model as shown in equation (1). The genotypes were represented in the linear model with a consistent dummy variable encoding of {0, 1, 2} across loci. The additive effects were drawn independently from a Γ(2, 1) distribution or from a model with fixed effects. The locations for loci were randomly sampled throughout the genome. For each genomic data set, 4, 8, or 32 loci with phenotype associations were simulated. The total heritability of the phenotype was fixed at either 0.5 or 0.9. The MAF is computed for each sampled locus in the genetic model since each locus is chosen from the SNPs generated by MaCS. By combining the MAF with the effects sampled for each locus in the genetic model, it is possible to determine the proportion of observed variation contributed by each locus. This individual heritability for each locus is defined as follows:
where f
j
is the MAF of locus j, β
j
is the additive effect of the locus j, and
is the total phenotypic variance of the trait.
GWA analysis of the simulated data were performed using both V-Bay and a linear regression single-marker analysis. When population structure was incorporated, the linear model (1) becomes a fixed effect ANOVA model, for both V-Bay and the single-marker analysis. The population means in V-Bay were treated as having normal priors centered on zero with a very large variance (τ = 1000), and were updated in a similar fashion as the other parameters in the V-Bay algorithm. The V-Bay algorithm was run until the tolerance for the likelihood portion of the lower bound ℒ(θ) was < 10-9. For the simulations with missing data, the minor allele frequency across loci (f
j
∀j) was estimated given the observed genotype data, and then the missing data points were sampled from a Bin(n = 2, f
j
), i.e. assuming Hardy-Weinberg equilibrium, for both V-Bay and single-marker analysis. We did random re-sampling of missing data to test the robustness of the output of V-Bay and the single-marker analysis (Additional file 1, Methods).
The false positive and true positive rates were calculated for each set of replicate simulations. Care was taken to account for the effect of linkage disequilibrium on the test statistics, for both V-Bay and single-marker analysis. A simple window was computed around each marker to determine when the r2 decayed to 0.4. The cutoff of 0.4 was used to be as generous to single-marker analysis as possible. Any marker in this window was considered a true positive. In the case where multiple recombination events occurred recently between different ancestral lineages, multiple blocks of markers in linkage disequilibrium were generated, that were separated by markers in low linkage disequilibrium. In these cases, a conservative rule for evaluating a true positive was implemented. If a marker had a p-vbay> 0.99, or -log10 p-value for the single-marker analysis in excess of the Bonferroni correction, and the r2 between the significant genetic marker and the true location was greater than 0.4, then the marker was considered a true positive.
For the comparison between V-Bay, the lasso, and single marker analyses, one-hundred thousand markers and samples sizes of 200 or 1000 for a single population were simulated (the reduced number of markers for these simulations was used to conserve CPU cycles). The genetic architectures were simulated as with the larger scale simulations, but with only 4 or 32 loci being sampled randomly from the one-hundred thousand markers, and effects sampled from a Γ(2, 1) distributions for 10 replicated data sets. Eight random reorderings of the markers were used with the V-Bay analysis, and the false discovery rate for V-Bay was controlled based on the consensus of associations found across reorderings with p-vbay> 0.99 (e.g. a false discovery rate of 5% corresponded to an association being found in at least 3 out of the 8 reorderings). The false discovery rate for the lasso (using F-statistics) and single-marker analysis were controlled based on the p-values computed for each method respectively.
Data Analysis
We performed a GWA analysis for gene expression levels measured in the eternal lymphoblastoid cell lines that were generated for the 210 unrelated individuals of Phase II of the International HapMap project [31]. This sample included 60 individuals sampled from Utah of European descent (CEU), 45 individuals sampled from Han Chinese population (CHB), 45 individuals sampled from Japanese population (JPT), and 60 individuals sampled from the Yoruban population in Africa (YRI). Expression data for these lines were available for 47,000 probes for (~17,000 genes) assayed with the Illumina bead array. For our analyses, we screened for MAF > 0.10 in all populations which left 1.03 * 106 SNPs on chromosomes 1 to 22. The X and Y chromosomes were not analyzed by Stranger et al. and we ignored these chromosomes in our analyses as well. Stranger et al. [31] reported 879 gene expression probes with highly significant cis-eQTL associations, found by testing within populations, where every SNP in a 2 Mb window around each gene was analyzed. We performed a GWA analysis, with both V-Bay and a single-marker regression, for their top 100 most significant expression probes. We combined genotypic data across populations, where we accounted for the effect of population structure in each case by including appropriate covariates. We also tested the top 20 probes, not in their association list that had the largest residual variance after correcting for population structure. Only 120 expression probes were analyzed to conserve CPU cycles; all 879 could easily be analyzed in a future study. The total missing data for this SNP set was 1.78%. We accounted for missing data using the same approach as with our simulated data analysis.