Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases

Background Identifying quantitative trait loci (QTL) for both additive and epistatic effects raises the statistical issue of selecting variables from a large number of candidates using a small number of observations. Missing trait and/or marker values prevent one from directly applying the classical model selection criteria such as Akaike's information criterion (AIC) and Bayesian information criterion (BIC). Results We propose a two-step Bayesian variable selection method which deals with the sparse parameter space and the small sample size issues. The regression coefficient priors are flexible enough to incorporate the characteristic of "large p small n" data. Specifically, sparseness and possible asymmetry of the significant coefficients are dealt with by developing a Gibbs sampling algorithm to stochastically search through low-dimensional subspaces for significant variables. The superior performance of the approach is demonstrated via simulation study. We also applied it to real QTL mapping datasets. Conclusion The two-step procedure coupled with Bayesian classification offers flexibility in modeling "large p small n" data, especially for the sparse and asymmetric parameter space. This approach can be extended to other settings characterized by high dimension and low sample size.


Background
With the advent of high-throughput biotechnologies to genotype dense molecular markers throughout the genome, statistical methodologies are crucial in understanding the genetic architecture of complex traits, and in locating genes underlying important traits. Since the pioneering statistical work by Lander and Botstein [1], much effort has been devoted to improving the efficiency and accuracy of QTL mapping. Traditional approaches to QTL mapping test each of dense grid loci on chromosomes via the likelihood ratios of linear regression models (see the reviews by Doerge et al. [2] and Broman and Speed [3]), and Wang et al. [4] also proposed a Bayesian shrinkage estimation of QTL parameters allowing varying shrinkage factors across different effects.
Epistases (that is, interactions between genes) are ubiquitous in biological systems [5] and may even play a more important role than additive effects, as have been shown in human population [6,7] and other organisms [8][9][10][11][12]. However, even a moderate number of markers implies a large number of pairwise combinations, thus creating statistical issues in QTL mapping. Due to the small sample sizes and the lack of efficient statistical tools, the number of identified genes is limited although the existence of epistasis has been recognized for nearly a hundred years [13]. To detect epistatic effects, Kao and Zeng [14] proposed modeling epistasis via orthogonal contrast scales using Cockerham's model; Yi and Xu [15] developed a Bayesian method to detect epistasis using reversible jump Markov chain Monte Carlo (MCMC) algorithm; Yi et al. [16][17][18] then proposed a Bayesian model selection approach to detect genome-wide epistasis (with the software described in [19]); Bogdan et al. [20] modified Bayesian information criterion (mBIC) to permit the identification of additive effects as well as pairwise interactions; and Cui and Wu [21] also proposed a statistical framework to detect genetic interactions derived from different genomes in self-pollinated plants. Recently, Żak et al. [22] developed a rank-based model selection and Shi et al. [23] developed a LASSO-type penalized likelihood method to locate interacting QTL while Bogdan et. al [24] extended mBIC for strongly correlated markers and multiple interval mapping.
Consider Y i as the trait value of strain i = 1, , n, and let X ij be the genotypic value of marker j = 1, , p β within the i-th strain. Here we focus on the populations with binary markers X ij (coded as -0.5 and 0.5), such as doubled-haploid, backcross or recombinant inbred lines. With available markers (either observed or imputed) densely located on chromosomes, we assume the putative QTL co-transmit with some of the markers. Let {X} denote the set including all pairwise epistases of interest, and define Z ij = X ik X il for the j-th candidate epistasis (k, l) ∈ {X}, j = 1, , p γ . We investigate the additive effects of putative QTL and the epistatic interactions between them through the following multiple regression model, where μ is the overall mean, β j is the additive effect of marker j, γ j represents the j-th epistatic effect, and ε i is the random error.
QTL mapping with this multiple regression model can be viewed as a model selection procedure [3,[25][26][27]. However, several characteristics of the data complicate the application of classical statistical methodologies. First, a large amount of missing molecular markers, due to failure in genotyping or selective genotyping, is common in practice. When markers are sparse, the missing genotype information between markers must be inferred. Second, the molecular markers in the same linkage group may be highly correlated. Third, the total number of molecular markers and putative epistases, i.e., p = p β + p γ , is usually much larger than the sample size n. Because of these issues, the efficiency and accuracy are usually compromised for easy development of statistical approaches. Characteristics of the "large p small n" data with missing values require further attention via extensions of traditional model selection approaches. We extend the Bayesian classification approach in Zhang et al. [28] to map QTL with epistases. Spike and slab priors have been used by, for example, Mitchell and Beauchamp [29], George and McCulloch [30], and Ishwaran and Rao [31] to develop Bayesian variable selection approaches. The spike and slab priors consist of two components, with one modeling zero coefficients and the other modeling nonzero ones.
Furthermore, the mixing weight plays a crucial role in condensing the searchable parameter space and enforcing a stochastic search within low-dimensional spaces. When only a limited number of covariates are being investigated, a uniform distribution on [0, 1] or even a fixed value (e.g., 0.5) is usually chosen for the mixing weight. However, when n << p, it is unrealistic to expect half of the variables to be selected because the final model may still be unidentifiable. Instead, we expect that, for a successful variable selection, the prior distributions of the mixing weights depend on both n and p.
We investigate the predictability of a model developed for a dataset of sample size n, and tackle the aforementioned issues. We then construct a two-step Bayesian variable selection approach for model (1) in the case that n << (p β + p γ ). In the first step, we employ a restrictive prior for each of the coefficients in model (1) in order to enforce stochastic filtering of the large number of candidate variables. This prior also allows flexibility for the possible different numbers and/or scales of positive and negative coefficients (see [32] for more details on its advantage over symmetric priors). A Gibbs sampling algorithm is developed to compute the posterior distributions and to implement the stochastic search. Only a limited number of variables are filtered to go through the second step, which repeats the first step but with much fewer candidate variables. The second step is necessary to model (1) when n << (p β + p γ ), as the priors in the first step could potentially be too restrictive. The performance of our approach is evaluated via a simulation study and application to real datasets.

Simulation
Simulation studies were performed to evaluate the performance of our method in the case of p Ŭ n. We simu- lated 56 markers across 3 chromosomes, with each having 10, 20, and 26 markers, and being 56.7 cM, 133.5 cM and 171.6 cM long respectively. We specify = 0.5415, and the locations of 28 markers are chosen based on the Drosophila data [28], which include 221 inbred introgression lines between two closely related species. The other 28 markers are chosen such that the neighboring markers are at least 5 cM away. Table 1 shows the detailed information of the non-zero effects specified in the simulation, including two additive effects and three epistatic effects. To assess whether our method is able to identify different types of epistatic effects, we include all three possible interactions in the simulation: (1) neither of the two markers has additive effects (that is, 2-133.8 and 3-56.7); (2) one of them has additive effects (that is, 1-24.7 and 2-47.8); (3) both have additive effects (that is, 2-47.8 and 3-141.5). All epistatic effects were set at the same size to avoid its effects on detectability. Due to the intensive computation involved in Gibbs sampling, a total of 100 complete data sets were simulated. Each of the 100 data sets was analyzed using two models, one model with both additive and epistatic effects while the other with additive effects only. When mapping QTL with epistases, we have a total number of 1596 variables (56 additive-effect loci and 1540 epistases) versus 221 observations in the model.
For the model without epistases, both markers can be detected in most of the 100 simulated datasets even when the false discovery rate (FDR) is controlled as low as 0 (via setting the Bayes factor higher than 3.2), see Table 2.
When modeling the epistases, all (additive and interaction) effects are still detected in more than 90% of the data sets for all levels of Bayes factor (BF) though the FDRs are higher. For those data sets with any effect not identified, the immediate neighbors of the corresponding marker locus are mostly detected instead. As expected, it is more difficult to detect epistases than to detect additive effects. The epistasis of markers both having additive effects is the easiest to be detected among all epistases. The true parameter values are included in their 95% credible intervals with the associated posterior probabilities being very close to one (results not shown).

Application
We apply the developed method to the simulans backcross II (BS2) data and the mauritiana backcross II (BM2) data [33,34]. An  Out of 100 simulated data sets, the total numbers of data sets that correctly identify the true additive and interaction effects (in the brackets, their neighboring ones when the true ones are missed) are counted respectively when thresholding the Bayes factor (BF) at different levels. Also listed are the mean and standard error (SE) of the estimated effect sizes. posterior lobe. The genotypes of males were determined at each marker locus, and genetic map positions were estimated from gametes produced by the F 1 females in this study. Further information about the data is referred to Liu et al. [33] and Zeng et al. [34].
Employing multiple interval mapping (MIM) [25,35] to the BS2 data, Zeng et al. [34] detected a total of 16 additive effects and no epistatic effect. Pooling all four data sets, Zeng et al. [34] detected three extra additive effects and six epistatic effects. These epistatic effects appeared to be relatively unimportant for PC1 in the interspecific backcross populations, which carried an observation difficult to interpret biologically. Of the 19 additive effects, 18 additive effect estimates have the same sign [34]. Zeng et al. [34] explained this interesting phenomena as an unusually strong directional selection, although Tanksley [36] suggested that transgressive segregation usually followed a mixture of plus and minus alleles in each species as demonstrated by most previous analyses of quantitative traits.
We focused our analysis on the BS2 and BM2 data with the standardized phenotypic values. Of the 19 putative QTL reported by Zeng et al. [34], only nine are at least 1 cM away from the 45 marker loci. Therefore, we analyzed both datasets with these 54 additive effects (nine putative QTL and 45 markers) and all possible pairwise interactions (that is, 1431 putative epistases). When controlling BF ≥ 1, the analysis of the BS2 data reported a total of 25 additive effects (see Table 3), including all nine putative QTL, but no epistatic effect. The analysis of the BM2 data instead reported a total of 20 additive effects (see Table 4), including three of the nine putative QTL, and 18 epistatic effects (see Table 5). On the basis of the simulation study, we may expect less than 0.67% FDR for those 17 [34]. Note that The position of each significant additive effect is specified by an index of the corresponding chromosome and its location on this chromosome (cM). The estimated sizes of additive effects and the standard deviations of the Markov chains are also shown in the columns of coefficient and S.D., respectively. The position of each significant additive effect is specified by an index of the corresponding chromosome and its location on this chromosome (cM). The estimated sizes of additive effects and the standard deviations of the Markov chains are also shown in the columns of coefficient and S.D., respectively.
almost each additive effect uniquely detected by Zeng et al. [34] has a neighboring one (within 10 cM) in our lists except 2-135 and 3-94 for the BM2 dataset, and almost each additive effect unique in our lists has a neighboring one (within 10 cM) detected by Zeng et al. [34]. Per the discussion on the precision of QTL location by Bogdan and Doerge [37] and Bogdan et al. [24], these effects of close neighbors may be due to identical QTL. Our analysis reported R2 = 0.934 and R2 = 0.902 for the BS2 and BM2 data respectively.

Conclusion
This article extends the Bayesian framework in Zhang et al. [28] to identify both additive and epistatic effects of QTL based on model (1). The advantage of this approach mainly lies in the flexible priors for the regression coefficients by accounting for some characteristics of "large p small n" data, the predictability of a model constructed with size n data, and the two step strategy for dimension reduction. A Gibbs sampler is developed to draw Markov chain samples from the posterior distributions, which can be considered as a stochastic search for an optimal model. Unlike information criteria based model selections which require calculation of the effective sample size for incomplete data, missing values can be naturally imputed within the Gibbs sampling scheme. The corresponding algorithm has been implemented in Matlab and is available as QTL-Bayes http://www.stat.purdue.edu/~zhangdb/QTLBayes/.
Bayesian variable selections can be viewed as penalized likelihood approaches, which have been studied recently [38,39]. With "large p small n" data, it is not clear how to set up the penalty properly such that it will neither overpenalize nor underpenalize the likelihood. An overpenalized likelihood will lose some significant variables of particular interest, while an underpenalized likelihood will introduce false positives. The predictability of size n data sheds light on the choice of this penalty. Since a size n data set will allow us to understand the variation of the trait explained by only p n = O( ) QTL with accuracy O(n -1/2 ), selecting too many variables into the model will ruin this practice of QTL mapping. As shown by Bogdan and Doerge [37], severely biased estimates can be resulted from large genome and/or marker number in QTL mapping. We propose a Bayesian framework to resolve the bias problem. We have illustrated our approach by application to the BS2 and BM2 data [33,34], both of which have 45 markers observed across three chromosomes. The disadvantage of this approach is the heavy computation involved as the computation-intensive Markov chain Monte Carlo algorithm is utilized. For example, the analysis of a dataset with more than 200 markers from 1,000 subjects take almost 24 hours using one Intel ® Xeon™ CPU at 2.80 GHz.
Coding binary markers with -0.5 and 0.5 has been commonly utilized in QTL mapping as it does not introduce correlation between additive effects and interactive effects, and such uncorrelation benefits the identification of additive effects. On the other hand, coding binary markers with 0 and 1 introduces correlation and thus is not preferred for QTL mapping with epistases [40,41]. Although developed for QTL mapping, this approach is completely general and can be applied to other settings with "large p small n" data, such as associating genomic features to clinical outcomes or phenotypes of biological interest. Unlike QTL mapping data with known missing structure from the linkage information, genomic data with imaging and microarray may require more information to impute missing values because of the unknown missing mechanism. Even though the missing values are usually imputed with a nearest-neighbor approach [42], Gibbs samplers allow natural multiple imputation under the assumption of missing at random (MAR, see Little and Rubin, [43]).

Predictability and Sample Size
Suppose, for a sample of size n, we select up to p n (assuming p n <n) significant variables into the following regression model, n The where Y n is an n-dimensional column vector; X n is an n × If p n = o(n), the mean squared prediction error asymptotically achieves the minimum variance, and thus the prediction is asymptotically efficient.
This illustration implies that, with a sample of size n and p n = O( ) predictors, the mean squared prediction error can reach the minimum prediction error at rate O(n -1/2 ). Suppose that all p n significant variables could be perfectly selected out of p candidates, we still need p n = o(n) in order to have a chance to correctly understand the variation of the dependent variable explained by the selected predictors. Therefore, we always assume that there are at most p n = O( ) significant variables among a total of p candidates in the case of p Ŭ n. Indeed, the study of consistency in a triangular array setting for regression problems was conducted by Huber [44][45][46]. In examining the underlying theory of 'model-selection' and 'variable-selection' procedures that choose p n explanatory variables from an initial set of variables, Greenshtein and Ritov [46] proved that one may expect consistency for the choice of p n with an order between o( ) a n d o(n/log(n)). Our choice of p n = O( ) satisfies the Greenshtein and Ritov [46] conditions for consistency.

Bayesian Variable Selection
Here we propose a two-step Bayesian variable selection approach to map QTL with epistases through model (1).
With the following Bayesian framework, we first select c out of p β additive effects and c out of p γ epistatic effects (e.g., we use c = 2), respectively, using a restrictive prior for each coefficient. We then apply the same Bayesian framework to stochastically select the filtered variables, using a non-restrictive prior for each coefficient. Gibbs sampling algorithms are developed to stochastically search low-dimensional subspaces, as implied by the predictability of a size n data set.

Prior Specification
For a two-state marker system, both additive effects β j , j = 1, , p β , and epistatic effects γ j , j = 1, , p γ , are the primary focus of QTL mapping. As is often the case p = (p β +p γ ) Ŭ n, many of these coefficients are zero, either because the variation of the trait can be explained by only a few QTL or because the limited sample size precludes selecting too many variables (otherwise the constructed model is not reliable as shown in the previous section). It is also possible that the number and/or scale of the positive coefficients may be different from those of the negative ones. To account for these properties, a three-component mixture prior is specified for each coefficient β j or γ j . More specifically, where δ (·) is a Dirac function with mass one at zero; N + (μ, σ 2 ) and N -(μ, σ 2 ) positively and negatively truncate the normal distribution, i.e., N(μ, σ 2 ), respectively. Therefore, w β+ (or w β-) is the probability for any single marker, and w γ+ (or w γ-) is the probability for any pair of markers in {X}, to have positive (or negative) interactive effect on the trait.
As suggested by the predictability of a size n data set, we expect to select at most p n = O( ) out of the p variables for the final model. Therefore, we specify the priors for (w β+ , w β-) and (w γ+ ,w γ-) as that is, expecting at most c significant additive effects and epistatic effects, respectively. Gaffney [48] and Yi et al. [17], among others, employed similar ideas to rescale the priors based on the number of possible effects. Apparently, when n << (p β + p γ ), either c /p β or c /p γ is very small, which implies a restrictive prior on each corresponding coefficient. Therefore, we usually select c additive effects and c epistatic effects during the first run of Bayesian analysis. We then apply the same Bayesian analysis to these pre-selected variables. The second run of Bayesian analysis has both w β+ + w β-and w γ+ + w γ-, a priori, With data (Y n , X n ) and the prior specification in Section 3.1, we have the likelihood function, that is, the joint distribution function of the data (Y n , X n ), the parameters (μ, β, γ),

, and all hyperparameters
The distribution of X n can be specified based on the available linkage map information [2]. The conditional distribution of is a product of the prior distribution for each β j . Similarly, the conditional distribution of is a product of the prior distribution for each γ j . The priors of the hyperparameters, θ β+ , θ γ+ , φ β+ , φ γ+ , θ β-, θ γ-, φ β-and φ γ- , are specified to be as noninformative as possible.

Sample
, and : With conditionally conjugate priors, the posterior for all variance parameters are still inverse gamma distributions. Specifically,

Bayesian Inference
For each variable in model (1), one pair of parameters is used to select the corresponding variable. They are, for the j-th additive effect, the posterior probabilities w β j+ = P (β j > 0|Y n , X n ) and w β j-= P (β j < 0|Y n , X n ). With the full conditional posterior distribution of β j and all the notations in the APPENDIX, we have Therefore, the two parameters w β j+ and w β j-can be estimated with the Markov chains of and drawn from the above Gibbs sampler. If and only if both w β j+ and w β j-are less than 0.5, the median of the posterior distribution of β j is zero. Similarly, the posterior probabilities w γ j+ = P (γ j > 0|Y n , X n ) and w γ j-= P (γ j < 0|Y n , X n ) can be esti- able Bayesian model and enforce to stochastically search for an optimal low-dimensional parameter subspace. We then rank the j-th additive effect based on max{w β j+ , w β j-}, and rank the j-th epistatic effect based on max{w γ j+ , w γ j-}. The top c out of p β additive effects, and the top c out of p γ epistatic effects are selected, respectively. At the second step, we select variables out of those selected c additive effects and c epistatic effects, under the above Bayesian framework for model (1). Obviously, we have a non-restrictive prior for each coefficient at the second step, and therefore avoid possible over-penalization due to restrictive priors.
Following Jeffreys [49,50], we test the hypothesis H 0 : β j = 0 vs. H 1 : β j ≠ 0 on the basis of the Bayes factor, which was defined as where π (β j = 0) and π (β j ≠ 0) are the a priori probabilities, and the last equality follows the fact that π (β j = 0) = π (β j ≠ 0) at the second step of our Bayesian Classification. As suggested by Jeffreys [50], a B 10 (β j ) with value between 1 and ≈ 3.2 provides "not worth more than a bare mention" evidence against H 0 ; a B 10 (β j ) with value from to 10 provides "substantial" evidence against H 0 ; a B 10 (β j ) with value from 10 to 100 provides "strong" evi-