- Methodology article
- Open access
- Published:

# Variable selection for large *p* small *n* regression models with incomplete data: Mapping QTL with epistases

*BMC Bioinformatics*
**volume 9**, Article number: 251 (2008)

## Abstract

### Background

Identifying quantitative trait loci (QTL) for both additive and epistatic effects raises the statistical issue of selecting variables from a large number of candidates using a small number of observations. Missing trait and/or marker values prevent one from directly applying the classical model selection criteria such as Akaike's information criterion (AIC) and Bayesian information criterion (BIC).

### Results

We propose a two-step Bayesian variable selection method which deals with the sparse parameter space and the small sample size issues. The regression coefficient priors are flexible enough to incorporate the characteristic of "large *p* small *n*" data. Specifically, sparseness and possible asymmetry of the significant coefficients are dealt with by developing a Gibbs sampling algorithm to stochastically search through low-dimensional subspaces for significant variables. The superior performance of the approach is demonstrated via simulation study. We also applied it to real QTL mapping datasets.

### Conclusion

The two-step procedure coupled with Bayesian classification offers flexibility in modeling "large p small n" data, especially for the sparse and asymmetric parameter space. This approach can be extended to other settings characterized by high dimension and low sample size.

## Background

With the advent of high-throughput biotechnologies to genotype dense molecular markers throughout the genome, statistical methodologies are crucial in understanding the genetic architecture of complex traits, and in locating genes underlying important traits. Since the pioneering statistical work by Lander and Botstein [1], much effort has been devoted to improving the efficiency and accuracy of QTL mapping. Traditional approaches to QTL mapping test each of dense grid loci on chromosomes via the likelihood ratios of linear regression models (see the reviews by Doerge et al. [2] and Broman and Speed [3]), and Wang et al. [4] also proposed a Bayesian shrinkage estimation of QTL parameters allowing varying shrinkage factors across different effects.

Epistases (that is, interactions between genes) are ubiquitous in biological systems [5] and may even play a more important role than additive effects, as have been shown in human population [6, 7] and other organisms [8–12]. However, even a moderate number of markers implies a large number of pairwise combinations, thus creating statistical issues in QTL mapping. Due to the small sample sizes and the lack of efficient statistical tools, the number of identified genes is limited although the existence of epistasis has been recognized for nearly a hundred years [13]. To detect epistatic effects, Kao and Zeng [14] proposed modeling epistasis via orthogonal contrast scales using Cockerham's model; Yi and Xu [15] developed a Bayesian method to detect epistasis using reversible jump Markov chain Monte Carlo (MCMC) algorithm; Yi et al. [16–18] then proposed a Bayesian model selection approach to detect genome-wide epistasis (with the software described in [19]); Bogdan et al. [20] modified Bayesian information criterion (mBIC) to permit the identification of additive effects as well as pairwise interactions; and Cui and Wu [21] also proposed a statistical framework to detect genetic interactions derived from different genomes in self-pollinated plants. Recently, *Żak* et al. [22] developed a rank-based model selection and Shi et al. [23] developed a LASSO-type penalized likelihood method to locate interacting QTL while Bogdan et. al [24] extended mBIC for strongly correlated markers and multiple interval mapping.

Consider *Y*_{
i
}as the trait value of strain *i* = 1, ⋯, *n*, and let *X*_{
ij
}be the genotypic value of marker *j* = 1, ⋯, *p*_{
β
}within the *i*-th strain. Here we focus on the populations with binary markers *X*_{
ij
}(coded as -0.5 and 0.5), such as doubled-haploid, backcross or recombinant inbred lines. With available markers (either observed or imputed) densely located on chromosomes, we assume the putative QTL co-transmit with some of the markers. Let \mathcal{I}{*X*} denote the set including all pairwise epistases of interest, and define *Z*_{
ij
}= *X*_{
ik
}*X*_{
il
}for the *j*-th candidate epistasis (*k*, *l*) ∈ \mathcal{I}{*X*}, *j* = 1, ⋯, *p*_{
γ
}. We investigate the additive effects of putative QTL and the epistatic interactions between them through the following multiple regression model,

where *μ* is the overall mean, *β*_{
j
}is the additive effect of marker *j*, *γ*_{
j
}represents the *j*-th epistatic effect, and *ε*_{
i
}is the random error.

QTL mapping with this multiple regression model can be viewed as a model selection procedure [3, 25–27]. However, several characteristics of the data complicate the application of classical statistical methodologies. First, a large amount of missing molecular markers, due to failure in genotyping or selective genotyping, is common in practice. When markers are sparse, the missing genotype information between markers must be inferred. Second, the molecular markers in the same linkage group may be highly correlated. Third, the total number of molecular markers and putative epistases, i.e., *p* = *p*_{
β
}+ *p*_{
γ
}, is usually much larger than the sample size *n*. Because of these issues, the efficiency and accuracy are usually compromised for easy development of statistical approaches. Characteristics of the "large *p* small *n*" data with missing values require further attention via extensions of traditional model selection approaches. We extend the Bayesian classification approach in Zhang et al. [28] to map QTL with epistases. Spike and slab priors have been used by, for example, Mitchell and Beauchamp [29], George and McCulloch [30], and Ishwaran and Rao [31] to develop Bayesian variable selection approaches. The spike and slab priors consist of two components, with one modeling zero coefficients and the other modeling non-zero ones.

Furthermore, the mixing weight plays a crucial role in condensing the searchable parameter space and enforcing a stochastic search within low-dimensional spaces. When only a limited number of covariates are being investigated, a uniform distribution on [0, 1] or even a fixed value (e.g., 0.5) is usually chosen for the mixing weight. However, when *n* ≪ *p*, it is unrealistic to expect half of the variables to be selected because the final model may still be unidentifiable. Instead, we expect that, for a successful variable selection, the prior distributions of the mixing weights depend on both *n* and *p*.

We investigate the predictability of a model developed for a dataset of sample size *n*, and tackle the aforementioned issues. We then construct a two-step Bayesian variable selection approach for model (1) in the case that *n* ≪ (*p*_{
β
}+ *p*_{
γ
}). In the first step, we employ a restrictive prior for each of the coefficients in model (1) in order to enforce stochastic filtering of the large number of candidate variables. This prior also allows flexibility for the possible different numbers and/or scales of positive and negative coefficients (see [32] for more details on its advantage over symmetric priors). A Gibbs sampling algorithm is developed to compute the posterior distributions and to implement the stochastic search. Only a limited number of variables are filtered to go through the second step, which repeats the first step but with much fewer candidate variables. The second step is necessary to model (1) when *n* ≪ (*p*_{
β
}+ *p*_{
γ
}), as the priors in the first step could potentially be too restrictive. The performance of our approach is evaluated via a simulation study and application to real datasets.

## Results and Discussion

### Simulation

Simulation studies were performed to evaluate the performance of our method in the case of *p* ≫ *n*. We simulated 56 markers across 3 chromosomes, with each having 10, 20, and 26 markers, and being 56.7 cM, 133.5 cM and 171.6 cM long respectively. We specify {\sigma}_{\epsilon}^{2} = 0.5415, and the locations of 28 markers are chosen based on the Drosophila data [28], which include 221 inbred introgression lines between two closely related species. The other 28 markers are chosen such that the neighboring markers are at least 5 cM away. Table 1 shows the detailed information of the non-zero effects specified in the simulation, including two additive effects and three epistatic effects. To assess whether our method is able to identify different types of epistatic effects, we include all three possible interactions in the simulation: (1) neither of the two markers has additive effects (that is, 2–133.8 and 3–56.7); (2) one of them has additive effects (that is, 1–24.7 and 2–47.8); (3) both have additive effects (that is, 2–47.8 and 3–141.5). All epistatic effects were set at the same size to avoid its effects on detectability. Due to the intensive computation involved in Gibbs sampling, a total of 100 complete data sets were simulated. Each of the 100 data sets was analyzed using two models, one model with both additive and epistatic effects while the other with additive effects only. When mapping QTL with epistases, we have a total number of 1596 variables (56 additive-effect loci and 1540 epistases) versus 221 observations in the model.

For the model without epistases, both markers can be detected in most of the 100 simulated datasets even when the false discovery rate (FDR) is controlled as low as 0 (via setting the Bayes factor higher than 3.2), see Table 2. When modeling the epistases, all (additive and interaction) effects are still detected in more than 90% of the data sets for all levels of Bayes factor (BF) though the FDRs are higher. For those data sets with any effect not identified, the immediate neighbors of the corresponding marker locus are mostly detected instead. As expected, it is more difficult to detect epistases than to detect additive effects. The epistasis of markers both having additive effects is the easiest to be detected among all epistases. The true parameter values are included in their 95% credible intervals with the associated posterior probabilities being very close to one (results not shown).

## Application

We apply the developed method to the *simulans* backcross II (BS2) data and the *mauritiana* backcross II (BM2) data [33, 34]. An F_{1} population was first produced by females from an inbred line of *D. simulans* and males from an inbred line of *D. mauritiana*. Then the F_{1} females were backcrossed to the parental line of *D. simulans*, which was fixed for different alleles at 45 marker loci, to produce a *simulans* backcross (BS) population. A *mauritiana* backcross (BM) population was also produced by backcrossing the F_{1} females to the other parental line. Based on the two different times of crossing, a total of four data sets were obtained, namely, BS1 (*n* = 186), BS2 (*n* = 288), BM1 (*n* = 192), and BM2 (*n* = 299). The phenotypic value of an individual is a morphometric descriptor of the posterior lobe, obtained by averaging both sides of the first principal component (PC1) of the Fourier coefficients of the posterior lobe. The genotypes of males were determined at each marker locus, and genetic map positions were estimated from gametes produced by the F_{1} females in this study. Further information about the data is referred to Liu et al. [33] and Zeng et al. [34].

Employing multiple interval mapping (MIM) [25, 35] to the BS2 data, Zeng et al. [34] detected a total of 16 additive effects and no epistatic effect. Pooling all four data sets, Zeng et al. [34] detected three extra additive effects and six epistatic effects. These epistatic effects appeared to be relatively unimportant for PC1 in the interspecific backcross populations, which carried an observation difficult to interpret biologically. Of the 19 additive effects, 18 additive effect estimates have the same sign [34]. Zeng et al. [34] explained this interesting phenomena as an unusually strong directional selection, although Tanksley [36] suggested that transgressive segregation usually followed a mixture of plus and minus alleles in each species as demonstrated by most previous analyses of quantitative traits.

We focused our analysis on the BS2 and BM2 data with the standardized phenotypic values. Of the 19 putative QTL reported by Zeng et al. [34], only nine are at least 1 cM away from the 45 marker loci. Therefore, we analyzed both datasets with these 54 additive effects (nine putative QTL and 45 markers) and all possible pairwise interactions (that is, 1431 putative epistases). When controlling BF ≥ 1, the analysis of the BS2 data reported a total of 25 additive effects (see Table 3), including all nine putative QTL, but no epistatic effect. The analysis of the BM2 data instead reported a total of 20 additive effects (see Table 4), including three of the nine putative QTL, and 18 epistatic effects (see Table 5). On the basis of the simulation study, we may expect less than 0.67% FDR for those 17 and 16 additive effects reported with BF ≥ 100 in analyzing the BS2 and BM2 data respectively. Similarly, three epistatic effects reported in analyzing the BM2 data have BF ≥ 100, less than 12% of which may be false discoveries.

Interestingly, the 25 additive effects detected from the BS2 data include all those detected by Zeng et al. [34] except the 2–135, 3–5 and 3–83 (we consider the markers within 1 cM to be same), but the 20 additive effects detected from the BM2 data only include nine of those detected by Zeng et al. [34]. On the other hand, nine additive effects (i.e., 2–28.53, 2–145.85, 3-0, 3–43.2, 3–49.99, 3–101.29, 3–126.62, 3–134.6, 3–147.69) from the BS2 data are not reported by Zeng et al. [34], and eleven additive effects from the BM2 data (i.e., 1-0, 2–6.98, 2–67.96, 2–145.85, 3–14.33, 3–28.74, 3–43.2, 3–49.99, 3–126.62, 3–147.69, 3–161.43) are not reported by Zeng et al. [34]. Note that almost each additive effect uniquely detected by Zeng et al. [34] has a neighboring one (within 10 cM) in our lists except 2–135 and 3–94 for the BM2 dataset, and almost each additive effect unique in our lists has a neighboring one (within 10 cM) detected by Zeng et al. [34]. Per the discussion on the precision of QTL location by Bogdan and Doerge [37] and Bogdan et al. [24], these effects of close neighbors may be due to identical QTL. Our analysis reported *R*^{2} = 0.934 and *R*^{2} = 0.902 for the BS2 and BM2 data respectively.

## Conclusion

This article extends the Bayesian framework in Zhang et al. [28] to identify both additive and epistatic effects of QTL based on model (1). The advantage of this approach mainly lies in the flexible priors for the regression coefficients by accounting for some characteristics of "large *p* small *n*" data, the predictability of a model constructed with size *n* data, and the two step strategy for dimension reduction. A Gibbs sampler is developed to draw Markov chain samples from the posterior distributions, which can be considered as a stochastic search for an optimal model. Unlike information criteria based model selections which require calculation of the effective sample size for incomplete data, missing values can be naturally imputed within the Gibbs sampling scheme. The corresponding algorithm has been implemented in Matlab and is available as QTLBayes http://www.stat.purdue.edu/~zhangdb/QTLBayes/.

Bayesian variable selections can be viewed as penalized likelihood approaches, which have been studied recently [38, 39]. With "large *p* small *n*" data, it is not clear how to set up the penalty properly such that it will neither overpenalize nor underpenalize the likelihood. An overpenalized likelihood will lose some significant variables of particular interest, while an underpenalized likelihood will introduce false positives. The predictability of size *n* data sheds light on the choice of this penalty. Since a size *n* data set will allow us to understand the variation of the trait explained by only *p*_{
n
}= *O*(\sqrt{n}) QTL with accuracy *O*(*n*^{-1/2}), selecting too many variables into the model will ruin this practice of QTL mapping. As shown by Bogdan and Doerge [37], severely biased estimates can be resulted from large genome and/or marker number in QTL mapping. We propose a Bayesian framework to resolve the bias problem. We have illustrated our approach by application to the BS2 and BM2 data [33, 34], both of which have 45 markers observed across three chromosomes. The disadvantage of this approach is the heavy computation involved as the computation-intensive Markov chain Monte Carlo algorithm is utilized. For example, the analysis of a dataset with more than 200 markers from 1,000 subjects take almost 24 hours using one Intel^{®} Xeon™ CPU at 2.80 GHz.

Coding binary markers with -0.5 and 0.5 has been commonly utilized in QTL mapping as it does not introduce correlation between additive effects and interactive effects, and such uncorrelation benefits the identification of additive effects. On the other hand, coding binary markers with 0 and 1 introduces correlation and thus is not preferred for QTL mapping with epistases [40, 41]. Although developed for QTL mapping, this approach is completely general and can be applied to other settings with "large *p* small *n*" data, such as associating genomic features to clinical outcomes or phenotypes of biological interest. Unlike QTL mapping data with known missing structure from the linkage information, genomic data with imaging and microarray may require more information to impute missing values because of the unknown missing mechanism. Even though the missing values are usually imputed with a nearest-neighbor approach [42], Gibbs samplers allow natural multiple imputation under the assumption of missing at random (MAR, see Little and Rubin, [43]).

## Methods

### Predictability and Sample Size

Suppose, for a sample of size *n*, we select up to *p*_{
n
}(assuming *p*_{
n
}<*n*) significant variables into the following regression model,

where *Y*_{
n
}is an *n*-dimensional column vector; *X*_{
n
}is an *n × p*_{
n
}design matrix such that {X}_{n}^{T}{X}_{n}=n\times {I}_{{p}_{n}}. The best linear unbiased estimator (BLUE) of *β* is

Let \tilde{x}=({\tilde{x}}_{1},{\tilde{x}}_{2},\cdots ,{\tilde{x}}_{{p}_{n}}) include *p*_{
n
}predictors for \tilde{y} such that {\mathrm{max}\phantom{\rule{0.1em}{0ex}}}_{1\le j\le {p}_{n}}\left|{\tilde{x}}_{j}\right|=O(1). Since trace\{Var({\widehat{\beta}}_{n})\}=\frac{{p}_{n}}{n}{\sigma}_{\epsilon}^{2},\tilde{x}\beta can be consistently estimated by \tilde{x}{\widehat{\beta}}_{n}. When using \tilde{x}{\widehat{\beta}}_{n} to predict \tilde{y}, the mean squared prediction error is

If *p*_{
n
}= *o*(*n*), the mean squared prediction error asymptotically achieves the minimum variance, and thus the prediction is asymptotically efficient.

This illustration implies that, with a sample of size *n* and *p*_{
n
}= *O*(\sqrt{n}) predictors, the mean squared prediction error can reach the minimum prediction error at rate *O*(*n*^{-1/2}). Suppose that all *p*_{
n
}significant variables could be perfectly selected out of *p* candidates, we still need *p*_{
n
}= *o*(*n*) in order to have a chance to correctly understand the variation of the dependent variable explained by the selected predictors. Therefore, we always assume that there are at most *p*_{
n
}= *O*(\sqrt{n}) significant variables among a total of *p* candidates in the case of *p* ≫ *n*. Indeed, the study of consistency in a triangular array setting for regression problems was conducted by Huber [44–46]. In examining the underlying theory of 'model-selection' and 'variable-selection' procedures that choose *p*_{
n
}explanatory variables from an initial set of variables, Greenshtein and Ritov [46] proved that one may expect consistency for the choice of *p*_{
n
}with an order between *o*(\sqrt{n/\mathrm{log}\phantom{\rule{0.1em}{0ex}}(n)}) and *o*(*n/* log(*n*)). Our choice of *p*_{
n
}= *O*(\sqrt{n}) satisfies the Greenshtein and Ritov [46] conditions for consistency.

### Bayesian Variable Selection

Here we propose a two-step Bayesian variable selection approach to map QTL with epistases through model (1). With the following Bayesian framework, we first select *c*\sqrt{n} out of *p*_{
β
}additive effects and *c*\sqrt{n} out of *p*_{
γ
}epistatic effects (e.g., we use *c* = 2), respectively, using a restrictive prior for each coefficient. We then apply the same Bayesian framework to stochastically select the filtered variables, using a non-restrictive prior for each coefficient. Gibbs sampling algorithms are developed to stochastically search low-dimensional subspaces, as implied by the predictability of a size *n* data set.

#### Prior Specification

For a two-state marker system, both additive effects *β*_{
j
}, *j* = 1, ⋯, *p*_{
β
}, and epistatic effects *γ*_{
j
}, *j* = 1, ⋯, *p*_{
γ
}, are the primary focus of QTL mapping. As is often the case *p* = (*p*_{
β
}+*p*_{
γ
}) ≫ *n*, many of these coefficients are zero, either because the variation of the trait can be explained by only a few QTL or because the limited sample size precludes selecting too many variables (otherwise the constructed model is not reliable as shown in the previous section). It is also possible that the number and/or scale of the positive coefficients may be different from those of the negative ones. To account for these properties, a three-component mixture prior is specified for each coefficient *β*_{
j
}or *γ*_{
j
}. More specifically,

where *δ* (·) is a Dirac function with mass one at zero; *N*_{+}(*μ*, *σ*^{2}) and *N*_{-}(*μ*, *σ*^{2}) positively and negatively truncate the normal distribution, i.e., *N*(*μ*, *σ*^{2}), respectively. Therefore, *w*_{β+}(or *w*_{β-}) is the probability for any single marker, and *w*_{γ+}(or *w*_{γ-}) is the probability for any pair of markers in \mathcal{I}{*X*}, to have positive (or negative) interactive effect on the trait.

The hyperparameters, {\sigma}_{\beta +}^{2},{\sigma}_{\beta -}^{2},{\sigma}_{\gamma +}^{2} and {\sigma}_{\gamma -}^{2}, are assumed to have priors as inverse gamma distributions, that is, *IG*(*θ*_{β+}, *φ*_{β+}), *IG*(*θ*_{β-}, *φ*_{β-}), *IG*(*θ*_{γ+}, *φ*_{γ+}), and *IG*(*θ*_{γ-}, *φ*_{γ-}), respectively (e.g., setting *θ*_{β+}= *θ*_{β-}= *θ*_{γ+}= *θ*_{γ-}= 0.1 and *φ*_{β+}= *φ*_{β-}= *φ*_{γ+}= *φ*_{γ-}= 10). As a result, the prior on *β* (and *γ*) is essentially a mixture of a point mass at zero and some truncated *t*-distributions, which shrinks the smaller effects towards zero and allows sufficient flexibility for non-zero effects. Furthermore, *t*-type prior distributions yield Bayes rules with desirable decision-theoretic frequentist properties [47]. The hyperparameters, *μ*_{β+}, *μ*_{γ+}, *μ*_{β-}and *μ*_{γ-}, are assumed to have diffuse priors, and the prior distribution for {\sigma}_{\epsilon}^{2} is proportional to 1/{\sigma}_{\epsilon}^{2}.

As suggested by the predictability of a size *n* data set, we expect to select at most *p*_{
n
}= *O*(\sqrt{n}) out of the *p* variables for the final model. Therefore, we specify the priors for (*w*_{β+}, *w*_{β-}) and (*w*_{γ+},*w*_{γ-}) as

that is, expecting at most *c*\sqrt{n} significant additive effects and epistatic effects, respectively. Gaffney [48] and Yi et al. [17], among others, employed similar ideas to rescale the priors based on the number of possible effects. Apparently, when *n* ≪ (*p*_{
β
}+ *p*_{
γ
}), either *c*\sqrt{n}/*p*_{
β
}or *c*\sqrt{n}/*p*_{
γ
}is very small, which implies a restrictive prior on each corresponding coefficient. Therefore, we usually select *c*\sqrt{n} additive effects and *c*\sqrt{n} epistatic effects during the first run of Bayesian analysis. We then apply the same Bayesian analysis to these pre-selected variables. The second run of Bayesian analysis has both *w*_{β+}+ *w*_{β-}and *w*_{γ+}+ *w*_{γ-}, *a priori*, uniformly distributed on [0, 1].

#### Likelihood

Let *Y*_{
n
}be the column vector including the trait values of all strains under investigation, let *X*_{
i
}be the vector of all marker values of the *i*-th strain and {X}_{n}={({X}_{1}^{T},\cdots ,{X}_{n}^{T})}^{T}, and let *Z*_{
i
}be the vector of all epistatic candidate values of the *i*-th strain. Denote the marginal distribution of *A* as [*A*], and the conditional distribution of *A* given *B* as [*A|B*]. With data (**Y**_{
n
}, **X**_{
n
}) and the prior specification in Section 3.1, we have the likelihood function, that is, the joint distribution function of the data (*Y*_{
n
}, *X*_{
n
}), the parameters (*μ*, ** β**,

**), {\sigma}_{\epsilon}^{2}, and all hyperparameters**

*γ*The distribution of *X*_{
n
}can be specified based on the available linkage map information [2]. The conditional distribution of [\beta |{w}_{\beta +},{w}_{\beta -},{\mu}_{\beta +},{\sigma}_{\beta +}^{2},{\mu}_{\beta -},{\sigma}_{\beta -}^{2}] is a product of the prior distribution for each *β*_{
j
}. Similarly, the conditional distribution of [\gamma |{w}_{\gamma +},{w}_{\gamma -},{\mu}_{\gamma +},{\sigma}_{\gamma +}^{2},{\mu}_{\gamma -},{\sigma}_{\gamma -}^{2}] is a product of the prior distribution for each *γ*_{
j
}. The priors of the hyperparameters, *θ*_{β+}, *θ*_{γ+}, *φ*_{β+}, *φ*_{γ+}, *θ*_{β-}, *θ*_{γ-}, *φ*_{β-}and *φ*_{γ-}, are specified to be as noninformative as possible.

#### Gibbs Sampling

Since the specified priors are conditionally conjugate, Bayesian variable selection can be implemented with a Gibbs sampling algorithm. We initialize the algorithm by imputing missing genotypic values based on the observed genotypes and linkage information. The initial value of *μ* is set as the mean of the observed trait values. Then, with individuals having fully observed trait values, each component of ** β** and

**is initially estimated using recursive univariate regression. Other parameters, {w}_{\beta +},{w}_{\beta -},{\mu}_{\beta +},{\sigma}_{\beta +}^{2},{\mu}_{\beta -} and {\sigma}_{\beta -}^{2}, are simply initialized based on the initial value of**

*γ***, and similarly, the initial values for {w}_{\gamma +},{w}_{\gamma -},{\mu}_{\gamma +},{\sigma}_{\gamma +}^{2},{\mu}_{\gamma -}, and {\sigma}_{\gamma -}^{2} can be specified using the information from**

*β***. For example, we can initialize {\sigma}_{\beta +}^{2}={\sigma}_{\beta -}^{2} with an estimate from the initial value**

*γ**β*, and then use max{#{

*j*:

*β*

_{ j }> 2

*σ*

_{β+}}, 1}/

*p*

_{ β }to initialize

*w*

_{β+}.

Let *X*_{i,-j}be *X*_{
i
}excluding the *j*-th component, and define *β*_{-j}and *γ*_{-j}similarly. Based on the likelihood function in (4), the Gibbs sampler can be developed by recursively drawing the missing genotypic values, the missing trait values, and the model parameters from their full conditional posterior distributions as follows.

**Sample missing values:** Sample each missing genotypic value *X*_{
ij
}from its full conditional posterior distribution,

and then sample each missing trait value *Y*_{
i
}from its full conditional posterior distribution [*Y*_{
i
}*|X*_{
i
}, *μ*, *β*, *γ*, {\sigma}_{\epsilon}^{2}].

**Sample** ** μ**: Sample

*μ*from its full conditional posterior distribution,

**Sample** *β***and** ** γ**: Sample each

*β*

_{ j }and

*γ*

_{ j }from their full conditional posterior distributions,

where {\tilde{w}}_{\beta j+},{\tilde{w}}_{\beta j-},{\tilde{\mu}}_{\beta j+},{\tilde{\sigma}}_{\beta j+}^{2},{\tilde{\mu}}_{\beta j-}, and {\tilde{\sigma}}_{\beta j-}^{2} are specified in the APPENDIX. In addition, {\tilde{w}}_{\gamma j+},{\tilde{w}}_{\gamma j-},{\tilde{\mu}}_{\gamma j+},{\tilde{\sigma}}_{\gamma j+}^{2},{\tilde{\mu}}_{\gamma j-}, and {\tilde{\sigma}}_{\gamma j-}^{2} can be obtained similarly.

**Sample** *w*_{β+}, *w*_{γ+}, *w*_{β-}, **and** *w*_{γ-}: These parameters can be sampled from the conditional posterior distributions,

where {\tilde{p}}_{\beta +} = #{*β*_{
j
}> 0 : 1 ≤ *j* ≤ *p*_{
β
}} and {\tilde{p}}_{\beta -} = #{*β*_{
j
}< 0 : 1 ≤ *j* ≤ *p*_{
β
}}; {\tilde{p}}_{\gamma +} = #{*γ*_{
j
}> 0 : 1 ≤ *j* ≤ *p*_{
γ
}} and {\tilde{p}}_{\gamma -} = #{*γ*_{
j
}< 0 : 1 ≤ *j* ≤ *p*_{
γ
}}.

**Sample**{\sigma}_{\epsilon}^{2},{\sigma}_{\beta +}^{2},{\sigma}_{\gamma +}^{2},{\sigma}_{\beta -}^{2}**, and**{\sigma}_{\gamma -}^{2}: With conditionally conjugate priors, the posterior for all variance parameters are still inverse gamma distributions. Specifically,

#### Bayesian Inference

For each variable in model (1), one pair of parameters is used to select the corresponding variable. They are, for the *j*-th additive effect, the posterior probabilities *w*_{β j+}= *P* (*β*_{
j
}> 0|**Y**_{
n
}, **X**_{
n
}) and *w*_{β j-}= *P* (*β*_{
j
}< 0|**Y**_{
n
}, **X**_{
n
}). With the full conditional posterior distribution of *β*_{
j
}and all the notations in the APPENDIX, we have

Therefore, the two parameters *w*_{β j+}and *w*_{β j-}can be estimated with the Markov chains of {\tilde{w}}_{\beta j+} and {\tilde{w}}_{\beta j-} drawn from the above Gibbs sampler. If and only if both *w*_{β j+}and *w*_{β j-}are less than 0.5, the median of the posterior distribution of *β*_{
j
}is zero. Similarly, the posterior probabilities *w*_{γ j+}= *P* (*γ*_{
j
}> 0|*Y*_{
n
}, *X*_{
n
}) and *w*_{γ j-}= *P* (*γ*_{
j
}< 0|*Y*_{
n
}, *X*_{
n
}) can be estimated with the Markov chains of {\tilde{w}}_{\gamma j+} and {\tilde{w}}_{\gamma j-} drawn from the above Gibbs sampler.

We propose to select variables twice under the above Bayesian framework for model (1). At the first step, we use a restrictive prior for each coefficient to ensure an identifiable Bayesian model and enforce to stochastically search for an optimal low-dimensional parameter subspace. We then rank the *j*-th additive effect based on max{*w*_{β j+}, *w*_{β j-}}, and rank the *j*-th epistatic effect based on max{*w*_{γ j+}, *w*_{γ j-}}. The top *c*\sqrt{n} out of *p*_{
β
}additive effects, and the top *c*\sqrt{n} out of *p*_{
γ
}epistatic effects are selected, respectively. At the second step, we select variables out of those selected *c*\sqrt{n} additive effects and *c*\sqrt{n} epistatic effects, under the above Bayesian framework for model (1). Obviously, we have a non-restrictive prior for each coefficient at the second step, and therefore avoid possible over-penalization due to restrictive priors.

Following Jeffreys [49, 50], we test the hypothesis *H*_{0} : *β*_{
j
}= 0 vs. *H*_{1}: *β*_{
j
}≠ 0 on the basis of the Bayes factor, which was defined as

where *π* (*β*_{
j
}= 0) and *π* (*β*_{
j
}≠ 0) are the *a priori* probabilities, and the last equality follows the fact that *π* (*β*_{
j
}= 0) = *π* (*β*_{
j
}≠ 0) at the second step of our Bayesian Classification. As suggested by Jeffreys [50], a *B*_{10} (*β*_{
j
}) with value between 1 and \sqrt{10} ≈ 3.2 provides "not worth more than a bare mention" evidence against *H*_{0}; a *B*_{10} (*β*_{
j
}) with value from \sqrt{10} to 10 provides "substantial" evidence against *H*_{0}; a *B*_{10} (*β*_{
j
}) with value from 10 to 100 provides "strong" evidence against *H*_{0}; and a *B*_{10} (*β*_{
j
}) with value larger than 100 provides "decisive" evidence against *H*_{0}. Similarly, we can test the hypothesis *H*_{0}: *γ*_{
j
}= 0 vs. *H*_{1}: *j* ≠ 0 using the following Bayes factor

## Appendix

**Fully Conditional Posterior Distribution of**
*β*
_{
j
}

For each *j* = 1, ⋯, *p*_{
β
}, the fully conditional posterior distribution of *β*_{
j
}is

where the updated parameter values are

## References

Lander ES, Botstein D: Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps.

*Genetics*1989, 121: 185–199.Doerge RW, Zeng ZB, Weir BS: Statistical issues in the search for genes affecting quantitative traits in experimental populations.

*Statistical Science*1997, 12: 195–219. 10.1214/ss/1030037909Broman KW, Speed TP: A model selection approach for the identification of quantitative trait loci in experimental crosses.

*Journal of the Royal Statistical Society Series B*2002, 64: 641–656. 10.1111/1467-9868.00354Wang H, Zhang YM, Li X, Masinde GL, Mohan S, Baylink DJ, Xu S: Bayesian shrinkage estimation of quantitative trait loci parameters.

*Genetics*2005, 170: 465–480. 10.1534/genetics.104.039354Carlborg d, Haley CS: Epistasis: too often neglected in complex trait studies?

*Natuer Review Genetics*2004, 5: 618–625. 10.1038/nrg1407Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common human disease.

*Human Heredity*2003, 56: 73–82. 10.1159/000073735Williams SM, Addy JH, Phillips JAI, Dai M, Kpodonu J, Afful J, Jackson H, Joseph K, Eason F, Murray MM, Epperson P, Aduonum A, Wong LJ, Jose PA, Felder RA: Combinations of variation in multiple genes are associated with hypertension.

*Hypertension*2000, 36: 2–6.Leamy LJ, Routman EJ, Cheverud JM: An epistatic genetic basis for fluctuating asymmetry of mandible size in mice.

*Evolution*2002, 56: 642–653.Wagner A: Robustness against mutations in genetic networks of yeast.

*Nature Genetics*2000, 24: 355–361. 10.1038/74174Sanjuán R, Cuevas JM, Moya A, Elena SF: Epistasis and the adaptability of an RNA virus.

*Genetics*2005, 170: 1001–1008. 10.1534/genetics.105.040741Eshed Y, Zamir D: Less-than-additive epistatic interactions of quantitative trait loci in tomato.

*Genetics*1996, 143: 1807–1817.Xu S, Jia Z: Genomewide analysis of epistatic effects for quantative traits in Barley.

*Genetics*2007, 175: 1955–1963. 10.1534/genetics.106.066571Bateson W:

*Mendel's Principles of Heredity*. Cambridge: Cambridge University Press; 1909.Kao CH, Zeng ZB: Modeling epistasis of quantitative trait loci using Cockerham's model.

*Genetics*2002, 160: 1243–1261.Yi N, Xu S: Mapping quantitative trait loci with epistatic effects.

*Genetical Research*2002, 79: 185–198. 10.1017/S0016672301005511Yi N, Yandell BS, Churchill GA, Allison DB, Eisen EJ, Pomp D: Bayesian model selection for genome-wide epistatic quantitative trait loci analysis.

*Genetics*2005, 170: s1333–1344. 10.1534/genetics.104.040386Yi N, Banerjee S, Pomp D, Yandell BS: Bayesian mapping of genomewide interacting quantitative trait loci for ordinal traits.

*Genetics*2007, 176: 1855–1864. 10.1534/genetics.107.071142Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell BS: An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects.

*Genetics*2007, 176: 1865–1877. 10.1534/genetics.107.071365Yandell BS, Mehta T, Banerjee S, Shriner D, Venkataraman R, Moon JY, Neely WW, Wu H, von Smith R, Yi N: R/qtlbim: QTL with Bayesian interval mapping in experimental crosses.

*Bioinformatics*2007, 23: 641–643. 10.1093/bioinformatics/btm011Bogdan M, Ghosh JK, Doerge RW: Modifying the Schwartz Bayesian information criterion to locate multiple interacting quantitative trait loci.

*Genetics*2004, 167: 989–999. 10.1534/genetics.103.021683Cui YH, Wu R: Mapping genome-genome epistasis: a high-dimensional model.

*Bioinformatics*2005, 21: 2447–2455. 10.1093/bioinformatics/bti342Żak M, Baierl A, Bogdan M, Futschik A: Locating multiple interacting quantitative trait loci using rank-based model selection.

*Genetics*2007, 176: 1845–1854. 10.1534/genetics.106.068031Shi W, Lee KE, Wahba G: Detecing disease-causing genes by LASSO-Patternsearch algorithm.

*BMC Proceedings*2007, 1(Suppl 1):S60.Bogdan M, Frommlet F, Biecek P, Cheng R, Ghosh JK, Doerge RW: Extending the modified Bayesian information criterion (mBIC) to dense markers and multiple interval mapping.

*Biometrics*, in press.Kao CH, Zeng ZB, Teasdale RD: Multiple interval mapping for quantitative trait loci.

*Genetics*1999, 152: 1203–1216.Zeng ZB, Kao CH, Basten CJ: Estimating the genetic architecture of quantitative traits.

*Genetical Research*1999, 74: 279–289. 10.1017/S0016672399004255Ball RD: Bayesian methods for quantitative trait loci mapping based on model selection: approximate analysis using the Bayesian information criterion.

*Genetics*2001, 159: 1351–1364.Zhang M, Montooth KL, Wells MT, Clark AG, Zhang D: Mapping multiple quantitative trait loci by Bayesian classification.

*Genetics*2005, 169: 2305–2318. 10.1534/genetics.104.034181Mitchell TJ, Beauchamp JJ: Bayesian variable selection in linear regression (with discussion).

*Journal of the American Statistical Association*1988, 83: 1023–1036. 10.2307/2290129George EI, McCulloch RE: Variable selection via Gibbs sampling.

*Journal of the American Statistical Association*1993, 88: 881–889. 10.2307/2290777Ishwaran H, Rao JS: Spike and slab variable selection: frequentist and Bayesian strategies.

*The Annals of Statistics*2005, 33: 730–773. 10.1214/009053604000001147Zhang M, Zhang D, Wells MT: Generalized Shrinkage Estimators Adpative to Sparsity and Asymmetry of High Dimensional Parameter Spaces.

*Technical Reports, Department of Statistics, Purdue University*2008, 08–01.Liu J, Mercer JM, Stam LF, Gibson G, Zeng ZB, Laurie CC: Genetic analysis of a morphological shape difference in the male genitalia of

*Drosophila simulans*and*D. mauritiana*.*Genetics*1996, 142: 1129–1145.Zeng ZB, Liu J, Stam LF, Kao CH, Mercer JM, Laurie CC: Genetic architecture of a morphological shape difference between two drosophila species.

*Genetics*2000, 154: 299–310.Kao CH, Zeng ZB: General formula for obtaining the MLEs and the asymptotic variance-covariance matrix in mapping quantitative trait loci when using the EM algorithm.

*Biometrics*1997, 53: 653–665. 10.2307/2533965Tanksley SD: Mapping polygenes.

*Annual Review Genetics*1993, 27: 205–233. 10.1146/annurev.ge.27.120193.001225Bogdan M, Doerge RW: Biased estimators of quantitative trait locus heritability and location in interval mapping.

*Heredity*2005, 95: 476–484. 10.1038/sj.hdy.6800747Tibshirani RJ: Regression shrinkage and selection via the lasso.

*Journal of the Royal Statistical Society Series B*1996, 58: 267–288.Fan J, Peng H: Nonconcave penalized likelihood with a diverging number of parameters.

*The Annals of Statistics*2004, 32: 928–961. 10.1214/009053604000000256Álvarez-Castro JM, Carlborg O: A unified model for functional and statistical epistasis and its application in quantitative trait loci analysis.

*Genetics*2007, 176: 1151–1167. 10.1534/genetics.106.067348Zeng ZB, Wang T, Zou W: Modeling quantitative trait loci and interpretation of models.

*Genetics*2005, 169: 1711–1725. 10.1534/genetics.104.035857Hastie H, Tibshirani R, Sherlock G, Eisen M, Brown P, Botstein D: Imputing missing data for gene expression arrays.

*PhD thesis*. Stanford University, Statistics Department; 1999.Little RJA, Rubin DB:

*Statistical Analysis with Missing Data*. New York: John Wiley; 2002.Huber P: Robust regression: asymptotics, conjectures, and Monte Carlo.

*The Annals of Statistics*1973, 1: 799–821. 10.1214/aos/1176342503Portnoy S: Asymptotic behavior of M-estimators of

*p*regression parameters when*p*^{2}*/n*is large, I. Consistency.*Annals of Statistics*1984, 12: 1298–1309. 10.1214/aos/1176346793Greenshtein E, Ritov Y: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization.

*Bernoulli*2004, 10: 971–988. 10.3150/bj/1106314846Fourdrinier D, Strawderman WE, Wells MT: On the construction of Bayes minimax estimators.

*The Annals of Statistics*1998, 26: 660–671. 10.1214/aos/1028144853Gaffney PJ: An efficient reversible jump Markov chain Monte Carlo approach to detect multiple loci and their effects in inbred crosses.

*PhD thesis*. Department of Statistics, University of Wisconsin, Madison, WI; 2001.Jeffreys H: Some tests of significance, treated by the theory of probability.

*Proceedings of the Cambridge Philosophy Society*1935, 31: 201–222.Jeffreys H:

*Theory of Probability*. Oxford: Clarendon Press; 1961.

## Acknowledgements

We thank Rebecca Doerge and three anonymous referees for suggestions and comments that significantly improved this article. This research was supported by NSF grant 0612031 and NIH-NIGMS 1R01GM083606-01 to MTW, Purdue Research Foundation grants to MZ and DZ, and Purdue Alumni Association Faculty Incentive Grant to MZ.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Authors' contributions

MZ and DZ both conceived the study and developed the method. DZ wrote the MATLAB^{®} code and did the simulation study. MZ analyzed the real data. MTW participated in conceptual development, writing, reviewing and editing the manuscript. All authors read and approved the final manuscript.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Zhang, M., Zhang, D. & Wells, M.T. Variable selection for large *p* small *n* regression models with incomplete data: Mapping QTL with epistases.
*BMC Bioinformatics* **9**, 251 (2008). https://doi.org/10.1186/1471-2105-9-251

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2105-9-251