Empirical Bayes analysis of single nucleotide polymorphisms
 Holger Schwender^{1}Email author and
 Katja Ickstadt^{1}
DOI: 10.1186/147121059144
© Schwender and Ickstadt; licensee BioMed Central Ltd. 2008
Received: 18 September 2007
Accepted: 06 March 2008
Published: 06 March 2008
Abstract
Background
An important goal of wholegenome studies concerned with single nucleotide polymorphisms (SNPs) is the identification of SNPs associated with a covariate of interest such as the casecontrol status or the type of cancer. Since these studies often comprise the genotypes of hundreds of thousands of SNPs, methods are required that can cope with the corresponding multiple testing problem. For the analysis of gene expression data, approaches such as the empirical Bayes analysis of microarrays have been developed particularly for the detection of genes associated with the response. However, the empirical Bayes analysis of microarrays has only been suggested for binary responses when considering expression values, i.e. continuous predictors.
Results
In this paper, we propose a modification of this empirical Bayes analysis that can be used to analyze highdimensional categorical SNP data. This approach along with a generalized version of the original empirical Bayes method are available in the R package siggenes version 1.10.0 and later that can be downloaded from http://www.bioconductor.org.
Conclusion
As applications to two subsets of the HapMap data show, the empirical Bayes analysis of microarrays cannot only be used to analyze continuous gene expression data, but also be applied to categorical SNP data, where the response is not restricted to be binary. In association studies in which typically several ten to a few hundred SNPs are considered, our approach can furthermore be employed to test interactions of SNPs. Moreover, the posterior probabilities resulting from the empirical Bayes analysis of (prespecified) interactions/genotypes can also be used to quantify the importance of these interactions.
Background
Wholegenome experiments comprise data of hundreds of thousands of single nucleotide polymorphisms (SNPs), where a SNP is the most common type of genetic variations that occurs when at a single base pair position different base alternatives exist in a population. SNPs are typically biallelic. Therefore, SNPs can be interpreted as categorical variables having three realizations: the homozygous reference genotype (if both chromosomes show the more frequent variant), the heterozygous genotype (if one chromosome shows the more frequent, and the other the less frequent variant), and the homozygous variant genotype (if both bases explaining the SNP are of the less frequent variant).
Since SNPs can alter the risk for developing a disease, an important goal in studies concerned with SNPs is the identification of the SNPs that show a distribution of the genotypes that differs substantially between different groups (e.g., cancer vs. noncancer). Detecting such SNPs requires methods that can cope with this vast multiple testing problem in which hundreds of thousands of hypotheses are tested simultaneously. Naturally, the value of a statistic appropriate for the considered testing situation and the corresponding pvalue are computed for each variable, where in the case of SNPs Pearson's χ^{2}statistic is an appropriate test score. These raw pvalues are then adjusted for multiple comparisons such that a Type I error rate is strongly controlled at a prespecified level of significance α.
The classical example for a Type I error rate is the familywise error rate
FWER = Prob(V ≥ 1),
where V is the number of false positives, i.e. the number of rejected null hypotheses that are actually true – or in biological terms, the number of SNPs found by the procedure to differ between groups that actually do not differ between the groups. This error rate is strongly controlled at a level α so that Prob(V ≥ 1) ≤ α by approaches such as the Bonferroni correction or the procedures of Westfall and Young [1]. An overview on such methods is given in [2]. In [3], procedures for controlling this and other error rates are compared in an application to gene expression data.
proposed by Benjamini and Hochberg [4], has hence become popular which in turn is a reasonable choice in the analysis of highdimensional SNP data.
Apart from adjusting pvalues, there also exist other approaches for adjusting for multiple comparisons such as the significance analysis of microarrays (SAM [5]) and the empirical Bayes analysis of microarrays (EBAM [6]) that have been developed particularly for the analysis of gene expression data.
In the original versions of both SAM and EBAM, a moderated tstatistic is computed. In SAM, the observed values of this test score are then plotted against the values of the statistic expected under the null hypothesis of no difference between the two groups, and a gene is called differentially expressed if the point representing this gene in this QuantileQuantile plot is far away from the diagonal. In EBAM, the density f of the observed values z of the moderated tstatistic is modeled by a mixture of the density f_{1} of the differentially expressed genes and the density f_{0} of the not differentially expressed genes, i.e. by
f(z) = π_{0}f_{0}(z) + π_{1}f_{1}(z),
for being differentially expressed is larger than or equal to 0.9.
In [7], a generalized version of the SAM algorithm is presented, whereas in [8, 9] SAM is adapted for categorical data such as SNP data.
In the following section, we first present a generalized EBAM algorithm. Then, we propose an adaption of EBAM enabling the analysis of categorical data. As computing the values of the test statistic for all SNPs individually would be very timeconsuming, we further suggest an approach based on matrix algebra that allows to compute all values simultaneously. Afterwards, EBAM for categorical data is applied, on the one hand, to two subsets of the highdimensional SNP data from the HapMap project [10], and on the other hand, to simulated data that mimic data from a typical association study in which several ten SNPs are considered. In the latter application, it is also shown how EBAM can be applied to identify SNP interactions associated with the response, and how it can be used to specify the importance of prespecified SNP interactions.
Methods
Generalized EBAM algorithm
where z_{ i }is the observed value of the test statistic Z_{ i }for variable i = 1 ⋯ m, π_{0} is the prior probability that a gene is not differentially expressed – or more generally, that a variable is not associated with the response – and ${\text{E}}_{{\text{H}}_{0}}$ (#{Z_{ i }∈ Γ}) is the number of values expected under the null hypothesis to fall into Γ [11].
Algorithm 1 (Generalized EBAM Procedure)
 1.
For each variable i = 1, ..., m, compute the value z_{ i }of a statistic appropriate for testing if the values of this variable are associated with the response.
 2.
If the null density f_{0}, is known, use a density estimation procedure to obtain $\widehat{f}$ and compute $\widehat{\varphi}={f}_{0}/\widehat{f}$. Otherwise, estimate the ratio ϕ = f_{0}/f directly by
 (a)
determining the m permuted zvalues z_{ ib }for each permutation b = 1, ..., B of the n values of the response,
 (b)
binning the m observed and mB permuted zvalues into an appropriate number of intervals,
 (c)
fitting a logistic regression model with repeated observations through these intervals using an appropriate regression function.
 3.
Estimate π_{0} by the procedure of Storey and Tibshirani [12].
 4.
For each variable i, compute the posterior probability ${\widehat{p}}_{1}({z}_{i})=1{\widehat{\pi}}_{0}\widehat{\varphi}({z}_{i})$
 5.
Order the observed zvalues to obtain z_{(1)} ≤ ... ≤ z_{(m)}, and set ${i}_{0}={\displaystyle {\sum}_{i=1}^{m}I({z}_{(i)}<0)+1}$
 6.
For a prespecified probability Δ or a set of appropriate values for Δ,
 (a)set ${i}_{1}={\mathrm{max}\phantom{\rule{0.1em}{0ex}}}_{i\ge {i}_{0}}\left\{i:{\widehat{p}}_{1}({z}_{(i)})<\Delta \right\}+1$, and compute the upper cutoff c_{ U }by${c}_{U}=\{\begin{array}{ll}{z}_{({i}_{1})},\hfill & \text{if}{i}_{1}\le m\hfill \\ \infty \hfill & \text{otherwise}\hfill \end{array},$
 (b)set ${i}_{2}={\mathrm{min}\phantom{\rule{0.1em}{0ex}}}_{i<{i}_{0}}\left\{i:{\widehat{p}}_{1}({z}_{(i)})<\Delta \right\}1$, and compute the lower cutoff c_{ L }by${c}_{L}=\{\begin{array}{ll}{z}_{({i}_{2})},\hfill & \text{if}{i}_{0}1\text{and}{i}_{2}\ge 1\hfill \\ \infty \hfill & \text{otherwise}\hfill \end{array},$
 (c)
call all variables i with ${z}_{i}\notin {\Gamma}_{\Delta}^{C}$ significant, where ${\Gamma}_{\Delta}^{C}=({c}_{L},{c}_{U})$ denotes the complement of the rejection region Γ_{Δ},
 (d)estimate the FDR of Γ_{Δ} by$\stackrel{\_}{\text{FDR}}({\Gamma}_{\Delta})={\widehat{\pi}}_{0}\frac{\alpha m}{\mathrm{max}\phantom{\rule{0.1em}{0ex}}\left\{\#\left\{{z}_{i}\in {\Gamma}_{\Delta}\right\},1\right\}},$
for each gene i = 1, ..., m, where d_{ i }is the difference of the groupwise mean expression values and s_{ i }is the corresponding standard deviation such that d_{ i }/s_{ i }is the ordinary tstatistic. The fudge factor a_{0} is computed by the quantile of the m standard deviations that leads to the largest number of genes called differentially expressed in a standardized EBAM analysis (see [6] for details on this standardized analysis). Since the null distribution of (1) is unknown, the response is permuted repeatedly to generate mB permuted zvalues. Efron et al. [6] then bin the m observed and mB permuted zvalues into 139 intervals. Treating the observed scores as successes and the permuted values as failures, a logistic regression model is fitted through the binned data points using a natural cubic spline with five degrees of freedom as regression function. For details on this logistic regression, see Remark (D) in [6].
Algorithm 1 also comprises the approach used by Efron and Tibshirani [13] to test twogroup gene expression data with Wilcoxon rank statistics.
The main difference between Algorithm 1 and the original version of EBAM is that Efron et al. [6] call all genes differentially expressed that have a posterior probability larger than or equal to Δ = 0.9, whereas we only call a variable i with $\widehat{p}$_{1}(z_{ i }) ≥ Δ significant if there is no other variable with a more extreme zvalue (a larger zvalue if z_{ i }> 0, or a smaller zvalue if z_{ i }< 0) that has a posterior probability less than Δ. This approach that is comparable to the proceeding in SAM, therefore, ensures that all variables with a zvalue exceeding some threshold are called significant, whereas in the original version of EBAM it might happen that a variable is not called significant, even though it has a more extreme zvalue than some of the identified variables.
Another difference is that Efron et al. [6] consider one fixed posterior probability, namely Δ = 0.9, for calling genes differentially expressed, whereas we allow both to prespecify one probability Δ and to consider a set of reasonable values for Δ. The latter again is similar to the SAM procedure in which the number of genes called differentially expressed and the estimated FDR is determined for several values of the SAM threshold, and then the value is chosen that provides the best balance between the number of identified genes and the estimated FDR. This approach can be helpful when the detection of interesting variables is just an intermediate aim, and the actual goal of the analysis is, e.g., the construction of a classification rule. In such a case, prespecifying the value of Δ might work poorly, as this might lead to either a too small number of identified variables, or a too high FDR. For an example of this proceeding in the context of the empirical Bayes analysis, see the application of EBAM for categorical data to the HapMap data set.
EBAM for categorical data
We now assume that our data consist of m categorical variables each exhibiting C levels denoted by 1, ..., C, and n observations each belonging to one of R groups denoted by 1, ..., R. If these variables are SNPs, C = 3.
where n_{ rc }and $\tilde{n}$_{ rc }are the observed number of observations and the number of observations expected under the null hypothesis in group r = 1, ..., R, respectively, showing level c = 1, ..., C.
Since the small denominator problem [5, 6, 14], which is the reason for adding the fudge factor a_{0} to the denominator of the ordinary tstatistic in (1), does not show up in this case, it is not necessary to add a fudge factor to the denominator of (2). Therefore, Algorithm 1 can be applied to SNPs – or to any other type of (genetic) categorical data – by employing Pearson's χ^{2}statistic as test score.
In EBAM, it is assumed that all variables follow the same null distribution. In the permutation based approach of Algorithm 1, this, e.g., means that not only the B permuted zvalues corresponding to a particular variable, but all mB permutations of all m variables are considered in the estimation of the null distribution of this variable. Normally, this is an advantage in the analysis of highdimensional data [6, 15]. In the analysis of categorical data, this, however, might lead to a loss of a large number of variables, as only variables showing the same number of levels can be considered together in an EBAM analysis.
Approximation to χ^{2}distribution
Since the null distribution of (2) can be approximated by a χ^{2}distribution with (R  1)(C  1) degrees of freedom, only the density f of the observed test statistics needs to be estimated. This can be done by applying a (nonparametric) kernel density estimator to the observed zvalues [16]. However, the standard kernels are typically symmetric such that negative values of z will have a positive estimated density, even though f(z) = 0 for z < 0. A solution to this problem is to use asymmetric kernels that only give nonnegative values of z a positive density [17, 18]. Another solution, which we will use, is a semiparametric method proposed by Efron and Tibshirani [19].
In the first step of this procedure, a histogram of the observed zvalues is generated. To obtain a reasonable number of bins for the histogram, we employ the onelevel bin width estimator of Wand [20]. Although other bin width estimators such as the approaches of Scott [21] or of Freedman and Diaconis [22] lead to different bin widths, the densities resulting from the method of Efron and Tibshirani [19] are virtually identical. The approach of Sturges [23], however, which is, e.g., the default method for estimating the number of bins in the R function hist, typically leads to a much too small number of intervals when considering large numbers of observations [24], and is therefore an inappropriate procedure in our application.
Having estimated f, $\widehat{\varphi}={f}_{0}/\widehat{f}$ is determined, and the remaining steps 3 to 6 of Algorithm 1 are processed.
Permutation based estimation of the null density
If the assumptions for the approximation to the χ^{2}distribution are not met [27], the null density f_{0} also has to be estimated. In this case, we calculate the ratio $\widehat{\varphi}$ directly by permuting the group labels B times, computing the mB permuted zvalues, dividing these scores and the m observed zvalues into intervals, and fitting a logistic regression model through the binned data points. Similar to the application of the procedure of Efron and Tibshirani [19] (see previous section), the estimation of ϕ does not depend on the number of intervals used in the binning as long as this number is not too small or too large. We therefore follow Efron et al. [6], and split the observed and permuted zvalues into 139 intervals. Since the rejection region is onesided when considering Pearson's χ^{2}statistic as test score, a natural cubic spline with three degrees of freedom is used as regression function.
Implementation
Wholegenome studies comprise the genotypes of hundreds of thousands of SNPs for each of which the value of Pearson's χ^{2}statistic (2) has to be computed. Since calculating these values onebyone is very timeconsuming, we employ matrix algebra for determining all the scores simultaneously.
Assume that we have given an m × n matrix X in which each row corresponds to a categorical variable exhibiting the levels 1, ..., C, and a vector y comprising the group labels 1, ..., R of the n observations represented by the columns of X.
i = 1, ..., m, j = 1, ..., n. Furthermore, an n × R matrix Y with entries y_{ jr }= I(y_{ j }= r) is built in which each column represents one of the R group labels. Then, we set
N^{(c)}= X^{(c)}Y
If the permutation based version of EBAM for categorical data is used, then not "just" m, but m(B + 1) zvalues have to be computed. Again, matrix algebra can help to speed up computation by considering all B permutations at once, or – if the number of variables or permutations is too large – subsets of the B permutations.
where ${\tilde{n}}_{r}^{(c)}$ is the r th column of $\tilde{N}$, ⊗ is the symbol for the Kronecker product, and * and the fraction line denote elementwise matrix calculation.
Processing time
Comparison of computation times (in seconds) on an AMD Athlon XP 3000+ machine with one GB of RAM for both the matrix algebra based calculation and the individual determination of the values of Pearson's χ^{2}statistic for different numbers of variables and observations. Each of the m variables can take C = 3 levels, and each of the n observations belongs to one of R = 2 classes.
Matrix Algebra Based  Individual  

m  n = 200  n = 1, 000  n = 200  n = 1, 000 
50  < 0.01  0.01  0.13  0.16 
100  < 0.01  0.02  0.26  0.32 
1,000  0.05  0.40  2.64  3.35 
10,000  0.63  2.39  26.74  34.42 
100,000  6.16    274.96   
Note that the main reason for this immense reduction in computation time is not that the matrix calculation approach is algorithmically less complex than an individual computation, but that the implementation of this approach makes essential use of the way how vectorization and matrix multiplication are implemented in R [28].
Results
To exemplify that EBAM can be used to analyze highdimensional categorical data, it is first applied to two subsets of the genotype data from the International Hapmap Project [10]. Afterwards, it is shown how EBAM can be employed to identify SNP interactions associated with the response in association studies, and to quantify the importance of genotypes. R code for reproducing the results of all analyses performed in this section is available in Additional file 1.
Application to HapMap data
In the International HapMap Project, millions of SNPs have been genotyped for each of 270 people from the four populations Japanese from Tokyo (abbreviated by JPT), Han Chinese from Beijing (CHB), Yoruba in Ibadan, Nigeria (YRI), and CEPH (Utah residents with ancestry from northern and western Europe, abbreviated by CEU).
About 500,000 of these SNPs have been measured using the Affymetrix GeneChip Mapping 500 K Array Set that consists of two chips. In this paper, we focus on the BRLMM (Bayesian Robust Linear Models with Mahalanobis distance) genotypes [29] of the 262,264 SNPs from one of these chips, namely the Nsp array (see [30] for these genotypes).
JPT vs. CHB
Since we are mainly interested in casecontrol studies, or more generally in binary responses, EBAM is applied to the 45 JPT and the 45 CHB to detect the SNPs that show a distribution that differs substantially between these two population. Another reason is that both the JPT are unrelated, and the CHB are unrelated, whereas the other two populations consist each of 30 trios each of which is composed of genotype data from a mother, a father and their child.
Since in EBAM it is assumed that all variables follow the same null distribution, only SNPs showing the same number of genotypes are considered in the same EBAM analysis. Moreover, the current implementation of EBAM in the R package siggenes cannot handle missing values such that either missing genotypes have to be imputed, or SNPs with missing genotypes have to be removed prior to the EBAM analysis. Therefore, 54,400 SNPs showing one or more missing genotypes and 75,481 SNPs for which not all three genotypes are observed at the 90 persons are excluded from the analysis leading to a data set composed of the genotypes of 132,383 SNPs.
Using an AMD Athlon XP 3000+ machine with one GB of RAM on which Windows XP is installed, an application of EBAM to this data set takes 11.62 seconds if the null density f_{0} is approximated by the χ^{2}density with two degrees of freedom, whereas it takes about 182 seconds if f_{0} is estimated using 100 permutations.
In the upper left panel of Figure 1, a histogram and the estimated density $\widehat{f}$ of the observed test scores is displayed. For many of the SNPs the assumptions for an approximation to the χ^{2}distribution might not be met [27], as some of the expected numbers in the corresponding contingency table are smaller than 5. We therefore prefer not to use the approximation to the χ^{2}distribution, but the permutation based approach of EBAM for categorical data.
Employing the threshold Δ = 0.9 as suggested by Efron et al. [6], i.e. calling all SNPs significant that have a posterior probability of being significant larger than or equal to 0.9, leads to the identification of 193 SNPs with an estimated FDR of 0.08.
Estimated FDRs and numbers of identified SNPs for several values of the threshold Δ.
0.89  0.90  0.91  0.92  0.93  0.94  0.95  

Number  224  193  147  109  66  42  22 
FDR  0.090  0.080  0.070  0.063  0.056  0.048  0.039 
Multiclass case
EBAM for categorical variables is not restricted to binary responses. It, e.g., can also be used to identify the SNPs showing a distribution that differs strongly between the four HapMap populations.
For this analysis, the most obvious dependencies are removed by excluding the child from each of the 60 trios such that 45 JPT, 45 CHB, 60 YRI, and 60 CEU are considered. Again, all SNPs for which at least one of the 210 values are missing (104,872 SNPs), or for which not all three genotypes are observed (14,273 SNPs), are excluded from the analysis resulting in a data set composed of the genotypes of 143,119 SNPs. In the lower right panel of Figure 1, the estimated density of the zvalues of these SNPs and the estimated null density are displayed. This figure reveals that a huge number of these SNPs exhibit a distribution that differs substantially in at least one of the populations. In fact, 131,336 SNPs show a posterior probability $\widehat{p}$_{1}(z) larger than or equal to 0.9, whereas 33,101 SNPs even have a posterior probability of 1.
Numbers of significant SNPs found in pairwise EBAM analyses of the four HapMap populations.
JPT  CHB  YRI  CEU  

JPT    148  66,410  92,732 
CHB  148    66,196  92,492 
YRI  66,410  66,196    92,969 
CEU  92,732  92,492  92,969   
Identification of interactions
When considering complex diseases, e.g., sporadic breast cancer, it is assumed that not individual SNPs, but interactions of SNPs have a high impact on the risk of developing the disease [34, 35]. In such a case, it would therefore be of interest to also test interactions of SNPs. However, in wholegenome studies in which the number m of SNPs is in the tens or even hundreds of thousands, it would take – depending on the order of the interactions – hours, days or even weeks to compute the test scores for all $\left(\begin{array}{c}m\\ p\end{array}\right)$pway interactions comprised by the m variables. For strategies on testing twoway interactions comprised by data from a simulated wholegenome study on a cluster of computers and their computation times, see [36]. Here, we focus our interest on the EBAM analysis of interactions of SNPs from association studies such as the GENICA study [9, 37] in which typically several ten SNPs are examined.
For the simulation of such a study, data for 50 SNPs and 1,000 observations are generated by randomly drawing the genotypes 1 (for the homozygous reference), 2 (heterozygous), and 3 (homozygous variant) for each SNP S_{ i }, i = 1 ,..., 50, where the minor allele frequency of the SNP is chosen uniformly at random from the interval [0.25, 0.4]. Afterwards, the casecontrol status y is randomly drawn from a Bernoulli distribution with mean Prob(Y = 1), where
logit(Prob(Y = 1)) = 0.5 + I(S_{6} ≠ 1, S_{7} = 1),
such that the probability of being a case is 62.25% if SNP S_{6} is not of the homozygous reference genotype and SNP S_{7} is of this genotype.
This analysis is repeated several times using different simulated data sets each generated randomly with the above settings. In each of the applications of EBAM to the individual SNPs, either one of S_{6} and S_{7}, or both are identified to be significant. Rarely, also other SNPs show a posterior probability larger than 0.9. In all of the analyses of the twoway interactions, the interaction of S_{6} and S_{7} is detected to be the most important one.
Measuring the importance of genotypes
EBAM cannot only be used to detect interesting variables or interactions. The posterior probabilities estimated by EBAM can also be employed to quantify the importance of features found by other approaches such as logicFS [38].
Logic regression [32] – which is employed as base learner in logicFS – is an adaptive regression and classification procedure that searches for Boolean combinations of binary variables associated with the response. Since this method has shown a good performance in comparison to other discrimination [9, 39] and regression [40, 41] approaches, a bagging [42] version of logic regression is used in logicFS to identify interactions of SNPs that are potentially interesting, i.e. associated with the response. While some of the found genotypes/interactions, that are of a similar form as the one intended to be influential for the disease risk in the previous section, have a high impact on the disease risk, others are only found at random by logicFS. It is therefore necessary to quantify the importance of the detected genotypes.
Since logic regression and thus logicFS can only handle binary predictors, each SNP has to be split into (at least) two binary dummy variables. We follow [32, 38] and code each SNP S_{ i }, i = 1, ..., m, by
S_{i 1}: "S_{ i }is not of the homozygous reference genotype."
S_{i 2}: "S_{ i }is of the homozygous variant genotype."
where ^{ C }denotes the complement of a binary variable with outcome true or false, and ⋀ represents the ANDoperator.
Contrary to the previous section in which each of the $\left(\begin{array}{c}m\\ p\end{array}\right)$ distributions of the values of the 3^{ p }levels comprised by the respective combination of p of the m SNPs is tested whether it differs between groups of persons, EBAM is here applied to conjunctions, i.e. ANDcombinations, of binary variables with outcome true or false which are in turn binary variables such that genotypes of different orders, i.e. combinations of genotypes of different numbers of SNPs, can be considered together in the same EBAM analysis.
It is therefore more appropriate to test the found genotypes on an independent data set. Thus, a new (test) data set is randomly generated as described in the previous section. Afterwards, the values of the 84 detected genotypes for the observations from the new data set are computed, and EBAM is applied to these values.
The same 15 genotypes as in the application to the original data set show a posterior probability of 1, where ${S}_{61}\wedge {S}_{71}^{C}$ is found to be the genotype with the largest zvalue. The other three genotypes also called significant using Δ = 0.9 either contain ${S}_{61}\wedge {S}_{71}^{C}$ or S_{61}. All the other genotypes not intended to have an impact on the disease risk, but called significant in the application to the data set on which they were found show a posterior probability less than 0.9, and thus are not called significant anymore in the application to the test data set.
Again, this analysis is repeated several times with different training and test data sets leading to similar results in each of the applications.
Conclusion and Discussion
Using the Bayesian framework to adjust for multiple comparisons is an attractive alternative to adjusting pvalues – in particular if the data are highdimensional. Thus, Efron et al. [6] have suggested an empirical Bayes analysis of microarrays (EBAM) for testing each gene if its mean expression value differs between two groups with a moderated tstatistic.
In this paper, we have proposed an algorithm that generalizes this procedure. This algorithm comprises the original EBAM analysis of Efron et al. [6] as well as the EBAM analysis based on Wilcoxon rank sums [13], and allows for other types of EBAM analyses in other testing situations. For this, it is only necessary to choose an appropriate test statistic, and, if the null density is known, a method for estimating the density of the observed test scores. The EBAM approach for categorical data proposed in this paper is one example for such an analysis. Another example would be to use an Fstatistic for performing an EBAM analysis of continuous data (e.g., gene expression data) when the response shows more than two levels. In this case, the zvalues of the genes would be given by the values of the Fstatistic, and the density of the observed zvalues might be estimated by the procedure of Efron and Tibshirani [19] if an Fdistribution with appropriate degrees of freedom is assumed to be the null distribution.
The generalized EBAM algorithm along with functions for using (moderated) tstatistics (one and twoclass, paired and unpaired, assuming equal or unequal group variances), (moderated) Fstatistics and Wilcoxon rank sums is implemented in the R package siggenes version 1.10.0 and later that can be downloaded from the webpage [43] of the BioConductor project [44] (see also the section Availability and requirements).
siggenes version 1.11.7 and later also contains a function for the EBAM analysis of categorical data proposed in this paper. Note that siggenes 1.10.× already comprises a preversion of this function. The main difference between these versions is the estimation of the density f of the observed test scores: While in siggenes 1.10.× the default version of the R function ns is used to generate the basis matrix for the natural cubic spline that is employed in the estimation of f, the inner knots of this spline are centered around the mode (and not the median) in siggenes 1.11.7 and later which leads to a better estimate of f as Figure 2 shows.
To exemplify how EBAM for categorical data can be applied to SNP data from wholegenome studies, it has been used to analyze two subsets of the HapMap data. In the first application aiming to identify SNPs showing a distribution that differs substantially between JPT and CHB, 193 of the 132,383 considered SNPs show a posterior probability larger than or equal to 0.9, and are therefore called significant by EBAM, where the estimated FDR of this set of SNPs is 0.08.
The number of identified SNPs and the corresponding FDR resulting from this EBAM analysis are identical to the results of the application of SAM to this HapMap data set [9] when the same permutations of the group labels are used in both methods. This is due to the fact that both EBAM and SAM employ the same approach to estimate the FDR. Moreover, the same set of SNPs is identified by both methods, since the same nonnegative test statistic is used in both applications. Virtually the same applies to the usage of the qvalues [11, 12] as implemented, e.g., in John Storey's R package qvalue. For example, each of the 193 SNPs found by EBAM exhibit a qvalue less than or equal to 0.08.
In the second application to the HapMap data set in which all four populations are considered, most of the 143,119 SNPs show a distribution that differs substantially in at least one of the four groups. This huge number of differences does not seem to be that surprising, as the four HapMap populations come from three different continents. Pairwise EBAM analyses of the four populations show that CEU is the population that differs the most from the other populations. Again, a SAM analysis would lead to the same estimated FDR as the EBAM analysis if the same number of SNPs is identified, where this set of significant variables will contain the same SNPs in both analyses.
An advantage of EBAM over other approaches is that it not only estimates the FDR for a set of detected variables, but also naturally provides a variablespecific estimate for the probability that a variable is associated with the response.
The two applications to the HapMap data, however, also reveal two restrictions of the EBAM procedure. Since in EBAM it is assumed that all variables follow the same null distribution, a large number of SNPs have to be removed prior to both analyses, as these SNPs either exhibit missing values or only show (one or) two of the three genotypes. A solution to the former problem would be to replace the missing genotypes using imputation methods such as KNNcatImpute [45] or – when considering Affymetrix SNP chips – to employ genotype calling algorithms such as RLMM [46] or CRLMM [47] that allow to obtain genotypes for all SNPs.
An idea to solve the second problem is to perform two EBAM analyses – one for the SNPs showing only two genotypes, and one for the SNPs with data available for all three genotypes. Having computed the posterior probabilities for the two sets of SNPs separately and called all SNPs significant that exhibit a posterior probability of being significant larger than or equal to Δ in any of the analyses, a combined FDR needs to be estimated for both analysis, since we are interested in one estimate for the FDR of all detected SNPs. How such a combined estimate of the FDR can be obtained is an open question that will be part of future research.
EBAM cannot only be used to test individual categorical variables such as SNPs, but can also be applied to interactions of these variables.
However, two problems occur when considering interactions. The first problem is that $\left(\begin{array}{c}m\\ p\end{array}\right)$pway interactions have to be tested. Although the functions implemented in siggenes allow to split the variables into subsets, an EBAM analysis of interactions in highdimensional data is not feasible in a reasonable amount of time. It is thus restricted to data from association studies in which several ten to a few hundred SNPs are considered.
The second problem is the empty cell problem: The number of observations available in a study is limited such that when considering pway interactions of SNPs some of the 3^{ p }cells of the pdimensional contingency tables of some of the interactions will be empty leading to features with different numbers of categories and thus with different null distributions. Hence, EBAM cannot be applied to all of these features at once. In the analysis of the twoway interactions from the simulated data set, e.g., one interaction exhibits values only for seven of the nine genotypes comprised by two SNPs. This interaction therefore has to be removed from the EBAM analysis.
The abovementioned idea of performing separate EBAM analyses for variables with different numbers of levels and computing a combined FDR might not be ideal in the case of interactions, as many different numbers of level could exist. In such a situation, a better solution is not to consider the pway interactions as variables with 3^{ p }categories, but to test each of the 3^{ p }genotypes comprised by p SNPs that are observed at at least a particular number of persons. Furthermore, it might make sense to include the complements of the genotypes, as, e.g., "Not the homozygous reference genotype" corresponds to a recessive effect of a SNP. This, however, would increase the multiple testing problem by a factor of up to 6^{ p }such that a filtering prior to the EBAM analysis might be advisable/necessary.
Boulesteix et al. [48] propose a multiple testing procedure for the identification of the combination of genotypes in a prespecified subset of (interacting) SNPs that shows the largest association with the response. Another solution to this multiple testing problem that does not require a prespecification of a subset of SNPs has been described in this paper: Firstly, a search algorithm such as logicFS is used to identify potentially interesting genotypes, where these genotypes can be composed of the genotypes from any of the SNPs considered in the study. Afterwards, the detected genotypes are tested on an independent data set using EBAM, where the posterior probability of being significant resulting from this EBAM analysis can be interpreted as an importance measure for the genotypes. For this analysis, it is not necessary that all genotypes are composed of the genotypes of the same number of SNPs, as they are coded as binary variables. Quantifying the importance of (combinations of) binary variables is implemented in the R packages logicFS version 1.7.6 and later [49].
Availability and requirements
Project name: siggenes – Multiple testing using SAM and Efron's empirical Bayes approach
Project home page: http://bioconductor.org/packages/2.1/bioc/html/siggenes.html (for siggenes 1.12.0)
Operating system(s): Platform independent
Programming language: R
Licence: Free for noncommercial use
Any restrictions to use by nonacademics: See the licence in the siggenes package
Abbreviations
 CEPH – Utah residents with ancestry from northern and western Europe (CEU). Han Chinese from Beijing (CHB). Empirical Bayes Analysis of Microarrays (EBAM). False Discovery Rate (FDR). Japanese from Tokyo (JPT). Significance Analysis of Microarrays (SAM). Single Nucleotide Polymorphism (SNP). Yoruba in Ibadan:

Nigeria (YRI).
Declarations
Acknowledgements
Financial support of the Deutsche Forschungsgemeinschaft (SFB 475, "Reduction of Complexity in Multivariate Data Structures") is gratefully acknowledged. The authors also would like to thank the reviewers for their helpful comments.
Authors’ Affiliations
References
 Westfall PH, Young SS: Resamplingbased multiple testing: examples and methods for pvalue adjustments. New York, NY: Wiley; 1993.Google Scholar
 Shaffer JP: Multiple hypothesis testing. Ann Rev Psych 1995, 46: 561–584.View ArticleGoogle Scholar
 Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Stat Sci 2003, 18: 71–103.View ArticleGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc B 1995, 57: 289–300.Google Scholar
 Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98: 5116–5124.PubMed CentralView ArticlePubMedGoogle Scholar
 Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment. J Amer Statist Assoc 2001, 96: 1151–1160.View ArticleGoogle Scholar
 Schwender H, Krause A, Ickstadt K: Identifying interesting genes with siggenes. RNews 2006, 6(5):45–50.Google Scholar
 Schwender H: Modifying microarray analysis methods for categorical data – SAM and PAM for SNPs. In Classification – The Ubiquitous Challenge. Weihs C, Gaul W edition. Springer, Heidelberg; 2005:370–377.View ArticleGoogle Scholar
 Schwender H: Statistical analysis of genotype and gene expression data. PhD thesis. University of Dortmund, Department of Statistics; 2007.Google Scholar
 The International HapMap Consortium: The International HapMap Project. Nature 2003, 426: 789–796.View ArticleGoogle Scholar
 Storey JD: A direct approach to false discovery rates. J Roy Statist Soc B 2002, 64: 479–498.View ArticleGoogle Scholar
 Storey JD, Tibshirani R: Statistical significance of genomewide studies. Proc Natl Acad Sci USA 2003, 100: 9440–9445.PubMed CentralView ArticlePubMedGoogle Scholar
 Efron B, Tibshirani R: Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol 2002, 23: 70–86.View ArticlePubMedGoogle Scholar
 Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3: Article 3.Google Scholar
 Storey JD, Tibshirani R: SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. In The Analysis of Gene Expression Data: Methods and Software. Edited by: Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. Springer, New York; 2004:272–290.Google Scholar
 Silverman BW: Density estimation for statistics and data analysis. London: Chapman and Hall; 1986.View ArticleGoogle Scholar
 Chen SX: Probability density functions estimation using gamma kernels. Ann Inst Statist Math 2000, 52: 471–480.View ArticleGoogle Scholar
 Scaillet O: Density estimation using inverse and reciprocal inverse Gaussian kernels. J Nonparam Statist 2004, 16: 217–226.View ArticleGoogle Scholar
 Efron B, Tibshirani R: Using specially designed exponential families for density estimation. Ann Statist 1996, 24: 2431–2461.View ArticleGoogle Scholar
 Wand MP: Databased choice of histogram bin width. Amer Stat 1997, 51: 59–64.Google Scholar
 Scott DW: On optimal and databased histograms. Biometrika 1979, 66: 605–610.View ArticleGoogle Scholar
 Freedman D, Diaconis P: On the histogram as a density estimator: L_{2}theory. Z Wahr Verw Geb 1981, 57: 453–476.View ArticleGoogle Scholar
 Sturges H: The choice of a classinterval. J Amer Statist Assoc 1926, 21: 65–66.View ArticleGoogle Scholar
 Scott DW: Multivariate density estimation: theory, practice, and visualization. New York: Wiley; 1992.View ArticleGoogle Scholar
 Bickel DR: Robust estimators of the mode and skewness of continuous data. Computat Statist Data Anal 2002, 39: 153–163.View ArticleGoogle Scholar
 Hedges SB, Shah R: Comparison of mode estimation methods and application in molecular clock analysis. BMC Bioinformatics 2003, 4: 31.PubMed CentralView ArticlePubMedGoogle Scholar
 Cochran WG: Some methods for strengthening the common χ^{2}tests. Biometrics 1954, 10: 417–451.View ArticleGoogle Scholar
 R Development Core Team:R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2007. ISBN 3–900051–07–0 [http://www.Rproject.org]Google Scholar
 Affymetrix: BRLMM: an improved genotype calling method for the GeneChip Human Mapping 500 k array set. Tech rep, Affymetrix, Santa Clara, CA; 2006.Google Scholar
 Affymetrix – Mapping 500 k genotype calls on 270HapMap samples[http://www.affymetrix.com/support/technical/sample_data/500k_hapmap_genotype_data.affx]
 Schwender H, Zucknick M, Ickstadt K, Bolt HM: A pilot study on the application of statistical classification procedure to molecular epidemiological data. Tox Letter 2004, 151: 291–299.View ArticleGoogle Scholar
 Ruczinski I, Kooperberg C, LeBlanc M: Logic regression. J Comput Graph Stat 2003, 12: 475–511.View ArticleGoogle Scholar
 The single nucleotids polymorphism database (dbSNP)[http://www.ncbi.nlm.nih.gov/projects/SNP]
 Garte S: Metabolic susceptibility genes as cancer risk factors: time for a reassessment? Cancer Epidemiol Biomarkers Prev 2001, 10(12):1233–1237.PubMedGoogle Scholar
 Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 2002, 70: 461–471.PubMed CentralView ArticlePubMedGoogle Scholar
 Marchini J, Donnely P, Cardon RC: Genomewide strategies for detecting multiple loci that influence complex diseases. Nat Genet 2005, 37: 413–416.View ArticlePubMedGoogle Scholar
 Justenhoven C, Hamann U, Pesch B, Harth V, Rabstein S, Baisch C, Vollmert C, Illig T, Ko Y, Brüning T, Brauch H: ERCC2 genotypes and a corresponding haplotype are linked with breast cancer risk in a German population. Cancer Epidemiol Biomarker Prev 2004, 13(12):2059–2064.Google Scholar
 Schwender H, Ickstadt K: Identification of SNP interactions using logic regression. Biostat 2008, 9(1):187–198.View ArticleGoogle Scholar
 Ruczinski I, Kooperberg C, LeBlanc M: Exploring interactions in highdimensional genomic data: an overview of logic regression, with applications. J Mult Anal 2004, 90: 178–195.View ArticleGoogle Scholar
 Kooperberg C, Ruczinski I, LeBlanc M, Hsu L: Sequence analysis using logic regression. Genet Epidemiol 2001, 21(Suppl 1):S626S631.PubMedGoogle Scholar
 Witte JS, Fijal BA: Introduction: analysis of sequence data and population structure. Genet Epidemiol 2001, 21: 600–601.Google Scholar
 Breiman L: Bagging predictors. Mach Learn 1996, 26: 123–140.Google Scholar
 BioConductor project[http://www.bioconductor.org]
 Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. [http://genomebiology.com/2004/5/10/R80]PubMed CentralView ArticlePubMedGoogle Scholar
 Schwender H, Ickstadt K: Imputing missing genotypes with k nearest neighbors. Tech rep., Collaborative Research Center 475, Department of Statistics, University of Dortmund; 2008.Google Scholar
 Rabbee N, Speed TP: A genotype calling algorithm for Affymetrix SNP arrays. Bioinformatics 2006, 22: 7–12.View ArticlePubMedGoogle Scholar
 Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls for highdensity oligonucleotide SNP array data. Biostat 2007, 8(2):485–499.View ArticleGoogle Scholar
 Boulesteix AL, Strobl C, Weidinger S, Wichmann HE, Wagenpfeil S: Multiple testing for SNPSNP interactions. Stat Appl Genet Mol Biol 2007., 6(37):
 logicFS version 1.8.0[http://bioconductor.org/packages/2.1/bioc/html/logicFS.html]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.