 Research article
 Open access
 Published:
An improved procedure for gene selection from microarray experiments using false discovery rate criterion
BMC Bioinformatics volume 7, Article number: 15 (2006)
Abstract
Background
A large number of genes usually show differential expressions in a microarray experiment with two types of tissues, and the pvalues of a proper statistical test are often used to quantify the significance of these differences. The genes with small pvalues are then picked as the genes responsible for the differences in the tissue RNA expressions. One key question is what should be the threshold to consider the pvalues small. There is always a trade off between this threshold and the rate of false claims. Recent statistical literature shows that the false discovery rate (FDR) criterion is a powerful and reasonable criterion to pick those genes with differential expression. Moreover, the power of detection can be increased by knowing the number of nondifferential expression genes. While this number is unknown in practice, there are methods to estimate it from data. The purpose of this paper is to present a new method of estimating this number and use it for the FDR procedure construction.
Results
A combination of test functions is used to estimate the number of differentially expressed genes. Simulation study shows that the proposed method has a higher power to detect these genes than other existing methods, while still keeping the FDR under control. The improvement can be substantial if the proportion of true differentially expressed genes is large. This procedure has also been tested with good results using a real dataset.
Conclusion
For a given expected FDR, the method proposed in this paper has better power to pick genes that show differentiation in their expression than two other well known methods.
Background
The development of microarray technologies has created unparalleled opportunities to study the mechanism of disease, monitor disease expression and evaluate effective therapies. Because tens of thousands of genes are investigated simultaneously with a technology that has not yet been perfected, assessing uncertainty in the decision process relies on statistical modelling and theory. One key function of any statistical procedure is to control the rate of erroneous decisions or, in the microarray case, rate of false discovery of responsible genes.
The above concern can be illustrated as a multiple comparisons problem. Suppose we are interested in testing g parameters (μ_{1},..., μ_{ g }) = μ. For each individual parameter μ_{ j }, the null hypothesis is H_{0j}: μ_{ j }= 0 and the alternative hypothesis is H_{1j}: μ_{ j }≠ 0. This μ_{ j }can be thought as the difference in mean expressions of the j th gene under two different conditions in a microarray experiment. A conventional method to test each hypothesis is to take a sample and then calculate its pvalue from a proper statistical test. If the calculated pvalue is less than a threshold determined by a testing significance level, then H_{0j}is rejected. However, when many hypotheses are simultaneously performed, a multiple comparisons procedure (MCP) has to be used to control the error rate [1].
The traditional MCP controls the probability of making any error in multiple selections, i.e., controls the familywise error rate (FWER). It has been shown, however, that this type of procedure is extremely conservative when the number of hypotheses is large. Alternatively, Benjamini and Hochberg [2] proposed the measure of false discovery rate (FDR) for which the expected proportion of false discoveries is controlled. This procedure is based on the idea that one can tolerate more false discoveries if the number of tests is large. For example, 5 false discoveries out of 10 selections is probably too many while 5 false discoveries out of 100 selections should be acceptable. This is particularly useful in microarray analysis since a very large number of genes usually show differential expressions. Therefore, controlling the FDR can greatly increase the power of discovery.
Benjamini and Hochberg [3] proved that Simes' procedure [4] can be used to control expected FDR at α (0 <α < 1) when tests of true null hypotheses are either independent or positively dependent. More specifically, let p_{(1)} ≤ p_{(2)} ≤ ..., ≤ p_{(g)}be the ordered pvalues and J + 1 be the smallest integer satisfying
{p}_{(J+1)}>(J+1)\times \frac{\alpha}{g}.\left(1\right)
If J ≥ 1, then rejecting H_{(0j)}, j = 1,..., J ensures the expected FDR at α. We call this method the BH procedure.
Suppose the number of true nulls is g_{0}, or the proportion is π_{0} = g_{0}/g. Benjamini and Hochberg [3] proved that the expected FDR of BH procedure is less than or equal to \frac{{g}_{0}}{g}α. When g_{0} is small compared to g, using the original α in (1) loses of power because it can be replaced by the larger value α/π_{0} and still control the FDP at α. Here, the power of a selection procedure is defined as the proportion of the alternative hypotheses that are correctly identified. Obviously, if we knew π_{0} in advance, we could replace α in (1) by α/π_{0} to increase power. Since π_{0} is unknown, the key question here is how to estimate the number of true null hypotheses g_{0}.
In an earlier published paper, Benjamini and Hochberg [5] proposed an adaptive procedure (aBH) which, by simulation, showed a higher power than BH procedure. The idea of the aBH procedure is to estimate g_{0} based on the fact that the pvalue under the alternative hypothesis is stochastically smaller than that under the null, which is uniformly distributed on (0, 1). On a quantile plot of pvalues (p_{(j)}versus j), the slope passing through (j, p_{(j)}) and (g + 1,1) increases just as j increases if the P_{(j)}s are corresponding to the subset of true alternative hypotheses. The first time the slope decreases indicates a change point. Using this stopping rule in conjunction with the Lowest SLope (LSL) estimator, (1  p_{(j)})/(g + 1  j), Benjamini and Hochberg proposed their aBH method to estimate g_{0} as follows:
Alg. 1 aBH method in g_{0}estimation

1.
If there is no hypothesis rejected by the BH procedure, then stop and declare no discovery.

2.
Calculate m_{(j)}= (1  p_{(j)})/(g + 1  j), ∀j = 1,..., g.

3.
Let J* be the largest value such that m_{(J*)}<m_{(J*  1)}. Define b as the smallest integer larger than 1/m_{(J*)}.

4.
The number of true null hypotheses is estimated as {\widehat{g}}_{0} = min(b, g).
Recently, Storey and Tibshirani [6] observed and demonstrated in their Figure 1 that the density histogram of pvalues based on p_{1},..., p_{ g }from a microarray experiment looked flat in the region of (0.5, 1). Based on the fact that the null pvalues of genes are uniformly distributed, most of genes in the region of (0.5, 1) should be from the null. Therefore, they used a smoothing method to estimate g_{0} based on the flat region of the observed pvalues. However, the implicit requirements of their method are: 1) g should be large, and 2) the proportion of true null hypotheses should also be large, such as 0.5, so that g_{0} can be estimated accurately. The method developed in this paper tries to bypass these two requirements so that a broad range of multiple testing problems can be applied. It is conceivable that when the number of relevant genes are known for certain target disease, a chip with a small number of genes will be widely used for diagnosis. The FDR methods, including both the BH and the aBH versions, have now been widely accepted for microarray analyses. They have been implemented in the R program [7], which can be incorporated into the Bioconductor package [8]. Several microarray analysis computer programs, such as SAM (Significance Analysis of Microarrays) [9] and GeneSpring [10] also use the FDR criterion to identify differentially expressed genes. Yang et al. [11] have used the FDR criterion to determine the sample size in microarray experiments.
Methods
Without loss of generality, we describe the method in a onesample testing problem. Its extension to commonly used pairedt test, twosample t test, or nonparametric rank tests in microarray analysis is very straightforward. Let Y_{ i }= (Y_{i 1},..., Y_{ ig })', i = 1,..., n be gvariate vector samples from populations with means μ= (μ_{1},..., μ_{ g })'. Define Y^{(j)}= (Y_{1j},..., y_{ nj }), ,j = 1,..., g as a row vector for the j th component. The null hypothesis of the j th component is H_{0j}: μ_{ j }= 0 and the alternative hypothesis is H_{1j}: μ_{ j }≠ 0. Let the pvalue for rejecting the null H_{0j}be p_{ j }derived from a properly chosen test statistic.
Once the pvalues are derived, we first summarize the information content of pvalues using a differentiable realvalued decreasing function h, i.e., h(p_{1},..., p_{ g }) is decreasing to all its arguments. We call h the combining function and its value, η = h(p_{1},..., p_{ g }), the global statistic. Next, define η^{(0)} to be the observed global statistic of η derived directly from the observed data. Note that a small value of p_{ j }indicates H_{0j}is less likely to be true. Since h is a decreasing function of its argument p_{ j }, a large value of η^{(0)} indicates a subset of the null hypotheses is likely to be true alternatives. To determine whether η^{(0)} is large enough to make such a claim, we need to know the distribution of η when all the nulls are true. The distribution of η depends on the distribution of its arguments, the correlation between its arguments, and the combining function h. In a microarray experiment, there seems no way to model or estimate so many correlations (see, for example, Chapter 6 of Pesarin [12]). Hence, a reasonable approach which can tackle the correlation within each experimental unit is to determine the critical value by a permutation test. More specifically, we calculate all B = 2^{n} values of η based on (ω_{1}Y_{1},..., ω_{ n }Y_{ n }), where ω_{ i }= ± 1 (The multivariate twosample permutation method can be found in Pesarin [12]). Hence, we have a set of reference values \{{\eta}_{1}^{(0)},\dots ,{\eta}_{B}^{(0)}\} from the B permutations, and can now define the pvalue of the global η^{(0)}statistic as
{p}^{(0)}=\frac{{\displaystyle {\sum}_{b=1}^{B}I\left({\eta}_{b}^{(0)}\ge {\eta}^{(0)}\right)}}{B}\left(2\right)
where I(x) is the indicator function that takes value 1 if the statement x is true and 0 if it is not. To determine whether the global null hypothesis of all μ_{(j)}, j = 1,..., g, is zero is to evaluate if the global pvalue, p^{(0)}, is less than or equal to a significance level. However, this level is part of the estimation process. It will be determined by the data (see later Alg. 2).
When the global null hypothesis is rejected, it indicates that not all null hypotheses H_{0j}s are true. Immediate question is, which subset of hypotheses is the true alternative. To determine whether to reject the hypothesis H_{(0j)}is a multiple testing problem. We can, however, determine the size of true alternatives using the following iterated process. We regard the gene with the smallest pvalue as the major contributor for the rejection and claim that the null hypothesis is not true with this gene. When this gene is removed from the data set, we continue the same process with the rest g  1 genes and the p^{(0)} computed by (2) is now denoted by p^{(1)}. The whole process is then repeated to produce a sequence of pseudoglobal pvalues, p^{(s)}, s = 1,..., g  1. We call the later step global pvalue pseudo because it is only based on the subset of the original data. The detail of how to generate pseudoglobal pvalues is given in the Algorithm section. First, we will use these pseudoglobal pvalues to estimate the number of true null genes g_{0}.
We observe that, intuitively, p^{(0)} ≤ p^{(1)} ≤ ... ≤ p^{(g1)}, and this monotone increasing property will be proved in (J.1) of the Justification section. In addition, we will prove in (J.2) of the Justification section that if r pseudoglobal pvalues are less than a given value β(0 <β < 1), the estimator of the number of null genes is
{\widehat{g}}_{0}=gr+\beta /{(1\beta )}^{2}\left(3\right)
and this estimator ensures that E [{\widehat{g}}_{0}] ≥ g_{0}. (The conservativeness of this estimate under other conditions using other techniques has also been mentioned by Storey and Tibshirani [13] and Efron et al. [14].) Since the inequality E [{\widehat{g}}_{0}] ≥ g_{0} holds for any value of β, the best choice of β is the one which makes {\widehat{g}}_{0} closest to g_{0}. By defining Δ = β/(1  β)^{2} and ρ(β) = r  Δ, we observe that ρ(β) is an increasing function and then a decreasing function of β for the following reasons. When β is small, Δ increases slowly. However, r is an increasing function in β and r is always greater than 1. Therefore, ρ(β) is an increasing function for small value of β. On the other hand, when β → 1, Δ reaches ∞. Since r is finite, ρ(β) becomes a decreasing function in this range. Since (3) is equivalent to {\widehat{g}}_{0} = g  ρ(β), the optimal value of β should be determined as
\widehat{\beta}=\underset{\beta}{\text{argmax}}\rho (\beta )
and the method of estimating g_{0} is equivalent to finding the optimal β value.
The estimation of g_{0} can thus be summarized as
Alg. 2 globalp method in g_{0}estimation

1.
List r_{β} as the integer such that p^{(s1)}≤ β ∀ s = 1,..., r_{β}, and {p}^{({r}_{\beta})} > β for a large number of β s for 0 <β < 1.

2.
Find β such that ρ(β) = r_{ β } β/(1  β)^{2} is maximized (see Figure 5). Let this β be β*.

3.
Let r_{β*}be the integer such that p^{(s1)}≤ β* ∀ s = 1,..., r_{β*}, and {p}^{({r}_{\beta *})} > β*

4.
Let {\widehat{g}}_{0} = min[g  r_{β*}+ β* /(1  β*)^{2}, g].
We called this globalp method and denoted it by \wp.
Alg. 3 Differentially expressed gene selection with given FDR
With g_{0} estimated, to control the FDR at α level for identifying differentially expressed genes in microarray data analysis with the conservative estimate {\widehat{g}}_{0}, we simply test each subhypothesis based on (1) with the new α ≡ α/{\widehat{g}}_{0}, i.e., rejects all H_{(j)}, j = 1,..., J, if p_{(j)}≤ j × \frac{\alpha}{{\widehat{g}}_{0}} and p_{(J+1)}> (J + 1) × \frac{\alpha}{{\widehat{g}}_{0}}. We also need to address the choices for the combining function h. In this paper, we consider only two commonly used ones. 1) Fisher's sum of logarithm
\eta =h\left({p}_{1}\dots ,{p}_{g}\right)=2{\displaystyle \sum _{j=1}^{g}\text{log}\left({p}_{j}\right)}
and 2) Liptak's sum of inverse standard Gaussian distribution functions
\eta =h\left({p}_{1},\dots ,{p}_{g}\right)={\displaystyle \sum _{j=1}^{g}{\Phi}^{1}\left(1{p}_{j}\right)}.
More discussion on the choices of combining functions can be found in Birnbaum [15] and Folks [16]. The computer program for the proposed method, globalp, has been implemented in R script and is publicly available [17].
Results
Simulation
Two simulation studies were conducted. First, when the number of hypotheses are small and second, when the total number of hypotheses is extremely large. For small number of hypotheses, we used the following configurations: numbers of hypotheses are 16, 32, 64, and 128 with sample size n = 10 and the proportions of g_{0}/g being 0, 0.25, 0.5 and 0.75. The true expressions under the alternative hypotheses are assumed to be variance 1 Gaussian random variables with nonzero mean values μ_{ j }= d_{ j }. In one set of experiments, all d_{ j }s are set as 0.2 or 0.4; in the other set the d_{ j }s are divided into 4 equal size blocks with values 0.2, 0.4, 0.6 or 0.8 in each block. The number of permutations is B = 1,000 and the number of simulations is 20,000. The FDR is set at α = 0.05. Four procedures are used for testing: BH, aBH, our globalp test one with Fisher's combining function, and one with Liptak's combining function. For the purpose of comparison, we also plug in the true value of g_{0} into the aBH method. This is the ideal situation to reach the highest power.
Table 1 lists the simulated expected FDR when all null hypotheses are true. Figures 1 and 2 show the estimated FDR and power under various gene expression compositions.
Based on this simulation results, all the four methods have their expected FDR below the nominal FDR α = 0.05. Both combining functions in our proposed method have a higher power than the aBH procedure in most situations, while there is little difference between them. The improvement over the aBH methods is substantial when the proportion of the true null hypotheses is small.
Since most microarray data consist of tens of thousands of genes simultaneously and many of their expressions are correlated, our second simulation tried to reflect these facts. Storey and Tibshirani [13] proposed a "clumpy" dependence model in which genes are partitioned into blocks. The gene expressions are assumed independent between blocks but dependent within the same block. Following the clumpy model, 10 test and 10 control samples were generated from normal random variables with mean 0 and standard deviation 1 where each sample contained 10,000 genes. For test samples, genes with expression difference were added by 3 to represent the expression differences. The number of differentially expressed genes g_{0} were 100, 2,000, 5,000, 8,000 which corresponded to the proportion of nulls π_{0} at 0.99, 0.8, 0.5, 0.2. To simulate intrablock dependency, we generated one vector of normal random variables with mean 0 and standard deviation 0.2 in each block of size 50 and add it to every gene in that block. This process creates correlations between genes. Since the true expression differences for nonnull genes are moderate large in this simulation, we expect that any good method should accurately estimate π_{0}.
In a recent article, Broberg [18] did a comparative review of various newly developed methods for estimating π_{0}. To make a comparison, we presented the means and standard errors of estimates of π_{0} for all the methods describe in Broberg's paper in addition to our proposed method (\wp) over 1000 simulations. The details of π_{0}estimation methods (BUM, SPLOSH, QVALUE, Boostrap LSE, SEP, LSL, mgf, PRE) can be found in Broberg's paper and the programs for these methods can be found in the addon packages of the freely available R software [19]. The simulation results are shown in Table 2.
As expected, all of them performed very well for large π_{0}. If both the mean (bias) and standard error (stability) over the whole range of π_{0} are considered, LSL and Globalp stand out and LSL is better in stability. However, as we will see from the next study, LSL seems to be overconservative in real data analysis. The means of both combining functions of our proposed method performed well but Liptak's function has a higher standard errors. Therefore, Fisher's function will be used for the real data analysis. During the simulation, we also noticed that current implementation of SPLOSH, SEP, mgf, and PRE methods in R programs failed to work when the number of hypotheses was small.
Real data analysis
We use a publically available experimental data set from the Stanford microarray database [20] first to compare the differences between BH, aBH and our method with Fisher's combining function. The purpose of this experiment was to identify genes that have different expressions between prostate tumor tissue and matched normal tissue. It consists of a total of 82 microarrays: 41 arrays were from primary prostate tumors and the other half were from matched normal prostate tissues. Each array contains 32,152 different human genes. We chose this data set because of the large number of replicates. With this amount of replicates, we had a better idea of the ground truth.
To compare performances of various methods, we split the whole data set into two groups: a test data set and a confirmation data set. Eleven pairs of tumor and normal arrays were chosen for the test data set and the remaining arrays were used for the confirmation. The number of eleven test pairs were taken from a systematic sampling to avoid bias. Patients 1, 5, 9,..., 41 in the original order from the database were chosen as the test set. Expression information is missing if it is labelled as failed, contaminated, or flagged. Genes were removed from our study if more than half of their expressions were missing in the original data set. A total of 24,865 genes was used to identify differences in expressions using BH, aBH and our proposed procedures at 0.01 expected FDR level. Since the experiment was designed with paired tumornormal arrays, the paired ttest was used to derive the pvalue of each gene in the test data set. The BH procedure identified 1,254 genes, the aBH procedure identified 1,523 genes, and our proposed method identified 2,119 genes. Our test is apparently more powerful if it can maintain the required FDR 0.01. Since we did not have the biological information, another approach was to estimate FDR based on our rejection set. Specifically, suppose the rejection set contains the pvalues that are less than ξ and the estimated proportion of null hypotheses is {\widehat{\pi}}_{0}. An intuitive estimate for FDR is [21, 22]
\widehat{\text{FDR}}(\xi )=\frac{{\widehat{\pi}}_{0}\xi}{\widehat{Pr}[P\le \xi ]},\left(4\right)
where \widehat{Pr}[P\le \xi ]={\displaystyle {\sum}_{j=1}^{g}I({p}_{i}}\le \xi )/g. Based on the confirm data set, our proposed method reported an estimator of π_{0}, {\widehat{\pi}}_{0} = 0.5326. Using equation (4), the estimated FDRs for BH, aBH and our method were 0.0054, 0.0067, 0.0099, respectively.
If we further reject more subhypotheses beyond the number provided by our method, the FDR will exceed the assigned level. For example, if we reject the next 500 null subhypotheses that have the smallest pvalues among those not previously rejected, the estimated FDR is 0.0137 which is larger than the preassigned 0.01 level.
Moreover, we calculated the standard difference (paired tstatistic) of expressions using the confirmation data set. The confirmation data set contains 30 pairs of arrays so that the statistic is a good estimate for the unknown standard difference. Figure 3 shows histograms of absolute standard differences based on genes identified by the BH procedure, by the aBH procedure, by our method, and the extra 500 genes beyond our method. From the histograms, several hundred differentially expressed genes that were not identified by BH or aBH in the test arrays were identified by our method and most of them have standard expression differences greater than 2. However, the next group of 500 genes beyond our method may contain too many false discoveries to make the FDR acceptable.
Next, we use this data set to compare different estimates of proportion of null genes, π_{0}. The whole data set was partitioned into four subdata sets. We labelled the arrays from 1 to 41 based on the original order in their database. Subdata set 1 contained 11 arrays with labels 1, 5, 9,..., 41; subdata set 2 contains 10 arrays with labels 2,6,..., 38; subdata set 3 contains 10 arrays with labels 3,7,..., 39; subdata set 4 contained the remaining arrays. Although we did not know the true value of π_{0} in real data analysis, we used the four subdata sets to compare the variation of different estimation methods. In addition, we used the whole data set to check if the estimates are reliable. For globalp method, we also plot the function of ρ(β) in Figure 5 using the whole data set. The maximum of ρ(β) occurs at β = 0.91.
The estimates {\widehat{\pi}}_{0} were listed in Table 3. All {\widehat{\pi}}_{0} estimates in the subsets were larger than the ones in the whole data set. This is expected because genes with small expressional differences are more likely to be classified as null genes in a small sample. From the π_{0} estimates from the whole data set, we may say the LSL gave a large π_{0} value, PRE gave a small value and all the others gave similar estimates. We used LSL, PRE and Globalp to draw the pvalue density histogram using the whole data set in Figure 4. The upper panel is the overall view and the lower panel is the closer view. The green dashdot line is the height using LSL estimate; the red dash line PRE estimate; the blue dot line globalp estimate. The LSL method, which is also shown in Broberg's study, is too conservative producing the largest value. The PRE method underestimated π_{0}. If we used {\widehat{\pi}}_{0} by PRE method to improve FDR procedure, it is likely to have a higher FDR than the nominal level. We think our π_{0} estimate is reliable because it is close but higher than the height of observed pvalues in the region of (0.6, 1) if we assume genes with expressional differences have rare chance to have large observed pvalues.
Conclusion
Benjamini and Hochberg's FDR procedure [2] opens an useful approach for multiple comparisons in microarray analysis. Due to the complexity in microarray experiments, it seems unlikely that one method can dominate all the other methods in all the cases. In this paper, we have demonstrated a new method that is intuitive appearing because:

1.
It uses permutation test which can take care the complex correlation structures in gene expressions.

2.
Its global test based on sequentially eliminated significant genes should provide a good stopping rule because all the remaining genes are always considered together.
With the support of simulation and real data studies, the new method should be a viable alternative to find the differentially expressed genes in microarray experiments.
Algorithm
We illustrate the algorithm for calculating pseudoglobal pvalues based on marginal pvalues, p_{1},..., {p}_{{g}_{0}}. Please note that the procedure described here is the same regardless of g_{0}. We consider h(p_{1}, ..., {p}_{{g}_{0}}) in the summation form:
\eta =h({p}_{1},\dots ,{p}_{{g}_{0}})={\displaystyle \sum _{i=1}^{{g}_{0}}\hslash}({p}_{i}),\left(5\right)
where ħ is a differentiable decreasing function in which \hslash \text{'}(p)=\frac{d\hslash (t)}{dt}\begin{array}{c}t=p\end{array} exists and ħ(p) → ∞ as p → 0. Note that equation (5) totally meets the requirement that h should be a differentiable realvalued decreasing function. The realization of ħ can be, for instance, ħ(p) = 2log(p) if we use Fisher's sum of logarithm or ħ(p) = Φ^{1} (1  p) if we use Liptak's sum of inverse standard Gaussian distribution function. Recall that p_{ j }is the pvalue obtained from j th row vector (Y_{1j},..., Y_{ nj }) and the ordered marginal pvalues are p_{(1)} ≤ p_{(2)}≤ .... ≤ {p}_{({g}_{0})}. The pvalues for individual genes are denoted by subscripts while the pseudoglobal pvalues in (2) in are denoted by superscripts. Parentheses are used to reveal the order property.
In onesample testing problem, we permute data B times, where each time we assign w_{ i }= ± 1 with equal probability to (w_{1}Y_{1},..., w_{ n }Y_{ n }). For each permuted data, we use p_{b(j)}to be the pvalue from the b th permuted data for the gene that produced p_{(j)}, j = 1,..., g_{0}, b = 1,..., B.
We summarize p_{(i)}and p_{b(i)}using the following table.
\begin{array}{cccccc}{p}_{(1)}& {p}_{1(1)}& \dots & {p}_{b(1)}& \dots & {p}_{B(1)}\\ {p}_{(2)}& {p}_{1(2)}& \dots & {p}_{b(2)}& \dots & {p}_{B(2)}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {p}_{({g}_{0})}& {p}_{1({g}_{0})}& \dots & {p}_{b({g}_{0})}& \dots & {p}_{B({g}_{0})}\end{array}\left(6\right)
Note that the first column is pvalues of genes from the original data while the remaining columns are from the permuted data.
To simplify the notation, let ħ(p_{(i)}) = ħ_{(i)}, ħ(p_{b(i)}) = ħ_{b(i)}and η is just the sum of ħ_{(·)}. Therefore, we can transform the table (6) into the following table with the addition of the last row where its value η is the sum of the ħ_{(·)} in its corresponding column.
\begin{array}{cccccc}{\hslash}_{(1)}& {\hslash}_{1(1)}& \dots & {\hslash}_{b(1)}& \dots & {\hslash}_{B(1)}\\ {\hslash}_{(2)}& {\hslash}_{1(2)}& \dots & {\hslash}_{b(2)}& \dots & {\hslash}_{B(2)}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {\hslash}_{({g}_{0})}& {\hslash}_{1({g}_{0})}& \dots & {\hslash}_{b({g}_{0})}& \dots & {\hslash}_{B({g}_{0})}\\ \Downarrow & \Downarrow & \dots & \Downarrow & \dots & \Downarrow \\ {\eta}^{(0)}& {\eta}_{1}^{(0)}& \dots & {\eta}_{b}^{(0)}& \dots & {\eta}_{B}^{(0)}\end{array}\left(7\right)
Once η^{(0)} and η^{(1)},..., η^{(B)}are available, the global pvalue, by definition, is
{p}^{(0)}=\frac{{\displaystyle {\sum}_{b=1}^{B}I({\eta}_{b}^{(0)}}\ge {\eta}^{(0)})}{B}.
The same process is directly used to calculate the pseudoglobal pvalues after removing the genes which have the smallest pvalues. Actually, we can use the available data illustrated in table (8). By induction, suppose p_{(1)},..., p_{(s)}have been removed, the ħ values of the raw data and their reference sets are
\begin{array}{cccccc}{\hslash}_{(s+1)}& {\hslash}_{1(s+1)}& \dots & {\hslash}_{b(s+1)}& \dots & {\hslash}_{B(s+1)}\\ {\hslash}_{(s+2)}& {\hslash}_{1(s+2)}& \dots & {\hslash}_{b(s+2)}& \dots & {\hslash}_{B(s+2)}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {\hslash}_{({g}_{0})}& {\hslash}_{1({g}_{0})}& \dots & {\hslash}_{b({g}_{0})}& \dots & {\hslash}_{B({g}_{0})}\\ \Downarrow & \Downarrow & \dots & \Downarrow & \dots & \Downarrow \\ {\eta}^{(s)}& {\eta}_{1}^{(s)}& \dots & {\eta}_{b}^{(s)}& \dots & {\eta}_{B}^{(s)}\end{array}\left(8\right)
and the pseudoglobal pvalue after removing the s genes is
{p}^{(s)}=\frac{{\displaystyle {\sum}_{b=1}^{B}I({\eta}_{b}^{(s)}}\ge {\eta}^{(s)})}{B}.
Justification
Our justification is based on extreme cases to make the proofs tractable. However, this requirement is not very critical. Our simulation and real data analysis show that even with moderate g_{0} value, the pseudoglobal pvalue, p^{(s)}, starts to increase when our procedure has removed most of the significant genes. Besides, it jumps up very fast to reach the point that p^{(s)}will be larger than threshold β and stop removing genes. Let g be the total number of subhypotheses considered. Given g, let d= (d_{1},..., d_{ g }) denote the vector containing the true values of μ_{ j }, j = 1,..., g for each subhypothesis; and R(g, d) be the number of Y^{(j)}removed using procedure \wp. Without loss of generality, we assume d_{(1)} ≤ ... ≤ {d}_{(g{g}_{0})} < 0 and d_{(j)}= 0 for j = g  g_{0} + 1,..., g. We have
{E}_{\wp}[R(g,d)]\le g{g}_{0}+{E}_{\wp}[R({g}_{0},0)],\left(9\right)
and the equality holds when {d}_{(g{g}_{0})}\to \infty. The equality affirms that the subsample Y^{(j)}with extreme small d_{ j }< 0 is identified and removed from the component of η. Actually, when the gene expression difference is far away from zero, it will be identified by any reasonable multiple testing procedure. Hence, in this extreme case, we only focus on {E}_{\wp}[R({g}_{0},0)] which is based on the remaining g_{0} samples that have null distributions. Our goal is reduced to finding an upper bond of {E}_{\wp}[R({g}_{0},0)], so that the expected value of {\widehat{g}}_{0} can be bounded (below) by g_{0}.
J.1 Monotone in Pseudoglobal pvalues
The key feature of our algorithm is that the pseudoglobal pvalues are monotone increasing. We can be more precise by showing that
Pr\left[{p}^{(s)}\le {p}^{(s+1)}{p}_{(s)}^{}\right]\to 1\text{as}\frac{s}{{g}_{0}}\to 0.\left(10\right)
Romano (Section 2)[23] has proved that the distribution of {\eta}_{b}^{(s)} can be approximated by Gaussian distribution using the Central Limit theorem. Therefore, if we let the mean and standard deviation of reference samples {\eta}_{b}^{(s)} (b = 1,..., B) be μ^{(s)}and σ^{(s)}, respectively, and use Z to denote the standard Gaussian random variable with distribution function Φ, then the pseudoglobal pvalue p^{(s)}can be expressed as
\begin{array}{c}{p}^{(s)}=\frac{{\displaystyle {\sum}_{b=1}^{B}I({\eta}_{b}^{(s)}\ge {\eta}^{(s)}}}{B}\\ \dot{=}Pr\left[Z>\frac{{\eta}^{(s)}{\mu}^{(s)}}{{\sigma}^{(s)}}\right]\\ =Pr\left[Z>{C}^{(s)}\right],\left(11\right)\end{array}
where C^{(s)}= Φ^{1}(1  p^{(s)}) is p^{(s)}th upper percentile of the standard Gaussian distribution. Hence, the relation between η^{(s)}and C^{(s)}is
η^{(s)}= μ^{(s)}+ σ^{(s)}C ^{(s)}.
Similarly, for the μ^{(s+1)}and σ^{(s+1)}as the mean and standard deviation of reference samples {\eta}_{b}^{(s+1)} (b = 1,..., B), we repeat the same approximation method to express p^{(s+1)}as
\begin{array}{ll}{p}^{(s+1)}\hfill & \left(12\right)\hfill \\ =\hfill & \frac{{\displaystyle {\sum}_{b=1}^{B}I({\eta}_{b}^{(s+1)}}\ge {\eta}^{(s+1)})}{B}\hfill \\ \dot{=}\hfill & Pr\left[Z>\frac{{\eta}^{(s+1)}{\mu}^{(s+1)}}{{\sigma}^{(s+1)}}\right]\hfill \\ =\hfill & Pr\left[Z>\frac{({\eta}^{(s+1)}+{\hslash}_{(s+1)})({\mu}^{(s+1)}+{\hslash}_{(s+1)})}{{\sigma}^{(s+1)}}\right]\hfill \\ =\hfill & Pr\left[Z>\frac{{\eta}^{(s)}({\mu}^{(s+1)}+{\hslash}_{(s+1)})}{{\sigma}^{(s+1)}}\right]\hfill \\ =\hfill & Pr\left[Z>\frac{{\mu}^{(s)}+{\sigma}^{(s)}{C}^{(s)}({\mu}^{(s+1)}+{\hslash}_{(s+1)})}{{\sigma}^{(s+1)}}\right]\hfill \\ =\hfill & Pr[Z>\frac{{\sigma}^{(s)}}{{\sigma}^{(s+1)}}{C}^{(s)}+\hfill \\ \frac{{\mu}^{(s)}{\mu}^{(s+1)}{\hslash}_{(s+1)}}{{\sigma}^{(s+1)}}].\hfill & \left(13\right)\hfill \end{array}
Comparing equations (11) and (13), we observe that to prove p^{(s)}≤ p^{(s+1)}with probability 1 asymptotically when p^{(s)}is given is equivalent to prove that
\frac{{\sigma}^{(s)}}{{\sigma}^{(s+1)}}{C}^{(s)}+\frac{{\mu}^{(s)}{\mu}^{(s+1)}{\hslash}_{(s+1)}}{{\sigma}^{(s+1)}}<{C}^{(s)}\left(14\right)
with probability 1 asymptotically when C^{(s)}is given.
As g_{0} → ∞, the sample deviations σ^{(s)}and σ^{(s+1)}are almost identical so that \frac{{\sigma}^{(s)}}{{\sigma}^{(s+1)}}C^{(s)}→ C^{(s)}. Next, using table (8) again, we can express μ^{(s)} μ^{(s+1)}as
\frac{{\displaystyle {\sum}_{b=1}^{B}\hslash ({p}_{b(s+1)})}}{B}.
The distribution of the reference set p_{1(s+1)},..., p_{B(s+1)}converges to the uniform (0, 1) distribution since the distribution of the permuted test statistics, which are corresponding to these pvalues, converges to a Normal distribution using Theorem 2.1 of Romano's work. Therefore, if we define uniform (0, 1) random variable as U, \frac{{\displaystyle {\sum}_{b=1}^{B}\hslash ({p}_{b(s+1)})}}{B} converges in probability to E[ħ(U)] by the Law of Large Number. Hence, to prove equation (14) we only need to prove that
Pr[ħ(p_{(s+1)}) ≥ E[ħ(U)]] → 1. (15)
Assume, without loss of generality, that p_{1},..., {p}_{{g}_{0}} correspond to the true null. To prove equation (15), we first observe that, since p_{1},...,{p}_{{g}_{0}} are independent uniform random variables, P_{(s+1)}is the (s + 1)/g_{0}th quantile of the uniform (0, 1) distribution. By defining {\zeta}_{s+1,{g}_{0}}=(s+1)/{g}_{0} and using the distribution of sample quantile [24], we have
\sqrt{{g}_{0}}({p}_{(s+1)}{\varsigma}_{s+1,{g}_{0}})\to N(0,{\varsigma}_{s+1,{g}_{0}}(1{\varsigma}_{s}+1,{g}_{0})).
Since the first derivative of ħ exists, we can use the delta method to have
\sqrt{{g}_{0}}(\hslash ({p}_{(s+1)})\hslash ({\varsigma}_{s+1,{g}_{0}}))\to N(0,\sigma {*}^{2}),
where {\sigma}^{*}=\sqrt{{\varsigma}_{s+1,{g}_{0}}(1{\varsigma}_{s+1,{g}_{0}}){[\hslash \text{'}({\varsigma}_{s+1,{g}_{0}})]}^{2}} represents the asymptotical standard error of ħ(p_{(s+1)}).
Finally, equation (15) can be proven as,
\begin{array}{l}Pr[\hslash ({p}_{(s+1)})<E[\hslash (U)]]\\ =Pr[\frac{\hslash ({p}_{(s+1)})\hslash ({\varsigma}_{s+1,{g}_{0}})}{{\sigma}^{\ast}}\\ <\frac{E[\hslash (U)]\hslash ({\varsigma}_{s+1,{g}_{0}})}{{\sigma}^{\ast}}]\\ \dot{=}\Phi \left(\frac{E[\hslash (U)]\hslash ({\varsigma}_{s+1,{g}_{0}})}{{\sigma}^{\ast}}\right)\end{array}
which converges to 0 because E[ħ(U)] is finite and \hslash ({\varsigma}_{s+1,{g}_{0}})\to \infty \text{as}{\varsigma}_{s+1,{g}_{0}}=(s+1)/{g}_{0}\to 0.
J.2 Determining the threshold β by (3)
First, we recall that, as p^{(0)} is the global pvalue based on all null data, it is uniformly distributed. That is,
Pr[p^{(0)} ≤ β] = β.
We have proved that the pseudoglobal pvalues, p^{(s)}, are monotone increasing when s/g_{0} → 0. Therefore, there exists m such that p^{(0)} ≤ p^{(1)} ≤ ... ≤ p^{(m)}. Furthermore, for all s ≤ m,
Pr[p^{(s)}≤ βp^{(j)}≤ β, j = 0, ..., s  1] ≤ β.
The purpose here is to derive equation (3) for any β within (0,1) using the above inequality. We start with the number of genes removed by our procedure. Observe that
\begin{array}{l}{E}_{\wp}[R({g}_{0},0)]\\ \begin{array}{ll}=\hfill & {\displaystyle \sum _{k=0}^{{g}_{0}1}kPr[R({g}_{0},0)=k]}\hfill \\ =\hfill & {\displaystyle \sum _{k=1}^{{g}_{0}1}kPr}[{p}^{(0)}\le \alpha ,{p}^{(1)}\le \beta ,\dots ,{p}^{(k1)}\le \beta ,\hfill \\ {p}^{(k)}>\beta ].\hfill \end{array}\end{array}
In addition,
\begin{array}{l}Pr[{p}^{(0)}\le \beta ,{p}^{(1)}\le \beta ,\dots ,{p}^{(k1)}\le \beta ,{p}^{(k)}>\beta ]\\ \begin{array}{ll}=\hfill & Pr\left[{p}^{(0)}\le \beta \right]\hfill \\ \left\{{\displaystyle \prod _{s=1}^{k1}Pr\left[{p}^{(s)}\le \beta {p}^{(l)}\le \beta ,l=0,\dots ,s1\right]}\right\}\hfill \\ Pr\left[{p}^{(k)}>\beta {p}^{(i)}\le \beta ,i=0,\dots ,k1\right]\hfill \\ \le \hfill & \beta {\beta}^{\text{min(}k1,m)}1\hfill \\ =\hfill & {\beta}^{{k}_{m}}\hfill \end{array}\end{array}
where k_{ m }= min(k, m + 1). Hence, we can find an upper bound of the expected number of genes removed, {E}_{\wp}[R({g}_{0},0)], which is
\begin{array}{l}{E}_{\wp}[R({g}_{0},0)]\\ \begin{array}{ll}\le \hfill & {\displaystyle \sum _{k=1}^{{g}_{0}1}k{\beta}^{{k}_{m}}}\hfill \\ =\hfill & \frac{\beta}{{(1\beta )}^{2}}(1{\beta}^{m+1})(m+1){\beta}^{m+2}/(1\beta )\hfill \\ +({g}_{0}m2){\beta}^{m+1}\hfill \\ \approx \hfill & \frac{\beta}{{(1\beta )}^{2}}\text{as}m\text{islarge}\text{.}\hfill & \left(\text{16}\right)\hfill \end{array}\end{array}
Let Δ = Δ(β) = β/(1  β)^{2}. From equations (9) and (16), we have
{E}_{\wp}[R(g,d)\Delta ]\le g{g}_{0}
so that the number of true null estimate is
= g  R(g, d) + Δ
and this estimate ensures that
{E}_{\wp}[{\widehat{g}}_{0}]\ge {g}_{0}.
References
Westfall PH, Young SS: Resamplingbased multiple testing. New York: John Wiley & Sons; 1993.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: A Practical and powerful approach to multiple testing. JRSSB 1995, 57: 289–300.
Benjamini Y, Hochberg Y: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 2001, 1165–1188.
Simes RJ: An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986., 73:
Benjamini Y, Hochberg Y: On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Education and Behavioral Statistics 2000, 25: 60–83.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. PNAS 2003, 100: 9440–9445. 10.1073/pnas.1530509100
Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003, 19: 368–375. 10.1093/bioinformatics/btf877
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, lacus S, Irizarry R, Li FLC, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics.2004. [http://genomebiology.com/2004/5/10/R80]
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498
GeneSpring 7.1. Silicon Genetics[http://www.silicongenetics.com]
Yang MCK, Yang JJ, Mclndoe RA, She JX: Microarray experimental design: power and samples size considerations. Physiol Genomics 2003, 16: 24–28. 10.1152/physiolgenomics.00037.2003
Pesarin F: Multivariate permutation tests with applications in Biostatistics. West Sussex, England: John Wiley & Sons; 2001.
Storey JD, Tibshirani R: Estimating false discovery rates under dependence, with applications to DNA microarrays. Technical report, Department of Statistics, Stanford University 2001.
Efron B, Storey JD, Tibshirani R: Microarrays, Empirical Bayes Methods, and False Discovery Rates. Technical Report 217, Department of Statistics, Stanford University 2001.
Birnbaum A: Combining independent tests of significance. JASA 1954, 49: 559–574.
Folks JL: Combination of independent tests. In Handbook of Statistics. Volume 4. Edited by: Krishnaiah PR, Sen PK. New York: Elsevier Science Publishers; 1984:113–121. 10.1016/S01697161(84)040086
Global p website[http://www.stat.ufl.edu/~jyang/global_p/global.p.R]
Broberg P: A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 2005., 6(199):
R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2005. ISBN 3–900051–07–0 [http://www.Rproject.org]
Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. PNAS 2004, 101: 811–816. 10.1073/pnas.0304146101
Storey JD: A direct approach to false discovery rates. JRSSB 2002, 64: 479–498. 10.1111/14679868.00346
Genovese C, Wasserman L: A stochastic process approach to false discovery control. The Annals of Statistics 2004, 32: 1035–1061. 10.1214/009053604000000283
Romano JH: On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association 1990., 85:
Serfling RJ: Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons; 1980.
Author information
Authors and Affiliations
Corresponding authors
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Yang, J.J., Yang, M.C. An improved procedure for gene selection from microarray experiments using false discovery rate criterion. BMC Bioinformatics 7, 15 (2006). https://doi.org/10.1186/14712105715
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/14712105715