- Research article
- Open Access
An improved procedure for gene selection from microarray experiments using false discovery rate criterion
BMC Bioinformatics volume 7, Article number: 15 (2006)
A large number of genes usually show differential expressions in a microarray experiment with two types of tissues, and the p-values of a proper statistical test are often used to quantify the significance of these differences. The genes with small p-values are then picked as the genes responsible for the differences in the tissue RNA expressions. One key question is what should be the threshold to consider the p-values small. There is always a trade off between this threshold and the rate of false claims. Recent statistical literature shows that the false discovery rate (FDR) criterion is a powerful and reasonable criterion to pick those genes with differential expression. Moreover, the power of detection can be increased by knowing the number of non-differential expression genes. While this number is unknown in practice, there are methods to estimate it from data. The purpose of this paper is to present a new method of estimating this number and use it for the FDR procedure construction.
A combination of test functions is used to estimate the number of differentially expressed genes. Simulation study shows that the proposed method has a higher power to detect these genes than other existing methods, while still keeping the FDR under control. The improvement can be substantial if the proportion of true differentially expressed genes is large. This procedure has also been tested with good results using a real dataset.
For a given expected FDR, the method proposed in this paper has better power to pick genes that show differentiation in their expression than two other well known methods.
The development of microarray technologies has created unparalleled opportunities to study the mechanism of disease, monitor disease expression and evaluate effective therapies. Because tens of thousands of genes are investigated simultaneously with a technology that has not yet been perfected, assessing uncertainty in the decision process relies on statistical modelling and theory. One key function of any statistical procedure is to control the rate of erroneous decisions or, in the microarray case, rate of false discovery of responsible genes.
The above concern can be illustrated as a multiple comparisons problem. Suppose we are interested in testing g parameters (μ1,..., μ g ) = μ. For each individual parameter μ j , the null hypothesis is H0j: μ j = 0 and the alternative hypothesis is H1j: μ j ≠ 0. This μ j can be thought as the difference in mean expressions of the j th gene under two different conditions in a microarray experiment. A conventional method to test each hypothesis is to take a sample and then calculate its p-value from a proper statistical test. If the calculated p-value is less than a threshold determined by a testing significance level, then H0jis rejected. However, when many hypotheses are simultaneously performed, a multiple comparisons procedure (MCP) has to be used to control the error rate .
The traditional MCP controls the probability of making any error in multiple selections, i.e., controls the familywise error rate (FWER). It has been shown, however, that this type of procedure is extremely conservative when the number of hypotheses is large. Alternatively, Benjamini and Hochberg  proposed the measure of false discovery rate (FDR) for which the expected proportion of false discoveries is controlled. This procedure is based on the idea that one can tolerate more false discoveries if the number of tests is large. For example, 5 false discoveries out of 10 selections is probably too many while 5 false discoveries out of 100 selections should be acceptable. This is particularly useful in microarray analysis since a very large number of genes usually show differential expressions. Therefore, controlling the FDR can greatly increase the power of discovery.
Benjamini and Hochberg  proved that Simes' procedure  can be used to control expected FDR at α (0 <α < 1) when tests of true null hypotheses are either independent or positively dependent. More specifically, let p(1) ≤ p(2) ≤ ..., ≤ p(g)be the ordered p-values and J + 1 be the smallest integer satisfying
If J ≥ 1, then rejecting H(0j), j = 1,..., J ensures the expected FDR at α. We call this method the BH procedure.
Suppose the number of true nulls is g0, or the proportion is π0 = g0/g. Benjamini and Hochberg  proved that the expected FDR of BH procedure is less than or equal to α. When g0 is small compared to g, using the original α in (1) loses of power because it can be replaced by the larger value α/π0 and still control the FDP at α. Here, the power of a selection procedure is defined as the proportion of the alternative hypotheses that are correctly identified. Obviously, if we knew π0 in advance, we could replace α in (1) by α/π0 to increase power. Since π0 is unknown, the key question here is how to estimate the number of true null hypotheses g0.
In an earlier published paper, Benjamini and Hochberg  proposed an adaptive procedure (aBH) which, by simulation, showed a higher power than BH procedure. The idea of the aBH procedure is to estimate g0 based on the fact that the p-value under the alternative hypothesis is stochastically smaller than that under the null, which is uniformly distributed on (0, 1). On a quantile plot of p-values (p(j)versus j), the slope passing through (j, p(j)) and (g + 1,1) increases just as j increases if the P(j)s are corresponding to the subset of true alternative hypotheses. The first time the slope decreases indicates a change point. Using this stopping rule in conjunction with the Lowest SLope (LSL) estimator, (1 - p(j))/(g + 1 - j), Benjamini and Hochberg proposed their aBH method to estimate g0 as follows:
Alg. 1 aBH method in g0estimation
If there is no hypothesis rejected by the BH procedure, then stop and declare no discovery.
Calculate m(j)= (1 - p(j))/(g + 1 - j), ∀j = 1,..., g.
Let J* be the largest value such that m(J*)<m(J* - 1). Define b as the smallest integer larger than 1/m(J*).
The number of true null hypotheses is estimated as = min(b, g).
Recently, Storey and Tibshirani  observed and demonstrated in their Figure 1 that the density histogram of p-values based on p1,..., p g from a microarray experiment looked flat in the region of (0.5, 1). Based on the fact that the null p-values of genes are uniformly distributed, most of genes in the region of (0.5, 1) should be from the null. Therefore, they used a smoothing method to estimate g0 based on the flat region of the observed p-values. However, the implicit requirements of their method are: 1) g should be large, and 2) the proportion of true null hypotheses should also be large, such as 0.5, so that g0 can be estimated accurately. The method developed in this paper tries to bypass these two requirements so that a broad range of multiple testing problems can be applied. It is conceivable that when the number of relevant genes are known for certain target disease, a chip with a small number of genes will be widely used for diagnosis. The FDR methods, including both the BH and the aBH versions, have now been widely accepted for microarray analyses. They have been implemented in the R program , which can be incorporated into the Bioconductor package . Several microarray analysis computer programs, such as SAM (Significance Analysis of Microarrays)  and GeneSpring  also use the FDR criterion to identify differentially expressed genes. Yang et al.  have used the FDR criterion to determine the sample size in microarray experiments.
Without loss of generality, we describe the method in a one-sample testing problem. Its extension to commonly used paired-t test, two-sample t test, or nonparametric rank tests in microarray analysis is very straightforward. Let Y i = (Yi 1,..., Y ig )', i = 1,..., n be g-variate vector samples from populations with means μ= (μ1,..., μ g )'. Define Y(j)= (Y1j,..., y nj ), ,j = 1,..., g as a row vector for the j th component. The null hypothesis of the j th component is H0j: μ j = 0 and the alternative hypothesis is H1j: μ j ≠ 0. Let the p-value for rejecting the null H0jbe p j derived from a properly chosen test statistic.
Once the p-values are derived, we first summarize the information content of p-values using a differentiable real-valued decreasing function h, i.e., h(p1,..., p g ) is decreasing to all its arguments. We call h the combining function and its value, η = h(p1,..., p g ), the global statistic. Next, define η(0) to be the observed global statistic of η derived directly from the observed data. Note that a small value of p j indicates H0jis less likely to be true. Since h is a decreasing function of its argument p j , a large value of η(0) indicates a subset of the null hypotheses is likely to be true alternatives. To determine whether η(0) is large enough to make such a claim, we need to know the distribution of η when all the nulls are true. The distribution of η depends on the distribution of its arguments, the correlation between its arguments, and the combining function h. In a microarray experiment, there seems no way to model or estimate so many correlations (see, for example, Chapter 6 of Pesarin ). Hence, a reasonable approach which can tackle the correlation within each experimental unit is to determine the critical value by a permutation test. More specifically, we calculate all B = 2n values of η based on (ω1Y1,..., ω n Y n ), where ω i = ± 1 (The multivariate two-sample permutation method can be found in Pesarin ). Hence, we have a set of reference values from the B permutations, and can now define the p-value of the global η(0)-statistic as
where I(x) is the indicator function that takes value 1 if the statement x is true and 0 if it is not. To determine whether the global null hypothesis of all μ(j), j = 1,..., g, is zero is to evaluate if the global p-value, p(0), is less than or equal to a significance level. However, this level is part of the estimation process. It will be determined by the data (see later Alg. 2).
When the global null hypothesis is rejected, it indicates that not all null hypotheses H0js are true. Immediate question is, which subset of hypotheses is the true alternative. To determine whether to reject the hypothesis H(0j)is a multiple testing problem. We can, however, determine the size of true alternatives using the following iterated process. We regard the gene with the smallest p-value as the major contributor for the rejection and claim that the null hypothesis is not true with this gene. When this gene is removed from the data set, we continue the same process with the rest g - 1 genes and the p(0) computed by (2) is now denoted by p(1). The whole process is then repeated to produce a sequence of pseudo-global p-values, p(s), s = 1,..., g - 1. We call the later step global p-value pseudo because it is only based on the subset of the original data. The detail of how to generate pseudo-global p-values is given in the Algorithm section. First, we will use these pseudo-global p-values to estimate the number of true null genes g0.
We observe that, intuitively, p(0) ≤ p(1) ≤ ... ≤ p(g-1), and this monotone increasing property will be proved in (J.1) of the Justification section. In addition, we will prove in (J.2) of the Justification section that if r pseudo-global p-values are less than a given value β(0 <β < 1), the estimator of the number of null genes is
and this estimator ensures that E  ≥ g0. (The conservativeness of this estimate under other conditions using other techniques has also been mentioned by Storey and Tibshirani  and Efron et al. .) Since the inequality E  ≥ g0 holds for any value of β, the best choice of β is the one which makes closest to g0. By defining Δ = β/(1 - β)2 and ρ(β) = r - Δ, we observe that ρ(β) is an increasing function and then a decreasing function of β for the following reasons. When β is small, Δ increases slowly. However, r is an increasing function in β and r is always greater than 1. Therefore, ρ(β) is an increasing function for small value of β. On the other hand, when β → 1, Δ reaches ∞. Since r is finite, ρ(β) becomes a decreasing function in this range. Since (3) is equivalent to = g - ρ(β), the optimal value of β should be determined as
and the method of estimating g0 is equivalent to finding the optimal β value.
The estimation of g0 can thus be summarized as
Alg. 2 global-p method in g0estimation
List rβ as the integer such that p(s-1)≤ β ∀ s = 1,..., rβ, and > β for a large number of β s for 0 <β < 1.
Find β such that ρ(β) = r β - β/(1 - β)2 is maximized (see Figure 5). Let this β be β*.
Let rβ*be the integer such that p(s-1)≤ β* ∀ s = 1,..., rβ*, and > β*
Let = min[g - rβ*+ β* /(1 - β*)2, g].
We called this global-p method and denoted it by .
Alg. 3 Differentially expressed gene selection with given FDR
With g0 estimated, to control the FDR at α level for identifying differentially expressed genes in microarray data analysis with the conservative estimate , we simply test each sub-hypothesis based on (1) with the new α ≡ α/, i.e., rejects all H(j), j = 1,..., J, if p(j)≤ j × and p(J+1)> (J + 1) × . We also need to address the choices for the combining function h. In this paper, we consider only two commonly used ones. 1) Fisher's sum of logarithm
and 2) Liptak's sum of inverse standard Gaussian distribution functions
More discussion on the choices of combining functions can be found in Birnbaum  and Folks . The computer program for the proposed method, global-p, has been implemented in R script and is publicly available .
Two simulation studies were conducted. First, when the number of hypotheses are small and second, when the total number of hypotheses is extremely large. For small number of hypotheses, we used the following configurations: numbers of hypotheses are 16, 32, 64, and 128 with sample size n = 10 and the proportions of g0/g being 0, 0.25, 0.5 and 0.75. The true expressions under the alternative hypotheses are assumed to be variance 1 Gaussian random variables with non-zero mean values μ j = d j . In one set of experiments, all d j s are set as 0.2 or 0.4; in the other set the d j s are divided into 4 equal size blocks with values 0.2, 0.4, 0.6 or 0.8 in each block. The number of permutations is B = 1,000 and the number of simulations is 20,000. The FDR is set at α = 0.05. Four procedures are used for testing: BH, aBH, our global-p test one with Fisher's combining function, and one with Liptak's combining function. For the purpose of comparison, we also plug in the true value of g0 into the aBH method. This is the ideal situation to reach the highest power.
Based on this simulation results, all the four methods have their expected FDR below the nominal FDR α = 0.05. Both combining functions in our proposed method have a higher power than the aBH procedure in most situations, while there is little difference between them. The improvement over the aBH methods is substantial when the proportion of the true null hypotheses is small.
Since most microarray data consist of tens of thousands of genes simultaneously and many of their expressions are correlated, our second simulation tried to reflect these facts. Storey and Tibshirani  proposed a "clumpy" dependence model in which genes are partitioned into blocks. The gene expressions are assumed independent between blocks but dependent within the same block. Following the clumpy model, 10 test and 10 control samples were generated from normal random variables with mean 0 and standard deviation 1 where each sample contained 10,000 genes. For test samples, genes with expression difference were added by 3 to represent the expression differences. The number of differentially expressed genes g0 were 100, 2,000, 5,000, 8,000 which corresponded to the proportion of nulls π0 at 0.99, 0.8, 0.5, 0.2. To simulate intra-block dependency, we generated one vector of normal random variables with mean 0 and standard deviation 0.2 in each block of size 50 and add it to every gene in that block. This process creates correlations between genes. Since the true expression differences for non-null genes are moderate large in this simulation, we expect that any good method should accurately estimate π0.
In a recent article, Broberg  did a comparative review of various newly developed methods for estimating π0. To make a comparison, we presented the means and standard errors of estimates of π0 for all the methods describe in Broberg's paper in addition to our proposed method () over 1000 simulations. The details of π0estimation methods (BUM, SPLOSH, QVALUE, Boostrap LSE, SEP, LSL, mgf, PRE) can be found in Broberg's paper and the programs for these methods can be found in the add-on packages of the freely available R software . The simulation results are shown in Table 2.
As expected, all of them performed very well for large π0. If both the mean (bias) and standard error (stability) over the whole range of π0 are considered, LSL and Global-p stand out and LSL is better in stability. However, as we will see from the next study, LSL seems to be over-conservative in real data analysis. The means of both combining functions of our proposed method performed well but Liptak's function has a higher standard errors. Therefore, Fisher's function will be used for the real data analysis. During the simulation, we also noticed that current implementation of SPLOSH, SEP, mgf, and PRE methods in R programs failed to work when the number of hypotheses was small.
Real data analysis
We use a publically available experimental data set from the Stanford microarray database  first to compare the differences between BH, aBH and our method with Fisher's combining function. The purpose of this experiment was to identify genes that have different expressions between prostate tumor tissue and matched normal tissue. It consists of a total of 82 microarrays: 41 arrays were from primary prostate tumors and the other half were from matched normal prostate tissues. Each array contains 32,152 different human genes. We chose this data set because of the large number of replicates. With this amount of replicates, we had a better idea of the ground truth.
To compare performances of various methods, we split the whole data set into two groups: a test data set and a confirmation data set. Eleven pairs of tumor and normal arrays were chosen for the test data set and the remaining arrays were used for the confirmation. The number of eleven test pairs were taken from a systematic sampling to avoid bias. Patients 1, 5, 9,..., 41 in the original order from the database were chosen as the test set. Expression information is missing if it is labelled as failed, contaminated, or flagged. Genes were removed from our study if more than half of their expressions were missing in the original data set. A total of 24,865 genes was used to identify differences in expressions using BH, aBH and our proposed procedures at 0.01 expected FDR level. Since the experiment was designed with paired tumor-normal arrays, the paired t-test was used to derive the p-value of each gene in the test data set. The BH procedure identified 1,254 genes, the aBH procedure identified 1,523 genes, and our proposed method identified 2,119 genes. Our test is apparently more powerful if it can maintain the required FDR 0.01. Since we did not have the biological information, another approach was to estimate FDR based on our rejection set. Specifically, suppose the rejection set contains the p-values that are less than ξ and the estimated proportion of null hypotheses is . An intuitive estimate for FDR is [21, 22]
where . Based on the confirm data set, our proposed method reported an estimator of π0, = 0.5326. Using equation (4), the estimated FDRs for BH, aBH and our method were 0.0054, 0.0067, 0.0099, respectively.
If we further reject more sub-hypotheses beyond the number provided by our method, the FDR will exceed the assigned level. For example, if we reject the next 500 null sub-hypotheses that have the smallest p-values among those not previously rejected, the estimated FDR is 0.0137 which is larger than the pre-assigned 0.01 level.
Moreover, we calculated the standard difference (paired t-statistic) of expressions using the confirmation data set. The confirmation data set contains 30 pairs of arrays so that the statistic is a good estimate for the unknown standard difference. Figure 3 shows histograms of absolute standard differences based on genes identified by the BH procedure, by the aBH procedure, by our method, and the extra 500 genes beyond our method. From the histograms, several hundred differentially expressed genes that were not identified by BH or aBH in the test arrays were identified by our method and most of them have standard expression differences greater than 2. However, the next group of 500 genes beyond our method may contain too many false discoveries to make the FDR acceptable.
Next, we use this data set to compare different estimates of proportion of null genes, π0. The whole data set was partitioned into four sub-data sets. We labelled the arrays from 1 to 41 based on the original order in their database. Sub-data set 1 contained 11 arrays with labels 1, 5, 9,..., 41; sub-data set 2 contains 10 arrays with labels 2,6,..., 38; sub-data set 3 contains 10 arrays with labels 3,7,..., 39; sub-data set 4 contained the remaining arrays. Although we did not know the true value of π0 in real data analysis, we used the four sub-data sets to compare the variation of different estimation methods. In addition, we used the whole data set to check if the estimates are reliable. For global-p method, we also plot the function of ρ(β) in Figure 5 using the whole data set. The maximum of ρ(β) occurs at β = 0.91.
The estimates were listed in Table 3. All estimates in the subsets were larger than the ones in the whole data set. This is expected because genes with small expressional differences are more likely to be classified as null genes in a small sample. From the π0 estimates from the whole data set, we may say the LSL gave a large π0 value, PRE gave a small value and all the others gave similar estimates. We used LSL, PRE and Global-p to draw the p-value density histogram using the whole data set in Figure 4. The upper panel is the overall view and the lower panel is the closer view. The green dash-dot line is the height using LSL estimate; the red dash line PRE estimate; the blue dot line global-p estimate. The LSL method, which is also shown in Broberg's study, is too conservative producing the largest value. The PRE method underestimated π0. If we used by PRE method to improve FDR procedure, it is likely to have a higher FDR than the nominal level. We think our π0 estimate is reliable because it is close but higher than the height of observed p-values in the region of (0.6, 1) if we assume genes with expressional differences have rare chance to have large observed p-values.
Benjamini and Hochberg's FDR procedure  opens an useful approach for multiple comparisons in microarray analysis. Due to the complexity in microarray experiments, it seems unlikely that one method can dominate all the other methods in all the cases. In this paper, we have demonstrated a new method that is intuitive appearing because:
It uses permutation test which can take care the complex correlation structures in gene expressions.
Its global test based on sequentially eliminated significant genes should provide a good stopping rule because all the remaining genes are always considered together.
With the support of simulation and real data studies, the new method should be a viable alternative to find the differentially expressed genes in microarray experiments.
We illustrate the algorithm for calculating pseudo-global p-values based on marginal p-values, p1,..., . Please note that the procedure described here is the same regardless of g0. We consider h(p1, ..., ) in the summation form:
where ħ is a differentiable decreasing function in which exists and ħ(p) → ∞ as p → 0. Note that equation (5) totally meets the requirement that h should be a differentiable real-valued decreasing function. The realization of ħ can be, for instance, ħ(p) = -2log(p) if we use Fisher's sum of logarithm or ħ(p) = Φ-1 (1 - p) if we use Liptak's sum of inverse standard Gaussian distribution function. Recall that p j is the p-value obtained from j th row vector (Y1j,..., Y nj ) and the ordered marginal p-values are p(1) ≤ p(2)≤ .... ≤ . The p-values for individual genes are denoted by subscripts while the pseudo-global p-values in (2) in are denoted by superscripts. Parentheses are used to reveal the order property.
In one-sample testing problem, we permute data B times, where each time we assign w i = ± 1 with equal probability to (w1Y1,..., w n Y n ). For each permuted data, we use pb(j)to be the p-value from the b th permuted data for the gene that produced p(j), j = 1,..., g0, b = 1,..., B.
We summarize p(i)and pb(i)using the following table.
Note that the first column is p-values of genes from the original data while the remaining columns are from the permuted data.
To simplify the notation, let ħ(p(i)) = ħ(i), ħ(pb(i)) = ħb(i)and η is just the sum of ħ(·). Therefore, we can transform the table (6) into the following table with the addition of the last row where its value η is the sum of the ħ(·) in its corresponding column.
Once η(0) and η(1),..., η(B)are available, the global p-value, by definition, is
The same process is directly used to calculate the pseudo-global p-values after removing the genes which have the smallest p-values. Actually, we can use the available data illustrated in table (8). By induction, suppose p(1),..., p(s)have been removed, the ħ values of the raw data and their reference sets are
and the pseudo-global p-value after removing the s genes is
Our justification is based on extreme cases to make the proofs tractable. However, this requirement is not very critical. Our simulation and real data analysis show that even with moderate g0 value, the pseudo-global p-value, p(s), starts to increase when our procedure has removed most of the significant genes. Besides, it jumps up very fast to reach the point that p(s)will be larger than threshold β and stop removing genes. Let g be the total number of sub-hypotheses considered. Given g, let d= (d1,..., d g ) denote the vector containing the true values of μ j , j = 1,..., g for each sub-hypothesis; and R(g, d) be the number of Y(j)removed using procedure . Without loss of generality, we assume d(1) ≤ ... ≤ < 0 and d(j)= 0 for j = g - g0 + 1,..., g. We have
and the equality holds when . The equality affirms that the sub-sample Y(j)with extreme small d j < 0 is identified and removed from the component of η. Actually, when the gene expression difference is far away from zero, it will be identified by any reasonable multiple testing procedure. Hence, in this extreme case, we only focus on which is based on the remaining g0 samples that have null distributions. Our goal is reduced to finding an upper bond of , so that the expected value of can be bounded (below) by g0.
J.1 Monotone in Pseudo-global p-values
The key feature of our algorithm is that the pseudo-global p-values are monotone increasing. We can be more precise by showing that
Romano (Section 2) has proved that the distribution of can be approximated by Gaussian distribution using the Central Limit theorem. Therefore, if we let the mean and standard deviation of reference samples (b = 1,..., B) be μ(s)and σ(s), respectively, and use Z to denote the standard Gaussian random variable with distribution function Φ, then the pseudo-global p-value p(s)can be expressed as
where C(s)= Φ-1(1 - p(s)) is p(s)th upper percentile of the standard Gaussian distribution. Hence, the relation between η(s)and C(s)is
η(s)= μ(s)+ σ(s)C (s).
Similarly, for the μ(s+1)and σ(s+1)as the mean and standard deviation of reference samples (b = 1,..., B), we repeat the same approximation method to express p(s+1)as
Comparing equations (11) and (13), we observe that to prove p(s)≤ p(s+1)with probability 1 asymptotically when p(s)is given is equivalent to prove that
with probability 1 asymptotically when C(s)is given.
As g0 → ∞, the sample deviations σ(s)and σ(s+1)are almost identical so that C(s)→ C(s). Next, using table (8) again, we can express μ(s)- μ(s+1)as
The distribution of the reference set p1(s+1),..., pB(s+1)converges to the uniform (0, 1) distribution since the distribution of the permuted test statistics, which are corresponding to these p-values, converges to a Normal distribution using Theorem 2.1 of Romano's work. Therefore, if we define uniform (0, 1) random variable as U, converges in probability to E[ħ(U)] by the Law of Large Number. Hence, to prove equation (14) we only need to prove that
Pr[ħ(p(s+1)) ≥ E[ħ(U)]] → 1. (15)
Assume, without loss of generality, that p1,..., correspond to the true null. To prove equation (15), we first observe that, since p1,..., are independent uniform random variables, P(s+1)is the (s + 1)/g0-th quantile of the uniform (0, 1) distribution. By defining and using the distribution of sample quantile , we have
Since the first derivative of ħ exists, we can use the delta method to have
where represents the asymptotical standard error of ħ(p(s+1)).
Finally, equation (15) can be proven as,
which converges to 0 because E[ħ(U)] is finite and .
J.2 Determining the threshold β by (3)
First, we recall that, as p(0) is the global p-value based on all null data, it is uniformly distributed. That is,
Pr[p(0) ≤ β] = β.
We have proved that the pseudo-global p-values, p(s), are monotone increasing when s/g0 → 0. Therefore, there exists m such that p(0) ≤ p(1) ≤ ... ≤ p(m). Furthermore, for all s ≤ m,
Pr[p(s)≤ β|p(j)≤ β, j = 0, ..., s - 1] ≤ β.
The purpose here is to derive equation (3) for any β within (0,1) using the above inequality. We start with the number of genes removed by our procedure. Observe that
where k m = min(k, m + 1). Hence, we can find an upper bound of the expected number of genes removed, , which is
Let Δ = Δ(β) = β/(1 - β)2. From equations (9) and (16), we have
so that the number of true null estimate is
= g - R(g, d) + Δ
and this estimate ensures that
Westfall PH, Young SS: Resampling-based multiple testing. New York: John Wiley & Sons; 1993.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: A Practical and powerful approach to multiple testing. JRSSB 1995, 57: 289–300.
Benjamini Y, Hochberg Y: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 2001, 1165–1188.
Simes RJ: An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986., 73:
Benjamini Y, Hochberg Y: On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Education and Behavioral Statistics 2000, 25: 60–83.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. PNAS 2003, 100: 9440–9445. 10.1073/pnas.1530509100
Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003, 19: 368–375. 10.1093/bioinformatics/btf877
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, lacus S, Irizarry R, Li FLC, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics.2004. [http://genomebiology.com/2004/5/10/R80]
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498
GeneSpring 7.1. Silicon Genetics[http://www.silicongenetics.com]
Yang MCK, Yang JJ, Mclndoe RA, She JX: Microarray experimental design: power and samples size considerations. Physiol Genomics 2003, 16: 24–28. 10.1152/physiolgenomics.00037.2003
Pesarin F: Multivariate permutation tests with applications in Biostatistics. West Sussex, England: John Wiley & Sons; 2001.
Storey JD, Tibshirani R: Estimating false discovery rates under dependence, with applications to DNA microarrays. Technical report, Department of Statistics, Stanford University 2001.
Efron B, Storey JD, Tibshirani R: Microarrays, Empirical Bayes Methods, and False Discovery Rates. Technical Report 217, Department of Statistics, Stanford University 2001.
Birnbaum A: Combining independent tests of significance. JASA 1954, 49: 559–574.
Folks JL: Combination of independent tests. In Handbook of Statistics. Volume 4. Edited by: Krishnaiah PR, Sen PK. New York: Elsevier Science Publishers; 1984:113–121. 10.1016/S0169-7161(84)04008-6
Global- p website[http://www.stat.ufl.edu/~jyang/global_p/global.p.R]
Broberg P: A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 2005., 6(199):
R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2005. ISBN 3–900051–07–0 [http://www.R-project.org]
Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. PNAS 2004, 101: 811–816. 10.1073/pnas.0304146101
Storey JD: A direct approach to false discovery rates. JRSSB 2002, 64: 479–498. 10.1111/1467-9868.00346
Genovese C, Wasserman L: A stochastic process approach to false discovery control. The Annals of Statistics 2004, 32: 1035–1061. 10.1214/009053604000000283
Romano JH: On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association 1990., 85:
Serfling RJ: Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons; 1980.
About this article
Cite this article
Yang, J.J., Yang, M.C. An improved procedure for gene selection from microarray experiments using false discovery rate criterion. BMC Bioinformatics 7, 15 (2006). https://doi.org/10.1186/1471-2105-7-15
- False Discovery Rate
- Real Data Analysis
- Null Gene
- True Null Hypothesis
- True Alternative