An improved procedure for gene selection from microarray experiments using false discovery rate criterion

Yang, James J; Yang, Mark CK

doi:10.1186/1471-2105-7-15

Research article
Open access
Published: 11 January 2006

An improved procedure for gene selection from microarray experiments using false discovery rate criterion

James J Yang¹ &
Mark CK Yang²

BMC Bioinformatics volume 7, Article number: 15 (2006) Cite this article

4473 Accesses
17 Citations
Metrics details

Abstract

Background

A large number of genes usually show differential expressions in a microarray experiment with two types of tissues, and the p-values of a proper statistical test are often used to quantify the significance of these differences. The genes with small p-values are then picked as the genes responsible for the differences in the tissue RNA expressions. One key question is what should be the threshold to consider the p-values small. There is always a trade off between this threshold and the rate of false claims. Recent statistical literature shows that the false discovery rate (FDR) criterion is a powerful and reasonable criterion to pick those genes with differential expression. Moreover, the power of detection can be increased by knowing the number of non-differential expression genes. While this number is unknown in practice, there are methods to estimate it from data. The purpose of this paper is to present a new method of estimating this number and use it for the FDR procedure construction.

Results

A combination of test functions is used to estimate the number of differentially expressed genes. Simulation study shows that the proposed method has a higher power to detect these genes than other existing methods, while still keeping the FDR under control. The improvement can be substantial if the proportion of true differentially expressed genes is large. This procedure has also been tested with good results using a real dataset.

Conclusion

For a given expected FDR, the method proposed in this paper has better power to pick genes that show differentiation in their expression than two other well known methods.

Background

The development of microarray technologies has created unparalleled opportunities to study the mechanism of disease, monitor disease expression and evaluate effective therapies. Because tens of thousands of genes are investigated simultaneously with a technology that has not yet been perfected, assessing uncertainty in the decision process relies on statistical modelling and theory. One key function of any statistical procedure is to control the rate of erroneous decisions or, in the microarray case, rate of false discovery of responsible genes.

The above concern can be illustrated as a multiple comparisons problem. Suppose we are interested in testing g parameters (μ₁,..., μ_g) = μ. For each individual parameter μ_j, the null hypothesis is H_0j: μ_j= 0 and the alternative hypothesis is H_1j: μ_j≠ 0. This μ_jcan be thought as the difference in mean expressions of the j th gene under two different conditions in a microarray experiment. A conventional method to test each hypothesis is to take a sample and then calculate its p-value from a proper statistical test. If the calculated p-value is less than a threshold determined by a testing significance level, then H_0jis rejected. However, when many hypotheses are simultaneously performed, a multiple comparisons procedure (MCP) has to be used to control the error rate [1].

The traditional MCP controls the probability of making any error in multiple selections, i.e., controls the familywise error rate (FWER). It has been shown, however, that this type of procedure is extremely conservative when the number of hypotheses is large. Alternatively, Benjamini and Hochberg [2] proposed the measure of false discovery rate (FDR) for which the expected proportion of false discoveries is controlled. This procedure is based on the idea that one can tolerate more false discoveries if the number of tests is large. For example, 5 false discoveries out of 10 selections is probably too many while 5 false discoveries out of 100 selections should be acceptable. This is particularly useful in microarray analysis since a very large number of genes usually show differential expressions. Therefore, controlling the FDR can greatly increase the power of discovery.

Benjamini and Hochberg [3] proved that Simes' procedure [4] can be used to control expected FDR at α (0 <α < 1) when tests of true null hypotheses are either independent or positively dependent. More specifically, let p₍₁₎ ≤ p₍₂₎ ≤ ..., ≤ p_(g)be the ordered p-values and J + 1 be the smallest integer satisfying

$p_{(J + 1)} > (J + 1) \times \frac{α}{g} . (1)$

If J ≥ 1, then rejecting H_(0j), j = 1,..., J ensures the expected FDR at α. We call this method the BH procedure.

Suppose the number of true nulls is g₀, or the proportion is π₀ = g₀/g. Benjamini and Hochberg [3] proved that the expected FDR of BH procedure is less than or equal to $\frac{g_{0}}{g}$ α. When g₀ is small compared to g, using the original α in (1) loses of power because it can be replaced by the larger value α/π₀ and still control the FDP at α. Here, the power of a selection procedure is defined as the proportion of the alternative hypotheses that are correctly identified. Obviously, if we knew π₀ in advance, we could replace α in (1) by α/π₀ to increase power. Since π₀ is unknown, the key question here is how to estimate the number of true null hypotheses g₀.

In an earlier published paper, Benjamini and Hochberg [5] proposed an adaptive procedure (aBH) which, by simulation, showed a higher power than BH procedure. The idea of the aBH procedure is to estimate g₀ based on the fact that the p-value under the alternative hypothesis is stochastically smaller than that under the null, which is uniformly distributed on (0, 1). On a quantile plot of p-values (p_(j)versus j), the slope passing through (j, p_(j)) and (g + 1,1) increases just as j increases if the P_(j)s are corresponding to the subset of true alternative hypotheses. The first time the slope decreases indicates a change point. Using this stopping rule in conjunction with the Lowest SLope (LSL) estimator, (1 - p_(j))/(g + 1 - j), Benjamini and Hochberg proposed their aBH method to estimate g₀ as follows:

Alg. 1 aBH method in g₀estimation

1.
If there is no hypothesis rejected by the BH procedure, then stop and declare no discovery.
2.
Calculate m_(j)= (1 - p_(j))/(g + 1 - j), ∀j = 1,..., g.
3.
Let J* be the largest value such that m_(J*)<m_{(J* - 1)}. Define b as the smallest integer larger than 1/m_(J*).
4.
The number of true null hypotheses is estimated as ${\hat{g}}_{0}$ = min(b, g).

Recently, Storey and Tibshirani [6] observed and demonstrated in their Figure 1 that the density histogram of p-values based on p₁,..., p_gfrom a microarray experiment looked flat in the region of (0.5, 1). Based on the fact that the null p-values of genes are uniformly distributed, most of genes in the region of (0.5, 1) should be from the null. Therefore, they used a smoothing method to estimate g₀ based on the flat region of the observed p-values. However, the implicit requirements of their method are: 1) g should be large, and 2) the proportion of true null hypotheses should also be large, such as 0.5, so that g₀ can be estimated accurately. The method developed in this paper tries to bypass these two requirements so that a broad range of multiple testing problems can be applied. It is conceivable that when the number of relevant genes are known for certain target disease, a chip with a small number of genes will be widely used for diagnosis. The FDR methods, including both the BH and the aBH versions, have now been widely accepted for microarray analyses. They have been implemented in the R program [7], which can be incorporated into the Bioconductor package [8]. Several microarray analysis computer programs, such as SAM (Significance Analysis of Microarrays) [9] and GeneSpring [10] also use the FDR criterion to identify differentially expressed genes. Yang et al. [11] have used the FDR criterion to determine the sample size in microarray experiments.

Methods

Without loss of generality, we describe the method in a one-sample testing problem. Its extension to commonly used paired-t test, two-sample t test, or nonparametric rank tests in microarray analysis is very straightforward. Let Y_i= (Y_{i 1},..., Y_ig)', i = 1,..., n be g-variate vector samples from populations with means μ= (μ₁,..., μ_g)'. Define Y^(j)= (Y_1j,..., y_nj), ,j = 1,..., g as a row vector for the j th component. The null hypothesis of the j th component is H_0j: μ_j= 0 and the alternative hypothesis is H_1j: μ_j≠ 0. Let the p-value for rejecting the null H_0jbe p_jderived from a properly chosen test statistic.

Once the p-values are derived, we first summarize the information content of p-values using a differentiable real-valued decreasing function h, i.e., h(p₁,..., p_g) is decreasing to all its arguments. We call h the combining function and its value, η = h(p₁,..., p_g), the global statistic. Next, define η⁽⁰⁾ to be the observed global statistic of η derived directly from the observed data. Note that a small value of p_jindicates H_0jis less likely to be true. Since h is a decreasing function of its argument p_j, a large value of η⁽⁰⁾ indicates a subset of the null hypotheses is likely to be true alternatives. To determine whether η⁽⁰⁾ is large enough to make such a claim, we need to know the distribution of η when all the nulls are true. The distribution of η depends on the distribution of its arguments, the correlation between its arguments, and the combining function h. In a microarray experiment, there seems no way to model or estimate so many correlations (see, for example, Chapter 6 of Pesarin [12]). Hence, a reasonable approach which can tackle the correlation within each experimental unit is to determine the critical value by a permutation test. More specifically, we calculate all B = 2ⁿ values of η based on (ω₁Y₁,..., ω_nY_n), where ω_i= ± 1 (The multivariate two-sample permutation method can be found in Pesarin [12]). Hence, we have a set of reference values ${η_{1}^{(0)}, \dots, η_{B}^{(0)}}$ from the B permutations, and can now define the p-value of the global η⁽⁰⁾-statistic as

$p^{(0)} = \frac{\sum_{b = 1}^{B} I (η_{b}^{(0)} \geq η^{(0)})}{B} (2)$

where I(x) is the indicator function that takes value 1 if the statement x is true and 0 if it is not. To determine whether the global null hypothesis of all μ_(j), j = 1,..., g, is zero is to evaluate if the global p-value, p⁽⁰⁾, is less than or equal to a significance level. However, this level is part of the estimation process. It will be determined by the data (see later Alg. 2).

When the global null hypothesis is rejected, it indicates that not all null hypotheses H_0js are true. Immediate question is, which subset of hypotheses is the true alternative. To determine whether to reject the hypothesis H_(0j)is a multiple testing problem. We can, however, determine the size of true alternatives using the following iterated process. We regard the gene with the smallest p-value as the major contributor for the rejection and claim that the null hypothesis is not true with this gene. When this gene is removed from the data set, we continue the same process with the rest g - 1 genes and the p⁽⁰⁾ computed by (2) is now denoted by p⁽¹⁾. The whole process is then repeated to produce a sequence of pseudo-global p-values, p^(s), s = 1,..., g - 1. We call the later step global p-value pseudo because it is only based on the subset of the original data. The detail of how to generate pseudo-global p-values is given in the Algorithm section. First, we will use these pseudo-global p-values to estimate the number of true null genes g₀.

We observe that, intuitively, p⁽⁰⁾ ≤ p⁽¹⁾ ≤ ... ≤ p^(g-1), and this monotone increasing property will be proved in (J.1) of the Justification section. In addition, we will prove in (J.2) of the Justification section that if r pseudo-global p-values are less than a given value β(0 <β < 1), the estimator of the number of null genes is

${\hat{g}}_{0} = g - r + β / {(1 - β)}^{2} (3)$

and this estimator ensures that E [ ${\hat{g}}_{0}$ ] ≥ g₀. (The conservativeness of this estimate under other conditions using other techniques has also been mentioned by Storey and Tibshirani [13] and Efron et al. [14].) Since the inequality E [ ${\hat{g}}_{0}$ ] ≥ g₀ holds for any value of β, the best choice of β is the one which makes ${\hat{g}}_{0}$ closest to g₀. By defining Δ = β/(1 - β)² and ρ(β) = r - Δ, we observe that ρ(β) is an increasing function and then a decreasing function of β for the following reasons. When β is small, Δ increases slowly. However, r is an increasing function in β and r is always greater than 1. Therefore, ρ(β) is an increasing function for small value of β. On the other hand, when β → 1, Δ reaches ∞. Since r is finite, ρ(β) becomes a decreasing function in this range. Since (3) is equivalent to ${\hat{g}}_{0}$ = g - ρ(β), the optimal value of β should be determined as

$\hat{β} = \underset{β}{arg max} ρ (β)$

and the method of estimating g₀ is equivalent to finding the optimal β value.

The estimation of g₀ can thus be summarized as

Alg. 2 global-p method in g₀estimation

1.
List r_β as the integer such that p^(s-1)≤ β ∀ s = 1,..., r_β, and $p^{(r_{β})}$ > β for a large number of β s for 0 <β < 1.
2.
Find β such that ρ(β) = r_β- β/(1 - β)² is maximized (see Figure 5). Let this β be β*.

3.
Let r_β*be the integer such that p^(s-1)≤ β* ∀ s = 1,..., r_β*, and $p^{(r_{β *})}$ > β*
4.
Let ${\hat{g}}_{0}$ = min[g - r_β*+ β* /(1 - β*)², g].

We called this global-p method and denoted it by $℘$ .

Alg. 3 Differentially expressed gene selection with given FDR

With g₀ estimated, to control the FDR at α level for identifying differentially expressed genes in microarray data analysis with the conservative estimate ${\hat{g}}_{0}$ , we simply test each sub-hypothesis based on (1) with the new α ≡ α/ ${\hat{g}}_{0}$ , i.e., rejects all H_(j), j = 1,..., J, if p_(j)≤ j × $\frac{α}{{\hat{g}}_{0}}$ and p_(J+1)> (J + 1) × $\frac{α}{{\hat{g}}_{0}}$ . We also need to address the choices for the combining function h. In this paper, we consider only two commonly used ones. 1) Fisher's sum of logarithm

$η = h (p_{1} \dots, p_{g}) = - 2 \sum_{j = 1}^{g} log (p_{j})$

and 2) Liptak's sum of inverse standard Gaussian distribution functions

$η = h (p_{1}, \dots, p_{g}) = \sum_{j = 1}^{g} Φ^{- 1} (1 - p_{j}) .$

More discussion on the choices of combining functions can be found in Birnbaum [15] and Folks [16]. The computer program for the proposed method, global-p, has been implemented in R script and is publicly available [17].

Results

Simulation

Two simulation studies were conducted. First, when the number of hypotheses are small and second, when the total number of hypotheses is extremely large. For small number of hypotheses, we used the following configurations: numbers of hypotheses are 16, 32, 64, and 128 with sample size n = 10 and the proportions of g₀/g being 0, 0.25, 0.5 and 0.75. The true expressions under the alternative hypotheses are assumed to be variance 1 Gaussian random variables with non-zero mean values μ_j= d_j. In one set of experiments, all d_js are set as 0.2 or 0.4; in the other set the d_js are divided into 4 equal size blocks with values 0.2, 0.4, 0.6 or 0.8 in each block. The number of permutations is B = 1,000 and the number of simulations is 20,000. The FDR is set at α = 0.05. Four procedures are used for testing: BH, aBH, our global-p test one with Fisher's combining function, and one with Liptak's combining function. For the purpose of comparison, we also plug in the true value of g₀ into the aBH method. This is the ideal situation to reach the highest power.

Table 1 lists the simulated expected FDR when all null hypotheses are true. Figures 1 and 2 show the estimated FDR and power under various gene expression compositions.

Table 1 False discovery rate when all null hypotheses are true.

Full size table

Based on this simulation results, all the four methods have their expected FDR below the nominal FDR α = 0.05. Both combining functions in our proposed method have a higher power than the aBH procedure in most situations, while there is little difference between them. The improvement over the aBH methods is substantial when the proportion of the true null hypotheses is small.

Since most microarray data consist of tens of thousands of genes simultaneously and many of their expressions are correlated, our second simulation tried to reflect these facts. Storey and Tibshirani [13] proposed a "clumpy" dependence model in which genes are partitioned into blocks. The gene expressions are assumed independent between blocks but dependent within the same block. Following the clumpy model, 10 test and 10 control samples were generated from normal random variables with mean 0 and standard deviation 1 where each sample contained 10,000 genes. For test samples, genes with expression difference were added by 3 to represent the expression differences. The number of differentially expressed genes g₀ were 100, 2,000, 5,000, 8,000 which corresponded to the proportion of nulls π₀ at 0.99, 0.8, 0.5, 0.2. To simulate intra-block dependency, we generated one vector of normal random variables with mean 0 and standard deviation 0.2 in each block of size 50 and add it to every gene in that block. This process creates correlations between genes. Since the true expression differences for non-null genes are moderate large in this simulation, we expect that any good method should accurately estimate π₀.

In a recent article, Broberg [18] did a comparative review of various newly developed methods for estimating π₀. To make a comparison, we presented the means and standard errors of estimates of π₀ for all the methods describe in Broberg's paper in addition to our proposed method ( $℘$ ) over 1000 simulations. The details of π₀estimation methods (BUM, SPLOSH, QVALUE, Boostrap LSE, SEP, LSL, mgf, PRE) can be found in Broberg's paper and the programs for these methods can be found in the add-on packages of the freely available R software [19]. The simulation results are shown in Table 2.

Table 2 The means (first row) and standard errors (second row) of estimates ${\hat{π}}_{0}$ based on various methods when π₀ = 0.2,0.5,0.8,0.99.

Full size table

As expected, all of them performed very well for large π₀. If both the mean (bias) and standard error (stability) over the whole range of π₀ are considered, LSL and Global-p stand out and LSL is better in stability. However, as we will see from the next study, LSL seems to be over-conservative in real data analysis. The means of both combining functions of our proposed method performed well but Liptak's function has a higher standard errors. Therefore, Fisher's function will be used for the real data analysis. During the simulation, we also noticed that current implementation of SPLOSH, SEP, mgf, and PRE methods in R programs failed to work when the number of hypotheses was small.

Real data analysis

We use a publically available experimental data set from the Stanford microarray database [20] first to compare the differences between BH, aBH and our method with Fisher's combining function. The purpose of this experiment was to identify genes that have different expressions between prostate tumor tissue and matched normal tissue. It consists of a total of 82 microarrays: 41 arrays were from primary prostate tumors and the other half were from matched normal prostate tissues. Each array contains 32,152 different human genes. We chose this data set because of the large number of replicates. With this amount of replicates, we had a better idea of the ground truth.

To compare performances of various methods, we split the whole data set into two groups: a test data set and a confirmation data set. Eleven pairs of tumor and normal arrays were chosen for the test data set and the remaining arrays were used for the confirmation. The number of eleven test pairs were taken from a systematic sampling to avoid bias. Patients 1, 5, 9,..., 41 in the original order from the database were chosen as the test set. Expression information is missing if it is labelled as failed, contaminated, or flagged. Genes were removed from our study if more than half of their expressions were missing in the original data set. A total of 24,865 genes was used to identify differences in expressions using BH, aBH and our proposed procedures at 0.01 expected FDR level. Since the experiment was designed with paired tumor-normal arrays, the paired t-test was used to derive the p-value of each gene in the test data set. The BH procedure identified 1,254 genes, the aBH procedure identified 1,523 genes, and our proposed method identified 2,119 genes. Our test is apparently more powerful if it can maintain the required FDR 0.01. Since we did not have the biological information, another approach was to estimate FDR based on our rejection set. Specifically, suppose the rejection set contains the p-values that are less than ξ and the estimated proportion of null hypotheses is ${\hat{π}}_{0}$ . An intuitive estimate for FDR is [21, 22]

$\hat{FDR} (ξ) = \frac{{\hat{π}}_{0} ξ}{\hat{P r} [P \leq ξ]}, (4)$

where $\hat{P r} [P \leq ξ] = \sum_{j = 1}^{g} I (p_{i} \leq ξ) / g$ . Based on the confirm data set, our proposed method reported an estimator of π₀, ${\hat{π}}_{0}$ = 0.5326. Using equation (4), the estimated FDRs for BH, aBH and our method were 0.0054, 0.0067, 0.0099, respectively.

If we further reject more sub-hypotheses beyond the number provided by our method, the FDR will exceed the assigned level. For example, if we reject the next 500 null sub-hypotheses that have the smallest p-values among those not previously rejected, the estimated FDR is 0.0137 which is larger than the pre-assigned 0.01 level.

Moreover, we calculated the standard difference (paired t-statistic) of expressions using the confirmation data set. The confirmation data set contains 30 pairs of arrays so that the statistic is a good estimate for the unknown standard difference. Figure 3 shows histograms of absolute standard differences based on genes identified by the BH procedure, by the aBH procedure, by our method, and the extra 500 genes beyond our method. From the histograms, several hundred differentially expressed genes that were not identified by BH or aBH in the test arrays were identified by our method and most of them have standard expression differences greater than 2. However, the next group of 500 genes beyond our method may contain too many false discoveries to make the FDR acceptable.

Next, we use this data set to compare different estimates of proportion of null genes, π₀. The whole data set was partitioned into four sub-data sets. We labelled the arrays from 1 to 41 based on the original order in their database. Sub-data set 1 contained 11 arrays with labels 1, 5, 9,..., 41; sub-data set 2 contains 10 arrays with labels 2,6,..., 38; sub-data set 3 contains 10 arrays with labels 3,7,..., 39; sub-data set 4 contained the remaining arrays. Although we did not know the true value of π₀ in real data analysis, we used the four sub-data sets to compare the variation of different estimation methods. In addition, we used the whole data set to check if the estimates are reliable. For global-p method, we also plot the function of ρ(β) in Figure 5 using the whole data set. The maximum of ρ(β) occurs at β = 0.91.

The estimates ${\hat{π}}_{0}$ were listed in Table 3. All ${\hat{π}}_{0}$ estimates in the subsets were larger than the ones in the whole data set. This is expected because genes with small expressional differences are more likely to be classified as null genes in a small sample. From the π₀ estimates from the whole data set, we may say the LSL gave a large π₀ value, PRE gave a small value and all the others gave similar estimates. We used LSL, PRE and Global-p to draw the p-value density histogram using the whole data set in Figure 4. The upper panel is the overall view and the lower panel is the closer view. The green dash-dot line is the height using LSL estimate; the red dash line PRE estimate; the blue dot line global-p estimate. The LSL method, which is also shown in Broberg's study, is too conservative producing the largest value. The PRE method underestimated π₀. If we used ${\hat{π}}_{0}$ by PRE method to improve FDR procedure, it is likely to have a higher FDR than the nominal level. We think our π₀ estimate is reliable because it is close but higher than the height of observed p-values in the region of (0.6, 1) if we assume genes with expressional differences have rare chance to have large observed p-values.

Table 3 Estimates of ${\hat{π}}_{0}$ based on the prostate tumor data set from the Stanford microarray database. The first four rows are estimated based on the four partitions of the whole data set while the last row the is based on the data set. The means and standard errors (5th and 6th rows) are calculated based on the estimates of the four sub-data sets.

Full size table

Conclusion

Benjamini and Hochberg's FDR procedure [2] opens an useful approach for multiple comparisons in microarray analysis. Due to the complexity in microarray experiments, it seems unlikely that one method can dominate all the other methods in all the cases. In this paper, we have demonstrated a new method that is intuitive appearing because:

1.
It uses permutation test which can take care the complex correlation structures in gene expressions.
2.
Its global test based on sequentially eliminated significant genes should provide a good stopping rule because all the remaining genes are always considered together.

With the support of simulation and real data studies, the new method should be a viable alternative to find the differentially expressed genes in microarray experiments.

Algorithm

We illustrate the algorithm for calculating pseudo-global p-values based on marginal p-values, p₁,..., $p_{g_{0}}$ . Please note that the procedure described here is the same regardless of g₀. We consider h(p₁, ..., $p_{g_{0}}$ ) in the summation form:

$η = h (p_{1}, \dots, p_{g_{0}}) = \sum_{i = 1}^{g_{0}} ℏ (p_{i}), (5)$

where ħ is a differentiable decreasing function in which $ℏ' (p) = \frac{d ℏ (t)}{d t} | \begin{matrix} t = p \end{matrix}$ exists and ħ(p) → ∞ as p → 0. Note that equation (5) totally meets the requirement that h should be a differentiable real-valued decreasing function. The realization of ħ can be, for instance, ħ(p) = -2log(p) if we use Fisher's sum of logarithm or ħ(p) = Φ^-1 (1 - p) if we use Liptak's sum of inverse standard Gaussian distribution function. Recall that p_jis the p-value obtained from j th row vector (Y_1j,..., Y_nj) and the ordered marginal p-values are p₍₁₎ ≤ p₍₂₎≤ .... ≤ $p_{(g_{0})}$ . The p-values for individual genes are denoted by subscripts while the pseudo-global p-values in (2) in are denoted by superscripts. Parentheses are used to reveal the order property.

In one-sample testing problem, we permute data B times, where each time we assign w_i= ± 1 with equal probability to (w₁Y₁,..., w_nY_n). For each permuted data, we use p_b(j)to be the p-value from the b th permuted data for the gene that produced p_(j), j = 1,..., g₀, b = 1,..., B.

We summarize p_(i)and p_b(i)using the following table.

$\begin{matrix} p_{(1)} & p_{1 (1)} & \dots & p_{b (1)} & \dots & p_{B (1)} \\ p_{(2)} & p_{1 (2)} & \dots & p_{b (2)} & \dots & p_{B (2)} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ p_{(g_{0})} & p_{1 (g_{0})} & \dots & p_{b (g_{0})} & \dots & p_{B (g_{0})} \end{matrix} (6)$

Note that the first column is p-values of genes from the original data while the remaining columns are from the permuted data.

To simplify the notation, let ħ(p_(i)) = ħ_(i), ħ(p_b(i)) = ħ_b(i)and η is just the sum of ħ_(·). Therefore, we can transform the table (6) into the following table with the addition of the last row where its value η is the sum of the ħ_(·) in its corresponding column.

$\begin{matrix} ℏ_{(1)} & ℏ_{1 (1)} & \dots & ℏ_{b (1)} & \dots & ℏ_{B (1)} \\ ℏ_{(2)} & ℏ_{1 (2)} & \dots & ℏ_{b (2)} & \dots & ℏ_{B (2)} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ ℏ_{(g_{0})} & ℏ_{1 (g_{0})} & \dots & ℏ_{b (g_{0})} & \dots & ℏ_{B (g_{0})} \\ ⇓ & ⇓ & \dots & ⇓ & \dots & ⇓ \\ η^{(0)} & η_{1}^{(0)} & \dots & η_{b}^{(0)} & \dots & η_{B}^{(0)} \end{matrix} (7)$

Once η⁽⁰⁾ and η⁽¹⁾,..., η^(B)are available, the global p-value, by definition, is

$p^{(0)} = \frac{\sum_{b = 1}^{B} I (η_{b}^{(0)} \geq η^{(0)})}{B} .$

The same process is directly used to calculate the pseudo-global p-values after removing the genes which have the smallest p-values. Actually, we can use the available data illustrated in table (8). By induction, suppose p₍₁₎,..., p_(s)have been removed, the ħ values of the raw data and their reference sets are

$\begin{matrix} ℏ_{(s + 1)} & ℏ_{1 (s + 1)} & \dots & ℏ_{b (s + 1)} & \dots & ℏ_{B (s + 1)} \\ ℏ_{(s + 2)} & ℏ_{1 (s + 2)} & \dots & ℏ_{b (s + 2)} & \dots & ℏ_{B (s + 2)} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ ℏ_{(g_{0})} & ℏ_{1 (g_{0})} & \dots & ℏ_{b (g_{0})} & \dots & ℏ_{B (g_{0})} \\ ⇓ & ⇓ & \dots & ⇓ & \dots & ⇓ \\ η^{(s)} & η_{1}^{(s)} & \dots & η_{b}^{(s)} & \dots & η_{B}^{(s)} \end{matrix} (8)$

and the pseudo-global p-value after removing the s genes is

$p^{(s)} = \frac{\sum_{b = 1}^{B} I (η_{b}^{(s)} \geq η^{(s)})}{B} .$

Justification

Our justification is based on extreme cases to make the proofs tractable. However, this requirement is not very critical. Our simulation and real data analysis show that even with moderate g₀ value, the pseudo-global p-value, p^(s), starts to increase when our procedure has removed most of the significant genes. Besides, it jumps up very fast to reach the point that p^(s)will be larger than threshold β and stop removing genes. Let g be the total number of sub-hypotheses considered. Given g, let d= (d₁,..., d_g) denote the vector containing the true values of μ_j, j = 1,..., g for each sub-hypothesis; and R(g, d) be the number of Y^(j)removed using procedure $℘$ . Without loss of generality, we assume d₍₁₎ ≤ ... ≤ $d_{(g - g_{0})}$ < 0 and d_(j)= 0 for j = g - g₀ + 1,..., g. We have

$E_{℘} [R (g, d)] \leq g - g_{0} + E_{℘} [R (g_{0}, 0)], (9)$

and the equality holds when $d_{(g - g_{0})} \to - \infty$ . The equality affirms that the sub-sample Y^(j)with extreme small d_j< 0 is identified and removed from the component of η. Actually, when the gene expression difference is far away from zero, it will be identified by any reasonable multiple testing procedure. Hence, in this extreme case, we only focus on $E_{℘} [R (g_{0}, 0)]$ which is based on the remaining g₀ samples that have null distributions. Our goal is reduced to finding an upper bond of $E_{℘} [R (g_{0}, 0)]$ , so that the expected value of ${\hat{g}}_{0}$ can be bounded (below) by g₀.

J.1 Monotone in Pseudo-global p-values

The key feature of our algorithm is that the pseudo-global p-values are monotone increasing. We can be more precise by showing that

$P r [p^{(s)} \leq p^{(s + 1)} | p_{(s)}^{}] \to 1 as \frac{s}{g_{0}} \to 0. (10)$

Romano (Section 2)[23] has proved that the distribution of $η_{b}^{(s)}$ can be approximated by Gaussian distribution using the Central Limit theorem. Therefore, if we let the mean and standard deviation of reference samples $η_{b}^{(s)}$ (b = 1,..., B) be μ^(s)and σ^(s), respectively, and use Z to denote the standard Gaussian random variable with distribution function Φ, then the pseudo-global p-value p^(s)can be expressed as

$\begin{matrix} p^{(s)} = \frac{\sum_{b = 1}^{B} I (η_{b}^{(s)} \geq η^{(s)}}{B} \\ \dot{=} P r [Z > \frac{η^{(s)} - μ^{(s)}}{σ^{(s)}}] \\ = P r [Z > C^{(s)}], (11) \end{matrix}$

where C^(s)= Φ^-1(1 - p^(s)) is p^(s)th upper percentile of the standard Gaussian distribution. Hence, the relation between η^(s)and C^(s)is

η^(s)= μ^(s)+ σ^(s)C ^(s).

Similarly, for the μ^(s+1)and σ^(s+1)as the mean and standard deviation of reference samples $η_{b}^{(s + 1)}$ (b = 1,..., B), we repeat the same approximation method to express p^(s+1)as

$\begin{array}{l} p^{(s + 1)} & (12) \\ = & \frac{\sum_{b = 1}^{B} I (η_{b}^{(s + 1)} \geq η^{(s + 1)})}{B} \\ \dot{=} & P r [Z > \frac{η^{(s + 1)} - μ^{(s + 1)}}{σ^{(s + 1)}}] \\ = & P r [Z > \frac{(η^{(s + 1)} + ℏ_{(s + 1)}) - (μ^{(s + 1)} + ℏ_{(s + 1)})}{σ^{(s + 1)}}] \\ = & P r [Z > \frac{η^{(s)} - (μ^{(s + 1)} + ℏ_{(s + 1)})}{σ^{(s + 1)}}] \\ = & P r [Z > \frac{μ^{(s)} + σ^{(s)} C^{(s)} - (μ^{(s + 1)} + ℏ_{(s + 1)})}{σ^{(s + 1)}}] \\ = & P r [Z > \frac{σ^{(s)}}{σ^{(s + 1)}} C^{(s)} + \\ \frac{μ^{(s)} - μ^{(s + 1)} - ℏ_{(s + 1)}}{σ^{(s + 1)}}] . & (13) \end{array}$

Comparing equations (11) and (13), we observe that to prove p^(s)≤ p^(s+1)with probability 1 asymptotically when p^(s)is given is equivalent to prove that

$\frac{σ^{(s)}}{σ^{(s + 1)}} C^{(s)} + \frac{μ^{(s)} - μ^{(s + 1)} - ℏ_{(s + 1)}}{σ^{(s + 1)}} < C^{(s)} (14)$

with probability 1 asymptotically when C^(s)is given.

As g₀ → ∞, the sample deviations σ^(s)and σ^(s+1)are almost identical so that $\frac{σ^{(s)}}{σ^{(s + 1)}}$ C^(s)→ C^(s). Next, using table (8) again, we can express μ^(s)- μ^(s+1)as

$\frac{\sum_{b = 1}^{B} ℏ (p_{b (s + 1)})}{B} .$

The distribution of the reference set p_1(s+1),..., p_B(s+1)converges to the uniform (0, 1) distribution since the distribution of the permuted test statistics, which are corresponding to these p-values, converges to a Normal distribution using Theorem 2.1 of Romano's work. Therefore, if we define uniform (0, 1) random variable as U, $\frac{\sum_{b = 1}^{B} ℏ (p_{b (s + 1)})}{B}$ converges in probability to E[ħ(U)] by the Law of Large Number. Hence, to prove equation (14) we only need to prove that

Pr[ħ(p_(s+1)) ≥ E[ħ(U)]] → 1. (15)

Assume, without loss of generality, that p₁,..., $p_{g_{0}}$ correspond to the true null. To prove equation (15), we first observe that, since p₁,..., $p_{g_{0}}$ are independent uniform random variables, P_(s+1)is the (s + 1)/g₀-th quantile of the uniform (0, 1) distribution. By defining $ζ_{s + 1, g_{0}} = (s + 1) / g_{0}$ and using the distribution of sample quantile [24], we have

$\sqrt{g_{0}} (p_{(s + 1)} - ς_{s + 1, g_{0}}) \to N (0, ς_{s + 1, g_{0}} (1 - ς_{s} + 1, g_{0})) .$

Since the first derivative of ħ exists, we can use the delta method to have

$\sqrt{g_{0}} (ℏ (p_{(s + 1)}) - ℏ (ς_{s + 1, g_{0}})) \to N (0, σ *^{2}),$

where $σ^{*} = \sqrt{ς_{s + 1, g_{0}} (1 - ς_{s + 1, g_{0}}) {[ℏ' (ς_{s + 1, g_{0}})]}^{2}}$ represents the asymptotical standard error of ħ(p_(s+1)).

Finally, equation (15) can be proven as,

$\begin{array}{l} P r [ℏ (p_{(s + 1)}) < E [ℏ (U)]] \\ = P r [\frac{ℏ (p_{(s + 1)}) - ℏ (ς_{s + 1, g_{0}})}{σ^{*}} \\ < \frac{E [ℏ (U)] - ℏ (ς_{s + 1, g_{0}})}{σ^{*}}] \\ \dot{=} Φ (\frac{E [ℏ (U)] - ℏ (ς_{s + 1, g_{0}})}{σ^{*}}) \end{array}$

which converges to 0 because E[ħ(U)] is finite and $ℏ (ς_{s + 1, g_{0}}) \to \infty as ς_{s + 1, g_{0}} = (s + 1) / g_{0} \to 0$ .

J.2 Determining the threshold β by (3)

First, we recall that, as p⁽⁰⁾ is the global p-value based on all null data, it is uniformly distributed. That is,

Pr[p⁽⁰⁾ ≤ β] = β.

We have proved that the pseudo-global p-values, p^(s), are monotone increasing when s/g₀ → 0. Therefore, there exists m such that p⁽⁰⁾ ≤ p⁽¹⁾ ≤ ... ≤ p^(m). Furthermore, for all s ≤ m,

Pr[p^(s)≤ β|p^(j)≤ β, j = 0, ..., s - 1] ≤ β.

The purpose here is to derive equation (3) for any β within (0,1) using the above inequality. We start with the number of genes removed by our procedure. Observe that

$\begin{array}{l} E_{℘} [R (g_{0}, 0)] \\ \begin{array}{l} = & \sum_{k = 0}^{g_{0} - 1} k P r [R (g_{0}, 0) = k] \\ = & \sum_{k = 1}^{g_{0} - 1} k P r [p^{(0)} \leq α, p^{(1)} \leq β, \dots, p^{(k - 1)} \leq β, \\ p^{(k)} > β] . \end{array} \end{array}$

In addition,

$\begin{array}{l} P r [p^{(0)} \leq β, p^{(1)} \leq β, \dots, p^{(k - 1)} \leq β, p^{(k)} > β] \\ \begin{array}{l} = & P r [p^{(0)} \leq β] \\ {\prod_{s = 1}^{k - 1} P r [p^{(s)} \leq β | p^{(l)} \leq β, l = 0, \dots, s - 1]} \\ P r [p^{(k)} > β | p^{(i)} \leq β, i = 0, \dots, k - 1] \\ \leq & β β^{min(k - 1, m)} 1 \\ = & β^{k_{m}} \end{array} \end{array}$

where k_m= min(k, m + 1). Hence, we can find an upper bound of the expected number of genes removed, $E_{℘} [R (g_{0}, 0)]$ , which is

$\begin{array}{l} E_{℘} [R (g_{0}, 0)] \\ \begin{array}{l} \leq & \sum_{k = 1}^{g_{0} - 1} k β^{k_{m}} \\ = & \frac{β}{{(1 - β)}^{2}} (1 - β^{m + 1}) - (m + 1) β^{m + 2} / (1 - β) \\ + (g_{0} - m - 2) β^{m + 1} \\ \approx & \frac{β}{{(1 - β)}^{2}} as m is large . & (16) \end{array} \end{array}$

Let Δ = Δ(β) = β/(1 - β)². From equations (9) and (16), we have

$E_{℘} [R (g, d) - Δ] \leq g - g_{0}$

so that the number of true null estimate is

{\hat{g}}_{0}

(1)

= g - R(g, d) + Δ

and this estimate ensures that

$E_{℘} [{\hat{g}}_{0}] \geq g_{0} .$

References

Westfall PH, Young SS: Resampling-based multiple testing. New York: John Wiley & Sons; 1993.
Google Scholar
Benjamini Y, Hochberg Y: Controlling the false discovery rate: A Practical and powerful approach to multiple testing. JRSSB 1995, 57: 289–300.
Google Scholar
Benjamini Y, Hochberg Y: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 2001, 1165–1188.
Google Scholar
Simes RJ: An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986., 73:
Google Scholar
Benjamini Y, Hochberg Y: On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Education and Behavioral Statistics 2000, 25: 60–83.
Article Google Scholar
Storey JD, Tibshirani R: Statistical significance for genomewide studies. PNAS 2003, 100: 9440–9445. 10.1073/pnas.1530509100
Article PubMed Central CAS PubMed Google Scholar
Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003, 19: 368–375. 10.1093/bioinformatics/btf877
Article CAS PubMed Google Scholar
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, lacus S, Irizarry R, Li FLC, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics.2004. [http://genomebiology.com/2004/5/10/R80]
Google Scholar
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001, 98: 5116–5121. 10.1073/pnas.091062498
Article PubMed Central CAS PubMed Google Scholar
GeneSpring 7.1. Silicon Genetics[http://www.silicongenetics.com]
Yang MCK, Yang JJ, Mclndoe RA, She JX: Microarray experimental design: power and samples size considerations. Physiol Genomics 2003, 16: 24–28. 10.1152/physiolgenomics.00037.2003
Article CAS PubMed Google Scholar
Pesarin F: Multivariate permutation tests with applications in Biostatistics. West Sussex, England: John Wiley & Sons; 2001.
Google Scholar
Storey JD, Tibshirani R: Estimating false discovery rates under dependence, with applications to DNA microarrays. Technical report, Department of Statistics, Stanford University 2001.
Google Scholar
Efron B, Storey JD, Tibshirani R: Microarrays, Empirical Bayes Methods, and False Discovery Rates. Technical Report 217, Department of Statistics, Stanford University 2001.
Google Scholar
Birnbaum A: Combining independent tests of significance. JASA 1954, 49: 559–574.
Google Scholar
Folks JL: Combination of independent tests. In Handbook of Statistics. Volume 4. Edited by: Krishnaiah PR, Sen PK. New York: Elsevier Science Publishers; 1984:113–121. 10.1016/S0169-7161(84)04008-6
Google Scholar
Global- p website[http://www.stat.ufl.edu/~jyang/global_p/global.p.R]
Broberg P: A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics 2005., 6(199):
R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 2005. ISBN 3–900051–07–0 [http://www.R-project.org]
Google Scholar
Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR: Gene expression profiling identifies clinically relevant subtypes of prostate cancer. PNAS 2004, 101: 811–816. 10.1073/pnas.0304146101
Article PubMed Central CAS PubMed Google Scholar
Storey JD: A direct approach to false discovery rates. JRSSB 2002, 64: 479–498. 10.1111/1467-9868.00346
Article Google Scholar
Genovese C, Wasserman L: A stochastic process approach to false discovery control. The Annals of Statistics 2004, 32: 1035–1061. 10.1214/009053604000000283
Article Google Scholar
Romano JH: On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association 1990., 85:
Google Scholar
Serfling RJ: Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons; 1980.
Book Google Scholar

Download references

Author information

Authors and Affiliations

Biostatistics and Research Epidemiology, Henry Ford Health Sciences Center, Detroit, Michigan, USA
James J Yang
Department of Statistics, University of Florida, Gainesville, Florida, USA
Mark CK Yang

Authors

James J Yang
View author publications
You can also search for this author in PubMed Google Scholar
Mark CK Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to James J Yang or Mark CK Yang.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yang, J.J., Yang, M.C. An improved procedure for gene selection from microarray experiments using false discovery rate criterion. BMC Bioinformatics 7, 15 (2006). https://doi.org/10.1186/1471-2105-7-15

Download citation

Received: 11 July 2005
Accepted: 11 January 2006
Published: 11 January 2006
DOI: https://doi.org/10.1186/1471-2105-7-15

An improved procedure for gene selection from microarray experiments using false discovery rate criterion