An improved procedure for gene selection from microarray experiments using false discovery rate criterion

Background A large number of genes usually show differential expressions in a microarray experiment with two types of tissues, and the p-values of a proper statistical test are often used to quantify the significance of these differences. The genes with small p-values are then picked as the genes responsible for the differences in the tissue RNA expressions. One key question is what should be the threshold to consider the p-values small. There is always a trade off between this threshold and the rate of false claims. Recent statistical literature shows that the false discovery rate (FDR) criterion is a powerful and reasonable criterion to pick those genes with differential expression. Moreover, the power of detection can be increased by knowing the number of non-differential expression genes. While this number is unknown in practice, there are methods to estimate it from data. The purpose of this paper is to present a new method of estimating this number and use it for the FDR procedure construction. Results A combination of test functions is used to estimate the number of differentially expressed genes. Simulation study shows that the proposed method has a higher power to detect these genes than other existing methods, while still keeping the FDR under control. The improvement can be substantial if the proportion of true differentially expressed genes is large. This procedure has also been tested with good results using a real dataset. Conclusion For a given expected FDR, the method proposed in this paper has better power to pick genes that show differentiation in their expression than two other well known methods.


Results:
A combination of test functions is used to estimate the number of differentially expressed genes. Simulation study shows that the proposed method has a higher power to detect these genes than other existing methods, while still keeping the FDR under control. The improvement can be substantial if the proportion of true differentially expressed genes is large. This procedure has also been tested with good results using a real dataset.
Conclusion: For a given expected FDR, the method proposed in this paper has better power to pick genes that show differentiation in their expression than two other well known methods.

Background
The development of microarray technologies has created unparalleled opportunities to study the mechanism of disease, monitor disease expression and evaluate effective therapies. Because tens of thousands of genes are investigated simultaneously with a technology that has not yet been perfected, assessing uncertainty in the decision process relies on statistical modelling and theory. One key function of any statistical procedure is to control the rate of erroneous decisions or, in the microarray case, rate of false discovery of responsible genes.
The above concern can be illustrated as a multiple comparisons problem. Suppose we are interested in testing g parameters (µ 1 ,..., µ g ) = µ. For each individual parameter µ j , the null hypothesis is H 0j : µ j = 0 and the alternative hypothesis is H 1j : µ j ≠ 0. This µ j can be thought as the difference in mean expressions of the jth gene under two dif-ferent conditions in a microarray experiment. A conventional method to test each hypothesis is to take a sample and then calculate its p-value from a proper statistical test. If the calculated p-value is less than a threshold determined by a testing significance level, then H 0j is rejected. However, when many hypotheses are simultane-ously performed, a multiple comparisons procedure (MCP) has to be used to control the error rate [1].
The traditional MCP controls the probability of making any error in multiple selections, i.e., controls the familywise error rate (FWER). It has been shown, however, that A simulation study to estimate FDR (in y-axis) for various numbers of hypotheses Tested (x-axis), proportion of true null hypotheses (left labels), and various configurations of true alternative hypotheses (top labels) this type of procedure is extremely conservative when the number of hypotheses is large. Alternatively, Benjamini and Hochberg [2] proposed the measure of false discovery rate (FDR) for which the expected proportion of false discoveries is controlled. This procedure is based on the idea that one can tolerate more false discoveries if the number of tests is large. For example, 5 false discoveries out of 10 selections is probably too many while 5 false discoveries out of 100 selections should be acceptable. This is particularly useful in microarray analysis since a very large number of genes usually show differential expressions. Therefore, controlling the FDR can greatly increase the power of discovery.
Benjamini and Hochberg [3] proved that Simes' procedure [4] can be used to control expected FDR at α (0 <α < 1) when tests of true null hypotheses are either independent or positively dependent. More specifically, let p (1) ≤ p (2) ≤ ..., ≤ p (g) be the ordered p-values and J + 1 be the smallest integer satisfying If J ≥ 1, then rejecting H (0j) , j = 1,..., J ensures the expected FDR at α. We call this method the BH procedure.
Suppose the number of true nulls is g 0 , or the proportion is π 0 = g 0 /g. Benjamini and Hochberg [3] proved that the expected FDR of BH procedure is less than or equal to α. When g 0 is small compared to g, using the original α in (1) loses of power because it can be replaced by the larger value α/π 0 and still control the FDP at α. Here, the power of a selection procedure is defined as the proportion of the alternative hypotheses that are correctly identified. Obviously, if we knew π 0 in advance, we could replace α in (1) by α/π 0 to increase power. Since π 0 is unknown, the key question here is how to estimate the number of true null hypotheses g 0 .
In an earlier published paper, Benjamini and Hochberg [5] proposed an adaptive procedure (aBH) which, by simulation, showed a higher power than BH procedure. The idea of the aBH procedure is to estimate g 0 based on the fact that the p-value under the alternative hypothesis is stochastically smaller than that under the null, which is uniformly distributed on (0, 1). On a quantile plot of pvalues (p (j) versus j), the slope passing through (j, p (j) ) and (g + 1,1) increases just as j increases if the P (j) s are corresponding to the subset of true alternative hypotheses. The first time the slope decreases indicates a change point.
Using this stopping rule in conjunction with the Lowest SLope (LSL) estimator, (1 -p (j) )/(g + 1 -j), Benjamini and Hochberg proposed their aBH method to estimate g 0 as follows: Alg. 1 aBH method in g 0 estimation 1. If there is no hypothesis rejected by the BH procedure, then stop and declare no discovery.
3. Let J* be the largest value such that m (J*) <m (J* -1) . Define b as the smallest integer larger than 1/m (J*) .
4. The number of true null hypotheses is estimated as Recently, Storey and Tibshirani [6] observed and demonstrated in their Figure 1 that the density histogram of pvalues based on p 1 ,..., p g from a microarray experiment looked flat in the region of (0.5, 1). Based on the fact that the null p-values of genes are uniformly distributed, most of genes in the region of (0.5, 1) should be from the null. Therefore, they used a smoothing method to estimate g 0 based on the flat region of the observed p-values. However, the implicit requirements of their method are: 1) g should be large, and 2) the proportion of true null hypotheses should also be large, such as 0.5, so that g 0 can be estimated accurately. The method developed in this paper tries to bypass these two requirements so that a broad range of multiple testing problems can be applied. It is conceivable that when the number of relevant genes are known for certain target disease, a chip with a small number of genes will be widely used for diagnosis. The FDR methods, including both the BH and the aBH versions, have now been widely accepted for microarray analyses. They have been implemented in the R program [7], which can be incorporated into the Bioconductor package [8]. Several microarray analysis computer programs, such as SAM (Significance Analysis of Microarrays) [9] and GeneSpring [10] also use the FDR criterion to identify differentially expressed genes. Yang et al. [11] have used the FDR criterion to determine the sample size in microarray experiments.

Methods
Without loss of generality, we describe the method in a one-sample testing problem. Its extension to commonly used paired-t test, two-sample t test, or nonparametric rank tests in microarray analysis is very straightforward. value for rejecting the null H 0j be p j derived from a properly chosen test statistic.
Once the p-values are derived, we first summarize the information content of p-values using a differentiable real-valued decreasing function h, i.e., h(p 1 ,..., p g ) is decreasing to all its arguments. We call h the combining function and its value, η = h(p 1 ,..., p g ), the global statistic.
Next, define η (0) to be the observed global statistic of η derived directly from the observed data. Note that a small value of p j indicates H 0j is less likely to be true. Since h is a decreasing function of its argument p j , a large value of η (0) indicates a subset of the null hypotheses is likely to be true alternatives. To determine whether η (0) is large enough to make such a claim, we need to know the distribution of η when all the nulls are true. The distribution of η depends on the distribution of its arguments, the correlation between its arguments, and the combining function h. In a microarray experiment, there seems no way to model or estimate so many correlations (see, for example, Chapter 6 of Pesarin [12]). Hence, a reasonable approach which can tackle the correlation within each experimental unit is to determine the critical value by a permutation test. More specifically, we calculate all B = 2 n values of η based on (ω 1 Y 1 ,..., ω n Y n ), where ω i = ± 1 (The multivariate two-sample permutation method can be found in Pesarin [12]).
Hence, we have a set of reference values from the B permutations, and can now define the p-value of the global η (0) -statistic as where I(x) is the indicator function that takes value 1 if the statement x is true and 0 if it is not. To determine whether the global null hypothesis of all µ (j) , j = 1,..., g, is zero is to evaluate if the global p-value, p (0) , is less than or equal to a significance level. However, this level is part of the estimation process. It will be determined by the data (see later Alg. 2).
When the global null hypothesis is rejected, it indicates that not all null hypotheses H 0j s are true. Immediate question is, which subset of hypotheses is the true alternative.
To determine whether to reject the hypothesis H (0j) is a multiple testing problem. We can, however, determine the size of true alternatives using the following iterated process. We regard the gene with the smallest p-value as the major contributor for the rejection and claim that the null hypothesis is not true with this gene. When this gene is removed from the data set, we continue the same process with the rest g -1 genes and the p (0) computed by (2) is now denoted by p (1) . The whole process is then repeated to produce a sequence of pseudo-global p-values, p (s) , s = 1,..., g -1. We call the later step global p-value pseudo because it is only based on the subset of the original data. The detail of how to generate pseudo-global p-values is given in the Algorithm section. First, we will use these pseudo-global p-values to estimate the number of true null genes g 0 .
We observe that, intuitively, 1) , and this monotone increasing property will be proved in (J.1) of the Justification section. In addition , we will prove in (J.2) of the Justification section that if r pseudo-global pvalues are less than a given value β(0 <β < 1), the estimator of the number of null genes is and this estimator ensures that E [ ] ≥ g 0 . (The conservativeness of this estimate under other conditions using other techniques has also been mentioned by Storey and Tibshirani [13] and Efron et al. [14].) Since the inequality E [ ] ≥ g 0 holds for any value of β, the best choice of β is the one which makes closest to g 0 . By defining ∆ = β/ (1 -β) 2 and ρ(β) = r -∆, we observe that ρ(β) is an increasing function and then a decreasing function of β for the following reasons. When β is small, ∆ increases slowly. However, r is an increasing function in β and r is always greater than 1. Therefore, ρ(β) is an increasing function for small value of β. On the other hand, when β → 1, ∆ reaches ∞. Since r is finite, ρ(β) becomes a decreasing function in this range. Since (3) is equivalent to = gρ(β), the optimal value of β should be determined as and the method of estimating g 0 is equivalent to finding the optimal β value.
The estimation of g 0 can thus be summarized as

Alg. 3 Differentially expressed gene selection with given FDR
With g 0 estimated, to control the FDR at α level for identifying differentially expressed genes in microarray data analysis with the conservative estimate , we simply test each sub-hypothesis based on (1) with the new α ≡ α/ , i.e., rejects all H (j) , j = 1,..., J, if p (j) ≤ j × and p (J+1) > (J + 1) × . We also need to address the choices for the combining function h. In this paper, we consider only two commonly used ones. 1) Fisher's sum of logarithm and 2) Liptak's sum of inverse standard Gaussian distribution functions More discussion on the choices of combining functions can be found in Birnbaum [15] and Folks [16]. The computer program for the proposed method, global-p, has been implemented in R script and is publicly available [17].

Simulation
Two simulation studies were conducted. First, when the number of hypotheses are small and second, when the total number of hypotheses is extremely large. For small number of hypotheses, we used the following configura- For the purpose of comparison, we also plug in the true value of g 0 into the aBH method. This is the ideal situation to reach the highest power. The plot of function ρ(β) in a real data analysis to determine the arg max ρ(β) for the tumor data The plot of function ρ(β) in a real data analysis to determine the arg max ρ(β) for the tumor data.  Both combining functions in our proposed method have a higher power than the aBH procedure in most situations, while there is little difference between them. The improvement over the aBH methods is substantial when the proportion of the true null hypotheses is small.
Since most microarray data consist of tens of thousands of genes simultaneously and many of their expressions are correlated, our second simulation tried to reflect these facts. Storey and Tibshirani [13] proposed a "clumpy" dependence model in which genes are partitioned into blocks. The gene expressions are assumed independent between blocks but dependent within the same block.
The same simulation study with estimated average powers, using the same legends as Figure 1

Figure 2
The same simulation study with estimated average powers, using the same legends as Figure 1. The powers of the new method are clearly higher than the existing two, especially when the null proportion is small. Following the clumpy model, 10 test and 10 control samples were generated from normal random variables with mean 0 and standard deviation 1 where each sample contained 10,000 genes. For test samples, genes with expression difference were added by 3 to represent the expression differences. The number of differentially expressed genes g 0 were 100, 2,000, 5,000, 8,000 which corresponded to the proportion of nulls π 0 at 0.99, 0.8, 0.5, 0.2. To simulate intra-block dependency, we generated one vector of normal random variables with mean 0 and standard deviation 0.2 in each block of size 50 and add it to every gene in that block. This process creates correlations between genes. Since the true expression differences for non-null genes are moderate large in this simulation, we expect that any good method should accurately estimate π 0 .
In a recent article, Broberg [18] did a comparative review of various newly developed methods for estimating π 0 . To make a comparison, we presented the means and standard errors of estimates of π 0 for all the methods describe in Broberg's paper in addition to our proposed method ( ) over 1000 simulations. The details of π 0 estimation methods (BUM, SPLOSH, QVALUE, Boostrap LSE, SEP, LSL, mgf, PRE) can be found in Broberg's paper and the programs for these methods can be found in the add-on packages of the freely available R software [19]. The simulation results are shown in Table 2.
As expected, all of them performed very well for large π 0 .
If both the mean (bias) and standard error (stability) over the whole range of π 0 are considered, LSL and Global-p stand out and LSL is better in stability. However, as we will see from the next study, LSL seems to be over-conservative in real data analysis. The means of both combining functions of our proposed method performed well but Liptak's function has a higher standard errors. Therefore, Fisher's function will be used for the real data analysis. During the simulation, we also noticed that current implementation of SPLOSH, SEP, mgf, and PRE methods in R programs failed to work when the number of hypotheses was small.

Real data analysis
We use a publically available experimental data set from the Stanford microarray database [20] first to compare the differences between BH, aBH and our method with Fisher's combining function. The purpose of this experiment was to identify genes that have different expressions between prostate tumor tissue and matched normal tissue. It consists of a total of 82 microarrays: 41 arrays were from primary prostate tumors and the other half were from matched normal prostate tissues. Each array contains 32,152 different human genes. We chose this data set because of the large number of replicates. With this amount of replicates, we had a better idea of the ground truth.
To compare performances of various methods, we split the whole data set into two groups: a test data set and a confirmation data set. Eleven pairs of tumor and normal arrays were chosen for the test data set and the remaining arrays were used for the confirmation. The number of eleven test pairs were taken from a systematic sampling to avoid bias. Patients 1, 5, 9,..., 41 in the original order from the database were chosen as the test set. Expression information is missing if it is labelled as failed, contaminated, or flagged. Genes were removed from our study if more than half of their expressions were missing in the original data set. A total of 24,865 genes was used to identify differences in expressions using BH, aBH and our proposed procedures at 0.01 expected FDR level. Since the experiment was designed with paired tumor-normal arrays, the paired t-test was used to derive the p-value of each gene in the test data set. The BH procedure identified 1,254 genes, the aBH procedure identified 1,523 genes, and our proposed method identified 2,119 genes. Our test is apparently more powerful if it can maintain the required FDR 0.01. Since we did not have the biological information, another approach was to estimate FDR based on our rejection set. Specifically, suppose the rejection set contains the p-values that are less than ξ and the estimated proportion of null hypotheses is . An intuitive estimate for FDR is [21,22] where . Based on the confirm data set, our proposed method reported an estimator of π 0 , = 0.5326. Using equation (4), the estimated FDRs for BH, aBH and our method were 0.0054, 0.0067, 0.0099, respectively.
If we further reject more sub-hypotheses beyond the number provided by our method, the FDR will exceed the assigned level. For example, if we reject the next 500 null sub-hypotheses that have the smallest p-values among those not previously rejected, the estimated FDR is 0.0137 which is larger than the pre-assigned 0.01 level.
Moreover, we calculated the standard difference (paired tstatistic) of expressions using the confirmation data set. The confirmation data set contains 30 pairs of arrays so that the statistic is a good estimate for the unknown standard difference. Figure 3 shows histograms of absolute standard differences based on genes identified by the BH procedure, by the aBH procedure, by our method, and the extra 500 genes beyond our method. From the histograms, several hundred differentially expressed genes that were not identified by BH or aBH in the test arrays were identified by our method and most of them have standard expression differences greater than 2. However, the next group of 500 genes beyond our method may contain too many false discoveries to make the FDR acceptable.
Next, we use this data set to compare different estimates of proportion of null genes, π 0 . The whole data set was partitioned into four sub-data sets. We labelled the arrays from 1 to 41 based on the original order in their database. Sub-data set 1 contained 11 arrays with labels 1, 5, 9,..., 41; sub-data set 2 contains 10 arrays with labels 2,6,..., 38; sub-data set 3 contains 10 arrays with labels 3,7,..., 39; sub-data set 4 contained the remaining arrays. Although we did not know the true value of π 0 in real data analysis, we used the four sub-data sets to compare the variation of different estimation methods. In addition, we used the whole data set to check if the estimates are reliable. The estimates were listed in Table 3. All estimates in the subsets were larger than the ones in the whole data set. This is expected because genes with small expressional differences are more likely to be classified as null genes in a small sample. From the π 0 estimates from the whole data set, we may say the LSL gave a large π 0 value, PRE gave a small value and all the others gave similar estimates. We used LSL, PRE and Global-p to draw the p-value density histogram using the whole data set in Figure 4. The upper panel is the overall view and the lower panel is the closer view. The green dash-dot line is the height using LSL estimate; the red dash line PRE estimate; the blue dot line global-p estimate. The LSL method, which is also shown in Broberg's study, is too conservative producing the largest value. The PRE method underestimated π 0 . If we used by PRE method to improve FDR procedure, it is likely to have a higher FDR than the nominal level. We think our π 0 estimate is reliable because it is close but higher than the height of observed p-values in the region of (0.6, 1) if we assume genes with expressional differences have rare chance to have large observed p-values.

Conclusion
Benjamini and Hochberg's FDR procedure [2] opens an useful approach for multiple comparisons in microarray analysis. Due to the complexity in microarray experiments, it seems unlikely that one method can dominate all the other methods in all the cases. In this paper, we have demonstrated a new method that is intuitive appearing because: π 0π 0π 0 The histograms of the absolute standard difference estimates (x-axis) in the confirmation data set for the selected genes from the three procedures based on the test data set for the prostate tumor data Figure 3 The histograms of the absolute standard difference estimates (x-axis) in the confirmation data set for the selected genes from the three procedures based on the test data set for the prostate tumor data. The top is the selected gene histogram by the BH procedure; the second is that by the aBH procdure; and the third is the one by our method. The similarity of the three histograms means that the additional 596 genes picked by our method should maintain the same FDR (0.01) as the other two. The bottom panel contains the next 500 genes beyond our proposed method. The left shift of this histogram means that the additional genes may contain too many false discoveries. With the support of simulation and real data studies, the new method should be a viable alternative to find the differentially expressed genes in microarray experiments.

Algorithm
We illustrate the algorithm for calculating pseudo-global p-values based on marginal p-values, p 1 ,..., . Please note that the procedure described here is the same regardless of g 0 . We consider h(p 1 , ..., ) in the summation form: where is a differentiable decreasing function in which exists and (p) → ∞ as p → 0. Note that equation (5)  In one-sample testing problem, we permute data B times, where each time we assign w i = ± 1 with equal probability to (w 1 Y 1 ,..., w n Y n ). For each permuted data, we use p b(j) to be the p-value from the bth permuted data for the gene that produced p (j) , j = 1,..., g 0 , b = 1,..., B.
We summarize p (i) and p b(i) using the following table.
Note that the first column is p-values of genes from the original data while the remaining columns are from the permuted data.
To simplify the notation, let (p (i) ) = (i) , (p b(i) ) = b(i) and η is just the sum of (·) . Therefore, we can transform the table (6) into the following table with the addition of the last row where its value η is the sum of the (·) in its corresponding column.
Once η (0) and η (1) ,..., η (B) are available, the global p-value, by definition, is The same process is directly used to calculate the pseudoglobal p-values after removing the genes which have the smallest p-values. Actually, we can use the available data illustrated in table (8). By induction, suppose p (1) ,..., p (s) ally, when the gene expression difference is far away from zero, it will be identified by any reasonable multiple testing procedure. Hence, in this extreme case, we only focus on which is based on the remaining g 0 samples that have null distributions. Our goal is reduced to finding an upper bond of , so that the expected value of can be bounded (below) by g 0 .

J.1 Monotone in Pseudo-global p-values
The key feature of our algorithm is that the pseudo-global  where C (s) = Φ -1 (1 -p (s) ) is p (s) th upper percentile of the standard Gaussian distribution. Hence, the relation between η (s) and C (s) is Similarly, for the µ (s+1) and σ (s+1) as the mean and standard deviation of reference samples (b = 1,..., B), we repeat the same approximation method to express p (s+1) as Comparing equations (11) and (13), we observe that to prove p (s) ≤ p (s+1) with probability 1 asymptotically when p (s) is given is equivalent to prove that with probability 1 asymptotically when C (s) is given.
The purpose here is to derive equation (3) for any β within (0,1) using the above inequality. We start with the number of genes removed by our procedure. Observe that In addition, where k m = min(k, m + 1). Hence, we can find an upper bound of the expected number of genes removed, , which is Let ∆ = ∆(β) = β/(1 -β) 2 . From equations (9) and (16)