High-throughput methods in molecular biology have challenged existing data analysis methods and stimulated the development of new methods. A key example is the gene expression microarray and its use as a screening tool for detecting genes that are differentially expressed (DE) between different biological states. The need to identify a possibly very small number of regulated genes among the 10,000s of sequences found on modern microarray chips, based on tens to hundreds of biological samples, has led to a plethora of different methods. The emerging consensus in the field [1] suggests that a) despite ongoing research on p-value adjustments [2], false discovery rates (FDR, [3]) are more practical for dealing with the multiplicity problem, and b) classical test statistics requires modification to limit the influence of unrealistically small variance estimates. Nonetheless, many competing methods for detecting DE exist, and even attempts at validation on data sets with known mRNA composition [4] cannot offer definitive guidelines.

In this context, the introduction of the so-called optimal discovery procedure (ODP, [5]) constitutes a major conceptual achievement. Building on the Neyman-Pearson lemma for testing an individual hypothesis, the author shows that an extension of the likelihood ratio test statistic for multiple parallel hypotheses (or genes) is the optimal procedure for deciding whether any specific gene is in fact DE: for any fixed number of false positive results, ODP will identify the maximum number of true positives. The ODP establishes therefore a theoretical optimum for detecting DE against which any other method can be measured.

Unfortunately, the optimality of ODP is a strictly theoretical result that requires, for all genes, a full parametric specification of the densities under null and alternative hypothesis. In practice, even assuming normality, the gene-wise means and variances are unknown, and they become nuisance parameters in the hypothesis testing. Consequently, the authors of [6] have suggested an estimated version EODP, which can be implemented in practice. It is, however, not clear how EODP performs compared to the theoretical optimum, or other existing methods, except under the most benign circumstances (no correlation and equal variances between genes).

The main questions of this paper are therefore a) whether the optimality of ODP is retained by EODP, and b) whether we can improve on EODP's performance in practice. Previously, we have introduced a multidimensional extension of the FDR procedure (fdr2d) that combines standard error information with the classical t-statistic. We demonstrated that the fdr2d performs as well or better than the usual modified t-statistics, without requiring extra modeling or model assumptions [7]. In this paper, we show that fdr2d also outperforms EODP on simulated and real data sets. We also demonstrate how a synthesis of the EODP and fdr2d procedures can further improve the power to detect DE.

### The two-sample problem

We demonstrate the application of EODP and fdr2d in the common situation where we want to detect genes that are DE between two biological states. We assume *n*
_{1} and *n*
_{2} arrays for each group, each containing probes for *m* genes. For gene *i*, we observe a vector of expression values **x**
_{
i
} of length *n*
_{1} + *n*
_{2}, which consists of the observations **x**
_{
i1
} in the first group, and **x**
_{
i2
} in the second group. We define the groupwise means and standard deviations as usual, and refer to the pooled standard deviation as

Furthermore, we assume that we are dealing with a random mixture of DE and nonDE genes, with a proportion *π*
_{0} of genes being nonDE.

### ODP statistics

The theoretical ODP statistic assumes that for all *i* = 1, ... *m* genes, the density functions of the expression values under the null hypothesis of no DE, *f*
_{
i
}, and under the alternative hypothesis of DE, *g*
_{
i
}, are fully known in advance. For the random mixture of DE and nonDE genes outlined above, the ODP statistic for the observed expression values **x**
_{
i
} of the *i*-the gene can then be written as

The procedure then rejects the null hypothesis for all genes *i* with *S*
_{
i
}≡ *S*(**x**
_{
i
}) ≥ *λ*, i.e. all genes with large *S*
_{
i
}are declared to be DE. Using the Neyman-Pearson Lemma, it can be shown that this procedure is optimal in the sense that for any pre-specified false positive rate (which will determine *λ*), the ODP will have the maximum true positive rate. This optimality property can also be expressed in terms of FDR [5].

Requiring full specification of all null and alternative distributions, however, is impractical. In any realistic application, only an estimated ODP statistic

is feasible, where the densities
and
are estimated from the data. In [6], the authors propose to assume that all genes follow a normal distribution (possibly after suitable transformation); under this assumption, only means and variances have to be estimated from the data. In our two-sample situation, this amounts to

where *φ*(·|*μ*, *σ*
^{2}) is the joint-density for the normal distribution with mean *μ* and variance *σ*
^{2}.

Conceptually, under the null hypothesis, we have the usual estimates
and
from the combined data, and under the alternative hypothesis, the corresponding group-wise means
and
with the pooled sample variance
. For the practical implementation, we follow [6] and pre-normalize all genes to have zero mean.

The second step in applying the ODP to data is the calibration of the procedure. There is no distribution theory for the statistic, so it is not clear how to choose the threshold λ to achieve a desired FDR level. [6] suggest a conventional algorithm that computes the estimated ODP statistic
under random permutations of the group labels; they use the resulting null distribution of
to compute the q-value for each gene, which represents its global FDR (e.g. [8]). We follow this approach for our implementation, but use the local false discovery rate (fdr, see [9] and below), with essentially identical results as theirs.

### Multidimensional local false discovery rate

FDR approaches focus on the distribution of the specific statistic *Z* used to test the gene-wise null hypotheses, in contrast to ODP, which is based on the distribution of the data. Given a mixture of DE and nonDE genes as described above, the density *f* of *Z* can be written as

*f*(*z*) = *π*_{0}*f*_{0}(*z*) + (1 - *π*_{0})*f*_{1}(*z*), (2)

where *f*
_{0} and *f*
_{1} are the densities of the test statistic *Z* for nonDE and DE genes, respectively, and *π*
_{0} the proportion of truly nonDE genes. The local fdr for any observed value *z* of the test statistic is then

and can be interpreted as the expected rate of false positives among genes with test statistic *z*, see [9]. Practically, the densities *f* can be estimated from the histograms of the test statistics computed from the real data, and *f*
_{0} is estimated similarly from the test statistics computed from permuted data.

Formulated as a decision procedure like ODP, we specify a test statistic *Z* and a desired threshold *α* for the local fdr; we then compute for each gene the value of the test statistic *z*
_{
i
}= *Z*(**x**
_{
i
}) and the decision criterion fdr_{
i
}= fdr(*z*
_{
i
}) and declare genes with fdr_{
i
}<*α* to be DE.

As the more usual global FDR of a set of test statistics is just the average of their local fdr [9], little seems to be gained by using the local fdr. Note, however, that Equations 2 and 3 still hold if we replace the univariate test statistic *Z* by a vector **Z** of test statistics. We have recently shown that for the two-sample problem, using a bivariate test statistic and the associated two-dimensional fdr is more powerful than conventional FDR for univariate test statistics [7]. Specifically, the test statistic **Z** = (*Z*
_{1}, *Z*
_{2}) with

*Z*_{1} = *t* and *Z*_{2} = log *se*, (4)

where *t* is the usual t statistic, and *se* the standard error of the mean,

yields smaller fdr not only compared to the conventional t-statistic on its own, but also compared to a number of popular modified t-statistics [10–12].

In the following, we will use the abbreviations fdr1d and fdr2d for local fdr computed based on univariate and bivariate test statistics, respectively. Note that in practice, the fdr2d is estimated in a similar manner as the fdr1d, using two-dimensional histograms instead of one-dimensional histograms, together with a somewhat more sophisticated binomial smoothing procedure, see [7] for details.

### Procedures to be evaluated

The central aim of this paper is to compare the operating characteristics of four different procedures for detecting DE on a number of real and simulated data sets:

1. t1d uses the standard t-statistic with conventional fdr1d and serves as a reference.

2. S1d uses the logarithm of
in (1) with fdr1d; this procedure is equivalent to the estimated version of ODP described in [6] and its implementation in the EDGE software.

3. t2d uses the test statistic in (4) for calculating fdr2d; this is the same procedure as described in [7].

4. S2d is a novel procedure that combines the logarithm of
and the standard error for computing fdr2d, see below.