Detecting differential expression in microarray data: comparison of optimal procedures
© Perelman et al. 2007
Received: 22 August 2006
Accepted: 26 January 2007
Published: 26 January 2007
Skip to main content
© Perelman et al. 2007
Received: 22 August 2006
Accepted: 26 January 2007
Published: 26 January 2007
Many procedures for finding differentially expressed genes in microarray data are based on classical or modified t-statistics. Due to multiple testing considerations, the false discovery rate (FDR) is the key tool for assessing the significance of these test statistics. Two recent papers have generalized two aspects: Storey et al. (2005) have introduced a likelihood ratio test statistic for two-sample situations that has desirable theoretical properties (optimal discovery procedure, ODP), but uses standard FDR assessment; Ploner et al. (2006) have introduced a multivariate local FDR that allows incorporation of standard error information, but uses the standard t-statistic (fdr2d). The relationship and relative performance of these methods in two-sample comparisons is currently unknown.
Using simulated and real datasets, we compare the ODP and fdr2d procedures. We also introduce a new procedure called S2d that combines the ODP test statistic with the extended FDR assessment of fdr2d.
For both simulated and real datasets, fdr2d performs better than ODP. As expected, both methods perform better than a standard t-statistic with standard local FDR. The new procedure S2d performs as well as fdr2d on simulated data, but performs better on the real data sets.
The ODP can be improved by including the standard error information as in fdr2d. This means that the optimality enjoyed in theory by ODP does not hold for the estimated version that has to be used in practice. The new procedure S2d has a slight advantage over fdr2d, which has to be balanced against a significantly higher computational effort and a less intuititive test statistic.
High-throughput methods in molecular biology have challenged existing data analysis methods and stimulated the development of new methods. A key example is the gene expression microarray and its use as a screening tool for detecting genes that are differentially expressed (DE) between different biological states. The need to identify a possibly very small number of regulated genes among the 10,000s of sequences found on modern microarray chips, based on tens to hundreds of biological samples, has led to a plethora of different methods. The emerging consensus in the field  suggests that a) despite ongoing research on p-value adjustments , false discovery rates (FDR, ) are more practical for dealing with the multiplicity problem, and b) classical test statistics requires modification to limit the influence of unrealistically small variance estimates. Nonetheless, many competing methods for detecting DE exist, and even attempts at validation on data sets with known mRNA composition  cannot offer definitive guidelines.
In this context, the introduction of the so-called optimal discovery procedure (ODP, ) constitutes a major conceptual achievement. Building on the Neyman-Pearson lemma for testing an individual hypothesis, the author shows that an extension of the likelihood ratio test statistic for multiple parallel hypotheses (or genes) is the optimal procedure for deciding whether any specific gene is in fact DE: for any fixed number of false positive results, ODP will identify the maximum number of true positives. The ODP establishes therefore a theoretical optimum for detecting DE against which any other method can be measured.
Unfortunately, the optimality of ODP is a strictly theoretical result that requires, for all genes, a full parametric specification of the densities under null and alternative hypothesis. In practice, even assuming normality, the gene-wise means and variances are unknown, and they become nuisance parameters in the hypothesis testing. Consequently, the authors of  have suggested an estimated version EODP, which can be implemented in practice. It is, however, not clear how EODP performs compared to the theoretical optimum, or other existing methods, except under the most benign circumstances (no correlation and equal variances between genes).
The main questions of this paper are therefore a) whether the optimality of ODP is retained by EODP, and b) whether we can improve on EODP's performance in practice. Previously, we have introduced a multidimensional extension of the FDR procedure (fdr2d) that combines standard error information with the classical t-statistic. We demonstrated that the fdr2d performs as well or better than the usual modified t-statistics, without requiring extra modeling or model assumptions . In this paper, we show that fdr2d also outperforms EODP on simulated and real data sets. We also demonstrate how a synthesis of the EODP and fdr2d procedures can further improve the power to detect DE.
We demonstrate the application of EODP and fdr2d in the common situation where we want to detect genes that are DE between two biological states. We assume n 1 and n 2 arrays for each group, each containing probes for m genes. For gene i, we observe a vector of expression values x i of length n 1 + n 2, which consists of the observations x i1 in the first group, and x i2 in the second group. We define the groupwise means and standard deviations as usual, and refer to the pooled standard deviation as
Furthermore, we assume that we are dealing with a random mixture of DE and nonDE genes, with a proportion π 0 of genes being nonDE.
The theoretical ODP statistic assumes that for all i = 1, ... m genes, the density functions of the expression values under the null hypothesis of no DE, f i , and under the alternative hypothesis of DE, g i , are fully known in advance. For the random mixture of DE and nonDE genes outlined above, the ODP statistic for the observed expression values x i of the i-the gene can then be written as
The procedure then rejects the null hypothesis for all genes i with S i ≡ S(x i ) ≥ λ, i.e. all genes with large S i are declared to be DE. Using the Neyman-Pearson Lemma, it can be shown that this procedure is optimal in the sense that for any pre-specified false positive rate (which will determine λ), the ODP will have the maximum true positive rate. This optimality property can also be expressed in terms of FDR .
Requiring full specification of all null and alternative distributions, however, is impractical. In any realistic application, only an estimated ODP statistic
is feasible, where the densities and are estimated from the data. In , the authors propose to assume that all genes follow a normal distribution (possibly after suitable transformation); under this assumption, only means and variances have to be estimated from the data. In our two-sample situation, this amounts to
where φ(·|μ, σ 2) is the joint-density for the normal distribution with mean μ and variance σ 2.
Conceptually, under the null hypothesis, we have the usual estimates and from the combined data, and under the alternative hypothesis, the corresponding group-wise means and with the pooled sample variance . For the practical implementation, we follow  and pre-normalize all genes to have zero mean.
The second step in applying the ODP to data is the calibration of the procedure. There is no distribution theory for the statistic, so it is not clear how to choose the threshold λ to achieve a desired FDR level.  suggest a conventional algorithm that computes the estimated ODP statistic under random permutations of the group labels; they use the resulting null distribution of to compute the q-value for each gene, which represents its global FDR (e.g. ). We follow this approach for our implementation, but use the local false discovery rate (fdr, see  and below), with essentially identical results as theirs.
FDR approaches focus on the distribution of the specific statistic Z used to test the gene-wise null hypotheses, in contrast to ODP, which is based on the distribution of the data. Given a mixture of DE and nonDE genes as described above, the density f of Z can be written as
f(z) = π0f0(z) + (1 - π0)f1(z), (2)
where f 0 and f 1 are the densities of the test statistic Z for nonDE and DE genes, respectively, and π 0 the proportion of truly nonDE genes. The local fdr for any observed value z of the test statistic is then
and can be interpreted as the expected rate of false positives among genes with test statistic z, see . Practically, the densities f can be estimated from the histograms of the test statistics computed from the real data, and f 0 is estimated similarly from the test statistics computed from permuted data.
Formulated as a decision procedure like ODP, we specify a test statistic Z and a desired threshold α for the local fdr; we then compute for each gene the value of the test statistic z i = Z(x i ) and the decision criterion fdr i = fdr(z i ) and declare genes with fdr i <α to be DE.
As the more usual global FDR of a set of test statistics is just the average of their local fdr , little seems to be gained by using the local fdr. Note, however, that Equations 2 and 3 still hold if we replace the univariate test statistic Z by a vector Z of test statistics. We have recently shown that for the two-sample problem, using a bivariate test statistic and the associated two-dimensional fdr is more powerful than conventional FDR for univariate test statistics . Specifically, the test statistic Z = (Z 1, Z 2) with
Z1 = t and Z2 = log se, (4)
where t is the usual t statistic, and se the standard error of the mean,
In the following, we will use the abbreviations fdr1d and fdr2d for local fdr computed based on univariate and bivariate test statistics, respectively. Note that in practice, the fdr2d is estimated in a similar manner as the fdr1d, using two-dimensional histograms instead of one-dimensional histograms, together with a somewhat more sophisticated binomial smoothing procedure, see  for details.
The central aim of this paper is to compare the operating characteristics of four different procedures for detecting DE on a number of real and simulated data sets:
1. t1d uses the standard t-statistic with conventional fdr1d and serves as a reference.
2. S1d uses the logarithm of in (1) with fdr1d; this procedure is equivalent to the estimated version of ODP described in  and its implementation in the EDGE software.
3. t2d uses the test statistic in (4) for calculating fdr2d; this is the same procedure as described in .
4. S2d is a novel procedure that combines the logarithm of and the standard error for computing fdr2d, see below.
We first evaluate the S2d procedure, based on the bivariate test statistic
Z1 = log and Z2 = log se,
with defined as in (1) and se as in (4). The only practical concern is that the smoothing procedure described in  may have problems with . Indeed, the reason for taking the logarithms of the test statistics is to facilitate smoothing, by avoiding crowding at the boundary values.
see . Figures 1(c) and 1(d) show S1d (black) overlayed with the averaged S2d (red) for both data sets, with excellent agreement. This indicates that the smoothing required for computing S2d has been successful. This is consistent with the relationship between t-statistics and log for the data at hand (not shown, but see e.g. Figure 1 in ), which is essentially linear for genes with t-statistic |t| > 1, suggesting that the same general smoothing procedure is applicable.
We perform simulations with 10,000 genes per array, a proportion of truly nonDE genes π 0 = 0.8, and two independent groups with n = 7 arrays per group. We combine three different levels of variance heterogeneity between genes with two different settings for the balance between up- and down-regulation, for a total of six different simulation scenarios:
1. Variances can be 'similar' (effectively the same) across genes, 'balanced', which allows for moderate differences in variance between genes, and 'variable', which allows large differences.
2. In the 'symmetric' case, roughly 50% of the DE genes are up- and down- regulated; in the 'asymmetric' case, only about 20% of all genes are down-regulated, the rest is up-regulated.
We have included the asymmetric scenario, because this is where ODP is expected to perform better than standard methods in a theoretical setting . All expression values are assumed to follow a normal distribution; see Methods for further details of the simulation procedure.
For each scenario, we generate 100 data sets, for a total of 106 genes. For each procedure, the fdr values are computed by keeping track of the DE status of each gene, grouping the genes in intervals (1d) or grid cells (2d) based on their test statistic, and computing the percentage of false positives in each interval or cell.
There is little or no difference in relative performance between the procedures under the symmetric and asymmetric scenarios in Figure 2. It is also clear that the differences in performance are most pronounced when the variances are similar, less so when the variances are balanced, and minor when the variances are highly variable. The ranking of the different procedures is consistent through all six scenarios: as expected, t1d has the worst performance; equally as expected, S1d does clearly better than t1d. Novel findings of this paper are that a) t2d does still better than S1d, and b) S2d improves over t2d, although only marginally.
We evaluate the performance of the different procedures on two real data sets:
The BRCA data  contains 3,170 genes and was collected from 15 patients with hereditary breast cancer, who had mutations either of the BRCA1(n = 7) or the BRCA2 gene (n = 8).
The Lymphoma data  contains 7,399 genes and was collected from 240 patients with diffuse large B-cell lymphoma, comprising n 1 = 102 survivors and n 2 = 138 non-survivors.
Here, the local fdr estimates are computed based on the mixture model (2). The estimate of f is computed by smoothing the histograms of the observed statistics, and similarly f 0 from permuted test statistics. The permuted statistics are obtained from permutations of the group labels to generate the null distribution. Technically, we also need an estimate of the proportion of nonDE genes, although for the purpose of comparing the different procedures, it does not matter which estimate, as long as we use the same value for all procedures, see Methods. In fact, in comparing different FDR procedures, it is important that this parameter is set to the same value.
For each procedure, we rank the genes by their estimated fdr, and compute their estimated global FDR among the top-ranked genes as the cumulative mean of their local fdrs. The global FDR is then plotted as a function of the percentage of genes declared DE. For comparison purposes, we also include the FDR as computed by the EDGE software.
The main motivation for using the FDR has been that it offers a way of dealing with multiplicity that is less restrictive and more powerful than traditional p-value adjustments. The challenge is how to explicitly exploit the multiplicity by pooling information across genes in order to make the FDR even more powerful.
In the case of t1d, the test statistic is computed gene-by-gene and does not use information shared with other genes. Moderated t-statistics [10–12], which borrow strength across genes for estimating standard errors, are more powerful than simple t-statistics. The ODP appears to be the ultimate in combining information, where to some extent all genes contribute to the statistic for each other gene. The fdr2d approach on the other hand augments the grouping of genes based on individual test statistics by sub-grouping them based on their variability. In all cases we find that when there are few instances of genes with similar variability, the performance of the different methods tends to converge towards the simple t1d (Figures 2(e) and 2(f)).
From a practical point of view, it seems that the smoothing procedure underlying our implementation of fdr2d seems to work as well for the statistic log in S2d as for the t-statistics in t2d, and arguably even better: when comparing Figures 1(c) and 1(d) in this paper with Figures 4(a) and 4(b) in , we find in the former less of a tendency to underestimate the fdr for genes with small effect sizes, as discussed in the previous paper.
At first glance, the empirical ODP statistic seems to rely on the assumption that the expression values for all genes are normally distributed. From a practical point of view, however, the empirical ODP procedure works even if the normal assumption does not hold, because it relies on the permutation algorithm. In this sense, the normal densities in (1) only represent a scoring function that exponentially downweights contributions from genes with different mean structure and/or large variability. However, the performance of the empirical ODP will depend on how precisely the normal assumption holds for the data at hand. Some loss of the optimality property in the real data applications is probably due to non-normality. But even in the simulations, the empirical ODP is not better than t2d. This can only mean that the presence of large number of nuisance parameters degrades the performance of ODP.
The estimation of the nuisance parameters required to apply the ODP in practice makes the procedure described in  no longer optimal. We have shown in this paper that the combination of a conventional t-statistic with the standard error of the mean as described in  can outperform the empirical ODP. Further improvements can be made by combining the ODP test statistic with standard error information, but the gains are comparatively small.
The ODP procedure exploits similarities in the distribution for a collection of genes, for example similarity in variance. When variances between genes are dissimilar, there is little gain by the ODP compared to the standard t-statistic. One advantage of the ODP over the modified t-statistics is that the adaption is done automatically, without calculating a model-based or heuristic fudge factor for the denominator.
The computational demand of calculating the ODP statistic is a serious practical disadvantage: each density term f(x) or g(x) requires computation across the whole dataset, so a single ODP statistic already involves substantial computations. Doing this for the whole collection of genes and for repeated permutations of the group labels is an order of magnitude more laborious than the computation required for the standard statistics.
Our model for simulating microarray data is based on the model described in . We assume that the expression values for all m genes are normally distributed (possibly after suitable transformation), and that their variances vary randomly between genes, following the scaled inverse of a χ 2-distribution. Values are simulated for two groups of n 1 and n 2 arrays. Each gene i = 1, ... m to is selected randomly with probability π 0 to be DE. For genes that are picked as nonDE, the mean value in both groups is set to zero; for genes that are selected as DE, the mean in the first group is set to zero, and the mean in the second group is drawn randomly from a normal distribution whose variance is proportional to the gene-specific variance .
In detail we proceed as follows for our simulations:
1. Initialize the design with m = 10,000 genes, proportion of nonDE genes π 0 = 0.8, and two groups with n 1 = n 2 = 7.
2. For each gene i = 1, ... m, draw a gene-specific variance from
where is a χ 2-distribution with d 0 degrees of freedom, and d 0 and s 0 are tuning parameters as described below.
3. For each gene i = 1,... m, determine randomly with probability π 0 whether it is to be DE or not.
(a) In case of nonDE, set μ 1 = μ 2 = 0.
(b) In case of DE, set μ 1 = 0 and draw μ 2 randomly from
D i ~ N(0, v0 ),
where is v 0 is another tuning parameter.
i. In case of an asymmetric scenario, set the sign of μ 2 to positive with probability 0.8, and to negative otherwise.
4. Simulate n 1 and n 2 values in the first and second group, respectively, following normal distributions
X.i1 ~ N(μ1, ),
X.i2 ~ N(μ2, ).
Following , we set the constants to = 4 and v 0 = 2 in our simulations. The amount of variability of the gene-wise variances is controlled via the parameter d 0: the three scenarios described in the Results section correspond to d 0 = 1000 (similar variances across genes), d 0 = 12 (balanced, with moderate differences between genes), and d 0 = 2 (variable, with large variability in variances).
For each scenario, we then generate 100 data sets, for a total of 106 genes. For each procedure, the true local fdr of the genes is estimated from the known DE status of each simulated gene, simply as the proportion of false positives in each histogram interval or grid cell. This means specifically that no permutation, smoothing, or estimation of π 0 is required.
The permutation and smoothing approach used for estimating the fdr values for real data has been described in detail in  and . The estimates for the proportion of nonDE genes are based on a mixture model for the observed distribution of t-statistics, consisting of one central and several non-central t-distributions; we have shown previously that the weight of the central t-distribution can be a less biased estimate of π 0 in the presence of genes with small effects than the usual estimate based on the distribution of p-values (). The same estimates have been used previously in .
The BRCA data set  was collected from patients with hereditary breast cancer who had mutations either of the BRCA1(n = 7) or the BRCA2 gene (n = 8). Expression was originally reported for 3,226 genes, but following , we removed 56 extremely variable genes and analysed only the remaining 3,170 genes. For all four procedures, we used = 0.61, and we evaluated 500 permutations of the group labels.
The Lymphoma data set  was collected from 240 patients with diffuse large B-cell lymphoma, n 1 = 102 of whom survived the study period, and n 2 = 138 of whom did not. We used all 7,399 genes reported in the original article. For all four procedures, we used = 0.59, and we evaluated 500 permutations of the group labels.
All expression values were logged prior to analysis.
Methods t1d and t2d are implemented in the R package OCplus, which is freely available at the Bioconductor website . R code implementing S1d and S2d is available from the authors on request. EDGE, the official implementation of EODP described in , is available at .
This work was partially supported by a research grant from the Swedish Cancer Foundation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.