Determination of the differentially expressed genes in microarray experiments using local FDR
 J Aubert^{1},
 A BarHen^{1},
 JJ Daudin^{1}Email author and
 S Robin^{1}
https://doi.org/10.1186/147121055125
© Aubert et al; licensee BioMed Central Ltd. 2004
Received: 27 May 2004
Accepted: 06 September 2004
Published: 06 September 2004
The Erratum to this article has been published in BMC Bioinformatics 2005 6:42
Abstract
Background
Thousands of genes in a genomewide data set are tested against some null hypothesis, for detecting differentially expressed genes in microarray experiments. The expected proportion of false positive genes in a set of genes, called the False Discovery Rate (FDR), has been proposed to measure the statistical significance of this set. Various procedures exist for controlling the FDR. However the threshold (generally 5%) is arbitrary and a specific measure associated with each gene would be worthwhile.
Results
Using process intensity estimation methods, we define and give estimates of the local FDR, which may be considered as the probability for a gene to be a false positive. After a global assessment rule controlling the false positive error, the local FDR is a valuable guideline for deciding wether a gene is differentially expressed. The interest of the method is illustrated on three well known data sets. A R routine for computing local FDR estimates from pvalues is available at http://www.inapg.fr/ens_rech/mathinfo/recherche/mathematique/outil.html.
Conclusions
The local FDR associated with each gene measures the probability that it is a false positive. It gives the opportunity to compute the FDR of any given group of clones (of the same gene) or genes pertaining to the same regulation network or the same chromosomic region.
Background
Microarrays are part of a new class of biotechnologies that allow the monitoring of the expression level of thousands of genes simultaneously. Among the applications of microarrays, an important task is the identification of differentially expressed genes, i.e genes whose expressions are associated with the status of the patient (treatment/control for example).
The biological question of the identification of differentially expressed genes can be restated as a one (for paired data) or twosample (for unpaired data) hypothesis testing procedure: is the gene differentially expressed between the two situations? However, when thousands of genes in a microarray data set are evaluated simultaneously by fold changes or significance tests approach, multiple testing problems immediately arise and lead to many false positive genes. In this 'onebyone gene' approach the probability of detecting false positives rises sharply.
The False Discovery Rate (FDR), is defined as the expected fraction of false rejections among those hypotheses rejected. In their seminal paper Benjamini & Hochberg [1] provided a distribution free procedure (BH) for choosing a threshold on pvalues that guarantees that the FDR is less than a target level α. The same paper demonstrated that the BH procedure is more powerful than the Bonferroni method that controls the familywise error rate.
The FDR gives an idea of the expected number of false positive hypotheses that a practitioner can expect if the experiment is done an infinite number of time. As usual with expectation, it gives very little information about the number of false discovery hypotheses in a given experiment.
Motivation
The value of 1, 5 or 10% for the FDR, which determines the threshold t, is arbitrary. Storey and Tibshirani [2] stressed the importance of assessing to each feature its own measure of significance. They proposed to use the qvalue,
where P_{ i }is the pvalue of the ordered gene i, R_{ i }is the total number of rejected genes whose pvalues are less than the threshold t = P_{ i }and is an estimate of the total number of non differentially expressed genes, m_{0}.
The qvalue is appealing because it gives a measure of significance that can be attached to each gene, but it must be stressed that it is not an estimate of the probability for the gene to be a false positive. The qvalue is generally lower than the latter because it is computed using all the genes that are more significant than gene i. Obviously a gene whose pvalue is near to the threshold t does not have the same probability to be differentially expressed than a gene whose pvalue is close to zero. Therefore the qvalue gives a too optimistic view of the probability for the gene to be a false positive.
Therefore it is interesting to obtain an estimate of the FDR attached to each gene, called local FDR, from an inferential point of view and without any assumption about the distribution of the pvalues under H_{1}.
Results
Let
H_{0}(i) = {gene i is not differentially expressed}.
Let the local FDR be the probability that a given gene is not differentially expressed. More specifically, FDR(i) is the probability that a gene, whose pvalue is P_{ i }, is not differentially expressed, taking into account the whole set of tests. A raw local FDR estimate is defined in a first step. In a second step the local FDR estimate is defined as a smoothed value based on the raw values.
Let P_{1} < … <P_{ m }denote the ordered pvalues for testing H_{0}(i). The raw local FDR estimate for gene i is:
where
where λ is a tuning parameter and W(λ) = #{i, P_{ i }> λ}, see Storey [3].
Assume that the pvalues for the nondifferentially expressed genes are independent. The raw local FDR estimate has the following properties:
The smoothed local FDR(i) is
where f_{ i }is a smoothing function of the (j, λ) for j = 1, m, computed at position P_{ i }.
As an interesting application of the local FDR, it is possible to compute the FDR associated with a class of genes or clones by summing up the local FDR estimate of each clone or gene: one may consider for example clones corresponding to the same gene, genes known involved in a given regulatory network, or gene from the same chromosomic region, and associate a FDR with the whole class. These genes do not need to have consecutive pvalues. The following sections demonstrate how the local FDR can be useful using the data of well known experiments.
Local FDR on Golub data set
Golub [4] were interested in identifying genes that are differentially expressed in patients with two types of leukemias (ALL, AML). Gene expression levels were measured using Affymetrix highdensity chips containing 6817 human genes. The learning set comprises 27 ALL cases and 11 AML cases.
Data are available in the R multtest package. We used the preprocessing proposed by the authors and the pvalues based on random permutations of the ALL/AML labels on Welch tstatistics for each gene, Dudoit [5], on the 3051 remaining genes. m_{0} is estimated with bootstrap method as suggested by Storey and Tibshirani and implemented in the library GeneTS of software R.
We can see that there is an abrupt change of the smoothed local FDR around gene number 500 which corresponds to a threshold t = 0.15 for the pvalue. This may be an indication about the threshold. The Figure 1(c) presents a zoom of the Figure 1(b) for the first 600 pvalues. We can see in Figure 1(c) that if we select the 438 (14%) top genes, we obtain a qvalue equal to 0.0078 while the 438^{ th }gene has a local FDR equal to 0.027. It must be noticed that there is a big difference between the two measures of FDR because the numerous regulated genes with very small pvalues have a great influence on the qvalue, which is not the case of the local FDR (see Figure 1(c)).
The pvalues have been obtained using random permutations. Therefore the pvalues are discrete with several genes possessing the same pvalue. Therefore the values of (i, λ) may be equal to 0 because the difference between two successive pvalues is 0. The discrete structure of the pvalues implies a departure from the theoretical continuous uniform distribution. This explains why the moving average smoothing creates discrete jumps which appear in Figure 1(c).
If the distribution of the statistics under H_{0} is correct, the pvalues are distributed as a uniform distribution over [0, 1]. The empirical distribution of the high observed pvalues (say above 0.5) is far from the uniform distribution. There are several nonexclusive possibilities to explain this: more than 50% of the genes are differentially expressed, the gene results for nondifferentially expressed are correlated or there is a technical problem in the random permutations of the Welch tstatistics.
Local FDR on Breast Cancer data set
pvalue, qvalue and local FDR estimates for three genes in Hedenfalk data.
gene  pvalue  rank  qvalue  raw local FDR  smoothed local FDR 

MSH2  0.00005  8  0.013  0.013  0.010 
PDCD5  0.00048  47  0.022  0.013  0.033 
CTGF  0.0036  159  0.049  0.176  0.098 
One can see that the smooth local FDR estimate is generally greater than the qvalue and gives a better idea of the probability that a gene is a false positive. For example, at the level of 5%, CTGF will be considered as differentially expressed on the basis of the qvalue while it will be considered as non differentially expressed using the local FDR.
Figure 2(a) presents the (i) for ordered genes and 2(b) presents the smooth curves obtained using lowess with a span of 0.2 and moving average methods. The two smoothing methods give similar results.
Local FDR on ApoAi data
Figure 3(a) presents the (i) for ordered clones and Figure 3(b) presents the smooth curves obtained using lowess with a span of 0.2 and moving average methods. The two smoothing methods give different results at the two ends of the [0, 1] interval. The moving average method which uses a special adaptative algorithm for the ends gives a better smoothing. This is particularly important for the clones with a small pvalue for which it is crucial to obtain good estimates of the probability of being false positives. The lowess smoothing does not work well for the 50 first clones. In this particular case the default smoothing parameter f = 0.2 is not well suited and should be lower. However if it is chosen too low, the smoothing will not fit well the rest of the curve.
Discussion
The curve of the smoothed local FDR is an efficient tool to summarize the information about the number and the statistical significance of differentially expressed genes, and may also be used to give an indication about the validity of the statistical assumptions. Moreover it is a valuable tool to choose the threshold for separating the differentially expressed genes from the nondifferentially expressed one: one can choose a value of t maximizing the second derivative. Alternatively one can use a cost function and choose the threshold that minimizes the mean cost for a given cost function: using cost of the experiment, cost of false positive gene validation and the profit of discovering a differentially expressed gene, it is direct to compute the optimal strategy for choosing the threshold.
Note that a decision rule based on the local FDR would lead to a different set of selected genes than the usual one obtained by controlling the FDR. Consider the set of tests for which the local FDR is below 0.05, say. This set is not identical to the set identified by the standard criterion that FDR < 0.05. The local FDR is higher than the qvalue. Therefore the first set is strictly included in the second one. The local FDR rule is therefore more conservative than the usual FDR one.
Conclusions
The pvalue gives the probability that a non differentially expressed gene would be as or more extreme than the gene under concern. The qvalue indicates the estimated proportion of genes as or more extreme than the gene under concern that are a false positive. The local FDR gives the estimated proportion of genes around the gene under concern which are false positive. The latter may be used as the probability that the gene under concern is a false positive, taking into account the multiplicity of the test. One of the major interest of the local FDR is that it gives the opportunity to compute the FDR of any given group of clones (of the same gene) or genes pertaining to the same regulatory network or the same chromosome.
Methods
Model
Basically, the various procedures proposed in the literature aim to test the null hypothesis
H_{0}(i) = {gene i is not differentially expressed}.
Let consider a particular experiment. We observed the differential expression of the genes and compute the associated ordered pvalues P_{ i }. In the following we will use the classical property: the pvalues corresponding to non differentially expressed genes are uniformly distributed over [0, 1]. Furthermore, we will assume, as often, that these pvalues are independent. However, the independence of the pvalues of differentially expressed genes is not required. Consider a multiple testing situation in which m tests are being performed. Let m_{0} be the number of non differentially expressed genes. Let I(t) be the set of the genes having a pvalue lower than t: I(t) = {i : P_{ i }≤ t} and R(t) = #I(t), its cardinal. Let
V(t) = #[I(t) ∩ (i ∈ H_{0})]
and
S(t) = #[I(t) ∩ (i ∈ H_{1})].
Classification of m genes using threshold
H_{0} accepted  H_{0} rejected  Total  

H_{0} true  U(t)  V(t)  m _{0} 
H_{0} false  T(t)  S(t)  m _{1} 
Total  W(t)  R(t)  m 
The Family Wise Error Rate (FWER) is defined to be
FWER = P [V(t) ≥ 1].
A classical way to control FWER is given by the Bonferroni inequality. This quantity corresponds to the most direct extension from a test hypothesis procedure but can be very restrictive in a multiple testing procedure.
The status of the gene associated with the P_{ i }is an unobserved value. It is the same framework as point process (see for example [8]). In fact we observe R(t) = V(t) + S(t) the sum of two counting processes. The first one V(t) is a counting process associated with non differentially expressed gene. Since the pvalues under H_{0} are uniformly distributed, V(t) has a binomial distribution with parameter m_{0} and t. The intensity of V(t) is constant and proportional to m_{0}. S(t) is the counting process associated with gene under H_{1} and very few can be said about its distribution. One may expect the intensity of S(t) to be decreasing with t. The false discovery rate is defined as:
It corresponds to the expected proportion of rejections that are incorrect.
The BH procedure works as follows. Let P_{1} < … <P_{ m }denote the ordered pvalues. Calculate k = max_{ i }{P_{ i }≤ αi/m}. The procedure rejects all null hypotheses for which P_{ i }≤ P_{ k }. If the tests are independent, this procedure ensures that
Let FDR(t) be the FDR when rejecting all null hypotheses with P_{ i }≤ t. Because the pvalues of nondifferentially expressed genes are uniformly distributed over [0, 1], a natural estimate of FDR(t) is
Therefore the problem is to estimate m_{0}. Storey [3], proposed to estimate m_{0} with
FDR is defined as the expectation of the ratio of two counting processes V(t) and R(t): FDR(t) = E[V(t)/max(R(t), 1)]. The expectation of V(t) is m_{0}t and R(t) is observed. Therefore, Storey [3] propose to use the following estimate:
The ratio of the expectations differs from the expectation ratio but Storey [3] proved that E( (t, λ)) ≥ FDR(t) using a convexity argument.
Definition and Estimation of the Local FDR
As stated before, V(t) and R(t) are counting (i.e. cumulative) processes. It would be very interesting to estimate the ratio of the local intensities of the two processes at point t. The intensity of process V(t) is equal to m_{0} and thus is known, provided that we know m_{0}. The intensity of process R(t) is unknown, but R(t) is observed. Therefore, using point process methods it is possible to estimate its intensity at each point t.
We first define the cumulative processes from t_{1} to t_{2}:
Let 0 ≤ t_{1} <t_{2}, I(t_{1}, t_{2}) = {i : t_{1} <P_{ i }≤ t_{2}},
R(t_{1}, t_{2}) = #I(t_{1}, t_{2}),
V(t_{1}, t_{2}) = #[I(t_{1}, t_{2}) ∩ (i ∈ H_{0})]
and
S(t_{1}, t_{2}) = #[I(t_{1}, t_{2}) ∩ (i ∈ H_{1})].
FDR (t_{1}, t_{2}) is defined as the expected ratio of V (t_{1}, t_{2}) and R(t_{1}, t_{2}):
It is a generalization of the usual FDR: if t_{1} = 0 and t_{2} = t then FDR(t_{1}, t_{2}) = FDR(t). So, the natural estimate of FDR(t_{1}, t_{2}) is:
The substitution of 0 by t_{1} does not change the proof, so using the same convexity argument as Storey [3], we obtain the following property:
E( (t_{1}, t_{2}, λ)) ≥ FDR (t_{1}, t_{2}).
The local FDR is the FDR(t_{1}, t_{2}) for small intervals [t_{1}, t_{2}]. If we want to estimate the local FDR around the pvalue of the gene i, the question can be restated as how to estimate the ratio of the intensities of two processes around a given point P_{ i }.
The intensity of process R(t) has to be estimated at each value of t. It is possible to consider small windows of size h, or alternatively, to consider windows of different sizes corresponding to a fixed count for R(t). We have chosen the latter solution, for windows of variable size seem more appealing in the particular context.
Let FDR(i) be the local FDR around P_{ i }. To estimate FDR(i) we need to define a neighborhood around P_{ i }. Let V_{ i }= V(P_{i1}, P_{ i }). Remarking that R(P_{i1}, P_{ i }) = 1, we have FDR(i) = E(V_{ i }). Furthermore
E(V_{ i }) = P(V_{ i }= 1)
since V_{ i }is a binary variable. Thus FDR(i) provides an unbiased estimation of P(V_{ i }= 1), the probability for gene i to be a false positive.
The raw local FDR estimate for gene i is:
Assume that H_{0}(i) and H_{0}(i  1) are true and E( ) = m_{0}. Therefore this estimate is unbiased with mean 1.
Using definition (1), it is direct to obtain:
which equals the qvalue of gene j. The qvalue may thus be viewed as the mean of the raw local FDR of the genes with pvalues lower than P_{ j }.
Under the hypothesis H_{0}, it is known that the differences between successive ordered values of independent realizations of the uniform([0,1]) distribution have a Beta distribution with parameters 1 and m_{0} (see Johnson [9] Chap. 26). Therefore the variance of the raw local FDR estimate for nondifferentially expressed genes when m_{0} is known is equal to /[(m_{0} + 1)^{2} (m_{0} + 2)] ≈ 1, for m_{0} large enough.
The variance of estimates (1) under H_{1} is generally much smaller than under H_{0} (see Figures 1(a), 2(a) and 3(a) for an illustration). However, one may see on these Figures that (i, λ) is a very variable estimator.
This fact is well known in point process literature, [8]. Moreover, the interval ]P_{i1}, P_{ i }[ is not symmetric. If we consider the neighborhood interval around P_{ i }defined by t_{1} = (P_{i1}+ P_{ i })/2, t_{2} = (P_{i+1}+ P_{ i })/2 then we obtain another estimate of the local FDR:
Note that (2) is a moving average of order 2 of (1). It is well known that estimates provided by moving average (or kernel estimators) are more stable, see [8].
This smoothing is generally not enough to obtain usable results and we can consider any kind of smoothing. We propose to estimate FDR(i) by
where f_{ i }is a smoothing function of the (j, λ) for j = 1, m, computed at position P_{ i }.
The smoothing method must be suited to the properties of the raw FDR:

its variance is low for low pvalues corresponding to highly differentially expressed genes

its variance is very high for pvalues corresponding to non differentially expressed genes
Therefore the window of smoothing should be short for low pvalues and large for pvalues corresponding high pvalues. The lowess smoothing method has a fixed number of neighbor points. Therefore its window size depends of the density of points around the pvalue under concern. The density of points is higher for low pvalues which in turn implies a shorter window size, which is a good property. However the adaptation of the window size is not sufficient in some cases such as in the ApoAI example. Moreover the smoothed FDR should be an increasing function of the pvalues, a property which is not satisfied by the lowess smoothing. Therefore we prefer to use an ad hoc moving average smoothing using the following algorithm for computing (i, λ): let 0 <t_{1} <t_{2} <t_{3} be three predefinite thresholds and m_{1} <m_{2} <m_{3} <m_{4} four predefinite integers.

if max_{j≤i} (j, λ) <t_{1} use a moving average of order min(2i  1, m_{1})

if t_{1} < max_{j≤i} (j, λ) <t_{2} use a moving average of order min(2i  1, m_{2})

if t_{2} < max_{j≤i} (j, λ) <t_{3} use a moving average of order min(2i  1, m_{3}).

if max_{j≤i} (j, λ) >t_{3} use a moving average of order min(2i  1, m_{4}).
We have obtained good empirical results on many data sets with t_{1} = 0.01, t_{2} = 0.05, t_{3} = 0.2, m_{1} = 3, m_{2} = 5, m_{3} = 15 and with the constraint that (i, λ) is not decreasing. This adaptative moving average method is quite empirical. This topic deserve some more work to build a well assessed smoothing method. This is one of our ongoing research project.
Notes
Declarations
Authors’ Affiliations
References
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB 1995, 57,1: 289–300.Google Scholar
 Storey JD, Tibshirani R: Statistical significance for genomewide studies. PNAS 2003, 100,16: 9440–9445. 10.1073/pnas.1530509100View ArticleGoogle Scholar
 Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. JRSSB 2004, 66: 187–205.View ArticleGoogle Scholar
 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov M, Coller JP, Loh M, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer:class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
 Dudoit S, Shaffer CBJ: Multiple hypothesis testing in microarray experiments. Statistical Science 2003, 18,1: 71–103. 10.1214/ss/1056397487View ArticleGoogle Scholar
 Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP: N Engl J Med 2001, 344: 539–548. 10.1056/NEJM200102223440801View ArticlePubMedGoogle Scholar
 Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 2002, 12: 1.Google Scholar
 Cressie N: Statistics for Spatial Data New York: Wiley 1993.Google Scholar
 Johnson NL, Kotz S, Balakrishnan N: Continuous Univariate Distributions New York: Wiley 1995.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.