The current article focuses on the two-component model. Other points of view exist. In reference [25] the two-component model is reshaped into a conceptually attractive one-group model allowing a continuum of effects.
Denote the pdf of p-values by f, the proportion of unchanged by p0 and the distribution of the p-values for the changed genes by f1. Then the pdf of p-values may be written as
f(x) = p0 × 1 + (1 - p0)f1(x) (8)
using the fact that p-values for the unchanged genes follow a uniform distribution over the interval [0,1]. This model is unidentifiable without further assumptions, e.g. that p-values in a vicinity of 1 only represent unchanged genes. From the non-negativity of pdf's, clearly
f(x) ≥ p0 (9)
This leads to the estimate based on the minimum of the estimated pdf [1]
see also Figure 6. In most cases the minimum in (10) will occur for some x close to or at 1. Hence (10) will in these cases agree well with an estimate of f(1). If one has reason to believe that p0 is close to 1, it may pay off to replace (10) by the 25% percentile or simply put the estimate equal to 1, in order to make overestimation more likely.
LSL
Let R(α) = # {i : p
i
≤ α}, the number of rejected given the cut-off α. In [36] the approximation
N - R(α) ≈ E[N - R(α)] ≈ N0(1 - α)
for small α and N0 = Np0 the number of true null hypotheses appears. Consequently, (N - R(p(i))/(1 - p(i)) = (N - i)/(1 - p(i)) will approximate N0, which lead the pioneering authors to consider plotting 1 - p(i)against N - i, thus giving them an estimate of N0. In [26] the Lowest SLope estimator (LSL) of N0 based on the slopes S
i
= (1 - p(i))/(N - i + 1) is presented. Starting from i = 1, the procedure stops at the first occurrence of Si0 <Si0-1, and outputs the estimate
In [12] the two above estimates are presented, derived and compared together with a method called Mean of Differences Method (MD). MD and LSL are motivated by assuming independence and approximating the gaps d(i)= p(i)- p(i-1)(define p(0) = 0 and p(N+1))≡ 1) with a Beta(1, N0) distribution, which has expectation 1/(N0 + 1). This expectation may be estimated by the inverse of a mean of the form
MD proceeds downward and chooses i0 equal to the first j satisfying
Of these three methods LSL and MD give very similar results, and outperform their predecessor [12].
LSL is available as function fdr.estimate.eta0 in package GeneTS [18] with the option method= "adaptive".
The smoother
A method here referred to as the smoother appeared in [9]. This method, like all presented, is based on a comparison of the empirical p-value distribution to that of the uniform distribution. There will likely be fewer p-values close to 1 in the empirical than in the null distribution, which is a uniform. The ratio of the proportion of p-values greater than some η to the expected proportion under the uniform distribution, 1-η, will give a measure of the thinning of observed p-values compared to the null distribution. Thus, with F
e
denoting the empirical distribution, the ratio {1-F
e
(η)}/{1-η} will often be a good estimate of p0 for an astutely chosen threshold η. A spline is fitted to the function p0(η) = {1-F
e
(η)}/{1-η}, and the resulting function is evaluated at η = 1, yielding the estimate
(tilde '~' above p0 denoting the spline-smoothed version of p0(η)) which goes into the calculation of the FDR in (2), see Figure 15. Note the relationship with LSL, p0(p(i)) = (N-i)/{(N-i+1)S
i
N}. It can be shown that for fixed η this method offers a conservative point estimate of the FDR:
The q-value is estimated by combining (2), (7) and (11), and an implementation is provided as the function qvalue in package qvalue available on CRAN [18].
In [30] the authors go to great lengths to prove that for fixed η, as above, the conservativeness remains under various forms of dependency, such as clumpy dependence.
The Bootstrap LSE
Another approach pioneered by Storey in [3] is to use a bootstrap least squares estimate (LSE), which chooses a value of η in p0(η), that minimises the variation of the estimate for random samples of the original p-values. The bootstrap standard reference [37] provides more theoretical background. Generate B new samples p-values p*b(b = 1,..., B) by sampling with replacement from the observed ones, calculate a measure of the Mean Squared Error (MSE) of the corresponding estimates p*b0(η) for a lattice of values of η and choose the value minimising the MSE. More formally, the optimal η is obtained through
The version of the bootstrap used in this article uses more samples B than the version available in qvalue (B = 500 instead of B = 100), and seems to perform better (data not shown).
Available in functions qvalue [18] and p0.mom (in package SAGx) [18, 38].
SPLOSH
In [11] a spline function estimates the log-transformed pdf log[f(x)] using a complex algorithm involving splines called spacings LOESS histogram (SPLOSH). To obtain a stable estimate of FDR near zero a technique from mathematical analysis called l'Hospital's rule is used to approximate the ratio in (5) and to yield
,
where the numerator has been estimated as in (4). An R package with the same name is available [39].
The FDR estimate (6) is used with F obtained by the non-parametric estimate of the pdf.
Note that we can now calculate the posterior probability given its p-value that a gene is a DEG as p1(x) = 1 - p0/f(x), compare (3).
The method is available in R function splosh [40].
BUM
In [10] the authors assume a beta-uniform (BUM) distribution, i.e. in (1) they replace f1 by a beta distribution,
f(x) = λ + (1 - λ)axa-1 (12)
where in addition to λ which corresponds to p0 the shape parameter a has to be estimated. Thanks to the simple form of the distribution it is possible to estimate parameters through the maximum likelihood principle, i.e. by choosing values that maximise the likelihood of observing the p-values that were actually observed. However, due to problem in identifying p0 with λ, the authors instead use an upper bound
which corresponds to f(1).
The FDR estimate (5) is used with F the cdf corresponding to (12).
The authors provide R code for the application of their method [39].
A more intricate Hierarchical Bayes model based on the beta-uniform concept allowing for different parameter values in different intervals appears in [41]. The R function localFDR provides an implementation of the method [42].
Poisson regression
In [28] it is suggested to estimate any empirical distribution by dividing the real axis into intervals and regarding the number of hits in each interval as the result of an inhomogeneous Poisson process, much like counting the number of cars arriving at a crossing during different time intervals. This method was used in [27] to model the distribution of a transformed test statistic, it also appears in function locfdr which estimates a local FDR as a function of a test statistic. In our case, of course, the support of the distribution is the unit interval [0,1]. Then the expected number of hits in each subinterval of [0,1] can be modelled as a polynomial in the midpoints of the subintervals by a technique called Poisson regression (PRE). The approach taken here is to choose a polynomial of low degree so that the plateau representing the uniform distribution is well captured. In doing so the ability the capture the distribution at low p-values is sacrificed.
A more mathematical description now follows. The PRE method assumes that the counts S
k
follow a Poisson distribution whose intensity is determined by the midpoint t
k
of the interval I
k
, see [28]. To be specific: in the current application it is assumed that the expected frequency of observations in an interval is given by
where μo
k
are the smoothed observed frequencies in each interval I
k
. In statistical jargon this is a Poisson regression model with μo
k
as offset. This assumes independence between counts in different intervals. Although this does not hold true the model succeeds to capture the essential features of distributions. Standard functions in e.g. R can fit this model. Normalising the curve by the total number of p-values we get an estimate of the pdf. Finally, smooth the pdf f(x) with a spline to obtain a more stable result, and use the estimate (10). An implementation of PRE is provided through R function p0.mom in package SAGx [18, 38].
SEP
The Successive Elimination Procedure (SEP) excludes and includes p
i
successively such that the similarity of the distribution of the included tends to behave increasingly like a uniform [13]. Finally, an index set J
final
will map to a set of p-values that represent the true null hypotheses. This yields the point estimate
with N
J
= # J, the cardinality of the set J. The identification of J
final
proceeds by an intricate optimisation algorithm where the objective function consists of two terms : one Kolmogorov-Smirnov score
for the empirical cdf F
J
(based on J), to measure the distance to a uniform, and one penalty term to guard against overfitting
for some tuning parameter λ.
A local FDR is obtained from smoothed histogram estimates based on equidistant bins
,
where the function h0 refers to the J
final
set and h to the total set of p-values.
The function twilight in package twilight provides an implementation of SEP [19].
Moment generating function approach
The next approach is based on the moment generating function (mgf), which is a transform of a random distribution, which yields a function M(s) characteristic of the distribution, cf. Fourier or Laplace transforms, e.g. [43]. Knowing the transform means knowing the distribution. It is defined as the expectation (or the true mean) of the antilog transform of s times a random variable X, i.e. the expectation of esXor in mathematical notation:
M(s) = ∫esxf(x)dx.
To calculate the mgf for p-values, we use the fact that the pdf is a mixture of pdf's (8). This yields the weighted sum of two transformed distributions:
,
where we have used the fact that the mgf of a uniform distribution over [0,1] equals g(s) = (es- 1)/s. Denoting the second transform by M1(s) we finally have
M(s) = p0g(s) + (1 - p0)M1(s). (13)
Now, the idea is to estimate these mgf's and to solve for p0. In the above equation M(s) can be estimated based on an observed vector of p-values and g(s) can be calculated exactly, respectively, while p0 and M1(s) cannot be estimated independently. The estimable transform is, given the observed p-values p = p1,..., p
n
, estimated by
Then, one can solve (13) for p0:
Let us do so for s
n
> sn-1, equate the two ratios defined by the right hand side in (14) and solve for M1(s
n
). This gives the recursion
If we can find a suitable start for this recursion we should be in a position to approximate the increasing function M1(s) for s = s1 <s2 < ... <s
m
in (0, 1]. Now, note that 1 ≤ M(s), for any mgf, with close to equality for small values of s. It makes sense to start the recursion with some M1(s1) in I = [1, M(s1)]. In general, it will hold true that 1 ≤ M1(s) <M(s) <g(s), since f1 puts weight to the lower range of the p-values at the expense of the higher range, the uniform puts equal weight, and f being a mixture lies somewhere in between. We can calculate g, M and M1 for an increasing series of values in [0,1], e.g. for s = (0.01, 0.0101, 0.0102, ..., 1). The output from one data set appears in Figure 16. Since all ratios (14) should be equal, a good choice of M1(s1) will be one that minimises the variation of the ratios. Standard one-dimensional optimisation will find the value in I that minimises the coefficient of variation (CV, standard deviation divided by mean)
where s = (s1, ..., s
n
). The CV will in contrast to the variance put both small and high values of the ratios on an equal footing and enable comparison.
Finally, these essentially equal ratios provide an estimate of p0.
A heuristic convexity argument suggests that mgf over-estimates p0, see the Additional file. Furthermore, the bias seems to decrease as p0 grows.
An implementation of mgf appears as function p0.mom in package SAGx [18, 38].
Local FDR and FDR
The concept of a local false discovery rate originates from [1]. Let the (true) local FDR at t be defined as the probability that gene i is unchanged conditional upon that its p-value equals t, or in formulas : LFDR(t) = Pr(gene i unchanged | p
i
= t) = p0/f(t). The Averaging Theorem of [4] states that integrating the local FDR over the rejection region R, such as R = [0, 0.01], yields the FDR : FDR(R) = E[LFDR(y) | y ∈ R]. In [29] it is noted that the estimated q-value equals the mean of a local FDR
where the local FDR at the ithordered p-value p(i)equals
,
where N denotes the total number of genes. This rephrases the theorem in terms of estimators. The local FDR is meant as an approximation of the probability that an individual gene i is a DEG. As remarked in [29] the q-value does not estimate the probability that a gene is a false positive. Indeed, the theorem shows that it is the mean of that probability for all genes at least as extreme as i. Thus the q-value will tend to give a lower value than LFDR(i).
Under a wide range of models, where f(x) is non-increasing, e.g. the BUM model, the expected local LFDR(i) will be non-increasing, and hence the differences above should tend to increase, see the Additional file. Hence there is a need for enforcing monotonicity as in (7). One tool for enforcing monotonicity is the Pooling of Adjacent Violators (PAVA) algorithm [44]. This algorithm has an intuitive appeal, is less ad-hoc than the local spline approach presented in [29], and is the non-parametric maximum likelihood estimate under the assumption of monotonicity. As an example of how it works, consider the series (1,2,4,3,5), which PAVA turns into the non-decreasing series (1, 2, 3.5, 3.5, 5) by pooling the violators of monotonicity (4, 3) and replacing them by their mean. Though not equivalent to the q-value from (2) and (7), the results from applying PAVA to the terms in (15) agreed rather well with the values obtained from function qvalue. In the Results section this approach combined with the PRE estimate of p0 is referred to as pava FDR. We could have used mgf for calculating FDR, but it was excluded due to the better over-all performance of PRE.
The bootstrap LSE gives a very similar result to the smoother and thus was excluded in comparison of FDR estimates.
The R function pava.fdr in package SAGx provides an implementation of pava FDR, and returns a list of estimates of FDR, LFDR and p0 [18, 38].
The reference [25] presents the theory behind the estimation of local false discovery rates provided by R package locfdr [18, 25]. The method procedes by transforming the test statistic t into Z = Φ-1(m(t)), where Φ is the cdf corresponding to N(0,1) and m is a transform that for the uninteresting/null class renders the distribution of Z into a N(0,1). Typically, m could equal the cdf of the t-distribution. Assuming the model (1) the method estimates the mixture density ftusing Poisson regression, and fits either a theoretical null sub density (from now on suppressing superscript t and denoting the pdf corresponding to Φ by φ) f+0(z) = p0 φ(z) around z = 0, or fits an empirical null distribution. Then the procedure estimates the local false discovery rate fdr(z) = f+0(z)/f(z).