If a gene is known to be differentially expressed at a certain level on average, then that level would predict future measurements of gene expression better than would making such predictions on the assumption that there is on average no differential expression. Likewise, if a gene is known to be equivalently expressed, then using an expression level of 0 or an expression ratio of 1 would predict future measurements better than making such predictions on the assumption that there is some differential expression. Thus, a method of selecting genes as differentially expressed may be judged by estimating its ability to predict future measurements of gene expression. This estimation may be carried out by a process of cross validation: the microarrays are divided between a training set used to determine which genes the method considers differentially expressed and a test set used to estimate how well such results would agree with future measurements.
The strategy of assessing gene selection algorithms by estimated prediction error may be more precisely specified in mathematical notation. Let xi,jdenote the logarithm of the measured expression intensity or ratio of intensities of the i th of m genes in the j th of n biological replicates of the control or reference group; each value of xi,jmay represent an average over technically replicated microarrays; x
i
= (xi,1,xi,2, ..., xi,n); x = (x1, x2, ..., x
m
)T. Likewise,
denotes the logarithm of the measured expression intensity or ratio of intensities of the i th gene in the j th of n' biological replicates of the treatment or perturbation group;
. The observations xi,jand
are realizations of the random variables X
i
and
, respectively. The i th gene is called equivalently expressed if E(
- X
i
) = 0 or differentially expressed if E(
- X
i
) ≠ 0. In hypothesis testing parlance, the null hypothesis associated with the i th gene is H
i
: E(
- X
i
) = 0.
Gene selection algorithms
A gene selection algorithm α returns π
α
(H
i
| x', x), an estimate of the posterior probability that the i th gene is equivalently expressed; it follows that 1 - π
α
(H
i
| x', x) is the algorithm's probability that the gene is differentially expressed across the perturbation and reference groups. Many algorithms [12–21] give π
α
(H
i
| x', x) directly as a local false discovery rate estimate [22, 23], whereas traditional false discovery rate estimates and other non-Bayesian algorithms in effect assign π
α
(H
i
| x', x) a value of either 0 or 1, depending on whether or not a gene is considered differentially expressed at a given threshold. For example, the practice of considering a gene differentially expressed if exp
, its estimated fold change, is at least φ may be expressed as
with φ > 0,
, and
. The discontinuity can be removed by introducing smooth functions on an ad hoc basis; here we use
as an example of such a smooth function. The trivial algorithms
which completely ignore the data, will serve as informative points of reference.
Some of the empirical Bayes algorithms implemented in two R packages [24] are considered here [25–27]. From calculations based on a moderated (regularized) t-statistic that are performed by the R package limma [25], one may readily obtain p
i
(
), a one-sided p-value of the i th null hypothesis;p (
) = (p1 (
), p2 (
), ..., p
m
(
)). Given the moderated t-statistics and π (H0), the proportion of genes expected to be equivalently expressed, limma also computes logω
i
(π (H0)), the estimated logarithm of the posterior odds that gene i is differentially expressed rather than equivalently expressed, from which the local false discovery rate may be readily obtained as (1 +ω
i
(π (H0)))-1 . Since, for use with the log-odds, the author of the algorithm does not recommend computing π (H0) using limma's convest function (Gordon Smyth, personal communication, 27 Oct. 2007), we instead iterated the log-odds function until convergence by adapting a method [28] originally proposed for another empirical Bayes algorithm [29]:
-
1.
Let π1 (H0) = 90% and initialize k to 1.
-
2.
Increment k by 1.
-
3.
Let
.
-
4.
Repeat Steps 2-3 until the absolute value of the proportion difference is sufficiently small, i.e., |π
k
(H0) - πk-1(H0)| < 1/1000, or until the sign of the proportion difference changes, i.e.,
(π
k
(H0) - πk-1(H0)) (πk-1(H0) - πk-2(H0)) < 0. The number of iterations performed until such convergence is denoted by K.
-
5.
Let π (H0) = π
K
(H0).
Based on that value of π (H0), the estimated probability of equivalent expression is derived by solving for it in the definition of the odds of differential expression (i.e., the ratio of the probability of differential expression to the probability of equivalent expression), yielding
Also using standard distributions of test statistics under the null hypothesis, the R package locfdr [26] maps p, a vector of single-tailed p-values for all genes, to estimates of a local false discovery rate (FDR), πlocfdr (H
i
, p| x', x). The use of moderated t-statistics is incorporated by
More commonly, p (t), a vector of standard (1- or 2-sample) t-test p-values, which also assume the normality of
- X
i
, or p (w), a vector of (signed-rank or rank-sum) Wilcoxon test p-values, which do not assume normality, yield local false discovery rate estimates
Alternatively, the locfdr package can employ an empirical maximum-likelihood estimate of the null distribution [27] for computation of the local-false-discovery-rate estimate πemp.null (H
i
, p|x',x):
Whereas the empirical Bayes algorithms provide approximations to a posterior probability of a hierarchical Bayesian class of models, we included comparisons to the posterior probability πBayes factor (H
i
| x', x) under a non-hierarchical set of models. The data densities under the non-hierarchical models are based on the same assumptions as those of standard linear regression: unconstrained data means under the alternative hypothesis (differential expression) and, for each gene, normal IID noise and equal variance within each group in the unpaired case. Let
represent the hypothesis of differential expression (in contrast to H
i
, which was defined as the hypothesis of equivalent expression). The posterior odds of differential expression under these models are
where P (dx', d x| h) is the prior predictive density or integrated likelihood under hypothesis h. The left-hand side of equation (8) is the posterior odds of equivalent expression to differential expression; on the right-hand side, the first factor is the prior odds of equivalent expression to differential expression, and the second factor is known as the Bayes factor. Since we take P (H
i
) = P(
) = 1/2, our posterior odds is equal to the Bayes factor; thus putting equal prior mass on each hypothesis does not share the conservatism of the above empirical Bayes algorithms. Additional file 1 gives the analytical derivation of the resulting posterior probability, which may be expressed in terms of some additional notation. Define
if n = n' and
is paired with xi,j, or
if
and X
i
are independent. Then the posterior probability is given by
We also applied two "information criteria" used in model selection to estimate the posterior probability; the information criteria were applied to the same linear regression framework used in the above Bayes factor computation. In model selection terminology, each criterion selects either model H
i
or model
(that is, equivalent expression or differential expression, respectively) for the i th gene, but we instead averaged the estimates corresponding to the two models for each gene as follows. We first applied the Bayesian Information Criterion (BIC) [30]. Up to a factor of - 1/2 and a constant term, the BIC approximates the logarithm of the prior predictive probability density given a statistical model and a sufficiently diffuse proper prior distribution under the given model without requiring specification of such a prior. With a prior mass on each model considered, the BIC leads to an approximation of a posterior probability that is less conservative than that of the above Bayes factor.
The general formula for the BIC under a model with normal errors is
where N is the number of data points and k is the number of parameters in the model. For paired data, N = n; under H
i
the only parameter is the data variance, giving k = 1, while under
the model includes both the data mean and data variance, giving k = 2. Therefore the BIC for each hypothesis is
with SSR
h
as defined in (9).
For independent data, N = n+ n'; under H
i
the model includes a single mean log-expression level and the data variance, giving k = 2, while under
the model includes two distinct mean log-expression levels (one for the treatment group and one for the control group) and the data variance, giving k = 3. Therefore the BIC for each hypothesis is
with SSR
h
as defined in (10). Since we again use P (H
i
) = P (
), the BIC approximation of the posterior odds (ω
i
,BIC) is equal to its approximation of Bayes factors corresponding to a wide class of priors on the model parameters. Transformed from the logarithmic scale to the probability scale [31], the result is an equation of the same form as (11),
The second information criterion we assessed was the Akaike Information Criterion corrected for small samples (AICc). While - AIC
c
/2 plus a constant term is in general only an approximately unbiased estimator of the expected Kullback-Leibler distance between the model/hypothesis and the unknown true data generating distribution [32], it is exactly unbiased for linear regression models with normal errors [33], a class that includes the present non-hierarchical models. Under the name of Akaike weights, it and other AlC-like criteria have been used to generate predictions that take model uncertainty into account in a manner exactly analogous to Bayesian model averaging [32], giving rise to an equation of the same form as(18).
The general formula for the AICc under a model with normal errors is
The particular values of N and k for paired and independent data under
and H
i
are the same as those given above for the BIC. For paired data, the AICc values of the hypotheses or models are
with SSR
h
as defined in (9); for independent data, the AICc values are
with SSR
h
as defined in (10). Transforming from the logarithmic scale yields the effective probability
Where
is the ratio of Akaike weights.
These algorithms were chosen as representatives of various classes of possible approaches. Whereas the fold-change-dependent algorithms represent algorithms that take no account of the data variance, the information criterion algorithms and the non-hierarchical Bayesian algorithm represent algorithms that do take data variance into account but do not share information across genes. The local FDR algorithms based on classical p-values share information across genes for the purpose of determining false discovery rates, thus accounting for multiple comparisons, but do not share information for estimating data variance. Algorithms employing the moderated t-statistic share information across genes to account for multiple comparisons and also to estimate data variance.
Methods of assessing gene selection algorithms
Each of the next subsections describes a different method of quantifying the performance of gene selection algorithms. The first, cross validation, has the advantage that it is an unbiased estimator of squared prediction error (defined below) without assuming any parametric model. The second, the computation of posterior predictive loss, takes advantage of the knowledge that gene expression is approximately lognormal and that relatively few genes will have substantial differential expression, the vast majority being equivalently expressed for all practical purposes. The two methods will differ in results; if nearly all genes have only negligible differential expression, the latter is deemed more reliable except in the case of extensive biological replication since the former achieves low bias by admitting a high variance of performance estimates.
Cross validation
Algorithm α's best prediction of future values of
- X
i
is the posterior expected degree of expression,
The term
is the best estimator of the degree of expression conditional on definite knowledge that gene i is differentially expressed; it is multiplied by (1 - π
α
(H
i
|x', x)), the posterior probability of differential expression. (The other product in the posterior expectation corresponds to equivalent expression, and is therefore identically zero.) The posterior expected degree of expression has been compared to a method of correcting estimates for gene selection bias [34]. For a new observation of gene i, the squared prediction error is,
The squared prediction error does not directly target the question of which genes are differentially expressed; instead, it addresses the question of what the value of the next observation will be. However, good performance of one algorithm relative to another on either of these questions implies good performance on the other, as can be seen by considering that in general the mean squared prediction error is the sum of an algorithm's squared predictive bias and the data variance. The squared predictive bias term summarizes the ability of an algorithm to correctly distinguish differentially expressed genes from equivalently expressed genes. It is more fexible than the 0/1 loss in that it penalizes algorithms not just for being wrong, but for how wrong they are. The data variance sets the scale for "wrongness", in that for one algorithm to appear significantly worse than another, its squared predictive bias must dominate the data variance term.
Under the "all nulls false" reference algorithm, the best prediction of future values of
- X
i
for all genes is the maximum likelihood estimator
. Other algorithms make gains over this reference by correctly assigning equivalently expressed genes, thereby avoiding the contribution of the variance of the MLE to the squared prediction error. Under the "all nulls true" reference algorithm, the best prediction of future values of
- X
i
for all genes is 0. Other algorithms make gains over this reference by correctly assigning differentially expressed genes, thereby avoiding the contribution of the squared bias (that is, [E (
- X
i
)]2) to the squared prediction error.
The squared prediction error criterion therefore quantifies the relative costs of false positives and false negatives in terms of the bias-variance trade-off. To estimate the squared prediction error, we used leave-one-out cross validation,
if n = n' and
is paired with xi,jor
if
and X
i
are independent, where (-j) means the j th replicate is omitted:
, and
.
For example, suppose that
is paired with xi,,jand the data for gene i were
= (0,1,1) and x
i
= (2,0,-2). For j from 1 to 3,
= (2, 0.5,-0.5), and using the fold change shrinkage calculation of equation 2, 1 - π
α
(H
i
|
, x(-j)) = (0.95, 0.78,0.39). (Note that the FDR estimation algorithms require all the other genes' data to calculate π
α
(H
i
|
, x(-j)).) The individual terms in the sum in equation 28 are (-2 - 0.95 - 2)2, (1 - 0.78 - 0.5)2, and (3 - 0.39 - (-0.5))2, and their mean is 8.6. If the given data were independent instead of paired, the calculation would involve each of the 9 subsets obtained by leaving out one perturbation data point and one control data point.
We considered measuring error relative to always predicting that
- X
i
= 0 on a gene-wise basis using the ratio
with two measures of central tendency,
(The half-sample mode (HSM) [35] is a fast, robust estimator of the mode that is suitable as for asymmetric distributions. It is implemented as the hsm function in the modeest package of R.) We also considered an absolute error criterion,
this measure is relative to a base model such as the "all nulls true" model or the "all nulls false" model because we expect only the relative performances of the estimators to be meaningful. We found that the relative error mean essentially reproduced the absolute error relative to the "all nulls true" model, and the relative error mode often evaluated estimators as not practically different from the "all nulls true" benchmark. Therefore, we show only the results for the absolute error measure.
The use of cross-validation for estimation of classification error, appropriate for the problem of categorizing samples or microarrays given known classifications for use in the training and test sets, differs from the use cross-validation for estimation of squared prediction error, appropriate for the distinct problem of determining which genes are differentially expressed without knowledge of which genes are differentially expressed for use in the training and test sets. Jeffery et al. [36] used a cross-validation approach to estimate the predictive error of a variety of gene selection algorithms, but with microarray classification error rather than equations (32)-(34) as the performance criterion. Such classification error depends not only on the gene selection algorithm, but also on the particular classifier for which that algorithm selects features. Since our interest lies strictly in identifying differentially expressed genes, our methods instead quantify performance in terms of predicting new measurements. We have also addressed the problem using estimation error in place of prediction error [37].
Posterior predictive expected squared error
The local FDR shrinkage algorithm can be used to define an estimator's posterior predictive expected squared error. In general, the posterior predictive expected squared error is
where
and Xi,neware random variables for new observations,
is algorithm α's point prediction for
- Xi,new, and Eposterior and varposterior are the expectation and variance with respect to the posterior distribution. The effective posterior distribution that leads to estimators of the form (26) is
which has variance,
We use the local FDR estimator with t-statistics and theoretical null distribution as our gold standard model for the computation of π
α
(H
i
|x',x); this model will be accurate under the reasonable assumption that few genes are differentially expressed at appreciable levels.
To fully express the posterior predictive loss, we must define the posterior predictive distribution for
- X
new
under both the null and alternative hypotheses for both paired and non-paired data. Conditional on each hypothesis, we use improper prior distributions for convenience. Strictly speaking, this is inconsistent with our choice of π
α
(H
i
| x', x), an empirical Bayes approximation to a posterior probability; under a full Bayesian analysis, posterior probabilities of hypotheses can only be computed under proper priors for the parameters conditional on each hypothesis, as in the Bayes factor algorithm of equation (11). Our choice of π
α
(H
i
| x',x) enables sharing information across genes to give a sensible empirical Bayes posterior probability for the hypotheses but otherwise relies on the same assumptions as our conditional prior distributions.
For paired data under the null hypotheses,
- X
new
has a normal sampling distribution with zero mean and sampling variance estimated from the data. Under the usual improper prior for the sampling variance (that is, πprior (σ2)∝ σ-2), the posterior distribution for the sampling variance is a scaled-inverse-χ2 distribution with degrees of freedom n and scale
. The posterior predictive density is the expectation of the sampling density with respect to the posterior distribution of the sampling variance,
where N (·|·,·) is the normal distribution parameterized in terms of mean and variance, and t
v
(· |c,s2) is a shifted, scaled version of the t distribution with v degrees of freedom, center c, and scale factor s. (That is, if Y is distributed as t
v
(·|c, s2), then (Y - c)/s is distributed as the usual t
v
distribution.)
For paired data under the alternative hypothesis,
- X
new
has a normal sampling distribution with both mean and sampling variance estimated from the data. It can be shown that under the usual improper joint prior for mean μ (μ = E(X' - X)) and the sampling variance (that is, πprior (μ, σ2) ∝ σ-2), the posterior predictive distribution for
- X
new
is,
where
, i.e., s2 is the usual unbiased variance estimator.
For non-paired data under the null hypothesis, if the treatment and control data are modeled as having distinct sampling variances (consistent with the assumptions used to specify πα (H
i
| x', x)) then the posterior predictive distribution is
where (σ')2 and σ2 are the sampling variance for treatment and control data respectively. This integral is intractable because πposterior (σ2, (σ')2)has a non-standard form (see Additional file 1). We estimated it by drawing samples from πposterior (σ2, (σ')2) using Markov chain Monte Carlo (MCMC) [38] and then calculating the MCMC average,
where the subscript k indicates the kth MCMC draw of parameter values (after suitable burn-in) and K is the total number of draws. In the present case, the MCMC algorithm we use is an inherently multi-chain procedure; we used 10 chains. We used a burn-in of 20 samples per chain, followed by 100 samples per chain, for a total of K = 1000 samples. For each gene in a randomly chosen subset of genes from the complete data set, a contour plot of the posterior density was superimposed on a scatter plot of the MCMC draws of parameter values. The scatter plots visually conformed to the contours of the posterior densities, verifying that the MCMC draws of parameter values provided a good approximation to the posterior distributions.
For non-paired data under the alternative hypothesis,
and X
new
each have a normal sampling distribution with both mean and sampling variance estimated from the data. It can be shown that under the usual improper joint prior for the individual means and sampling variances, the posterior predictive distributions for
and X
new
are
and therefore
To summarize gene-wise posterior predictive expected squared error over all genes in a data set, we considered quantities analogous to the relative errors and absolute errors of equations (32)-(34), with gene-wise posterior predictive expected squared errors replacing cross-validation-derived prediction errors. Again, we found that the relative error mean essentially replicated the results of the absolute error relative to the "all nulls true" benchmark; relative error mode evaluated the performance of all estimators as identical to the "all nulls true" benchmark. Therefore, we show only the results for the absolute error measure for posterior predictive expected squared error.