Empirical Bayes estimation of posterior probabilities of enrichment: A comparative study of five estimators of the local false discovery rate

Yang, Zhenyu; Li, Zuojing; Bickel, David R

doi:10.1186/1471-2105-14-87

Methodology article
Open access
Published: 06 March 2013

Empirical Bayes estimation of posterior probabilities of enrichment: A comparative study of five estimators of the local false discovery rate

Zhenyu Yang¹,
Zuojing Li² &
David R Bickel¹

BMC Bioinformatics volume 14, Article number: 87 (2013) Cite this article

4242 Accesses
1 Altmetric
Metrics details

Abstract

Background

In investigating differentially expressed genes or other selected features, researchers conduct hypothesis tests to determine which biological categories, such as those of the Gene Ontology (GO), are enriched for the selected features. Multiple comparison procedures (MCPs) are commonly used to prevent excessive false positive rates. Traditional MCPs, e.g., the Bonferroni method, go to the opposite extreme: strictly controlling a family-wise error rate, resulting in excessive false negative rates. Researchers generally prefer the more balanced approach of instead controlling the false discovery rate (FDR). However, the q-values that methods of FDR control assign to biological categories tend to be too low to reliably estimate the probability that a biological category is not enriched for the preselected features. Thus, we study an application of the other estimators of that probability, which is called the local FDR (LFDR).

Results

We considered five LFDR estimators for detecting enriched GO terms: a binomial-based estimator (BBE), a maximum likelihood estimator (MLE), a normalized MLE (NMLE), a histogram-based estimator assuming a theoretical null hypothesis (HBE), and a histogram-based estimator assuming an empirical null hypothesis (HBE-EN). Since NMLE depends not only on the data but also on the specified value of Π₀, the proportion of non-enriched GO terms, it is only advantageous when either Π₀ is already known with sufficient accuracy or there are data for only 1 GO term. By contrast, the other estimators work without specifying Π₀ but require data for at least 2 GO terms. Our simulation studies yielded the following summaries of the relative performance of each of those four estimators. HBE and HBE-EN produced larger biases for 2, 4, 8, 32, and 100 GO terms than BBE and MLE. BBE has the lowest bias if Π₀ is 1 and if the number of GO terms is between 2 and 32. The bias of MLE is no worse than that of BBE for 100 GO terms even when the ideal number of components in its underlying mixture model is unknown, but has high bias when the number of GO terms is small compared to the number of estimated parameters. For unknown values of Π₀, BBE has the lowest bias for a small number of GO terms (2-32 GO terms), and MLE has the lowest bias for a medium number of GO terms (100 GO terms).

Conclusions

For enrichment detection, we recommend estimating the LFDR by MLE given at least a medium number of GO terms, by BBE given a small number of GO terms, and by NMLE given either only 1 GO term or precise knowledge of Π₀.

Background

The development of microarray techniques and high-throughput genomic, proteomic, and bioinformatics scanning approaches (such as microarray gene expression profiling, mass spectrometry, and ChIP-on-chip) has enabled researchers to simultaneously study tens of thousands of biological features (e.g., genes, proteins, single-nucleotide polymorphisms [SNPs], etc.), and to identify a set of features for further investigation. However, there remains the challenge of interpreting these features biologically. For a given set of features, the determination of whether some biological information terms are enriched (i.e., differentially represented), compared to the reference feature set, is termed the feature enrichment problem. The biological information term may be, for instance, a Gene Ontology (GO) term [1, 2] or a pathway in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [3]. We call this problem the feature enrichment problem.

This problem has been addressed using a number of high-throughput enrichment tools, including DAVID, MAPPFinder, Onto-Express and GoMiner [4-7]. Huang et al. [8] reviewed 68 distinct feature enrichment analysis tools. These authors further classified feature enrichment analysis tools into 3 categories: singular enrichment analysis (SEA), gene set enrichment analysis (GSEA), and modular enrichment analysis (MEA). In this article, we propose empirical Bayes solutions to the SEA problem using genes as archetypal features. Without loss of generality, we consider whether some specific biological categories are enriched for differentially expressed genes with respect to the reference genes.

Indeed, like other enrichment-detection methods, our methods apply much more broadly. They can assess enrichment given any sub-list of features selected for future study, not just a list of genes considered differentially expressed. An anonymous referee pointed out these examples of such lists of candidate features that arise in the context of whole genome sequencing:

genes with SNPs
genes with copy number variations
genes with loss of heterozygosity

These examples and those of our first paragraph do not exhaust contemporary applications, and the feature enrichment problem may occur in unforeseen domains of study. Thus, our illustrative use of differential gene expression as a running example should not be interpreted as a limitation.

Existing enrichment tools mainly address the feature enrichment problem using a p-value obtained from an exact or approximate statistical test (e.g., Fisher’s exact test, the hypergeometric test, binomial test, or the χ² test). For each GO term or other biological category, the null hypothesis tested and its alternative hypothesis are as follows:

\begin{array}{l} H_{0} : the GO term is not enriched for the preselected genes \\ H_{1} : the GO term is enriched for the preselected genes \end{array}

(1)

Here and in the remainder of the paper, we use GO terms as concrete examples of biological categories without excluding applications of the methods to categories from other relevant databases. The general process begins as follows:

For each GO term, construct Table 1 based on the preselected genes (e.g., differentially expressed (DE) genes) and reference genes (e.g., all genes measured in a microarray experiment).
Compute the p-value for each GO term using a statistical test that can detect enrichment for the preselected genes.

Table 1 The number of differentially expressed (DE) and equivalently expressed (EE) genes in a GO category

Full size table

Multiple comparison procedures (MCPs) are then applied to the resulting p-values to prevent excessive false positive rates. The false discovery rate (FDR) [9] is frequently used to control the expected proportion of incorrectly rejected null hypotheses in gene enrichment studies [10-12] because it has lower false negative rates than Bonferroni correction and other methods of controlling the family-wise error rate. Methods of FDR control assign q-values [13] to biological categories, but q-values are too low to reliably estimate the probability that the biological category is not enriched for the preselected features. Thus, we study application of better estimators of that probability, which is technically known as the local FDR (LFDR). Hong et al. [14] used an LFDR estimator to solve a GSEA problem and pointed out that this was less biased than the q-value for estimating the LFDR, the posterior probability that the null hypothesis is true.

Efron [15, 16] devised reliable LFDR estimators for a range of applications in microarray gene expression analysis and other problems of large-scale inference. However, whereas microarray gene expression analysis takes into account tens of thousands of genes, the feature enrichment problem typically concerns a much smaller number of GO terms. While these methods are appropriate for microarray-scale inference, they are less reliable for enrichment-scale inference [17-19]. Thus, we will specifically adapt LFDR estimators that are appropriate for smaller-scale inference to address the SEA problem. Again, we will focus on genes and GO terms for the sake of concreteness. Nevertheless, the estimators used can be applied to other features and to other biological terms (e.g., metabolic pathways).

The sections of this paper are arranged as follows. We first introduce some preliminary concepts in the feature enrichment problem. Next, two previous LFDR estimators and three new LFDR estimators are described. Following this, we compare the LFDR estimators by means of a simulation study and an application to breast cancer data. Finally, we draw conclusions and make recommendations on the basis of our results.

Preliminary concepts

The feature enrichment problem described in the Background section is stated here more formally for the application of LFDR methods in the next section.

Likelihood functions

In Table 1, x₁ and x₂ are the observed numbers DE genes and EE genes in a given GO category, respectively. Whereas n is the total number of DE genes, N is the total number of reference genes. Thus, N − n is the total number of EE genes. The columns gives the numbers of DE genes and EE genes, and the rows give the numbers of genes in the GO category and outside the GO category.

Let x₁ and x₂, respectively, denote the random numbers of DE and EE genes in a GO category. The observed values x₁ and x₂ are modeled as realizations of x₁ and x₂. x₁ and x₂ follow binomial distributions, namely, X₁ ∼ Binomial(n, π₁) and X₂ ∼ Binomial(N − n,π₂), where π₁ and π₂ are the probabilities that a gene is DE and EE, respectively, given that it is in the GO category. Under the assumption that x₁ and x₂ are independent, the unconditional likelihood is

\begin{align} L (π_{1}, π_{2}; x_{1}, x_{2}, n, N) \\ = Pr (X_{1} = x_{1}, X_{2} = x_{2}; π_{1}, π_{2}, n, N) \\ = (\binom{n}{x_{1}}) (\binom{N - n}{x_{2}}) π_{1}^{x_{1}} {(1 - π_{1})}^{n - x_{1}} π_{2}^{x_{2}} {(1 - π_{2})}^{N - n - x_{2}}, \end{align}

(2)

where 0 ≤ x₁ ≤ n, 0 ≤ x₂ ≤ N − n, and 0 ≤ π_i ≤ 1, i=1,2.

If we define

λ = ln [π_{2} / (1 - π_{2})],

(3)

and

θ = ln [π_{1} / (1 - π_{1})] - λ

(4)

then θ is the parameter of interest, representing the log odds ratio of the GO term, and λ is a nuisance parameter. Under the new parametrization, the unconditional likelihood function (2) is

L (θ, λ; x_{1}, x_{2}, n, N) = \frac{(\binom{n}{x_{1}}) (\binom{N - n}{x_{2}}) \times e^{x_{1} (θ + λ)} e^{x_{2} λ}}{{(1 + e^{θ + λ})}^{n} {(1 + e^{λ})}^{N - n}},

(5)

where 0 ≤ x₁ ≤ n and 0 ≤ x₂ ≤ N − n.

In equation (5), we take the interest parameter θ and also the nuisance parameter λ into consideration. Consider statistics T and S, functions of x₁ and x₂, such that T(X₁,X₂) = X₁ and S(X₁,X₂) = X₁+X₂. Thus, T represents the number of DE genes in a GO category, and S represents the number of total genes in a GO category. Let t and s be the observed values of T and S. The probability mass function of T(x₁,x₂) = t evaluated at S(x₁,x₂) = x₁+x₂ = s, say Pr(T = t|S = s;θ,λ,N,n), does not depend on the nuisance parameter λ[19]. See also Example 8.47 of Severini[20]. Thus, we derive the conditional probability mass function

\begin{align} f_{θ} (t | s) & = Pr (T = t | S = s; θ, n, N) \\ = \frac{(\binom{n}{t}) (\binom{N - n}{s - t}) e^{tθ}}{\sum_{j = max (0, s + n - N)}^{min (s, n)} (\binom{n}{j}) (\binom{N - n}{s - j}) e^{jθ}} \end{align}

(6)

understood as a function of t.

By eliminating the nuisance parameter λ, we can reduce the original data x₁ and x₂ by considering the statistic T = t. However, the use of the conditional probability mass function requires some justification because of concerns about losing information during the conditioning process. Unfortunately, in the presence of the nuisance parameter, the statistic S(X₁,X₂) = X₁+X₂ is not an ancillary statistic for the parameter of interest. In other words, the probability mass function of the conditional variable S(X₁,X₂) may contain some information about parameter θ[20]. However, following the explanation of Barndor-Nielsen and Cox ([21], $§ 2.5$ ), the expectation value of statistic S(X₁,X₂) equals the nuisance parameter. Hence, from the observation of S(X₁,X₂) alone, the distribution of S(X₁,X₂) contains little information about θ[21]. S(X₁,X₂) satisfies the other 3 conditions of an ancillary statistic defined by Barndor-Nielsen and Cox [21]: parameters θ and λ are variation independent; (T(X₁,X₂),S(X₁,X₂)) is the minimal sufficient statistic; and the distribution of T(X₁,X₂), given S(X₁,X₂) = s, is independent of the parameter of interest, θ, given the nuisance parameter λ. Therefore, the probability mass function of S(X₁,X₂) contains little information about the value of θ.

Hypotheses and LFDRs

Considering GO term i, we denote the T, S, t, s, and θ used in equation (6) as T_i, S_i, T_i, S_i, and θ_i. From Table 1, hypothesis comparison (1) of GO term i is equivalent to

H_{0} : θ_{i} = 0 versus H_{1} : θ \neq 0 .

(7)

Let S = 〈S₁,S₂,⋯,S_m〉 and S = 〈S₁,S₂,⋯,S_m〉. Let BF_i denote the Bayes factor of GO term i:

{BF}_{i} = \frac{Pr (T_{i} = t_{i} | S = s, θ_{i} \neq 0)}{Pr (T_{i} = t_{i} | S = s, θ_{i} = 0)} .

(8)

It is called the Bayes factor because it yields posterior odds when multiplied by prior odds. More precisely, the posterior odds of the alternative hypothesis corresponding to GO term i is

ω_{i} = \frac{Pr (θ_{i} \neq 0 | t_{i})}{Pr (θ_{i} = 0 | t_{i})} = {BF}_{i} \times \frac{(1 - Π_{0})}{Π_{0}},

(9)

where Π₀ is the prior conditional probability that a GO term is not enriched for the preselected genes given s, i.e., Π₀ = Pr(θ_i = 0|S = s). Thus, (1 − Π₀)/Π₀ is the prior odds of the alternative hypothesis of enrichment. According to Bayes’ theorem, the LFDR of GO term i is

{LFDR}_{i} = Pr (θ_{i} = 0 | t_{i}) = \frac{1}{1 + ω_{i}},

(10)

where ω_i is defined in equation (9).

Methods

This section is divided into two parts:

1.
Previous LFDR estimators. While not unique to this paper, these methods are included for comparison.
2.
New LFDR estimators. Our main methodological innovations are the uses of a conditional probability mass function and of normalized maximum likelihood for LFDR estimation.

The other original contributions of this paper are the estimator comparisons of the next section. The comparisons are made by simulation and by a case study.

Previous LFDR estimators

Binomial-based LFDR estimator

The version of the FDR that generalizes the LFDR is the nonlocal FDR, which is defined as the ratio of the expected number of false discoveries to the expected total number of discoveries [17]. In our running example, a discovery of enrichment is a rejection of the null hypothesis of non-enrichment at some significance level α, and a false discovery of enrichment is a discovery of enrichment corresponding to a case of no actual enrichment. (This FDR has been called the “Bayesian FDR” [22] to distinguish it from the FDR of Benjamini and Hochberg [9]).

Let α denote any significance level chosen to be between 0 and 1. For all GO terms of interest, the nonlocal FDR may be estimated by

\begin{align} \hat{FDR} (α) & = min (\frac{mα}{\sum_{j = 1}^{m} 1_{{p_{j} \leq α}}}, 1), \end{align}

(11)

where m is the number of GO terms, p_j is the p-value of GO term j, and $1_{{p_{j} \leq α}}$ is the indicator such that $1_{{p_{j} \leq α}} = 1$ if p_j ≤ α is true and $1_{{p_{j} \leq α}} = 0$ otherwise. Thus, $\sum_{j = 1}^{m} 1_{{p_{j} \leq α}}$ represents the number of discoveries of enriched GO terms, and m α estimates the number of such discoveries that are false.

Let r_i be the rank of the p-value of GO term i, e.g., r_i = 1 if the p-value of GO term i is the smallest among all p-values of m GO terms. Based on a modification of equation (11), the binomial-based estimator(BBE) of LFDR of the GO term i is

{\hat{LFDR}}_{i} = \{\begin{array}{l} min (\frac{m p_{2 r_{i}}}{2 r_{i}}, 1), & r_{i} \leq \frac{m}{2}, \\ 1, & r_{i} > \frac{m}{2} . \end{array}

(12)

It is conservative in the sense that it tends to overestimate LFDR [17].

Histogram-based LFDR estimator

Efron [15, 16] devised reliable histogram-based LFDR estimators for a range of applications in microarray gene expression analysis and other problems of large-scale inference. Let z_i = Φ^{− 1}(p_i) be the z-transformed statistic of GO term i, where Φ is the standard normal cumulative distribution function (cdf) and p_i is the 2-sided p-value of GO term i. For each GO term, the density is a mixture of the form

f (z_{i}) = Π_{0} f_{0} (z_{i}) + (1 - Π_{0}) f_{1} (z_{i}),

(13)

where f₀ is the density function of z for the non-enriched GO terms, f₁ is that for the enriched GO terms, and Π₀ is the probability that a GO term is non-enriched. The histogram-based LFDR of GO term i is estimated by equation (14):

{\hat{LFDR}}_{i} = \frac{\hat{f_{0}} (z_{i})}{\hat{f} (z_{i})},

(14)

where $\hat{f}$ is the estimator of f that is estimated by a nonparametric Poisson regression method [15, 16]. We call ${\hat{LFDR}}_{i}$ the histogram-based estimator (HBE) if the density function f₀ is assumed to be standard normal, N(0,1), and the histogram-based estimator with empirical null (HBE-EN) if the density function f₀ is estimated based on the truncated maximum likelihood technique of [16]. Dalmasso et al. [23] compared the precursor of HBE-EN [15] to other LFDR estimators.

New LFDR estimators

Type II maximum likelihood estimator

Bickel [17] follows Good [24] in calling the maximization of likelihood over a hyperparameter Type II maximum likelihood to distinguish it from the usual Type I maximum likelihood, which pertains only to models that lack random parameters. Type II maximum likelihood has been applied to parametric mixture models (PMMs) for the analysis of microarray data [25, 26], proteomics data [18], and genetic association data [27]. In this section, we adapt the approach to the feature enrichment problem by using the conditional probability mass function defined above. The particular models we use in this framework correspond to new methods of enrichment analysis.

Let $G (s) = {g_{θ} (∙ | s); θ \geq 0}$ be a parametric family of probability mass functions with

\begin{align} g_{θ} (∙ | s) & = \frac{1}{2} \times [f_{θ} (∙ | s) + f_{- θ} (∙ | s)], \end{align}

(15)

where f_θ(∙|s) is defined in equation (6). We define the k-component PMM as

\begin{align} g (∙ | s; θ_{0}, \dots, θ_{k - 1}, Π_{0}, \dots, Π_{k - 1}) & = \sum_{j = 0}^{k - 1} Π_{j} g_{θ_{j}} (∙ | s), \end{align}

(16)

where θ₀ = 0 and θ_j≠θ_J for all $j, J \in \{0, \dots, k - 1\}$ such that j≠J.

Let T = 〈T₁,T₂,⋯,T_m〉 and T = 〈T₁,T₂,⋯,T_m〉 be vectors of the T_is and T_is used in equation (8). Assuming T_i is independent of T_j and S_j for all $i, j \in \{1, \dots, m\}$ such that j≠J. i≠j, the joint probability mass function is

\begin{align} g & (t | s; θ_{0}, \dots, θ_{k - 1}, Π_{0}, \dots, Π_{k - 1}) \\ = \prod_{i = 1}^{m} g (t_{i} | s; θ_{0}, \dots, θ_{k - 1}, Π_{0}, \dots, Π_{k - 1}) \\ = \prod_{i = 1}^{m} g (t_{i} | s_{i}; θ_{0}, \dots, θ_{k - 1}, Π_{0}, \dots, Π_{k - 1}), \end{align}

(17)

where S_i is the observed value of S_i for GO term i, and S=〈S₁,S₂,⋯,S_m〉.

Moreover, we assume that for a given number of genes in GO term i, T_i(i = 1,…,m) satisfies the k-component PMM shown in equation (16). In other words, we assume that the possible log odds ratios of GO term i are the θ₀,θ₁,θ₂,…,θ_{k − 1} of equation (16) if the alternative hypothesis H₁ in hypothesis comparison (7) is true.

Therefore, the log-likelihood function under the k-component PMM for all GO terms is

\begin{align} log & L (θ_{0}, \dots, θ_{k - 1}, Π_{0}, \dots, Π_{k - 1}) \\ = log g (t | s; θ_{0}, \dots, θ_{k - 1}, Π_{0}, \dots, Π_{k - 1}) \\ = \sum_{i = 1}^{m} [log \sum_{j = 0}^{k - 1} Π_{j} g_{θ_{j}} (t_{i} | s_{i})] . \end{align}

(18)

The LFDR of GO term i is estimated by

{\hat{LFDR}}_{i}^{(k)} = \frac{{\hat{Π}}_{0} g_{θ_{0}} (t_{i} | s_{i})}{g (t_{i} | s_{i}; θ_{0}, {\hat{θ}}_{1}, \dots, {\hat{θ}}_{k - 1}, {\hat{Π}}_{0}, \dots, {\hat{Π}}_{k - 1})},

(19)

where ${\hat{θ}}_{1}, \dots, {\hat{θ}}_{k - 1}$ and ${\hat{Π}}_{0}, \dots, {\hat{Π}}_{k - 1}$ are maximum likelihood estimates of θ₁,…,θ_{k − 1} and Π₀,…,Π_{k − 1} in equation (18). We call ${\hat{LFDR}}_{i}^{(k)}$ the k -component maximum likelihood estimator (MLEk). Our LFDRenrich and LFDRhat software suites of R functions that implement MLE2 and MLE3 are now available at http://www.statomics.com. Moreover, ${\hat{θ}}_{i}$ (i = 1,…,k − 1;k = 2,3) and ${\hat{Π}}_{j}$ (j = 0,…,k − 1;k = 2,3), also in LFDRenrich and LFDRhat, are maximum likelihood estimators of θ_i (i = 1,…,k − 1;k = 2,3) and Π_j (j = 0,…,k − 1;k = 2,3) under given constraints.

LFDR estimator based on the normalized maximum likelihood

Combining equations (9)-(10), we obtain

{LFDR}_{i} = {(1 + {BF}_{i} \times \frac{(1 - Π_{0})}{Π_{0}})}^{- 1} .

(20)

Therefore, given a guessed value of Π₀, we may use an estimator of the Bayes factor to estimate the LFDR of a GO term.

We now develop such an estimator of the Bayes factor. For GO category i, let $E_{i}$ stand for the set of all probability mass functions defined on {0,1,…,s_i}, the set of all possible values of T_i. Based on hypothesis comparison (7), the set of log odds ratios, denoted as $ℋ$ , is {0} under the null hypothesis and is $R ∖ \{0\} = \{θ \in R : θ \neq 0\}$ , the set of all real values except 0, under the alternative hypothesis. With the assumption that random variable T_i is independent of random variable S_j for any i≠j, the regret of a predictive mass function $\bar{f} \in E_{i}$ is a measure of how well it predicts the observed value t_i∈{0,1,…,s_i}. The regret is defined as

reg (\bar{f}, t_{i} | s_{i}; ℋ) = log \frac{f_{{\hat{θ}}_{i} (t_{i} | s_{i})} (t_{i} | s_{i})}{\bar{f} (t_{i} | s_{i})},

(21)

where ${\hat{θ}}_{i} (t_{i} | s_{i})$ is a Type I MLE with respect to $ℋ$ under observed values T_i given S_i[28, 29].

For all members of $E_{i}$ , the optimal predictive conditional probability mass function of GO category i and the hypothesis that $θ_{i} \in ℋ$ is denoted by $f_{i}^{†} (∙ | s_{i}; ℋ)$ . It minimizes the maximal regret in sample space {0,1,…,s_i} in the sense that it satisfies

f_{i}^{†} (∙ | s_{i}; ℋ) = arg min_{\bar{f} \in E_{i}} max_{t \in \{0, 1, \dots, s_{i}\}} reg (\bar{f}, t | s_{i}; ℋ) .

(22)

It is well known [28] that the predictive probability mass function that satisfies equation (22) is

\begin{align} f_{i}^{†} (t_{i} | s_{i}; ℋ) & = \frac{max_{θ \in ℋ} f_{θ} (t_{i} | s_{i})}{K_{i}^{†} (ℋ)}, \end{align}

(23)

where f_θ(t_i|s_i) is the conditional probability mass function defined in equation (6), and $K_{i}^{†} (ℋ)$ is a constant defined as

\begin{array}{l} K_{i}^{†} (ℋ) = max_{θ \in ℋ} f_{θ} (y | s_{i}) = \sum_{y = max (0, s_{i} + n - N)}^{min (s_{i}, n)} max_{θ \in ℋ} f_{θ} (y | s_{i}) \\ = \sum_{y = max (0, s_{i} + n - N)}^{min (s_{i}, n)} \frac{(\binom{n}{y}) (\binom{N - n}{s_{i} - y}) e^{y {\hat{θ}}_{i} (y)}}{\sum_{j = max (0, s_{i} + n - N)}^{min (s_{i}, n)} (\binom{n}{j}) (\binom{N - n}{s_{i} - j}) e^{j {\hat{θ}}_{i} (y)}}, \end{array}

(24)

where

{\hat{θ}}_{i} (y) = arg max_{θ \in ℋ} f_{θ} (y | s_{i}) .

(25)

We call $f_{i}^{†} (t_{i} | s_{i}; ℋ)$ the normalized maximum likelihood (NML) associated with the hypothesis that $θ_{i} \in ℋ$ .

Thus, BF_i is approximated by

{\hat{BF}}_{i}^{†} = \frac{f_{i}^{†} (t_{i} | s_{i}; R ∖ \{0\})}{f_{i}^{†} (t_{i} | s_{i}; 0)},

(26)

which we call the NML ratio. (More generally, the logarithm of an NML ratio is interpreted as a measure of the evidential support for the alternative hypothesis over the null hypothesis [29, 30]). Therefore, by combining equations (8) and (9), if we guess the prior probability Π₀, the LFDR estimate of GO category i in the hypothesis comparison (7) is

{\hat{LFDR}}_{i}^{†} = {[1 + \frac{1 - Π_{0}}{Π_{0}} \times {\hat{BF}}_{i}^{†}]}^{- 1},

(27)

where ${\hat{BF}}_{i}^{†}$ is defined in equation (26). We call this LFDR estimator the NML estimator (NMLE).

To assess the reliability of NML ratio ${\hat{BF}}_{i}^{†}$ for a particular data set, it will be compared to an empirical Bayes estimate of the Bayes factor that unlike NML, simultaneously takes all GO terms into account. Equations (19) and (20) suggest

{\hat{BF}}_{i} = \frac{1 - {\hat{LFDR}}_{i}^{(k)}}{{\hat{LFDR}}_{i}^{(k)}} \times \frac{1 - {\hat{Π}}_{0}}{{\hat{Π}}_{0}}

(28)

as the empirical Bayes estimator of BF_i.

Results and discussion

In this section, we compared the LFDR estimators using simulation data and breast cancer data.

For each GO category, the p-value used in BBE to estimate LFDR is computed based on the 2-sided Fisher’s exact test. In computing MLEk (k = 2,3), θ_i (i = 1,…,k − 1) in equation (18) was constrained to lie between 0 and 10, whereas Π_i (i = 0,…,k − 1) in equation (18) was allowed to take any value between 0 and 1 such that $\sum_{i = 0}^{k - 1} Π_{i} = 1$ .

Simulation studies

The aim of the following simulation studies is to compare the LFDR estimation biases of BBE, MLE2, MLE3, HBE, and HBE-EN. NMLE is not taken into account because its performance depends not only on the data, but also on the specified prior probability Π₀.

The simulation setting involves 10,000 genes in a microarray with 200 genes identified as DE and 100 GO terms. We conducted a separate simulation study using each of these values of Π₀: 50%, 60%, 70%, 80%, 90%, and 94%.

Since the PMM behind MLE is optimal when the number of enriched GO terms is equal to the number non-enriched GO terms, we assessed the sensitivity of MLE to that symmetry assumption by using strongly asymmetric log odds ratios and by using symmetric ones. For each GO term, two configurations were used in this simulation to choose log odds ratios: the asymmetric configuration shown in equation (29) and the symmetric configuration shown in equation (30). We used these values of odds ratio of the i th GO term:

ϕ_{i}^{asymmetric} = \{\begin{array}{l} \frac{5 i}{100 (1 - Π_{0})}, & 1 \leq i \leq 100 (1 - Π_{0}), \\ 0, & 100 (1 - Π_{0}) < i \leq 100; \end{array}

(29)

ϕ_{i}^{symmetric} = \{\begin{array}{l} \frac{5 \times 2 i}{100 (1 - Π_{0})}, & 1 \leq i \leq 50 (1 - Π_{0}), \\ 5 - \frac{5 \times 2 i}{100 (1 - Π_{0})}, & 50 (1 - Π_{0}) < i \leq 100 (1 - Π_{0}), \\ 0, & 100 (1 - Π_{0}) < i \leq 100 . \end{array}

(30)

Considering the log odds ratios of all GO terms in each simulation study, we generated Table 1 for GO term i and for each of the 20 simulated data sets as follows:

x₁ is generated from a binomial distribution with parameter π₁ used in equation (2); π₁ is a real value randomly picked from 0 to 1.
x₂ is obtained from a binomial distribution with parameter $π_{2} = {[\frac{(1 - π_{1}) \times 2^{ϕ_{i}}}{π_{1}} + 1]}^{- 1}$ , obtained by solving
$ϕ_{i} = \underset{2}{log} [π_{1} / (1 - π_{1})] - \underset{2}{log} [π_{2} / (1 - π_{2})] .$
(31)

Thus, according to equation (4), we obtain ϕ_i = θ_i log2e for GO term i. Each of those artificial data sets represents what might have been a real data set such as that of the next subsection.

The p-value of each GO term used in BBE, HBE, and HBE-EN is obtained from the 2-sided Fisher’s exact test. The k-component PMM (k = 2 or k = 3) used in MLE is shown in equation (16) with $Π_{j} = (1 - Π_{0}) / k [j = 1, \dots, k]$ and $g_{θ_{i}} (t_{i} | s_{i})$ defined in equation (15). For every log odds ratio sequence, we estimated the LFDR for each GO term and each data set using BBE, MLE2, MLE3, HBE, and HBE-EN. We compared the performances of the 5 estimators by means of computing the absolute value of the estimated LFDR bias. The true LFDR is computed by equation (10), where

f_{0} (t_{i}) = \frac{(\binom{n}{t_{i}}) (\binom{N - n}{s_{i} - t_{i}})}{\sum_{j = max (0, s_{i} + n - N)}^{min (s_{i}, n)} (\binom{n}{j}) (\binom{N - n}{s_{i} - j})}

and f₁(t_i) is computed by

\frac{1}{J} \sum_{j = 1}^{J} f_{θ_{j}} (t_{i} | s_{i}),

where f_θ(t|s) is defined in equation (6).

Figure 1 shows the performance comparisons of the 5 LFDR estimators (i.e., BBE, MLE2, MLE3, HBE, and HBE-EN) for simulation data obtained from asymmetric and symmetric log odds ratios. The absolute LFDR biases estimated by BBE, MLE2, MLE3, and HBE-EN are similar. The absolute bias of LFDR estimated by HBE on symmetric log odds ratios is a little higher than that on asymmetric log odds ratios when the proportion of non-enriched GO terms is greater than 80%. Therefore, the estimated LFDR biases of the estimators are not strongly affected by whether the log odds ratios are symmetric or asymmetric.

To assess the performance of the 5 estimators for smaller GO terms, we added simulation studies using 2, 4, 8, and 32 as the total number of GO terms. The proportion of non-enriched GO terms (Π₀) and log odds ratios of simulation studies are shown in Table 2. The simulation studies were otherwise the same as those for 100 GO terms. Figure 2 shows the performance of LFDR estimators by means of computing the absolute estimated LFDR bias for 2, 4, 8, and 32 GO terms with log odds ratios based on formulas shown in Table 2.

Table 2 The proportion of non-enriched GO terms and the log2 odds ratios of GO terms used in the simulation studies

Full size table

Considering every (m,Π₀) pair of the simulation studies with symmetric log odds ratios for the case of 100 GO terms, we recorded the LFDR estimator with the lowest absolute estimated LFDR bias among the 5 LFDR estimators (BBE, MLE2, MLE3, HBE, and HBE-EN). Moreover, we determined the maximum absolute LFDR bias over the proportion of non-enriched GO terms (Π₀) in order to evaluate the worst-case bias of each estimator at each value of m. Figure 3 shows the results.

Breast cancer data analysis

The single-channel microarray data set used here to illustrate our new methods is from an experiment applying an estrogen treatment to cells of a human breast cancer cell line [31]. The Affymetrix human genome U-95Av2 genechip data are from four samples from an estrogen receptor positive breast cancer cell line. Two of the samples were exposed to estrogen and then harvested after 10 hours. The remaining two samples were left untreated and then harvested after 10 hours. For simplicity of terminology, we call probes in the microarray experiment “genes.” The relevant data consist of measurements of gene expression across the reference class of 12,625 genes. The purpose of the study was to determine which genes are affected by the estrogen treatment. (For further information concerning the data, see Gentleman et al. [32].)

We applied the R function expresso in the affy package [33] of Bioconductor [34] to convert the raw probe intensities from the the CEL data files to logarithms of gene expression levels without background correction. In doing so, we applied the “quantiles,” “pmonly,” and “medianpolish” [35] preprocessing settings.

We selected as genes of interest those that were differentially expressed between the treatment group and the control group according to the following criterion. Using the LFDR as the probability that a gene is EE, we considered genes with LFDR estimates below 0.2 as DE. In other words, we selected as DE genes those that were differentially expressed with estimated posterior probability of at least 80%. Considering four samples of each gene in the microarray, we used the unpaired t-test with equal variances to compute the p-value. The LFDR of every gene is estimated using the theoretical null hypothesis method of Efron [15, 16]; the empirical null hypotheses method can lead to excessive bias due to deviations from normality [36]. When we compared gene expression data for the presence and absence of estrogen after 10 hours of exposure, we obtained 74 DE genes.

Defining unrelated pairs of GO terms as those that do not share any common ancestor, we selected for analysis all unrelated GO molecular function terms with at least 1 DE gene, thereby obtaining a total of 82 GO terms of interest. Figure 4 compares the BBE to the MLEs based on the 2-component (MLE2) and 3-component (MLE3) PMM. Figure 5 displays the probability mass of GO:0005524 under the null and alternative hypotheses of hypothesis comparison (7). Figure 6 compares MLE-based estimates of the Bayes factor given by equation (28) to the NML ratios given by equation (28).

For two GO terms, opposite conclusions would be drawn about their enrichment, depending on which estimator is used. As seen in Figure 4, the estimated LFDRs of GO:0051082 and GO:0005524 using MLE2 were 100%. However, the LFDRs estimated by MLE3 were essentially 0.

Using the MLE formula shown in equation (19), and the k-component PMM shown in equation (16), we conclude that the sensitivity of the LFDRs of GO term i estimated by MLE2 and MLE3 depended mainly on the sensitivity of the Bayes factor, based on the number of PMM components. Comparing the probability masses of GO:0005524, based on the 2- and 3-component PMMs shown in Figure 5, we found that the probability mass of GO:0005524 under the null hypothesis is larger than that under the alternative hypothesis based on the 2-component PMM (left plot in Figure 5). In contrast, the probability mass under the null hypothesis is smaller than that under the alternative hypothesis based on the 3-component PMM (right plot in Figure 5). Thus, the LFDR estimated by MLE is strongly dependent on the number of PMM components.

While a real data set can in that way indicate the impact of selecting an appropriate method, that impact does not in itself say which method has lowest bias. For that, we rely on the simulation study of the previous subsection.

Conclusions

As seen in Figure 1 and Figure 2, HBE and HBE-EN have relative high biases for a small number and a medium number of GO terms, respectively. The performance comparison displayed in the left-hand side of Figure 3 indicates that BBE contains the lowest minimum estimated LFDR bias for a small number of GO terms (i.e., 2-32 GO terms) when the proportion of non-enriched GO terms is 1. Although the minimum bias of BEE is not the lowest for some Π₀s under a small number of GO terms, it is very close to the lowest value of bias based on plots shown in Figure 2. The right-hand side of Figure 3 indicates that MLE3 has the lowest maximum absolute estimated LFDR bias in 100 GO terms. MLE exhibits bias similar to that of BBE when the number of GO terms is much larger than k except for when the proportion of non-enriched GO terms is high (close to 1). Moreover, MLE3 has lower bias than MLE2 as an LFDR estimator. Due to its conservatism and freedom from PMM, we recommend using BBE for a small number of GO terms of interest (2-32 GO terms) and MLE for a medium number of GO terms of interest (100 GO terms).

Finally, we recommend that NMLE be used when there is only 1 GO term of interest since none of the other estimators is able to estimate LFDR in such a case except by conservatively giving 1 as the estimate. Otherwise, unless Π₀ is known with sufficient accuracy, NMLE is not recommended since it depends not only on the data but also on a guess of the value of Π₀, which in the absence of strong prior information, is often set to the default value of 50%. A closely related approach is to use the logarithm of the NML ratio as a measure of statistical support for the enrichment hypothesis [30] without guessing Π₀. By using 10 and 100 as thresholds of the approximate Bayes factors from equations (26) and (28) to determine whether a GO term is enriched, we reached similar conclusions with both NML and MLE (Figure 6). Thus, in our data set, the NML ratio tends to estimate the Bayes factor almost as accurately as methods that simultaneously use information across GO terms. While we do not expect the same for all data sets, we note that similar results have been found for an application of a modified NML [29] to a proteomics data set [30].

References

Altshuler D, Daly MJ, Lander ES: Genetic mapping in human disease. Science 2008, 322: 881-888. 10.1126/science.1156409
Article PubMed Central CAS PubMed Google Scholar
Rhee SY, Wood V, Dolinski K, Draghici S: Use and misuse of the gene ontology annotations. Nat Rev Genet 2008,9(7):509-515. 10.1038/nrg2363
Article CAS PubMed Google Scholar
Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genome. Nucleic Acids Res 2000, 28: 27-30. 10.1093/nar/28.1.27
Article PubMed Central CAS PubMed Google Scholar
Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki R: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4: P3. 10.1186/gb-2003-4-5-p3
Article PubMed Google Scholar
Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using gene ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol 2003, 4: R7. 10.1186/gb-2003-4-1-r7
Article PubMed Central PubMed Google Scholar
Khatri P, Draghici S, Ostermeier G, Krawetz S: Profiling gene expression using onto-express. Genomics 2002, 79: 266-270. 10.1006/geno.2002.6698
Article CAS PubMed Google Scholar
Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 2003, 4: R28. 10.1186/gb-2003-4-4-r28
Article PubMed Central PubMed Google Scholar
Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009, 37: 1-13. 10.1093/nar/gkn923
Article PubMed Central Google Scholar
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 1995, 57: 289-300.
Google Scholar
Min JL, Barrett A, Watts T, Pettersson FH, Lockstone HE, Lindgren CM, Taylor JM, Allen M, Zondervan KT, McCarthy MI: Variability of gene expression profiles in human blood and lymphoblastoid cell lines. BMC Genomics 2010, 11: 96. 10.1186/1471-2164-11-96
Article PubMed Central PubMed Google Scholar
Reyal F, van Vliet MH, Armstrong NJ, Horlings HM, de Visser KE, Kok M, Teschendorff AE, Mook S, van’t Veer L, Caldas C, Salmon RJ, Vijver MJVD, Wessels LFA: A comprehensive analysis of prognostic signatures reveals the high predictive capacity of the proliferation, immune response and RNA splicing modules in breast cancer. Breast Cancer Res 2008, 10: R93. 10.1186/bcr2192
Article PubMed Central PubMed Google Scholar
Wang R, Bencic D, Lazorchak J, Villeneuve D, Ankley GT: Transcriptional regulatory dynamics of the hypothalamic-pituitary-gonadal axis and its peripheral pathways as impacted by the 3-beta HSD inhibitor trilostane in zebrafish (Danio rerio). Ecotoxicol Environ Saf 2011, 74: 1461-1470. 10.1016/j.ecoenv.2011.05.001
Article CAS PubMed Google Scholar
Storey JD: The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat 2003, 31: 2013-2035. 10.1214/aos/1074290335
Article Google Scholar
Hong WJ, Tibshirani R, Chu G: Local false discovery rate facilitates comparison of different microarray experiments. Nucleic Acids Res 2009, 37: 7483-7497. 10.1093/nar/gkp813
Article PubMed Central CAS PubMed Google Scholar
Efron B: Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J Am Stat Assoc 2004, 99: 96-104. 10.1198/016214504000000089
Article Google Scholar
Efron B: Large-Scale Inference: Empirical Bayes Methods for Estimation,Testing, and Prediction Cambridge. Cambridge University Press; 2010.
Book Google Scholar
Bickel DR: Simple estimators of false discovery rates given as few as one or two p-values without strong parametric assumptions. Stat Appl Genet Mol Biol in press in press
Bickel DR: Small-scale inference: empirical Bayes and confidence methods for as few as a single comparison. Tech Rep, Ottawa Inst Syst Biol; 2011:arXiv:1104.0341-arXiv:1104.0341.
Google Scholar
Padilla M, Bickel DR: Empirical Bayes methods corrected for small numbers of tests. Stat Appl Genet Mol Biol 2012,11(5):art. 4.
Google Scholar
Severini T: Likelihood Methods in Statistics Oxford. Oxford University Press; 2000.
Google Scholar
Barndorff-Nielsen OE, Cox DR: Inference and Asymptotics. London: CRC Press; 1994.
Book Google Scholar
Efron B, Tibshirani R: Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol 2002, 23: 70-86. 10.1002/gepi.1124
Article PubMed Google Scholar
Dalmasso C, Bar-Hen A, Broët P: A constrained polynomial regression procedure for estimating the local false discovery rate. BMC Bioinformatics 2007, 8: 229. 10.1186/1471-2105-8-229
Article PubMed Central PubMed Google Scholar
Good IJ: How to estimate probabilities. IMA J Appl Math 1966, 2: 364-383. 10.1093/imamat/2.4.364
Article Google Scholar
Pawitan Y, Murthy K, Michiels S, Ploner A: Bias in the estimation of false discovery rate in microarray studies. Bioinformatics 2005, 21: 3865-3872. 10.1093/bioinformatics/bti626
Article CAS PubMed Google Scholar
Muralidharan O: An empirical Bayes mixture method for effect size and false discovery rate estimation. Ann Appl Stat 2010, 4: 422-438.
Article Google Scholar
Yang Y, Aghababazadeh FA, Bickel DR: Parametric estimation of the local false discovery rate for identifying genetic associations. IEEE/ACM Trans Comput Biol Bioinformatics 2012. online ahead of print at http://dx.doi.org/10.1109/TCBB.2012.140 online ahead of print at
Google Scholar
Grünwald PD: The Minimum Description Length Principle. London: MIT Press; 2007.
Google Scholar
Bickel DR: A predictive approach to measuring the strength of statistical evidence for single and multiple comparisons. Can J Stat 2011, 39: 610-631. 10.1002/cjs.10109
Article Google Scholar
Bickel DR: Minimax-optimal strength of statistical evidence for a composite alternative hypothesis. Int Stat Rev 2013. in press. 2011 version available at arXiv:1101.0305 in press. 2011 version available at arXiv:1101.0305
Google Scholar
Scholtens D, Miron A, Merchant FM, Miller A, Miron PL, Iglehart JD, Gentleman R: Analyzing factorial designed microarray experiments. J Multivariate Anal 2004, 90: 19-43. 10.1016/j.jmva.2004.02.004
Article Google Scholar
Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S (Eds): Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer; 2005.
Book Google Scholar
Gautier L, Cope L, Bolstad BM, Irizarry RA: Affy—analysis of Affymetrix Gene Chip data at the probe level. Bioinformatics 2004,20(3):307-315. 10.1093/bioinformatics/btg405
Article CAS PubMed Google Scholar
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. 10.1186/gb-2004-5-10-r80
Article PubMed Central PubMed Google Scholar
Tukey JW: Exploratory Data Analysis. Reading: Addison-Wesley; 1977.
Google Scholar
Bickel DR: Estimating the null distribution to adjust observed confidence levels for genome-scale screening. Biometrics 2011, 67: 363-370. 10.1111/j.1541-0420.2010.01491.x
Article PubMed Google Scholar
Jeffreys H: Theory of Probability. London: Oxford University Press; 1948.
Google Scholar
Bickel DR: The strength of statistical evidence for composite hypotheses: inference to the best explanation. Statistica Sinica 2012, 22: 1147-1198.
Google Scholar

Download references

Acknowledgements

We thank the two anonymous reviewers for comments that led to improvements in the manuscript, particularly those that led to clearer communication or to additional simulation studies. We also thank both Editage and Donna Reeder for the detailed copyediting. We are grateful to Corey Yanofsky and Ye Yang for the useful discussions. This work was partially supported by the Natural Sciences and Engineering Research Council of Canada, by the Canada Foundation for Innovation, by the Ministry of Research and Innovation of Ontario, and by the Faculty of Medicine of the University of Ottawa.

Author information

Authors and Affiliations

Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology, and Immunology, Department of Mathematics and Statistics, University of Ottawa, 451 Smyth Road, Ottawa, Ontario, K1H 8M5, Canada
Zhenyu Yang & David R Bickel
School of Foundation, Shenyang Pharmaceutical University, No. 103 Wenhua Road, Shenyang, Liaoning, 110016, China
Zuojing Li

Authors

Zhenyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zuojing Li
View author publications
You can also search for this author in PubMed Google Scholar
David R Bickel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David R Bickel.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

ZY implemented the NMLE function, co-designed and executed the simulation study, carried out the comparison among 3 LFDR estimators (BBE, MLE, and NMLE) using the breast cancer data, and drafted the manuscript. ZL implemented the 3-component MLE functions. DRB suggested and guided the project, contributed to writing the paper, co-designed the simulation study, and provided the BBE function. All authors have read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yang, Z., Li, Z. & Bickel, D.R. Empirical Bayes estimation of posterior probabilities of enrichment: A comparative study of five estimators of the local false discovery rate. BMC Bioinformatics 14, 87 (2013). https://doi.org/10.1186/1471-2105-14-87

Download citation

Received: 20 December 2011
Accepted: 11 February 2013
Published: 06 March 2013
DOI: https://doi.org/10.1186/1471-2105-14-87

Empirical Bayes estimation of posterior probabilities of enrichment: A comparative study of five estimators of the local false discovery rate

Abstract

Background

Results

Conclusions

Background

Preliminary concepts

Likelihood functions

Hypotheses and LFDRs

Methods

Previous LFDR estimators

Binomial-based LFDR estimator

Histogram-based LFDR estimator

New LFDR estimators

Type II maximum likelihood estimator

LFDR estimator based on the normalized maximum likelihood

Results and discussion

Simulation studies

Breast cancer data analysis

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us