Control of false discoveries in grouped hypothesis testing for eQTL data

Rudra, Pratyaydipta; Zhou, Yi-Hui; Nobel, Andrew; Wright, Fred A.

doi:10.1186/s12859-024-05736-3

Research
Open access
Published: 11 April 2024

Control of false discoveries in grouped hypothesis testing for eQTL data

Pratyaydipta Rudra¹,
Yi-Hui Zhou²,
Andrew Nobel³ &
…
Fred A. Wright²

BMC Bioinformatics volume 25, Article number: 147 (2024) Cite this article

627 Accesses
Metrics details

Abstract

Background

Expression quantitative trait locus (eQTL) analysis aims to detect the genetic variants that influence the expression of one or more genes. Gene-level eQTL testing forms a natural grouped-hypothesis testing strategy with clear biological importance. Methods to control family-wise error rate or false discovery rate for group testing have been proposed earlier, but may not be powerful or easily apply to eQTL data, for which certain structured alternatives may be defensible and may enable the researcher to avoid overly conservative approaches.

Results

In an empirical Bayesian setting, we propose a new method to control the false discovery rate (FDR) for grouped hypotheses. Here, each gene forms a group, with SNPs annotated to the gene corresponding to individual hypotheses. The heterogeneity of effect sizes in different groups is considered by the introduction of a random effects component. Our method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR), assumes a model for alternative hypotheses for the eQTL data and controls the FDR by adaptive thresholding. As a convenient alternate approach, we also propose Z-REG-FDR, an approximate version of REG-FDR, that uses only Z-statistics of association between genotype and expression for each gene-SNP pair. The performance of Z-REG-FDR is evaluated using both simulated and real data. Simulations demonstrate that Z-REG-FDR performs similarly to REG-FDR, but with much improved computational speed.

Conclusion

Our results demonstrate that the Z-REG-FDR method performs favorably compared to other methods in terms of statistical power and control of FDR. It can be of great practical use for grouped hypothesis testing for eQTL analysis or similar problems in statistical genomics due to its fast computation and ability to be fit using only summary data.

Peer Review reports

Background

Expression quantitative trait locus (eQTL) analysis aims to detect genetic loci that are associated with the expression of one or more genes [1]. For each gene, expression can be considered as a quantitative trait potentially associated with the genotypes at different sites in the genome, typically single nucleotide polymorphisms (SNPs) [2]. Although there is a substantial literature on both eQTL mapping [3,4,5] and grouped hypothesis testing [6,7,8], consideration of the natural gene-level grouping of the SNPs, e.g., SNPs local to a gene for a cis-eQTL problem, is comparatively unexplored or requires permutation methods or approximations [9, 10]. Analysis of gene-level eQTLs and meaningful consideration of causal SNPs is an important biological problem [11]. Testing whether there is any eQTL (local SNP) for an entire gene while controlling the false discovery rate (FDR) across the set of all genes may be interesting for various reasons, which has been imperfectly addressed in the “e-Gene” concept employed by the GTEx Consortium [12].

Local (cis) eQTL testing includes tests of individual SNPs nearby a gene, which leads to summaries at the gene level [12]. The natural hierarchical organization would suggest standard methods for group-level testing [6, 13]. However, local eQTL testing can include additional structure to be exploited: (i) the number of cis-eQTLs is typically large, so that explicit consideration of the proportion and “strength” of alternatives is possible; (ii) the conditional analyses of discovered eQTLs suggest that, to a first approximation, most local eQTLs can be considered unique within the region [14]; (iii) correlation of test statistics is driven by regional SNP correlation.

In the following sections, we discuss the structure of eQTL data and how the grouped nature can be effectively modeled using a random effects model. We consider the case of cis-eQTLs, i.e. local to the gene [14, 15], where the variant affecting the gene expression is in the immediate neighborhood of the gene. Our proposed method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR), uses an empirical Bayes framework to model the eQTL data and controls the FDR by adaptive thresholding. We also propose an alternate approach Z-REG-FDR, an approximate version of REG-FDR, that uses only the summary measures given by the Z-statistics of association between genotype and expression for each gene-SNP pair. We demonstrate using simulations and real data analysis that this approximate version performs well compared to other possible approaches while having a much faster computation time.

Methods

Structure of the eQTL data and the hypotheses

eQTL data can typically be expressed in the form of an expression matrix, consisting of N genes along with a genotype matrix which has genotypes (m SNPs) for the same n sample units. We denote the expression matrix as $Y_{N \times n}$ and the genotype submatrix corresponding to the ith gene as $X^{(i)}_{m_i \times n}$, $i=1,2,\ldots ,N$, where $m_i$ is the number of SNPs local to the ith gene. Linear modeling of eQTLs typically includes additional covariates, such as expression cofactors [12, 16]. The t-statistics for the partial correlations between Y and $X^{(i)}_{m_i \times n}$, after both are adjusted for covariates, are equivalent to the Wald statistics for the $X^{(i)}_{m_i \times n}$ when conducting the full linear regression in which Y is modeled as a function of $X^{(i)}_{m_i \times n}$ and the additional covariates [17, 18]. We assume that the sample size n is large enough that the residual degrees of freedom for the t statistic is sufficient to use a standard normal approximation. Thus we can can directly work with z-statistics for Y and $X^{(i)}_{m_i \times n}$, where we consider each of these matrices to have been covariate-residualized.

Let $H_{0ij}$ denote the gene-SNP level null hypothesis that there is no eQTL at the jth SNP local to the ith gene, $j=1,2,\ldots ,m_i, i=1,2,\ldots ,N$. Therefore there are $\sum _{i=1}^N{m_i}$ gene-SNP level tests. These tests can be grouped into N groups corresponding to the N genes with $m_i$ tests in the ith group. Define $H_{0i}$ to be the gene-level null hypothesis for the ith gene that there is no eQTL for the ith gene. Therefore the gene-level null hypothesis $H_{0i}$ can be written as

$$\begin{aligned} H_{0i}=\cap _{j=1}^{m_i}{H_{0ij}}, \end{aligned}$$

(1)

i.e. the gene-level null requires that all of the corresponding $m_i$ gene-SNP level hypotheses be null.

An empirical Bayes model

We adopt an empirical Bayes approach for controlling the gene-level FDR. Empirical Bayes approaches have been used in many genetic applications, and indeed these applications have been a prime motivator for the methods [19, 20]. The advantages of using an empirical Bayes approach based on the local false discovery rate (lfdr), instead of p-value-based FDR-controlling approaches, has been discussed in [21] and [22]. The lfdr corresponding to the gene-level null hypothesis $H_{0i}$ is

$$\begin{aligned} \lambda _i(Y_i,X^{(i)}) = P(H_{0i}|Y_i,X^{(i)}), \ i=1,2,\ldots ,N. \end{aligned}$$

(2)

Here $Y_i$ denotes the ith row of the matrix Y. If we obtain the lfdr $\lambda _{i}$ for each of the gene-level hypotheses, we can control the FDR at target level $\alpha $ for gene-level testing, using the following adaptive thresholding procedure, which has been used extensively in the literature [7, 23,24,25].

1.
Enumerate the index $i_1,i_2,\ldots ,i_N$ of the genes such that $\lambda _{i_1} \le \lambda _{i_2} \le \cdots \le \lambda _{i_N}$.
2.
Reject hypotheses $H_{0i_1},\ldots ,H_{0i_L}$ where L is the largest integer such that
$$\begin{aligned} \frac{1}{L}\sum _{l=1}^L{\lambda _{i_l}} \le \alpha . \end{aligned}$$

[24] and subsequently [7] showed that the adaptive thresholding procedure is valid in the sense that it controls the FDR at target level $\alpha $ for an ‘oracle’ procedure where the true parameters of the model are assumed to be known. It is asymptotically valid for a ‘data-driven’ procedure when the parameters are consistently estimated from the data. [25] proved its validity under further relaxed conditions. The proof makes use of the following result (Averaging Theorem, [19]).

Let $\text {lfdr}(z)=P(H_0|z)$ denote the lfdr for observed data z. Then, for a rejection region ${\mathcal {R}}$, the FDR will be given by

$$\begin{aligned} FDR({\mathcal {R}})=P(H_0|Z \in {\mathcal {R}})=E(\text {lfdr}(Z)|Z \in {\mathcal {R}}). \end{aligned}$$

(3)

The adaptive thresholding procedure can be used to control the FDR for testing the gene-level hypotheses $H_{0i}$’s and a similar procedure can be used to test the gene-SNP level hypotheses $H_{0ij}$’s. However, obtaining the gene-level lfdr’s is a non-trivial problem. In the next section, we propose a model which enables us to calculate the lfdr’s.

The random effects model and testing procedure for group-level FDR control (REG-FDR)

Here we propose a model to obtain the gene-level lfdr values, that can be subsequently used to test the gene-level hypotheses while controlling the FDR using the adaptive thresholding method. The model is based on the following assumptions.

A1.
For any gene i, under the gene-level alternative hypothesis $H_{0i}^c$, there exists a single causal SNP that influences its expression.
A2.
Each of the $m_i$ SNPs has equal probability to be the causal SNP.

First, we note that Assumption (A1) is at best a simplification, but very large eQTL studies have supported the view that most genes with eQTLs have a primary local eQTL [26], with other loci having much smaller effect sizes. We therefore treat A1 as a ‘workable condition’ [27,28,29].

Assumption (A2) could easily be relaxed, and one might use a distributional assumption over the SNPs as a modest modification of our method below (see the Discussion section). We note that it is trivial to enforce Assumption (A2) by, for example, randomizing the SNP identities within gene i prior to analysis.

Under these assumptions, the gene-level lfdr for the ith gene has the following form:

$$\begin{aligned} \lambda _i(Y_i,X^{(i)})= P(H_{0i}|Y_i,X^{(i)})&= \frac{P(H_{0i})P(Y_i,X^{(i)}|H_{0i})}{P(H_{0i})P(Y_i,X^{(i)}|H_{0i}) + P(H_{1i})P(Y_i,X^{(i)}|H_{1i})} \end{aligned}$$

(4)

$$\begin{aligned}&= \frac{\pi _0f_0(Y_i)}{\pi _0f_0(Y_i)+(1-\pi _0)\frac{1}{m_i}\sum _{j=1}^{m_i}{f_1(Y_i|X_j^{(i)},\beta _{ij})}}, \end{aligned}$$

(5)

where $\pi _0=P(H_{0i})$ is the prior probability of $H_{0i}$, $f_0(Y_i)$ is the density of $Y_i$ under the null, and $f_1(Y_i|X_j^{(i)},\beta _{ij})$ is the conditional density under the alternative given that the jth SNP is causal. Here $\beta _{ij}$ is correlation between the expression of the ith gene and the causal SNP j. Note that the marginal density $p(X^{(i)})$ cancels from numerator and denominator. Importantly, this cancellation allows us to bypass the modeling of the dependence structure of the SNPs, which otherwise might have been difficult to estimate accurately.

We assume that $f_0(.)$ is the density of the $N_n(0,I_n)$ distribution (noting that expression data can be normalized), and that $f_1(.|X_j^{(i)},\beta _{ij})$ is the density of the $N_n(\beta _{ij} X_j^{(i)}, (1-\beta _{ij}^2)I_n)$ distribution, where $\beta _{ij}$ is the correlation between $Y_i$ and $X_j^{(i)}$. This choice of $f_1$ ensures that the unconditional variance of $Y_i$ is free of $\beta _{ij}$. To account for variability across genes, we assume $\beta _{ij}$ to be a random effect such that $\sqrt{n-3} \ tanh^{-1}(\beta _{ij})$ follows a $N(0,\sigma ^2)$ distribution. As $\beta _{ij}$ is a correlation coefficient, the Fisher transformation is used to ensure that the variance does not depend on the mean. Moreover, $\sigma $ will be estimated from the data, and so the apparent dependence on n is not important to the procedure.

Our procedure treats the genotype values as fixed, and assuming the expression of genes to be independent, given genotypes, we can estimate $\pi _0$ and $\sigma $ using a maximum likelihood approach and follow with plug-in estimates to obtain estimates of $\lambda _i(Y_i,X^{(i)})$ from Eq. 5. The assumption that the expression of different genes are independent is violated in general, but our approach can be viewed as employing a composite likelihood [30], and thus consistent for $\pi _0$ and $\sigma $ even under independence violations [31]. An EM algorithm is used (see Additional file 1: Section 1) for the maximum likelihood estimation. The procedure enables us to use the adaptive thresholding procedure to provide proper gene-level control of the FDR.

The Z-REG-FDR model

One computational challenge presented by the REG-FDR model is that the density $f_1(Y_i|X_j^{(i)})$ does not have a closed form expression. While it can be expressed as the following integral

$$\begin{aligned} f_1(Y_i|X_j^{(i)})=\int _{-1}^{1}{f_1(Y_i|X_j^{(i)},\beta ) \frac{\sqrt{n-3}}{\sqrt{2\pi }\sigma (1-\beta ^2)}e^{-\frac{n-3}{2\sigma ^2}\{tanh^{-1}(\beta )\}^2} d\beta ~,} \end{aligned}$$

(6)

numeric maximum likelihood estimation is computationally burdensome. We propose an alternative model, termed Z-REG-FDR, which avoids this problem. In this approach, we consider the Fisher transformed and scaled z-statistics as our data. Thus, for each gene i, we have a vector of z-statistics

$z^{(i)} = (z_{1}^{(i)},z_{2}^{(i)},\ldots ,z_{m_i}^{(i)}) , \ i=1,2,\ldots ,N,$

where $z_{j}^{(i)}=\sqrt{n-3} \ tanh^{-1}(r_{j}^{(i)})$ and $r_{j}^{(i)}$ is the sample correlation of $Y_i$ and $X_j^{(i)}$.

Fisher transformation and scaling ensures that $z^{(i)}$ is approximately normal and that the variance of each component is approximately 1 under both null and alternative. Under the null, the mean of $z^{(i)}$ is zero. We treat the component $z^{(i)}$ as if they are independent across different genes, again relying on approximate conditional independence (given genotypes) and a compositie likelihood interpretation.

The Z-REG-FDR procedure is based on an additional assumption to (A1) and (A2) above. If the kth SNP is causal, we assume (Assumption (A3)) that the distribution of $(z_{1}^{(i)},\ldots ,z_{k-1}^{(i)},z_{k+1}^{(i)},\ldots ,z_{m_i}^{(i)})$ given $z_{k}^{(i)}$ under the alternative is same as that under the null. In particular, we note that this assumption is true if the components of $z^{(i)}$ have a Markov dependence structure with the same serial correlation under null and alternative, which is true in the special case that the successive marker correlations are zero. In general, this assumption can be violated, but as shown in “Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation” section, the resultant procedure appears to work well in many circumstances as an approximate maximum likelihood method even when Assumption (A3) is not satisfied.

Under the above assumptions, we can write the joint distribution of the random vector $z^{(i)}=(z_{1}^{(i)},z_{2}^{(i)},\ldots ,z_{m_i}^{(i)})$ as

$$\begin{aligned} f_0(z_{1}^{(i)},z_{2}^{(i)},\ldots ,z_{m_i}^{(i)})=p_0(z_{k}^{(i)})f_{0|k}(z_{1}^{(i)},\ldots ,z_{k-1}^{(i)},z_{k+1}^{(i)},\ldots ,z_{m_i}^{(i)}) \end{aligned}$$

(7)

under the null, and

$$\begin{aligned} f_1(z_{1}^{(i)},z_{2}^{(i)},\ldots ,z_{m_i}^{(i)})=p_1(z_{k}^{(i)})f_{0|k}(z_{1}^{(i)},\ldots ,z_{k-1}^{(i)},z_{k+1}^{(i)},\ldots ,z_{m_i}^{(i)}) \end{aligned}$$

(8)

under the alternative. We assume $p_0(.)$ to be N(0, 1) and $p_1(.)$ to be $N(\mu ,1)$, where $\mu $ is assumed to be random with a $N(0,\sigma ^2)$ distribution. We do not assume anything about the form of $f_{0|k}$ except that it does not involve the parameters $\pi _0$ and $\sigma $. Under these assumptions, the gene-level lfdr for this model reduces to

$$\begin{aligned} P(H_{0i}|z^{(i)})=\frac{1}{1+\frac{1-\pi _0}{\pi _0} \frac{1}{m_i} \sum _{k=1}^{m_i}{\frac{p_1(z_{k}^{(i)})}{p_0(z_{k}^{(i)})}}}, \ i=1,2,\ldots ,N. \end{aligned}$$

(9)

This follows from the cancellation of $f_{0|k}(z_{1}^{(i)},\ldots ,z_{k-1}^{(i)},z_{k+1}^{(i)},\ldots ,z_{m_i}^{(i)})$ in the numerator and denominator. While estimating $\pi _0$ and $\sigma $, a similar cancellation helps us bypass maximizing the full (approximate) likelihood

$$\begin{aligned}\prod _{i=1}^N{(\pi _0 f_0(z^{(i)})+(1-\pi _0)f_1(z^{(i)}))}.\end{aligned}$$

Instead, we maximize

$$\begin{aligned}\prod _{i=1}^N{\frac{\pi _0 f_0(z^{(i)})+(1-\pi _0)f_1(z^{(i)})}{f_0(z^{(i)})}} = \prod _{i=1}^N\left\{ \pi _0+(1-\pi _0) \frac{1}{m_i} \sum _{k=1}^{m_i}{\frac{p_1(z_{k}^{(i)})}{p_0(z_{k}^{(i)})}}\right\} .\end{aligned}$$

This is equivalent to the maximum likelihood estimation under the assumption that $f_{0|k}$ does not involve the parameters $\pi _0$ and $\sigma $. Note that we need to estimate only the parameters $\pi _0$ and $\sigma $ to obtain the gene-level lfdr using Eq. 9.

When the required assumptions are not satisfied, this method still has value as an approximate maximum likelihood approach. For instance, when the $X_j^{(i)}$’s are related by an AR(1) structure, it can be shown that the correlation between the z-statistics depends on the effect size, i.e. the correlation between $Y_i$ and the causal SNP, hence violating Assumption (A3). Additional file 1: Lemma 1 and Additional file 1: Figures 1 and 2 show the extent to which the conditional distribution $f_{0|k}$ might depend on the effect size for any correlation structure among normally distributed SNPs. However, our results in “Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation” section demonstrate that it does not have a significant adverse effect on the performance of the estimation and control of false discovery.

Results

Simulations: performance of Z-REG-FDR when all assumptions are satisfied

First, we conducted a simulation study to explore the performance of Z-REG-FDR under the ideal situation where all assumptions are satisfied. Table 1 shows the results for simulated datasets (1000 simulations of datasets with 10,000 genes and 200 samples) where z’s are directly simulated from an autoregressive structure, and therefore Assumption (A3) is also satisfied. The estimates are accurate to within about 15% when the true $\sigma $ is at least 2.0. The control of the FDR is also satisfactory for $\sigma > 2$. However, the performance is not as good for small $\sigma $, which is due to the fact that it is difficult to separate the null and alternative cases when the effect sizes are small; this is true even when all the assumptions are satisfied. This is a property of the two group mixture model in the empirical Bayes set up, and not a limitation due to the approximate nature of Z-REG-FDR.

Table 1 Summary of the simulation studies with directly simulated z from an AR(1) model with correlation $\rho $

Full size table

Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation

We wished to study the accuracy of the estimation under the approximations employed and for a relatively small sample size, in order to ensure that the approach can work in this challenging situation. Accordingly, we simulated data that uses the covariate adjusted genotype matrix of a real dataset from the GTEx project (V3) [12]. The genotype matrix corresponding to the tissue ‘heart’, which had 83 samples, was selected for analysis. For computational purposes, 10,000 genes were chosen randomly from 28,991 genes. Use of genotype matrices from real data ensures that we are not enforcing Assumption (A3) while simulating, and our choice of $f_{0|k}$ for the simulation is obtained from the data. We simulate the $Y_i$’s (1,000 simulations) using the following scheme.

1.
For each gene, decide whether it has an eQTL using a Bernoulli($\pi _0$) distribution.
2.
If the gene has an eQTL, pick a causal SNP using a discrete uniform distribution over the $m_i$ SNPs. Let it be the kth SNP.
3.
If the gene has an eQTL, simulate each element of $Y_i$ from $N(\beta _{ij} X_k^{(i)}, 1-\beta _{ij}^2)$ with $\sqrt{n-3} \ tanh^{-1}(\beta _{ij})$ simulated from a $N(0,\sigma ^2)$ distribution. If the gene doesn’t have an eQTL, simulate each element of $Y_i$ from N(0, 1).

Table 2 shows the results for this data, indicating that the estimates are still accurate and control of FDR is satisfactory unless $\sigma $ is very small. Large eQTL studies have observed large effect sizes for cis-eQTL analysis [15, 32] which implies that $\sigma $ is not expected to be very small. Thus our numerical results indicate that the Z-REG-FDR method has valid applications for eQTL data.

Table 2 Summary of the simulation studies using the SNP matrix from real data. The last two columns show the FDR control performance of the Z-REG-FDR method when the target FDR is 5% and 10%, respectively

Full size table

Figure 1 shows the plot of REG-FDR estimates against the Z-REG-FDR estimates for 500 simulated datasets using the simulation scheme described above. It is clear from the plot that the two methods agree with each other (with correlations 0.906 and 0.952 for $\pi _0$ and $\sigma $, respectively) and largely fall near the unit line. These results suggest that the approximate maximum likelihood method in Z-REG-FDR is quite effective in controlling the FDR, with a much improved computation speed—a few minutes on a single computer for a dataset with 10,000 genes and 100–200 samples as opposed to more than a day for REG-FDR. A comparison of the estimated lfdr and estimated FDR of the two methods is shown in Fig. 2. It is evident that the slight over-estimation of $\pi _0$ and the slight underestimation of $\sigma $ by Z-REG-FDR work in opposite directions, which leads to similar lfdr values when compared to REG-FDR. The correlation between the estimated FDR based on the true values of the parameters and that based on REG-FDR or Z-REG-FDR are also very high (see Additional file 1: Figure 3).

Behavior of the expected pseudo-log-likelihood of Z-REG-FDR

It is a standard result that the expected log-likelihood is maximized at the true value of the parameter under standard regularity conditions [33]. Since REG-FDR is the true maximum likelihood method for the proposed model, it is expected to satisfy this property. If Assumption (A3) is not satisfied then Z-REG-FDR is an approximate maximum likelihood method, and as such, its pseudo-log-likelihood need not be maximized at the true value of the parameter. We explored several realistic combinations of the true parameters and observed that the pseudo-log-likelihood of Z-REG-FDR is maximized very near the true parameter value. It is a difficult task to analytically compute the expected pseudo-log-likelihood, and so Monte-Carlo integration was used for this task. Figure 3 shows the expected pseudo-log-likelihood surface of Z-REG-FDR for $\pi _0=0.2$ and $\sigma =3$. A contour plot also confirms the fact the surface peaks near the true values of the parameters.

Simulations: comparison of Z-REG-FDR with other methods

It is possible to use other methodologies to control the FDR in grouped hypothesis testing problem for eQTL data. A conservative approach is to obtain the Bonferroni adjusted p-values for each gene, where the p-value for each gene-SNP pair is computed based on the usual t-test or z-test, and then use an FDR controlling approach [eg 34, 35, 36] to assess the conservative p-values. [29] used a permutation based (“eGene”) approach in their analysis of the GTEx data. The method uses the smallest gene-SNP p-value for a gene as the test statistic and computes its distribution by permuting the expression values. Such a distribution can be used to obtain p-values for each gene, which can subsequently be used to control the FDR by using methods such as Storey’s q-value method [35].

The Bonferroni method is typically conservative and hence less powerful. The permutation method, while correctly controlling false positives, can suffer from lack of power to detect genes having an eQTL since it uses an extreme value statistic (not based on likelihood). Our model, on the other hand, utilizes more information through its use of approximate likelihood. We carried out a simulation study to compare the performance of the methods in terms of their power. The simulations were performed using the simulation scheme described in “Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation” section and statistical power was obtained using an FDR threshold of 0.05. The results are shown in Fig. 4. As expected, the Bonferroni method turned out to have very low power and is not shown in Fig. 4. The permutation approach with Storey’s q-value method [35] was conservative and less powerful in comparison with Z-REG-FDR. To address the possible concern that Z-REG-FDR can be slightly anti-conservative, and therefore the comparison with the permutation method is unfair, we also included an adjusted version of the Z-REG-FDR method where a slightly lower FDR threshold was chosen based on the simulations in such a way that the estimated FDR was exactly 0.05. This adjusted version had slightly less power compared to unadjusted Z-REG-FDR, but was more powerful than the permutation method.

Analysis of real data

Finally, we also applied the Z-REG-FDR on a real dataset obtained from GTEx (V6) [12]. Besides Z-REG-FDR, we also used the permutation method and Simes method [37], which is expected to be more powerful than the Bonferroni method although it may not control the FDR for all types of correlation structures. We applied each method on the GTEx data for 44 tissues, separately for each tissue.

For each tissue, the normalized gene expression data and SNP genotype data were separately residualized after adjusting for covariates provided by GTEx. We fit a linear regression model with individuals’ gene expression or SNP genotype as the response variable and covariates as the explanatory variables. Then we extracted the model residuals to obtain “covariate-corrected” gene expression and SNP genotypes.

Figure 5 shows a comparison of the number of significant genes found by Z-REG-FDR and the permutation method employed by [12]. A complete list of the sample sizes and the number of significant genes discovered for the 44 tissues is provided in Additional file 1: Table 2. The methods agree with each other to some extent in terms of number of discoveries. The Z-REG-FDR method has higher number of discoveries compared to the Permutation method and the Simes method in most cases. The parameter estimates for each tissue using Z-REG-FDR are shown in Fig. 6.

Discussion

We have introduced a principled procedure to perform gene-level FDR test, most appropriate and useful in the eQTL setting. A major advantage of Z-REG-FDR is its computational efficiency. While other methods such as the permutation method or our REG-FDR method can take days on a single PC to complete the analysis of a real eQTL dataset, Z-REG-FDR can do the same in a few minutes. For instance, it takes approximately two minutes to fit the model and find significant genes by Z-REG-FDR for a data set with 4.5 million SNPs grouped as local SNPs for 10, 000 genes. REG-FDR takes about a day, and the permutation method (for 10, 000 permutations) takes about 6 hours to analyze the same data. Since there are thousands of simultaneous tests, even 10, 000 permutations may not be enough to provide sufficient p-value resolution. While the Bonferroni method is very fast, it has little power to detect the genes having true eQTLs.

Z-REG-FDR has additional advantages. One important feature of the method is that it does not require access to the full data. In fact, the symmetry of the distributions involved in the Z-REG-FDR pseudo-likelihood ensure that only the gene-SNP level p-values (or equivalently the absolute z-values) are needed to fit the model. Z-REG-FDR does not model the correlation structure of the SNPs, and therefore does not require access to that data. This might be very useful since, in many genetic applications, data are found in the form of summary measures.

Z-REG-FDR can be slightly anti-conservative depending on the true values of the parameters. Various simulations show that if $\sigma $ is large, which appears to often be the case for eQTL data, the control of FDR is satisfactory. The fact that Assumption (A3) is not satisfied does not significantly affect the FDR control. Therefore the assumption can be thought of as a means to reduce computational burden, rather than a necessary assumption for the practical workability of the model.

Assumptions (A1) and (A2) also have the potential to be relaxed, although we consider that to be beyond the scope of this paper. For example, the method can be extended by relaxing Assumption (A2) and incorporating a non-uniform prior for the causal location. If a well-grounded prior exists, then it can be incorporated into our method in a straightforward manner using weighted versions of our statistics. We have included an example in the Additional file 1 to demonstrate empirical evidence that the method remains valid even for more than one causal SNPs under certain conditions.

Our use of the lfdr statistics, while valid, does not utilize gene-level local correlation structures [38,39,40,41] that might provide additional power. Implementation of such methods would require sensitive estimation of gene-level correlations, and a possible direction of future effort.

With the continuous increase in the size of genomic data sets, and with the possibility of further extensions of our approach, we strongly believe that the approximate likelihood approach of the Z-REG-FDR method can be of great practical use for grouped hypothesis testing for eQTL analysis or similar problems in statistical genomics.

Availability of data and materials

Supplementary material is available in the file Supplementary.pdf. Software in the form of R code and documentation is available at https://doi.org/10.5281/zenodo.8331734.

References

Rockman MV, Kruglyak L. Genetics of global gene expression. Nat Rev Genet. 2006;7(11):862–72.
Article CAS PubMed Google Scholar
Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, et al. The genotype-tissue expression (gtex) project. Nat Genet. 2013;45(6):580–5.
Article CAS Google Scholar
Palowitch J, Shabalin A, Zhou Y-H, Nobel AB, Wright FA. Estimation of cis-eqtl effect sizes using a log of linear model. Biometrics. 2018;74(2):616–25.
Article CAS PubMed Google Scholar
Nica AC, Dermitzakis ET. Expression quantitative trait loci: present and future. Philos Trans R Soc B Biol Sci. 2013;368(1620):20120362.
Article Google Scholar
Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M. Mapping complex disease traits with global gene expression. Nat Rev Genet. 2009;10(3):184–94.
Article CAS PubMed PubMed Central Google Scholar
James X H, Zhao H, Zhou HH. False discovery rate control with groups. J Am Stat Assoc. 2010;105(491):1215–27.
Article Google Scholar
TTony Cai and Wenguang Sun. Simultaneous testing of grouped hypotheses: finding needles in multiple haystacks. J Am Stat Assoc. 2009;104(488):1467–81.
Article Google Scholar
Zhao H, Zhang J. Weighted p-value procedures for controlling fdr of grouped hypotheses. J Stat Plan Inference. 2014;151:90–106.
Article Google Scholar
Huang QQ, Ritchie SC, Brozynska M, Inouye M. Power, false discovery rate and winner’s curse in eqtl studies. Nucleic Acids Res. 2018;46(22):e133–e133.
Article PubMed PubMed Central Google Scholar
Sul JH, Raj T, De Jong S, De Bakker PIW, Raychaudhuri S, Ophoff RA, Stranger BE, Eskin E, Han B. Accurate and fast multiple-testing correction in eQTL studies. Am J Hum Genet 2015;96(6):857–868.
Westra H-J. From genome to function by studying eqtls. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease. 2014;1842(10):1896–902.
Article CAS PubMed Google Scholar
and GTEx Consortium. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204.
Peterson CB, Bogomolov M, Benjamini Y, Sabatti C. Many phenotypes without many false discoveries: error controlling strategies for multitrait association studies. Genet Epidemiol. 2016;40(1):45–56.
Article PubMed Google Scholar
and GTEx Consortium. The gtex consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30.
Article Google Scholar
Wright FA, Sullivan PF, Brooks AI, Zou F, Sun W, Xia K, Madar V, Jansen R, Chung W, Zhou Y-H, et al. Heritability and genomics of gene expression in peripheral blood. Nat Genet. 2014;46(5):430–7.
Article CAS PubMed PubMed Central Google Scholar
Stegle O, Parts L, Piipari M, Winn J, Durbin R. Using probabilistic estimation of expression residuals (peer) to obtain increased power and interpretability of gene expression analyses. Nat Protoc. 2012;7(3):500–7.
Article CAS PubMed PubMed Central Google Scholar
Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28(10):1353–8.
Article CAS PubMed PubMed Central Google Scholar
Zhou HJ, Li L, Li Y, Li W, Li JJ. PCA outperforms popular hidden variable inference methods for molecular QTL mapping. Genome Biol. 2022;23(1):1–17.
Article Google Scholar
Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002;23(1):70–86.
Article PubMed Google Scholar
Ferkingstad E, Frigessi A, Rue H, Thorleifsson G, Kong A. Unsupervised empirical Bayesian multiple testing with external covariates. Ann Appl Stat. 2008;2(2):714–35.
Article Google Scholar
Efron B, Storey JD, Tibshirani R. Microarrays, empirical Bayes methods, and false discovery rates. Genet. Epidemiol. Citeseer;2001.
Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003;22(24):3899–914.
Article CAS PubMed Google Scholar
Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5(2):155–76.
Article PubMed Google Scholar
Wenguang Sun and T Tony Cai. Oracle and adaptive compound decision rules for false discovery rate control. J Am Stat Assoc. 2007;102(479):901–12.
Article Google Scholar
Li G, Shabalin AA, Rusyn I, Wright FA, Nobel AB. An empirical bayes approach for multiple tissue eQTL analysis. Biostatistics. 2018;19(3):391–406.
Article PubMed Google Scholar
Jansen R, Hottenga J-J, Nivard MG, Abdellaoui A, Laport B, de Geus EJ, Wright FA, Penninx BWJH, Boomsma DI. Conditional eQTL analysis reveals allelic heterogeneity of gene expression. Hum Mol Genet. 2017;26(8):1444–51.
Article CAS PubMed PubMed Central Google Scholar
Kendziorski CM, Chen M, Yuan M, Lan H, Attie AD. Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics. 2006;62(1):19–27.
Article CAS PubMed Google Scholar
Gelfond JAL, Ibrahim JG, Zou F. Proximity model for expression quantitative trait loci (eQTL) detection. Biometrics. 2007;63(4):1108–16.
Article CAS PubMed Google Scholar
Ardlie KG, Deluca DS, Segrè AV, Sullivan TJ, Young TR, Gelfand ET, Trowbridge CA, Maller JB, Tukiainen T, Lek M, et al. The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60.
Article Google Scholar
Varin C, Reid N, Firth D. An overview of composite likelihood methods. Stat Sin. 2011;21(1):5–42.
Google Scholar
Ximing X, Reid N. On the robustness of maximum composite likelihood estimate. J Stat Plan Inference. 2011;141(9):3047–54.
Article Google Scholar
Joehanes R, Zhang X, Huan T, Yao C, Ying S, Nguyen QT, Demirkale CY, Feolo ML, Sharopova NR, Sturcke A, et al. Integrated genome-wide analysis of expression quantitative trait loci aids interpretation of genomic association studies. Genome Biol. 2017;18(1):1–24.
Article Google Scholar
Cox DR, Hinkley DV. Theoretical statistics. CRC Press;1979.
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol). 1995;57(1):289–300.
Google Scholar
Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B (Stat Methodol). 2002;64(3):479–98.
Article Google Scholar
Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinf. 2008;9(1):303.
Article Google Scholar
John Simes R. An improved bonferroni procedure for multiple tests of significance. Biometrika. 1986;73(3):751–4.
Article Google Scholar
Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007;23(12):1537–44.
Article CAS PubMed Google Scholar
Sun W, Tony Cai T. Large-scale multiple testing under dependence. J R Stat Soc Ser B (Stat Methodol). 2009;71(2):393–424.
Article Google Scholar
Wei Z, Sun W, Wang K, Hakonarson H. Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics. 2009;25(21):2802–8.
Article CAS PubMed Google Scholar
Xiao J, Zhu W, Guo J. Large-scale multiple testing in genome-wide association studies via region-specific hidden Markov models. BMC Bioinf. 2013;14:1–12.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Supported in part by R01ES033243 and R01ES029911.

Author information

Authors and Affiliations

Department of Statistics, Oklahoma State University, Stillwater, OK, USA
Pratyaydipta Rudra
Bioinformatics Research Center, Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, USA
Yi-Hui Zhou & Fred A. Wright
Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, USA
Andrew Nobel

Authors

Pratyaydipta Rudra
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Hui Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Nobel
View author publications
You can also search for this author in PubMed Google Scholar
Fred A. Wright
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

PR constructed the models and performed the statistical analyses of simulated and real data. YZ conducted pre-processing and covariate adjustment for the real data. FAW and AN supervised the modeling and analysis. All authors have read and approved the final version of this manuscript.

Corresponding authors

Correspondence to Pratyaydipta Rudra or Fred A. Wright.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Supplementary Materials.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Rudra, P., Zhou, YH., Nobel, A. et al. Control of false discoveries in grouped hypothesis testing for eQTL data. BMC Bioinformatics 25, 147 (2024). https://doi.org/10.1186/s12859-024-05736-3

Download citation

Received: 18 August 2023
Accepted: 08 March 2024
Published: 11 April 2024
DOI: https://doi.org/10.1186/s12859-024-05736-3

Control of false discoveries in grouped hypothesis testing for eQTL data