 Methodology Article
 Open access
 Published:
HTeQTL: integrative expression quantitative trait loci analysis in a large number of human tissues
BMC Bioinformatics volume 19, Article number: 95 (2018)
Abstract
Background
Expression quantitative trait loci (eQTL) analysis identifies genetic markers associated with the expression of a gene. Most existing eQTL analyses and methods investigate association in a single, readily available tissue, such as blood. Joint analysis of eQTL in multiple tissues has the potential to improve, and expand the scope of, singletissue analyses. Largescale collaborative efforts such as the GenotypeTissue Expression (GTEx) program are currently generating high quality data in a large number of tissues. However, computational constraints limit genomewide multitissue eQTL analysis.
Results
We develop an integrative method under a hierarchical Bayesian framework for eQTL analysis in a large number of tissues. The model fitting procedure is highly scalable, and the computing time is a polynomial function of the number of tissues. Multitissue eQTLs are identified through a local false discovery rate approach, which rigorously controls the false discovery rate. Using simulation and GTEx real data studies, we show that the proposed method has superior performance to existing methods in terms of computing time and the power of eQTL discovery.
Conclusions
We provide a scalable method for eQTL analysis in a large number of tissues. The method enables the identification of eQTL with different configurations and facilitates the characterization of tissue specificity.
Background
Expression quantitative trait loci (eQTL) analyses identify single nucleotide polymorphisms (SNPs) that are associated with the expression level of a gene. A geneSNP pair such that the expression of the gene is associated with the value of the SNP is referred to as an eQTL. One may view eQTL analyses as GenomeWide Association Studies (GWAS) with multiple molecular phenotypes. Identification of eQTLs is a key step in investigating genetic regulatory pathways. To date, numerous eQTLs have been discovered to be associated with human traits such as height and complex diseases such as Alzheimer’s disease and diabetes [1, 2].
With few exceptions, existing eQTL studies have focused on a single tissue; in human studies this tissue is usually blood. An important next step in exploring the genomic regulation of expression is to simultaneously study eQTLs in multiple tissues. Multitissue eQTL analysis can strengthen the conclusions of single tissue analyses by borrowing strength across tissues, and can help provide insight into the genomic basis of differences between tissues, as well as the genetic mechanisms of tissuespecific diseases.
Recently, the NIH Common Fund’s GenotypeTissue Expression (GTEx) project has undertaken a largescale effort to collect and analyze eQTL data in multiple tissues on a growing set of human subjects, and there has been a concomitant development of methods for the analysis of such data. For example, Peterson et al. [3] and Bogomolov et al. [4] developed new error control procedures to control false discovery rates at different levels of resolution (e.g., at the SNP level or the gene level) for eQTL analysis. The methods have been used to identify genes whose expression is regulated by SNPs (eGenes), or SNPs that affect the expression levels of multiple genes (eSNPs). However, the methods only concern how to reduce the number of hypotheses in a hierarchical structure, but cannot effectively borrow strength across tissues to enhance eQTL discoveries. Lewin et al. [5], Sul et al. [6] and Han et al. [7] developed regressionbased methods via Bayesian multivariate regression and randomeffects models. The models accommodate data from multiple tissues simultaneously, and integrate information across tissues for eQTL detection. However, a potential drawback is that they only focus on one gene or geneSNP pair at a time, and fail to leverage information across different geneSNP pairs. Flutre et al. [8] and Li et al. [9] developed hierarchical Bayesian models to model summary statistics across multiple tissues. The models capture the marginal distribution of each geneSNP pair with interpretable parameters, and explicitly characterize heterogenous eQTL configurations in multiple tissues. However, the model fitting is computationally expensive and cannot scale to a large number of tissues. Recently, Urbut et al. [10] proposed an ad hoc approach based on shrinkage to improve the scalability of the Bayesian models. However, the procedure is subject to overfitting and the model parameters are hard to interpret. Initial analyses and conclusions of the GTEx project are described in [11]. As part of this work, the “Bayesian Model Averaging” method [8] and the MTeQTL (“MT” stands for multitissue) method [9] were applied to 9 human tissues with sample size greater than 80, focusing on local (cis) pairs for which the SNP is within one megabase (Mb) of the transcription start site (TSS) of the gene. The analysis found that most eQTLs discovered were common across the 9 tissues included in the study, though the effect size may vary from tissue to tissue. In addition, there are a small, but potentially interesting, set of eQTLs that are present only in a subset of tissues, the most common cases being eQTLs that are present in only one tissue, or present in all but one tissue.
As GTEx and related projects proceed, data are being collected from an increasing number of subjects, and an increasing number of tissues. In the current GTEx database (v6p), more than 20 tissues have a sample size greater than 150. Existing eQTL analysis methods that can effectively borrow strength across tissues are limited in their ability to perform simultaneous local eQTL analyses in a large number of tissues. Methods like [8] and [9] incorporate and rely on a binary configuration vector, with dimension equal to the number of available tissues, that describes, for each geneSNP pair, the presence or absence of association in each tissue. The total number of possible configurations grows exponentially in the number of tissues, making computation, numerical accuracy, and memory management problematic when dealing with large numbers of tissues.
In this paper, we develop an efficient computational tool, called HTeQTL (“HT” stands for hightissue), for joint eQTL analysis. The method builds on the hierarchical Bayesian model developed in [9], but the estimation procedure is significantly improved to address scaling issue associated with a large number of tissues. Rather than fitting a full model, HTeQTL fits models for all pairs of tissues in a parallel fashion, and then synthesizes the resulting pairwise models into a higher order model for all tissues. To do this, we exploit the marginal compatibility of the hierarchical Bayesian model, which is not an obvious property and was proven in [9]. An important innovation is that we employ a multiProbit model and thresholding to deal with the exponentially growing configuration space. The resulting model and fitting procedure can be efficiently applied to the simultaneous eQTL analysis of 2025 tissues. Empirical Bayesian methods for controlling false discovery rates in multiple hypothesis testing are developed. We design testing procedures to detect different families of eQTL configurations. We show that the eQTL detection power of HTeQTL is similar to that of MTeQTL, and that both outperform the tissuebytissue approach, in a simulation study with a moderate number of tissues. We also compare HTeQTL with the MetaTissue method in the analysis of the GTEx v6p data. This analysis shows that the methods have largely concordant results, but that HTeQTL gains additional power by borrowing strength across tissues.
Methods
In this section we describe the HTeQTL method, beginning with a review of the hierarchical Bayesian model and the MTeQTL method in [9], and then describing our proposal on how to fit the Bayesian model in hightissue settings. Next, we describe a local false discovery rate based method for performing flexible eQTL inference. Finally, we discuss a marginal test and a marginal transformation to check and improve the goodness of fit of the model.
Review: Bayesian hierarchical model and MTeQTL procedure
Consider a study with n subjects and K tissues. From each subject we have genotype data and measurements of gene expression in a subset of tissues. In many cases, covariate correction will be performed prior to analysis of eQTLs. For k=1,…,K, let n_{ k }≤n denote the number of subjects contributing expression data from tissue k. Let λ=(i,j) be the index of a geneSNP pair consisting of gene i and SNP j, and let Λ be the set of all local (cis) geneSNP pairs. For λ=(i,j)∈Λ and k=1,…,K, let r_{ λ }(k) denote the (covariate corrected) sample correlation between the expression level of gene i and the number of copies of the minor allele of SNP j in tissue k, and ρ_{ λ }(k) be the corresponding population correlation. Define r_{ λ }=(r_{ λ }(1),…,r_{ λ }(K)) to be the vector of sample correlations across tissues, and define the vector ρ_{ λ } of population correlations in the same fashion.
Let Z_{ λ }=h(r_{ λ })·d^{1/2}, where h(·) is the entrywise Fisher transformation with the effect of variance stabilization, · is the Hadamard product, and d is a Kvector whose kth component is the number of samples in tissue k minus the number of covariates removed from tissue k minus 3. With proper preprocessing of the gene expression data, the vector Z_{ λ } is approximately multivariate normal [12] with mean μ_{ λ }=h(ρ_{ λ })·d^{1/2} and marginal variance one. In particular, if ρ_{ λ }(k)=0 then the kth component of Z_{ λ } has a standard normal distribution, and can therefore be used as a zstatistic for testing ρ_{ λ }(k)=0 vs ρ_{ λ }(k)≠0. Thus we refer to Z_{ λ } as a zstatistic vector.
The MTeQTL model introduced in [9] is a Bayesian hierarchical model for the random vector Z_{ λ }. The model can be expressed in the form of a mixture as
The mixture in (1) is taken over the set {0,1}^{K} of length K binary vectors. Each vector γ∈{0,1}^{K} represents a particular configuration of eQTLs across the K available tissues: γ_{ k }=1 if the geneSNP pair indexed by λ is an eQTL in tissue k, and γ_{ k }=0 otherwise. We define Hamming class m (m=0,⋯,K) as the set of all binary Kvectors having m ones, which correspond to all configurations in which there is an eQTL in m tissues and no eQTL in K−m tissues. The parameter p(γ) is the prior probability of the configuration γ. We collect all the priors in a length 2^{K} vector p. The Kvector μ characterizes the average true effect size of eQTLs in each tissue. The K×K correlation matrix Δ captures the behavior of Z_{ λ } when no eQTLs are present (γ=0): its diagonal entries are 1 due to the variance stabilization caused by the Fisher transformation, and its offdiagonal entries reflect correlations arising from subject overlap between tissues. The K×K matrix Σ captures the covariance structure of nonzero eQTL effect sizes in different tissues. Let θ={p,μ,Δ,Σ} denote the set of unknown model parameters.
Under the model (1) the distribution of Z_{ λ } is a normal mixture with each component corresponding to a specific eQTL configuration. In particular, if γ=0 (λ is not an eQTL in any tissue) then \(\mathbf {Z}_{\lambda }\sim \mathcal {N}_{K}(\mathbf {0},\boldsymbol {\Delta })\); if γ=1 (λ is an eQTL in all tissues) then \(\mathbf {Z}_{\lambda }\sim \mathcal {N}_{K}(\boldsymbol {\mu },{\boldsymbol {\Delta }}+\boldsymbol {\Sigma })\). The true configuration vector for each geneSNP pair λ can be viewed as a latent variable. The main goal of a statistical analysis is to obtain the posterior distribution of each latent variable, and to use it to make inferences about eQTL configurations in multiple tissues.
In order to make inference about configuration vectors, we first estimate the model parameters θ={p,μ,Δ,Σ}. In practice it is common to set the average effect size vector μ to 0, as minor alleles are equally likely to be associated with high or low expression, and we assume in what follows that μ=0. The remaining parameters can be estimated within a maximum pseudolikelihood framework, where the pseudolikelihood is defined as the product of the likelihoods of all considered geneSNP pairs. We note that factorizing the likelihood in this way ignores dependence between adjacent and nearby SNPs arising from linkage disequilibrium. However, our interest is not in the joint behavior of the vectors Z_{ λ } but in their marginal behavior, which is reflected in the mixture (1). In particular, the parameters in Model (1) determine, and are determined by, the marginal distribution of the vectors Z_{ λ }, and do not depend on joint distribution of the vectors Z_{ λ }.
A modified EM algorithm was devised in [9] to estimate the parameters from the pseudolikelihood (see Section A of the Additional file 1). While the method scales linearly with sample size and the number of geneSNP pairs, its computational time increases exponentially with the number of tissues K (see Fig. 1). For genomewide studies, it is infeasible to apply the method to data with more than a few tissues. Moreover, the number of configurations grows exponentially with the number of tissues as well, making inference about configurations difficult as well. Below we introduce a scalable procedure, the HTeQTL method, to address multitissue eQTL analysis in about 20 tissues.
The HTeQTL method
The original MTeQTL model has the desirable property of being marginally compatible. Let the dimension of the MTeQTL model be the number of available tissues. Marginal compatibility means that: 1) the marginalization of a Kdimensional model to a subset of L tissues has the same general form as the Kdimensional model; and 2) the corresponding parameters for the Ldimensional model are obtained in the obvious way by restricting the parameters of the Kdimensional model to the subset of L tissues.
Because of marginal compatibility, it is straightforward to obtain a submodel from a high dimensional model without refitting the MTeQTL parameters. The HTeQTL method, which is discussed below, estimates the high dimensional model from the collection of its one and twodimensional submodels. Thus we address the computationally intractable problem of estimating a high dimensional model by considering a manageable number of subproblems that can be solved efficiently, and in parallel.
In the MTeQTL model, the covariance matrices Δ and Σ reflect interactions between pairs of tissues, while the probability mass function p(·) captures higher order relationships between tissues. The HTeQTL model is built from estimates of all one and twodimensional submodels, which can be computed in parallel. In particular, we make use of a MultiProbit model to approximate the Kth order probability mass function p(·) from the probability mass functions of twodimensional models. In what follows we denote the estimated parameters of the twodimensional model for tissue pair (i,j) by
A description of the twotissue model fitting procedure can be found in Section A of the Additional file 1.
Assemble Δ: For each tissue pair (i,j) where 1≤i<j≤K, the corresponding offdiagonal value of Δ is denoted by δ_{ ij }. An asymptotically consistent estimate of δ_{ ij } is the offdiagonal value of Δ^{ij}, which is the null covariance matrix for the twodimensional model for tissue pair (i,j). Making this substitution for each i<j and placing ones along the diagonal yields the proposed estimate of Δ (i.e., \(\hat {\boldsymbol {\Delta }}\)). In practice, since each Δ^{ij} is typically estimated from a large number of geneSNP pairs, \(\hat {{\boldsymbol {\Delta }}}\) is very close to Δ with negligible variability. If \(\hat {{\boldsymbol {\Delta }}}\) is not positive definite (which did not occur in our numerical studies), we set the negative eigenvalues of \(\hat {{\boldsymbol {\Delta }}}\) by 0, and rescale it to be a correlation matrix.
Assemble Σ: To estimate the covariance matrix Σ={σ_{ ij }}, we decompose it into the diagonal values, which are tissuespecific variances, and the corresponding correlation matrix. For each diagonal entry σ_{ kk } (k=1,⋯,K), there are K−1 estimates, namely \(\sigma ^{1k}_{22},\cdots,\sigma ^{(k1)k}_{22},\sigma ^{k(k+1)}_{11},\cdots,\sigma ^{kK}_{11}\). In practice, the distribution of zstatistics is usually heavytailed, inflating the pairwise estimates of the variance. As a remedy, we propose to use the minimum of the K−1 estimates as the estimate of σ_{ kk } to compensate the inflation effect. The induced correlation matrix from Σ is estimated in the same way as Δ. In particular, we start with a matrix having ones along the diagonal and offdiagonal entries \(\sigma _{12}^{ij} / \sqrt {\sigma _{11}^{ij} \sigma _{22}^{ij}}\). We then obtain the closest positive semidefinite matrix by setting negative eigenvalues to zero, and rescale the resulting matrix to be a correlation matrix. Combining the correlation matrix with the diagonal variance terms, we obtain the estimate \(\hat {\boldsymbol {\Sigma }}\).
The MultiProbit Model for p: Existing multitissue eQTL studies [9, 11] support several broad conclusions about eQTL configurations across tissues. Researchers found that most geneSNP pairs are not an eQTL in any tissue (Hamming class 0) or were an eQTL in all tissues (Hamming class K). With larger sample sizes and a larger number of tissues (thus providing increased power to detect crosstissue sharing), we expect these two Hamming classes to predominate.
In general, the probability mass functions obtained from twodimensional models will not determine a unique probability mass function on the full Kdimensional model. Here we make use of a multiProbit model through which we equate the values of the estimated probability mass function with integrals of a multivariate normal probability density. In particular, for each tissue pair (i,j), we select thresholds \(\tau ^{ij}_{1}, \tau ^{ij}_{2} \in \mathbb{R}\) and a correlation ω^{ij}∈(0,1) so that if (W_{ i },W_{ j }) are bivariate normal with mean zero, variance one, and correlation ω^{ij} then
for each u,v∈{0,1}. Here \(\mathbb {I}(A)\) is the indicator function of A, and p^{ij}(·) is the estimated probability mass function for the pair (i,j).
Beginning with a symmetric matrix having diagonal values 1 and offdiagonal values equal to ω^{ij}, we define a correlation matrix Ω following the procedure used to define \(\hat {\boldsymbol {\Delta }}\). Let ϕ_{ K }(·) be the probability density function of the corresponding Kvariate normal distribution \(\mathcal {N}_{K}(\textbf {0}, \boldsymbol {\Omega })\). For each tissue j, we define an aggregate threshold τ^{j} to be the minimum of \(\tau ^{ij}_{1}\) (i<j) and \(\tau ^{ji}_{2}\) (j<i). Here we use the minimum because pairwise models may occasionally overestimate the null prior probability p^{ij}(0,0). Subsequently, for each configuration γ∈{0,1}^{K}, we define the probability
where I_{ k } is equal to (−∞,τ^{k}] if γ_{ k }=0, and (τ^{k},∞), if γ_{ k }=1. Consequently, we obtain the estimate of probability mass function p for the Kdimensional model.
Threshold p(·): In practice, many of the 2^{K} possible configurations will have estimated probabilities close to zero. In order to further reduce the number of configurations, we set the threshold for the prior probabilities to be 10^{−5}, and truncate those values below the threshold to be zero. The remaining probabilities are rescaled to have total mass one. As a result, the total number of configurations with nonzero probabilities is dramatically reduced to a manageable level for subsequent inferences.
Inferences
The first, and often primary, goal of eQTL analysis in multiple tissues is to detect which geneSNP pairs are an eQTL in some tissue. Subsequent testing may seek to identify geneSNP pairs that are an eQTL in a specific tissue, and pairs that are an eQTL in some, but not all, tissues. As the model (1) is fit with large number of geneSNP pairs, we ignore the estimation error associated with the model parameters and treat the estimated values as fixed and true for the purposes of subsequent inference.
The mixture model (1) may be expressed in an equivalent, hierarchical form, in which for each geneSNP pair λ, there is a latent random vector Γ_{ λ }∈Γ indicating whether or not that pair is an eQTL in each of the K tissues. The prior distribution of Γ_{ λ } is characterized by the probabilistic mass function p(·). In the hierarchical model, given that Γ_{ λ }=γ, the random zstatistic vector Z_{ λ } has distribution \( \mathcal {N}_{K}(\mathbf {0},\boldsymbol {\Delta }+\boldsymbol {\Sigma }\cdot \mathbf {\gamma }\mathbf {\gamma }') \). The posterior distribution of Γ_{ λ } given the observed vector z_{ λ } can be used to test eQTL configurations for the geneSNP pair λ.
Detection of eQTLs with specified configurations can be formulated as a multiple testing problem, and addressed through the use of local false discovery rates derived from the posterior distribution of geneSNP pairs. Suppose that we are interested in identifying geneSNP pairs with eQTL configurations in a set S⊆{0,1}^{K}. This can be cast as a multiple testing problem
where λ∈Λ. Rejecting the null hypothesis for a geneSNP pair λ indicates that λ is likely to have an eQTL configuration in S. There are several families S of particular interest, corresponding to different configurations of interest:

Testing for the presence of an eQTL in any tissue: S={γ:γ≠0}

Testing for presence of a tissuespecific eQTL, i.e., an eQTL in some, but not all, tissues: S={γ:γ≠0,γ≠1}

Testing for presence of an eQTL in tissue k only: S={γ:γ_{ k }=1}

Testing for presence of a common eQTL, i.e., an eQTL in all tissues: S={1}.
To carry out multiple testing under the hierarchical Bayesian model, we make use of the local false discovery rate (lfdr) for the set S, which is defined as the posterior probability that the configuration Γ lies in S^{c} given the observed zstatistics vector z. The local false discovery rate was introduced by [13] in the context of an empirical Bayes analysis of differential expression in microarrays. Other applications can be found in [14–16]. Formally, the lfdr for S⊆{0,1}^{K} is defined by
where f_{ γ }(z) is the pdf of \(\mathcal {N}_{K}\left (\mathbf {0},\boldsymbol {\Delta }+\boldsymbol {\Sigma }\cdot \mathbf {\gamma }\mathbf {\gamma }'\right)\). Thus η_{ S }(z_{ λ }) is the probability of the null hypothesis given the zstatistic vector for the geneSNP pair λ. Small values of the lfdr provide evidence for the alternative hypothesis H_{1,γ}. In order to control the overall false discovery rate (FDR) for the multiple testing problem across all geneSNP pairs λ∈Λ we employ an adaptive thresholding procedure for local false discovery rates [9, 13, 14, 17]. For a given set of configurations S, and a given false discovery rate threshold α∈(0,1), the procedure operates as follows.

Calculate the lfdr η_{ S }(z_{ λ }) for each λ∈Λ.

Sort the lfdrs from smallest to largest as η_{ s }(λ_{(1)})≤⋯η_{ s }(λ_{(N)}).

Let N be the largest integer such that
$${1 \over N} \sum_{i=1}^{N} \eta_{s}\left(\lambda_{(i)}\right) < \alpha. $$ 
Reject hypotheses \(\mathrm {H}_{0,\lambda _{(i)}}\) for i=1,…,N.
It is shown in [9] that the adaptive procedure controls the FDR at level α under very mild conditions. Consequently, we obtain a set of discoveries with FDR below the nominal level α.
Results
In the first part of this section, we conduct a simulation study with 9 tissues. We compare HTeQTL with the MTeQTL [9], MetaTissue [6] and tissuebytissue (TBT) [18–21] methods on different eQTL detection problems. The MetaTissue approach leverages the fixed effects and random effects method to address effect size heterogeneity and detect eQTLs across multiple tissues. The TBT approach first evaluates the significance of geneSNP association in each tissue separately, and then aggregates the information across tissues. We also compare HTeQTL and MTeQTL in terms of the model fitting times and parameter estimation accuracy. Then we apply the two scalable methods, HTeQTL and MetaTissue, to the GTEx v6p data with 20 tissues.
Simulation
In the simulation study, we first generate zstatistics directly from Model (1) with K=9 tissues. We fix the model parameters {p,μ,Δ,Σ} to be the ones estimated from MTeQTL method on the GTEx pilot data. In particular, the parameters are available from the supplementary material of [9]. For each geneSNP pair, we first randomly generate a lengthK binary configuration vector γ based on the prior probability mass function p. Given γ, the marginal distribution of the zstatistics is \(\mathcal {N}\left (\boldsymbol {\mu }\cdot {\mathbf {\gamma }},{\boldsymbol {\Delta }}+\boldsymbol {\Gamma }\cdot {\mathbf {\gamma }}{\mathbf {\gamma }}'\right)\). Then we simulate a lengthK effect size vector from the multivariate Gaussian distribution. We repeat the procedure 10^{5} times to obtain the true eQTL configurations and corresponding zstatistic vectors in 10^{5} geneSNP pairs. The true eQTL configurations under the simulation are used to evaluate the efficacy of different methods.
We first compare the computational costs of the MTeQTL model fitting and the HTeQTL model fitting (without parallelization). We consider a sequence of nested models with dimensions from 2 to 9. The model fitting times on the simulated data are shown in Fig. 1. We demonstrate that the model fitting time for the MTeQTL grows exponentially in the number of tissues, while it grows much slower for the HTeQTL. Namely, the HTeQTL scales better than the MTeQTL. This is because the HTeQTL model fitting only involves the fitting of all the 2tissue MTeQTL models and a small overhead induced by assembling the pairwise parameters. When the total number of geneSNP pairs and the number of tissues are large, the advantage of HTeQTL is significant. Based on the timing results for MTeQTL on the 9tissue GTEx pilot data in [9], we project its fitting time to be more than 30 CPU years on 20 tissues. As we describe later, fitting the HTeQTL model on the 20tissue GTEx v6p data only takes less than 3 CPU hours. We remark that the straightforward parallelization of the 2tissue MTeQTL model fittings will further reduce the computational cost for HTeQTL.
Now we compare the parameter estimation from MTeQTL and HTeQTL. We particularly focus on the 9tissue model. The HTeQTL parameters are obtained by fitting all 2tissue models and assembling the pairwise parameters as described above. The MTeQTL parameters are obtained directly by fitting a 9tissue MTeQTL model. Regarding the estimation of the correlation matrix Δ, the quartiles of the entrywise relative errors are (0.86, 2.42–4.36%) and (0.81, 2.00–2.72%) for HTeQTL and MTeQTL, respectively. Regarding the estimation of the covariance matrix Σ, the quartiles of the entrywise relative errors are (1.13, 2.41–3.25%) and (0.36, 0.68–1.08%) for HTeQTL and MTeQTL, respectively. Namely, although HTeQTL had larger relative errors than MTeQTL, both methods estimated the covariance matrices very accurately. For the probability mass vector p, we calculated the KullbackLiebler divergence of different estimates from the truth, defined as \(D_{KL}\left (\mathbf {p}\\widehat {\mathbf {p}}\right)=\sum _{i=1}^{2^{K}}p_{i}\log {\left (p_{i}/\widehat {p_{i}}\right)}\). The MTeQTL estimate has a very small divergence of 0.025 while the HTeQTL estimate has a slightly larger divergence of 0.141. Overall, the HTeQTL estimates are slightly less accurate than the MTeQTL estimates, which is expected because the HTeQTL method has fewer degrees of freedom than the MTeQTL method. When there are abundant data relative to the number of parameters, the more complicated MTeQTL model will result in more accurate estimation. Nevertheless, we emphasize that the HTeQTL estimates are sufficiently accurate for the eQTL detection purposes (see Fig. 2).
Next, we compare the eQTL detection power of different methods. We particularly focus on the detection of four types of eQTLs: (a) eQTLs in at least one tissue (Any eQTL); (b) eQTLs in all tissues (Common eQTL); (c) eQTLs in at least one tissue but not all tissues (TissueSpecific eQTL); (d) eQTLs in a single tissue (SingleTissue eQTL). In addition to the MTeQTL and HTeQTL methods, we also consider the MetaTissue and TBT approaches. In order to detect Any eQTL, we exploit the random effects model in MetaTissue and a minP procedure in TBT, where the minimum p value across tissues is used as the test statistics for each geneSNP pair. To detect Common eQTL, we use the fixed effects model in MetaTissue and a maxP procedure in TBT, where the maximum p values across tissues are used. To detect TissueSpecific eQTL, we devise a diffP procedure for TBT, where the test statistics for each geneSNP pair is the difference between the maximum and the minimum p values across tissues. A large value indicates the discrepancy between the two extreme p values is large, and thus provides a strong evidence for the geneSNP pair to be a tissuespecific eQTL. Similarly, for MetaTissue, we exploit the difference of p values from the fixed effects model and the random effects model as the test statistics. Finally, for SingleTissue eQTL detection, MetaTissue reduces to the TBT method. We just use the p values in the primary tissue and ignore those in other tissues. For the MTeQTL and HTeQTL methods, we adapt the lfdr test statistics in (2) to different testing problems accordingly.
We evaluate the performance of different methods using the Receiver Operating Characteristic (ROC) curves for different eQTL detection problems. The results are shown in Fig. 2. In particular, in panel (a), a geneSNP pair identified by a method is deemed as a true positive if it truly has an eQTL in any tissue; otherwise, it is a false positive. Similar for the other panels. The Area under a Curve (AUC) is also calculated for each curve. The oracle curves correspond to the lfdr approach based on the true model with the true parameters. In all eQTL detection problems, the MTeQTL and HTeQTL methods have comparable performance, very similar to the oracle results. While we expect the MTeQTL to perform similarly to the oracle procedure, it is surprising that the HTeQTL, only using information in tissue pairs, also provides comparable (although slightly worse) results to the oracle procedure. Both MTeQTL and HTeQTL clearly outperform the MetaTissue and TBT approaches in all detection problems.
To sum up, the HTeQTL method achieves high parameter estimation accuracy and eQTL detection power at a low computational cost. For a large number of tissues, it provides a preferable alternative to the MTeQTL method.
GTEx v6p data
The GTEx v6p data constitute the most recent freeze for official GTEx Consortium publications, and can be accessed from the GTEx portal at http://www.gtexportal.org/home/. We apply the HTeQTL method to 20 tissues (selected in consultation with the GTEx Analysis Working Group), including 2 brain tissues, 2 adipose tissues, and a heterogeneous set of 16 other tissues. We consider all 70,724,981 cis geneSNP pairs where the SNP is within 1Mb of the TSS of the gene.
To obtain model parameters using HTeQTL, we first fit \({20 \choose 2}=190\) 2tissue models, and then assemble all the pairwise parameters following the procedure in the method section. The probability mass vector p estimated from the MultiProbit model is summarized in Fig. 3. We particularly focus on 377 configurations with prior probabilities greater than 10^{−5}. The prior probabilities are added up for configurations in the same Hamming class, providing a general characterization of the multitissue eQTL distribution. The parabolic shape estimated from the data is concordant with previous results from the pilot study [11]. The global null configuration (the binary 0 vector) has the largest probability of 0.936, and the common eQTL configuration (the binary 1 vector) has the second largest probability of 0.0396. Configurations in Hamming class 1 (eQTL in only one tissue) and 19 (eQTL in all but one tissues) have relatively large probabilities. All other configurations have much lower probabilities. We remark that as the number of possible configurations increases exponentially with the number of tissues, the prior probability of each configuration is likely to decrease.
Recall that Σ captures the covariance of effect sizes in different tissues when eQTLs are present. We treat the correlation matrix induced from Σ as the distance metric between tissues, and use the single linkage to conduct hierarchical clustering for the 20 tissues. The dendrogram is shown in Fig. 4. We demonstrate that similar tissues, such as the two adipose tissues and the breast tissue, or the two brain tissues, are grouped together. The whole blood is apparently different from all the other tissues. These findings are concordant with those in the pilot analysis [11].
We also carry out testing of eQTL configurations (at a fixed the FDR level of 5%) for the presence of an eQTL in any tissue, in all tissues, in at least one but not all tissues, and in each individual tissue. The number of discoveries are shown in Table 1. As a comparison, we also apply the MetaTissue method [6] to the same data set. In particular, we focus on the Any eQTL detection problem, using p values from the random effects model in MetaTissue. We apply the Benjamini and Yekutieli approach [22] to control the FDR at the level of 5%. As a result, we obtain over 6.36 million cis pairs from the MetaTissue method. About 3.60 million of these pairs are shared with the HTeQTL method. We further investigate the unique discoveries of each method. As shown in the left panel of Fig. 5, the unique discoveries made by HTeQTL have very small p values from the MetaTissue method, indicating those are likely to be “near” discoveries for the MetaTissue method as well. In the right panel of Fig. 5, however, the excessive unique discoveries made by MetaTissue have highly enriched large lfdr values. The distribution of the lfdr values for the unique MetaTissue discoveries is striking. It indicates that the majority of the unique MetaTissue discoveries are not close to being significant according to the HTeQTL model. This may be partially due to the inadequacy of the pvaluebased FDR control method for highly dependent tests in MetaTissue. We further investigated those geneSNP pairs with large lfdr values, and found that many have large effect sizes with opposite signs in different tissues. These geneSNP pairs cannot be well characterized by Model (1), because the estimated correlations between tissues are large and positive. As a result, they have large lfdrs from HTeQTL. Further research is needed to determine whether those geneSNP pairs are true eQTLs of interest or not.
Discussion
In this paper, we develop a new method, HTeQTL, for joint analysis of eQTL in a large number of tissues. The method builds upon the empirical Bayesian framework, MTeQTL, proposed in [9], and extends it to 20 or more tissues. Like the earlier model, the HTeQTL model provides a flexible platform for modeling and testing different configurations of eQTLs, while effectively leveraging information across tissues and across geneSNP pairs. The model fitting procedure only involves the estimation of all 2tissue models, and the obtained pairwise parameters are then assembled to get the full model parameters. Even in lowdimensional settings, the HTeQTL method expedites the parameter estimation procedure of the MTeQTL model with little cost in accuracy. The detection of eQTLs with different configurations is addressed by adaptively thresholding the corresponding local false discovery rates, which efficiently borrow strength across tissues and control the nominal FDR. Finally, the numerical studies demonstrate the efficacy of the proposed method. In the GTEx v6p data analysis, we apply HTeQTL to 20 tissues. The estimated prior probabilities of eQTL configurations show that most eQTLs are common across all tissues or present in a single tissue. The estimated effect sizes provide additional insights into the tissue similarity and clustering. We identify a large number of common and tissuespecific eQTLs. A large proportion of the discoveries are replicated by the MetaTissue approach. The additional unique discoveries made by our method are “near” discoveries for the MetaTissue method, as illustrated by the highly skewed pvalue distributions (see Fig. 5). It indicates that HTeQTL is able to push the detection boundary in a favorable direction (i.e., more statistical power) while preserving error control.
The HTeQTL method is a necessary first step in the extension of the multitissue eQTL model, and a basis for extensions to 30 or more tissues. There are several future research directions. One the one hand, the proposed method relies on the marginal compatibility of a multivariate Gaussian distribution. In practice, if the joint distribution of the zstatistics deviates from the Gaussian distribution, it may affect the model fitting. One may investigate multivariate transformations to make the zstatistics jointly Gaussian. Another direction is to estimate very high dimensional distributions on the space of configurations. One may explore a hierarchical structure in tissues, where each hierarchy only consists of a moderate number of tissues (or tissue groups). Then the proposed method can be applied to each hierarchy separately and combined afterwards. One could also explore computationally efficient and accurate approximations of the cumulative probabilities of a highdimensional multivariate Gaussian distribution.
Conclusions
We present a scalable method for multitissue eQTL analysis. The method can effectively borrow strength across tissues to improve the power of eQTL detection in a single tissue. It also has superior power to detect eQTL of different configurations. The model parameters capture important biological insights into tissue similarity and specificity. In particular, from the GTEx analysis we observe that most cis eQTLs are present in either all tissues or a single tissue. The eQTLs identified by the proposed method provide a valuable resource for subsequent analysis, and may facilitate the discovery of genetic regulatory pathways underlying complex diseases.
Abbreviations
 EM:

ExpectationMaximization
 eQTL:

Expression Quantitative Trait Loci
 FDR:

False discovery rate
 GTEx:

GenotypeTissue Expression project
 GWAS:

GenomeWide Association Studies
 HTeQTL:

HighTissue Expression Quantitative Trait Loci analysis
 lfdr:

Local false discovery rate
 Mb:

Megabase
 MTeQTL:

MultiTissue Expression Quantitative Trait Loci analysis method
 ROC:

Receiver Operating Characteristic
 SNP:

Single nucleotide polymorphism
 TBT:

TissuebyTissue analysis
 TSS:

Transcription start site
References
Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M. Mapping complex disease traits with global gene expression. Nat Rev Genet. 2009; 10(3):184–94.
Mackay TF, Stone EA, Ayroles JF. The genetics of quantitative traits: challenges and prospects. Nat Rev Genet. 2009; 10(8):565–77.
Peterson CB, Bogomolov M, Benjamini Y, Sabatti C. Treeqtl: hierarchical error control for eqtl findings. Bioinformatics. 2016; 32(16):2556–8.
Bogomolov M, Peterson CB, Benjamini Y, Sabatti C. Testing hypotheses on a tree: new error rates and controlling strategies. arXiv preprint arXiv:1705.07529. 2017.
Lewin A, Saadi H, Peters JE, MorenoMoral A, Lee JC, Smith KG, Petretto E, Bottolo L, Richardson S. Mthess: an efficient bayesian approach for simultaneous association detection in omics datasets, with application to eqtl mapping in multiple tissues. Bioinformatics. 2015; 32(4):523–32.
Sul JH, Han B, Ye C, Choi T, Eskin E. Effectively identifying eQTLs from multiple tissues by combining mixed model and metaanalytic approaches. PLoS Genet. 2013; 9(6):1003491.
Han B, Duong D, Sul JH, de Bakker PI, Eskin E, Raychaudhuri S. A general framework for metaanalyzing dependent studies with overlapping subjects in association mapping. Hum Mol Genet. 2016; 25(9):1857–66.
Flutre T, Wen X, Pritchard J, Stephens M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 2013; 9(5):1003486.
Li G, Shabalin AA, Rusyn I, Wright FA, Nobel AB. An empirical bayes approach for multiple tissue eQTL analysis. Biostatistics. 2017. kxx048. https://doi.org/10.1093/biostatistics/kxx048.
Urbut SM, Wang G, Stephens M. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. bioRxiv. 2017:096552.
The GTEx Consortium. The genotypetissue expression (gtex) pilot analysis: Multitissue gene regulation in humans. Science. 2015; 348(6235):648–60.
Mudholkar GS, Chaubey YP. On the distribution of Fisher’s transformation of the correlation coefficient. Commun Stat Simul Comput. 1976; 5(4):163–72.
Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001; 96(456):1151–60.
Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004; 5(2):155–76.
Efron B. Size, power and false discovery rates. Ann Stat. 2007; 35(4):1351–77.
Efron B. Microarrays, empirical Bayes and the twogroups model. Stat Sci. 2008; 23(1):1–22.
Sun W, Cai TT. Oracle and adaptive compound decision rules for false discovery rate control. J Am Stat Assoc. 2007; 102(479):901–12.
Heinzen EL, Ge D, Cronin KD, Maia JM, Shianna KV, Gabriel WN, WelshBohmer KA, Hulette CM, Denny TN, Goldstein DB. Tissuespecific genetic control of splicing: implications for the study of complex traits. PLoS Biol. 2008; 6(12):1000001.
Dimas AS, Deutsch S, Stranger BE, et al. Common regulatory variation impacts gene expression in a cell type–dependent manner. Science. 2009; 325(5945):1246–50.
Ding J, Gudjonsson JE, Liang L, et al. Gene expression in skin and lymphoblastoid cells: refined statistical method reveals extensive overlap in ciseQTL signals. Am J Hum Genet. 2010; 87(6):779–89.
Fu J, Wolfs MG, Deelen P, et al. Unraveling the regulatory mechanisms underlying tissuedependent genetic variation of gene expression. PLoS Genet. 2012; 8(1):1002431.
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001; 29:1165–88.
Acknowledgements
The authors would like to thank members of the GTEx Analysis Working Group for helpful comments and discussions.
Funding
This work was supported by the National Institutes of Health [1R01HG00898001 to GL, R01MH101819 and R01MH090936 to GL, ABN, FAW, R01HG009125 to ABN]; the National Science Foundation [DMS1613072 to ABN]; and the National Institute of Environmental Health Sciences [P42ES005948 to FAW, P30ES025128 to DJ].
Availability of data and materials
The GTEx v6p data used in this paper are available from the GTEx portal (although application may be required) at http://www.gtexportal.org/home/. The Matlab code for implementing the method, including a numerical example, is publicly available at https://github.com/reagan0323/MTeQTL/tree/master/HTeQTL.
Author information
Authors and Affiliations
Contributions
GL, ABN and FAW conceptualized the project and developed the novel methodology and analysis. DJ helped conduct the MetaTissue method. GL, ABN and FAW contributed to the analysis and interpretation of results and editing the final manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Supplementary material of the HTeQTL model fitting procedure. (PDF 144 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, G., Jima, D., Wright, F.A. et al. HTeQTL: integrative expression quantitative trait loci analysis in a large number of human tissues. BMC Bioinformatics 19, 95 (2018). https://doi.org/10.1186/s1285901820883
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901820883