 Research Article
 Open Access
 Published:
Sparse regression models for unraveling group and individual associations in eQTL mapping
BMC Bioinformatics volume 17, Article number: 136 (2016)
Abstract
Background
As a promising tool for dissecting the genetic basis of common diseases, expression quantitative trait loci (eQTL) study has attracted increasing research interest. Traditional eQTL methods focus on testing the associations between individual singlenucleotide polymorphisms (SNPs) and gene expression traits. A major drawback of this approach is that it cannot model the joint effect of a set of SNPs on a set of genes, which may correspond to biological pathways.
Results
To alleviate this limitation, in this paper, we propose geQTL, a sparse regression method that can detect both groupwise and individual associations between SNPs and expression traits. geQTL can also correct the effects of potential confounders. Our method employs computationally efficient technique, thus it is able to fulfill large scale studies. Moreover, our method can automatically infer the proper number of groupwise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method. The results show that geQTL can effectively detect both individual and groupwise signals and outperforms the stateofthearts by a large margin.
Conclusions
This paper well illustrates that decoupling individual and groupwise associations for association mapping is able to improve eQTL mapping accuracy, and inferring individual and groupwise associations.
Background
Expression quantitative trait loci (eQTL) mapping aims at identifying single nucleotide polymorphisms (SNPs) that influence the expression level of genes. It has been widely applied to analyze the genetic basis of gene expression and molecular mechanisms underlying complex traits [1, 2]. In a typical eQTL study, the association between each expression trait and each SNP is assessed separately [3–5]. This approach does not consider the interactions among SNPs and among genes. However, multiple SNPs may interact with each other and jointly influence the phenotypes [6]. This assumption will inevitably miss complex cases where multiple genetic variants jointly affect the coexpressions of multiple genes. It has been observed in biological experiments that the joint effect of multiple SNPs to a phenotype may be nonadditive [6], and genes from the same biological pathway are usually coregulated [7] by the same genetic basis. The biological process contains both individual effects and joint effects between SNPs and genes [8]. A straightforward approach to detect associations between sets of SNPs and a gene expression level can be done using the standard gene set enrichment analysis [9]. Wu et al. [10] further proposed the variance component models for SNP set testing. Braun et al. employed aggregationbased approaches to cluster SNPs [11]. In [12], Listgarten et al. further considered the potential confounding factors.
However, there are two limitations for these approaches. First, these methods typically only consider SNPs from predefined pathways or gene ontology categories, which are far from being complete. Second, these methods can only detect the mapping of SNP set and a single gene expression level. To better elucidate the genetic basis of gene expression, it is a crucial challenge to understand how multiple modestlyassociated SNPs interact to influence the a group of genes [6]. In this paper, we refer to this kind of eQTL mapping to find associations between group of SNPs and group of gene expression levels as the groupwise eQTL mapping. An example is shown in Fig. 1. Note that an ideal model should allow overlaps between SNP sets and between gene sets, that is, a SNP or gene may participate in multiple individual and groupwise associations [6]. In literature, groupwise eQTL mapping has attracted increasing research interest recently. For example, Xu et al. [13] proposed a twographguided multitask Lasso approach to infer groupwise eQTL mapping. However, it required the grouping information of both SNPs and genes available as prior knowledge, which may not be practical for many applications. Besides, it is not able to correct the effects of confounding factors.
In this paper, we propose a novel method, geQTL, to automatically detect individual and groupwise associations in eQTL studies. It uses a twolayer feature selection strategy and adopts efficient optimization techniques, which make it suitable for large scale studies. Moreover, geQTL can automatically infer the optimal number of groupwise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method.
Methods
Preliminaries
Important notations used in this paper are listed in Table 1. In this paper, for each sample, the data of SNPs and genes are denoted by column vectors. Let x=[x _{1},x _{2},…,x _{ K }]^{T} denote the K SNPs. Here, x _{ i }∈{0,1,2} denotes a random variable corresponding to the ith SNP (For example, 0, 1, 2 may encode the homozygous major allele, heterozygous allele, and homozygous minor allele, respectively.). Let z=[z _{1},z _{2},…,z _{ N }]^{T} denote the N genes in the study. z _{ j } denotes a continuous random variable corresponding to the jth gene expression. Let \(\mathbf {X}=\{{\mathbf {x}}_{h}1\leq h\leq H\} \in {\mathbb {R}}^{K \times H}\) be the SNP matrix. We use \(\mathbf {Z}=\{{\mathbf {z}}_{h}1\leq h \leq H\}\in {\mathbb {R}}^{N\times H}\) to denote the matrix of gene expression levels. H denotes the number of samples in consideration.
The traditional linear regression model for association mapping between x and z is
where z is a linear function of x with coefficient matrix W, μ is an N×1 translation factor vector. And ε is the additive noise of Gaussian distribution with zeromean and variance γ I, where γ is a scalar. That is, \(\mathbf {\epsilon } \sim \mathcal {N}(\mathbf {0},\gamma \mathbf {I})\).
In association studies, sparsity is a reasonable assumption because only a small fraction of genetic variants are expected to be associated with a set of gene expression traits. This can be modeled as a feature selection problem. For example, the standard Lasso [5] can be used in association mapping, which applies ℓ _{1} penalty on W for sparsity.
If both X and Z are standardized, the objective function of Lasso is formulated as
where ·_{ F } denotes the Frobenius norm, ·_{1} is the ℓ _{1}norm. η is the empirical parameter for the ℓ _{1} penalty. W is the parameter (also called weight) matrix parameterizing the space of linear functions mapping from X to Z.
Confounding factors, such as unobserved covariates, experimental artifacts and unknown environmental perturbations, may mask real signals and lead to spurious findings. LORS [14] uses a lowrank matrix \(\mathbf {L}\in {\mathbb {R}}^{N\times H}\) to account for the variations caused by hidden factors. The objective function of LORS is
where ·_{∗} is the nuclear norm [14]. ρ is the regularization parameter to control the rank of L. L is a lowrank matrix assuming that there are only a small number of hidden factors influencing the gene expression levels.
When we fix {W, we can optimize {L} by using singular value decomposition (SVD) according to the following lemma.
Lemma 1.
([15]) Suppose that matrix O has rank r. The solution to the optimization problem
is given by \(\mathbf {\widehat {S}}=\mathbf {H}_{\lambda }(\mathbf {O})\), where \(\mathbf {H}_{\lambda }(\mathbf {O})=\mathbf {U}\mathbf {D}_{\lambda }\mathbf {V}^{\mathrm {T}}\) with D _{ λ }=diag[(d _{1}−λ)_{+},…,(d _{ r }−λ)_{+}], U D V ^{T} is the Singular Value Decomposition (SVD) of O, D=diag[d _{1},…,d _{ r }], and \((d_{i}\lambda)_{+}=\max ((d_{i}\lambda),0), (1\leq i \leq r)\).
Thus, for fixed W, the formula for updating L is
Both Lasso and LORS do not consider the existence of groupwise associations. Below, we will introduce the proposed model to infer both groupwise and individual associations for eQTL mapping.
geQTL
In geQTL, individual associations between SNPs and genes are modeled by following the Lassobased strategy. Groupwise associations are inferred using a twolayer feature selection method. Since multiple SNPs may have joint effect on a group of genes, and such effect may be accomplished through complex biological processes, we introduce latent variables to bridge sets of SNPs and sets of genes. Specifically, we assume that there exist latent factors regulating the gene expression level, which serve as bridges between the SNPs and the genes. The latent variables are denoted by y=[y _{1},y _{2},…,y _{ M }]^{T}. Here, M (M≪m i n(K,N)) is the total number of latent variables representing groupwise associations. The relationship between x and y can be represented as
where
\(\mathbf {A} \in \mathbb {R}^{M\times K}\) denotes the matrix of coefficients between x and y. \({\sigma _{1}^{2}}\mathbf {I}_{M}\) denotes the variances of the additive noise. I _{ M } is an identity matrix. Here we drop the intercept terms because the input data X and Z are normalized to zero mean and unit variance as preprocessing.
Similarly, the relationship between y and z can be represented as
where
\(\mathbf {B} \in \mathbb {R}^{N\times M}\) denotes the matrix of coefficients between y and z, \(\mathbf {C} \in \mathbb {R}^{N\times K}\) denotes the matrix of coefficients between x and z to encode the individual associations.
Note that Eq. (7) decouples the associations between SNPs and genes into two parts: one for individual associations represented as C x, and another for groupwise associations represented as B y. Next, we infer the groupwise associations by a twolayer feature selection strategy. We first remove the individual associations and denote
Thus \(\tilde {\mathbf {Z}}\) contains only groupwise effects. Next let
Thus Y represents a lowrank transformation of the original SNP matrix. Each row of Y represents a group of SNPs. From Eq. (7), we have the following multipleinputmultipleoutput (MIMO) linear system
where E is a Gaussian whitenoise term. In Eq. (9) and (10), A and B should be sparse since a single gene is often influenced by a small number of SNPs and vice versa [12].
Therefore, the overall objective function is
where α,β,γ,ρ are the regularization parameters, and the loss function is
Here, we choose different penalties for A,B,C because the sparsities of different matrices are typically of different scales.
Optimization
The optimization for L can be achieved by following a similar approach as in [14]. To optimize A,B,C, many tools can be used to optimize the ℓ _{1} penalized objective function, e.g., the OrthantWise Limitedmemory QuasiNewton (OWLQN) algorithm [16]. Due to space limitation, we omit the details. In the next, we devise optimization techniques that can dramatically improve the computational efficiency of geQTL.
Boosting the computational efficiency
Given a large number of SNPs and gene expression traits, scalability of the algorithm is a crucial issue. We propose two improved models, geQTL ^{+} and geQTLridge, which optimize the search for significant individual associations, which is the main computational bottleneck of the algorithm.
geQTL ^{+}
In a typical eQTL study, we usually have M≪m i n(K,N). Thus, the bottleneck of the algorithm is to optimize C. Our strategy is to confine the space of C. The intuition is that we only permit a small fraction of elements in C to be nonzero. It has been shown that if Z and X are standardized with zero mean and unit sum of squares, then r=a b s(Z X ^{T}) is equal to the geneSNP correlations (r _{ gs }=c o r(z _{ g },x _{ s })) [17]. Since for many test statistics, e.g., t, F, R ^{2}, and LR, for the simple linear regression problem can be expressed as functions of the sample correlation r _{ gs }, e.g., R ^{2}=r ^{2}, and \(t=\frac {r\sqrt {n2}}{1r^{2}}\), we can find a threshold according to the required pvalue, such that test statistics exceeding the threshold are significant at the required significance level. The test statistics for every geneSNP pair in r are compared with the threshold, and only those elements whose r are greater than the threshold are optimized. We denote \(\mathbf {R} \in \mathbb {R}^{N\times K}\) as the indicator matrix indicating which elements in C can be nonzero (i.e., r _{ gs }>t h r e s h o l d).
geQTLridge
When N and K are extremely large, optimizing C may still be timeconsuming, since it may take many iterations to converge with the ℓ _{1} constraint. Next, we introduce geQTLridge, which further improves the time efficiency with slight decrease in accuracy. The key idea is to use ridge regression for individual associations so that we can get a closed form solution for C. The objective function is shown in the following.
Theorem 1.
The solution of C in Eq. (13) is
where
and P _{ i } is defined as in formula (19).
The proof of the Theorem 1 is in the following section.
Proof of Theorem 1
Proof.
Recall that any ridge regression problem
where a is a row vector and Q has linearly independent rows, has the following solution
Note that
where D=Z−L−B A X, c _{ i }=(C)_{ i,:} and d _{ i }=(D)_{ i,:}.
We have
Taking into account that (c _{ i })_{ j } can be nonzero only if (R)_{ i,j } is 1, we introduce P _{ i }, where P _{ i } has K rows and \(l_{i} = \sum _{j=1}^{K} (\mathbf {R})_{i,j}\) columns. And
Then \(\mathbf {c}_{i} = \mathbf {c}_{i} \mathbf {P}_{i} \mathbf {P}_{i}^{\mathrm {T}}\), \(\mathbf {d}_{i}  \mathbf {c}_{i} \mathbf {X}_{2}^{2} +\gamma \mathbf {c}_{i}_{2}^{2} = \mathbf {d}_{i}  (\mathbf {c}_{i} \mathbf {P}_{i}) \left (\mathbf {P}_{i}^{\mathrm {T}} \mathbf {X}\right)_{2}^{2} +\gamma \mathbf {c}_{i} \mathbf {P}_{i}_{2}^{2} \), and
is solved by
Therefore,
is solved by \(\mathbf {C} = \left (\mathbf {c}_{1}^{\mathrm {T}}, \ldots, \mathbf {c}_{N}^{\mathrm {T}}\right)^{\mathrm {T}},\) which leads to the update formula given in Eq. (14).
Determining the number of hidden variables
In Eq. (12), we use B A+C to formulate the overall associations between SNPs and expression traits.
Two groupwise associations will not share the same group of SNPs (or genes), since otherwise these two groupwise associations can be combined into one. Therefore, every groupwise association should be unique and irreplaceable. Hence, following two conditions should be satisfied

A has linearly independent rows. Since M≪K, this condition is equivalent to that A has full rank;

B has linearly independent columns. Since M≪N, this condition is equivalent to that B has full rank.
When these two conditions are met, we have
The last equality holds because both A and B have full rank.
We have the following observation. The singular value decomposition (SVD) of B A has the form
where U and V are unitary (orthogonal in our case) matrices, and Σ is a rectangular diagonal matrix with nonnegative real numbers on the diagonal, which corresponds to singular values of B A. Since U and V are unitary and hence have full rank, we have
We compute B A by minimizing Eq. (12), which gives
Combine (22), (23), and (24), we find
Due to the existence of noise, we should allow small singular values to be considered as zero. Therefore, we can draw a plot with singular values of \((\mathbf {Z} \mathbf {L} \mathbf {C} \mathbf {X}) \mathbf {X}^{\mathrm {T}}\left (\mathbf {X} \mathbf {X}^{\mathrm {T}}\right)^{1}\) in descending order and set M to be k, if the first k singular values are large and significantly greater than the (k+1)th singular value.
Based on the discussion above, in order to find optimal M, we can first use Lasso to infer the initial value of C. Then, using Eq. 25, we can infer the optimal M at this stage. After that, we can optimize new C, and calculate new optimal M. We can repeat this procedure until M became stable or reach maximal number of iterations.
Results
In this section, we perform extensive experimental study using both simulated and real eQTL datasets to evaluate the performance of our methods. For comparison, we select several stateoftheart eQTL methods, including twograph guided multitask lasso (MTLasso2G) [13], FaSTLMM [12], SETeQTL [18], LORS [14], Matrix eQTL [17] and Lasso [5]. Note that we did not compare with our previous work, GDL, in [19] because it needs to incorporate many prior knowledge, that is not relevant to this work. For all the methods, the tuning parameters are learned using cross validation. The discussion of setting proper number of groupwise associations M is included in the Additional file 1. The shrinkage of the coefficients is also presented in the Additional file 1.
Simulated data
We use a similar setup for simulation study to that in [14]. First, 100 SNPs are randomly selected from the yeast eQTL dataset [20]. This gives birth to the matrix X. 100 gene expression profiles are generated by Z _{ j∗}=β _{ j∗} X+Ξ _{ j∗}+E _{ j∗} (1≤j≤N), where \(\mathbf {E}_{j*}\sim \mathcal {N}(0,\phi I)\) (ϕ=0.1) is used to simulate the Gaussian noise. To simulate the effects of confounding factors, we use Ξ _{ j∗}, drawn from \(\mathcal {N}(\mathbf {0},\tau \Lambda)\). In this paper, we set τ=0.1. Λ is given by F F ^{T}. Here, \(\mathbf {F}\in \mathbb {R}^{H\times J}\) and \(\mathbf {F}_{ij}\sim \mathcal {N}(0,1)\). J denotes hidden factor number. In this paper, we set J to 10.
In the left most of Fig. 2, we illustrate β. Here, we set the association strength to 1. Totally, there exist four groupwise associations with different scales. The diagonal line represents the individual signals in cisregulation.
In Fig. 2, we report the associations inferred by geQTL. Recall that groupwise associations can be inferred from matrix A and B, and individual associations can be inferred from matrix C. It is obvious that geQTL can detect both groupwise and individual signals.
We use \(SNR=\sqrt {\frac {Var(\mathbf {\beta }\mathbf {X})}{Var(\Xi + \mathbf {E})}}\) to denote the signaltonoise ratio [14] in the eQTL datasets. Here, we fix J=10,τ=0.1. The SNR’s are controlled by using different ϕ’s. Using 50 simulated datasets with different SNR’s, we compare the proposed methods with the selected methods. Because FaSTLMM requires the input of genomic locations information (e.g., chromosome, base pair, etc), we will compare it on the real data set. The results are averaged over 50 different simulated datasets. B A+C is used to represent the association matrix in our method. Figure 3 shows the ROC curve of TPRFPR (true positive rate  false positive rate) for performance comparison. Typically, we care more about the TPR when the FPR is small because it is important to evaluate the performance of model when controlling the maximum tolerated FPR. Thus, in Fig. 3, the ROC of interest for eQTL are generally shown in the range [0,0.1]. The corresponding areas under the TPRFPR curve are shown in Fig. 4.
It can be seen that geQTL and geQTL ^{+} outperform all alternative methods by a large margin since they considers both individual and groupwise associations. We also observe that geQTLridge is not as good as geQTL and geQTL ^{+}. This is because geQTLridge does not provide a sparse solution for individual associations. MTLasso2G is comparable to LORS. LORS can correct the effects of the confounders, however, it is not able to detect groupwise mappings. We also observe that by decoupling individual and groupwise associations, the proposed models (geQTL, geQTL ^{+}, and geQTLridge) are more robust to noise than other methods.
Yeast eQTL data
We also validated geQTL using the bench mark dataset–yeast (Saccharomyces cerevisiae) eQTL dataset. The dataset contains 112 yeast segregants generated from a cross of two inbred strains [20]. Originally, It contains 6229 gene epxressions and 2956 SNPs. SNPs with >10 % missing values in the remaining SNPs are imputed using the function fill.geno in R/qtl [21]. The neighboring SNPs with the same genotype profiles are combined, resulting in 1027 genotype profiles. Remove gene expression traits with missing values, we get 4474 expression profiles.
cis and trans analysis
We follow the standard cisenrichment analysis that is used in [22, 23] for evaluation. Moreover, we use the transenrichment with a similar strategy [24]. Genes regulated by transcription factors (obtained from http://www.yeastract.com/download.php) are treated as transacting signals.
In Table 2, we report the pairwise comparison using cis and trans enrichment analysis. We do not list geQTL separately from geQTL+, since geQTL+ is a faster version of geQTL. In this table, the methods are sorted (from top to bottom in the left column and from left to right in the top row) in decreasing order of performance. A pvalue shows how significant a method on the left column outperforms a method in the top row in terms of cis and trans enrichments. We observe that geQTL ^{+} has significantly better cisenrichment scores than the other models. For transenrichment, geQTL ^{+} is the best, and MTLasso2G comes in second, outperforming FaSTLMM, SETeQTL, LORS, Matrix eQTL and Lasso. LORS outperforms Matrix eQTL and Lasso for both cis and transenrichment. This is because LORS considers confounding factors while Matrix eQTL and Lasso does not. In total, these methods each detected about 6000 associations according to nonzero W values. We estimate FDR using 50 permutations as proposed in [14]. With FDR ≤ 0.01, geQTL ^{+} obtains about 4500 significant associations. The plots of all identified significant associations for different methods are given in Fig. 5. Obviously, we can see that C+B×A and C of geQTL ^{+} report weaker transregulatory bands while stronger cisregulatory signals than other competitors.
Gene ontology enrichment analysis on detected groupwise associations for yeast
We further evaluate the quality of detected groups of genes by measuring the correlations between the detected groups of genes and the GO (Gene Ontology) categories [25]. Specifically, the GO enrichment test is calculated by DAVID [26]. In this paper, gene sets reported by our algorithm with calibrated pvalues less than 0.01 are considered as significantly enriched.
Since SETeQTL is the only previous approach capable of detecting groupwise association mapping, we compare the groups of genes detected by geQTL and those by SETeQTL. For SETeQTL, 90 out of 150 gene sets are significantly enriched. By contrast, 28 out of 30 gene sets reported by geQTL are significantly enriched. This well illustrates that the effectiveness of geQTL to infer groupwise associations. The number of SNPs in each group reported by geQTL and their genomic locations are shown in Fig. 6. We can clearly observe that SNPs in the same group are often physically close to each other. This is reasonable because SNPs nearby usually jointly affect the expression level of a set of genes that achieves a specific cell function [8].
Reproducibility of eQTLs between studies
To further evaluate the identified associations, we investigate the consistency of calls between two independent studies [27]. We examine the reproducibility based on the following two criteria [14, 28, 29]:

Reproducibility of detected SNPgene associations: Let L _{1} and L _{2} be the sets of SNPgene associations detected in the two yeast datasets, respectively. We can rank the associations according to the weights (or qvalues for FaSTLMM). Let \({L_{1}^{T}}\) and \({L_{2}^{T}}\) be the top T most significant associations from the two datasets. The reproducibility is defined as \(\frac {{L_{1}^{T}}\bigcap {L_{2}^{T}}}{T}\).

Reproducibility of trans regulatory hotspots: For each SNP, we count the number of associated genes from the detected SNPgene associations. We use this number as the regulatory degree of each SNP. For FaSTLMM, SNPGene pairs with a qvalue < 0.001 are considered significant associations. We also tried different cutoffs for FaSTLMM (from 0.01 to 0.001), the results are similar. SNPs with large regulatory degrees are often referred to as hotspots. We sort SNPs in descending order of their regulatory degrees. We denote the sorted SNPs lists as S _{1} and S _{2} for the two yeast datasets. Let \({S_{1}^{T}}\) and \({S_{2}^{T}}\) be the top T SNPs in the sorted SNP lists. The trans calling consistency of reported hotspots is denoted by \(\frac {{S_{1}^{T}}\bigcap {S_{2}^{T}}}{T}\).
In Fig. 7 a, we show the consistency of the top 4500 associations between different studies. In Fig. 7 b, we list the reproducibility of trans regulatory hotspots reported by different approaches. Overall, geQTL ^{+} yielded results with greater consistency all other methods. This well illustrates the superiority of geQTL ^{+}.
Conclusions
In literature, much efforts have been done on eQTL mapping. Traditional eQTL mapping approaches can not detect the groupwise associations between sets of SNPs and sets of genes. To achieve that, we propose a fast approach, geQTL, to detect both individual and groupwise associations for eQTL mapping. geQTL can also correct the effects of potential confounders. We also introduce efficient algorithms to scale up the computation so that the algorithms are able to tackle large scale data sets. Additionally, the proposed model provides an effective strategy to automatically infer the proper number of groupwise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method. Inferring individual and groupwise associations also helps us better explain the genetic basis of gene expression. Due to scalability issue, our model simply assume random errors between different genes are independent and have the same variance. That is the reason why current model only identified a small number of groupwise associations. Our future work will further incorporate the relationships between genes by integrating gene coexpression network or proteinproteininteraction network.
Data availability
The software is publicly available at http://www.cs.unc.edu/~weicheng/bioinformatics_code.zip.
References
 1
Bochner BR. New technologies to assess genotype henotype relationships. Nat Rev Genet. 2003; 4:309–14.
 2
Michaelson J, Loguercio S, Beyer A. Detection and interpretation of expression quantitative trait loci (eQTL). Methods. 2009; 48(3):265–76.
 3
Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT. Mapping determinants of human gene expression by regional and genomewide association. Nature. 2005; 437:1365–69. doi:10.1038/nature04244.
 4
Zhu J, Zhang B, Smith EN, Drees B, Brem RB, Kruglyak L, Bumgarner RE, Schadt EE. Integrating largescale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet. 2008; 40(7):854–61. doi:10.1038/ng.167.
 5
Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996; 58(1):267–88.
 6
Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011; 470(7333):187–97.
 7
Pujana MA, Han JDJ, Starita LM, Stevens KN, Muneesh Tewari ea. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet. 2007:1338–1349.
 8
Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene x gene interactions in genomewide association studies of human population data. Hum Hered. 2007:67–84.
 9
Holden M, Deng S, Wojnowski L, Kulle B. GSEASNP: applying gene set enrichment analysis to SNP data from genomewide association studies. Bioinformatics. 2008; 24(23):2784–785.
 10
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011; 89(1):82–93.
 11
Braun R, Buetow K. Pathways of distinction analysis: a new technique for multiSNP analysis of GWAS data. PLoS Genet. 2011; 7(6):1002101.
 12
Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics. 2013; 29(12):1526–1533.
 13
Chen X, Shi X, Xu X, Wang Z, Mills R, Lee C, Xu J. A twograph guided multitask lasso approach for eQTL mapping In: Lawrence ND, Girolami MA, editors. Proceedings of the fifteenth international conference on artificial intelligence and statistics (AISTATS12); vol. 22: 2012. p. 208–17. http://jmlr.csail.mit.edu/proceedings/papers/v22/chen12b/chen12b.pdf.
 14
Yang C, Wang L, Zhang S, Zhao H. Accounting for nongenetic factors by lowrank representation and sparse regression for eQTL mapping. Bioinformatics. 2013; 29(8):1026–34. doi:10.1093/bioinformatics/btt075. Epub 2013 Feb 17.
 15
Mazumder R, Hastie T, Tibshirani R. Spectral regularization algorithms for learning large incomplete matrices. JMLR. 2010; 11:2287–322.
 16
Andrew G, Gao J. Scalable training of L1regularized loglinear models. In: International conference on machine learning: 2007. http://research.microsoft.com/apps/pubs/default.aspx?id=78900.
 17
Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012; 28(10):1353–1358.
 18
Cheng W, Zhang X, Wu Y, Yin X, Li J, Heckerman D, Wang W. Inferring novel associations between SNP sets and gene sets in eQTL study using sparse graphical model. In: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACMBCB’12. New York, NY, USA: ACM: 2012. p. 466–73. http://doi.acm.org/10.1145/2382936.2382996.
 19
Cheng W, Zhang X, Guo Z, Shi Y, Wang W. Graphregularized dual lasso for robust eQTL mapping. Bioinformatics. 2014; 30(12):i139–i148. doi:10.1093/bioinformatics/btu293.
 20
Brem RB, Storey JD, Whittle J, Kruglyak L. Genetic interactions between polymorphisms that affect gene expression in yeast. Nature. 2005; 436:701–03. doi:10.1038/nature03865.
 21
Broman KW, Wu H, Sen S, Churchill GA. R/qtl: QTL mapping in experimental crosses. Bioinformatics. 2003; 19(7):889–90.
 22
Listgarten J, Kadie C, Schadt EE, Heckerman D. Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci U.S.A. 2010; 107(38):16465–16470.
 23
McClurg P, Janes J, Wu C, Delano DL, Walker JR, Batalov S, Takahashi JS, Shimomura K, Kohsaka A, Bass J, Wiltshire T, Su AI. Genomewide association analysis in diverse inbred mice: power and population structure. Genetics. 2007; 176(1):675–83.
 24
Yvert G, Brem RB, Whittle J, Akey JM, Foss E, Smith EN, Mackelprang R, Kruglyak L. Transacting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat Genet. 2003; 35(1):57–64.
 25
The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–9.
 26
Huang d. a. W., Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4(1):44–57.
 27
Smith EN, Kruglyak L. Geneenvironment interaction in yeast gene expression. PLoS Biol. 2008; 6(4):e83. doi:10.1371/journal.pbio.0060083.
 28
Fusi N, Stegle O, Lawrence ND. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput Biol. 2012; 8(1):1002330.
 29
Joo JW, Sul JH, Han B, Ye C, Eskin E. Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies. Genome Biol. 2014; 15(4):61.
Acknowledgments
This work is partially supported by NSF IIS1313606, NIH U01HG00848801, NIH R01GM11583301, NSF CAREER, NSF IIS1162374, and NSF IIS1218036.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Conceived and designed the experiments: WC. Performed the experiments: WC. Analyzed the data: WC YS. Contributed reagents/materials/analysis tools: WC. Wrote the paper: WC ZX WW YS. All authors have read and approved the final version of the paper.
Additional file
Additional file 1
Sparse regression models for unraveling group and individual associations in eQTL mapping. (PDF 1,004 KB)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Cheng, W., Shi, Y., Zhang, X. et al. Sparse regression models for unraveling group and individual associations in eQTL mapping. BMC Bioinformatics 17, 136 (2016). https://doi.org/10.1186/s1285901609869
Received:
Accepted:
Published:
Keywords
 eQTL mapping
 Groupwise association
 Computation efficiency