Fast and robust group-wise eQTL mapping using sparse graphical models
- Wei Cheng^{1},
- Yu Shi^{2},
- Xiang Zhang^{3} and
- Wei Wang^{4}Email author
https://doi.org/10.1186/s12859-014-0421-z
© Cheng et al.; licensee BioMed Central. 2015
Received: 3 June 2014
Accepted: 11 December 2014
Published: 16 January 2015
Abstract
Background
Genome-wide expression quantitative trait loci (eQTL) studies have emerged as a powerful tool to understand the genetic basis of gene expression and complex traits. The traditional eQTL methods focus on testing the associations between individual single-nucleotide polymorphisms (SNPs) and gene expression traits. A major drawback of this approach is that it cannot model the joint effect of a set of SNPs on a set of genes, which may correspond to hidden biological pathways.
Results
We introduce a new approach to identify novel group-wise associations between sets of SNPs and sets of genes. Such associations are captured by hidden variables connecting SNPs and genes. Our model is a linear-Gaussian model and uses two types of hidden variables. One captures the set associations between SNPs and genes, and the other captures confounders. We develop an efficient optimization procedure which makes this approach suitable for large scale studies. Extensive experimental evaluations on both simulated and real datasets demonstrate that the proposed methods can effectively capture both individual and group-wise signals that cannot be identified by the state-of-the-art eQTL mapping methods.
Conclusions
Considering group-wise associations significantly improves the accuracy of eQTL mapping, and the successful multi-layer regression model opens a new approach to understand how multiple SNPs interact with each other to jointly affect the expression level of a group of genes.
Keywords
Background
Expression quantitative trait loci (eQTL) mapping is the process of identifying single nucleotide polymorphisms (SNPs) that play important roles in the expression of genes. It has been widely used to dissect genetic basis of complex traits [1,2]. Traditionally, associations between individual expression traits and SNPs are assessed separately [3,4].
Since genes in the same biological pathway are often co-regulated and may share a common genetic basis [5,6], it is crucial to understand how multiple modestly-associated SNPs interact to influence the phenotypes [7]. To address this issue, several approaches have been proposed to study the joint effect of multiple SNPs by testing the association between a set of SNPs and a gene expression trait. A straightforward approach is to follow the gene set enrichment analysis (GESA) [8]. In [9], the authors propose variance component models for SNP set testing. Aggregation-based approaches such as collapsing SNPs are investigated in [10]. In [11], the authors take confounding factors into consideration.
Despite their success, these methods have two common limitations. First, they only study the association between a set of SNPs and a single expression trait, thus overlook the joint effect of a set of SNPs on the activities of a set of genes, which may act and interact with each other to achieve certain biological function. Second, the SNP sets used in these methods are usually taken from known pathways. However, the existing knowledge on biological pathways is far from being complete. These methods cannot identify unknown associations between SNP sets or gene sets.
To address these limitations, in [12], a method is developed to identify cliques in a bipartite graph derived from the eQTL data. Cliques are used to model the hidden correlations between SNP sets and gene sets. However, this method needs the progeny strain information, which is used as a bridge for modeling the eQTL association graphs. In [13], the authors proposed a method to infer associations between sets of SNPs and sets of genes. However, this method does not consider the associations between individual SNPs and genes. A two-graph-guided multi-task Lasso approach was developed in [14]. This method needs to calculate gene co-expression network and SNP correlation network first. Errors and noises in these two networks may introduce bias in the final results. A graph regularized dual lasso approach considering the factor of group-wise association was developed in [15]. This method, however, needs extra SNP-SNP interaction network and PPI network data to penalize the regression model and it’s not able to infer novel group-wise associations. Note that all these methods do not consider confounding factors.
To better elucidate the genetic basis of gene expression and understand the underlying biology pathways, it is highly desirable to develop methods that can automatically infer associations between a group of SNPs and a group of genes. We refer to the process of identifying such associations as group-wise eQTL mapping. In contrast, we refer to the process of identifying associations between individual SNPs and genes as individual eQTL mapping. In this paper, we introduce a fast and robust approach to identify novel associations between sets of SNPs and sets of genes. Our model is a multi-layer linear-Gaussian model and uses two different types of hidden variables: one capturing group-wise associations and the other capturing confounding factors [11,16-20]. We apply an ℓ _{1}-norm on the parameters [3,21], which yields a sparse network with a large number of association weights being zero [22]. We develop an efficient optimization procedure that makes this approach suitable for large-scale studies^{a}. Extensive experimental evaluations using both simulated and real datasets demonstrate that the proposed methods can effectively capture both group-wise and individual associations and significantly outperforms the state-of-the-art eQTL mapping methods.
Methods
Preliminaries
Summary of notations
Symbols | Description |
---|---|
K | Number of SNPs |
N | Number of genes |
D | Number of samples |
x | The random variables of K SNPs |
z | The random variables of N genes |
s | The latent variables to model confounding |
factors | |
y | The latent variables to model group-wise |
associaiton | |
\(\mathbf {X}\in {\mathbb {R}}^{K \times D}\) | The SNP matrix data |
M | Number of latent variables y |
H | Number of latent variables s |
\(\mathbf {Z} \in {\mathbb {R}}^{N \times D}\) | The gene expression matrix data |
\(\textbf {A} \in \mathbb {R}^{M\times K}\) | The coefficient matrix between x and y |
\(\textbf {B} \in \mathbb {R}^{N\times M}\) | The coefficient matrix between y and z |
\(\textbf {C} \in \mathbb {R}^{N\times K}\) | The coefficient matrix between x and y |
\(\textbf {W} \in \mathbb {R}^{N\times H}\) | The coefficient matrix of confounding factors |
\(\mathbf {\boldsymbol {\mu }_{\textbf {A}}}\in \mathbb {R}^{M\times 1}\), \(\boldsymbol {\mu }_{\textbf {B}}\in \mathbb {R}^{N\times 1}\) | The translation factor vectors |
where z is a linear function of x with coefficient matrix β. μ is an N×1 translation factor vector. ε is the additive noise of Gaussian distribution with zero-mean and variance ψ I, where ψ is a scalar. That is, ε∼N(0,ψ I).
The question now is how to define an appropriate objective function to decompose β which (1) can effectively detect both individual and group-wise eQTL associations, and (2) is efficient to compute so that it is suitable for large-scale studies. In the next, we will propose a group-wise eQTL detection method first, then improve it to capture both individual and group-wise associations. Then we will discuss how to boost the computational efficiency.
Graphical model for group-wise eQTL mapping
Incorporating individual effect
Objective function
where \(\textbf {A} \in \mathbb {R}^{M\times K}\) is the coefficient matrix between x and y, \(\textbf {B} \in \mathbb {R}^{N\times M}\) is the coefficient matrix between y and z, \(\textbf {C} \in \mathbb {R}^{N\times K}\) is the coefficient matrix between x and z to capture the individual associations, \(\textbf {W} \in \mathbb {R}^{N\times H}\) is the coefficient matrix of confounding factors. \(\boldsymbol {\mu }_{\textbf {A}}\in \mathbb {R}^{M\times 1}\) and \(\boldsymbol {\mu }_{\textbf {B}}\in \mathbb {R}^{N\times 1}\) are the translation factor vectors, \({\sigma _{1}^{2}}\mathbf {I}_{M}\) and \({\sigma _{2}^{2}}\mathbf {I}_{N}\) are the variances of the two conditional probabilities respectively (σ _{1} and σ _{2} are constant scalars and I _{ M } and I _{ N } are identity matrices).
Since the expression level of a gene is usually affected by a small fraction of SNPs, we impose sparsity on A, B and C. We assume that the entries of these matrices follow Laplace distributions:
A _{ i,j }∼Laplace(0,1/λ),
B _{ i,j }∼Laplace(0,1/γ), and
C _{ i,j }∼Laplace(0,1/α).
λ, γ and α will be used as parameters in the objective function. The probability density function of Laplace(μ,b) distribution is \(f(x|\mu,b)=\frac {1}{2b}\exp \left (-\frac {|x-\mu |}{b}\right)\).
where ||·||_{1} is the ℓ _{1}-norm. λ, γ and α are the precision of the prior Laplace distributions of A, B and C respectively. They serve as the regularization parameters and can be determined by cross or holdout validation.
Optimization
In addition to the loss function and penalized parameters, the OWL-QN algorithm also requires the gradient of the loss function, which (without detailed derivation) is given in the Additional file 1.
Computational speedup
In this section, we discuss how to speedup the optimization process for the proposed model. In the previous section, we have shown that A, B, C, W, σ _{1}, and σ _{2} are the parameters to be solved. Here, we first derive an updating scheme for σ _{2} when other parameters are fixed by following a similar technique as discussed in [25]. For other parameters, we develop an efficient method for calculating the inverse of the covariance matrix which is the main bottleneck of the optimization process.
Updating σ _{ 2 }
where U is an N×(N−q) eigenvector matrix corresponding to the nonzero eigenvalues; V is an N×q eigenvector matrix corresponding to the zero eigenvalues. A reasonable solution should have no zero eigenvalues in Σ, otherwise the loss function would be infinitely big. Therefore, q=0.
This is a 1-dimensional optimization problem that can be solved very efficiently.
Efficiently inverting the covariance matrix
We devise an acceleration strategy that calculates Σ ^{−1} using formula (14) in the following theorem. The complexity of computing the inverse reduces to \(\mathcal {O}(M^{3}+H^{3})\).
Theorem 1.
The proof is provided in the Additional file 1.
Results and discussion
We apply our method to both simulation datasets and yeast eQTL datasets [26] to evaluate its performance. For simplicity, we refer to the proposed model that only considers group-wise associations as Model 1, and the model that considers both individual and group-wise associations as Model 2. For comparision, we select several recent eQTL methods, including LORS [27], MTLasso2G [14], FaST-LMM [11], SET-eQTL [13] and Lasso [3]. The tuning parameters in the selected methods are learned using cross-validation. All experiments are performed on a PC with 2.20 GHz Intel i7 eight-core CPU and 8 GB memory.
Simulation study
Next, we generate 50 simulated datasets with different signal-to-noise ratios (defined as \(SNR=\sqrt {\frac {Var(\mathbf {\beta }\mathbf {X})}{Var(\Xi + \mathbf {E})}}\)) in the eQTL datasets [27] to compare the performance of the selected methods. Here, we fix H=10,ρ=0.1, and use different η’s to control SNR. For each setting, we report the averaged result from the 50 datasets. For the proposed methods, we use B A+C as the overall associations. Since FaST-LMM needs extra information (e.g., the genetic similarities between individuals) and uses PLINK format, we do not list it here and will compare it on the real data set.
Shrinkage of C and B×A
Computational efficiency evaluation
Yeast eQTL study
We apply the proposed methods to a yeast (Saccharomyces cerevisiae) eQTL dataset of 112 yeast segregants generated from a cross of two inbred strains [26]. The dataset originally includes expression profiles of 6229 gene expression traits and genotype profiles of 2956 SNP markers. After removing SNPs with more than 10% missing values and merging consecutive SNPS with high linkage disequilibrium, we obtain 1017 SNPs with distinct genotypes [29]. In total, 4474 expression profiles are selected after removing the ones with missing values. It takes about 5 hours for Model 1, and 3 hours for Model 1 to run to completion. The regularization parameters are set by grid search in {0.1, 1, 10, 50, 100, 500, 1000, 2000}. Specifically, grid search trains the model with each combinations of three regularization parameters in the grid and evaluates their performance (by measuring out-of-sample loss function value) for a two-fold cross validation. Finally, the grid search algorithm outputs the settings that achieved the smallest loss in the validation procedure.
cis- and trans-enrichment analysis
In total, the proposed two methods detect about 6000 associations with non-zero weight values (B×A for Model 1 and C+B×A for Model 2). We estimate their FDR values by following the method proposed in [27]. With FDR ≤ 0.01, both models obtain about 4500 associations. The visualization of significant associations detected by different methods is provided in Additional file 1.
We apply cis- and trans-enrichment analysis on the discovered associations. In particular, we follow the standard cis-enrichment analysis [30,31] to compare the performance of two competing models. The intuition behind cis-enrichment analysis is that more cis-acting SNPs are expected than trans-acting SNPs. A two-step procedure is used in the cis-enrichment analysis [30]: (1) for each model, we apply a one-tailed Mann-Whitney test on each SNP to test the null hypothesis that the model ranks its cis hypotheses (we use <500 bp for yeast) no better than its trans hypotheses, (2) for each pair of models compared, we perform a two-tailed paired Wilcoxon sign-rank test on the p-values obtained from the previous step. The null hypothesis is that the median difference of the p-values in the Mann-Whitney test for each SNP is zero. The trans-enrichment is implemented using a similar strategy as in [32], in which genes regulated by transcription factors^{c} are used as trans-acting signals.
Pairwise comparison of different models using cis -enrichment and trans -enrichment in yeast
cis -enrichment | FaST-LMM | C of Model 2 | MTLasso2G | B × A of Model 1 | LORS | Lasso | |
---|---|---|---|---|---|---|---|
C+B×A of Model 2 | 0.4351 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | |
FaST-LMM | - | 0.2351 | <0.0001 | <0.0001 | <0.0001 | <0.0001 | |
Cof Model 2 | - | - | 0.0221 | <0.0001 | <0.0001 | <0.0001 | |
MTLasso2G | - | - | - | <0.0001 | <0.0001 | <0.0001 | |
B×Aof Model 1 | - | - | - | - | <0.0001 | <0.0001 | |
LORS | - | - | - | - | - | 0.0052 | |
trans-enrichment | B×A of Model 2 | FaST-LMM | MTLasso2G | LORS | B×A of Model 1 | Lasso | |
C+B×A of Model 2 | 0.4245 | 0.3123 | 0.0034 | 0.0029 | 0.0027 | 0.0023 | |
B×A of Model 2 | - | 0.3213 | 0.0132 | 0.0031 | 0.0028 | 0.0026 | |
FaST-LMM | - | - | 0.0148 | 0.0033 | 0.0031 | 0.0029 | |
MTLasso2G | - | - | - | 0.0038 | 0.0037 | 0.0032 | |
LORS | - | - | - | - | 0.0974 | 0.0151 | |
B×Aof Model 1 | - | - | - | - | - | 0.0564 |
Reproducibility of trans regulatory hotspots between studies
We also evaluate the consistency of calling eQTL hotspots between two independent glucose yeast datasets [33]. The glucose environment from Smith et al. [33] shares a common set of segregants. It includes 5493 probes measured in 109 segregates. Since our algorithm aims at finding group-wise associations, we focus on the consistency of regulatory hotspots.
We examine the reproducibility of trans regulatory hotspots based on the following criteria [18,19,27]. For each SNP, we count the number of associated genes from the detected SNP-gene associations. We use this number as the regulatory degree of each SNP. For Model2, LORS, and Lasso, all SNP-Gene pairs with non-zero association weights are defined as associations. Note that Model2 uses B A+C as the overall associations. For FaST-LMM, SNP-Gene pairs with a q-value < 0.001 are defined as associations. Note that we also tried different cutoffs for FaST-LMM (from 0.01 to 0.001), the results are similar. SNPs with large regulatory degrees are often referred to as hotspots. We sort SNPs by the extent of trans regulation (regulatory degrees) in a descending order. We denote the sorted SNPs lists as S _{1} and S _{2} for the two yeast datasets. Let \({S_{1}^{T}}\) and \({S_{2}^{T}}\) be the top T SNPs in the sorted SNP lists. The trans calling consistency of detected hotspots is defined as \(\frac {|{S_{1}^{T}}\bigcap {S_{2}^{T}}|}{T}\).
Gene ontology enrichment analysis
As discussed in Methods, hidden variables y in the middle layer may model the joint effect of SNPs that have influence on a group of genes. To better understand the learned model, we look for correlations between a set of genes associated with a hidden variable and GO categories (Biological Process Ontology) [34]. In particular, for each gene set G, we identify the GO category whose set of genes is most correlated with G. We measure the correlation by a p-value determined by the Fisher’s exact test. Since multiple gene sets G need to be examined, the raw p-values need to be calibrated because of the multiple testing problem [35]. To compute the calibrated p-values for each gene set G, we perform a randomization test, wherein we apply the same test to randomly created gene sets that have the same number of genes as G. Specifically, the enrichment test is performed using DAVID [29]. And gene sets with calibrated p-values less than 0.01 are considered as significantly enriched.
Summary of all detected groups of genes from Model 2 on yeast data
^{ a } Group ID | ^{ b } SNPs set size | ^{ c } gene set size | ^{ d } GO category |
---|---|---|---|
1 | 63 | 294 | oxidation-reduction process ^{∗} |
2 | 78 | 153 | thiamine biosynthetic process ^{∗} |
3 | 94 | 871 | rRNA processing ^{∗∗∗} |
4 | 64 | 204 | nucleosome assembly ^{∗∗} |
5 | 70 | 288 | ATP synthesis coupled proton transport ^{∗∗∗} |
6 | 43 | 151 | branched chain family amino acid biosynthetic... ^{∗∗} |
7 | 76 | 479 | mitochondrial translation ^{∗∗∗} |
8 | 47 | 349 | transmembrane transport ^{∗∗} |
9 | 64 | 253 | cytoplasmic translation ^{∗∗∗} |
10 | 72 | 415 | response to stress ^{∗∗} |
11 | 64 | 225 | mitochondrial translation ^{∗} |
12 | 62 | 301 | oxidation-reduction process ^{∗∗} |
13 | 83 | 661 | oxidation-reduction process ^{∗} |
14 | 69 | 326 | cytoplasmic translation ^{∗} |
15 | 71 | 216 | oxidation-reduction process ^{∗} |
16 | 66 | 364 | methionine metabolic process ^{∗} |
17 | 74 | 243 | cellular amino acid biosynthetic process ^{∗∗∗} |
18 | 63 | 224 | transmembrane transport ^{∗∗} |
19 | 23 | 50 | de novo’ pyrimidine base biosynthetic process ^{∗} |
20 | 66 | 205 | cellular amino acid biosynthetic process ^{∗∗∗} |
21 | 81 | 372 | oxidation-reduction process ^{∗∗} |
22 | 33 | 126 | oxidation-reduction process ^{∗∗∗} |
23 | 81 | 288 | pheromone-dependent signal transduction... ^{∗∗} |
24 | 53 | 190 | pheromone-dependent signal transduction... ^{∗∗} |
25 | 91 | 572 | oxidation-reduction process ^{∗∗∗} |
26 | 66 | 46 | cellular cell wall organization ^{∗} |
27 | 111 | 1091 | translation ^{∗∗∗} |
28 | 89 | 362 | cellular amino acid biosynthetic process ^{∗∗} |
29 | 62 | 217 | transmembrane transport ^{∗∗} |
30 | 71 | 151 | cellular aldehyde metabolic process ^{∗∗} |
Summary of the top 15 detected hotspots by LORS
chr | start | end | size | GO category | adjusted p-value |
---|---|---|---|---|---|
XII | 659357 | 662627 | 36 | sterol biosynthetic process | 7.18E-05 |
XII | 1056097 | 1056097 | 31 | telomere maintenance via recombination | 4.72E-08 |
XV | 154177 | 154309 | 29 | amino acid catabolic process to alcohol via Ehrlich pathway | 0.052947053 |
III | 201166 | 201167 | 23 | regulation of mating-type specific transcription, DNA-dependent | 0.001998002 |
XV | 143597 | 150651 | 23 | response to stress | 0.672327672 |
III | 81832 | 92391 | 22 | pheromone-dependent signal transduction involved in conjugation with cellular fusion | 1.76E-03 |
VIII | 111682 | 111690 | 22 | cell adhesion | 0.002947528 |
IX | 139462 | 139512 | 21 | cellular response to nitrogen starvation | 0.00106592 |
XV | 170945 | 180961 | 20 | cell adhesion | 0.053946054 |
III | 105042 | 105042 | 19 | branched chain family amino acid biosynthetic process | 5.51357E-08 |
XIII | 46070 | 46084 | 19 | cell adhesion | 0.050949051 |
XV | 563943 | 563943 | 19 | transport | 0.003996004 |
I | 41483 | 42639 | 18 | cellular response to nitrogen starvation | 0.016983017 |
III | 175799 | 177850 | 18 | pheromone-dependent signal transduction involved in conjugation with cellular fusion | 7.47E-03 |
I | 36900 | 37068 | 17 | signal transduction | 0.547452547 |
Conclusion
A crucial challenge in eQTL study is to understand how multiple SNPs interact with each other to jointly affect the expression level of genes. In this paper, we propose a sparse graphical model to identify novel group-wise eQTL associations. The proposed model can also take into account potential confounding factors and individual associations. ℓ _{1}-regularization is applied to learn the sparse structure of the graphical model. We also introduce computational techniques to make this approach suitable for large scale studies. Extensive experimental evaluations using both simulated and real datasets demonstrate that the proposed methods can effectively capture both individual and group-wise signals and significantly outperform the state-of-the-art eQTL mapping methods.
Endnotes
^{a} The software is implemented in both C++ and matlab, and publicly available at http://www.cs.unc.edu/~weicheng/Group-Wise-EQTL.zip.
^{b} For example, 0, 1, 2 may encode the homozygous major allele, heterozygous allele, and homozygous minor allele, respectively.
^{c} http://www.yeastract.com/download.php.
Declarations
Acknowledgements
This work is supported by National Institutes of Health (grants R01HG006703 and P50 GM076468-08); NSF IIS-1313606; NSF IIS-1162374 and IIS-1218036.
Authors’ Affiliations
References
- Bochner BR. New technologies to assess genotype henotype relationships. Nat Rev Genet. 2003; 4:309–314.View ArticlePubMedGoogle Scholar
- Michaelson J, Loguercio S, Beyer A. Detection and interpretation of expression quantitative trait loci (eQTL). Methods. 2009; 48(3):265–276.View ArticlePubMedGoogle Scholar
- Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996; 58(1):267–288.Google Scholar
- Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT. Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005; 437:1365–1369.View ArticlePubMedPubMed CentralGoogle Scholar
- Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene x gene interactions in genome-wide association studies of human population data. Human Heredity. 2007; 63:67–84.View ArticlePubMedGoogle Scholar
- Pujana MA, Han J-DJ, Starita LM, Stevens KN, Muneesh Tewari EA. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet. 2007; 39:1338–1349.View ArticlePubMedGoogle Scholar
- Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011; 470(7333):187–197.View ArticlePubMedGoogle Scholar
- Holden M, Deng S, Wojnowski L, Kulle B. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics. 2008; 24(23):2784–2785.View ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011; 89(1):82–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Braun R, Buetow K. Pathways of distinction analysis: a new technique for multi-SNP analysis of GWAS data. PLoS Genet. 2011; 7(6):1002101.View ArticleGoogle Scholar
- Listgarten J, Lippert C, Kang EY, Xiang J, Kadie CM, Heckerman D. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics. 2013; 29(12):1526–1533.View ArticlePubMedPubMed CentralGoogle Scholar
- Huang Y, Wuchty S, Ferdig MT, Przytycka TM. Graph theoretical approach to study eqtl: a case study of plasmodium falciparum. ISMB. 2009; 25:15–20.Google Scholar
- Cheng W, Zhang X, Wu Y, Yin X, Li J, Heckerman D, Wang W. Inferring novel associations between snp sets and gene sets in eqtl study using sparse graphical model. ACM-BCB. 2012; 29:466–473.Google Scholar
- Chen X, Shi X, Xu X, Wang Z, Mills R, Lee C, Xu J. A two-graph guided multi-task lasso approach for eqtl mapping. In: Lawrence ND, Girolami MA, editors. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS) ’12. vol. 22: 2012. p. 208–217.Google Scholar
- Cheng W, Zhang X, Guo Z, Shi Y, Wang W. Graph regularized dual lasso for robust eqtl mapping. Bioinformatics. 2014; 30:i139-148.View ArticlePubMedPubMed CentralGoogle Scholar
- Gao C, Brown CD, Engelhardt BE. A latent factor model with a mixture of sparse and dense factors to model gene expression data with confounding effects. ArXiv e-prints. 2013.Google Scholar
- Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007; 3(9):1724–1735.View ArticlePubMedGoogle Scholar
- Joo JW, Sul JH, Han B, Ye C, Eskin E. Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies. Genome Biol. 2014; 15(4):61.View ArticleGoogle Scholar
- Fusi N, Stegle O, Lawrence ND. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput Biol. 2012; 8(1):1002330.View ArticleGoogle Scholar
- Carlos M, Carvalhoa JELJRNQW, Jeffrey Changa, West M. High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics. J Am Stat Assoc. 2008; 103:1438–1456.View ArticleGoogle Scholar
- Lee S-I, Dudley AM, Drubin D, Silver PA, Krogan NJ, Pe’er D, Koller D. Learning a prior on regulatory potential from eqtl data. PLoS Genet. 2009; 5:e1000358.View ArticlePubMedPubMed CentralGoogle Scholar
- Ng A. Feature selection, l1 vs. l2 regularization, and rotational invariance. In: Proceedings of the International Conference on Machine Learning (ICML): 2004.Google Scholar
- Andrew G, Gao J. Scalable training of l1-regularized log-linear models. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML): 2007.Google Scholar
- Nocedal J, Wright SJ. Numerical optimization. New York: Springer-Verlag; 1999.View ArticleGoogle Scholar
- Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008; 178(3):1709–1723.View ArticlePubMedPubMed CentralGoogle Scholar
- Rachel B, Brem JW, John DStorey, Kruglyak L. Genetic interactions between polymorphisms that affect gene expression in yeast. Nature. 2005; 436:701–03.View ArticleGoogle Scholar
- Yang C, Wang L, Zhang S, Zhao H. Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping. Bioinformatics. 2013; 29:1026–1034.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee S, Xing EP. Leveraging input and output structures for joint mapping of epistatic and marginal eQTLs. Bioinformatics. 2012; 28(12):137–146.View ArticleGoogle Scholar
- Huang DAW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4(1):44–57.View ArticleGoogle Scholar
- Listgarten J, Kadie C, Schadt EE, Heckerman D. Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci USA. 2010; 107(38):16465–16470.View ArticlePubMedPubMed CentralGoogle Scholar
- McClurg P, Janes J, Wu C, Delano DL, Walker JR, Batalov S, Takahashi JS, Shimomura K, Kohsaka A, Bass J, Wiltshire T, Su AI. Genomewide association analysis in diverse inbred mice: power and population structure. Genetics. 2007; 176(1):675–683.View ArticlePubMedPubMed CentralGoogle Scholar
- Yvert G, Brem RB, Whittle J, Akey JM, Foss E, Smith EN, Mackelprang R, Kruglyak L. Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat Genet. 2003; 35(1):57–64.View ArticlePubMedGoogle Scholar
- Smith EN, Kruglyak L. Gene-environment interaction in yeast gene expression. PLoS Biol. 2008; 6:83.View ArticleGoogle Scholar
- The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–29.View ArticlePubMed CentralGoogle Scholar
- Westfall PH, Young SS. Resampling-based multiple testing; 1993.Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.