Meta-eQTL: a tool set for flexible eQTL meta-analysis

Background Increasing number of eQTL (Expression Quantitative Trait Loci) datasets facilitate genetics and systems biology research. Meta-analysis tools are in need to jointly analyze datasets of same or similar issue types to improve statistical power especially in trans-eQTL mapping. Meta-analysis framework is also necessary for ChrX eQTL discovery. Results We developed a novel tool, meta-eqtl, for fast eQTL meta-analysis of arbitrary sample size and arbitrary number of datasets. Further, this tool accommodates versatile modeling, eg. non-parametric model and mixed effect models. In addition, meta-eqtl readily handles calculation of chrX eQTLs. Conclusions We demonstrated and validated meta-eqtl as fast and comprehensive tool to meta-analyze multiple datasets and ChrX eQTL discovery. Meta-eqtl is a set of command line utilities written in R, with some computationally intensive parts written in C. The software runs on Linux platforms and is designed to intelligently adapt to high performance computing (HPC) cluster. We applied the novel tool to liver and adipose tissue data, and revealed eSNPs underlying diabetes GWAS loci.


Background
Expression quantitative trait loci (eQTLs) are genomic loci that regulate expression levels of mRNAs, and eQTLs play important roles in genetics and systems biology studies. To date, multiple eQTL datasets (where both transcriptome and DNA genotype are profiled on the same individuals) exist for a given tissue type, e.g. liver and lung [1,2]. It is necessary to jointly analyze these sets to further improve statistical power especially for trans-eQTL discovery. Even for the same tissue type, the eQTL datasets (transcriptome and genotype data) could be heterogeneous due to platform and lab differences, and meta-analysis (but not pooled analysis) would be the method of choice. Meta-analysis is also desirable the analysis of chromosome X eQTLs in dataset consisting of both males and females. The interpretation of genotype effects on gene expression varies between genders. For example, an allele count of 1 in a female indicates a heterozygote genotype (one reference and one alternative allele), while a count of 1 in a male means only alternative allele exists and may cause more profound effects. The variance of the genetic effect may also differ between genders. In such scenario, directly pooling males and females in chromosome X eQTL discovery is invalid, while meta-eQTL tackles this issue elegantly by deriving eQTLs per gender and then combining the test statistics.
The typical strategy of meta-analysis has two steps: (1) calculate and record raw test statistics (e.g. β and pvalue) of every transcript-SNP pair per individual dataset, and (2) combine the statistics using meta-analysis approach. However, this strategy is not practical in eQTL setting, where each dataset requires evaluation of >10 11 tests. Storing the raw statistics of every test is prohibitive due to massive disk and I/O demand. The common practice is only recording the top hits (e.g. pvalue < 1e-4) per dataset and meta-analysis. This strategy will miss the eQTLs that have consistent small-to-moderate effect in multiple datasets [3]. Herein, we propose the solution of parallel and synchronized eQTL computation of multiple datasets, and conducting meta-analysis on the fly. By these means, the above steps (1) and (2) are performed in memory, and only the meta-analysis results which pass a user-defined significance level are outputted to disk. Moreover, meta-eqtl offers versatile features: implementation of peak finding algorithm, various statistical models (eg. non-parametric and mixed effect model), consistent handling of missing data, easy deployment on high performance computing (HPC) clusters, etc.

Implementation
Meta-eqtl is a set of command line utilities written in R, with some computationally intensive parts written in C. Optimized linear algebra code (which is included in the R package) is used to fit linear models in absence of missing values. When missing values are present, in either the gene expression or SNP data, C code is called to compute the pairwise minimal sufficient statistics. The data format is based on plain text, tab-delimited files, which make the data easy to inspect and manipulate with standard UNIX utilities. Within meta-eqtl, several modules are dedicated to specific functionality, and can be called individually by user.
Linear regressions meta-analysis is implemented in the R script lm-meta. The computation occurs in multiple threads, where the number of threads corresponds to the number of datasets in the meta-analysis. The multiple threads proceed concordantly, therefore, the same set of gene expression-SNP tests are evaluated in each individual dataset at the same moment. When statistics are obtained from multiple threads, the meta-analytical test statistic is computed as: where the weights are assigned either based on sample size or the standard error of β in each dataset. The software output comprehensive statistics of the fitted model, including effective sample size, regression coefficients, standard error of regression coefficients, transcriptome variance explained (ie, r 2 ), T statistics (T) and pvalues (p). The T and p were presented for both meta-analysis and each individual cohort. A separate utility, lm-fdr, compares the output from observed and permuted data and quantify FDR. In brief, the meta-analysis results enter the downstream peak finding and empirical FDR calibration by permuting the sample IDs in the gene expression files. To our experience, this empirically estimated FDR is more robust than Benjamini-Hochberg procedure, such as used in MatrixEQTL [4], which is heavily biased when gene expression follows a non-Gaussian distribution. The tool set also contains the eqtl-sex-peaks utility, specially designed for meta-analysis of regression results by gender. kruskal is provided as a non-parametric Kruskal-Wallis test for eQTL detection. Since eQTL computation involves big data sets, gene expression and SNP data are accessed sequentially and concordantly by each thread, and results are reported on the fly, as they are computed. This allows for the analysis of files of arbitrary sample size and arbitrary number of datasets with constant memory usage. Also, this framework enables a natural deployment on HPC and Hadoop clusters as it can trivially distribute the analysis into multiple computing nodes.

Results and discussions
To our knowledge, meta-eqtl is the first software to perform meta-analysis on arbitrary number of eQTL datasets. We thus compared our results with those obtained with METAL [5,6], a tool which performs meta-analysis on pre-stored test statistics. On a data of four individual sets (sample size of 1000, 1000, 500 and 500, respectively), we tested 10,000 SNPs, and the two software gave the identical results to the available numerical precision.
We also benchmarked the performance on a large data of three cohorts (N = 450, 400 and 350) with 44,000 transcripts profiled and 1000 genome imputed genotype (~8 million SNPs). Meta-eqtl distributed the computation on a cluster of 800 computing cores (each core allocated 824 Mb to 1013 Mb of memory), and was able to complete within three days. The top eQTLs (10% FDR) statistics were identical to those computed by the R package "meta". On a single cohort, we further conducted head-to-head comparison to the MatrixEQTL software [4], which to our knowledge is the fastest software available to date for large scale eQTL analysis, and found that meta-eqtl was about 2-3 times slower, reflecting the expense of missing data handling and the more flexible pipeline. Nevertheless, meta-eqtl is still one of the fastest eQTL tools available. We also leverage another large-scale published eQTL study data [7], where custom 44 K RNA microarray were run on 651 liver, 848 adipose fat and 701 subcutaneous fat samples of 1,008 patients. 950 samples from the same patients were successfully genotyped on the Illumina 650Y BeadChip array, and further imputed on the 1000 Genome reference for 14 million SNPs using the MACH [8] pipeline. Applying meta-eQTL, we derived ChrX eQTL for each tissue ( Table 1). The meta-analysis of males and females provides increased power in detecting genetic regulation of gene expression, while still correctly keeping separate the analysis of the two sets. In Figure 1, we illustrate e.g. how the X chromosome gene DUSP9 shows some evidence of cis-regulation in the liver of both females (top panel) and males (middle panel), with the meta-analytical results pointing to a sharper and more conclusive signal (bottom panel). Further, we employed the ChrX eQTLs to inform type 1 and type 2 diabetes (T1D and T2D) GWAS SNPs (where liver and adipose are disease relevant issues) documented in the NHGRI catalog [9]. Three chrX SNPs associated with T1D or T2D were also eQTLs in at least one of these three tissues (Table 2). Genes close to these SNPs were proposed as underlying the disease etiology in the original GWAS reports, herein, we identify additional plausible candidates (Table 3). For examples, rs2664170 is associated with T1D and has profound influence on gene expression levels of IKBKG in all tissues. IKBKG (inhibitor of nuclear factor kappa-B kinase subunit gamma) is the regulatory subunit of the inhibitor of IκB kinase (IKK) complex, which activates NF-κB resulting in activation of genes involved in inflammation, immunity, cell survival, and other pathways. Given the inflammatory basis of T1D, IKBKG is a highly relevant genetic risk factor. The direction of eQTL is consistent among the three tissues; that is the disease risk allele (rs2664170-G) is associated with lower level of IKBKG, leading to higher IκB kinase activity and elevated inflammation and in turn increase T1D risks.

Conclusions
In summary, we describe a novel package, meta-eqtl. To our knowledge, it is the only tool to allow fast metaanalysis of eQTLs for today's large genotype and gene expression data with reasonable memory requirement and fast speed. It can also be used as a flexible and fast tool for eQTL discover on a single dataset, where it features flexible model specification (e.g. non-parametric and mixed effect models), missing data handling and implements significance peaks extraction. Meta-eqtl features computation speed comparable to the fastest alternative available to date, and is further well suited to distribute parallel jobs onto a HPC system. Another major advantage is the ability to handle chrX eQTLs. In recent year, increasing number of eQTL studies and dataset become available [3,7,11], joint analyses of same/similar tissue sets are of great interest. Meta-eqtl enables meta-analysis of arbitrary number of eQTL dataset and will greatly facilitate this research field.  *SNPs rs2664170, rs5945326 and rs12010175 were identified as eQTL, and these SNPs are also reported in association with diabetes by large GWA studies (summarized in Table 2).