Mapping eQTL by leveraging multiple tissues and DNA methylation

Background DNA methylation is an important tissue-specific epigenetic event that influences transcriptional regulation of gene expression. Differentially methylated CpG sites may act as mediators between genetic variation and gene expression, and this relationship can be exploited while mapping multi-tissue expression quantitative trait loci (eQTL). Current multi-tissue eQTL mapping techniques are limited to only exploiting gene expression patterns across multiple tissues either in a joint tissue or tissue-by-tissue frameworks. We present a new statistical approach that enables us to model the effect of germ-line variation on tissue-specific gene expression in the presence of effects due to DNA methylation. Results Our method efficiently models genetic and epigenetic variation to identify genomic regions of interest containing combinations of mRNA transcripts, CpG sites, and SNPs by jointly testing for genotypic effect and higher order interaction effects between genotype, methylation and tissues. We demonstrate using Monte Carlo simulations that our approach, in the presence of both genetic and DNA methylation effects, gives an improved performance (in terms of statistical power) to detect eQTLs over the current eQTL mapping approaches. When applied to an array-based dataset from 150 neuropathologically normal adult human brains, our method identifies eQTLs that were undetected using standard tissue-by-tissue or joint tissue eQTL mapping techniques. As an example, our method identifies eQTLs by leveraging methylated CpG sites in a LIM homeobox member gene (LHX9), which may have a role in the neural development. Conclusions Our score test-based approach does not need parameter estimation under the alternative hypothesis. As a result, our model parameters are estimated only once for each mRNA - CpG pair. Our model specifically studies the effects of non-coding regions of DNA (in this case, CpG sites) on mapping eQTLs. However, we can easily model micro-RNAs instead of CpG sites to study the effects of post-transcriptional events in mapping eQTL. Our model’s flexible framework also allows us to investigate other genomic events such as alternative gene splicing by extending our model to include gene isoform-specific data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1856-9) contains supplementary material, which is available to authorized users.


Background
It has been long established that regulatory regions in higher eukaryotes activate gene transcription in a tissue-specific manner [1,2]. These regulatory regions, which affect the binding affinities of transcription factors, are susceptible to both genetic variation and epigenetic (mQTLs) [9][10][11]. Since an increased DNA methylation at any of the distinct CpG sites located in the promoter regions necessitate chromatin remodeling and subsequent decrease in gene expression, any DNA sequence variation within the CpG-rich regions that disrupts the methylation process may have an opposite effect on gene expression.
Even though, mechanisms which regulate DNA methylation are unclear, it is clear that there is some association between genetic variation and quantitative changes in methylation levels [12]. For example, Catechol-Omethyltransferase (COMT) gene, which is implicated in schizophrenia has a SNP, Val 158 Met (rs4680) that is associated with differential COMT expression across regions of the brain during the course of the illness [13]. More specifically, the substitution of a methionine (Met) for a valine (Val) at position 158 results in reduced activity of the COMT enzyme due to reduced protein stability. Methylation of CpG islands associated with the aforementioned variant affect the region-specific expression of COMT [13]. Identifying and studying the mechanisms through which genetic variation, DNA methylation and gene expression interact may provide us yet another clue to understanding regions within the genome that are associated with complex disease phenotypes (Fig. 1a).
We have previously proposed a score test-based approach to map multi-tissue eQTLs where we model tissue-specificity as a random effect and investigated an overall shift in the gene expression combined with tissue-specific effects due to genetic variants [14]. Current approaches to delineate the role played by both genetic and epigenetic variation in gene expression are limited to identifying statistically significant pairs of mRNA -SNPs and CpG -SNPs by performing independent eQTL and mQTL analyses, respectively, within a tissue-bytissue (TBT) framework [4,11,15]. These pairs are then expanded to combinations of mRNA transcript, CpG site and a SNP wherever the SNP was significantly correlated with either mRNA or CpG site of the mRNA -CpG pair. First, any such TBT analyses have been shown to fall short in fully exploiting patterns across the tissues thus impacting eQTL or mQTL discovery [14,16,17]. Second, independent eQTL and mQTL analyses do not reveal any underlying effects of genetic variation on tissue-specific gene expression due to DNA methylation. Consequently, we propose to map eQTLs by leveraging DNA methylation and testing for any higher order interactions among methylation, genotype and tissues. We extend this framework to include methylation-specific effects and model GENE   the combined effect of genetic and epigenetic variation on gene expression (Fig. 1b). Our main objective is to improve eQTL discovery by accounting for epigenetic effects such as DNA methylation. We show using Monte Carlo simulations that our joint score test is more powerful in mapping eQTLs by controlling for methylation than any TBT approach that uses methylation as a covariate (TBTm-eQTL). We also show that the new joint score test is better at identifying eQTLs in the presence of DNA methylation than our previously proposed multi-tissue eQTL and TBT methods. Finally, we show that in cases where the interaction effects of DNA methylation are absent, our approach is slightly less powerful but remains competitive. We demonstrate the effectiveness of our method by applying it to a publicly available expression, methylation and SNP array datasets from normal adult human brains [4] and show that by jointly analyzing multiple brain regions (or tissues), we identify eQTLs that may otherwise be not identified by multi-tissue eQTL methods.

Evaluating our new score test using Monte Carlo simulations
We evaluate our approach through extensive simulation studies. Briefly, each Monte Carlo simulated dataset was comprised of data from a single locus and a single gene, whose expression is measured across 5 tissues in 500 observations. For a given mRNA -SNP pair, the genotypes at each SNP in all the individuals were simulated as Binomial(2,0.3), i.e. a minor allele frequency 0.30 and assuming Hardy-Weinberg equilibrium. Methylation data for all tissues was generated from a multivariate normal distribution with a positive definite variance-covariance matrix. Since all the tissue-specific effects are modeled as random effects, a test of whether there are any tissuespecific effects is equivalent to testing whether the variances of the random effects (γ and δ) are zero. Thus, our model involves testing four scalar parameters (β, φ, γ and δ). Simulations under the null hypothesis confirm that our method has the correct type 1 error (see Additional file 1). Since we model the effects of both epigenetic and genetic variation, we evaluated any power loss in identifying mRNA -SNP associations in the absence of any epigenetic effect. This was accomplished by comparing our method's performance with TBT-eQTL approach by keeping all the parameters associated with methylation in Eq. 5 at zero (i.e. λ = φ = δ = θ = 0). We also compared our method with a previously proposed multi-tissue eQTL method, implemented in our software JAGUAR [18], which is made available at Comprehensive R Archive Network (CRAN) repository. Briefly, JAGUAR implements an approach that jointly models the overall shift in the gene expression due to genotype together with tissue-specific interaction with genotype in order to efficiently identify multi-tissue eQTL. From Fig. 2a, we see that JAGUAR outperforms both TBT-eQTL and our new joint score test. This loss of power, though not substantial, may be attributed to testing for an inexistent methylation effect. However, in the presence of a methylation effect our method outperforms both TBT-eQTL and JAGUAR as evidenced by Fig. 2b. When the number of tissues is increased from 5 to 10, the same pattern in statistical power was observed (see Additional file 1 section for figures).
We also compared our joint score test to a TBT-eQTL approach that included methylation as a baseline covariate [15], henceforth referred to as TBTm-eQTL analysis, using the following linear regression model - where Y is a nt-dimensional matrix of expression levels in t tissues and n individuals, α is a fixed effect representing the tissue-specific intercepts, G is a nt-dimensional matrix of genotypes, β is a fixed effect of genotype across all tissues, M is an nt-dimensional matrix of methylation information and φ is genotype × methylation interaction effect (fixed effect). Minimum p value from the TBTm-eQTL analysis across all the tissues is computed for power calculations. Table 1 shows that our method significantly outperforms TBTm-eQTL approach showing a clear statistical advantage in using our joint score test over the TBTm-eQTL approach. See Additional file 1 methods for more information on the description of various null hypotheses being tested.

Region-specific DNA methylation impacts eQTL mapping in adult human brains
In order to demonstrate the effectiveness of our method, we applied it to Gibbs et al. [4] dataset comprising of 150 individual data obtained from four regions of human brain. We performed data analyses that focused on only cis candidate regions i.e., the proximity of an eQTL to the transcription start site of a gene did not exceed 100 kilobase up-and down-stream of the transcription start site of a gene/mRNA transcript (cis-SNP). CpG islands that were less than 1.5 kilobase up-and down-stream of the transcription start site of the same mRNA transcript were paired with the mRNA transcripts. Each mRNA transcript was tested for an association with every cis-SNP in the presence of a (methylated or unmethylated) CpG site located in the promoter region.
Our joint score test method performed a total of 471,272 tests (totaling 11,076 mRNA transcripts, 14,244 CpG sites and 144,393 cis-SNPs). Each such mRNA -CpG pair is tested for an association with a cis-SNP. It is important to note that our method does not test any direct association between an mRNA transcript and its corresponding In the absence of methylation data, statistical power from the joint analysis of genotype and tissue-specific interaction using JAGUAR is marginally better than our joint score test. A tissue-by-tissue (TBT-eQTL) method is also used for comparison. The x-axis denotes the proportion of variance explained by the G × T effect and the y-axis denotes the statistical power. These data were generated from 1,000 simulations run on 500 individuals and five tissues with genotypes generated at a common variant allele frequency (MAF = 0.3). b In the presence of DNA methylation effect, our method out performs JAGUAR and tissue-by-tissue analyses. The top two rows in the figure indicate PVE G×T and PVE G×M×T , respectively, on the x-axis. Statistical power is denoted on the y-axis. These data were generated from 1,000 simulations run on 500 individuals and five tissues with genotypes generated at a common variant allele frequency (MAF = 0.3) CpG site. Any resulting combinations of mRNA transcript, CpG site and a SNP would describe the relationship between the mRNA and SNP in the presence of the corresponding promoter CpG site, i.e. identify an eQTL. Our method identified a total of 5967 eQTLs that are statistically significant at 5% false discovery rate (FDR). In order to account for the number of traits being tested, the p values obtained from applying our joint score test were adjusted for multiple testing using an optimized FDR approach to obtain per-SNP q values (FDR adjusted p values) [19]. We observed that majority of these significant results are driven by a combination of additive genetic effect (93%) and G × T effect (81%) while the G × M and G × M ×T effects were barely observed. This may be due to a lack of any distinct tissue-specificity in the methylation data, which we observed while preprocessing Gibbs et al. data (see "Methods" section). However, we expect that the aforementioned effects may be well pronounced across diverse tissue types such as the ones made available by the Genotype-Tissue Expression (GTEx) initiative [20]. We performed two region-by-region or TBT approaches on the same set of mRNA transcripts, CpG sites and SNPs as above, one with DNA methylation as a covariate (TBTm-eQTL) and the other with no methylation (TBT-eQTL) and compared the results with our approach. We estimated q values from each set of p values (originated from each region-by-region analysis) and minimum q value for a given mRNA -SNP pair across all the brain regions was computed, which indicates the presence of a statistically significant pair in at least one brain region. The number of significant associations in at least one brain region were then assessed at 5% FDR (q value ≤ 0.05 4 where 4 is the number of brain regions). TBT-eQTL approach identified a total of 5009 mRNA-cis-SNP pairs or cis-eQTLs significant in at least one region of the brain at 5% FDR. Roughly 79% of these TBT-eQTLs overlap with eQTLs identified using our method. On the other hand, TBTm-eQTL approach identified 5625 eQTLs with a 73% overlap with eQTLs identified using our method.
In order to assess the role of brain region-specificity on gene expression and the advantages in jointly modeling all the brain regions on mapping eQTLs, we compared our joint score test approach with a previously proposed multi-tissue eQTL mapping method [14] implemented by our software JAGUAR. JAGUAR identifies 7934 eQTLs (96% of them overlap with the TBT-eQTLs, 80% of them overlap with TBTm-eQTLs, and 94% of them overlap with the joint tests's eQTLs) at 5% FDR. All the eQTLs that overlap between JAGUAR and our new joint score test are mostly driven by the additive genetic effect and G × T effect and not higher order methylation interaction effects such as G × M and G × M × T. This absence of any pronounced region-specific DNA methylation effect explains the lower number of eQTLs identified by our joint test method. However, as we have shown using simulation data, in the presence of any region-specific interaction effects involving methylation, our joint score test is far more informative than the results from JAGUAR. Few of the eQTLs identified by our method were not detected by JAGUAR. This could be because a majority of these eQTLs were driven by G × M × T interaction effect, which is not tested by JAGUAR. For example, let us consider a splice variant of LIM Homeobox protein coding gene (LHX9; Ensemble ID -ENSG00000143355), located on chromosome 1, which has 2 annotated cis-SNPs (SNP IDs: rs10922303 and rs2047541) possibly in LD with each other) and two promoter CpG sites (CpG IDs: cg07214572 and cg08008403) in our preprocessed datasets. Out of these 4 (number of mRNA -CpG pairs × the number of SNPs) combinations of mRNA transcript, CpG sites and SNPs and a possible 2 eQTLs, our method identified all of them to be statistically significant. None of them were found to be statistically significant by any TBT-based or the multi-tissue eQTL approaches (Fig. 3a). This is a good example of mapping eQTLs by leveraging effects due to DNA methylation since there is tissuespecific interaction effect clearly observed in Fig. 3b not captured by either JAGUAR or TBT methods. Of note, LHX9 is ubiquitously expressed in brain and are known to help in determining neuronal differentiation in humans [21]. On the other hand, we also see many instances of eQTLs that were observed to be statistically significant using JAGUAR but not our joint score test method due to the lack of any distinct tissue-specific DNA methylation effects. For example, JAGUAR method identified gene glutathione S-transferase mu 4 (GSTM4; Ensembl ID -ENSG00000168765), a gene that belongs to a superclass of glutathione S-transferases, which play a major role in the development of brain tumors [22], to have a statistically significant association with a promoter eQTL (SNP ID: rs524998), as illustrated by Fig. 4. However, we found that GSTM4 gene has two promoter CpG sites (CpG IDs: cg11903880 and cg15955341). Since there is no tissuespecific methylation effect, our joint test method was less powerful in detecting this eQTL . As seen in this figure, the lack of any tissue-specific methylation effects may have resulted in not being identified as a potential eQTL by our joint score test method.
To assess the biological relevance of the genes with eQTLs identified by TBT or multi-tissue methods including our new joint score test, we performed a KEGG pathway term enrichment analysis [23] for each set of results separately (see Additional file 1). KEGG pathways were considered overrepresented if a set of at least three genes from different linked regions is observed to be overrepresented with an adjusted significance level of q value <0.05, calculated from a hypergeometric test. Our method identified 5 overrepresented pathways (Metabolic pathways, Ribosome, Fatty acid degradation, Purine and Pyramidine metabolism), JAGUAR identified 2 pathways while TBT-eQTL identified 1 overrepresented pathway. The overrepresented pathway, "Metabolic Pathways" (KEGGID: hsa01100) is the only common pathway between TBT-eQTL, JAGUAR and our method. On the basis of prior knowledge of function, the overrepresented pathways "Purine metabolism" (KEGGID: hsa00230) and "Pyramidine metabolism" (KEGGID: hsa00240) are plausible functional candidate pathways for schizophrenia [24]. These information can be used to guide genetic analyses by selecting these relevant pathways and genes associated with the pathways for schizophrenia.

Conclusion
Overall, our efforts are primarily directed to understanding two very specific aspects -1) the overall effect of a genetic variant on gene expression regulation by accounting for any changes in tissue DNA methylation levels, and 2) map eQTLs by leveraging tissue-specific methylation effects. Currently, there are no methods that jointly model the epigenetic and genetic control of tissue-specific gene expression. Many eQTL studies fail to account for the masking effect on a genetic variant due to DNA methylation, which may regulate gene expression across multiple tissues. Our method provides an efficient framework to integrate SNPs, DNA methylation and gene expression, and investigate how the different forms of variation interrelate.
The dataset examined here used genome-wide association (GWA) study SNP array platform to interrogate germline variation that includes an overwhelming number of common variants. Although GWA studies have been able to explain a small fraction of the genetic components of common human diseases, it is hypothesized that some of the missing heritability may be due to rare variation. Since standard common disease common variant approaches are severely underpowered to tease out any underlying variants that are moderate to extremely rare, there is an emphasis on large sample sizes and gene-based association tests in order to securely identify genetic risk factors that may otherwise be outside the range detectable by GWA studies [25]. One solution to the aforementioned issue would be to prioritize genetic variants in a non ad-hoc framework that preferentially weights genetic variants. Our method can provide a statistically disciplined weighting framework within which genetic variants can be either up-or down-weighted for any subsequent downstream analyses. Our method may also be useful in generating weights to any methods that use a reference data set in which both genome variation and gene expression levels have been measured to develop prediction models for gene expression [26].
The absence of strong tissue-specific methylation effects has an effect on mapping eQTLs using our joint test method. In the absence of any tissue-specific methylation effect, our method is less powerful while mapping eQTLs. One potential way to overcome such situations would be to run an omnibus test that identifies strongest evidence between JAGUAR and our joint test model. Specifically, we calculate the p value under each model, and then compute the minimum of the two p values and compare the observed minimum p value to its null distribution. Deriving the analytical null distribution of the minimum p value is not trivial considering the complex correlation structure between the statistics and due to the presence of higher order interaction effects (see Additional file 1 section). This approach is purely speculative and was not tested by us.
Since we are modeling the effects of non-coding regions (via CpG sites) on gene expression using our model, we can easily use micro-RNA (miRNA) data instead of CpG site methylation data and model post-transcriptional regulation of tissue-specific gene expression. miRNA expression, also a tissue-specific phenomenon, have been known to post-transcriptionally silence expression of mRNA transcripts. The presence of genetic variants such as SNPs may have an effect on the biogenesis and function of miRNA molecules leading to a downstream effect on gene expression [27]. This tissue-specific interaction between miRNA and SNP can be modeled in a similar fashion, analogous to modeling the interaction effects of tissuespecific DNA methylation and SNPs. The flexibility of our model also enables us to incorporate new information such as gene isoform data and accommodate the analysis of next-generation sequencing data (such as RNA-seq) by modeling gene transcripts in an analogous fashion to tissues in our current model formulation. This type of analysis would aggregate expression over all the splice variants of a gene across multiple tissues and inform us of tissue-specific alternative splice variant of a gene. These results become relevant to studying genetic effects on alternative splicing and its key role in important cellular networks.

Our model
For a given mRNA transcript, tissue-specific gene expression is modeled as a function of genotype and methylation -Y = Jα +Gβ +Mλ+MGφ +Au+Bv+Cw+Dx+ξ (2) where Y is an nt-dimensional vector of expression levels in t tissues and n individuals, α is a vector of tissuespecific intercepts, G is nt-dimensional vector of genotypes, β is a fixed effect of genotype across tissue, M is nt-dimensional vector of methylation levels, λ is an overall methylation-specific fixed effect, MG is nt-dimensional vector of the product of methylation and genotype, φ is the regression coefficient for genotype and methylation interaction (fixed effect), u ∼ N 0, τ AA T is a vector of subject-specific random effect, v ∼ N 0, γ BB T is a vector of tissue-specific random effects, w ∼ N 0, δCC T is a vector of random effects that describes the interaction effect between genotype, methylation and tissue, x ∼ N 0, θDD T is a vector of random effects describing tissue-specific methylation effects and ξ ∼ N (0, I nt ). The matrices J, A, B, C, and D are design matrices with B being a function of genotype, C is a function of both genotype and methylation data and finally, D is a function of just the methylation data. J is nt × t dimensional matrix denoting the design matrix for the tissue-specific intercepts. A is nt × n design matrix for the subject-specific intercepts. B is a nt×t design matrix of stacked genotypes.
C is a nt × t design matrix of stacked (product of ) tissuespecific methylation and genotype data. D is nt × t design matrices of stacked tissue-specific methylation data. The parameters of interest are γ , δ, β and φ; α, λ, τ , θ and are nuisance parameters. Alternatively, we can represent the distribution of Y conditional on methylation and genotype as - From our model, the log-likelihood function of the parameters conditional on the genotype and methylation data is given by - where represents the vector of all the variance components involved in and c is a constant. We test the null hypothesis that H 0 : β = φ = γ = δ = 0, i.e. the variant does not affect gene expression across any of the tissues.
To do so, we compute the efficient scores for γ , δ, β and φ by projecting off components correlated with the nuisance parameters. The reduced model under the null is - The efficient scores evaluated under the null are given by - whereŶ are the residuals from the model,Ḡ is an ntdimensional vector of mean-centered genotypes, MG is an nt-dimensional vector of mean-centered product of genotypes and methylation, andˆ =ˆ I +τ ZZ T +θ DD T . Our joint score test will test for the effect of genotype on 1) an overall shift in the gene expression, 2) tissue-specific interaction (G × T), 3) overall methylation (G × M), and 4) tissue-specific methylation (G × M × T). More on the individual components of our score test can be found in the Additional file 1 section. We propose a weighted sum of the above components (under the null) to arrive at our joint score test statistic, U ζ . Since U β and U φ are linear in Y while U γ and U δ are quadratic, we propose the following rule to combine them - where a β , a φ , a γ and a δ are scalar constants chosen to minimize the variance of U ζ . Under the null, U ζ is distributed as a mixture of chi-square random variables. We use Satterthwaite method [28] to approximate the p values from a scaled χ 2 distribution by matching the first two moments as Var(U ζ ) .

Simulations
For a positive integer t that represents number of tissues, if 1 denotes a column vector of t ones and I denotes the corresponding t × t diagonal matrix, following the t-variate normal law denoted by N t [μ, ] with mean μ ∈ R t and variance ∈ R t×t , expression levels of a target gene j at a single locus by using the following vectorized form of the linear mixed model - where y ij is a t × 1 vector of gene expression data, α t is the tissue-specific intercept (α t ∈ R t ), β j describes the main additive genotypic effect (β j ∈ R 1 ), λ j describes the overall effect due to methylation (λ j ∈ R 1 ), φ describes the interaction effect between the overall methylation and genotype (φ j ∈ R 1 ), g i is the value of a bi-allelic genotype such that g ∈ (0, 1, 2) represents the number of copies of the minor allele. The random effect b j ∈ R t represents tissue-specific effect of the genotype, c j ∈ R t represents tissue-specific interaction effect between methylation and genotype, d j ∈ R t represents tissue-specific methylation effect, and a i ∈ R 1 is a subject-specific random intercept. We assume that all the random effects are independent and that a i ∼ N 1 (0, τ ), b j ∼ N t (0, γ I), c j ∼ N t (0, δI) and d j ∼ N t (0, θI). Methylation data was generated independently from a multivariate normal distribution with mean zero and positive definite variance-covariance matrix. We use 1000 data replicates to evaluate the type I error and for power calculations. Simulations were performed by varying the following parameters-β (additive genetic effect), φ (G × M effect), the proportion of variation explained by the G × T effect PVE γ ≡ γ θ+τ + +γ +δ and the proportion of variation explained by the G × M × T effect PVE δ ≡ δ θ+τ + +γ +δ . A linear mixed effects model was fit using the package lme4 [29,30] in the statistical environment R (R Core Team). The significance of an association between a mRNA -SNP pair in a tissueby-tissue (TBT-eQTL) analysis is assessed by the p value obtained using lm function in R by fitting the following linear regression model.
For each mRNAcis-SNP pair, TBT-eQTL analysis was performed using the following linear regression model - where Y is either gene expression data and G represents genotypes encoded as the number of copies of minor allele. The test statistic is the minimum p value over the total number of tissues from linear regressions performed separately in each tissue for each mRNA -SNP pair. Statistical significance was determined at a nominal p value of 0.05 for all power simulations (in case of TBT-eQTL analysis, it is 0.05 k where k is the number of tissues).

Gene expression data
Gene expression on four brain regions are publicly available as rank-invariant [31] normalized gene expression data ("series matrix file"). All the negative values in the gene expression dataset are changed to a 1 and the entire dataset was then log2 transformed. Before generating the PCA plots, samples with African and Asian ancestry (n = 2) were removed from the analysis in order to keep the study a homogenous mixture of European-Caucasians. All the gene expression probes on sex chromosomes X and Y were removed from the analysis. Each gene expression probe was then adjusted for known variation contributed by batch effects and biological covariates such as tissue bank, gender, hybridization batch and numeric covariates such as post-mortem interval (PMI) and age as well as unknown variation using surrogate variable analysis (SVA) model [32].
Gene Expression ∼ Biological Covariates + Known Batch Effects It has previously been shown that the number of cis-eQTL detected significantly improved when multiple PCs were removed from the expression data [33].

Methylation data
Methylation data, obtained as a "series matrix file" consisted of Beta-values, which represent the ratio of methylated probe intensity and the overall intensity (sum of methylated and unmethylated probe intensities) [34]. We followed the previously mentioned method to preprocess methylation data using the SVA model. The biological covariates here include tissue bank, gender, hybridization batch and numeric covariates such as post-mortem interval (PMI) and age.

Genotype data
The genotype data was obtained from dbGAP database (phs000249.v1.p1) following requisite author permissions. The genotype data was recoded into a SNP matrix of values 0, 1 and 2 representing minor allele counts. Samples with African and Asian ancestry were removed from the analysis in order to keep the data relatively homogeneous with patients of European-Caucasian ancestry. These SNPs were filtered on the missing-ness of the individual data and the SNP data (excluded SNPs with missing values), followed by MAF (included SNPs with MAF ≥ 0.05)and Hardy-Weinberg equilibrium (HWE; p-values ≤ 0.001) in the same order using PLINK [35] software. The resulting dataset has 400,097 SNPs after preprocessing.