- Research
- Open access
- Published:
Estimating the effect size of a hidden causal factor between SNPs and a continuous trait: a mediation model approach
BMC Bioinformatics volume 23, Article number: 420 (2022)
Abstract
Background
Observational studies and Mendelian randomization experiments have been used to identify many causal factors for complex traits in humans. Given a set of causal factors, it is important to understand the extent to which these causal factors explain some, all, or none of the genetic heritability, as measured by single-nucleotide polymorphisms (SNPs) that are associated with the trait. Using the mediation model framework with SNPs as the exposure, a trait of interest as the outcome, and the known causal factors as the mediators, we hypothesize that any unexplained association between the SNPs and the outcome trait is mediated by an additional unobserved, hidden causal factor.
Results
We propose a method to infer the effect size of this hidden mediating causal factor on the outcome trait by utilizing the estimated associations between a continuous outcome trait, the known causal factors, and the SNPs. The proposed method consists of three steps and, in the end, implements Markov chain Monte Carlo to obtain a posterior distribution for the effect size of the hidden mediator. We evaluate our proposed method via extensive simulations and show that when model assumptions hold, our method estimates the effect size of the hidden mediator well and controls type I error rate if the hidden mediator does not exist. In addition, we apply the method to the UK Biobank data and estimate parameters for a potential hidden mediator for waist-hip ratio beyond body mass index (BMI), and find that the hidden mediator has a large effect size relatively to the effect size of the known mediator BMI.
Conclusions
We develop a framework to infer the effect of potential, hidden mediators influencing complex traits. This framework can be used to place boundaries on unexplained risk factors contributing to complex traits.
Introduction
One of the goals in studying the associations between heritable traits and disease outcomes is to identify which of these factors are truly causal for the outcome. Expansion of causal inference studies like Mendelian randomization studies [1] in recent years have provided one piece of causal evidence to risk factors identified from observational epidemiological studies [2]. However, generating causal evidence between exposures and disease outcome leaves unaddressed the question of whether the genetic heritability of the trait can be fully explained by the set of known causal factors or if there exist additional ones that are unidentified but further explain disease risk or trait heritability. For example, it is known that high body mass index (BMI) and potentially low-density lipoprotein cholesterol (LDL-C) increase the risk of developing type 2 diabetes (T2D) [3, 4], but can the genetic heritability of T2D be fully explained by BMI and LDL-C? If the heritability of a disease can be fully explained by the set of existing known causal factors, then research on the disease can focus on studying the biological mechanisms of those causal factors. On the other hand, if there is genetic heritability that remains unexplained, other causal factors for the disease may exist and are currently hidden from observation. Studying the characteristics of the hidden causal factors may provide insights into the novel biological associations and mechanisms that remain undiscovered.
The classic mediation framework decomposes the associations between the exposures and the outcome into direct effect and indirect effect through a mediator (i.e., mediated effects) [5, 6]. The question described earlier can be considered within a mediation framework with the trait of interest as the outcome, identified single nucleotide polymorphisms (SNPs) associated with the trait as the exposure, and the known causal factors of the trait as the mediators. Under this framework, any remaining direct effects between the identified SNPs and the outcome trait are viewed as the residual associations that are not explained by causal factors or mediators included in the model. Thus, the residual associations could be due to one or more hidden mediators. As the first step towards learning about the residual associations in the following work, we consider the simplest case that the remaining direct effects between the identified SNPs and the outcome trait are due to a single hidden mediator. Under this case, we can further decompose the unexplained genetic heritability into two parts: (i) the SNP effects on the hidden mediator, and (ii) the effect size of the hidden mediator on the outcome trait. In this study, we aim to determine whether the hidden mediator exists, and if so, estimate the effect size of the hidden mediator on the outcome trait. We believe that the results of our work can be helpful in at least two ways. First, the compatibility of the data with the presence of a hidden mediator suggests additional work to understand complex trait heritability. Secondly, given the presence of a hidden mediator, enumerating the distribution of effects of that mediator across loci could be important. For example, in T2D, a locus which is entirely mediated by BMI / obesity (e.g., FTO) but not by other risk factors is interesting, and could point to known disease etiology [7]. In contrast, a locus which is not explained at all (or incompletely) by known mediators could help focus investigation on a locus where novel biological insight could be obtained.
In this work, we assume that the values of the SNPs, the known mediators, and the outcome trait are accurately measured and there are no unmeasured confounding variables in the model. Furthermore, we assume that the effect sizes of the SNPs on the standardized known and hidden mediators have the same distribution. This is reasonable due to the following: because there is little or no prior information about the hidden mediator, the unexplained genetic heritability can theoretically be decomposed by infinitely many combinations of the two parts. Thus, to limit the number of possible combinations of the two parts and infer a reasonable range of the hidden mediator’s effect size on the outcome, we propose to restrict the effect sizes of the SNPs on the hidden mediator to be similar to the effect sizes of the SNPs on the known mediators. Under this assumption, one part of the unexplained genetic heritability, that is, the SNP effects on the hidden mediator, can be learned from the SNP effects on the known mediators. Furthermore, the effect size of the hidden mediator on the outcome trait can be inferred by dividing the direct effects between the SNPs and the outcome trait by the inferred SNP effects on the hidden mediator.
The rest of the report is organized as follows: In the "Methods" Section, we provide a broad overview of our approach and describe each step of our method in detail. We also describe the settings for the simulation study. In the "Results" Section, we present the simulation results and the application on investigating the trait of waist-hip ratio. We conclude inthe "Discussion" Section with thoughts about limitations and possible extensions to the approach.
Methods
The mediation model and notations
We illustrate our method using a model with two known mediators, although our method can extend to the case with more than two mediators or only one known mediator. We denote a continuous trait of interest as Y, the vector of SNPs associated with Y as G, and the two known mediating causal factors of Y as M1 and M2 (Fig. 1A). M1 and M2 are both standardized to have unit variance. Furthermore, the SNP effects from G to M1 are represented by a vector a1 and the SNP effects from G to M2 are represented by a vector a2, the direct effects between G and Y are represented by a vector c, the effect size of M1 on Y is denoted by a scalar b1, and the effect size of M2 on Y is denoted by a scalar b2.
If there is an unexplained genetic heritability between G and Y (i.e., c \(\ne\) 0), then we assume a hidden mediator exists and is denoted by MH (Fig. 1B). The SNP effects from G to MH are represented by a vector aH. The goal of the proposed work is to infer the effect size of the hidden mediator MH on Y, denoted by bH. To account for the scenario that some of the SNPs in G are not associated with MH, we use πH to denote the proportion of the SNPs in G that are associated with MH.
If we were to observe MH, the direct effect vector c can be decomposed as c = aHbH. However, since MH is not observed, we can only estimate a1, a2, b1, b2, and c as shown in Fig. 1A but not aH and bH as shown in Fig. 1B. We propose to use the estimates of a1, a2, and c, denoted as a∗1, a∗2, and c∗ to infer aH and, subsequently, infer bH. For simplicity, we denote the joint vector of a1 and a2 and the joint vector of a∗1 and a∗2 as a and a∗, respectively.
Overview of the proposed method and the rationales
To infer the effect size of the hidden mediator, bH, we utilize the fact that we can decompose the direct effect c as aHbH. To do so, we assume that the SNP effects on the hidden mediator, aH, share some similarities with the SNP effects on the known mediators, a1 and a2. We first consider the simplest case, in which we assume that all the SNP effects (on both the known and unknown mediators) come from the same distribution. Under this assumption, the true mean and standard deviation of a∗H can be consistently estimated by the SNP effects a∗1 and a∗2 if the sample size is large and a large number of SNPs are included in the model. We estimate the SNP effects a∗1 and a∗2 by fitting two linear regression models with the known mediators M1 and M2 as the dependent variables and G as the independent variable. Recognizing that assuming similar SNP effect sizes and variances on different mediators may be a strong assumption, we present a more general setting in which the SNP effects could vary according to a three-level structure (Additional file 1).
One challenge in the decomposition of the direct effects of SNPs on Y is that we might not expect every SNP associated with the outcome trait will be associated with the known and hidden mediators. Therefore, we model the SNP effects on the known mediators (a), using a mixture model with a point mass at zero and a true effect size distribution that centers at a non-zero value. Because we estimate the SNPs effects on the known mediators (a∗) using linear regression models that come with estimation uncertainty, the distribution of a∗ will be a mixture of a distribution centered at zero and the true effect distribution with a non-zero mean and a variance that is larger than the true dispersion. Similarly, the SNP effects on the hidden mediator, aH, can also be modeled using a mixture model with a point mass at zero and a true effect distribution that centers at a non-zero value. For the same reason as in the case of a∗, the estimated c∗ will be the mixture of a point mass at zero and a true effect distribution not centered at zero and with a larger dispersion. Therefore, we utilize Gaussian mixture models (GMMs) to model the distributions of a∗ and c∗.
The proposed multi-step method
Our method consists of three major steps as shown in Fig. 2. In Step 1, we estimate the individual SNP effects on the known mediators to obtain a∗ and estimate the direct effects between the SNPs and the outcome trait to obtain c∗ by fitting a series of linear regression models. In Step 2, we fit GMMs on a∗ and c∗ using an Expectation–Maximization (EM) algorithm to separate the SNP effects on the known and hidden mediators from the zero-mean noises and estimate the GMM parameters from both distributions. In Step 3, we incorporate the estimated GMM parameters from Step 2 to a GMM Markov Chain Monte Carlo (MCMC) procedure to generate a posterior distribution for bH. Details of each step are presented below. All steps are implemented in R (version 3.6.1) [8].
Step 1: Mediation regressions
In Step 1, we estimate the SNP effects on the known mediators, a1 and a2, and direct effects between the SNPs by fitting linear regressions for each known mediator separately where the mediator (e.g., M1) is the dependent variable and the elements of G are the independent variables. To estimate the direct effects between the SNPs and the outcome trait, c, we fit a linear regression with Y being the dependent variable and G, M1, M2, and other covariates being the independent variables. The resulting estimated effects are a∗ and c∗. We denote a∗ as the joint vector of a∗1 and a∗2.
Step 2: EM
In Step 2, we separate the true effects in c∗ from the zero-mean noise component by fitting a GMM. In addition, because it is possible that not all the SNPs are associated with the set of known mediators, we also fit a GMM on a∗ to capture the actual effects of the SNPs on the known mediators. Specifically, we use the EM algorithm to fit the GMMs on a∗ and c∗ via the normalmixEM function in the R package mixtools (version 1.2.0) [9,10,11] with the initial value of mixing proportions, lambda, set to 0.5, which represents that initially, the SNPs have equal probabilities of being associated with the trait or not (i.e., the hidden mediator in the case of c∗ or the corresponding known mediators in the case of a∗).
The EM algorithm works as follows. Let Xi for i = 1…n be random variables generated from a GMM consist of two normal distributions (Eq. 1), and let Zi = {1,2} for i = 1…n be binary latent variables that each indicates which of the two normal distribution the corresponding Xi comes from.
Let \({\theta }^{(r)}= \{{\pi }_{1}^{(r)}, {\mu }_{1}^{(r)}, {\sigma }_{1}^{2, (r)}, {{\pi }_{2}^{(r)}, \mu }_{2}^{(r)}, {\sigma }_{2}^{2, (r)}\}\) be the set of GMM parameters computed at iteration r.
At iteration r, during the E step, compute \(E\left({Z}_{i}\left(k\right)|{x}_{i},{\uptheta }^{\left(r-1\right)}\right)=P\left({Z}_{i}=k|{x}_{i},{\uptheta }^{\left(r-1\right)}\right)\), where k = {1,2} as in Eq. 2. During the M step, compute elements of \({\theta }^{(r)}\) as in Eqs. 3–5.
The fitted GMM should consist of a near zero-mean normal distribution, which are due to the zero-mean noise and a non-zero-mean normal distribution, which are due to the true effects. The means and the standard deviations of a∗’s and c∗’s true effect component can be estimated by the means and the standard deviations of the non-zero-mean distributions from the fitted GMMs. To avoid occasional convergence issues and extreme estimates, we run the EM algorithm multiple times and take the median value of the estimated true effect distribution means from all the runs. Based on our experience, generating 15 runs of the EM GMM is typically sufficient to obtain a well-fitted model.
Step 3: GMM
In Step 3, we specify a GMM as shown in Eq. 6 and perform a MCMC procedure to generate a posterior distribution for bH using the R package rjags (version 4.10) [12, 13].
The data input into the GMM MCMC procedure are the elements of c∗. We denote the elements of c∗ as cj for j = 1…n, where n is the number of SNPs in the model. \({\upmu }_{a}\), \({\upsigma }_{1}^{2}\), and \({\sigma }_{2}^{2}\) are constants in the model. The value of \({\mu }_{a}\) is set to the estimated mean of the true effect Gaussian distribution of the fitted GMM on a∗ in Step 2; the value of \({\sigma }_{1}^{2}\) is set to the estimated variance of the zero-mean Gaussian distribution in the fitted GMM in step 2; the value of \({\sigma }_{2}^{2}\) is set to the estimated variance of the non-zero-mean Gaussian distribution in the fitted GMM in step 2. The parameters estimated by this MCMC procedure are bH and the binary indicator variables zj for j = 1…n, which indicate which Gaussian distribution in the GMM the corresponding cj belongs to. We specify the same categorical distribution prior for zj, where the weight parameters of the two categories are set to the corresponding estimated weights for the two Gaussian distributions in the fitted GMM in Step 2. Lastly, we specify a normal prior with mean zero and variance 100 for bH so its distribution is almost flat at small values close to 0, which is the potential region of the hidden mediator’s effect size.
When bH is equal to zero, the distribution for c∗ reduces to a single Gaussian distribution instead of a GMM, and the resulting GMM from EM will likely assign a very small weight to one of the Gaussian distributions in the GMM. If the Gaussian distribution that involves bH in the MCMC procedure happens to be the one that receives a very small weight, the posterior distribution of bH will span a very wide region around the true value of bH. This is because during each MCMC iteration, due to the small weight, very few or none of the elements in c∗ will be assigned to the Gaussian distribution that involves bH so that the MCMC procedure is uncertain about the estimation of bH. This situation can also occur when the true value of bH is very small such that the two Gaussian distributions in the GMM are not separable. Under these scenarios, the resulting interval estimators will be extremely wide and will not be useful in terms of giving a precise estimate of bH. Thus, if one observes an extremely wide posterior distribution of bH, we propose to flip the binary labeling of the fitted GMM from Step 2 and perform Step 3 again. Based on the results of the simulation study presented below, we observe that this adjustment generally results in more meaningful estimates of bH.
A simulation study
We conduct a simulation study to evaluate the proposed method. We consider a base case and eight additional settings (Table 1). The base case is an ideal setting for our method. For each of the eight settings, we vary different aspects of the base case and evaluate the behavior of our method. In Setting 1, we vary the proportion of the SNPs that are associated with the hidden mediator, which we denote as πH; in Setting 2, we allow two of the known mediators to have negative effects on the outcome trait; in Setting 3, we decrease the sample size (number of individuals); in Setting 4, we consider cases where there are 1 and 10 known mediators of the outcome trait; in Setting 5, we consider four cases where some of the known mediators affect other known mediators; in Setting 6, we simulate two cases under the three-level SNP effect structure; in Setting 7, we simulate the case where bH is equal to zero (i.e., negative control); in Setting 8, we vary the number of SNPs included in the model. For the base case and Settings 1–7, we consider a scenario where there are 70 SNPs associated with the outcome trait and another scenario where there are 500 SNPs associated with the outcome trait. We select the 500 SNP scenario because it is similar to the number of SNPs used in the data application example (522 SNPs) described in the next section. We also examine the 70 SNP scenario and repeated all the simulations done with 500 SNPs to access how our method behaves when there are a significantly smaller number of SNPs in the model. For both scenarios, we let the hidden mediator’s effect size on the outcome trait, bH, be 0.02, 0.25, and 0.5 and perform 1000 independent simulations for each of the three values of bH. Also, for Settings 1–6, we simulate data for 49 additional bH between 0.02 and 0.5 with a step size of 0.01 to show a behavior trend of our posterior distribution as bH increases. As previously mentioned, when the true value of bH is very small or equal to zero, the MCMC posterior distribution of bH may be too wide occasionally to make any meaningful inference about its true value. Therefore, we apply the labeling-switching adjustment if the width of the 90% quantile interval derived from the initial posterior distribution for bH is wider than 5. The detailed setting of the base case is presented in the next paragraph with its simulation results being presented in "Results" Section. The detailed settings and the simulation results of the other eight settings are presented in the Additional file 1. All simulations were performed in R (version 3.6.1) [8].
The data for the base case are simulated as follows. The sample size is 100,000. The SNPs associated with the outcome trait, G, are simulated independently with minor allele frequencies generated from a uniform distribution between 0.1 and 0.5, and we assume the SNP effects are additive. The SNP effects on the mediators a1, a2, a3, a4, a5 and aH are generated from mixture distributions of zero point-masses and normal distributions with the mean being 0.2 and the standard deviation being 0.08 as shown in Eqs. 7–12. The frequencies that the five known mediators are associated with the exposure SNPs are (0.5, 0.6, 0.8, 0.2, 0.5), i.e., 50% of the SNPs associated with the outcome trait are associated with M1; 60% of the SNPs associated with the outcome trait are associated with M2, and so forth. There are five known mediators of the outcome trait. The effect sizes of the five known mediators (M1, M2, M3, M4, M5) on the outcome trait are (0.4, 0.2, 0.3, 0.2, 0.4). The known mediators are generated as Eqs. 13–17. The proportion of the SNPs that are associated with the hidden mediator, πH, is 0.8. The hidden mediator is generated as Eq. 18. We also include two covariates, C1 and C2, in the outcome model. C1 is generated from a normal distribution with a mean of 7 and a standard deviation of 0.5; C2 is generated from a normal distribution with a mean of 4 and a standard deviation of 0.4. Their corresponding effect sizes on the outcome trait are 0.8 and -0.3. The outcome trait is generated based on Eq. 19. Note that we only add a relatively small error term (\(\varepsilon\)) to the outcome trait because we assume that almost all the leftover genetic heritability of the outcome trait can be explained by the hidden mediator. The mediators in the base case do not affect each other (i.e., no correlation among M1, M2, M3, M4, and M5).
Data application
We apply the proposed method to UK Biobank data. We consider Waist-to-hip Ratio (WHR) as the outcome, the significant SNPs from the latest GWAS meta-analysis of WHR as the exposure, and body mass index (BMI) as the known causal mediator between WHR and the SNPs [14]. The goal of the analysis is to determine whether there exists a second, hidden mediator on WHR and if so, estimate the effect size of this hidden mediator. In addition to waist circumference, hip circumference and BMI, we also include sex, age, and the first ten genetic principal components in the UK Biobank phenotype data as covariates in the mediation model. Only individuals with European ancestry are considered for the current analysis, which includes those described as “British”, “Irish”, “White” or “Any other white background”. Individuals with missing phenotype data (i.e., “NA”) in any of the data fields are removed.
Because the GWAS meta-analysis used for identifying the exposure SNPs involves UK Biobank data, to minimize over-estimated SNP effect sizes in our analyses, we use a stringent p-value threshold of 5 × 10−9 when choosing the exposure SNPs. The resulting SNPs are clumped using the R package TwoSampleMR (version 0.5.6) with a clumping window of 250 kb and a cutoff for correlation due to linkage disequilibrium (LD r2 = 0.01) based on the 1000 Genomes Continental European groups reference [15, 16]. After LD clumping, a total of 535 independent SNPs associated with WHR are identified and extracted from the imputed UK Biobank genotype using plink (version 2.0) [17] (Additional file 2: Table S1). Furthermore, ten SNPs with low imputation quality (INFO score < 0.9) are dropped from further analyses (labeled with “INFO” in Additional file 2: Table S1), and a hard-call threshold of 0.4 is used when converting the imputed alleles probabilities to the number of allele copies. After joining the genotype data with the phenotype data, we further remove two SNPs with more than 10,000 missing rows (labeled with “NA” in Additional file 2: Table S1) and a multi-allelic SNP (labeled with “M” in Additional file 2: Table S1). Individuals with missing data (i.e., “NA”) in any of the data SNP fields are removed. In the end, 218,277 individuals and 522 SNPs are included in the application of our method. WHR is calculated as the ratio of waist circumference to hip circumference, and WHR and BMI are standardized to have means of zero and standard deviations of one.
The cleaned data are analyzed using the proposed multi-step method according to the flowchart in Fig. 2. In Step 1, an initial regression was performed to access the effect direction of the SNPs on the outcome trait WHR using WHR as the dependent variable and all the SNPs, sex, age, and the first 10 genetic principal components as the independent variables. Based on the direction (positive or negative) of the estimated effect, the coding of each SNP is flipped such that all SNPs have a positive effect on WHR. In the subsequent mediator (BMI) regression models (to obtain a∗) and the outcome (WHR) regression model (to obtain c∗), sex, age and the first ten principal components are adjusted for as covariates. For both a∗ and c∗, SNPs with effects that are greater than the 3rd quartile + 3 × interquartile range (IQR) and values that are smaller than the 1st quartile—3 × IQR are removed to avoid the downstream GMM methods to be driven by these outliers. A total of 12 SNPs were removed (label with “O” in Additional file 2: Table S1). In Step 3, the MCMC chain length is set to 30,000 with a burn-in length of 5,000. We also estimate BMI’s effect on WHR by fitting a regression model with the dependent variable being WHR and the independent variables being BMI, sex, age and the first ten principal components. As a sensitivity analysis, we repeated the proposed multi-step method multiple times with some SNPs dropped. Specifically, we randomly divided 522 SNPs into 20 groups (i.e., approximately 26 SNPs per group), and the analysis was repeated 20 times with each of the 20 groups dropped.
Results
Simulation results
For the base case, simulation results on the median, mean, 90% highest density interval (HDI) and 90% quantile interval (QI) of bH’s posterior distributions are summarized in Fig. 3. Rather than the usual 95% intervals, 90% intervals are reported to avoid the potential instability at the tails of the MCMC posterior distributions. We also report the root mean square error, the average bias, the number of outliers of the mean and the median point estimators, the proportion of the times that the HDI and QI contain the true value of bH, and empirical power / type I error of the HDI and the QI interval estimators for each of the eight simulation settings (Additional files 3 and 4: Tables S2 and S3). The outliers are defined as the values more extreme than the third quartile + 1.5 * (the third quartile – the first quartile) or the first quartile – 1.5 * (the third quartile – the first quartile). For the base case and Settings 1–6, the empirical power is calculated as the number of simulations with the interval not containing zero divided by the total number of simulations. For Setting 7, the empirical type I error is calculated as the number of simulations with the interval not containing zero divided by the number of total simulations.
Under the base case, the performances of the posterior median and mean of bH are similar; both are close to the true value of bH, and slightly downward biased only when the true value of bH is 0.5. Both the median and the mean have a smaller variation for the cases with 500 SNPs compared to 70 SNPs (Fig. 3A –H). In the case with 70 SNPs, where both bH = 0.25 and bH = 0.5, for one out of the one thousand simulation runs the median and the mean were far away from the corresponding true value (at values close to zero) (Fig. 3A). But we did not observe this in the 500 SNPs case. The 90% HDI and the 90% QI behaved similarly as well; the intervals were wider for the cases with 70 SNPs than 500 SNPs only a few times, and the HDI and the QI were extremely wide. For bH = 0.02, bH = 0.25 and bH = 0.5 98.1%, 98.3%, and 97.3% of the 90% HDI and the 90% QI contains the true value of bH, indicating both intervals were conservative. In addition, when the true value of bH is small, both interval estimators were slightly wider. Although the behavior of the posterior distribution of bH did not change dramatically with the number of SNPs in the model, having more SNPs in the model can lead to slightly better estimations of bH in terms of both the point estimators and the interval estimators.
Simulations were also conducted for eight additional settings. In Setting 1, where a low proportion of the SNPs are associated with the hidden mediator, the posterior median and meanindicated slightly larger downward biases, and the HDI and the QI were wider when there is a smaller number of SNPs in the model. In contrast, when all of the SNPs were associated with the hidden mediators, the posterior median and mean were biased upward and the HDI and QI were less likely to capture the true value of bH when the true value of bH is large (Additional file 1: Sect. 3.1). For Settings 2, 3 and 4, the simulation results showed that having known mediators with negative effects on the outcome trait and varying the sample size (number of individuals) and the number of known mediators did not have dramatic impacts on the posterior distribution of bH (Additional file 1: Sect. 3.2, 3.3, 3.4). We also observe that in Setting 5, the posterior median and mean can be biased and the HDI and the QI are less likely to include the true value of bH if the causal relationships among the known mediators are not appropriately adjusted for in the mediation regressions during Step 1 (Additional file 1: Sect. 3.5). Furthermore, if the assumption that the SNP effects on all the mediators come from the same distribution does not hold, depending on the degree, the posterior median and mean can vary greatly and the HDI and the QI can have a low chance to include the true value of bH as shown in Setting 6 (Additional file 1: Sect. 3.6). Next, according to the simulation result for Setting 7, the posterior median and meanwere close to the true value of bH, zero, and the HDI and the QI have Type I error rates thatwere close to 0.1 (Additional files 1 and 4: Sect. 3.7 and Table S3). Finally, the simulation result for Setting 8 showed that as the number of SNPs in the model decreases, the interval estimates become wider, and the point estimators become less precise, which is as expected. However, the point and interval estimators still have decent performance when therewere only 20 SNPs in the model, indicating that the performance of our method is not greatly affected as long as there are asufficient number of SNPs in the model. (Additional file 1 and 4: Sect. 3.8, Table S3). Detailed results of the additional settings are presented in the Additional files.
Application on waist-hip ratio
The regression estimated SNP effects on BMI (a∗) and direct effects between the SNPs and WHR c∗ are shown in Fig. 4A and C, respectively. As shown in the histograms, therewere some extreme values or outliers in both a∗ and c∗. The histograms with outliers removed are shown in Fig. 4B and D. Two modes can be observed from the distribution of a∗ with the mode on the left being approximately at zero suggesting a∗ follow a mixture model. It is less clear whether the distribution of c∗ has two modes. EM estimated means of a∗ and c∗ are indicated by the vertical red lines shown in Fig. 4B and D. The fitted GMM on a∗ estimates a mixture weight of 0.882 for the distribution on the left, and the fitted GMM on c∗ estimates a mixture weight of 0.836 to the distribution on the left. From the posterior distribution of bH, the point estimate for bH using the posterior median is 1.5556 and 1.5562 using the posterior mean; the HDI is (1.4658, 1.6440); the QI is (1.4683, 1.6469). The whole procedure of the multi-step method on waist-hip-ratio (218,277 individuals, 522 SNPs, 1 known mediator, 5,000 iterations for the MCMC burn-in, and 30,000 iterations for the MCMC chain length) took 6.33 min on a MacBook Pro laptop with a 2.8 GHz Intel Core i7 processor and 16 GB memory. The runtime can be variable based on how long the MCMC chain is set to. In the sensitivity analysis that repeats the multi-step method 20 times with approximately 1/20 of the SNPs dropped each time, the median of the posterior median of bH from 20 repeats is 1.5486, the first quartile is 1.4483, the third quartile is 1.6355, the smallest value is 1.0511, and the largest value is 1.9575. For the posterior mean of bH, the median of the 20 repeats is 1.5494, the first quartile is 1.4492, the third quartile is 1.6368, the smallest value is 1.0511, and the largest value is 1.9584. The posterior median and mean from three repeats are relatively extreme (posterior median = 1.0511, 1.0643, 1.9574) indicating that some SNPs are more influential than other SNPs. Finally, the effect of BMI on WHR conditional on sex, age and the first ten principal components is estimated to be 0.3834 (95% CI: 0.3807, 0.3861).
Discussion
In this work, we propose to infer the effect size of a potential hidden mediator on a trait of interest based on the observed associations between the trait of interest, its known causal factors, and the associated SNPs that have been identified previously. Utilizing the mediation framework, we propose a multi-step method to estimate the effect size of the hidden factor by treating the trait of interest, its associated SNPs, and the known causal factors as the outcome, exposure, and known mediators in the mediation model, respectively. Assuming the direct effects between the outcome trait and the identified SNPs that are unexplained by the known mediators can be explained by a hidden mediator, we obtain the effect of this hidden mediator on the outcome trait by decomposing the direct effects between the outcome trait and its associated SNPs into the SNP effects on the hidden mediator and the hidden mediator’s effect size on the outcome trait. In Step 1 of our proposed method, we estimate the SNP effects on the known mediators and the direct effects between the SNPs and the outcome trait via a series of linear regressions. In Step 2, we fit two GMMs on the estimated SNP effects on the known mediators and the direct effects between the SNPs and the outcome trait using the EM algorithm to separate the actual SNP effects from the zero-mean noises (i.e., the estimated effects of those SNPs not associated with the known and hidden mediators). Lastly in Step 3, based on the EM estimated GMM parameters, a GMM MCMC procedure is applied to generate a posterior distribution for the hidden mediator’s effect size, bH.
Through extensive simulation studies, we show that our method can produce a posterior distribution that captures bH well. We observe that if the model assumptions are correct, both using the posterior median and mean to estimate bH provides only small biases, and good coverage of the simulated true effect by the 90% HDI and QI. When there are more identified SNPs associated with the outcome trait, in general, both the point and the interval estimators perform better. Also, our posterior distribution estimates the hidden mediator’s effect size well if some of the known mediators have negative effects on the outcome trait while others have positive effects. Our method can also accommodate varying numbers of known mediators and the performance is not dramatically affected by decreasing sample size. In addition, our method estimates bH well even if the hidden mediator does not exist (i.e., bH = 0). However, we notice that when the hidden mediator is associated with all of the identified SNPs that are associated with the outcome trait, both the median and the mean are upward biased and both the HDI and the QI are less likely to include the true value of bH. This is expected because the assumed GMM distribution on the estimated direct effects between the SNPs and the outcome trait is wrong. On the other hand, if the hidden mediator is associated with too few identified SNPs, the point estimators have large biases and the interval estimators become wider. This is also expected as there are fewer SNPs that can provide information about bH. Also, if causal relationships exist among the known mediators, these effects need to be adjusted in the mediation regressions. Otherwise, the point estimators can have relatively large biases and the interval estimators are likely to include the true value if the causal relationships are large enough. Finally, our method assumes that the SNP effects on different mediators (including both the known and hidden mediators) are similar. The posterior distribution estimates bH poorly when the SNP effects on different mediators are very different, especially when the number of SNPs in the model is large. This is because little information can be borrowed from the observed SNP effects to infer the SNP effects on the hidden mediator, which can make it difficult to decompose the direct effects between the outcome trait and the associated SNPs to estimate bH.
Occasionally, a posterior distribution with little information about bH (i.e., is extremely wide) can be generated when the true value of bH is either zero or very close to zero such that the distribution of the estimated direct effects between the outcome trait and the associated SNPs follows a single Gaussian distribution around zero rather than a GMM. Under this scenario, by chance alone, the EM algorithm in Step 2 may assign a tiny weight to the distribution that involves bH in the MCMC model in Step 3 and assign a large weight to the other distribution in the GMM. As a result, little data can be used to infer bH in the MCMC procedure. To avoid this situation, we suggest one inspect the histogram of estimated direct effects and the EM fitted GMM from Step 2. If the histogram of estimated direct effects does not have two modes and is centered approximately at zero, and the EM fitted GMM assigns a tiny weight to the distribution that involves bH relative to the weight of the other distribution, then it is reasonable to flip the binary labeling of GMM and proceed to Step 3.
We applied our proposed method on UK Biobank data to estimate the effect size for a potential hidden mediator of waist-hip ratio. From the posterior distribution generated by our method, the posterior median estimates that a potential hidden mediator exists in the European population with an effect size of 1.56 (90% QI: 1.47, 1.64). This result suggests that the hidden mediator has a larger effect on waist-hip ratio comparing to BMI (0.38). Some caution here is warranted, as we used the same UK Biobank data for both identifying SNPs associated with waist-hip ratio and for estimating the SNP effects. Although we used an extra stringent p-value threshold (5 × 10−9) for filtering SNPs associated with waist-hip ratio to mitigate biases from winner’s curse, some degree of biases from the winner’s curse is unavoidable. A more optimal approach is to identify SNPs associated with waist-hip ratio and perform the estimation in two independent populations with similar ancestries such that the effects are similar in the two populations but the biases from the winner’s curse are minimal. There are a couple of possible biological or physiological explanation for the hidden mediator identified in the current analysis that could contribute to the remaining association. For example, the distribution of brown fat (the cell type responsible for non-shivering thermogenesis) varies across individuals and could influence body size and weight, and studies have reported the negative association between WHR and having active brown fat in males [18] and the negative association between WHR and neuregulin 4, an adipokine secreted by brown fat, in children [19]. However, there are no great quantitative measurements of that trait nor have genetic studies of this trait been performed. Another possibility is the propensity for physical activities, which can be measured by actigraphy and could certainly influence body size and weight [20, 21].
Our proposed method has some limitations. First, the performance of the posterior distribution for bH under our method largely depends on the how well we estimate the regression coefficients for the mediation regressions during Step 1, as the downstream steps treat the estimated regression coefficients as input data. Precise and accurate estimates of the coefficients require the mediation regression to be performed on data sets with large samples size, especially when many SNPs are included in the model. For large population genetics data, this may be less of a concern. As we learned from the simulation studies, the posterior distribution captures bH well when the sample size is 25,000, which for today’s conventional size of DNA biobanks is not unreasonable. However, if strong correlations among the known mediators are not properly adjusted in the mediation regressions, the regression coefficients will be biased, which can lead to substantial biases in the resulting posterior of bH. Thus, our method relies on one’s input of prior domain knowledge and the specification of reasonable regression models. Future work can be devoted to extending the methods to address the situation when the known mediators are correlated with each other and no prior knowledge on the causal directions among the known mediators is available. Second, our method relies on the strong assumption that the SNP effects on all the mediators between the SNPs and the outcome trait come from the same distribution, and departure from this assumption can lead to substantial variation for the posterior distribution for bH such that the inference based on the posterior distribution can be very inaccurate. However, we argue that it is somewhat reasonable to constrain the SNP effects on the hidden mediator to be similar to the SNP effects on the known mediators and there will be infinite number of ways to decompose the direct effect between the SNPs and the outcome trait without this assumption. Future work can focus on relaxing this assumption or inventing ways to incorporate information about the potential SNP effect sizes from another modality. Furthermore, we view our work that considers the simplest case with one hidden mediator as an initial step in learning about the residual associations between an outcome and the existing mediator. The current model cannot distinguish whether the estimated effect size is for one hidden mediator or the combined effect size of multiple hidden mediators. In reality, any leftover associations—and perhaps more likely—could be due to multiple hidden mediators each with a relatively small effect size. Even if there are multiple hidden mediators, our approach to determine whether the known mediators fully explain the heritability of the outcome trait is still applicable, and we can interpret the estimated effect size assuming a single hidden mediator as the combined effect of multiple potential hidden mediators. Future work can be devoted to developing a method capable of inferring the number of potential hidden mediators and decomposing the combined effect into individual effects from each of the hidden mediators. Furthermore, specifying priors for the SNP effect sizes based on the heritability model, as suggested by one of the reviewers, may be a way to incorporate the estimation uncertainties of the SNP effect sizes into the model, lead to a better estimation of the hidden mediator’s effect size. Lastly, future work can extend the continuous outcome trait to other outcome types such as binary variables potentially via the counterfactual framework. This will involve utilizing generalized linear models in the meditation regressions in Step 1. Such extension will make the method useful for many applications such that the disease trait is binary.
Conclusions
We developed a method for estimating the effect size of a potential hidden estimator between a trait of interest and its associated SNPs. In the first step, a series of regression models are used to estimate the SNP effects on the hidden mediators and the direct effects between the SNPs and the trait of interest. In the second step, GMM models are fitted to the estimated SNP effects on the hidden mediators and the estimated direct effects between the SNPs via the EM algorithm. In the final step, an MCMC procedure that utilizes parameters estimated in the second step is used for generating a posterior distribution for the hidden mediator’s effect size. Extensive simulations show that our method can generate accurate estimators for the hidden mediator’s effect size. Also, when the hidden mediator does not exist, our method has controlled type I error rates. By applying our method to UK Biobank data, we found a potential hidden mediator between waist-hip-ratio and its associated SNPs in the European population and estimated its effect size on waist-hip-ratio to be larger than a known mediator BMI’s effect size on waist-hip-ratio. Although, as an initial step toward finding the hidden mediators between a trait of interest and its associated SNPs, we hypothesize a simple model with only one hidden mediator left, we hope that our method can provide some insights into the characteristics of the remaining one or multiple hidden mediators and inspire further method developments in this direction.
Availability of data and materials
The UK Biobank data are provided under application ID 32133 and can be accessed by others at the UK Biobank website (https://www.ukbiobank.ac.uk). R Scripts with the simulation code and an R package that implements the described method are available at https://github.com/zhd007/HiddenMediator [22].
References
Thom CS, Ding Z, Levin MG, Damrauer SM, Lee KM, Lynch J, Chang KM, Tsao PS, Cho K, Wilson P, Assimes TL, Sun YV, O’Donnell CJ, Million Veteran Program VA, Vujkovic M, Voight BF. Genetic determinants of increased body mass index mediate the effect of smoking on increased risk for type 2 diabetes but not coronary artery disease. Hum Mol Genet. 2020;29(19):3327–37.
Maddatu J, Anderson-Baucum E, Evans-Molina C. Smoking and the risk of type 2 diabetes. Transl Res. 2017;184:101–7.
Herder C, Karakas M, Koenig W. Biomarkers for the prediction of type 2 diabetes and cardiovascular disease. Clin Pharmacol Ther. 2011;90(1):52–66.
Njajou OT, Kanaya AM, Holvoet P, Connelly S, Strotmeyer ES, Harris TB, Hsueh WC. Association between oxidized LDL, obesity and type 2 diabetes in a population-based cohort, the health, aging and body composition study. Diabetes Metab Res Rev. 2009;25(8):733–9.
MacKinnon DP, Fairchild AJ, Fritz MS. Mediation analysis. Annu Rev Psychol. 2007;58:593.
VanderWeele T, Vansteelandt S. Mediation analysis with multiple mediators. Epidemiol Methods. 2014;2(1):95–115.
Huong PT, Nguyen CTT, Nhung VT. The association between FTO polymorphisms and type 2 diabetes in Asian populations: a meta-analysis. Meta Gene. 2021;30: 100958.
R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Peel DA, MacLahlan G. Finite mixture models. John & Sons. 2000.
Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80(2):267–78.
Tatiana B, Didier C, David RH, Derek Y. mixtools: an R package for analyzing finite mixture models. J Stat Softw. 2009;32(6):1–29.
Plummer M, Stukalov A, Denwood M. Package rjags: Bayesian graphical models using MCMC. R package version. 2016:4-6.
Plummer M (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In: Proceedings of the 3rd international workshop on distributed statistical computing (Vol. 124, No. 125.10, pp. 1–10).
Pulit SL, Stoneman C, Morris AP, Wood AR, Glastonbury CA, Tyrrell J, Lindgren CM. Meta-analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Hum Mol Genet. 2019;28(1):166–74.
Hemani G, Zheng J, Elsworth B, Wade KH, Haberland V, Baird D, Laurin C, Burgess S, Bowden J, Langdon R, Tan VY. The MR-Base platform supports systematic causal inference across the human phenome. elife. 2018;7.
Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.
Zhang Q, Ye H, Miao Q, Zhang Z, Wang Y, Zhu X, Li Y. Differences in the metabolic status of healthy adults with and without active brown adipose tissue. Wien Klin Wochenschr. 2013;125(21):687–95.
Wang R, Yang F, Qing L, Huang R, Liu Q, Li X. Decreased serum neuregulin 4 levels associated with non-alcoholic fatty liver disease in children with obesity. Clin Obes. 2019;9(1): e12289.
Doherty A, Smith-Byrne K, Ferreira T, Holmes MV, Holmes C, Pulit SL, Lindgren CM. GWAS identifies 14 loci for device-measured physical activity and sleep duration. Nat Commun. 2018;9(1):1–8.
Klimentidis YC, Raichlen DA, Bea J, Garcia DO, Wineinger NE, Mandarino LJ, Going SB. Genome-wide association study of habitual physical activity in over 377,000 UK Biobank participants identifies multiple variants including CADM2 and APOE. Int J Obes. 2018;42(6):1161–76.
HiddenMediator. https://github.com/zhd007/HiddenMediator. Accessed 4 June 2022.
Acknowledgements
Not applicable.
Funding
B.F.V. acknowledges support from the National Institutes of Health (DK101478, DK126194) and a Linda Pechenik Montague Investigator Award. M.D.R. acknowledges support from the National Institutes of Health (AI077505).
Author information
Authors and Affiliations
Contributions
Z.D. and B.F.V. conceived of the project and designed the experiments. M.D.R. provided data. Z.D. analyzed the data. Z.D., W-T.H., and B.F.V. developed the method and prepared a draft of the initial manuscript. All authors edited the manuscript. W-T.H and B.F.V. supervised the project. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
We confirm that all methods were carried out in accordance with relevant guidelines and regulations.
Consent for publication
Not applicable.
Competing interests
The authors have no conflicts to report.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. A pdf file containing supplementary text (Sections S1–S4) and figures (Figures S1–S20) including the discussion on a more generation assumption on the SNP effects (Section S1), additional simulation settings and results (Sections S2–S3), and a proof of the estimated SNP effect’s distribution (Section S4).
Additional file 2
. An Excel table containing the list of 535 independent SNPs associated with WHR.
Additional file 3
. An Excel table containing the simulation results for Settings 1–6 and 8
Additional file 4
. An Excel table containing the simulation results for Setting 7.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ding, Z., Ritchie, M.D., Voight, B.F. et al. Estimating the effect size of a hidden causal factor between SNPs and a continuous trait: a mediation model approach. BMC Bioinformatics 23, 420 (2022). https://doi.org/10.1186/s12859-022-04977-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-022-04977-4