Estimation of total mediation effect for high-dimensional omics mediators

Background Environmental exposures can regulate intermediate molecular phenotypes, such as gene expression, by different mechanisms and thereby lead to various health outcomes. It is of significant scientific interest to unravel the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposure and traits. Mediation analysis is an important tool for investigating such relationships. However, it has mainly focused on low-dimensional settings, and there is a lack of a good measure of the total mediation effect. Here, we extend an R-squared (R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}2) effect size measure, originally proposed in the single-mediator setting, to the moderate- and high-dimensional mediator settings in the mixed model framework. Results Based on extensive simulations, we compare our measure and estimation procedure with several frequently used mediation measures, including product, proportion, and ratio measures. Our R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}2-based second-moment measure has small bias and variance under the correctly specified model. To mitigate potential bias induced by non-mediators, we examine two variable selection procedures, i.e., iterative sure independence screening and false discovery rate control, to exclude the non-mediators. We establish the consistency of the proposed estimation procedures and introduce a resampling-based confidence interval. By applying the proposed estimation procedure, we found that 38% of the age-related variations in systolic blood pressure can be explained by gene expression profiles in the Framingham Heart Study of 1711 individuals. An R package “RsqMed” is available on CRAN. Conclusion R-squared (R\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document}2) is an effective and efficient measure for total mediation effect especially under high-dimensional setting. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04322-1.

via a Venn diagram, in which the area where the three circles overlap is R 2 mediated . The R 2 mediated is interpreted as the variance of Y commonly explained by X and M , as in the commonality analysis (CA). Figure S1: Venn diagram demonstrating the Rsq measure in the single-mediator model: Area indicated with backslashes represents R 2 Y,M ; area with forward slashes represents R 2 Y,X ; and all the covered area represents R 2 Y,M X . The area in which the three circles overlap represents R 2 M ediated .

Characteristics of the measure R 2 mediated
The Rsq measure has many characteristics of a good effect size measure, yet it is considered to have heuristic value in certain situations, mainly because of one limitation: it can be negative in the single-mediator model, limiting its interpretation (1). The derivation of the Rsq measure originally came from CA (2), which is widely used in psychology and education research. CA partitions the regression R 2 into unique and common effects. The unique effects indicate that the variance is uniquely accounted for by the independent variable, while the common effects indicate that the variance is common to both the independent variable and the mediator(s). The negativity of the unique and common effects is well accepted in CA, where suppression is considered to be the main reason (3). In fact, it is believed that the suppression brings in extra information that clarifies or "purifies" the relationship between the dependent variable and the independent variable and leads to a practical increment in predictive validity (4).
In high-dimensional settings, suppression is very likely to occur. In light of the interpretation concern, we prove that the R 2 mediated is always positive under a consistent model (Proposition 1). We further show that the R 2 mediated is always positive under a general consistent model (Proposition 2), where suppression occurs. We find that the negativity is caused by suppression, but suppression does not always lead to the negativity of the R 2 mediated . In fact, under the high-dimensional setting, the scenario in which the negativity is less frequently occurred (Proposition 3). In addition, we did not encounter this concern in our real data analysis. In summary, the R 2 mediated is always positive under consistent and general consistent models and is often a trivial case under the high-dimensional setting in which the R 2 mediated becomes negative.
Additionally, we investigate the potential bias introduced by two types of non-mediators in Proposition 4 and 5. We also briefly show that our proposed R 2 mediated is invariant to certain transformations that may be useful for estimation under the high-dimensional setting in Proposition 6.
Preacher and Kelly (1) proposed several desirable properties for a good mediation effect measure. Although the proposed Rsq-based second-moment measure have many appealing properties, there are a few other concerns in the literature, leading to a few variants of the Rsq measure, such as those proposed in MacKinnon 2008 (5), Preacher and Kelly (2011) (1), de Heus (6), and Lachowicz et al. (7) under a single-mediator model. The υ measure proposed in Lachowicz (7) was later extended to multiple-mediator models in the structural equation modeling framework (8). They were aimed at additional potential advantages including a bounded range, a monotonic relationship with the product measure, and better dealing with spurious correlation, at the possible price of losing connection with the commonality analysis. The extension of these measures to the high-dimensional setting may not be as straightforward and necessarily as appealing as our proposed measure.
Without loss of generality, X and M are assumed to have variance of 1, and Y is centered at We have h = (r M 1 Y , . . . , r MpY ) T and V M M as a p × p matrix with cor(M i , M j ) as the (i, j) th component. By some algebra, it can be shown that such that where

Relationships among the mediation effect measures
The relationship between R 2 M ediated and the product measure cannot be clearly seen under the mixed-effect model. In the fixed-effect model, R 2 M ediated can be expressed in terms of a, b and r as in equation (S4).
From equation (S4), we can see that the variance of Y explained by M is a function of a,b and r. Under the circumstances where r is small, the product measure (b T a) and R 2 M ediated can have high correlation; however, R 2 M ediated is not guaranteed to monotonically increase with the product measure. This is considered as a potential drawback of using the Rsq measure by Lachowicz et al (7), who suggested a "corrected" Rsq measure under the single-mediator models, and later on to multiple-mediator models (8) with regard to this matter. We do not consider it as a potential drawback considering the product measure has its own limitation under multiple-mediator model, i.e., cancellation when the mediation effects from individual pathways are in different directions, as illustrated in our motivating Framingham Heart Study data example. Additionally, the correction may further increase the interpretation difficulty of this second-moment measure derived from the commonality analysis.
On the other hand, the proportion measure is also a function of a,b, and r, but it decreases as r increases, which is not true for R 2 M ediated . In fact, the proportion measure has a closer relationship with shared over simple effect (SOS), and SOS can be expressed as After some transformation, we can obtain the following equation: As the proportion measure is positive, the SOS and the proportion measure monotonically increase with each other. When SOS equals 1, the proportion mediated also equals 1. When no mediator exists, both SOS and the proportion mediated equal 0; however, when b T a = 0 but some individual pathways do not (a j b j = 0 for some j), the proportion mediated cannot capture the mediation effect (proportion= 0) while SOS can (SOS = 0). On the other hand, it is quite rare to happen when SOS equals 0, but proportion mediated does not (r has to be a certain function of a and b, r = . We note that R 2 M ediated and SOS are both zero when a 1 = a 2 = . . . = a p = 0; when almost all the variance of Y explained by the exposure is through the mediators, R 2 M ediated is close to R 2 Y,X and thus SOS ≈ 1; R 2

M ediated
and SOS approach 1 when b T a goes to infinity.

M ediated
Proposition 1: In the consistent model, where a j b j and r are in the same direction (a j b j r > 0) for j = 1, 2, . . . , p, R 2 M ediated ∈ (0, 1).
In addition, (r + b T a) 2 − r 2 /(1 + a T D −1 a) < (r + b T a) 2 + b T Db + τ for any a, b and r.
Next, we extend the proof to scenarios where b T a and r are in the same direction (b T ar > 0), and we define such models herein as general consistent models.

Proposition 2:
In the general consistent model, R 2 M ediated > 0. Proof: Under the general consistent model, (r + b T a) 2 > r 2 .
In addition, since D is a positive (semi)definite matrix, r 2 > r 2 /(1 + a T D −1 a).
M ediated is negative. Proof: When r is large and c ≈ 0, Since r 2 > 0 and 1 + a T D −1 a > 0, R 2 M ediated < 0. The first step to establish mediation according to Baron and Kelly (9) is that the independent variable must affect the dependent variable, that is, the total effect c should be different from 0. If the effect is not significant, the analysis for mediation analysis stops. Coinciding with this step, we show that the negativity of R 2 M ediated happens when this criterion does not hold.
More generally, it can be proven that R 2 M ediated < 0 when |r/c| > √ 1 + a T D −1 a using algebra.
Under the high-dimensional setting, a T D −1 a can be large and c is large enough to pass the first step of (9); therefore, the scenario where R 2 M ediated < 0 may not be very likely to happen.

Non-mediators and transformation
Variables that are not mediators may be falsely included in the mediation model. Here, we denote three types of non-mediators:  By some linear algebra, it can be shown that where D tt is D −1 with the first t columns and the first t rows.
In the special case where all M = Ø, Therefore, the inclusion of M (1) does not lead to bias in the point estimate of R 2 M ediated . When such variables are included in the mediation model, both R 2 Y,M and R 2 Y,XM increase the same amount. As a result, their effects cancel out in R 2 M ediated . Since the noise variables also have a j = 0, they do not bias the estimation either.
By some linear algebra, it can be shown that Y,M is biased away from its true value.The amount of bias does not diminish with the increase of sample size. In addition, Y,X stay the same. We conclude that In the special case where 0, but the true R 2 M ediated = 0, and it causes spurious and misleading conclusion. Moreover, under specific correlation structure, for example, the mediators are conditionally independent given X (D tq = D qt = 0), the bias of R 2 M ediated is always positive. Therefore, inclusion of M (2) is problematic.
Transforming correlated mediators in the original model to be uncorrelated may be helpful in dimension reduction, especially for high-dimensional data (10).

Proposition 6: R 2
M ediated is invariant to the following transformation of mediators: the mediators are transformed from M to P: P = uM, u is an orthogonal invertible p × p matrix (10).
In conclusion, such a transformation does not influence R 2 M ediated . Possible transformations include u such that uDu T = diag(D).

Consistency
Proposition 6: The estimation procedure coupled with iterative SIS-based variable selection guarantees the consistency of R 2 M ediated . Proof: Let the putative mediation variables in the initial assessment as It has been proven that correlation learning in the sure independence screening (SIS) method has the sure screening property, that is, P (A ⊂ A * 1 ) − → 1 as n − → ∞ under mild conditions (11). MCP is applied after the first dimension reduction step. Since it has the oracle Y,M X , and each component is estimated in the second half of the data based on the selected variables in the first step: whereĉ is the MLE of the total effect;σ 2 Y = n i=1 (Y 2 −Ȳ )/(n − 1);ĉ andσ 2 Y are consistent estimates of c and σ 2 . Moreover, we estimateφ 1 andφ 2 by a mixed-effect model using REML, which is consistent under mild conditions (12). According to the continuous mapping In conclusion, the consistency of R 2 M ediated is established for using SIS-MCP to perform variable selection and the mixed-effect model to estimate the variance components. The proof holds for any SIS with oracle property; iterative SIS served as an extension to the SISbased model selection method where the regularity conditions may fail. It has been proven that iterative SIS also has the sure screening property (13). Therefore, the result holds for our proposed method based on iterative SIS-MCP as well.

R 2
M ediated under a model with directed path among mediators When the mediators are conditionally dependent on the exposure due to the directed path among mediators, the proposed R 2 M ediated is still a valid effect size measure to capture the variation of the dependent variable explained by the independent variable through the mediators. For example, if there are two mediators M 1 and M 2 from independent X to outcome We can write the corresponding models as: Model (S6) can be further written as the following form by plugging in equation (S5): We denote a 2 = (a 2 + γa 1 ) and ξ 2 = (γξ 1 + ξ 2 ). Vanderweele and Vansteelandt showed that the product measure under this framework is a 1 b 1 + a 2 b 2 (14). For our R 2 M ediated , the residuals ξ 1 and ξ 2 are correlated and the corresponding correlation matrix D p×p is no longer diagonal. However, equations (S1), (S2), (S3) and (S4) still hold. The complicated dependence structure is reflected and captured by D p×p matrix. We have performed a simulation study to assess bias under the low-dimensional setting, as demonstrated in Supplementary Material Section 1.5.1.

M ediated
When variable selection is involved, the point estimate of R 2 M ediated can be calculated through the following steps in general: 1. Randomly split the data into two non-overlapping parts. Regress the dependent, independent, and putative mediator variables on the covariates to obtain the residuals in the first part and the second part; 2. Using the residuals as the 'adjusted' dependent variable to build an iterative SIS model for mediator selection in the first part; 3. Standardize the selected mediators and obtain the corresponding point estimate of M ediated in the second part following the main text Section 2.3.1.
To obtain the CI of the R 2 M ediated , we propose using the following bootstrap-based procedure: A. Draw B bootstrap samples of size n with replacement from the original sample; B. Calculate the point estimate of R 2 M ediated for each bootstrapped sample based on the above iterative SIS-mixed model procedure 1-3, denoted asR 2 1 ,R 2 2 , . . . ,R 2 B ; C. Extract the 2.5% and 97.5% quantiles ofR 2 1 ,R 2 2 , . . . ,R 2 B as the lower and upper bound of CI.
We have developed an R package 'RsqMed' for researchers and practitioners to estimate the R 2 M ediated and its CI based on the above described procedure. The package is publicly available at the GitHub website https://github.com/ytzhong/highD-mediation-analysis and CRAN. We would recommend implementing the iterative SIS model selection procedure with the BIC to select the tunning parameter. Variable selection methods based on crossvalidation are not supported in the current version of implementation; however, an extension is feasible but has to be done with caution (non-overlapping validation and training samples are required to avoid overfitting, which is not supported by the standard bootstrap procedure). Other resampling methods to correct post-selection bias are also possible (15).

Adjusting for covariates and partial R 2
In our estimating procedure, we suggest using the residuals as the 'new' dependent variable, independent variable, and mediators to take into account the potential confounders/covariates. Denoting the covariates as H, 'new' Y, X, and M as Y r , X r , and M r , we show that the new corresponding R 2 is partial R 2 or the coefficient of partial determination as follows.
It can be easily shown that using the 'adjusted' dependent variable does not change SSR, Therefore, the R 2 mediated can be interpreted as the additional amount of variance of the dependent variable that can be explained by independent variable and mediators given covariates.  Table S1 shows the results of bias and standard deviation assessment under the lowdimensional setting, i.e., (L1) to (L6). The estimators of all the mediation effect size measures had acceptable bias under the six scenarios, except that the R 2 M ediated estimators and their closely related SOS estimators were positively biased under scenario (L2). According to Proposition 5, the direction of bias is as expected. In addition, we showed that the bias was small when directed path among the mediators existed in (L6). Overall, R 2 M ediated estimated by the mixed-effect model had smaller bias and variance than that estimated based on the fixed-effect model, which also held for SOS. In contrast, using the fixed-effect model gave smaller bias than using the mixed-effect model for the product, ratio, and proportion measures under the low-dimensional setting.

Bias and variance under the high-dimensional setting
In the main text Table 1, we presented the bias and standard deviation of five different measures, i.e., product, proportion, ratio, R 2 M ediated , and SOS under the high-dimensional setting. However, they are not directly comparable because they are on different scales. Table S2 presents their true values. Note that the true proportion measure is negative due to different directions of the mediation effects and the p j=1 a j b j .

Bias and variance under the high-dimensional setting with model misspecification
We evaluated the robustness of the normal assumption of the random effects under the high-dimensional setting. Normally distributed random effect is commonly seen in the literature. It has been shown to be quite robust in real data analysis and highly correlated data in simulations (16; 17). There are multiple theoretical studies that consistently show the negligible impact of misspecifying random effect distributions for the linear mixed model (18; 19; 20). In the current context, we further evaluated whether misspecifying the distribution of the random effects would bias R 2 mediated in the scenarios (H6) to (H10) whose settings are exactly the same as (H1) to (H5) except that a j and b j follow a scaled t-distribution with the degree of freedom equal to one. Table S3 confirms that the bias under (H6) to (H10) was not larger than the bias under (H1) to (H5) with correctly-specified random effect distribution. In (H11), we simulated a j ∼ N (0, 0.2) and b j following a uniform distribu-

Simulation setting II
The MSE, bias, and standard deviation are presented in Table S4 and S5, corresponding to   Setting (V3) shared the same true Rsq with (V1) with 1% of the true signal. The bias of R 2 M ediated was -0.049 and standard deviation was 0.03, while the bias of ab(lasso) was 0.06 and standard deviation was 0.13. The amount bias of R 2 M ediated was close to (V1) with 1% of true signal (Table S5, bias=-0.049, SD=0.07) for Rsq (VS=iSIS-MCP). The bias of ab(lasso) was worse than that in (V1).
Setting (S4) shared the same true Rsq with (V1) with no true signal. The bias of R 2 M ediated was 0.0008 and standard deviation was 0.03, while the bias of ab(lasso) 0 and standard deviation was 0. The amount bias of R 2 M ediated was close to (V1) with 1% of true signal (Table S5, bias=-0.0008, SD=0.03) for Rsq (VS=iSIS-MCP). In the Lasso regression, none of the variables are selected, leading to the same estimation as p 0 = 1500.

Evaluation of finite-sample performance of consistency
Although we showed the consistency property of our proposed estimation procedure in theory, we are interested in its finite-sample performance. We evaluated the following simulation settings (C1 to C4), and we looked at three sets of sample size n = 750, 1500, and 3000 with the initial size of M 0 as p 0 = 1500; Additional parameters were the same as V1): r = 3,    Table S7 were worse than those in Table S6, but the bias in R 2 mediated was not necessarily worse.  Without loss of generality, we simulated one confounder H with a sample size of 1500, p 0 = 1500, and 200 replications. X, Y , and M were simulated following the models below:  Table S8, the residual method had a better performance in terms of higher true positive and lower false positive rate in variable selection. We note that the residual method is applicable to both continuous and non-continuous X and M due to its connection with the partial R 2 as elaborated in Section 1.4.1.  and C are shown in plots D, E, and F, respectively.

Pathway analysis for systolic blood pressure
For systolic BP, 182 genes whoseGEs were selected by the iterative SIS-MCP. We narrowed the gene list down to 102 genes by screening out GEs not associated with age (FDR-adjusted p-value larger than 0.1). We further performed a pathway enrichment analysis of the 102 genes that corresponded to the mediating GEs of systolic BP using the Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 (24). One Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, Nucleotide excision repair pathway, consisting of three genes (ERCC6, ERCC8, and CUL4B ), were statistically significant with a p-value of 0.029. The nucleotide excision repair pathway is a well-known pathway associated with age-related vascular dysfunction, which in turn is associated with hypertension (25). Furthermore, when using FDR-adjusted p-value larger than 0.2 as the cutoff point, there were 123 genes involved. Four KEGG pathways were significantly enriched at the nominal significance level of 0.05. Table S9 presents these four pathways. The first two pathways are closely related by sharing three genes. Rheumatoid arthritis is a chronic inflammatory disease and could be preceded by immune abnormalities (26). It could be linked to hypertension by possible sharing of pathways (27). MAPK signaling pathway was evidenced in a rat study and others to mediate in the process of aging and hypertension, related to phenotypic switching of vascular smooth muscle cells (28; 29). It is reassuring to see the discovered pathways were supported by these previous studies.