- Research
- Open Access
- Published:

# HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data

*BMC Bioinformatics*
**volume 23**, Article number: 296 (2022)

## Abstract

Mediation analysis plays a major role in identifying significant mediators in the pathway between environmental exposures and health outcomes. With advanced data collection technology for large-scale studies, there has been growing research interest in developing methodology for high-dimensional mediation analysis. In this paper we present HIMA2, an extension of the HIMA method (Zhang in Bioinformatics 32:3150–3154, 2016). First, the proposed HIMA2 reduces the dimension of mediators to a manageable level based on the sure independence screening (SIS) method (Fan in J R Stat Soc Ser B 70:849–911, 2008). Second, a de-biased Lasso procedure is implemented for estimating regression parameters. Third, we use a multiple-testing procedure to accurately control the false discovery rate (FDR) when testing high-dimensional mediation hypotheses. We demonstrate its practical performance using Monte Carlo simulation studies and apply our method to identify DNA methylation markers which mediate the pathway from smoking to reduced lung function in the Coronary Artery Risk Development in Young Adults (CARDIA) Study.

## Introduction

Mediation analysis explores the underlying mechanism by which an independent variable (e.g., exposure or treatment) influences the dependent variable (e.g., health outcome) through a mediator variable [1]. Mediation analysis has been playing a major role in many areas, e.g., social studies, economics, and health sciences [2]. More recently, with the advancement of large-scale data collection techniques, there has been substantial interest in developing methodology for high-dimensional mediation analysis in omics and imaging studies. An incomplete list of publications include [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. For example, Derkach et al. [11] considered a latent variable model for high-dimensional mediation analysis. Huang et al. [12] presented a hypothesis test of the mediation effect in a causal mediation model with high-dimensional continuous mediators. Dai et al. [22] developed a multiple-testing procedure that accurately controls the false discovery rate (FDR) when testing high-dimensional mediation hypotheses.

Our motivating example comes from the DNA methylation (DNAm) research of the Coronary Artery Risk Development in Young Adults (CARDIA) Study [23]. In the DNAm process, methyl groups are added to DNA at binding sites referred to as cytosine-phosphate-guanine (CpG) islands, which inhibits the binding of transcription factors to DNA and results in changes (typically down regulation) to the expression of genes [24]. The platform Illumina MethylationEPIC Beadchip array is used to measure DNAm levels of roughly 850 K probes, which are ultra high-dimensional. Such high-dimensional DNAm markers may mediate pathways linking environmental exposures with health outcomes. Our objective is to explore the mediating role from high dimensional DNAm markers on the relationship between smoking and lung function in the CARDIA study.

In this paper, we propose an improved estimation and inference procedure for the high-dimensional mediation model, extending the work of Zhang et al. [3]. Our method includes three major steps: First, to tackle the ultra-high dimensionality of the DNAm markers, we screen out potentially a large number of mediators using a series of marginal mediation effect pathways (exposure \(\to \) mediator \(\to \) outcome). Second, we adopt the de-biased Lasso method [25] to estimate the high dimensional regression coefficients (mediator \(\to \) outcome). Third, we employ a joint significance test with a mixture of null distributions to accurately control the FDR for large-scale multiple tests [22].

The remainder of this paper is structured as follows. In "Methodology" Section, we propose a three-step inference procedure for mediation effects in the high-dimensional regression model. In "Simulation studies" Section, we evaluate the performance of our method via numerical simulations. In "Application" Section, an application to the CARDIA study is provided. Finally, some discussion and concluding remarks are presented in "Conclusion and remarks" Section.

## Methodology

Denote the exposure as \(X\), baseline covariates to be adjusted for as \({\varvec{Z}}={({Z}_{1},\dots ,{Z}_{q})}^{T}\), where the superscript \(T\) denotes the transpose of a vector or a matrix. We adopt the following counterfactual framework for the vector of potential mediators \({\varvec{M}}(x)={({M}_{1}(x),\dots ,{M}_{p}(x))}^{T}\) under exposure level \(x\), and counterfactual \(Y(x,{\varvec{m}})\) under exposure level \(x\) and mediators level \({\varvec{m}}\)**,** to perform the mediation analysis [26]:

where \(\gamma \) is the direct effect of exposure on the outcome; \({\varvec{\beta}}={({\beta }_{1},\dots ,{\beta }_{p})}^{T}\) is the regression parameter vector relating the mediators to the outcome; \(\boldsymbol{\alpha }={({\alpha }_{1},\dots ,{\alpha }_{p})}^{T}\) is the parameter vector relating the exposure to mediators; \({\varvec{\eta}}\) and \({{\varvec{\delta}}}_{{\varvec{j}}}\) are vectors of regression coefficients for the covariates; and \(\varepsilon \) and \({e}_{j}\) are error terms in Models (1’) and (2’), respectively. Note there are \(p\) submodels in Model (2’), one for each mediator. We allow the correlation between the error terms, i.e., \({\varvec{e}}={({e}_{1},\dots ,{e}_{p})}^{T}\sim N(0,{\Sigma }_{e})\), where \({\Sigma }_{e}\) is a positive definite covariance matrix.

A few causal assumptions that are needed for the identification of natural direct effect (NDE) and natural indirect effects (NIE) are listed below [41–42]:

A1. Stable unit treatment value assumption (SUTVA) for both the mediators and the outcome. This assumption means that there is no multiple versions of exposures and there is no interference between individuals, which implies that \({\varvec{M}}(x)\) and \(Y(x,{\varvec{m}})\) are well defined.

A2. Consistency for the mediators and the outcome. That is, there are no measurement errors in the mediators and thus the observed variables satisfy \({\varvec{M}}={\varvec{M}}(X)\) and \(Y=Y(X,{\varvec{M}})\).

A3. Sequential ignorability: This assumption contains 4 parts:

(A3.1) \(X\perp Y(x,{\varvec{m}})|{\varvec{Z}}\)**,** i.e., no unmeasured confounding between exposure and the potential outcome;

(A3.2) \({\varvec{M}}\perp Y(x,{\varvec{m}})|X,{\varvec{Z}}\)**,** i.e., no unmeasured confounding between mediators and the potential outcome;

(A3.3) \(X\perp {\varvec{M}}(x)|{\varvec{Z}}\)**,** i.e., no unmeasured confounding between exposure and the potential mediators;

(A3.4) \({\varvec{M}}\left( {x^{\prime}} \right) \bot Y\left( {x,{\varvec{m}}} \right)|{\varvec{Z}}\)**,** i.e., no exposure-induced confounding between mediators and the potential outcome. In other words, the potential mediators under any intervention level \(\varvec{m}\) are independent of potential outcomes under any intervention *x* and mediator level \(x^{\prime}\) given covariate \(\varvec{Z}\).

A4. No direct causal relationship between mediators. We do not allow one mediator to be the cause of another, but we do allow them to have shared common causes.

Under A1-A3, we have direct effect \(NDE=E\left[Y\left(1,{\varvec{M}}(0)\right)-Y\left(0,{\varvec{M}}(0)\right)\right]=\gamma \), indirect effect \(NIE=E\left[Y\left(1,{\varvec{M}}(1)\right)-Y\left(1,{\varvec{M}}(0)\right)\right]=\sum_{j=1}^{p}{\alpha }_{j}{\beta }_{j}\). Under the additional assumption A4, we can decompose the indirect effect into sum of indirect effects through each mediator \({M}_{j}\), \({NIE}_{j}={\alpha }_{j}{\beta }_{j}\). Also we obtain the structural equation model for the observed outcome as in previous literature [3] to assess the mediation effects of high-dimensional mediators:

Our goal is to estimate and test the mediation effects \({\alpha }_{j}{\beta }_{j}\) jointly for \(j=1,\dots ,p.\) An illustration of mediation analyses with single mediator and high dimensional mediators is given in Fig. 1.

As shown in Fig. 1, we do allow these mediators to share common unmeasured causes. These assumptions are in line with the underlying biologic procedures. Smoking could induce biochemical alterations to the DNAs, which lead to methylation changes. Such change in a certain CpG site is unlikely to *directly* cause the methylation alternation of other CpG sites. Rather, such dependency is most likely to be indirect, for example, by regulating gene expressions in related pathways that in turn modify other CpGs, or several CpGs are modified by common unmeasured causes (e.g., inflammatory response).

Details of our proposed approach are given below.

*Step 1*: (*Screening of Mediators*). For \(j=1,\dots ,p\), we consider a series of marginal models:

Along the lines of the sure independence screening (SIS) method [27], we select a subset \(\mathcal{D}=\{j : {M}_{j}\) is among the top \(d=\left[2n/\mathrm{log}(n)\right]\) largest effect\(\left|{\widehat{\alpha }}_{j}{\widehat{\beta }}_{j}\right|\), for\(j=1,\dots ,p\}\), where \({\widehat{\alpha }}_{j}\) and \({\widehat{\beta }}_{j}\) are ordinary least square (OLS) estimators based on marginal models (3) and (4), respectively.

All \({M}_{j}\)'s are scaled with mean zero and unit variance before performing this screening procedure. The key advantage of Step 1 is that the product term \({\widehat{\alpha }}_{j}{\widehat{\beta }}_{j}\) could roughly describe the mediated effect of the \(j\) th mediator. Therefore, the selected subset \(\mathcal{D}\) contains true mediators with a large probability.

*Step 2*: (*De-biased Lasso*). We consider the following submodel based on the selected set \(\mathcal{D}\),

where \({{\varvec{\beta}}}_{\mathcal{D}}\) and \({{\varvec{M}}}_{\mathcal{D}}\) denote sub-vectors of \({\varvec{\beta}}\) and \({\varvec{M}}\) with index belonging to \(\mathcal{D}\) respectively, and \({{\varvec{\beta}}}_{\mathcal{D}}\) is estimated using the de-biased Lasso method, with estimator \({\widehat{\beta }}_{j}\) and its standard error \({\widehat{\sigma }}_{{\beta }_{j}}\) obtained in [25]. The corresponding p-values are given as:

where \(\Phi \left(\cdot \right)\) is the cumulative distribution function of \(N\left(\mathrm{0,1}\right).\) De-biased Lasso in Step 2 is necessary as the ordinary least square will yield inefficient estimates (with reduced power), because the dimension of survived mediators after Step 1 is still relatively large.

*Step 3:* (*Joint Significance Test*). We consider the multiple testing problem for \(j \epsilon \mathcal{D}\) as follows:

with corresponding p-value

where \({P}_{{\beta }_{j}}\) is given in (6), \({P}_{{\alpha }_{j}}=2\left\{1-\Phi \left(\left|{\widehat{\alpha }}_{j}\right|/ {\widehat{\sigma }}_{{\alpha }_{j}}\right)\right\}\), \({\widehat{\alpha }}_{j}\) and \({\widehat{\sigma }}_{{\alpha }_{j}}\) are OLS estimators. Zhang et al. [3] considered the joint significant test (termed “JS-uniform”), which assumes that \({P}_{j}\) follows a uniform distribution. However, although \({P}_{{\alpha }_{j}}\) and \({P}_{{\beta }_{j}}\) are each uniformly distributed, their maximum is not. As a result, the significance rule using the uniform null distribution for \({P}_{j}\) results in a valid but overly conservative test [28]. In this paper, we will adopt the “JS-mixture" approach to accurately control the FDR [22] (Sect. 2.3).

The multiple testing problem (7) is equivalent to the union of the following three disjoint component null hypotheses,

That is, \({P}_{j}\) is a 3-component mixture distribution instead of the uniform distribution. Dai et al. [22] proposed the following estimated FDR for testing mediation:

where \({\widehat{\pi }}_{01}, {\widehat{\pi }}_{10}\) and \({\widehat{\pi }}_{00}\) are the estimates of proportions \({H}_{01,j}, {H}_{10,j}\) and \({H}_{00,j}\), respectively, and \(R\left(t\right)={V}_{00}\left(t\right)+{V}_{01}\left(t\right)+\) \({V}_{10}\left(t\right)+\) \({V}_{11}\left(t\right)\), where \({V}_{00}\left(t\right)=\#\left\{{P}_{j}\le t|{H}_{00}\right\}\), \({V}_{01}\left(t\right)=\#\left\{{P}_{j}\le t|{H}_{01}\right\}\), \({V}_{10}\left(t\right)=\#\left\{{P}_{j}\le t|{H}_{10}\right\}\), \({V}_{11}\left(t\right)=\#\left\{{P}_{j}\le t|{H}_{11}\right\}\) for \(t\in \left[\mathrm{0,1}\right].\)

We define the significant threshold for \({P}_{j}\) as \({\widehat{t}}_{b}=\mathrm{sup}\left\{t: \widehat{FDR}(t)\le b\right\},\) to control the FDR at level \(b\). Then \(\widehat{S}=\left\{j:{P}_{j}\le {\widehat{t}}_{b}, j\in \mathcal{D}\right\}\) gives the estimated index set of significant mediators.

We can obtain \({\widehat{\pi }}_{01}, {\widehat{\pi }}_{10}\), \({\widehat{\pi }}_{00}\) and \({\widehat{t}}_{b}\) using the R package HDMT [22].

Compared to the estimation and inference method in [3] (termed “HIMA”), our new method (termed “HIMA2”) has the following three advantages. First, HIMA only considers \(\beta \) (mediator \(\to \) outcome) for screening in Step 1, while HIMA2 considers the indirect effect of \(\alpha \beta \). Therefore, the mediation-based screening method in HIMA2 addresses the indirect effect more accurately than HIMA. Second, HIMA uses the minimax concave penalty (MCP; [29]) technique to estimate the effect \(\beta \), which can only provide p-values for selected mediators in Step 2. That is, \({P}_{{\beta }_{j}}\) is set to 1 for those not selected, which results in poor estimate of \({P}_{j}\) in Eq. (7). In contrast, de-biased Lasso in HIMA2 yields p-values for all \({\beta }_{j}\)’s in \(\mathcal{D}\), which gives more appropriate estimate of \({P}_{{\beta }_{j}}\). Third, HIMA adopts a naive joint significance rule assuming a uniform null distribution for the maximum p-value calculation in Step 3, which may result in a valid but overly conservative test with lower power.

## Simulation studies

In this section we assess our proposed method using simulation studies. For Model (1), we generate the exposure \(X\) from \(N(0, 2)\); covariates \(Z={\left({Z}_{1}, {Z}_{2}\right)}^{T}\), where \({Z}_{1}\) and \({Z}_{2}\) are independently generated from \(N(0, 2)\). We set \(\gamma =0.5\), \(\delta ={\left(0.\mathrm{3,0.3}\right)}^{T}\) and \(\eta ={\left(0.5, 0.5\right)}^{T}\); \({\beta }_{1}=0.20,\) \({\beta }_{2}=0.25,\) \({\beta }_{3}=0.15,\) \({\beta }_{4}=0.30,\) \({\beta }_{5}=0.35,\) \({\beta }_{6}=0.10,\) and \({\beta }_{j}=0\) for all other \(j\)’s; \({\alpha }_{1}=0.20,\) \({\alpha }_{2}=0.25,\) \({\alpha }_{3}=0.15,\) \({\alpha }_{4}=0.30,\) \({\alpha }_{5}=0.35,\) \({\alpha }_{7}=0.10,\) and \({\alpha }_{j}=0\) for all other \(j\)’s. Therefore, we have: (i) \({\alpha }_{j}{\beta }_{j}\ne 0\) for \(j=1,\dots ,5\); (ii) \({\alpha }_{j}=0\) but \({\beta }_{j}\ne 0\) for \(j=6\); (iii) \({\alpha }_{j}\ne 0\) but \({\beta }_{j}=0\) for \(j=7\); and (iv) \({\alpha }_{j}=0\) and \({\beta }_{j}=0\) for \(j>7\). The error terms \(e={\left({e}_{1}, \dots ,{e}_{p}\right)}^{T}\) are generated from \(N(0,{\Sigma }_{e})\), where \({\Sigma }_{e}={\left({\rho }^{|j-{j}^{^{\prime}}|}\right)}_{j,{j}^{^{\prime}}}\) and \(\varepsilon \) is generated from \(N(0, 1)\). All the simulations are based on 500 replications with 16 factorial settings: \(p=1000, 5000\), \(n=300,\) 600, and \(\rho =0, 0.25, 0.5, 0.75\).

We compare the performance of HIMA2 with HIMA in Table 1, which provides the estimated biases (Bias) given by the sample mean of the estimates minus the true value, and the mean-square error (MSE) of the estimates. Table 1 shows that both HIMA2 and HIMA are unbiased, however, HIMA2 has smaller MSEs than HIMA for significant mediators. MSEs for both HIMA and HIMA2 decrease as the sample size increases. Of note, the results for \(j>8\) (\({\alpha }_{j}=0\) and \({\beta }_{j}=0\)) are close to those of \(j=8\) and thus omitted.

We also present the estimated FDR and power of mediation effects testing in Tables 2 and 3, where the nominal level is 0.05. The results indicate that both HIMA2 and HIMA can achieve valid FDR control. Furthermore, HIMA2 is more powerful than HIMA in selecting significant mediators, though the differences become smaller when sample size increases. We also note that as the correlation among the mediators becomes larger, both methods suffer in terms of power.

Per suggestion from a reviewer, we compare our method to HDMA [30], which was developed along the lines of HIMA but adopts the de-biased Lasso method in Step 2. However, no multiple testing adjustment was used in HDMA for inference. As a result, HDMA suffers from poor FDR control albeit with higher power as shown in Tables 2 and 3.

Per suggestion from a reviewer, similar to our real data analysis, we also consider a setting with 2 significant mediators, i.e.: \({\beta }_{1}=0.15,\) \({\beta }_{2}=0.3,\) \({\beta }_{3}=0.1,\) \({\beta }_{4}=0,\) and \({\beta }_{j}=0\) for all other \(j\)’s; \({\alpha }_{1}=0.15,\) \({\alpha }_{2}=0.3,\) \({\alpha }_{3}=0,\) \({\alpha }_{4}=0.1,\) and \({\alpha }_{j}=0\) for all other \(j\)’s. As shown in the supplementary materials (Additional file 1: Tables S1, S2 and S3), we observe similar results to those in Tables 1, 2 and 3. We note that the results from HIMA and HIMA2 are more close to each other when the correlation is high (\(\rho =0.75)\).

Per suggestion from a reviewer, we use the standardized coefficient estimates in the SIS step, but the results are close to those without standardization (results available upon request).

Finally, we notice that in Tables 2 and Additional file 1: Table S2, the FDR of HIMA2 decreases with sample size. This also happens with HIMA, though to a less magnitude.

## Application

We apply our method to the Coronary Artery Risk Development in Young Adults (CARDIA) Study, an ongoing longitudinal cohort examining the development and determinants of clinical and subclinical cardiovascular disease and their risk factors [23]. A group of 5115 black and white men and women aged 18–30 years were enrolled in 1985–6 from 4 study centers: Birmingham, AL; Chicago, IL; Minneapolis, MN; and Oakland, CA. They were followed-up during 1987–1988 (Year 2), 1990–1991 (Year 5), 1992–1993 (Year 7), 1995–1996 (Year 10), 2000–2001 (Year 15), 2005–2006 (Year 20), 2010–2011 (Year 25), and 2015–2016 (Year 30).

We are interested in investigating how the DNA methylation (DNAm) markers mediate the relation between smoking and lung function. Due to budget limitation, 1200 individuals from the CARDIA participants at Year 15 were randomly selected for DNAm profiling using the Illumina MethylationEPIC Beadchip (*p* = ~ 850,000 sites). The R package Enmix [31] was used to perform quality control, background correction, dye bias correction, quantile normalization (by probe types), and extreme outliers removal. Eventually, the DNAm measurements were obtained for a total of 1042 blood samples, which are treated as mediators in this study. The FEV1 (forced expiratory volume in 1 s) measured at Year 20 is considered as the lung function outcome. The number of cigarette packs/year in Year 10 is the exposure variable. We are interested in building the mediation pathway in sequence: smoking at Year 10 \(\to \) High dimensional DNAm markers at Year 15 \(\to \) lung function at Year 20.

Our analysis adjusts for age, height, weight, study center, gender, and race in Models (1) and (2). Additionally, we estimated the proportions of CD4 + T lymphocytes, CD8 + T lymphocytes, B lymphocytes, natural killer cells, monocytes, and granulocytes using [32], which are also adjusted in the models. To account for experimental batch effects and other technical biases, we derive surrogate variables from intensity data for non-negative internal control probes using principal components (PCs) analysis [31]. The top eight PCs, explaining 95.06% of the variation across the non-negative internal control probes, are also adjusted as covariates in the model. All the covariates are measured at Year 10.

After screening in Step 1, the average of the absolute values of correlation among CpGs is 0.25 (max 0.93). In Table 4, we present the summary results on selected mediators. For FDR < 0.05, HIMA2 identifies 2 CpGs: cg26331243 and cg19862839 as mediators. CpG cg26331243 is located in the body region of gene *CCDC33*, which is differentially expressed for tobacco smoke exposure [33, 34]. *CCDC33* is also linked to susceptibility to lung function disorders, e.g., pneumococcal meningitis [35] and SARS-CoV-2 infection [36]. Therefore, it is plausible that cg26331243 plays a role in regulating the expression of *CCDC33*, which in turn mediates the pathway from smoking to lung function.

CpG cg19862839 is located in the body region of gene *TBX4*. Growing evidence has indicated that *TBX4* variants are associated with a wide spectrum of lung disorders [37, 38]. Patients with mutations in *TBX4* may also be more susceptible to cigarette smoking [39]. Therefore, we speculate that cg19862839 could participate in regulating the expression of *TBX4*, which also acts as a mediator between smoking and lung function.

In comparison, HIMA only identifies cg26331243 as a mediator with FDR < 0.05. Therefore, the proposed HIMA2 has better power to identify CpGs in high dimensional mediation analysis.

Finally, we note that cg05575921, which was identified in the normative aging study (NAS) [3], is not a significant mediator in the CARDIA study. In CARDIA, the estimate of \(\alpha \) (from smoking to DNAm) is highly significant for cg05575921. However, the estimate of \(\beta \) (from DNAm to FEV1) is not significant. This may be due to that participants in CARDIA were much younger (mean age 45 at Year 20, range 38–55) than NAS (mean age 74, range 55–100), when the lung function of CARDIA participants are more homogenous. Therefore, the association between DNAm to lung function at Year 20 may not be significant in CARDIA.

In the current analysis, there is a 5-year gap between the exposure and the mediator. A reviewer raised the concern on treatment-induced-mediator-outcome confounding. The life-course smoking trajectories for the majority of individuals were relatively stable before age 40–45, which corresponds to the Year 10–15 of our study cohort [40]. Although DNA methylation is modifiable by smoking, it is still a relatively stable biomarker over time [41]. Short-term exposure-induced covariates within a 5-year gap (if any) are unlikely to produce biologically functional changes in DNA methylation for us to detect as mediators.

## Conclusion and remarks

In this paper we proposed an improved method HIMA2 for high dimensional mediation analysis, which was shown to have better performance than HIMA [3] by numerical studies. We applied HIMA2 to the identification and testing of the DNA methylation mediating effects in the CARDIA study. Our method is relatively simple to implement, and can be widely used in high-dimensional mediation analyses.

Our method can be extended in several directions. First, we will consider how to address the correlation among DNA methylation markers to improve the inferential results, as shown in the Simulation Studies that both HIMA and HIMA2 lose power for high correlation. Second, it is of interest to incorporate the interaction terms between the exposure and the mediators in our model, i.e., the high dimensional moderated mediation analysis. Third, there has been an increasing interest and development in longitudinal studies of DNA methylation. We can also consider repeated measures of DNA methylation markers as mediators in our future research.

## Availability of data and materials

R package, source code, and simulation study are available at https://github.com/joyfulstones/HIMA2.

## References

Baron RM, Kenny DA. The moderator-mediator variable distinction in social psychological research – conceptual, strategic, and statistical considerations. J Pers Soc Psychol. 1986;51(6):1173–82.

MacKinnon DP. Introduction to statistical mediation analysis. New York: Erlbaum; 2008.

Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics. 2016;32(20):3150–4.

Valeri L, Reese SL, Zhao S, Page CM, Nystad W, Coull BA, London SJ. Misclassified exposure in epigenetic mediation analyses. Does DNA methylation mediate effects of smoking on birthweight? Epigenomics. 2017;9(3):253–65.

Fang R, Yang H, Gao Y, Cao H, Goode EL, Cui Y. Gene-based mediation analysis in epigenetic studies. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbaa113.

Zhang J, Wei Z, Chen J. A distance-based approach for testing the mediation effect of the human microbiome. Bioinformatics. 2018;34(11):1875–83.

Sohn MB, Li H. Compositional mediation analysis for microbiome studies. Ann Appl Stat. 2019;13(1):661–81.

Chén OY, Crainiceanu C, Ogburn EL, Caffo BS, Wager TD, Lindquist MA. High-dimensional multivariate mediation with application to neuroimaging data. Biostatistics. 2017;19(2):121–36.

Zhao Y, Lindquist MA, Caffo BS. Sparse principal component based high-dimensional mediation analysis. Comput Stat Data Anal. 2020;142:106835.

Gao Y, Yang H, Fang R, Zhang Y, Goode EL, Cui Y. Testing mediation effects in high-dimensional epigenetic studies. Front Genet. 2019. https://doi.org/10.3389/fgene.2019.01195.

Derkach A, Pfeiffer RM, Chen TH, Sampson JN. High dimensional mediation analysis with latent variables. Biometrics. 2019;75(3):745–56.

Huang YT, Pan WC. Hypothesis test of mediation effect in causal mediation mode with high-dimensional continuous mediators. Biometrics. 2016;72(2):402–13.

Zhang, Q. High dimensional mediation analysis with applications to causal gene identification. bioRxiv. Doi: https://doi.org/10.1101/497826 (2019)

Djordjilović V, Page CM, Gran JM, Nøst TH, Sandanger TM, Veierød MB, Thoresen M. Global test for high-dimensional mediation: testing groups of potential mediators. Stat Med. 2019;38:3346–60.

Zhang H, Chen J, Li Z, Liu L. Testing for mediation effect with application to human microbiome data. Stat Biosci. 2019. https://doi.org/10.1007/s12561-019-09253-3.

Zhang H, Chen J, Feng Y, Wang C, Li H, Liu L. Mediation effect selection in high-dimensional and compositional microbiome data. Stat Med. 2021;40(4):885–96.

Wang C, Hu J, Blaser MJ, Li H. Estimating and testing the microbial causal mediation effect with high-dimensional and compositional microbiome data. Bioinformatics. 2020;36:347–55.

Liu Z, Shen J, Barfield R, Schwartz J, Baccarelli AA, Lin X. Large-scale hypothesis testing for causal mediation effects with applications in genome-wide epigenetic studies. J Am Stat Assoc. 2021. https://doi.org/10.1080/01621459.2021.1914634.

Loh WW, Moerkerke B, Loeys T, Vansteelandt S. Non-linear mediation analysis with high-dimensional mediators whose causal structure is unknown. Biometrics. 2021. https://doi.org/10.1111/biom.13402.

Zhou RR, Wang L, Zhao SD. Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika. 2020;107(3):573–89.

Shi CA, Li L. Testing mediation effects using logic of Boolean matrices. J Am Stat Assoc. 2021. https://doi.org/10.1080/01621459.2021.1895177.

Dai JY, Stanford JL, LeBlanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. J Am Stat Assoc. 2021. https://doi.org/10.1080/01621459.2020.1765785.

Friedman GD, Cutter GR, Donahue RP, Hughes GH, Hulley SB, Jacobs DR Jr, et al. CARDIA: study design, recruitment, and some characteristics of the examined subjects. J Clin Epidemiol. 1998;41(11):1105–16.

Tate PH, Bird AP. Effects of DNA methylation on DNA-binding proteins and gene expression. Curr Opin Genet Dev. 1993;3(2):226–31 (

**PMID: 8504247**).Fang EX, Ning Y, Liu H. Testing and confidence intervals for high dimensional proportional hazards models. J R Stat Soc Series B (Statistical Methodology). 2016;79(5):1415–37.

Tsai PC, et al. Smoking induces coordinated DNA methylation and gene expression changes in adipose tissue with consequences for metabolic health. Clin Epigenet. 2018;10:126. https://doi.org/10.1186/s13148-018-0558-0.

Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B. 2008;70:849–911.

Huang YT. Joint significance tests for mediation effects of socioeconomic adversity on adiposity via epigenetics. Ann Appl Stat. 2018;12(3):1535–57.

Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38(2):894–942.

Gao Y, Yang H, Fang R, Zhang Y, Goode E, Cui Y. Testing mediation effects in high-dimensional epigenetic studies. Front Genet. 2019. https://doi.org/10.3389/fgene.2019.01195.

Xu Z, Niu L, Li L, Taylor JA. ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip. Nucleic Acids Res. 2016;44(3):e20 (

**PMID: 26384415; PMCID: PMC4756845**).Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 2012;13:86 (

**PMCID: PMC3532182**).Beane J, Sebastiani P, Liu G, Brody JS, Lenburg ME, Spira A. Reversible and permanent effects of tobacco smoke exposure on airway epithelial gene expression. Genome Biol. 2007;8(9):R201 (

**PMID: 17894889; PMCID: PMC2375039**).Gower AC, Steiling K, Brothers JF 2nd, Lenburg ME, Spira A. Transcriptomic studies of the airway field of injury associated with smoking-related lung disease. Proc Am Thorac Soc. 2011;8(2):173–9.

Lees JA, Ferwerda B, Kremer PHC, et al. Joint sequencing of human and pathogen genomes reveals the genetics of pneumococcal meningitis. Nat Commun. 2019;10:2176.

Vastrad B, Vastrad C, Tengli A. Bioinformatics analyses of significant genes, related pathways, and candidate diagnostic biomarkers and molecular targets in SARS-CoV-2/COVID-19. Gene Rep. 2020;21:100956.

Haarman MG, Kerstjens-Frederikse WS, Berger RMF. TBX4 variants and pulmonary diseases: getting out of the “Box.” Curr Opin Pulm Med. 2020;26(3):277–84.

Xie T, Liang J, Liu N, et al. Transcription factor TBX4 regulates myofibroblast accumulation and lung fibrosis. J Clin Investig. 2016;126(8):3063–79.

Maurac A, Lardenois É, Eyries M, et al. T-box protein 4 mutation causing pulmonary arterial hypertension and lung disease. Eur Respir J. 2019;54:1900388.

Mathew AR, et al. Life-course smoking trajectories and risk for emphysema in middle age: the CARDIA lung study. Am J Respir Crit Care Med. 2019;199:237–40. https://doi.org/10.1164/rccm.201808-1568LE.

Tsai PC, et al. Smoking induces coordinated DNA methylation and gene expression changes in adipose tissue with consequences for metabolic health. Clin Epigene. 2018;10:26. https://doi.org/10.1186/s13148-018-0558-0.

## Acknowledgements

None.

## Funding

This research was partly supported by NIH/NIA R21 AG 063370, R21 AG068955, and NIH/NCATS UL1 TR002345. The Coronary Artery Risk Development in Young Adults Study (CARDIA) is supported by contracts HHSN268201800003I, HHSN268201800004I, HHSN268201800005I, HHSN268201800006I, and HHSN268201800007I from the National Heart, Lung, and Blood Institute (NHLBI). This manuscript has been reviewed by CARDIA for scientific content. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

## Author information

### Authors and Affiliations

### Contributions

LL conceived and designed the study. CP and KX analyzed the data and wrote the manuscript. HZ, YZ, AQ, CZ and LF guided analyses, provided advice, and critically reviewed the manuscript. All authors read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Ethical approval and consent to participate

All CARDIA participants provided written informed consent, with institutional review board approval at each field center (the University of Alabama at Birmingham, Northwestern University, University of Minnesota, and Kaiser Permanente). All methods were performed in accordance with the relevant guidelines and regulations (for example- Declarations of Helsinki).

### Consent to publish

Not applicable.

### Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

**Additional file 1.**

**Table S1:** Bias (MSE) for mediation effect estimates. **Table S2:** FDR at significance level 0.05. **Table S3:** Power at significance level 0.05.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

## About this article

### Cite this article

Perera, C., Zhang, H., Zheng, Y. *et al.* HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data.
*BMC Bioinformatics* **23, **296 (2022). https://doi.org/10.1186/s12859-022-04748-1

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s12859-022-04748-1

### Keywords

- Variable selection
- Joint significant test
- Epigenetics
- Causality