Nonlinear ridge regression improves robustness of cell-type-specific differential expression studies

1 Background: Epigenome-wide association studies (EWAS) and differential 2 gene expression analyses are generally performed on tissue samples, which 3 consist of multiple cell types. Cell-type-specific effects of a trait, such as 4 disease, on the omics expression are of interest but difficult or costly to 5 measure experimentally. By measuring omics data for the bulk tissue, cell 6 type composition of a sample can be inferred statistically. Subsequently, cell- 7 type-specific effects are estimated by linear regression that includes terms 8 representing the interaction between the cell type proportions and the trait. 9 This approach involves two issues, scaling and multicollinearity. 10 Results: First, although cell composition is analyzed in linear scale, 11 differential methylation/expression is analyzed suitably in the logit/log scale. 12 To simultaneously analyze two scales, we developed nonlinear regression. 13 Second, we show that the interaction terms are highly collinear, which is 14 obstructive to ordinary regression. To cope with the multicollinearity, we 15 applied ridge regularization. In simulated and real data, the improvement was 16 modest by nonlinear regression and substantial by ridge regularization. 17 Conclusion: Nonlinear ridge regression performed cell-type-specific 18 association test on bulk omics data more robustly than previous methods. 19 The omicwas package for R implements nonlinear ridge regression for cell- 20 type-specific EWAS, differential gene expression and QTL analyses. The 21 software is freely available from https://github.com/fumi-github/omicwas


Background
Epigenome-wide association studies (EWAS) and differential gene expression analyses elucidate the association of disease traits (or conditions) with the level of omics expression, namely DNA methylation and gene expression.
Thus far, tissue samples, which consist of heterogeneous cell types, have mainly been examined, because cell sorting is not feasible in most tissues and single-cell assay is still expensive.Nevertheless, the cell type composition of a sample can be quantified statistically by comparing omics measurement of the target sample with reference data obtained from sorted or single cells [1,2].By utilizing the composition, the disease association specific to a cell type was statistically inferred for gene expression [3][4][5][6][7][8][9][10] and DNA methylation [11][12][13][14].
For the imputation of cell type composition, omics markers are usually analyzed in the original linear scale, which measures the proportion of mRNA molecules from a specific gene or the proportion of methylated cytosine molecules among all cytosines at a specific CpG site [15].The proportion can differ between cell types, and the weighted average of cell-type-specific proportions becomes the proportion in a bulk tissue sample.Using the fact that the weight equals the cell type composition, the cell type composition of a sample is imputed.In contrast, gene expression analyses are performed in the log-transformed scale because the signal and noise are normally distributed after log-transformation [16].In DNA methylation analysis, the logit-transformed scale, which is called the M-value, is statistically valid [17].
Consequently, the optimal scales for analyzing differential gene expression or methylation can differ from the optimal scale for analyzing cell type composition.
Aiming to perform cell-type-specific EWAS or differential gene expression analyses by using unsorted tissue samples, we study two issues that have been overlooked.Whereas previous studies were performed in linear scale, we develop a nonlinear regression, which simultaneously analyzes cell type composition in linear scale and differential expression/methylation in log/logit scale.The second issue is multicollinearity.Cell-type-specific effects of a trait, such as disease, on omics expression are usually estimated by linear regression that includes terms representing the interaction between the cell type proportions and the trait.We show that the interaction terms can mutually be highly correlated, which obstructs ordinary regression.To cope with the multicollinearity, we implement ridge regularization.Our methods and previous ones are compared in simulated and real data.

Multicollinearity of interaction terms
Typically, cell-type-specific effects of a trait on omics marker expression is analyzed by the linear regression in equation (2).The goal is to estimate  ℎ, , the effect of trait k on the expression level in cell type h.This is estimated based on the relation between the bulk expression level   of a sample and the regressor  ℎ,  , , which is an interaction term defined as the product of the cell type proportion  ℎ, and the trait value  , of the sample.The variable  ℎ for cell type composition cannot be mean-centered, and interaction terms involving uncentered variables cause multicollinearity [18].
We first survey the extent of multicollinearity in real data for cell-type-specific association.
In peripheral blood leukocyte data from a rheumatoid arthritis study (GSE42861), the proportion of cell types ranged from 0.59 for neutrophils to 0.01 for eosinophils (Table 1A).The proportion of neutrophils was negatively correlated with the proportion of other cell types (apart from monocytes) with correlation coefficient of -0.68 to -0.46, whereas the correlation was weaker for other pairs (Table 1B).Rheumatoid arthritis status was modestly correlated with proportions of cell types.The product of the disease status   , centered to have zero mean, and the proportion of a cell type becomes an interaction term.The correlation coefficients between the interaction terms are mostly >0.8, apart from eosinophils (Table 1C).The ratio of mean to SD of the proportion is high for all cell types apart from eosinophils (Table 1A).The interaction terms for high-ratio cell types are strongly correlated with   , which in turn causes strong correlation between the relevant interaction terms.
The situation was the same for the interaction with age in GTEx data.The granulocytes (which include neutrophils and eosinophils) were the most abundant (Table 2A).The proportion of granulocytes was negatively correlated with other cell types (apart from monocytes) with correlation coefficient of -0.89 to -0.41, and the correlation between other pairs was generally weaker (Table 2B).Age was modestly correlated with proportions of cell types.In this dataset, the ratio of mean to SD of the proportion was high in all cell types (Table 2A), which caused strong mutual correlation between interaction terms (Table 2C).
In the above empirical data, multicollinearity between interaction terms seemed to arise not due to the correlation between cell type proportions or   , but due to the high ratio of mean to SD in the cell type proportions.
Subsequently, this property was derived mathematically.As we derived in equation (17), the correlation between interaction terms  ℎ   and  ℎ ′   approaches to one, when the ratios 1).The ratio was 1.6 to 5.3 (apart from eosinophils) in the rheumatoid arthritis dataset and ≥4.3 in the GTEx dataset.We looked up datasets of several ethnicities and found the ratio to be ≥1.5 in majority of cell types (Additional file 1: Table S1).Thus, multicollinearity can be a common problem for cell-type-specific association analyses.

Evaluation in simulated data
By using simulated data, we evaluated previous methods and new approaches of the omicwas package.In order to simultaneously analyze two scales, the linear scale for heterogeneous cell mixing and the log/logit scale for trait effects, we applied nonlinear regression in omicwas (equations ( 4) and ( 5)).To cope with the multicollinearity of interaction terms, we applied ridge regularization (equations ( 9) and ( 10)).The simulation data was generated from real datasets of DNA methylation and gene expression.The original cell type composition was retained for all samples, and the case-control status was randomly assigned.In each sample, expression level in each cell type was randomly determined according to a scenario, and then averaged according to the sample's cell type composition.

Previous regression type methods
Under each statistical algorithm, the disease association in the target cell type was assessed by a Z-score, comparing cases vs controls.
In scenario A for DNA methylation, expression of all cell types had identical distribution, irrespective of the case/control status (Figure 2A).The type I error rate was controlled (≤0.05) in all algorithms.In scenario B, cases had higher expression level in one randomly selected cell type, and that cell type was tested (Figure 2B).Here, the most appropriate algorithm is the marginal test applied to the perturbed cell type, which indeed attained the highest power.For the most abundant neutrophils, the Z-score was in the high range of 9.9 to 14.9 for the marginal test.3C).Extremely strong false signals of Z-score < -6 occurred in marginal and csSAM.monovariate.In scenario D, where the tested cell type has higher expression in cases, while one non-tested cell type has lower expression, we could observe the overlay of power gain of scenario B and type I error inflation of scenario C (Figure 3D).
Although we roughly grouped previous algorithms into derivatives of full or derivatives of marginal, some implement treatments beyond simple linear models.For example, the TCA algorithm tends to detect neutrophil signals similarly as the marginal test (Fig. 2B), yet had smaller type I error rate (Fig. 2C).

Cell-type-specific association with rheumatoid arthritis and age
The cell-type-specific association of DNA methylation with rheumatoid arthritis was predicted using bulk peripheral blood leukocyte data and was evaluated in sorted monocytes (Figure 4A) and B cells (Figure 4B).Whereas the full model (and its derivatives) performed the best and the marginal model (and its derivatives) performed the worst in monocytes, the performance ranking was opposite in B cells.A robust algorithm would consistently achieve high performance relative to the best algorithm in each instance.Nonlinear ridge regression (omicwas.logit.ridge)was the most robust, performing 65% to 93% relative to the best method.
The cell-type-specific association of gene expression with age was predicted using whole blood data and was evaluated in sorted CD4 + T cells (Figure 4C) and monocytes (Figure 4D).All algorithms performed poorly in CD4 + T cells, and the marginal model performed the best in monocytes.
Overall, nonlinear ridge regression (omicwas.log.ridge) was next to the marginal model, performing 21% to 47% to the marginal.
For dataset GSE42861 and for GTEx whole blood, the omicwas.logit.ridgeand omicwas.log.ridgemodels of the omicwas package was computed in 8.1 and 0.7 hours respectively, using 8 cores of a 2.5 GHz Xeon CPU Linux server.

Discussion
Aiming to elucidate cell-type-specific trait association in DNA methylation and gene expression, this article explored two aspects, multicollinearity and scale.
We observed multicollinearity in real data and derived mathematically how it emerges.To cope with the multicollinearity, we proposed ridge regression.To properly handle multiple scales simultaneously, we developed nonlinear regression.By testing in simulated and real data, we found proper scaling to modestly improve performance.In contrast, ridge regression achieved performance that was more robust than previous methods.
The statistical methods discussed in this article are applicable, in principle, to any tissue.For validation of the methods, we need datasets for bulk tissue as well as sorted cells, ideally of >100 samples.Currently, the publicly available data is limited to peripheral blood.By no means, we claim the rheumatoid arthritis EWAS datasets [19][20][21] or the datasets for age association of gene expression [22,23] to be representative.Nevertheless, we think verification in real data is important, which has not been performed previously in large sample size.
By the performance in simulated and real data, we can roughly divide algorithms into three groups: full (and its derivatives), marginal (and its derivatives) and ridge models.In marginal models, we test one cell type at a time.If we knew in advance that one particular cell type is associated with the trait, which would be a rare situation, testing that cell type in the marginal model is the most simple and correct approach.Indeed, under such a simulated scenario, the marginal test attained highest power (Figs.2B, 3B).
However, when the test target cell type is not associated, but instead another We mathematically modeled and implemented the logit scale for DNA methylation and log scale for gene expression.It turns out that the improvement by formulating the nonlinear scale was negligible for DNA methylation (Fig. 2B) and modest for gene expression (Fig. 3B; omicwas.identityvs omicwas.log,and omicwas.identity.ridgevs omicwas.log.ridge).This implies that previous works, which were almost exclusively in linear scale, were not losing much power due to scaling.

Conclusions
For cell-type-specific differential expression analysis by using unsorted tissue samples, we recommend trying ridge regression as a first choice because it balances power and type I error.Although marginal tests can be powerful when the tested cell type actually is the only one associated with the trait, caution is needed due to its high type I error rate.For a signal detected by the marginal test, reanalysis in full model could be valuable.Ridge regression is preferable compared to the full model without ridge regularization because ridge estimator of the effect size has smaller MSE (equation ( 13)).Nonlinear regression, which models scales properly, is recommended more than the linear regression, yet the difference can be modest.We do not claim the ridge model to substitute previous models.Indeed, we think none of the current algorithms is superior to others in all aspects, indicating possibility for future improvement.

Linear regression
We begin by describing the linear regressions used in previous studies.Let the indexes be h for a cell type, i for a sample, j for an omics marker (CpG site or gene), k for a trait that has cell-type-specific effects on marker expression, and l for a trait that has a uniform effect across cell types.The The parameters we estimate are the cell-type-specific trait effect  ℎ,, , tissue-uniform trait effect  , , and basal marker level  ℎ, in each cell type.
For the remaining of the first five sections (up to "Multicollinearity of interaction terms"), we focus on one marker j, and omit the index for readability.For cell type h, the marker level of sample i is This is a representative value rather than a mean because we do not model a probability distribution for cell-type-specific expression.By averaging the value over cell types with weight  ℎ, , and combining with the tissue-uniform trait effects, we obtain the mean marker level in bulk tissue of sample i, With regards to the statistical model, we assume the error of the marker level to be normally distributed with variance  2 , independently among samples, as The statistical significance of all parameters is tested under the full model of linear regression, or its derivatives [5,10,13].Alternatively, the cell-type-specific effects of traits can be fitted and tested for one cell type h at a time by the marginal model, or its derivatives [7][8][9]11,14].

Nonlinear regression
Aiming to simultaneously analyze cell type composition in linear scale and differential expression/methylation in log/logit scale, we develop a nonlinear regression model.The differential analyses are performed after applying normalizing transformation.The normalizing function is the natural logarithm f = log for gene expression, and f = logit for methylation (see Background).
Conventional linear regression can be formulated by defining f as the identity function.We denote the inverse function of f by g; g = exp for gene expression, and g = logistic for methylation.Thus, f converts from the linear scale to the normalized scale, and g does the opposite.
The marker level in a specific cell type (formula (1)) is modeled in the normalized scale.The level is linearized by applying function g, then averaged over cell types with weight  ℎ, , and normalized by applying function f.
Combined with the tissue-uniform trait effects, the mean normalized marker level in bulk tissue of sample i becomes We assume the normalized marker level to have an error that is normally distributed with variance  2 , independently among samples, as ~ (0,  2 ).
We obtain the ordinary least squares (OLS) estimator of the parameters by minimizing the residual sum of squares, and then estimate the error variance as where n is the number of samples and p is the number of parameters [[24], section 6.3.1].

Ridge regression
The parameters  ℎ, for cell-type-specific effect cannot be estimated accurately by ordinary linear regression because the regressors  ℎ,  , in equation ( 2) are highly correlated between cell types (see below).
Multicollinearity also occurs to the nonlinear case in formula (4) because of local linearity.To cope with the multicollinearity, we apply ridge regression with a regularization parameter  ≥ 0, and obtain the ridge estimator of the parameters that minimizes where ).
The matrices  and  * are the observed and expected Fisher matrices multiplied by  2 and adapted to ridge regression, respectively.
Since our objective is to predict the cell-type-specific trait effects, we choose the regularization parameter  that can minimize the mean squared error (MSE) of  ℎ, .Our methodology is based on [26].To simplify the explanation, we assume the Jacobian matrices (()  ⁄ ) , (()  ⁄ ) and (()  ⁄ ) to be mutually orthogonal, where  ,  and  are the vector forms of  ℎ ,  ℎ, and   , respectively.Then, from formulae (11) For each m in the summation of ( 13), the minimum of the summand is . To minimize MSE, we need to find some "average" of the optimal   over the range of m.Hoerl et al. [27] proposed to take the harmonic mean  =  2 ‖‖ 2 ⁄ .However, if an OLS estimator  ̂(0) is plugged in, ‖‖ 2 is biased upwards, and  is biased downwards.Indeed, with regards to the estimator of 1 √  ⁄ , we notice that where the terms with larger m have larger variance.Thus, we take the average of (    ̂(0))

⁄
, and also subtract the upward bias as, The weighting and subtraction were mentioned in [26], where the subtraction term was dismissed, under the assumption of large effect-size .Since the effect-size could be small in our application, we keep the subtraction term.
The statistic  can be nonpositive, and is unbiased in the sense that .
Our choice of regularization parameter is where  1 2 is taken instead of positive infinity.

Implementation of omicwas package
For each omics marker, the parameters ,  and  (denoted in combination by ) are estimated and tested by nonlinear ridge regression in the following steps.As we assume the magnitude of trait effects  and  to be much smaller than that of basal marker level , we first fit  alone for numerical stability.
Apply Wald test.
2. Calculate  2 ̂ by formula (7).Use it as a substitute for  2 .The residual degrees of freedom  −  is the number of samples minus the number of parameters in .
The formula ( 16) is the same as a Wald test, but the test differs, because the ridge estimators are not maximum-likelihood estimators.The algorithm was implemented as a package for the R statistical language.We used the NL2SOL algorithm of the PORT library [29] for minimization.
In analyses of quantitative trait locus (QTL), such as methylation QTL (mQTL) and expression QTL (eQTL), an association analysis that takes the genotypes of a single nucleotide polymorphism (SNP) as  , is repeated for many SNPs.In order to speed up the computation, we perform rounds of linear regression.First, the parameters  ̂(0) and  ̂(0) are fit by ordinary linear regression under  = , which does not depend on  , .By taking the residuals, we practically dispense with  ̂(0) and  ̂(0) in the remaining steps.

Multicollinearity of interaction terms
The regressors for cell-type-specific trait effects in the full model (equation ( 2)) are the interaction terms  ℎ,  , .To assess multicollinearity, we mathematically derive the correlation coefficient between two interaction terms  ℎ,  , and  ℎ ′ ,  , .In this section, we treat  ℎ, ,  ℎ ′ , and  , as sampled instances of random variables  ℎ ,  ℎ ′ and   , respectively.For simplicity, we assume  ℎ and  ℎ ′ are independent of Cov are high, the correlation of interaction terms approaches to one, irrespective of Cor[ ℎ ,  ℎ ′ ].

EWAS of rheumatoid arthritis
EWAS datasets for rheumatoid arthritis were downloaded from the Gene Expression Omnibus (GEO).Using the RnBeads package (version 2.2.0) [30] of R, IDAT files of HumanMethylation450 array were preprocessed by removing low quality samples and markers, by normalizing methylation level, and by removing markers on sex chromosomes and outlier samples.The association of methylation level with disease status was tested with adjustment for sex, age, smoking status and experiment batch; the covariates were assumed to have uniform effects across cell types.After quality control, dataset GSE42861 included bulk peripheral blood leukocyte data for 336 cases and 322 controls [20].GSE131989 included sorted CD14 + monocyte data for 63 cases and 31 controls [21].By meta-analysis of GSE131989 and GSE87095 [19], we obtained sorted CD19 + B cell data for 108 cases and 95 controls.The cell type composition of bulk samples was imputed using the Houseman algorithm [31] in the GLINT software (version 1.0.4)[32].

Differential gene expression by age
Whole blood RNA-seq data of GTEx v7 was downloaded from the GTEx website [22].Genes of low quality or on sex chromosomes were removed, expression level was normalized, outlier samples were removed, and 389 samples were retained.The association of read count with age was tested with adjustment for sex.From GEO dataset GSE56047 [23], we obtained sorted CD14 + monocyte data for 1202 samples and sorted CD4 + T cell data for 214 samples.The cell type composition of bulk samples was imputed using the DeconCell package (version 0.1.0)[9] of R.

Simulation of cell-type-specific disease association
Bulk tissue sample data for case-control comparison were simulated based on real data.We generated four scenarios.Each omics marker was simulated independently.The mean expression level was defined for each cell type, separately in cases and controls.The standard deviation (SD) was set to be the same for each combination.We tested disease association specific to one cell type, which we call the target cell type.In each scenario, the mean expression level was set as follows.
A. The mean was equal for all cell types both in cases and controls (null scenario).
B. The mean in cases was higher by 1 SD for the target cell type.Other combinations had the same mean value.
C. The mean in cases was lower by 1 SD for one non-target cell type.Other combinations had the same mean value.
D. The mean in cases was higher by 1 SD for the target cell type, and lower by 1 SD for one non-target cell type.Other combinations had the same middle mean value.
The target and non-target cell types were randomly chosen for each marker.
For each sample, the cell-type-specific expression level was randomly sampled from a normal distribution that was specified in the scenario.The cell-type-specific expression levels were converted to the linear scale, and then averaged across cell types, according to the predefined cell type composition.The result becomes the bulk expression level of the sample in linear scale.
We used the above-mentioned bulk tissue data, namely DNA methylation data for 658 peripheral blood leukocyte samples (GSE42861) and gene expression data for 389 whole blood samples (GTEx).We applied the same simulation procedure to each dataset.The cell type composition in the original data was retained for all samples.Half of the samples were randomly assigned as cases, and the other half were assigned as controls.Normalizing transformation (i.e., logit or log) was applied to the bulk expression data, and 500 omics markers were randomly selected.For each marker, we measured the average  and the standard deviation  of the expression level.For control samples, the expression level in each cell type was sampled from (,  2 ).For case samples, the expression level in each cell type was sampled from (,  2 ), ( + ,  2 ) or ( − ,  2 ) according to the scenario.

Evaluation of statistical methods
Cell-type-specific effects of traits was statistically tested by using bulk tissue data as input.We applied the omicwas package with the normalizing function f = log, logit, identity without ridge regularization (omicwas.log,omicwas.logit,omicwas.identity)or under ridge regression (omicwas.log.ridge,omicwas.logit.ridge,omicwas.identity.ridge).The omicwas package was used also for conventional linear regression under the full and marginal models.
Among previous methods, we evaluated those that accept cell type composition as input and compute test statistics for cell-type-specific association.For DNA methylation data, we applied TOAST (version 1.2.0) [10], CellDMC (version 2.4.0)[13] and TCA (version 1.1.0)[14].CellDMC first tests association for all combinations, and then filters out those not differentially methylated.We took all of the initial results as CellDMC.unfiltered; in CellDMC.filtered,Z-score was set to zero for those filtered out.For gene expression data, we applied TOAST and csSAM (version 1.4) [5].For csSAM, we either fitted all cell types together or one cell type at a time, and denoted the results as csSAM.lmand csSAM.monovariate,respectively.The csSAM method is applicable to binomial traits but not to quantitative traits.
For simulated data, we adopted the nominal significance level P < 0.05 (two-sided).In scenario B, the power was defined as the frequency of Z-score > 1.96.
For the association with rheumatoid arthritis and age, "true" association was determined from the measurements in physically sorted blood cells, under the nominal significance level P < 0.05 (two-sided).The significant markers were "up-regulated" (in rheumatoid arthritis cases or elders) or "down-regulated."For a set of differentially expressed markers in a cell type (e.g., up-regulated in monocytes), the prediction performance of an algorithm was measured by the area under the curve (AUC) of receiver operating characteristic (ROC).Standard error of AUC was computed by the jackknife estimator by splitting the markers into 100 groups by chromosomal position.The relative performance of an algorithm was evaluated by its AUC -0.5 divided by that for the best algorithm in each scenario.The prediction is evaluated separately for up-regulated and down-regulated markers.The AUC of ROC and its 95% confidence interval are plotted for each statistical algorithm.

TABLES
are based either on the full model of linear regression (equation (2)) or the marginal model (equation (3)).The full model fits and tests cell-type-specific effects for all cell types and CellDMC.filtered.The marginal model fits and tests cell-type-specific effect for one cell type at a time, and its derivatives include csSAM.monovariateand TCA.
cell type is associated, the marginal tests can pick up false signals due to the collinearity between regressor variables (Figs.2C, 3C).The high power and high error rate of the marginal tests can lead to unstable performance; in real data, the marginal tests were the most powerful for detecting B cell specific association with rheumatoid arthritis (Fig.4B) but were the least powerful for monocytes (Fig.4A).The full model tests all cell types together, and its performance was the opposite of the marginal.By fitting all cell types simultaneously, the full model adjusts for the effects of other cell types.The full models did not detect false association coming indirectly from non-target cell types (Figs.2C, 3C), yet their power was relatively low (Figs.2B, 3B).The ridge tests (omicwas.identity.ridge,omicwas.logit.ridgeand omicwas.log.ridge) were in the middle between full and marginal tests with regards to the power (Figs.2B, 3B, 4).The false positives of ridge tests were modest compared to the marginal tests (Figs.2C, 3C).
input data is given in four matrices.The matrix  ℎ, represents cell type composition.The matrices  , and  , represent the values of the traits that have cell-type-specific and uniform effects, respectively.We assume the two matrices are centered: ∑  ,  = ∑  ,  = 0.The matrix  , represents the omics marker expression level in tissue samples.

Figure 2
Figure 2 Detection of cell-type-specific association in simulated data for DNA methylation.(A), (B), (C) and (D) correspond to the respective scenarios.Results from different algorithms are aligned horizontally.Vertical axis indicates the Z-score for the disease effect (cases vs controls) specific to the target cell type.Points are colored according to the target cell type.The middle bar of the box plot indicates the median, and the lower and upper hinges correspond to the first and third quartiles.The whiskers extend to the value no further than 1.5 * inter-quartile range from the hinges.Neu, neutrophils; Mono, monocytes; Eos, eosinophils.

Figure 3
Figure 3 Detection of cell-type-specific association in simulated data for gene expression.(A), (B), (C) and (D) correspond to the respective scenarios.Results from different algorithms are aligned horizontally.Vertical axis indicates the Z-score for the disease effect (cases vs controls) specific to the target cell type.Points are colored according to the target cell type.The middle bar of the box plot indicates the median, and the lower and upper hinges correspond to the first and third quartiles.The whiskers extend to the value no further than 1.5 * inter-quartile range from the hinges.Gra, granulocytes; Mono, monocytes.

Figure 4
Figure 4Performance of cell-type-specific association prediction.For rheumatoid arthritis association of DNA methylation in monocytes (A) and B cells (B); Age association of gene expression in CD4 + T cells (C) and monocytes (D).
the second term penalizes  ℎ, for taking large absolute values.The where  is the vector form of   ,  is the vector form of the parameters  ℎ ,  ℎ, and   combined, (  ⁄ ) is the Jacobian matrix, ( 2 ⁄ ) is the array of Hessian matrices for   taken over samples, and T indicates matrix transposition.The product of () − () and the Hessian is taken by multiplying for each sample and then summing up over samples.The matrix after  has one only in the diagonal corresponding to  ℎ, .The assigned value  is the true parameter value.By taking the expectation of , we obtain a rougher approximation [25] as Mean[ ̂()] =  * () −1  * where U and V are orthogonal matrices, the columns of V are  1 , ⋯ ,   , and the diagonals of diagonal matrix D are sorted  1 ≥ ⋯ ≥   ≥ 0 .The bias, variance and MSE of the ridge estimator are decomposed as Table 2B Correlation between blood cell type proportion and age (Xk)