- Methodology article
- Open Access
A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies
BMC Bioinformatics volume 18, Article number: 105 (2017)
Intra-sample cellular heterogeneity presents numerous challenges to the identification of biomarkers in large Epigenome-Wide Association Studies (EWAS). While a number of reference-based deconvolution algorithms have emerged, their potential remains underexplored and a comparative evaluation of these algorithms beyond tissues such as blood is still lacking.
Here we present a novel framework for reference-based inference, which leverages cell-type specific DNAse Hypersensitive Site (DHS) information from the NIH Epigenomics Roadmap to construct an improved reference DNA methylation database. We show that this leads to a marginal but statistically significant improvement of cell-count estimates in whole blood as well as in mixtures involving epithelial cell-types. Using this framework we compare a widely used state-of-the-art reference-based algorithm (called constrained projection) to two non-constrained approaches including CIBERSORT and a method based on robust partial correlations. We conclude that the widely-used constrained projection technique may not always be optimal. Instead, we find that the method based on robust partial correlations is generally more robust across a range of different tissue types and for realistic noise levels. We call the combined algorithm which uses DHS data and robust partial correlations for inference, EpiDISH (Epigenetic Dissection of Intra-Sample Heterogeneity). Finally, we demonstrate the added value of EpiDISH in an EWAS of smoking.
Estimating cell-type fractions and subsequent inference in EWAS may benefit from the use of non-constrained reference-based cell-type deconvolution methods.
One of the great challenges facing the omics community is that posed by intra-sample heterogeneity (ISH) . Molecular profiles derived from complex tissues such as blood or breast represent averages over many different cell types. Since cell-type composition of complex tissues varies in response to phenotypes such as cancer or age, correcting for changes in tissue composition could be crucial if one wishes to identify potentially causal alterations in individual cell-types [2, 3]. This is particularly relevant for Epigenome-Wide Association Studies (EWAS) where the effect sizes of interest could be small compared to changes in tissue composition [3, 4]. Because of this, a number of different ISH deconvolution algorithms for genome-wide DNA methylation data have recently been proposed [5–9]. As with the analogous tools developed for mRNA expression data [10, 11], these algorithms can be classified as either “reference-free” [6, 8] or “reference-based” , depending on whether they use an a-priori database of cell-type specific DNA methylation (DNAm) reference profiles to perform the deconvolution. Although reference-free methods have the advantage that they don’t require such a DNAm database and are therefore applicable, in principle, to any tissue-type, these algorithms have only been tested in blood and don’t provide sample-specific absolute estimates of cellular proportions. Obtaining absolute estimates of underlying cell-type proportions in a tissue is an important task, as shifts in specific cell subtypes within the tissue could be used for diagnostic or prognostic purposes [12, 13], as well as providing useful mechanistic insight into systems medicine . A further problem with reference-free methods is that they often rely on the assumption that the top components of variation correlate with cell-type composition, an assumption which may not always hold, and which if violated could lead to loss of biological signal. Indeed, a clear example of this can be seen in a study which used a reference-free method called EWASher , concluding that only a handful of CpGs are differentially methylated between breast cancer and normal breast tissue, clearly contradicting all the available evidence that DNA methylation differences between breast cancer and normal tissue is widespread and largely independent of changes in tissue composition [15, 16]. Thus, reference-based approaches appear to be the safest option provided one can construct a reference DNAm database for the tissue of interest. As representative DNAm profiles for all human normal cell types accrue , reference-based methods are therefore more likely to become the framework of choice for tackling the ISH problem.
So far however, only one reference-based algorithm (the Houseman algorithm) has been proposed for DNAm data  (see also [18, 19]). Houseman’s algorithm performs inference using a quadratic programming technique known as linear constrained projection (CP), where non-negativity and normalization constraints on cellular proportions are imposed during inference. Interestingly, in the gene expression context, a recent comparative study concluded that CP is outperformed by a non-constrained reference-based approach which relies on the technique of Support Vector Regressions (SVR) . Another technique based on robust partial correlations (RPC)  has also not yet been explored in the context of DNA methylation data. Thus, there is an urgent need to conduct a comprehensive comparative evaluation of different reference-based deconvolution algorithms. Here, we perform such a comparison using experimental as well as computationally generated cellular mixtures, and including epithelial cell-types as well as leukocytes.
The importance of the reference database itself for the quality of the inference has also been noted before [10, 20]. Motivated by this, we here present a novel approach for the construction of a reference DNAm database, which integrates prior biological knowledge of cell-type specific sites with cell-type specific DNA methylation. Specifically, given that DNAse Hypersensitive Sites (DHS) are highly cell-type specific , we use such DHS data from the NIH Epigenomics Roadmap  and ENCODE [22, 23], to improve the quality of the reference database. Sample-specific cell-type proportions can then be estimated using either Houseman’s CP algorithm or a non-constrained approach such as SVR or RPC.
We show that incorporation of such prior biological knowledge can improve inference, and that Houseman’s algorithm is only optimal in the scenario of very noisy data. For realistic noise levels, we discover that non-constrained methods like RPC generally perform better. Hence, we propose an improved novel framework for reference-based inference of cell-type composition, called EpiDISH (Epigenetic Dissection of Intra-Sample-Heterogeneity), which uses (i) DHS data from the NIH Roadmap and ENCODE to construct a reference DNAm database, and (ii) RPC for estimation of cell-type proportions.
Reference-based algorithms for deconvolution of intra-sample heterogeneity
We considered a total of 4 different reference-based algorithms. Reference-based means that each of these algorithms models a DNAm profile of any given sample as a linear combination of a given set of reference DNAm profiles representing underlying cell-types present in the sample. Given a number C of underlying cell-types, each with a DNAm profile b c, and denoting by y the DNAm profile of a given sample, the underlying model is
The general idea is to estimate the weight coefficients in a least squares sense, but restricting to a subset of CpGs that are highly discriminative of the underlying cell subtypes. Assuming that the reference database contains the major cell-types present in the sample y, one may assume that ∑ C c = 1 w c = 1 (or more generally that ∑ C c = 1 w c ≤ 1). The 4 algorithms differ in how the normalization constraint is implemented:
For 3 algorithms, the constraint that weights have to be non-negative and add to 1, is implemented a posteriori, i.e. after inference of the weights themselves. Specifically, we follow the procedure implemented in  and set any negative weight estimates to zero, followed by a scaling to ensure that the non-zero positive weights add to 1. The 3 algorithms which enforce the constraints a posteriori are (i) multivariate linear regression or partial correlations (LR), (ii) robust multivariate linear regression or robust partial correlations (RLR/RPC) and (iii) Support Vector Regressions (SVR), an advanced form of robust penalized multivariate regression. In the case of SVR, we used the implementation called CIBERSORT . For LR and RLR/RPC we used the lm and rlm R-functions (www.r-project.org), to perform the multivariate regressions.
The 4th algorithm performs the inference of the weights in a least squares sense but imposes the positivity and normalization constraints as part of the inference process. This technique is known as linear constrained projection (CP) and weights can be inferred using quadratic programming (QP) [18, 19]. In implementing CP/QP there are in principle two options for the normalization constraint: one can implement a strict equality which requires the weights to add to 1, or one can implement the normalization as an inequality constraint, in which case the weights are only required to add to a number less or equal to 1. Here we implement the CP algorithm using the normalization as an inequality constraint. In effect, modulo the reference database, this algorithm is the reference-based Houseman algorithm . Differences between the two implementations of CP are relatively minor since in this work we evaluate methods in tissues where all the major cell subtypes are known and for which reference DNAm profiles exist.
Construction of integrated DHS reference DNA methylation databases
Below we give a brief summary of the datasets used in the construction of the reference databases (see also Table 1).
In the case of blood tissue we used the purified blood cell Illumina 450k data from Reinius et al. . Specifically, we used the purified cell data of Monocytes, Neutrophils, Eosinophils, CD4+ T-cells, CD8+ T-cells, Natural Killer (NK) cells and B-cells. There were 6 samples for each cell-type coming from 6 different individuals. We used a well-known empirical Bayes framework of moderated t-statistics  to derive differentially methylated CpGs (DMCs) between one of the 7 cell types and the rest using a false discovery rate (FDR) threshold of 0.05. Separately to this, we also identified all Illumina 450k probes that mapped to a DNase Hypersensitive Site (DHS) in any of the considered blood cell subtypes using data from the NIH Epigenomics Roadmap. DHS data was available for Monocytes, B-cells, T-cells and NK-cells. For each cell-type we then filtered DMCs to include only those mapping to a DHS, which we call DHS-DMCs. This resulted in 14105 B-cell, 7723 NK-cell, 12118 CD4+ T-cell, 38131 CD8+ T-cell, 11289 Monocyte, 2375 Neutrophil and 11515 Eosinophil DHS-DMCs. We then ranked these DHS-DMCs according to the mean difference in DNAm, thus favouring DHS-DMCs with large mean differences (i.e. delta beta-value > 0.8). For each cell-type we picked the top 50 DHS-DMCs. Across all 7 cell subtypes, there were 333 unique DHS-DMCs. DNAm reference centroids were then calculated as the average over the 6 samples of each purified blood cell subtype and for each of these 333 CpGs, resulting in a blood DNA methylation reference database of 333 DHS-DMCs and 7 blood cell subtypes.
Mixed epithelial cells
As a means of testing the statistical algorithms in an epithelial context, we sought to identify at least 3 human epithelial cell subtypes for which Illumina 450k DNAm data was available, generated as part of at least 2 independent studies, in order to use one study for reference database construction and another for validation (generation of in-silico mixtures). In addition, we also required DHS data from ENCODE or the NIH Epigenomics Roadmap for these cell-types. We identified DHS data from the NIH Roadmap for breast and pancreatic cells, whilst from ENCODE we obtained DHS data for human renal cortical epithelial (HRCE) cells. For human mammary epithelial cells (HMECs) and HRCE cells, Illumina 450k data was available from ENCODE, whereas for pancreas we used corresponding 450k data from Slieker et al. . In this case, because in some cases we only had 1 representative sample for each cell-type, we selected DMCs according to a difference in DNAm beta-value being larger than 0.9. This differential DNAm analysis was only performed for CpGs that mapped to a DHS in either HMECs, HRCEs or pancreas. Thus, this procedure is extremely stringent as we select DHS CpGs whose DNAm varies as much as possible between cell-types. This resulted in 272 DMCs between HMECs and HRCEs, 315 CpGs between HMECs and pancreas, and 54 DMCs between HRCEs and pancreas. In total, there were 575 unique DMCs, resulting in a DNAm reference database of 575 CpGs and 3 cell-types (breast, HRCEs and pancreas).
Mixed epithelial and non-epithelial cells
We identified DHS data from the NIH Roadmap for a fetal lung fibroblast cell-line (IMR90) and for B-cells, whilst from ENCODE we obtained DHS data for hepatocytes. For IMR90 and hepatocytes Illumina 450k data was available from ENCODE, whereas for B-cells we used the data from Reinius et al. . In this case too, because in some cases we only had 1 representative sample for each cell-type, we selected DMCs according to a difference in DNAm beta-value being larger than 0.9. This differential DNAm analysis was only performed for CpGs that mapped to a DHS in either hepatocytes, IMR90 or B-cells. Thus, this procedure is extremely stringent as we select DHS CpGs whose DNAm varies as much as possible between cell-types. This resulted in 181 DMCs between IMR90 and liver, 405 CpGs between IMR90 and B-cells, and 482 DMCs between liver and B-cells. In total, there were 912 unique DMCs, resulting in a DNAm reference database of 912 CpGs and 3 cell-types (IMR90, liver and B-cells).
Non-DHS based reference databases
For the two previously described scenarios, we also derived reference DNAm databases without restriction to DHSs, but ensuring that reference databases were of approximately the same size as the DHS-based ones. In the case of breast, pancreas and HRCEs, this resulted in a 558 DMC x 3 cell-type reference DNAm matrix. In the case of IMR90, liver and B-cells, the reference DNAm matrix was of dimension 975 DMCs and 3 cell-types. In the case of blood, the corresponding “non-DHS” reference database was defined over 339 DMCs.
In what follows we briefly describe the Illumina 450k datasets used to validate and compare the different algorithms (see also Table 1).
One dataset profiling whole blood for over 650 samples (encompassing both rheumathoid arthritis (RA) cases and controls) was available from Liu et al. . For this dataset, there were average flow-cytometric estimates for all major blood cell subtypes for RA cases and controls. Another dataset (Koestler et al.) profiled 6 whole blood (WB) and 12 experimentally reconstructed “whole blood” mixtures . In the former case, flow-cytometric estimates for the different blood cell subtypes were available for each of the 6 WB samples. In the latter case, the mixing proportions were determined by the experimentalist and therefore known without error. We also considered an additional Illumina 450k dataset from Zilbauer et al., which profiled 5 blood cell subtypes (Monocytes, Neutrophils, B-cells, CD4+ and CD8+ T-cells) with 6 replicates of each .
Other cell types
For validating and assessing the algorithms in the context of the mixed epithelial cell type scenario, we used as validation, Illumina 450k DNAm data for HMECs from Lowe et al. , and Illumina 450k data for HRCEs and adult male & female pancreas from the Stem-Cell-Matrix Compendium-2 (SMC2) . For the case of the mixed epithelial/non-epithelial cell types, we used Illumina 450k DNAm data of IMR90 (fetal lung fibroblast) from SCM2, for liver cells from Slieker et al. , and purified B-cells from Zilbauer et al. .
Validation and evaluation strategy based on in-silico mixtures
For evaluation and comparison of the statistical algorithms, we generated 100 different in-silico mixtures of the purified cell DNAm profiles, with weights chosen randomly from a uniform (0,1) distribution, subject to the constraints that weights add to 1. Performance of each algorithm was then assessed using the root mean square error (RMSE) between the estimated and true weights for each cell-type, as estimated over the 100 different mixtures. R2 values between estimated and true weights for each cell-type were also computed over the 100 different mixtures. Average RMSE and R2 values over cell-types were also calculated. Finally, this procedure was repeated for a total of 25 Monte Carlo runs, yielding a total of 25 average RMSE and R2 values. This overall strategy was used for validation/testing purposes in the Zilbauer dataset, as well as for testing the methods in the mixed epithelial and mixed epithelial/non-epithelial scenarios.
We note that in this work we always test methods on data which is independent from the data used to construct the reference database. This already implicitly assesses the performance of the algorithms under noise levels which one may encounter between two similar experiments performed by different labs and personnel. However, in order to assess the algorithms under increasing levels of noise, we also added Gaussian Noise of increasing variation to the mixtures. The addition of Gaussian Noise was done in the M-value basis, with data subsequently transformed back to the beta-valued basis for application of the algorithms. Specifically, we considered 7 increasing levels of standard deviation noise: SD = 0, 1, 2, 3, 4, 5, 6. For a DMC that differs between 2 cell-types by an amount of 0.8 (in a beta-valued basis), this corresponds roughly to a difference in the M-value basis of approx. 6. Thus, the case SD = 6, can be seen as an extreme case of noise. The case SD = 1 corresponds to a typical deviation in the beta-value basis of approximately β (1-β)/(1 + β), i.e. an approx. 5% change for a CpG which is say unmethylated (β = 0.05), and an approx. 16% change for a CpG which is partially methylated (β = 0.4). Thus the case SD = 1 is a fairly realistic scenario given known noise levels in Illumina 450k data.
Definition of a gold-standard list of smoking-associated DMCs and of a true negative list
Smoking and whole blood is the ideal scenario in which to compare methods for cell-type correction, since many whole blood EWAS have reported strong consistency of smoking-associated DMCs (reviewed in ). Hence, to define a gold-standard list of sDMCs and associated genes, we used the curated table of Gao et al. . Specifically, we defined a gold-standard list of sDMCs defined by Illumina 450k probes which have been associated with smoking in at least 3 independent studies. This resulted in a total of 62 gold-standard sDMCs, implicating a total of 15 unique genes. Using this gold-standard list to declare a list of true positives, we then assessed sensitivity of the methods in an independent whole blood EWAS of 152 samples . We note that this EWAS was not among the ones reviewed by Gao et al. and hence is truly independent.
Definition of a true negative CpG (i.e. one not associated with smoking) is much harder. However, to construct an approximate set of true negative CpGs, we used the intersection of 450k probes not associated with smoking in 3 independent EWAS studies. These EWAS studies were a (i) whole blood set of 464 samples from Tsaprouni et al.  (GEO: GSE50660), (ii) a set of 333 whole blood samples from healthy controls and (iii) corresponding 354 whole blood sample from rheumathoid arthritis cases, all from Liu et al. (GEO: GSE42861) . For all 3 sets, smoking status information was available. In the case of Tsaprouni et al., processed data was downloaded from GEO. In the case of Liu et al., we processed raw data with minfi . All 450k data was corrected for type-2 probe bias using BMIQ . P-values of association with smoking (treated as ordinal variable, 0 = non-smoker, 1 = ex-smoker, 2 = current-smoker) was determined by linear regression in each set. We then defined CpGs as not associated with smoking if their P-value > 0.25 in each of the 3 datasets, resulting in 89290 true negative (TN) CpGs.
The blood reference DNAm database for 333 DHS-DMCs and 7 blood cell subtypes is provided (Table S1 in Additional File 1). A user-friendly R-script implementing EpiDISH is available (Additional File 2). EpiDISH is also freely available as an R-package from github: https://github.com/sjczheng/EpiDISH
The EpiDISH algorithm and validation in blood using flow-cytometry
We collected DHS and Illumina 450k DNA methylation data from the NIH Epigenomics Roadmap  and ENCODE [22, 23], as well as from other Illumina 450k studies profiling individual cell-types [24, 29] and normal tissues [26, 35] (Table 1). Given a tissue of interest, and with prior knowledge of which cell subtypes might be present in the tissue, EpiDISH first constructs a DNA methylation reference database, by integrating DHSs from the corresponding cell subtypes with a supervised selection procedure, to identify cell subtype specific differentially methylated CpGs (DMCs) which localize to open chromatin (DHS-DMCs, Methods, Fig. 1a). Once the reference database is constructed, EpiDISH then infers sample-specific cellular proportions using robust partial correlations (Fig. 1b, Additional File 2).
We first considered the case of blood tissue, a complex tissue for which the main constituent cell types are well known and which is being used extensively in EWAS . We constructed a blood reference database using Illumina 450k DNAm profiles for a total of 7 purified blood cell subtypes (B-cells, NK-cells, CD4+ T-cells, CD8+ T-cells, Monocytes, Neutrophils and Eosinophils) obtained from , integrating it with blood cell subtype specific DNAse Hypersensitive Sites (DHS), as obtained from the NIH Epigenomics Roadmap , and further filtering the 450k probes for differential methylation between every pair of blood cell subtypes, resulting in an integrated blood reference DNAm database of 333 CpGs and 7 blood cell subtypes (Methods, Table 1, Table S1 in Additional File 1). As a sanity check, we verified that the original purified samples segregated according to cell subtype when clustered over these 333 CpGs (Figure S1 in Additional File 1). Blood cell subtype proportions obtained with EpiDISH on whole blood (WB) and peripheral blood mononuclear cells (PBMC) from the same study were also in line with known proportions, i.e. strongest enrichment for neutrophils in WB and with lymphocytes making the dominant component of PBMCs (Figure S2 in Additional file 1).
In order to validate the EpiDISH algorithm, we first applied it to an independent 450k data set of 689 whole blood samples from an EWAS in Rheumatoid Arthritis, for which FACS (flow-cytometric) estimates of 5 purified blood cell subtypes (B-cells, NK-cells, T-cells, Granulocytes and Monocytes), averaged separately over 354 cases and 335 controls, was available . We observed excellent agreement between EpiDISH and FACS estimates, for both cases and controls, with a root mean square error (RMSE) of only 4% (Fig. 2a). Agreement was even better for the actual difference in mean blood cell subtype proportions between cases and controls, with a RMSE of 1% (Fig. 2a). To further validate EpiDISH, we applied it to another set of 6 whole blood samples, for which independent flow-cytometry estimates of blood cell subtype fractions was available . In this set too, EpiDISH achieved a RMSE of approximately 3 to 4%, with a reasonably high average R2 value of 0.85 (Fig. 2b).
EpiDISH correctly infers blood cell subtype proportions in reconstructed whole blood samples
Flow cytometric estimates of blood cell subtype proportions are also subject to error. Hence, we further assessed EpiDISH in its predictions to correctly infer cell subtype proportions from 12 reconstructed whole blood samples where the exact mixing proportions are known . For all those cell subtypes whose proportions exhibited a reasonable dynamic range in the experimental mixtures, EpiDISH obtained R2 values above 0.95, with an average RMSE of 2.6%, confirming that EpiDISH can accurately quantify blood cell subtype proportions (Fig. 2c).
Using DHS information marginally improves the quality of the reference database
In order to demonstrate that statistical inference is improved by using relevant cell-type DHSs when constructing the reference DNAm database , we conducted a comparative analysis using two different references: one using DMCs that map to cell-type specific DHSs, and another where we only use DMCs (Methods). We performed the analysis for three different scenarios. In the first, we generated 100 random in-silico mixtures of 5 purified blood cell subtypes from Zilbauer et al. , and compared EpiDISH’s R2 value for the estimated cell-type proportions between the two different reference databases. Performing this analysis for 25 different Monte Carlo runs, revealed significantly higher R2 values for the database that used DHS information (Fig. 3a). We next repeated this analysis for another scenario where we generated mixtures from 3 epithelial cell-types (breast, human renal cortical and pancreas) for which DNAm profiles were available from at least 2 independent studies, and for which DHS information for each individual cell-type was also available from either the NIH Roadmap or ENCODE (Methods). DNAm data from 2 independent studies is necessary to separate out the process of reference construction (training) and evaluation (validation). Confirming the previous analysis, improved R2 values was observed for the DHS-based reference database (Fig. 3b). Finally, we considered a third scenario where we mixed together epithelial and non-epithelial cell-types (fetal lung fibroblast-IMR90, hepatocytes and B-cells). For each of these cell-types, DHS data and DNAm profiles generated by two independent studies were available (again, in order to avoid overfitting). Confirming the previous results, R2 values were distinctively improved upon using DHS information (Fig. 3c). Using RMSE as performance measure, the DHS-based reference was best in 2/3 studies (Figure S3 in Additional File 1). However, we note that in all cases improvements were only marginal.
EpiDISH compares favorably to other reference-based methods
Having validated EpiDISH, we next performed a detailed comparison to competing reference-based methods. Specifically, we compared EpiDISH to the linear constrained projection technique (CP) used by Houseman and others [5, 18, 19], to an algorithm called CIBERSORT (CBS), which uses Support Vector Regression and which has been shown to perform best on gene expression data , and finally to a non-robust variant of EpiDISH based on simple multivariate regression (LR). In order to objectively compare the 4 algorithms, we first considered the case of blood tissue, for which several datasets profiling purified blood cell subtypes were available. For each algorithm we used the same blood reference database of 333 DHS-DMCs and 7 blood cell subtypes, as constructed previously using the purified blood cell data from Reinius et al. . Using the known reconstructed whole blood mixtures (12 samples) from Koestler et al. , we compared the algorithms in terms of the RMSE and R2 values of the estimated mixing proportions. Interestingly, we observed that robust partial correlations (EpiDISH) outperformed both CP and CIBERSORT in terms of the RMSE, with similar performance as assessed using R2 (Fig. 4a). In an independent dataset of 5 purified blood cell subtypes from Zilbauer et al. , where we generated in-silico mixtures, CP underperformed while EpiDISH/CIBERSORT performed optimally (Fig. 4b). In order to further compare performance in the context of other cell-types, we applied all 4 algorithms to the 3 epithelial cell type and 3 epithelial/non-epithelial cell type mixture scenarios considered earlier (Methods). Once again, EpiDISH compared very favorably relative to the other methods, specially relative to CP which overall showed the weakest performance (Fig. 4c, Fig. S4 in Additional File 1). We note however that EpiDISH did not outperform CIBERSORT in one of these two studies (Figure S4 in Additional File 1).
Constrained Projection reveals added robustness under higher levels of noise
Since the reconstructed and in-silico mixtures from the previous analyses were generated from cell-type specific DNAm profiles that are independent from the cellular DNAm profiles used to build the reference databases, these analyses already implicitly assess the robustness of the algorithms to natural levels of noise, as encountered for instance between different labs or different experimental protocols. However, in order to improve our understanding of the noise performance characteristics of the different methods, we next investigated their relative performance under increasingly higher levels of noise (Methods, Fig. 5). Adding increasing levels of noise to the reconstructed mixtures of Koestler et al., we observed that while EpiDISH was optimal for low levels, that the relative performance of other algorithms, notably constrained projection (CP), improved as noise levels increased (Fig. 5a). We observed a similar pattern using in-silico mixtures of purified blood cell subtypes from Zilbauer et al., with EpiDISH optimal at low levels of noise, but CP emerging as the more optimal method at higher levels (Fig. 5b). In the context of in-silico mixtures of 3 epithelial cell subtypes, we once again observed a cross-over in RMSE performance between CP and EpiDISH, with EpiDISH performing better at lower levels of noise, but CP and CIBERSORT emerging as the more optimal methods under larger levels (Fig. 5c). In terms of R2, CP emerged as the best performing method in this set. CP also emerged as the better performing method (in terms of R2) for larger levels of noise in the context of in-silico mixtures of epithelial/non-epithelial cell subtypes (Fig. 5d). In summary, these data indicate that the relative performance of constrained (CP) vs non-constrained (EpiDISH, CIBERSORT) approaches for estimating cell proportions in heterogeneous mixtures is dependent on cell-type and the levels of noise in the data.
EpiDISH improves sensitivity in an EWAS of smoking in whole blood
In order to further validate EpiDISH and to illustrate its application to EWAS, we applied it to an EWAS of smoking in a cohort of 152 women, all aged 53, for which whole blood samples were taken and for which DNAm profiles using the Illumina 450k technology were generated . Smoking in whole blood tissue provides the ideal scenario in which to test EpiDISH, since a gold-standard list of 62 smoking-associated DMCs (sDMCs), as derived from an extensive survey of independent smoking EWASs, exists  (Methods, Table S2 in Additional file 1). Hence, we applied EpiDISH to our 152 samples to obtain sample-specific weights for the different blood cell subtypes. The resulting estimates were in line with what is expected for whole blood samples, with granulocytes/neutrophils making up ~50% of the samples and with lymphocytes/monocytes making up the rest (Fig. 6a). SVD analysis confirmed that the top component of variation correlated strongly with changes in blood cell-type composition (Fig. 6b). Not adjusting for variation in blood cell subtype composition, we observed 34 sDMCs at genome-wide significance (FDR < 0.05, Table S3 in Additional File 1, Fig. 6c). Adjusting for cell-type proportions, as estimated using EpiDISH, we observed a doubling of sDMCs (70 sDMCs at FDR < 0.05, Table S4 in Additional File 1, Fig. 6d). Importantly, among the additional sDMCs identified with EpiDISH, there were probes that mapped to 5 genes within our gold-standard set of 15 smoking-associated genes [30, 37–41] (Methods, Fig. 6e). For instance, an additional probe mapping to AHRR was observed after adjustment, and probes mapping to PTK2 and LRP5 were obtained only after adjustment (Fig. 6e). In contrast, only one probe in the unadjusted analysis was not found after adjustment (Table S3 in Additional File 1). To investigate this further we estimated the sensitivity for an unadjusted analysis, EpiDISH (with and without DHS-DMCs in the reference), CIBERSORT and CP, at two different FDR thresholds (FDR < 0.05 and FDR < 0.3) (Table 2). This confirmed that EpiDISH improves the sensitivity over an unadjusted analysis, although CP performed marginally better (Figure S5 in Additional File 1). To check that the improved sensitivity is not at the expense of a much lower specificity, we also defined a set of true negative CpGs, i.e. CpGs not associated with smoking (P > 0.25) as assessed in 3 independent EWAS cohorts (Methods). This resulted in a true negative set of 89290 CpGs, allowing the relative specificity of the methods to be assessed. Using the same FDR < 0.05 threshold, we observed that all methods achieved comparably high relative specificity values (Specificity ≈ 1), although CP exhibited a 3-fold higher type-1 error rate (false positive rate) than EpiDISH or CIBERSORT (Table 2). At a more relaxed threshold (FDR < 0.3), CP exhibited a 10 times higher type-1 error rate than EpiDISH or CIBERSORT (Table 2). Thus, overall, EpiDISH improves inference over an unadjusted analysis and compares favorably to CP.
The aim of this study was to compare different reference-based methods in order to establish whether there is an optimal approach. This is particularly pertinent given that only one reference-based approach (Houseman’s algorithm) has been applied to DNA methylation data. As our study has demonstrated, Houseman’s algorithm, which relies on the CP technique, is in fact not the most robust method. While we did not identify a single method which always outperformed all others, we found that non-constrained techniques such as partial correlations and support vector regression provided a more robust inference framework than CP, particularly for more realistic noise levels. This is consistent with the results obtained by Newman et al. on gene expression data . In our study, CP only emerged as the optimal choice if data was subjected to a large amount of noise, typically larger than what is encountered in real data. Hence, our study advances upon the state-of-the-art, and proposes the use of RPC or CIBERSORT as a more robust means for inferring cell-fractions in complex tissues.
Another important novel insight of our study is that inference can be improved, albeit only marginally, by incorporating cell-type specific DHS information when constructing the reference DNAm database. We demonstrated this not only in the context of blood tissue, but also for mixtures involving epithelial cell subtypes. To understand why the improvement is only marginal, we note that supervised selection of DMCs from a training set, for instance, by selecting DMCs that show big (i.e. 80% or over) differences in DNAm between cell-types, will almost always identify true DMCs. Supervised selection tends to favour DMCs with larger differences in average DNAm which serve better for the statistical deconvolution problem. Thus, while prior biological knowledge can help remove a few false positives, this leads to only a relatively minor improvement in the quality of the inference. It will be interesting to explore if further improvements are possible, for instance, using predicted cell-type specific DMCs , although we anticipate that such prior information will turn out to be more fruitful in the context of “semi-reference-free” approaches such as RUV .
We further showcased EpiDISH in an EWAS of smoking, where it was found to identify twice as many smoking-associated DMCs than an unadjusted analysis. Importantly, some of the additional sDMCs mapped to genes (e.g. AHRR, PTK2, MYO1G, LRP5) which are well-known to be associated with differential methylation in smokers , thus confirming that EpiDISH leads to an increase in sensitivity. Although similar improvements were possible with CIBERSORT and CP, CP suffered from a higher FDR. Thus, our detailed analysis on experimental and in-silico generated mixtures, which suggests improved modelling with a non-constrained technique such as RPCs, appears to also hold true on real data, although we caution that analyses in more smoking-EWAS will be needed to reach a robust conclusion.
Our analysis focused entirely on Illumina 450k data. While this may be seen as a limitation, in light of the recent arrival of the newer 850k version , the overwhelming majority of the 450k probes are present in the new beadarray version, rendering the extension to 850k data relatively trivial. Likewise, given the relatively good agreement between Illumina 450k and WGBS/RRBS data , and given that in both cases DNAm data can be treated in the beta-value basis, we would envisage that application of EpiDISH will carry over to such data, although studies demonstrating this will be needed.
In summary, we have presented a novel reference-based algorithm, EpiDISH, for in-silico deconvolution of DNA methylation data, which compares very favorably in relation to the current gold-standard. We recommend the use of EpiDISH for dissection of intra-sample heterogeneity in EWAS, and to this purpose we provide the community with an R-package (EpiDISH), freely available from https://github.com/sjczheng/EpiDISH.
DNAse Hypersensitive Site
Differentially methylated CpG
False discovery rate
Removing unwanted variation
Smoking-associated differentially methylated CpG.
Alizadeh AA, Aranda V, Bardelli A, Blanpain C, Bock C, Borowski C, Caldas C, Califano A, Doherty M, Elsner M, et al. Toward understanding and exploiting tumor heterogeneity. Nat Med. 2015;21:846–53.
Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, Reinius L, Acevedo N, Taub M, Ronninger M, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31:142–7.
Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 2014;15:R31.
Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12:529–41.
Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinf. 2012;13:86.
Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J: Epigenome-wide association studies without the need for cell-type composition. Nat Methods. 2014;11(3):309–11.
Houseman EA, Kelsey KT, Wiencke JK, Marsit CJ. Cell-composition effects in the analysis of DNA methylation array data: a mathematical perspective. BMC Bioinf. 2015;16:95.
Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics. 2014;30:1431–9.
Zheng X, Zhao Q, Wu HJ, Li W, Wang H, Meyer CA, Qin QA, Xu H, Zang C, Jiang P, et al. MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes. Genome Biol. 2014;15:419.
Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12:453–7.
Yoshihara K, Shahmoradgoli M, Martinez E, Vegesna R, Kim H, Torres-Garcia W, Trevino V, Shen H, Laird PW, Levine DA, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.
Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Gayther SA, Apostolidou S, Jones A, Lechner M, Beck S, Jacobs IJ, Widschwendter M. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4:e8274.
Langevin SM, Koestler DC, Christensen BC, Butler RA, Wiencke JK, Nelson HH, Houseman EA, Marsit CJ, Kelsey KT. Peripheral blood DNA methylation profiles are indicative of head and neck squamous cell carcinoma: an epigenome-wide association study. Epigenetics. 2012;7:291–9.
Vogt H, Hofmann B, Getz L: The new holism: P4 systems medicine and the medicalization of health and life itself. Med Health Care Philos. 2016;19(2):307–23.
Timp W, Bravo HC, McDonald OG, Goggins M, Umbricht C, Zeiger M, Feinberg AP, Irizarry RA: Large hypomethylated blocks as a universal defining epigenetic alteration in human solid tumors. Genome Med. 2014;6(8):61.
Teschendorff AE, Gao Y, Jones A, Ruebner M, Beckmann MW, Wachter DL, Fasching PA, Widschwendter M. DNA methylation outliers in normal breast tissue identify field defects that are enriched in cancer. Nat Commun. 2016;7:10478.
Roadmap Epigenomics C, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–30.
Koestler DC, Christensen B, Karagas MR, Marsit CJ, Langevin SM, Kelsey KT, Wiencke JK, Houseman EA. Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics. 2013;8:816–26.
Accomando WP, Wiencke JK, Houseman EA, Nelson HH, Kelsey KT. Quantitative reconstruction of leukocyte subsets using DNA methylation. Genome Biol. 2014;15:R50.
Koestler DC, Jones MJ, Usset J, Christensen BC, Butler RA, Kobor MS, Wiencke JK, Kelsey KT. Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL). BMC Bioinf. 2016;17:120.
Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489:75–82.
Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.
Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100.
Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlen SE, Greco D, Soderhall C, Scheynius A, Kere J. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One. 2012;7:e41361.
Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3:Article3.
Slieker RC, Bos SD, Goeman JJ, Bovee JV, Talens RP, van der Breggen R, Suchiman HE, Lameijer EW, Putter H, van den Akker EB, et al. Identification and systematic annotation of tissue-specific differentially methylated regions using the Illumina 450 k array. Epigenetics Chromatin. 2013;6:26.
Zilbauer M, Rayner TF, Clark C, Coffey AJ, Joyce CJ, Palta P, Palotie A, Lyons PA, Smith KG. Genome-wide methylation analyses of primary human leukocyte subsets identifies functionally important cell-type-specific hypomethylated regions. Blood. 2013;122:e52–60.
Lowe R, Overhoff MG, Ramagopalan SV, Garbe JC, Koh J, Stampfer MR, Beach DH, Rakyan VK, Bishop CL. The senescent methylome and its relationship with cancer, ageing and germline genetic variation in humans. Genome Biol. 2015;16:194.
Nazor KL, Altun G, Lynch C, Tran H, Harness JV, Slavin I, Garitaonandia I, Muller FJ, Wang YC, Boscolo FS, et al. Recurrent variations in DNA methylation in human pluripotent stem cells and their differentiated derivatives. Cell Stem Cell. 2012;10:620–34.
Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin Epigenetics. 2015;7:113.
Teschendorff AE, Yang Z, Wong A, Pipinikas CP, Jiao Y, Jones A, Anjum S, Hardy R, Salvesen HB, Thirlwell C, et al. Correlation of Smoking-Associated DNA Methylation Changes in Buccal Cells With DNA Methylation Changes in Epithelial Cancer. JAMA Oncol. 2015;1:476–85.
Tsaprouni LG, Yang TP, Bell J, Dick KJ, Kanoni S, Nisbet J, Vinuela A, Grundberg E, Nelson CP, Meduri E, et al. Cigarette smoking reduces DNA methylation levels at multiple genomic loci but the effect is partially reversible upon cessation. Epigenetics. 2014;9:1382–96.
Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9.
Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics. 2013;29:189–96.
Lowe R, Rakyan VK. Marmal-aid--a database for Infinium HumanMethylation450. BMC Bioinf. 2013;14:359.
Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010;28:1045–8.
Zeilinger S, Kuhnel B, Klopp N, Baurecht H, Kleinschmidt A, Gieger C, Weidinger S, Lattka E, Adamski J, Peters A, et al. Tobacco smoking leads to extensive genome-wide changes in DNA methylation. PLoS One. 2013;8:e63812.
Shenker NS, Ueland PM, Polidoro S, van Veldhoven K, Ricceri F, Brown R, Flanagan JM, Vineis P. DNA methylation as a long-term biomarker of exposure to tobacco smoke. Epidemiology. 2013;24:712–6.
Zhang Y, Yang R, Burwinkel B, Breitling LP, Brenner H. F2RL3 methylation as a biomarker of current and lifetime smoking exposures. Environ Health Perspect. 2014;122:131–7.
Breitling LP, Yang R, Korn B, Burwinkel B, Brenner H. Tobacco-smoking-related differential DNA methylation: 27 K discovery and replication. Am J Hum Genet. 2011;88:450–7.
Besingi W, Johansson A. Smoke-related DNA methylation changes in the etiology of human disease. Hum Mol Genet. 2014;23:2290–7.
Ernst J, Kellis M. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat Biotechnol. 2015;33:364–76.
Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13:539–52.
Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics. 2016;8:389–99.
Titus AJ, Houseman EA, Johnson KC, Christensen BC: methyLiftover: cross-platform DNA methylation data integration. Bioinformatics. 2016;32(16):2517–9.
This work was supported by a Royal Society Newton Advanced Fellowship (AET & SB), (NAF project number: 522438, NAF award number: 164914) The authors also wish to thank the Chinese Academy of Sciences, Shanghai Institute for Biological Sciences and the Max-Planck Society for financial support and the NSFC (National Science Foundation of China) under grant number 31571359. CEB was supported by a PhD fellowship from the EU-FP7 project EpiTrain (316758).
Availability of data and materials
All data analysed in this study is publicly available from repositories as indicated in Methods. EpiDISH is freely available as an R-package from github: https://github.com/sjczheng/EpiDISH.
AET performed most of the statistical analyses and wrote the manuscript. SCZ contributed to data analysis. CEB and SB provided data. Study was conceived by AET with input from CEB and SB. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent to publish
Ethics approval and consent to participate
Contains all Supplementary Figures and Supplementary Tables plus their captions: Figure S1. Clustering validation of reference blood database. Figure S2. Validation of EpiDISH. Figure S3. Improvement of inference using DHS data in 3 independent data sets. Figure S4. Comparison of reference-based methods. Figure S5. Comparative performance of reference-based methods in an EWAS of smoking. Table S1. The blood reference database. Table S2. Gold-standard list of 62 smoking-associated DMCs (sDMCs). Table S3. Smoking-associated DMCs with no adjustment for cell-type composition. Table S4. Smoking-associated DMCs as obtained using EpiDISH. (PDF 1241 kb)
An R-script implementing EpiDISH with either robust partial correlations (RPC), CIBERSORT (CBS) or Constrained Projection (CP). (R 5 kb)
About this article
Cite this article
Teschendorff, A.E., Breeze, C.E., Zheng, S.C. et al. A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinformatics 18, 105 (2017). https://doi.org/10.1186/s12859-017-1511-5
- Cellular heterogeneity
- DNA methylation