A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies
© The Author(s). 2017
Received: 27 May 2016
Accepted: 31 January 2017
Published: 13 February 2017
Intra-sample cellular heterogeneity presents numerous challenges to the identification of biomarkers in large Epigenome-Wide Association Studies (EWAS). While a number of reference-based deconvolution algorithms have emerged, their potential remains underexplored and a comparative evaluation of these algorithms beyond tissues such as blood is still lacking.
Here we present a novel framework for reference-based inference, which leverages cell-type specific DNAse Hypersensitive Site (DHS) information from the NIH Epigenomics Roadmap to construct an improved reference DNA methylation database. We show that this leads to a marginal but statistically significant improvement of cell-count estimates in whole blood as well as in mixtures involving epithelial cell-types. Using this framework we compare a widely used state-of-the-art reference-based algorithm (called constrained projection) to two non-constrained approaches including CIBERSORT and a method based on robust partial correlations. We conclude that the widely-used constrained projection technique may not always be optimal. Instead, we find that the method based on robust partial correlations is generally more robust across a range of different tissue types and for realistic noise levels. We call the combined algorithm which uses DHS data and robust partial correlations for inference, EpiDISH (Epigenetic Dissection of Intra-Sample Heterogeneity). Finally, we demonstrate the added value of EpiDISH in an EWAS of smoking.
Estimating cell-type fractions and subsequent inference in EWAS may benefit from the use of non-constrained reference-based cell-type deconvolution methods.
KeywordsCellular heterogeneity DNA methylation EWAS
One of the great challenges facing the omics community is that posed by intra-sample heterogeneity (ISH) . Molecular profiles derived from complex tissues such as blood or breast represent averages over many different cell types. Since cell-type composition of complex tissues varies in response to phenotypes such as cancer or age, correcting for changes in tissue composition could be crucial if one wishes to identify potentially causal alterations in individual cell-types [2, 3]. This is particularly relevant for Epigenome-Wide Association Studies (EWAS) where the effect sizes of interest could be small compared to changes in tissue composition [3, 4]. Because of this, a number of different ISH deconvolution algorithms for genome-wide DNA methylation data have recently been proposed [5–9]. As with the analogous tools developed for mRNA expression data [10, 11], these algorithms can be classified as either “reference-free” [6, 8] or “reference-based” , depending on whether they use an a-priori database of cell-type specific DNA methylation (DNAm) reference profiles to perform the deconvolution. Although reference-free methods have the advantage that they don’t require such a DNAm database and are therefore applicable, in principle, to any tissue-type, these algorithms have only been tested in blood and don’t provide sample-specific absolute estimates of cellular proportions. Obtaining absolute estimates of underlying cell-type proportions in a tissue is an important task, as shifts in specific cell subtypes within the tissue could be used for diagnostic or prognostic purposes [12, 13], as well as providing useful mechanistic insight into systems medicine . A further problem with reference-free methods is that they often rely on the assumption that the top components of variation correlate with cell-type composition, an assumption which may not always hold, and which if violated could lead to loss of biological signal. Indeed, a clear example of this can be seen in a study which used a reference-free method called EWASher , concluding that only a handful of CpGs are differentially methylated between breast cancer and normal breast tissue, clearly contradicting all the available evidence that DNA methylation differences between breast cancer and normal tissue is widespread and largely independent of changes in tissue composition [15, 16]. Thus, reference-based approaches appear to be the safest option provided one can construct a reference DNAm database for the tissue of interest. As representative DNAm profiles for all human normal cell types accrue , reference-based methods are therefore more likely to become the framework of choice for tackling the ISH problem.
So far however, only one reference-based algorithm (the Houseman algorithm) has been proposed for DNAm data  (see also [18, 19]). Houseman’s algorithm performs inference using a quadratic programming technique known as linear constrained projection (CP), where non-negativity and normalization constraints on cellular proportions are imposed during inference. Interestingly, in the gene expression context, a recent comparative study concluded that CP is outperformed by a non-constrained reference-based approach which relies on the technique of Support Vector Regressions (SVR) . Another technique based on robust partial correlations (RPC)  has also not yet been explored in the context of DNA methylation data. Thus, there is an urgent need to conduct a comprehensive comparative evaluation of different reference-based deconvolution algorithms. Here, we perform such a comparison using experimental as well as computationally generated cellular mixtures, and including epithelial cell-types as well as leukocytes.
The importance of the reference database itself for the quality of the inference has also been noted before [10, 20]. Motivated by this, we here present a novel approach for the construction of a reference DNAm database, which integrates prior biological knowledge of cell-type specific sites with cell-type specific DNA methylation. Specifically, given that DNAse Hypersensitive Sites (DHS) are highly cell-type specific , we use such DHS data from the NIH Epigenomics Roadmap  and ENCODE [22, 23], to improve the quality of the reference database. Sample-specific cell-type proportions can then be estimated using either Houseman’s CP algorithm or a non-constrained approach such as SVR or RPC.
We show that incorporation of such prior biological knowledge can improve inference, and that Houseman’s algorithm is only optimal in the scenario of very noisy data. For realistic noise levels, we discover that non-constrained methods like RPC generally perform better. Hence, we propose an improved novel framework for reference-based inference of cell-type composition, called EpiDISH (Epigenetic Dissection of Intra-Sample-Heterogeneity), which uses (i) DHS data from the NIH Roadmap and ENCODE to construct a reference DNAm database, and (ii) RPC for estimation of cell-type proportions.
Reference-based algorithms for deconvolution of intra-sample heterogeneity
For 3 algorithms, the constraint that weights have to be non-negative and add to 1, is implemented a posteriori, i.e. after inference of the weights themselves. Specifically, we follow the procedure implemented in  and set any negative weight estimates to zero, followed by a scaling to ensure that the non-zero positive weights add to 1. The 3 algorithms which enforce the constraints a posteriori are (i) multivariate linear regression or partial correlations (LR), (ii) robust multivariate linear regression or robust partial correlations (RLR/RPC) and (iii) Support Vector Regressions (SVR), an advanced form of robust penalized multivariate regression. In the case of SVR, we used the implementation called CIBERSORT . For LR and RLR/RPC we used the lm and rlm R-functions (www.r-project.org), to perform the multivariate regressions.
The 4th algorithm performs the inference of the weights in a least squares sense but imposes the positivity and normalization constraints as part of the inference process. This technique is known as linear constrained projection (CP) and weights can be inferred using quadratic programming (QP) [18, 19]. In implementing CP/QP there are in principle two options for the normalization constraint: one can implement a strict equality which requires the weights to add to 1, or one can implement the normalization as an inequality constraint, in which case the weights are only required to add to a number less or equal to 1. Here we implement the CP algorithm using the normalization as an inequality constraint. In effect, modulo the reference database, this algorithm is the reference-based Houseman algorithm . Differences between the two implementations of CP are relatively minor since in this work we evaluate methods in tissues where all the major cell subtypes are known and for which reference DNAm profiles exist.
Construction of integrated DHS reference DNA methylation databases
Main Illumina 450k DNAm datasets used. We list the main datasets used in this study, the cell-types/tissue profiled, whether the data was used for reference database construction (if yes, we specify which cell-types were used), whether the data was used for validation/evaluation purposes (if yes, we specify which cell-types were used) and the reference/citation
Use in Reference DNAm Database
Reinius et al.
WB, PBMC, NK, B, CD4T, CD8T, Monoc, Neutro, Eosino. (n = 6 of each)
NK, B, CD4T, CD8T, Monoc., Neutro., Eosino.
WB & PBMC
Liu et al.
WB (n = 335 controls, n = 354 rheumathoid arthritis cases)
Average Flow Cytometry estimates for cases and controls
Koestler et al.
WB (n = 18)
12 Reconstructed WB mixtures + 6 WB samples with Flow Cytometry estimates
Zilbauer et al.
PBMC, CD4T, CD8T, NK, B, Monoc, Neutro. (n = 6 of each)
In-silico mixtures of purified blood cell subtypes
HMEC, HRCE, IMR90, Liver
Slieker et al.
HRCE, Pancreas, IMR90
Lowe et al.
Teschendorff et al.
WB (n = 152)
Smoking associated DMCs
In the case of blood tissue we used the purified blood cell Illumina 450k data from Reinius et al. . Specifically, we used the purified cell data of Monocytes, Neutrophils, Eosinophils, CD4+ T-cells, CD8+ T-cells, Natural Killer (NK) cells and B-cells. There were 6 samples for each cell-type coming from 6 different individuals. We used a well-known empirical Bayes framework of moderated t-statistics  to derive differentially methylated CpGs (DMCs) between one of the 7 cell types and the rest using a false discovery rate (FDR) threshold of 0.05. Separately to this, we also identified all Illumina 450k probes that mapped to a DNase Hypersensitive Site (DHS) in any of the considered blood cell subtypes using data from the NIH Epigenomics Roadmap. DHS data was available for Monocytes, B-cells, T-cells and NK-cells. For each cell-type we then filtered DMCs to include only those mapping to a DHS, which we call DHS-DMCs. This resulted in 14105 B-cell, 7723 NK-cell, 12118 CD4+ T-cell, 38131 CD8+ T-cell, 11289 Monocyte, 2375 Neutrophil and 11515 Eosinophil DHS-DMCs. We then ranked these DHS-DMCs according to the mean difference in DNAm, thus favouring DHS-DMCs with large mean differences (i.e. delta beta-value > 0.8). For each cell-type we picked the top 50 DHS-DMCs. Across all 7 cell subtypes, there were 333 unique DHS-DMCs. DNAm reference centroids were then calculated as the average over the 6 samples of each purified blood cell subtype and for each of these 333 CpGs, resulting in a blood DNA methylation reference database of 333 DHS-DMCs and 7 blood cell subtypes.
Mixed epithelial cells
As a means of testing the statistical algorithms in an epithelial context, we sought to identify at least 3 human epithelial cell subtypes for which Illumina 450k DNAm data was available, generated as part of at least 2 independent studies, in order to use one study for reference database construction and another for validation (generation of in-silico mixtures). In addition, we also required DHS data from ENCODE or the NIH Epigenomics Roadmap for these cell-types. We identified DHS data from the NIH Roadmap for breast and pancreatic cells, whilst from ENCODE we obtained DHS data for human renal cortical epithelial (HRCE) cells. For human mammary epithelial cells (HMECs) and HRCE cells, Illumina 450k data was available from ENCODE, whereas for pancreas we used corresponding 450k data from Slieker et al. . In this case, because in some cases we only had 1 representative sample for each cell-type, we selected DMCs according to a difference in DNAm beta-value being larger than 0.9. This differential DNAm analysis was only performed for CpGs that mapped to a DHS in either HMECs, HRCEs or pancreas. Thus, this procedure is extremely stringent as we select DHS CpGs whose DNAm varies as much as possible between cell-types. This resulted in 272 DMCs between HMECs and HRCEs, 315 CpGs between HMECs and pancreas, and 54 DMCs between HRCEs and pancreas. In total, there were 575 unique DMCs, resulting in a DNAm reference database of 575 CpGs and 3 cell-types (breast, HRCEs and pancreas).
Mixed epithelial and non-epithelial cells
We identified DHS data from the NIH Roadmap for a fetal lung fibroblast cell-line (IMR90) and for B-cells, whilst from ENCODE we obtained DHS data for hepatocytes. For IMR90 and hepatocytes Illumina 450k data was available from ENCODE, whereas for B-cells we used the data from Reinius et al. . In this case too, because in some cases we only had 1 representative sample for each cell-type, we selected DMCs according to a difference in DNAm beta-value being larger than 0.9. This differential DNAm analysis was only performed for CpGs that mapped to a DHS in either hepatocytes, IMR90 or B-cells. Thus, this procedure is extremely stringent as we select DHS CpGs whose DNAm varies as much as possible between cell-types. This resulted in 181 DMCs between IMR90 and liver, 405 CpGs between IMR90 and B-cells, and 482 DMCs between liver and B-cells. In total, there were 912 unique DMCs, resulting in a DNAm reference database of 912 CpGs and 3 cell-types (IMR90, liver and B-cells).
Non-DHS based reference databases
For the two previously described scenarios, we also derived reference DNAm databases without restriction to DHSs, but ensuring that reference databases were of approximately the same size as the DHS-based ones. In the case of breast, pancreas and HRCEs, this resulted in a 558 DMC x 3 cell-type reference DNAm matrix. In the case of IMR90, liver and B-cells, the reference DNAm matrix was of dimension 975 DMCs and 3 cell-types. In the case of blood, the corresponding “non-DHS” reference database was defined over 339 DMCs.
In what follows we briefly describe the Illumina 450k datasets used to validate and compare the different algorithms (see also Table 1).
One dataset profiling whole blood for over 650 samples (encompassing both rheumathoid arthritis (RA) cases and controls) was available from Liu et al. . For this dataset, there were average flow-cytometric estimates for all major blood cell subtypes for RA cases and controls. Another dataset (Koestler et al.) profiled 6 whole blood (WB) and 12 experimentally reconstructed “whole blood” mixtures . In the former case, flow-cytometric estimates for the different blood cell subtypes were available for each of the 6 WB samples. In the latter case, the mixing proportions were determined by the experimentalist and therefore known without error. We also considered an additional Illumina 450k dataset from Zilbauer et al., which profiled 5 blood cell subtypes (Monocytes, Neutrophils, B-cells, CD4+ and CD8+ T-cells) with 6 replicates of each .
Other cell types
For validating and assessing the algorithms in the context of the mixed epithelial cell type scenario, we used as validation, Illumina 450k DNAm data for HMECs from Lowe et al. , and Illumina 450k data for HRCEs and adult male & female pancreas from the Stem-Cell-Matrix Compendium-2 (SMC2) . For the case of the mixed epithelial/non-epithelial cell types, we used Illumina 450k DNAm data of IMR90 (fetal lung fibroblast) from SCM2, for liver cells from Slieker et al. , and purified B-cells from Zilbauer et al. .
Validation and evaluation strategy based on in-silico mixtures
For evaluation and comparison of the statistical algorithms, we generated 100 different in-silico mixtures of the purified cell DNAm profiles, with weights chosen randomly from a uniform (0,1) distribution, subject to the constraints that weights add to 1. Performance of each algorithm was then assessed using the root mean square error (RMSE) between the estimated and true weights for each cell-type, as estimated over the 100 different mixtures. R2 values between estimated and true weights for each cell-type were also computed over the 100 different mixtures. Average RMSE and R2 values over cell-types were also calculated. Finally, this procedure was repeated for a total of 25 Monte Carlo runs, yielding a total of 25 average RMSE and R2 values. This overall strategy was used for validation/testing purposes in the Zilbauer dataset, as well as for testing the methods in the mixed epithelial and mixed epithelial/non-epithelial scenarios.
We note that in this work we always test methods on data which is independent from the data used to construct the reference database. This already implicitly assesses the performance of the algorithms under noise levels which one may encounter between two similar experiments performed by different labs and personnel. However, in order to assess the algorithms under increasing levels of noise, we also added Gaussian Noise of increasing variation to the mixtures. The addition of Gaussian Noise was done in the M-value basis, with data subsequently transformed back to the beta-valued basis for application of the algorithms. Specifically, we considered 7 increasing levels of standard deviation noise: SD = 0, 1, 2, 3, 4, 5, 6. For a DMC that differs between 2 cell-types by an amount of 0.8 (in a beta-valued basis), this corresponds roughly to a difference in the M-value basis of approx. 6. Thus, the case SD = 6, can be seen as an extreme case of noise. The case SD = 1 corresponds to a typical deviation in the beta-value basis of approximately β (1-β)/(1 + β), i.e. an approx. 5% change for a CpG which is say unmethylated (β = 0.05), and an approx. 16% change for a CpG which is partially methylated (β = 0.4). Thus the case SD = 1 is a fairly realistic scenario given known noise levels in Illumina 450k data.
Definition of a gold-standard list of smoking-associated DMCs and of a true negative list
Smoking and whole blood is the ideal scenario in which to compare methods for cell-type correction, since many whole blood EWAS have reported strong consistency of smoking-associated DMCs (reviewed in ). Hence, to define a gold-standard list of sDMCs and associated genes, we used the curated table of Gao et al. . Specifically, we defined a gold-standard list of sDMCs defined by Illumina 450k probes which have been associated with smoking in at least 3 independent studies. This resulted in a total of 62 gold-standard sDMCs, implicating a total of 15 unique genes. Using this gold-standard list to declare a list of true positives, we then assessed sensitivity of the methods in an independent whole blood EWAS of 152 samples . We note that this EWAS was not among the ones reviewed by Gao et al. and hence is truly independent.
Definition of a true negative CpG (i.e. one not associated with smoking) is much harder. However, to construct an approximate set of true negative CpGs, we used the intersection of 450k probes not associated with smoking in 3 independent EWAS studies. These EWAS studies were a (i) whole blood set of 464 samples from Tsaprouni et al.  (GEO: GSE50660), (ii) a set of 333 whole blood samples from healthy controls and (iii) corresponding 354 whole blood sample from rheumathoid arthritis cases, all from Liu et al. (GEO: GSE42861) . For all 3 sets, smoking status information was available. In the case of Tsaprouni et al., processed data was downloaded from GEO. In the case of Liu et al., we processed raw data with minfi . All 450k data was corrected for type-2 probe bias using BMIQ . P-values of association with smoking (treated as ordinal variable, 0 = non-smoker, 1 = ex-smoker, 2 = current-smoker) was determined by linear regression in each set. We then defined CpGs as not associated with smoking if their P-value > 0.25 in each of the 3 datasets, resulting in 89290 true negative (TN) CpGs.
The blood reference DNAm database for 333 DHS-DMCs and 7 blood cell subtypes is provided (Table S1 in Additional File 1). A user-friendly R-script implementing EpiDISH is available (Additional File 2). EpiDISH is also freely available as an R-package from github: https://github.com/sjczheng/EpiDISH
The EpiDISH algorithm and validation in blood using flow-cytometry
We first considered the case of blood tissue, a complex tissue for which the main constituent cell types are well known and which is being used extensively in EWAS . We constructed a blood reference database using Illumina 450k DNAm profiles for a total of 7 purified blood cell subtypes (B-cells, NK-cells, CD4+ T-cells, CD8+ T-cells, Monocytes, Neutrophils and Eosinophils) obtained from , integrating it with blood cell subtype specific DNAse Hypersensitive Sites (DHS), as obtained from the NIH Epigenomics Roadmap , and further filtering the 450k probes for differential methylation between every pair of blood cell subtypes, resulting in an integrated blood reference DNAm database of 333 CpGs and 7 blood cell subtypes (Methods, Table 1, Table S1 in Additional File 1). As a sanity check, we verified that the original purified samples segregated according to cell subtype when clustered over these 333 CpGs (Figure S1 in Additional File 1). Blood cell subtype proportions obtained with EpiDISH on whole blood (WB) and peripheral blood mononuclear cells (PBMC) from the same study were also in line with known proportions, i.e. strongest enrichment for neutrophils in WB and with lymphocytes making the dominant component of PBMCs (Figure S2 in Additional file 1).
EpiDISH correctly infers blood cell subtype proportions in reconstructed whole blood samples
Flow cytometric estimates of blood cell subtype proportions are also subject to error. Hence, we further assessed EpiDISH in its predictions to correctly infer cell subtype proportions from 12 reconstructed whole blood samples where the exact mixing proportions are known . For all those cell subtypes whose proportions exhibited a reasonable dynamic range in the experimental mixtures, EpiDISH obtained R2 values above 0.95, with an average RMSE of 2.6%, confirming that EpiDISH can accurately quantify blood cell subtype proportions (Fig. 2c).
Using DHS information marginally improves the quality of the reference database
EpiDISH compares favorably to other reference-based methods
Constrained Projection reveals added robustness under higher levels of noise
EpiDISH improves sensitivity in an EWAS of smoking in whole blood
Comparison of the relative sensitivity and specificity. Smoking-associated DMCs (sDMCs) were defined at FDR < 0.05 and FDR < 0.3 and for each of 5 methods in the 152 whole blood Illumina 450k samples from the MRC1946 birth cohort data
nTP = 62, nTN = 89290
FDR < 0.05
| TP ∩ sDMC |
FP = | TN ∩ sDMC |
FDR < 0.3
| TP ∩ sDMC |
FP = | TN ∩ sDMC |
The aim of this study was to compare different reference-based methods in order to establish whether there is an optimal approach. This is particularly pertinent given that only one reference-based approach (Houseman’s algorithm) has been applied to DNA methylation data. As our study has demonstrated, Houseman’s algorithm, which relies on the CP technique, is in fact not the most robust method. While we did not identify a single method which always outperformed all others, we found that non-constrained techniques such as partial correlations and support vector regression provided a more robust inference framework than CP, particularly for more realistic noise levels. This is consistent with the results obtained by Newman et al. on gene expression data . In our study, CP only emerged as the optimal choice if data was subjected to a large amount of noise, typically larger than what is encountered in real data. Hence, our study advances upon the state-of-the-art, and proposes the use of RPC or CIBERSORT as a more robust means for inferring cell-fractions in complex tissues.
Another important novel insight of our study is that inference can be improved, albeit only marginally, by incorporating cell-type specific DHS information when constructing the reference DNAm database. We demonstrated this not only in the context of blood tissue, but also for mixtures involving epithelial cell subtypes. To understand why the improvement is only marginal, we note that supervised selection of DMCs from a training set, for instance, by selecting DMCs that show big (i.e. 80% or over) differences in DNAm between cell-types, will almost always identify true DMCs. Supervised selection tends to favour DMCs with larger differences in average DNAm which serve better for the statistical deconvolution problem. Thus, while prior biological knowledge can help remove a few false positives, this leads to only a relatively minor improvement in the quality of the inference. It will be interesting to explore if further improvements are possible, for instance, using predicted cell-type specific DMCs , although we anticipate that such prior information will turn out to be more fruitful in the context of “semi-reference-free” approaches such as RUV .
We further showcased EpiDISH in an EWAS of smoking, where it was found to identify twice as many smoking-associated DMCs than an unadjusted analysis. Importantly, some of the additional sDMCs mapped to genes (e.g. AHRR, PTK2, MYO1G, LRP5) which are well-known to be associated with differential methylation in smokers , thus confirming that EpiDISH leads to an increase in sensitivity. Although similar improvements were possible with CIBERSORT and CP, CP suffered from a higher FDR. Thus, our detailed analysis on experimental and in-silico generated mixtures, which suggests improved modelling with a non-constrained technique such as RPCs, appears to also hold true on real data, although we caution that analyses in more smoking-EWAS will be needed to reach a robust conclusion.
Our analysis focused entirely on Illumina 450k data. While this may be seen as a limitation, in light of the recent arrival of the newer 850k version , the overwhelming majority of the 450k probes are present in the new beadarray version, rendering the extension to 850k data relatively trivial. Likewise, given the relatively good agreement between Illumina 450k and WGBS/RRBS data , and given that in both cases DNAm data can be treated in the beta-value basis, we would envisage that application of EpiDISH will carry over to such data, although studies demonstrating this will be needed.
In summary, we have presented a novel reference-based algorithm, EpiDISH, for in-silico deconvolution of DNA methylation data, which compares very favorably in relation to the current gold-standard. We recommend the use of EpiDISH for dissection of intra-sample heterogeneity in EWAS, and to this purpose we provide the community with an R-package (EpiDISH), freely available from https://github.com/sjczheng/EpiDISH.
DNAse Hypersensitive Site
Differentially methylated CpG
False discovery rate
Removing unwanted variation
Smoking-associated differentially methylated CpG.
This work was supported by a Royal Society Newton Advanced Fellowship (AET & SB), (NAF project number: 522438, NAF award number: 164914) The authors also wish to thank the Chinese Academy of Sciences, Shanghai Institute for Biological Sciences and the Max-Planck Society for financial support and the NSFC (National Science Foundation of China) under grant number 31571359. CEB was supported by a PhD fellowship from the EU-FP7 project EpiTrain (316758).
Availability of data and materials
All data analysed in this study is publicly available from repositories as indicated in Methods. EpiDISH is freely available as an R-package from github: https://github.com/sjczheng/EpiDISH.
AET performed most of the statistical analyses and wrote the manuscript. SCZ contributed to data analysis. CEB and SB provided data. Study was conceived by AET with input from CEB and SB. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent to publish
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Alizadeh AA, Aranda V, Bardelli A, Blanpain C, Bock C, Borowski C, Caldas C, Califano A, Doherty M, Elsner M, et al. Toward understanding and exploiting tumor heterogeneity. Nat Med. 2015;21:846–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, Reinius L, Acevedo N, Taub M, Ronninger M, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31:142–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 2014;15:R31.View ArticlePubMedPubMed CentralGoogle Scholar
- Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12:529–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinf. 2012;13:86.View ArticleGoogle Scholar
- Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J: Epigenome-wide association studies without the need for cell-type composition. Nat Methods. 2014;11(3):309–11.
- Houseman EA, Kelsey KT, Wiencke JK, Marsit CJ. Cell-composition effects in the analysis of DNA methylation array data: a mathematical perspective. BMC Bioinf. 2015;16:95.View ArticleGoogle Scholar
- Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics. 2014;30:1431–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Zheng X, Zhao Q, Wu HJ, Li W, Wang H, Meyer CA, Qin QA, Xu H, Zang C, Jiang P, et al. MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes. Genome Biol. 2014;15:419.View ArticlePubMedPubMed CentralGoogle Scholar
- Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12:453–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Yoshihara K, Shahmoradgoli M, Martinez E, Vegesna R, Kim H, Torres-Garcia W, Trevino V, Shen H, Laird PW, Levine DA, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.View ArticlePubMedPubMed CentralGoogle Scholar
- Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Gayther SA, Apostolidou S, Jones A, Lechner M, Beck S, Jacobs IJ, Widschwendter M. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4:e8274.View ArticlePubMedPubMed CentralGoogle Scholar
- Langevin SM, Koestler DC, Christensen BC, Butler RA, Wiencke JK, Nelson HH, Houseman EA, Marsit CJ, Kelsey KT. Peripheral blood DNA methylation profiles are indicative of head and neck squamous cell carcinoma: an epigenome-wide association study. Epigenetics. 2012;7:291–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Vogt H, Hofmann B, Getz L: The new holism: P4 systems medicine and the medicalization of health and life itself. Med Health Care Philos. 2016;19(2):307–23.
- Timp W, Bravo HC, McDonald OG, Goggins M, Umbricht C, Zeiger M, Feinberg AP, Irizarry RA: Large hypomethylated blocks as a universal defining epigenetic alteration in human solid tumors. Genome Med. 2014;6(8):61.
- Teschendorff AE, Gao Y, Jones A, Ruebner M, Beckmann MW, Wachter DL, Fasching PA, Widschwendter M. DNA methylation outliers in normal breast tissue identify field defects that are enriched in cancer. Nat Commun. 2016;7:10478.View ArticlePubMedPubMed CentralGoogle Scholar
- Roadmap Epigenomics C, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–30.View ArticleGoogle Scholar
- Koestler DC, Christensen B, Karagas MR, Marsit CJ, Langevin SM, Kelsey KT, Wiencke JK, Houseman EA. Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics. 2013;8:816–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Accomando WP, Wiencke JK, Houseman EA, Nelson HH, Kelsey KT. Quantitative reconstruction of leukocyte subsets using DNA methylation. Genome Biol. 2014;15:R50.View ArticlePubMedPubMed CentralGoogle Scholar
- Koestler DC, Jones MJ, Usset J, Christensen BC, Butler RA, Kobor MS, Wiencke JK, Kelsey KT. Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL). BMC Bioinf. 2016;17:120.View ArticleGoogle Scholar
- Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489:75–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.View ArticleGoogle Scholar
- Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100.View ArticlePubMedPubMed CentralGoogle Scholar
- Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlen SE, Greco D, Soderhall C, Scheynius A, Kere J. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One. 2012;7:e41361.View ArticlePubMedPubMed CentralGoogle Scholar
- Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3:Article3.
- Slieker RC, Bos SD, Goeman JJ, Bovee JV, Talens RP, van der Breggen R, Suchiman HE, Lameijer EW, Putter H, van den Akker EB, et al. Identification and systematic annotation of tissue-specific differentially methylated regions using the Illumina 450 k array. Epigenetics Chromatin. 2013;6:26.View ArticlePubMedPubMed CentralGoogle Scholar
- Zilbauer M, Rayner TF, Clark C, Coffey AJ, Joyce CJ, Palta P, Palotie A, Lyons PA, Smith KG. Genome-wide methylation analyses of primary human leukocyte subsets identifies functionally important cell-type-specific hypomethylated regions. Blood. 2013;122:e52–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Lowe R, Overhoff MG, Ramagopalan SV, Garbe JC, Koh J, Stampfer MR, Beach DH, Rakyan VK, Bishop CL. The senescent methylome and its relationship with cancer, ageing and germline genetic variation in humans. Genome Biol. 2015;16:194.View ArticlePubMedPubMed CentralGoogle Scholar
- Nazor KL, Altun G, Lynch C, Tran H, Harness JV, Slavin I, Garitaonandia I, Muller FJ, Wang YC, Boscolo FS, et al. Recurrent variations in DNA methylation in human pluripotent stem cells and their differentiated derivatives. Cell Stem Cell. 2012;10:620–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin Epigenetics. 2015;7:113.View ArticlePubMedPubMed CentralGoogle Scholar
- Teschendorff AE, Yang Z, Wong A, Pipinikas CP, Jiao Y, Jones A, Anjum S, Hardy R, Salvesen HB, Thirlwell C, et al. Correlation of Smoking-Associated DNA Methylation Changes in Buccal Cells With DNA Methylation Changes in Epithelial Cancer. JAMA Oncol. 2015;1:476–85.View ArticlePubMedGoogle Scholar
- Tsaprouni LG, Yang TP, Bell J, Dick KJ, Kanoni S, Nisbet J, Vinuela A, Grundberg E, Nelson CP, Meduri E, et al. Cigarette smoking reduces DNA methylation levels at multiple genomic loci but the effect is partially reversible upon cessation. Epigenetics. 2014;9:1382–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics. 2013;29:189–96.View ArticlePubMedGoogle Scholar
- Lowe R, Rakyan VK. Marmal-aid--a database for Infinium HumanMethylation450. BMC Bioinf. 2013;14:359.View ArticleGoogle Scholar
- Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010;28:1045–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Zeilinger S, Kuhnel B, Klopp N, Baurecht H, Kleinschmidt A, Gieger C, Weidinger S, Lattka E, Adamski J, Peters A, et al. Tobacco smoking leads to extensive genome-wide changes in DNA methylation. PLoS One. 2013;8:e63812.View ArticlePubMedPubMed CentralGoogle Scholar
- Shenker NS, Ueland PM, Polidoro S, van Veldhoven K, Ricceri F, Brown R, Flanagan JM, Vineis P. DNA methylation as a long-term biomarker of exposure to tobacco smoke. Epidemiology. 2013;24:712–6.View ArticlePubMedGoogle Scholar
- Zhang Y, Yang R, Burwinkel B, Breitling LP, Brenner H. F2RL3 methylation as a biomarker of current and lifetime smoking exposures. Environ Health Perspect. 2014;122:131–7.PubMedGoogle Scholar
- Breitling LP, Yang R, Korn B, Burwinkel B, Brenner H. Tobacco-smoking-related differential DNA methylation: 27 K discovery and replication. Am J Hum Genet. 2011;88:450–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Besingi W, Johansson A. Smoke-related DNA methylation changes in the etiology of human disease. Hum Mol Genet. 2014;23:2290–7.View ArticlePubMedGoogle Scholar
- Ernst J, Kellis M. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat Biotechnol. 2015;33:364–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13:539–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics. 2016;8:389–99.View ArticlePubMedGoogle Scholar
- Titus AJ, Houseman EA, Johnson KC, Christensen BC: methyLiftover: cross-platform DNA methylation data integration. Bioinformatics. 2016;32(16):2517–9.