 Methodology article
 Open access
 Published:
Cellcomposition effects in the analysis of DNA methylation array data: a mathematical perspective
BMC Bioinformatics volume 16, Article number: 95 (2015)
Abstract
Background
The impact of cellcomposition effects in analysis of DNA methylation data is now widely appreciated. With the availability of a reference data set consisting of DNA methylation measurements on isolated cell types, it is possible to impute cell proportions and adjust for them, but there is increasing interest in methods that adjust for cell composition effects when reference sets are incomplete or unavailable.
Results
In this article we present a theoretical basis for one such method, showing that the total effect of a phenotype on DNA methylation can be decomposed into orthogonal components, one representing the effect of phenotype on proportions of major cell types, the other representing either subtle effects in composition or global effects at focused loci, and that it is possible to separate these two types of effects in a finite data set. We demonstrate this principle empirically on nine DNA methylation data sets, showing that the first few principal components generally contain a majority of the information on celltype present in the data, but that later principal components nevertheless contain information about a small number of loci that may represent more focused associations. We also present a new method for determining the number of linear terms to interpret as cellmixture effects and demonstrate robustness to the choice of this parameter.
Conclusions
Taken together, our work demonstrates that referencefree algorithms for cellmixture adjustment can produce biologically valid results, separating cellmediated epigenetic effects (i.e. apparent effects arising from differences in cell composition) from those that are not cell mediated, and that in general the interpretation of associations evident from DNA methylation should be carefully considered.
Background
In the last decade, numerous publications have reported associations between DNA methylation profiles in a single tissue (the majority of studies published to date interrogate peripheral blood leukocyte mixtures) and disease states or exposure phenotypes. For example, DNA methylation profiles measured in blood have been shown to correlate with ovarian cancer [1], bladder cancer [2], cardiovascular disease [3], obesity [4], and environmental exposures [57]. These associations have led to an interest in epigenomewide association studies (EWAS), which aim to investigate associations between DNA methylation and health or exposure phenotypes across the genome. DNA methylation and coordinated chromatin alterations are partially responsible for coordination of gene expression in individual cells [810]. Consequently, normal tissue development, individual cellular differentiation and cellular lineage determination are regulated by epigenetic mechanisms [9]. This necessarily means that DNA methylation shows substantial variation across tissue types [11] as well as individual cell types, demonstrated particularly clearly amongst the distinct types of leukocytes [8]. This biological directive leads to significant and potentially underappreciated problems in interpreting the results of EWAS studies, which have predominantly utilized peripheral blood samples (whole blood or buffy coat) as the source of DNA for these analyses. There is a profound and fundamental difference between “environmentally induced DNA methylation” and environmentally induced (or, more properly, “biologically selected”) changes in the underlying landscape of cellular subtypes within a given sample. These involve completely distinct events  the former an intranuclear enzymatic action perhaps mediated at the level of the cell and the latter a signaling event that likely involves the entire immune cascade. There is limited evidence for the existence of the former phenomenon, e.g. numerous reports that collectively demonstrate a doserelated effect on DNA methylation along the AHRR gene in distinctly different tissues [7,12,13]. However, the second mechanism may predominate in most applications of EWAS, as has been demonstrated in a number of recent papers [1417]. In cases where potentially both phenomena are occurring, the phenotypic effect on the distribution of cell types within a sampled tissue, i.e. cellcomposition effects, represents a confounder of associations driven by the intranuclear activity.
Several algorithms have been proposed to address this issue in a mathematically simple manner. Houseman et al. (2012) proposed a procedure for imputing a restricted set of cell type proportions directly from DNA methylation data; a limitation of this approach is that only cell types for which a reference data set exists can be evaluated, i.e. an assembly of DNA methylation measurements on isolated cell types [18]. Liu et al. (2013) demonstrated the use of these proportions as adjustment covariates in an EWAS [17]. Subsequently, we recently proposed a referencefree algorithm that potentially obviates this problem, as it can be used in studies where no reference data set exists [19]; in the paper we validated the method against the referencebased approach, and this algorithm has already been applied successfully, e.g. to discover a DNA methylation biomarker for Wilms tumor [20]. Several similar algorithms have also recently been published, including a method specific to brain tissue [21] as well as the EWASher method [22], which is similar in spirit to our algorithm but reverses the role of dependent and independent variables. The referencefree algorithm employs a singular value decomposition (SVD) to separate cellcomposition effect from a direct effect (generally interpreted as the intranuclear activity). Thus, the algorithm entails very strong linearity assumptions, i.e. that the linear structures underlying variability in a data set will necessarily correspond to mixing of DNA methylation via cell composition. This is true also of EWASher, as well as the surrogate variable analysis (SVA) and independent SVA (ISVA) algorithms [23,24] that are similar to our proposed method (see below) and have previously been used to account both for technical artifacts and cell mixture effects. The goal of the present article is to justify such linearity assumptions from a rigorous biological and mathematical framework, demonstrating that the linear space of epigenetic effects can be partitioned into two subspaces, one representing cellmediated effects (i.e. those arising from differences in cell composition) and the other representing the remaining, noncellmediated effects. Projection of total epigenetic effect onto the latter space thus recovers the noncellmediated associations and our present work thus justifies the application of our previously published algorithm, as well as similar algorithms that rely on similar linear decompositions.
In general, referencebased approaches will be superior to referencefree approaches because the deconvolution required for estimating cell type proportions is essentially supervised by basis vectors that have direct biological interpretations. In the referencebased procedure, cell proportions are obtained by projecting wholetissue DNA methylation data onto linear spaces spanned by celltypespecific methylation profiles for a specific set of cyotosinephosphateguanine dinucleotides (CpGs) that distinguish the cell types, socalled differentially methylated regions (DMRs). However, we envision that epigenetic approaches will be applied in the future to tissues of diverse cell origin that contain unique combinations of cell type DMRs. In many of these cases, prior knowledge of major constituent cell types (their DMRs and their DNA methylation profiles) may be lacking. In such cases, the referencefree approach is readily applicable. However, even the complexity of a given sample may be unknown; even in blood, activation of distinct types of cells may play an important role in some diseases (e.g. [25]), in which case results of referencefree approaches could in principle depend upon the assumed level of complexity. Thus, critical to the algorithm is its sensitivity of results to choice of a parameter that represents the complexity, the dimension k of the latent linear association driven by cell composition, widely (though perhaps incorrectly) interpreted as the number of cell types. As we demonstrate below, choice of this parameter somewhat impacts the results of the algorithm. Consequently, as a second goal of this article, we also examine in detail the choice of dimension, its potential biological significance, and its impact on data analysis. In particular, we argue that, mathematically, major cellcomposition effects can be distinguished from other effects by a choice of two orthogonal vector spaces, and consequently the SVD of a certain matrix provides information about cellular type. By associating the largest singular values of this decomposition with the space corresponding to major cell types, it becomes possible to distinguish cellcomposition effects from other effects. The choice of the number k of singular values to associate with cellcomposition effects then drives the analytical results. Over a range of values for k, we apply this analysis to nine distinct data sets, showing that in most cases, results will remain stable for a wide range of choices of k. We also show that the original method proposed for selecting k may not always reliably find the stable range and propose a simple alternative procedure that is slightly more reliable. Finally, we discuss the implications of these results to biological interpretation of EWAS.
Results and discussion
Cellcomposition effects represent a potential mediator of associations observed between a phenotype (disease state or environmental exposure) and DNA methylation measured in a heterogeneous tissue, as well as a confound of “direct” associations (presumed to represent intranuclear activity or “direct” action of the exposure, producing DNA methylation without disturbing the cellular distribution). Under reasonable regression assumptions (no effect modification by cell composition and independence of cell composition with the errors in measurement of DNA methylation), several techniques are currently available for analyzing DNA methylation data while accounting for cellular heterogeneity. All of them assume essentially the following linear model for m CpG loci measured on n subjects:
where Y is an m × n matrix of average beta values, X is an n × d design matrix of phenotype variables and potential confounders (for a total of d covariates including an intercept), B is the m × d matrix of regression coefficients representing direct effects, MΩ ^{T} represents a linear mixture effect, with M an m × k′ matrix representing m CpGspecific methylation states for k′ cell types, Ω is an n × k′ matrix representing subjectspecific celltype distributions (each row representing the celltype proportions for a given subject), and E is an m × n matrix of errors with E(E) = 0 _{ m × n }. We discuss the meaning of “direct effect” below. Note that the value k′ is assumed to be known in advance, although we have earlier proposed estimating it by an application of random matrix theory originally described by Teschendorff et al. (2011) [24]. Note also that the entries of Y, of M, and of Ω are assumed to lie in the unit interval, and that the rows of Ω sum to one. Finally, note that (1) is true even when M, with a very large value of k′, exhaustively characterizes all possible types of cells in the target tissue, although with finite data set it may not be possible to estimate certain parameters in (1).
Linear characterization of cell mixture
Consider a particular cell type T, which may be as general as a leukocyte or a lymphocyte, or as specific as a CD4+ T lymphocyte or CD4+ regulatory T cell (T_{reg}). We assume T has methylation profile μ = (μ _{1}, …, μ _{ n })^{T} representing the mean methylation state for that type, where the implied expectation averages the methylation states of all subtypes with probability weights equal to the mean distribution of the subtypes in a target human population. As a concrete example, let us suppose T represents all CD4+ T lymphocytes. If now two subtly different subtypes (e.g. T_{reg} cells and helper T_{H} cells) are defined by differences in the epigenetic states of only the first r loci, e.g. between \( \left({\mu}_{10}^{*},\dots, {\mu}_{r0}^{*}\right) \) and \( \left({\mu}_{11}^{*},\dots, {\mu}_{r1}^{*}\right), \) and each of these types occurs in population average proportions ω _{0} and ω _{1} respectively (with ω _{0} + ω _{1} = 1), then the DNA methylation states of the two subtypes are respectively \( {\boldsymbol{\upmu}}_0={\left({\mu}_{10}^{*},\dots, {\mu}_{r0}^{*},{\mu}_{r+1},\dots, {\mu}_n\right)}^T \) and \( {\boldsymbol{\upmu}}_1={\left({\mu}_{11}^{*},\dots, {\mu}_{r1}^{*},{\mu}_{r+1},\dots, {\mu}_n\right)}^T, \) with μ = μ _{0} ω _{0} + μ _{1} ω _{1}. Note that this is true even for a very general type T such as all leukocytes, which may be conceived as a mixture of myeloid and lymphoid cell lineages. It will also be true of solid tissues; for example, even adipose tissue has been described to have multiple different types of adipocytes [26].
This concept is related to the idea of a recursive partitioning mixture model, which has been used previously in the analysis of DNA methylation data [27]. Commonly, normal constituent cell types of a complex tissue will have different functions, each represented by distinct epigenetic states, so that the cell types can be partitioned from “coarse” to “fine” by recursively partitioning each constituent type in the manner just described. However, in cases where a single alteration is observed across many types or even tissues (e.g. smokingrelated effects on the AHRR gene locus) it is possible to mathematically represent such an alteration as a shift in cell type, simply by taking r = 1 (or a small number) in the above partitioning analysis. For the moment we ignore the biological origin of such an alteration (widescale and dosedependent intranuclear action or signaling cascade and cell selection), noting that the mathematical representation is identical.
It follows from elementary linear algebra that n arbitrary cell types could be defined for an array with n loci (although likely many fewer would have biological meaning, i.e. correspond to true biological function). In other words, it is possible to define a linearly independent set of n celltype state profiles {μ _{1}, μ _{2}, …, μ _{ n }}. Theoretically it is also possible to define the set of n profiles with state vectors that are orthogonal to each other, although again it is extremely unlikely that such an orthogonal set of profiles will correspond to biologically and functionally meaningful cell types. For example, consider two gross subtypes μ _{0} = (0, μ _{2}, …, μ _{ n })^{T} and μ _{1} = (1, μ _{2}, …, μ _{ n })^{T}, differing in methylation state only at the first locus. Each of these states may be highly meaningful in a biological sense, but they are clearly not orthogonal for general values of 0 ≤ μ ≤ 1. To orthogonalize the set {μ _{0}, μ _{1}}, we must instead consider a decomposition such as {μ _{0}, ε _{1}}, where ε _{1} = (1, 0, …, 0)^{T} is a unit vector corresponding to (presumably) no biologically functional cell type. Thus, both profiles μ _{0} and μ _{1} may be obtained as linear combinations of μ _{0} and ε _{1}, but profile ε _{1} likely does not correspond to any biological function. Nevertheless, if we are concerned with a small relevant subset of k′ major subtypes \( \left\{{\boldsymbol{\upmu}}_1,{\boldsymbol{\upmu}}_2,\dots, {\boldsymbol{\upmu}}_{k^{\prime }}\right\}, \) it is possible to differentiate these types from more subtle forms of variation by decomposing the space spanned by {μ _{1}, μ _{2}, …, μ _{ n }} into a vector subspace \( \mathsf{M}= span\left\{{\boldsymbol{\upmu}}_1,{\boldsymbol{\upmu}}_2,\dots, {\boldsymbol{\upmu}}_{k^{\prime }}\right\} \) and its complementary vector space \( {\mathsf{M}}^{\perp } \), orthogonal to M. This is achieved by partitioning each major type into its constituent types in the manner described above, i.e. determining r loci that distinguish the subtypes from a parent type, and subsequently using a GramSchmidt orthonormalization procedure to orthogonalize the vector subspace corresponding to those loci. For example, consider type T and the subvector \( {\boldsymbol{\upmu}}^{*}={\left({\mu}_1^{*},\dots, {\mu}_r^{*}\right)}^T \) of mean states for the loci that distinguish the subtypes of T, and let \( \left\{{\boldsymbol{\upvarepsilon}}_1^{*},\dots, {\boldsymbol{\upvarepsilon}}_r^{*}\right\} \) be a corresponding set of r unit vectors. Starting with μ*, r  1 additional vectors can be chosen from \( span\left\{{\boldsymbol{\upvarepsilon}}_1^{*},\dots, {\boldsymbol{\upvarepsilon}}_r^{*}\right\} \) to form an orthogonal basis for \( span\left\{{\boldsymbol{\upmu}}^{*},{\boldsymbol{\upvarepsilon}}_1^{*},\dots, {\boldsymbol{\upvarepsilon}}_r^{*}\right\}, \), and these vectors define members of \( {\mathsf{M}}^{\perp } \) that together with \( \boldsymbol{\upmu} \in \mathsf{M} \) identify all subtypes of T. Thus, in general, matrix M in equation (1) may be augmented as \( \tilde{\mathbf{M}}=\left[{\mathbf{M}}^{(bio)},{\mathbf{M}}^{\left(\perp \right)}\right] \) to exhaustively characterize all types that can be defined within \( \tilde{\mathsf{M}}=\mathsf{M}\oplus {\mathsf{M}}^{\perp }, \) where the column vectors of M ^{(bio)} are orthogonal to the column vectors in M ^{(⊥)} and correspond to epigenetic states that are biologically relevant to major cell types. In particular, the column vectors of M ^{(⊥)} form a linear basis for the effects that are not mediated by cell type. We remark that \( \tilde{\mathsf{M}} \) is a vector space of dimension n, although only a convex subset of \( \tilde{\mathsf{M}} \) satisfies constraints on the values that mean methylation profiles can take, and a much smaller subset corresponds to biologically meaningful profiles.
Mediation by phenotypic cell composition effects
In addition to the linear model (1) defining mixtures of DNA methylation states, we assume that the matrix of celltype proportions Ω in (1) is a random variable potentially associated with X. Although a Dirichlet model would most appropriately model the rows of Ω, a reasonable and computationally efficient linear approximation is as follows:
where Γ is a d × k′ matrix of covariate effects upon cell proportion and Ξ is an n × k′ error matrix. Note that equation (1) explicitly omits interaction between X and Ω, which is likely adequate for most problems. With the additional assumption that E and Ξ are independent (and independent of X), and substituting (2) in (1), we have
The total effect of X upon Y is E(YX) = (B + MΓ ^{T})X ^{T}, the direct effect is BX ^{T}, and the mediated, or cellcomposition effect, is ΔX ^{T}, where Δ = MΓ ^{T}. Note that (3) can be written as Y = AX ^{T} + R,, where A = MΓ ^{T} + B and the error matrix R = MΞ + E includes a term that depends on the celltypespecific coefficient matrix M. Here, A is the total effect of phenotype matrix X. Following from the previous section, we replace M in (1) with M ^{(bio)}, assuming all cell types that mediate phenotypic effects are captured in M ^{(bio)}, and explore the relationship of (1) to the m × (n − k′) matrix M ^{(⊥)}. Note that two types of “direct” effects (understood from a molecular point of view) are possible. The first is a subtle alteration of a major subtype. Adopting the notation of the previous section and without loss of generality, assume that exactly one locus characterizes such an alteration (from \( {\mu}_0^{*} \) to \( {\mu}_1^{*} \)), μ is the average profile across altered and unaltered types, and that ε is a unit vector nonzero for every locus except the altered one. If ω _{0} is the mean proportion of unaltered cells (of the given type) and ω _{1} is the corresponding mean proportion of altered cells, ω _{0} + ω _{1} = 1, then \( {\overline{\mu}}^{*}={\omega}_0{\mu}_0^{*}+{\omega}_1{\mu}_1^{*} \) is the average methylation at the altered locus, \( \boldsymbol{\upmu} +\left({\mu}_0^{*}{\overline{\mu}}^{*}\right)\boldsymbol{\upvarepsilon} \) is the unaltered profile, \( \boldsymbol{\upmu} +\left({\mu}_1^{*}{\overline{\mu}}^{*}\right)\boldsymbol{\upvarepsilon} \) is the altered profile, and a mean shift in distribution of types, (Δω _{0}, Δω _{1}), Δω _{0} + Δω _{1} = 0, is characterized by a mean methylation difference of \( \Delta \boldsymbol{\upmu} =\Delta {\omega}_0\left[\boldsymbol{\upmu} +\left({\mu}_0^{*}{\overline{\mu}}^{*}\right)\boldsymbol{\upvarepsilon} \right]+\Delta {\omega}_1\left[\boldsymbol{\upmu} +\left({\mu}_1^{*}{\overline{\mu}}^{*}\right)\boldsymbol{\upvarepsilon} \right]=\left[\Delta {\omega}_0\left({\mu}_0^{*}{\overline{\mu}}^{*}\right)+\Delta {\omega}_1\left({\mu}_1^{*}{\overline{\mu}}^{*}\right)\right]\boldsymbol{\upvarepsilon} \).
In other words, the effect is captured entirely within the vector space \( {\mathsf{M}}^{\perp } \). A second type of “direct” effect is a wholesale change across every cell type (e.g. carcinogenic transformation of a normal cell). In this scenario, every cell type characterized by M has the same alteration, in which case (by an argument similar to that above) the effect is also captured entirely within the space \( {\mathsf{M}}^{\perp } \). In summary, a “direct” effect BX ^{T} will be any effect that lies within \( {\mathsf{M}}^{\perp } \), outside the space spanned by the profiles of the major cell types of interest. Consequently, we have B = M ^{(⊥)} Η ^{T}, expressed now as a mixture of subtler effects living in \( {\mathsf{M}}^{\perp } \) with projection coefficient H, a d × (n − k′) matrix. This in turn implies A = M ^{(bio)} Γ ^{T} + M ^{(⊥)} Η ^{T}. As in Houseman et al. (2014), we concatenate the total effects matrix A and the total error matrix R, decomposing as follows:
where Π = M ^{(⊥)}(M ^{(⊥)T} M ^{(⊥)})^{− 1} is the projection coefficient onto \( {\mathsf{M}}^{\perp } \). In large samples, the error E should be asymptotically orthogonal to \( {\mathsf{M}}^{\perp } \), therefore the second term in (4) is stochastically negligible. Note that if the singular values of the first term are nondegenerate, then the term has a unique SVD and, consequently \( {\left[\begin{array}{cc}\hfill {\boldsymbol{\Gamma}}^T\hfill & \hfill {\boldsymbol{\Xi}}^T\hfill \end{array}\right]}^T \) is orthogonal to \( {\left[\begin{array}{cc}\hfill {\mathbf{H}}^T\hfill & \hfill {\boldsymbol{\Pi}}^T\mathbf{E}\hfill \end{array}\right]}^T \). This motivates the use of the SVD of \( \left[\begin{array}{cc}\hfill \mathbf{A}\hfill & \hfill \mathbf{R}\hfill \end{array}\right] \) to recover M. Note that M could be recovered from R alone, a fact that underlies the application of independent surrogate variable analysis (ISVA, [24]) to adjust for cell composition effects. However, since A also contains information about M, it adds additional information to the decomposition. A alone cannot typically be used to recover M because d < k′ in most applications.
Linear expansion of total effect
The concatenated matrix \( \left[\begin{array}{cc}\hfill \mathbf{A}\hfill & \hfill \mathbf{R}\hfill \end{array}\right] \) has singular value decomposition \( \left[\begin{array}{cc}\hfill \mathbf{A}\hfill & \hfill \mathbf{R}\hfill \end{array}\right]=\mathbf{U}\;\boldsymbol{\Lambda}\;{\mathbf{V}}^T \), where U = (u _{1}, …, u _{ n + d }) is an orthogonal m × (n + d) matrix, Λ = diag(λ _{1}, …, λ _{ n + d }) is a diagonal (n + d) × (n + d) matrix, and V = (v _{1}, …, v _{ n + d }) is an orthogonal (n + d) × (n + d) matrix. Note that decomposition is algebraically equivalent to \( \left[\begin{array}{cc}\hfill \mathbf{A}\hfill & \hfill \mathbf{R}\hfill \end{array}\right]={\displaystyle {\sum}_{j=1}^{n+d}{\lambda}_j{\mathbf{u}}_j{\mathbf{v}}_j^T}, \) and therefore
Where V ^{*} is the matrix consisting of the first d rows of V, \( {\mathbf{Q}}_j={\mathbf{u}}_j{\mathbf{v}}_j^{*T} \), and \( {\mathbf{v}}_j^{*} \) is the j ^{th} column of V ^{*}. Thus the singular value decomposition effectively represents a linear expansion of A by n + d terms. Note that \(A \in {R^{m \times d}}\), a vector space of dimension > n + d, so the expansion need not be overdetermined. If follows from (4) that the orthogonal columns of U are partitioned into two sets, those that span M and those that span \( {\mathsf{M}}^{\perp } \). Thus, the terms Q _{ j } contribute unambiguously either to cellcomposition effects on the targeted cell types or else to direct effects lying outside the targeted types.
By convention, the singular values (diagonal elements of Λ) are ordered from greatest to least. Because we expect the variation in DNA methylation driven by differentiation of major cell types to dominate the variation among elements of \( \tilde{\mathsf{M}} \), it follows that the largest singular values should correspond to basis vectors of M, and therefore we interpret the initial terms of (5) as cellcomposition effects. However, it remains unclear how many initial terms k to select. Note that while we expect the value k to loosely correlate with the number of cell types k′, a direct correspondence may be difficult to establish in empirical data sets, as we demonstrate below. A reasonable approach is to vary k and examine the impact on estimates \( {\boldsymbol{\Delta}}_k={\displaystyle {\sum}_{j=1}^k{\lambda}_j{\mathbf{Q}}_j} \) and \( {\mathbf{B}}_k={\displaystyle {\sum}_{j=k+1}^{n+d}{\lambda}_j{\mathbf{Q}}_j} \) obtained by each choice of k.
Results of empirical evaluation of theoretical concepts
Above we have argued that the total effect A of phenotype on DNA methylation can be decomposed into two orthogonal terms, Δ _{ k }, the effect of a phenotype on the distribution of major cell types, and B _{ k }, which represents the effect of the phenotype on subtle variants of one cell type or global effects (across cell type) focused at single loci. The dimension parameter k can be chosen by a method we have proposed previously [19,24] or by a new method we propose below in Methods, which seeks the value for which Δ _{ k } is most “stable”, i.e. changes least across adjacent values of k. To evaluate the theory and demonstrate the phenomenon in finite data sets, we applied the proposed method to nine DNA methylation data sets obtained from Gene Expression Omnibus (GEO), each described in Table 1. Five data sets were generated from the HumanMethylation27 (27K) platform and four data sets from the HumanMethylation450 (450K) platform. Two are wellknown reference data sets for blood, three were generated from epidemiology studies assaying whole/peripheral blood, one consists only of invasive breast tumors, and three others consist of comparisons of normal and pathological tissues (gastric, liver, and arterial).
We hypothesized that for sufficiently large values of k << n, the number of significant (q < 0.05) Δ _{ k } loci will generally be larger than the number of significant B _{ k } loci. Figure 1 demonstrates that for the 450K rheumatoid arthritis data set, the number of significant (q < 0.05) Δ _{ k } and B _{ k } coefficients stabilized after about k ≥ 10. All methods of estimating dimension returned values that were in the stable region, although the values themselves were different. In the stable region, the number of significant B _{ k } coefficients was vastly smaller than the number of significant Δ _{ k } coefficients, although still numbering over 1000 (as is shown in Additional file 1: Figure S1). This pattern held true generally over other data sets (see Additional file 1: Figures S2 – S9). The one exception was the artery data set, which exhibited instability for k ≥ 25; however, this data set had small sample size relative to the other data sets, with only 44 residual degrees of freedom in the error terms (rows of the residual matrix R). Indeed, the procedure broke down after k ~ 40. However, reasonable stability was evident around k = 10, so we chose this value for subsequent analyses. We remark that for most analyses the fraction of significant Δ _{ k } coefficients was typically close to the number of significant A = Β _{0} coefficients (Additional file 1: Figure S10), representing the total effect of the phenotype, though this fraction was rarely identical. Although the number of significant B _{ k } coefficients was generally much smaller than the number of significant Δ _{ k } coefficients, it was almost never zero at selected values of k (see Additional file 1: Figure S1), with the one exception of age coefficients in the ovarian cancer case/control data set (age in this data set was omitted from subsequent geneset analysis because of the absence of loci with significant qvalues). As shown in Additional file 1: Figure S11, the two methods of selecting k proposed in this article resulted in about the same orderofmagnitude in estimating median rootmeansquaredifference (RMSD, the objective statistic used to measure stability at a value of k), and often much smaller orderofmagnitude than the random matrix theory method. Thus the proposed methods typically chose “flatter” regions of k.
To assess the biological significance of results, we selected several gene sets and investigated the enrichment of significant (q < 0.05) loci within each set. We hypothesized that for k selected by our proposed method, significant Δ _{ k } coefficients will overrepresent in gene sets that are reasonable for the target tissue, and in particular, significant Δ _{ k } coefficients in blood data sets will be enriched for CpGs that are known to differentiate major types of leukocytes, i.e. DMRs for blood. Additional file 1: Figure S12 shows the results of analysis of DMR sets for the seven nonreference data sets and two reference data sets. Strongly significant enrichment of DMRs within Δ _{ k } coefficients was evident for all comparisons except gastric tumor vs. normal. Enrichment of DMR gene sets for reference sets is unsurprising since these reference sets were used to determine the DMRs. Enrichment of DMR gene sets makes sense for data sets where blood was analyzed, since the DMR loci are precisely those that differentiate blood cell types. It also makes sense for breast tumors, as tumor infiltration by leukocytes has been well established [28], for liver tissues, where there is an inflammatory response in alcoholrelated disease and an antiviral immune response in infectionrelated disease [29], and for artery, where leukocyte infiltration has been observed in atherosclerosis [30,31].
Additional file 1: Figure S12 also shows analysis of polycomb group target (PcG) sets. Polycomb group target regions represent sites of binding and occupancy of polycomb group repressor complexes, which play a critical role in defining celltype specific expression patterns [32,33] and which may represent regions targeted for epigenetic variation resulting in altered cellular function [34]. Thus we hypothesized that those loci driving cellular composition effect are overrepresented by regions considered PcG targets. PcG gene sets were significantly enriched for all Δ _{ k } coefficients, though weakly in the age association for head and neck squamous cell carcinoma (HNSCC). PcG gene sets were weakly enriched for significant B _{ k } coefficients in the following comparisons: atherosclerotic lesion vs. normal aorta, atherosclerotic carotid tissue vs. normal aorta, and age (in HNSCC data set), as well as for some of the cell types in the reference data sets. Given that PcG targets are deeply involved with the development and maintenance of hematopoietic stem cells, as well as development of neoplastic tissues, the enrichment of PcG genes among most of the comparisons we investigated is not unreasonable. In particular, the profound overrepresentation of PcG targets in the significant Δ _{ k } coefficients from the breast tumor data is reassuring, as the original analysis of this data demonstrated such overrepresentation amongst loci exhibiting differing methylation between tumors and normal tissues in a variety of women’s cancers [35].
In Figure 2, a clustering heatmap summarizes the gene set results for Δ _{ k } and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, with clusters obtained by Euclidean metric and Ward’s method of clustering applied to logsignificance. In general, 27K and 450K data sets clustered together, as one would expect from the different numbers of loci on each array platform. However, the arthritis case/control data set clustered together with the other two blood case/control data sets and the 27K reference data coefficients within the 27K cluster. The nonblood 450K data sets clustered with the Bcell, CD8, and NK coefficients from the 450K reference data set, while the remaining 450K reference data coefficients clustered together in a separate cluster. In general, the 450K data set cluster showed enrichment in the cytokinecytokine receptor interaction and hematopoietic cell lineage pathways. The cluster containing only 450K reference coefficients showed additional enrichment for the three lymphocytespecificsignaling pathways (Bcell, NK, and Tcell), which were less strongly enriched for the coefficients in the other 450K cluster. Note that the vascular smooth muscle contraction pathway was strongly enriched for the arterial data set and for cirrhosis associated with viral infection. Interestingly, the most striking difference between the two cirrhosis coefficients was the lack of significance of HepatitisC pathway in the coefficient representing differences between virallyassociated cirrhosis and normal liver! However, this is explained by the enrichment of the HepatitisC pathway by significant B _{ k } coefficients (p = 0.0028 for cirrhosis related to viral infection vs. p = 0.80 for alcoholrelated cirrhosis), suggesting DNA methylation differences not mediated by cellcomposition effects. Additional file 1: Figure S13 shows a similar clustering heatmap for B _{ k } coefficients. Gene set enrichment was much weaker, occurring primarily for hematopoietic pathways and reference lymphocyte coefficients. Interestingly, the arthritis casecontrol coefficient showed significant enrichment of B _{ k } coefficients in these pathways. Odds ratios for enrichment of gene sets by significant loci are illustrated by clustering heatmaps in S14 (Δ _{ k }) and S15 (B _{ k }).
Additional file 1: Table S1 provides a list of genes mapped to the CpGs having significant (q < 0.05) B _{ k } coefficients, or the top 50 such CpGs. While it is not our purpose to present an exhaustive epigenetic analysis of each data set, we note some features of these lists to demonstrate the biological plausibility of the cellmixtureadjusted hits. For example, the two top hits for arthritis case vs. control were mapped to the NLRC5 gene, which has recently been implicated to play a key role in antigen presentation [36] and inflammasome activation [37], suggesting a potential link to inflammation present in rheumatoid arthritis. Genes among the top hits for breast cancer ER+ vs. ER are related to transcription (BCOR, CST11, HERC2, HEYL, IGF2BP3, MBP, NRIP2, RASSF5, SCML4, SMARCC2, SMPX, SSX1, TBX19, VHL, ZBTB16, ZNF124, ZNF560) and signal transduction (ARHGAP27, CHRNB4, KALRN, LRP12, PDCL, TNFSF18), which may be important pathways in breast tumorogenesis [38]. Many genes among the top hits for gastric tumor vs. normal gastric tissue are related to cancer and tumorogenesis (ASL, BCL2L2, CASP1, IFITM1, IL11RA, LCAT, LRIG1, NFE2L2, PABPC1, PFN1, PPAT, PSMD2, RANBP2, RARG, TERF2IP, THOC1, UHRF1). While extensive efforts would be required to confirm the phenotypic associations with these individual genes, they demonstrate the biological plausibility of significant B _{ k } coefficients.
We also evaluated the behavior of the referencefree method in various null scenarios. Additional file 1: Figure S16 summarizes the results. In all three scenarios (completely null by permutation, or nulllinear with nonnull nonlinear effects of two different magnitudes) there were no significant (q < 0.05) Δ _{ k } coefficients for k ≥ 10. For the completely null scenario, none of the B _{ k } coefficients were significant. For the two nonlinear scenarios, the proportion of significant B _{ k } coefficients was substantially higher than for the original analyses upon which these null scenarios were based, with the higher magnitude effects resulting in a higher proportion of significant B _{ k } coefficients. In summary, the referencefree method appeared to differentiate nonlinear effects from those that act together in a linear fashion.
In Methods we argue that a recursive partitioning principle holds for DNA methylation profiles, i.e. that the mean methylation profile for different cell types will partition in a manner consistent with cell lineage. To demonstrate this we applied hierarchical clustering to coefficients from the two reference data sets. Figure 3 shows the results of clustering \( {\boldsymbol{\Delta}}_k^T \) coefficients from the 27K reference data set at different values of k. The figure demonstrates that for k = 1, the coefficients in \( {\boldsymbol{\Delta}}_k^T \) did not respect the myeloid/lymphoid lineage distinction, but for 2 ≤ k ≤ 5, the split occurred exactly as one would expect. For k ≥ 10, Natural Killer (NK) and B cells were grouped more closely with granulocytes and monocytes than with other lymphocytes, although they were still separated from the myeloid lineage. This demonstrates the potential addition of noise when including additional terms into \( {\boldsymbol{\Delta}}_k^T \). Additional file 1: Figure S17 similarly shows that for the 450K data set, the myeloid/lymphoid distinction was respected for k ≥ 1, but that the relative groupings of myeloid types (especially eosinophils) changed depending on the value of k.
The referencefree method relies on applying a SVD to the matrix obtained by concatenating the total effects matrix A with the corresponding residual matrix R. We have argued that the resulting left singular vectors u _{ k } corresponding to the largest eigenvalues represent variation among constituent cell types. As an evaluation of the hypothesis that the leftsingular vectors u _{ k } of the SVD in (4) concentrate a majority of information about major cell types within the initial values of k, we computed the cross products of the leftsingular vectors u _{ k } across all five 27K data sets. We hypothesized that because the variation in the blood data sets are based on the same major cell types, for lower values of k, leftsingular vector crossproducts among blood will have higher absolute values than crossproducts between blood and nonblood tissues and among nonblood tissues. The clustering heatmap in Additional file 1: Figure S18 depicts the absolute values of the crossproducts in leftsingular vectors u _{ k } among the 27K data set analyses. Clustering was based on Euclidean metric and Ward’s linkage. Note that because the vectors must be orthogonal within a single data set, tight clusters will not have vectors from the same data set. As shown in the figure, the crossproduct absolute values did appear to cluster together more tightly for lower values of k than for higher values. In particular, the first vector from each data set appeared in the lowerright cluster, except for the gastric tissue data set for which the second vector was included in the cluster; this large cluster split into two smaller clusters that were composed entirely of blood or entirely of nonblood tissue. Another large cluster contained one vector (k ≤ 2) from each data set and, additionally, the third vector from the HNSCC data set, with each 2term subcluster containing only blood or only nonblood tissues. The remaining clusters were smaller and showed weaker clustering. This result is inconclusive in itself, since the first term from each data set appeared in one cluster and could represent simply known variation, e.g. differences across CpGs with different CpG Island status. However, we also clustered the intercept columns of each Δ _{ k } matrix across 27K data sets and across 450K data sets, and finally computed stratified correlation coefficients among intercept columns. Again, our hypothesis was that for Δ _{ k }, blood data sets would cluster together relative to other tissues and exhibit stronger correlation, while for k > 0, no such clustering or correlation would exist for B _{ k } coefficients. Additional file 1: Figure S19 shows clustering dendrograms for the intercept column of A and for Δ _{ k } at various values of k. Clusters from A demonstrated marked separation of blood data sets from breast tumor and gastric tissue data sets. Note that the intercept term in the 27K reference data set represented mean methylation in whole blood. Interestingly, the separation between tissue types was evident in Δ _{2}, with only two terms from the SVD, and completely recovered by the first 10 terms with Δ _{10}. The anticipated separation was completely broken by removing the first SVD term (resulting in B _{1}). In other words, the wellknown differences in methylation between tissue types that were evident from A were contained almost wholly in the first term of the SVD, i.e. a rank1 linear combination suggestive of cellcomposition effects. Additional file 1: Figure S20 shows the correlation between the intercept term in the 27K reference data set correlated strongly with the intercept terms in the other blood data sets (stratified Pearson r > 0.99) for moderate and high values of k, but never correlated with the intercepts from the other two tissues more strongly than r > 0.9. In contrast, the intercept terms of B _{1} did not correlate: stratified correlations with the intercept from the reference data set were 0.29 (HNSCC blood), 0.44 (ovarian blood), 0.06 (breast tumor), 0.06 (gastric tissue). Again, this result suggests that the initial terms of the SVD drive correlations within and across tissue types, suggestive of strong cellcomposition effects. Additional file 1: Figure S21 shows clustering dendrograms for 450K, similar to what Additional file 1: Figure S19 shows for 27K. The pattern was the same except that it was necessary to remove two terms of the SVD (resulting in B _{2}) to break the anticipated grouping between the reference blood data set and the arthritis data set. As in Additional file 1: Figure S20, Additional file 1: Figure S22A shows correlations with 450K reference blood across different values of k. The pattern was more striking than in Additional file 1: Figure S20, demonstrating strong correlation between blood data sets and weaker correlation with liver and artery, a pattern that was constant across all values of k. Additional file 1: Figure S22B shows a complementary pattern of weak correlation in B _{ k } intercepts across all data sets.
Discussion of empirical results
The novel linear expansion of DNA methylation array presented here illustrates that tissue differences and tissuespecific cell lineages are evident in only a few linear terms of the SVD. It also suggests that principal variation driven by linear combinations contains a majority of the biological information in DNA methylation data sets. This finding is in line with our prior work demonstrating that phenotype associated methylation profiles are driven by variation in the underlying composition of major types of cells within the sample [1416]. Our present work also suggests that this finding may apply more widely in studies that utilize target tissue samples other than peripheral blood. It is also consistent with other recently published work that demonstrates the ability to deconvolve DNA methylation data to obtain underlying information about cell mixtures [22,39].
Our analysis separates composition effects by major cell types from subtler effects within a cell type or focused effects across cell type. This was clearly apparent when examining gene set enrichment of the significant Δ _{ k } coefficients arising from the blood cell reference datasets, which showed the predicted highly significant overrepresentation within demonstrated blood cell type DMRs (which had in fact been chosen on the basis of these data sets). This overrepresentation was also seen in data from other studies using peripheral blood, including the ovarian cancer, rheumatoid arthritis, and HNSCC casecontrol analyses. Interestingly, other than the gastric tissue data set, the other data sets demonstrated significant overrepresentation of DMR loci among significant Δ _{ k } coefficients. This is potentially due to immune cell infiltration or inflammatory process, as we suggest above. In most of the datasets the set of significant CpGs showed enrichment for PcG targets, important for cell growth and differentiation, again potentially reasonable for the data sets we analyzed. Several KEGG gene set analyses also showed overrepresentation among significant Δ _{ k } coefficients of pathways such as cytokinecytokine interactions, hematologic processes, and NK cell signaling for nonblood data sets; again suggestive of immunecell infiltration or inflammation.
We note that the sets of significant Δ _{ k } coefficients corresponded in large part to the sets of significant A (unadjusted) coefficients. Although it may be possible that for moderate values of k the referencefree method merely assigns a substantial portion of any A effect to Δ _{ k } rather than to B _{ k }, the null scenarios we considered suggest instead that the method is able to correctly partition effects into linear and nonlinear categories, as was suggested by the simulations conducted in our original paper [19]. Therefore, the alternative explanation, that much of the phenotypic effect on DNA methylation is explained by latent linear variables suggestive of cell composition effects, is more plausible.
Although the principal variation in DNA methylation profiling appears to be explained through these compositional effects, there are loci that still show significant covariation with phenotype outside these predominant linear structures. We note that the referencefree algorithm has been recently applied for the specific purpose of discovering such loci, e.g. Wilms’ tumor biomarkers have been confirmed recently for significant cellmixture adjusted coefficients obtained using our referencefree algorithm [20]. According to theoretical considerations given earlier, this variation may be interpreted as evidence for effects involving consistent but focused changes across cell type or subtly distinct forms of major cell types arising or being selected for through pathological processes. Biologically, this is an important demonstration, as it would suggest that, beyond profound alterations in major cell type composition, subtle changes to the underlying distribution of cells within tissues can be an important part of the pathological process underlying a variety of conditions. This is well understood in oncology, and even in diseases such as hepatic cirrhosis where dysplastic changes are apparent. In some cases, these changes may not be histologically obvious, but can still have important implications in disease progression. For example, the expansion of stem cell compartments in tissues such as adipose or regenerative processes in bone marrow may arise in expanded numbers of cells with bivalent chromatin, marked perhaps by 5hydroxymethyl cytosine but without obvious histologic signs of tissue differences. Through these analytical methods, these potentially crucial and possibly prognostic or etiologically relevant alterations may be observable.
An issue that was somewhat underdeveloped in our original paper was the selection of the dimension parameter k. In particular, there has been some concern in response to our original paper that the choice of k may influence results. We had suggested the use of an established algorithm designed for an entirely different purpose, and found in this paper that it did not optimize the criterion we have found most useful for this type of analysis (see Additional file 1: Figure S11B). Although our proposed alternative by definition optimizes this criterion, both methods often resulted in values of k for which the solution changed very little in moving to adjacent values. This demonstrates some robustness of the referencefree method to reasonable choices of k, a finding that supports the use of the referencefree algorithm. However, as shown in Figure 1 and Additional file 1: Figures S2 through S8, choosing a value of k that is too low or too high (i.e. too close to the available residual degrees of freedom) will result in unstable partitioning of effect. Robustness to the choice of k also suggests that most data sets will show variation in a small number of cell types, beyond which all detectable variation is of a more subtle variety. Some biologists may be interested in using plots similar to those shown in Figure 1 and Additional file 1: Figures S2S9 to determine the (often unknown) number of major cell types k′ in a tissue by finding the minimal value of k for which the solution seems stable. However, we note that the dimension k evident in a real data set may not correspond to the number of cell types in a linear fashion; this was subtly evident in Figures 3, Additional file 1: Figure S16, and S18S21, which demonstrated that a small number of terms recovered the majority of variation in the data sets. Indeed, in some cases k may scale with the base2 logarithm of the number of cell types k′, as each dimension is able to differentiate one lineage “split”.
We examined the overrepresentation of blood cell DMRs and PcG targets, as these are wellestablished gene sets, and in the case of bloodbased analyses they are critical. For other tissues, cellspecific DMRs are less well established, but our findings linking pathology to these compositional effects points to the existence of such DMRs, and suggests an important area for further epigenetic research.
Importantly, this work also has implications for conducting an EWAS: adjusting for cell composition and/or looking for cellcomposition effects (where most of the information is in the initial terms of the SVD), is crucially important for interpreting the data. For example, the distribution of cell types in numerous organs (e.g. autocrine or exocrine organs) may be affected by the conditions of organismal development. The Barker hypothesis holds that developmental effects of environmental exposure alter susceptibility to later chronic disease [40]. In this construct, maternal exposures (including diet, exercise etc.) can affect the development of fetal offspring in a manner that alters tissue composition and function. That is, alteration of the mixed cellularity of important cell types in tissues, occurring inutero and persisting into adulthood, could be a determinant of later disease and this could be examined using these methods. Further, cell adjustment would also allow for assessment of the usually much smaller effects outside the initial terms of the SVD.
It must also be noted that adjusting for cell composition significantly complicates the process of EWAS replication and validation. Replication, of course, can be accomplished if there exists a replication population with array data; in this case the models can be reapplied. However, often the process of replication and validation is accomplished using an orthogonal approach such as pyrosequencing of selected targets, as has been recommended by some as a critical approach in EWAS quality control [41]. In this case, separation of the cellular composition effects and environmental effects would be extremely challenging without additional arrayscale data.
We acknowledge one limitation in this work, the assumption that linear associations necessarily represent cell mixture effects. While it is true that cell mixture effects must be linear, the converse may not be true, i.e. there may be processes outside cellcomposition effects that result in linear associations. We have not ruled out the wellknown potential for approximately linear technical artifacts; although we did not correct for them in this analysis due to incomplete access to the necessary lowlevel files, we emphatically recommend the use of accepted preprocessing techniques such as BMIQ or FunNorm for 450K [42] before applying referencefree methods in the analysis of single data sets. Even without adjustment for technical effects, we saw evidence in our empirical analyses of truly biological information present in the initial linear terms of the SVD. Given that most biological processes are nonlinear, it would be difficult to imagine a theoretical basis for linearity of truly epigenetic effects arising outside of cellcomposition. Of course it is possible that a strong nonlinear effect would result in strong firstorder linear terms that might be interpreted as cellcomposition effects, but such widespread systematic effects would likely represent a selection process that is ultimately cellmediated, i.e. occurring outside the cell nucleus. An example of this is the gastric tissue data set, where the profound difference in behavior between normal gastric tissue and tumor ultimately results from the selection of tumor cells over normal cells. We do not rule out the potential for widespread intranuclear processes occurring sporadically (as opposed to systematically) but would argue that such effects, if not resulting in the expansion of an altered type via selection process, would present as a more random process, such as those targeted by algorithms such as EVORA [43,44]. We note that similar methods such as SVA and ISVA [23,24] may be used for the purpose of cellmixture adjustment, but because they do not use the cellmixture information available in the regression coefficients, they may perform less adequately for this purpose, as we have shown previously [19]. Although we argue in this paper that the most likely interpretation of linear associations is that of cell composition, a more general point can be made: that the various interpretations assigned to phenotypic associations with DNA methylation require greater scrutiny than has often been offered in many EWAS reports.
Conclusions
In this article we have presented a mathematical basis for referencefree cellmixture adjustment, arguing that the total effect of a phenotype on DNA methylation can be decomposed into orthogonal components, one representing the effect of phenotype on proportions of major cell types, the other representing subtle differences and global effects at focused loci. Using nine different DNA methylation data sets arrayed on Illumina Infinium platforms, we have demonstrated empirically that a majority of the information regarding lineage and celltype appears to reside in the first few terms of an orthogonal linear expansion, thus corroborating the assertion that these initial principal components should be interpreted as cell composition effects. We also demonstrate that the remaining terms nevertheless may contain significant effects at a small number of loci, and these effects may represent either subtle alterations in cell type or focused changes common to all cell types. In addition, we demonstrate reasonable robustness to the choice of k, the number of terms to interpret as effects of major cell types, and present a method that is designed to find the value of k that results in minimal change in results across adjacent values.
Taken together, our work demonstrates that referencefree algorithms for cellmixture adjustment can produce biologically valid results, separating cellmediated epigenetic effects from those that are not mediated by major cell types, and thus represents a useful method for distinguishing the two types of effects in EWAS. In general, we argue that the biological interpretation of epigenetic associations evident from DNA methylation data requires closer examination than has often been offered in EWAS reports.
Methods
Novel methods for estimating dimension
We have previously proposed using a random matrix theory approach for estimating k, demonstrating that it performs well under simulation in small artificial data sets [19]. The method was originally proposed for determining the number of singular values to keep in isva, an approach that is similar but does not use A as part of the matrix to be decomposed [24]. In essence, the method looks for the smallest number of singular values of R (standardized by row) for which the residual matrix (analogous to B _{ k } above) is consistent with a randomly generated matrix. On the other hand, an empirically useful criterion for choosing k, based in part on biological considerations, is to find values of k, such that terms λ _{ j } Q _{ j } are small relative to Δ _{ k } for j close to k. The biological interpretation is that such values of k effectively exhaust the variation among major types of cells, and represent a “stable” solution. We propose to operationalize this as follows: for each candidate value of k, compute the medianrootmeansquareddifference between successive values of Δ _{ k }, \( RMS{D}_k= me{d}_h{\left\{{\left(d1\right)}^{=1}{\lambda}_j{\displaystyle {\sum}_{l=2}^d{Q}_{hl}^2}\right\}}^{1/2} \), where each sumofsquares is computed separately for each row over the nonintercept column entries l of Q _{ j } and the median is computed over the rows h; k is then chosen as the value that minimizes this median statistic.
We previously proposed a bootstrap method for generating the sampling distribution of A and B _{ k } (and therefore Δ _{ k }) [19]. Thus it is possible to construct tstatistics for each entry of Δ _{ k }, and an alternate but similar procedure could minimize the median of the rowsums of the squares of the elementwise differences of these statistics. In principle it would be better to use the formal Fstatistic across all nonintercept columns, but this vastly increases the computational burden of the procedure (since it involves a large number of matrixinversions).
Methods for empirical evaluation
To evaluate and demonstrate the concepts proposed above, we applied the proposed method to nine DNA methylation data sets obtained from GEO, each described in Table 1. Table 1 also indicates he regression model used in each analysis. For all but two femaleonly data sets, sex chromosome loci were omitted. For each of the 450K data sets, 166,314 CpGs were removed due to known problems with crossreactivity or polymorphisms [45] and loci with greater than 5 missing values were also omitted. For 27K data sets we considered a range of k from 1 to 25; for three of the 450K data sets we considered a range of k from 1 to 50, but due to small residual degrees of freedom, we considered only a range up to 30 for the Reinius 450K blood reference data set.
For each data set and for each value of k, we computed B _{ k }, Δ _{ k }, and their elementwise bootstrap standard errors (100 bootstrap samples for each analysis) using our previously published referencefree algorithm [19]. From estimates and standard errors we computed pvalues, and across each column of B _{ k } and of Δ _{ k } we computed the corresponding qvalues using the Bioconductor qvalue package (version 1.34.0). Additionally, we applied the two new proposed methods for estimating k, as well as two variants of the previously proposed random matrix theory approach, one that scales the rows of R, and one that does not.
To assess the biological significance of results, we selected several gene sets and investigated the enrichment of significant (q < 0.05) loci within each set. For each data set, we used the value of k selected by our proposed method, except for the artery data set where we used k = 10 due difficulty in estimating k, caused by small sample size. The first gene set consisted of known DMRs for leukocytes: for 27K, the 500 CpG sites published for 27K in our earlier paper [18], and for 450K, 417 CpGs not excluded from the top 600 CpGs reported by Jaffe & Irrizary [46] based on the Reinius data set [47]. The second was a set of CpGs mapped to polycomb target genes, compiled from four separate published articles [4851] and used extensively in our previous work. The remaining gene sets were selected KEGG pathways, obtained via Bioconductor annotation packages for the 27K and 450K platforms. Please see Table 2 for a summary of pathways investigated. To test overrepresentation while circumventing known problems with the application of such geneset analysis to DNA methylation data, we used the exact MantelHaenzel test to stratify by genomic context. For 27K we stratified by CpG Island status, and for 450K we stratified by strata defined by Infinium chemistry type, relation to CpG Island, and gene region. Since a single CpG may be mapped to different region designations due to splice variants or gene adjacency, we used the following rule to establish precedence: promoter (“TSS1500” or “TSS200”) were combined as “TSS” and had highest precedence, and the remaining chain of precedence was as follows; TSS > 1stExon > Body > 5’UTR > 3’UTR > null. In cases where sparsity prevented stratification (a few analyses with PcG gene set) we used the unstratified Fisher test.
We also evaluated the behavior of the referencefree method in null scenarios. Although our original paper conducted simulations using artificial data sets with only 1000 CpGs, we devised simple but more realistic approach based on real 27K data sets. From the HNSCC and ovarian cancer case/control data sets, we obtained three null data sets as follows. First, to simulate a completely null result, we permuted the phenotype information (case status and age) with respect to the array data. Second, to simulate a null linear result with some nonnull, nonlinear effects, we selected CpGs with q < 0.0001 for the case coefficient in a limma analysis, unadjusted for celltype, conducted on an Mvalue (logitbeta) scale (458 CpGs for the HNSCC data set and 817 for the ovarian cancer data set), we then multiplied the corresponding coefficients on Mvalues by random scalars drawn from a normal distribution with standard deviation 2, and finally applied the corresponding effects to Mvalues whose case effect had been removed (on a beta scale so that any linear age effect would be maintained), converting the result back to beta scale. Note that this approach would produce a data set with weak linear age effects, no linear case effects, and nonlinear case effects at a small fraction of CpGs. The third approach was identical to the second except that a threshold of q < 0.001 was used (resulting in 643 CpGs for the HNSCC data set and 1163 for the ovarian cancer data set). Each of these three approaches was applied five separate times, with subsequent analysis by the referencefree method across values of k ranging from 1 to 25, and tabulation of the resulting number of significant (q < 0.05) coefficients. Our hypothesis was that the first set of analyses would produce very few significant qvalues for either Δ _{ k } or B _{ k }, and that the latter two sets would produce some significant qvalues for the B _{ k } coefficients but few to none for the Δ _{ k } coefficients.
In order to illustrate the principle of recursive partitioning above, we conducted an additional analysis on the two reference data sets. For each value of k, we applied hierarchical clustering to \( {\boldsymbol{\Delta}}_k^T \) (i.e. clustered the columns of Δ _{ k }), using Euclidean metric and Ward’s method of clustering. We have demonstrated that Ward’s method performs similarly to the recursively partitioned mixture model [27]. We hypothesized that even with relatively small values of k, phylogenetic relationships between cell types will be evident in the resulting clusterings.
Finally, as an evaluation of the hypothesis that the leftsingular vectors u _{ k } of the SVD in (4) concentrate a majority of information about major cell types within the initial values of k, we computed the cross products of the leftsingular vectors u _{ k } across all five 27K data sets and across all four 450K data sets. We also clustered the intercept columns of each Δ _{ k } across 27K data sets and across 450K data sets, and computed correlation coefficients stratified over CpG Island status (27K) or more general genomic context (450K) as in the aforementioned MantelHaenzel test. We performed a similar analysis for the B _{ k } coefficients.
The core engine of the methods discussed in this article and published in our previous work are available in the CRAN/R package RefFreeEWAS. The novel elements proposed in this article are available in the Additional file 1.
Availability of supporting data
As described in Table 1, data used in this article were obtained from Gene Expression Omnibus, accession numbers GSE39981, GSE35069, GSE30229, GSE19711, GSE42861, GSE32393, GSE30601, GSE60753, and GSE46394.
Abbreviations
 450K:

Human methylation450 platform
 CpG:

Cytosinephosphateguanine dinucleotide
 DMR:

Differentially methylated region
 EWAS:

Epigenomewide association study
 GEO:

Gene expression omnibus
 HBV:

Chronic hepatitis B viral infection
 HCV:

Chronic hepatitis C viral infection
 HNSCC:

Head and neck squamous cell carcinoma
 ISVA:

Independent surrogate variable analysis
 KEGG:

Kyoto encyclopedia of genes and genomes
 NK:

Natural killer cell
 PcG:

Polycomb group target
 SVA:

Surrogate variable analysis
References
Teschendorff AE, Menon U, GentryMaharaj A, Ramus SJ, Gayther SA, Apostolidou S, et al. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4(12):e8274.
Marsit CJ, Koestler DC, Christensen BC, Karagas MR, Houseman EA, Kelsey KT. DNA methylation array analysis identifies profiles of bloodderived DNA methylation associated with bladder cancer. J Clin Oncol. 2011;29(9):1133–9.
Kim M, Long TI, Arakawa K, Wang R, Yu MC, Laird PW. DNA methylation as a biomarker for cardiovascular disease risk. PLoS One. 2010;5(3):e9692.
Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aissi D, Wahl S, et al. DNA methylation and bodymass index: a genomewide analysis. Lancet. 2014;383(9933):1990–8.
Kile ML, Houseman EA, Baccarelli AA, Quamruzzaman Q, Rahman M, Mostofa G, et al. Effect of prenatal arsenic exposure on DNA methylation and leukocyte subpopulations in cord blood. Epigenetics. 2014;9(5):774–82.
Koestler DC, AvissarWhiting M, Houseman EA, Karagas MR, Marsit CJ. Differential DNA methylation in umbilical cord blood of infants exposed to low levels of arsenic in utero. Environ Health Perspect. 2013;121(8):971–7.
Joubert BR, Haberg SE, Nilsen RM, Wang X, Vollset SE, Murphy SK, et al. 450K epigenomewide scan identifies differential DNA methylation in newborns related to maternal smoking during pregnancy. Environ Health Perspect. 2012;120(10):1425–31.
Ji H, Ehrlich LI, Seita J, Murakami P, Doi A, Lindau P, et al. Comprehensive methylome map of lineage commitment from haematopoietic progenitors. Nature. 2010;467(7313):338. 342.
Khavari DA, Sen GL, Rinn JL. DNA methylation and epigenetic control of cellular differentiation. Cell Cycle. 2010;9(19):3880–3.
Natoli G. Maintaining cell identity through global control of genomic organization. Immunity. 2010;33(1):12–24.
Christensen BC, Houseman EA, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, et al. Aging and environmental exposures alter tissuespecific DNA methylation dependent upon CpG island context. PLoS Genet. 2009;5(8):e1000602.
Monick MM, Beach SR, Plume J, Sears R, Gerrard M, Brody GH, et al. Coordinated changes in AHRR methylation in lymphoblasts and pulmonary macrophages from smokers. Am J Med Genet B Neuropsychiatr Genet. 2012;159(2):141–51.
Sung N, Choi K, Park E, Park K, Lee S, Lee A, et al. Smoking, alcohol and gastric cancer risk in Korean men: the National Health Insurance Corporation Study. British J Cancer. 2007;97(5):700–4.
Koestler DC, Marsit CJ, Christensen BC, Accomando W, Langevin SM, Houseman EA, et al. Peripheral blood immune cell methylation profiles are associated with nonhematopoietic cancers. Cancer Epidemiol Biomark Prev. 2012;21(8):1293–302.
Langevin SM, Houseman EA, Accomando WP, Koestler DC, Christensen BC, Nelson HH, et al. Leukocyteadjusted epigenomewide association studies of blood from solid tumor patients. Epigenetics. 2014;9(6):884–95.
Langevin SM, Koestler DC, Christensen BC, Butler RA, Wiencke JK, Nelson HH, et al. Peripheral blood DNA methylation profiles are indicative of head and neck squamous cell carcinoma. Epigenetics. 2012;7(3):291–9.
Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, et al. Epigenomewide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013;31(2):142–7.
Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13:86.
Houseman EA, Molitor J, Marsit CJ. Referencefree cell mixture adjustments in analysis of DNA methylation data. Bioinformatics. 2014;30(10):1431–9.
Charlton J, Williams RD, Weeks M, Sebire NJ, Popov S, Vujanic G, et al. Methylome analysis identifies a wilms tumour epigenetic biomarker detectable in blood. Genome Biol. 2014;15(8):434.
Guintivano J, Aryee MJ, Kaminsky ZA. A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. Epigenetics. 2013;8(3):290–302.
Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J. Epigenomewide association studies without the need for celltype composition. Nat Methods. 2014;11(3):309–11.
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3(9):1724–35.
Teschendorff AE, Zhuang J, Widschwendter M. Independent surrogate variable analysis to deconvolve confounding factors in largescale microarray profiling studies. Bioinformatics. 2011;27(11):1496–505.
Falvo JV, Jasenosky LD, Kruidenier L, Goldfeld AE. Epigenetic control of cytokine gene expression: regulation of the TNF/LT locus and T helper cell differentiation. Adv Immunol. 2013;118:37–128.
Walden TB, Hansen IR, Timmons JA, Cannon B, Nedergaard J. Recruited vs. nonrecruited molecular signatures of brown, “brite”, and white adipose tissues. Am J Physiol Endocrinol Metabol. 2012;302(1):E19–31.
Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, et al. Modelbased clustering of DNA methylation array data: a recursivepartitioning algorithm for highdimensional data arising as a mixture of beta distributions. BMC Bioinformatics. 2008;9(1):365.
DeNardo DG, Coussens LM. Balancing immune response: crosstalk between adaptive and innate immune cells during breast cancer progression. Breast Cancer Res. 2007;9(4):212.
Lederer SL, Walters KA, Proll S, Paeper B, Robinzon S, Boix L, et al. Distinct cellular responses differentiating alcoholand hepatitis C virusinduced liver cirrhosis. Virol J. 2006;3(1):98.
Nakashima Y, Fujii H, Sumiyoshi S, Wight TN, Sueishi K. Early human atherosclerosis accumulation of lipid and proteoglycans in intimal thickenings followed by macrophage infiltration. Arterioscler Thromb Vasc Biol. 2007;27(5):1159–65.
Zhou X, Stemme S, Hansson GK. Evidence for a local immune response in atherosclerosis. CD4+ T cells infiltrate lesions of apolipoproteinEdeficient mice. Am J Pathol. 1996;149(2):359.
Margueron R, Reinberg D. The Polycomb complex PRC2 and its mark in life. Nature. 2011;469(7330):343–9.
Simon JA, Kingston RE. Occupying chromatin: polycomb mechanisms for getting to genomic targets, stopping transcriptional traffic, and staying put. Mol Cell. 2013;49(5):808–24.
Kolybaba A, Classen AK. Sensing cellular states–signaling to chromatin pathways targeting Polycomb and Trithorax group function. Cell Tissue Res. 2014;356(3):477–93.
Zhuang J, Jones A, Lee SH, Ng E, Fiegl H, Zikan M, et al. The dynamics and prognostic potential of DNA methylation changes at stem cell gene loci in women’s cancer. PLoS Genet. 2012;8(2):e1002517.
Yao Y, Wang Y, Chen F, Huang Y, Zhu S, Leng Q, et al. NLRC5 regulates MHC class I antigen presentation in host defense against intracellular pathogens. Cell Res. 2012;22(5):836–47.
Davis BK, Roberts RA, Huang MT, Willingham SB, Conti BJ, Brickey WJ, et al. Cutting edge: NLRC5dependent activation of the inflammasome. J Immunol. 2011;186(3):1333–7.
Liu Y, LudesMeyers J, Zhang Y, MunozMedellin D, Kim HT, Lu C, et al. Inhibition of AP1 transcription factor causes blockade of multiple signal transduction pathways and inhibits breast cancer growth. Oncogene. 2002;21(50):7680–9.
Zheng X, Zhao Q, Wu HJ, Li W, Wang H, Meyer CA, et al. MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes. Genome Biol. 2014;15(8):419.
Barker DJP, Robinson RJ. Fetal and infant origins of adult disease. London: British Medical Journal; 1992.
Michels KB, Binder AM, Dedeurwaerder S, Epstein CB, Greally JM, Gut I, et al. Recommendations for the design and analysis of epigenomewide association studies. Nature Methods. 2013;10(10):949–55.
Hansen JPFKD. Minfi tutorial BioC2014. 2014.
Teschendorff AE, Liu X, Caren H, Pollard SM, Beck S, Widschwendter M, et al. The Dynamics of DNA Methylation Covariation Patterns in Carcinogenesis. PLoS Comput Biol. 2014;10(7):e1003709.
Teschendorff AE, Widschwendter M. Differential variability improves the identification of cancer risk markers in DNA methylation studies profiling precursor cancer lesions. Bioinformatics. 2012;28(11):1487–94.
Ya C, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, et al. Discovery of crossreactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013;8(2):203.
Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenomewide association studies. Genome Biol. 2014;15(2):R31.
Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlén SE, Greco D, et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PloS One. 2012;7(7):e41361.
Bracken AP, Dietrich N, Pasini D, Hansen KH, Helin K. Genomewide mapping of Polycomb target genes unravels their roles in cell fate transitions. Genes Dev. 2006;20(9):1123–36.
Lee TI, Jenner RG, Boyer LA, Guenther MG, Levine SS, Kumar RM, et al. Control of developmental regulators by Polycomb in human embryonic stem cells. Cell. 2006;125(2):301–13.
Schlesinger Y, Straussman R, Keshet I, Farkash S, Hecht M, Zimmerman J, et al. Polycombmediated methylation on Lys27 of histone H3 premarks genes for de novo methylation in cancer. Nature Genet. 2006;39(2):232–6.
Squazzo SL, O’Geen H, Komashko VM, Krig SR, Jin VX, Jang Sw, et al. Suz12 binds to silenced regions of the genome in a celltypespecific manner. Genome Res. 2006;16(7):890–900.
Accomando WP, Wiencke JK, Houseman EA, Butler RA, Zheng S, Nelson HH, et al. Decreased NK cells in patients with head and neck cancer determined in archival DNA. Clin Cancer Res. 2012;18(22):6147–54.
Zouridis H, Deng N, Ivanova T, Zhu Y, Wong B, Huang D, et al. Methylation subtypes and largescale epigenetic alterations in gastric cancer. Sci Transl Med. 2012, 4(156).156ra140156ra140.
Hlady RA, Tiedemann RL, Puszyk W, Zendejas I, Roberts LR, Choi JH, et al. Epigenetic signatures of alcohol abuse and hepatitis infection during human hepatocarcinogenesis. Oncotarget. 2014;5(19):9425.
Zaina S, Heyn H, Carmona FJ, Varol N, Sayols S, Condom E, et al. A DNA methylation map of human atherosclerosis. Circ Cardiovasc Genet. 2014;7(5):692–700.
Acknowledgments
Funding for this work was provided by grants from the National Institutes for Health: NIMH R01MH094609 (EAH and CJM), NIEHS P01 ES022832 (CJM), R01CA052689 (JKW), P50CA097257 (JKW). Funding was also provided by EPA grant RD83544201 (CJM).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
EAH: Developed methods, conducted analysis, served as primary author. KTK: Provided feedback on biological concepts. JKW: Provided feedback on biological concepts. CJM: Assisted in authoring of manuscript by providing feedback on biological concepts. All authors read and approved the final manuscript.
Additional file
Additional file 1:
Supplementary Materials.
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Houseman, E.A., Kelsey, K.T., Wiencke, J.K. et al. Cellcomposition effects in the analysis of DNA methylation array data: a mathematical perspective. BMC Bioinformatics 16, 95 (2015). https://doi.org/10.1186/s128590150527y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590150527y