 Research
 Open Access
 Published:
Deconvolution of tumor composition using partially available DNA methylation data
BMC Bioinformatics volume 23, Article number: 355 (2022)
Abstract
Background
Deciphering proportions of constitutional cell types in tumor tissues is a crucial step for the analysis of tumor heterogeneity and the prediction of response to immunotherapy. In the process of measuring cell population proportions, traditional experimental methods have been greatly hampered by the cost and extensive dropout events. At present, the public availability of large amounts of DNA methylation data makes it possible to use computational methods to predict proportions.
Results
In this paper, we proposed PRMeth, a method to deconvolve tumor mixtures using partially available DNA methylation data. By adopting an iteratively optimized nonnegative matrix factorization framework, PRMeth took DNA methylation profiles of a portion of the cell types in the tissue mixtures (including blood and solid tumors) as input to estimate the proportions of all cell types as well as the methylation profiles of unknown cell types simultaneously. We compared PRMeth with five different methods through three benchmark datasets and the results show that PRMeth could infer the proportions of all cell types and recover the methylation profiles of unknown cell types effectively. Then, applying PRMeth to four types of tumors from The Cancer Genome Atlas (TCGA) database, we found that the immune cell proportions estimated by PRMeth were largely consistent with previous studies and met biological significance.
Conclusions
Our method can circumvent the difficulty of obtaining complete DNA methylation reference data and obtain satisfactory deconvolution accuracy, which will be conducive to exploring the new directions of cancer immunotherapy. PRMeth is implemented in R and is freely available from GitHub (https://github.com/hedingqin/PRMeth).
Background
Intratumor heterogeneity is formed by the dynamic interactions of different tumor cell populations (or subclones), infiltrating immune cells, and stromal cells in the tumor microenvironment [1,2,3,4]. Studies have shown that intratumor heterogeneity is closely related to clinical prognoses such as tumor growth, metastasis, recurrence, and drug resistance [5]. The tumor heterogeneity can be measured by the number of cell populations in tumor tissues, their molecular profiles, and their proportions. Specifically, cell type proportion prediction is an important task for multiomic data analysis or clinical studies. For example, accounting for cell type proportions is proven to be helpful for EpigenomeWide Association Study (EWAS) [6], and the composition of infiltrating immune cells in tumor tissues is predictive of the response to checkpoint inhibitor immunotherapy [7].
Currently, experimental techniques including flow cytometry and singlecell techniques such as Dropseq [8], 10X Genomics, and sciRNAseq [9] have been used to study cellular components in complex tissues, but they are costly [10] and sensitive to technical changes during cell isolation. Thus, in recent years, computational estimation of cellular components using gene expression or DNA methylation data has become a hot topic in computational biology [10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]. Compared to gene expression, DNA methylation has the advantage of being more stable [28], highly celltype specific [29], and easier to measure in formalinfixed paraffinembedded (FFPE) tissues [30]. As a result, DNA methylation is more suitable for studying cellular components. Currently, the methods based on DNA methylation can be broadly classified into two categories: referencebased methods and referencefree methods. Among the referencebased methods, Houseman et al. [11] proposed a linear regression method (QP) based on DNA methylation, which uses quadratic programming to ensure that the regression coefficients are nonnegative. Teschendorff et al. [16] developed EpiDISH, which uses nonconstrained weighted linear regression rather than linear regression to reduce the weights of data points with large residuals. Altboum et al. proposed DCQ [13], which modifies the deconvolution approach into a regularized regression model to reduce the number of model parameters. Inspired by the success of CIBERSORT [14] in gene expression decomposition, Chakravarthy et al. [10] analyzed the cell type composition of complex mixtures using support vector regression based on DNA methylation data and obtained more accurate estimates. The latest reference method based on DNA methylation is Emeth [21], which uses a mixture distribution based on ICeDT [20] to identify CpG sites whose DNA methylation in tumor samples is inconsistent with the reference methylation profiles and to reduce the contribution of these aberrant sites in cell type abundance estimation.
The general limitation of the above reference methods is that they require DNA methylation profiles of specific cell types as input, but in practice, it is difficult to obtain DNA methylation profiles of all cellular components in tumor tissues [31]. To overcome this limitation, many researchers have developed referencefree methods. For example, James et al. [22] proposed a combination (FASTLMMeWasher) of linear mixed models and principal components to correct the composition of cell types automatically. Houseman et al. [23] applied an iterative quadratic programming framework (RF) to DNA methylation for cell type analysis. Motivated by previous research, Lutsik et al. [26] developed MeDeCom by combining constrained nonnegative matrix factorization with a new biologically relevant regularization function. Such methods do not rely on reference information and aim to estimate molecular profiles and proportions of all cell types simultaneously, unfortunately, their prediction accuracies are far from satisfactory. However, in real clinical practice, gene expression or DNA methylation is often available for only a small fraction of cell types, and reference information for the remaining cell types is unknown. To overcome these limitations, easily available data for a portion of cell types in a tumor mixture can be used as a reference to deconvolute the entire tumor mixture.
In this paper, we proposed a method for partially–reference cell type decomposition using DNA methylation data (PRMeth). PRMeth used an iteratively optimized nonnegative matrix factorization framework, which took DNA methylation profiles of a portion of the cell types in the tissue mixtures (including blood and solid tumors) as input to estimate the proportions of all cell types as well as the methylation profiles of unknown cell types simultaneously. Based on three benchmark datasets, we compared PRMeth with five different methods (i.e., ReferenceFree (RF) [23], Quadratic Programming (QP) [11], CIBERSORT (CBS) [14], Digital cell quantification (DCQ) [13], and Epigenetic Dissection of IntraSample Heterogeneity (EpiDISH) [16]). The results showed that PRMeth outperformed the other five methods. PRMeth was then applied to four types of tumors from The Cancer Genome Atlas (TCGA) [32] database, i.e., skin cutaneous melanoma (SKCM), invasive breast carcinoma (BRCA), acute myeloid leukemia (LAML), and thymoma (THYM). The experimental results revealed that immune cell proportions estimated by PRMeth were in good agreement with previous studies and PRMeth could provide new insights into tumor heterogeneity and immunotherapy.
Methods
Simulation data
The simulation dataset was constructed from five immune cells (including neutrophils, CD4 + T cells, CD8 + T cells, natural killer cells (NK), CD19 + B cells) (GSE88824), one nonsmall cell lung cancer cell (A549), and one normal human bronchial epithelial cell (NHBEC) (GSE92843) available from the Gene Expression Omnibus (GEO) [33]. To obtain the methylation profiles of the cell types, we loaded their respective IDAT files using the champ.load (ChAMP package in R) and filtered out 79,818 probes with a detection p value > 0.01, a beadcount < 3 in at least 5% of samples, nonCPGs, SNPs, MultiHit, and locating on X, Y chromosome. Then, the filtered data were normalized by the champ.norm and their batch effects were eliminated by the champ.runCombat. Finally, we were able to obtain the methylation profiles for seven different cell types (recorded as base profiles).
Next, the base profiles were employed to generate the methylation profiles of nonsmall cell lung cancer (NSCLC) samples with different cell type proportions and levels of noise. In the first step, we randomly generated the proportions of all cell types for each NSCLC sample based on the Dirichlet distribution. In detail, the proportions of A549 cell, NHBEC, and immune cells are 60%, 10%, and 30%, respectively. These proportions are in accordance with the true proportions of the cell types found in NSCLC samples [25]. In the second step, we generated methylation profiles of the cell types with different levels of noise from an independent beta distribution with mean and variance inferred from the base profiles (see Results for details). In the third step, the methylation profiles of the cell types with different noise levels were linearly combined according to the above ratios as the methylation profiles of NSCLC samples. In the end, the methylation profiles of 100 NSCLC samples were obtained. We used the Dirichlet distribution to generate proportions of cell types 20 times randomly and then obtained 20 simulation datasets at each noise level. The 20 simulation datasets are used to validate the performance of the proposed method PRMeth.
Real data obtained from experiments
Besides the simulation dataset, we also applied our method to the following three datasets. In the first dataset, the methylation profiles of 100 mixture samples, the methylation profiles of seven types of immune cells (including CD4 + T cells, CD8 + T cells, monocytes, B cells, NK cells, neutrophils, and T regulatory cells) constituting mixture samples, and the proportions of all cell types for each sample were provided by Zhang et al. [21]. This dataset is referred to as the Zhang dataset in this paper.
In the second dataset, the methylation profiles of six whole blood samples and their constitutional cell types (including CD4 + T cells, CD8 + T cells, monocytes, B cells, NK cells, neutrophils, and eosinophils) were obtained from Chakravarthy et al. [10] via the GEO accession number GSE35069, and the proportions of each cell type were measured by flow cytometry as provided by the authors [34].
In the third dataset, the methylation profiles of skin cutaneous melanoma (SKCM), invasive breast carcinoma (BRCA), acute myeloid leukemia (LAML), and thymoma (THYM) samples were downloaded from the TCGA database. To facilitate the comparison, 100 tumor samples were randomly selected for each cancer type. As the reference for deconvolution, the methylation profiles of seven immune cells (including monocytes, dendritic cells, macrophages, eosinophils, naive T cells, CD8 + T cells, and NK cells) were obtained from Arneson et al. [35] via the GEO accession numbers GSE35069, GSE59250, and GSE71837. Meanwhile, the batch effects between the methylation profiles of tumor samples and those of immune cell types were eliminated by the ComBat function in sva package of R.
PRMeth model construction
The framework of PRMeth is illustrated in Fig. 1. It is assumed that the methylation profiles of tumor tissues are mixture signals from their constitutional cell types, where only a part of them have available methylation profiles. We proposed a nonnegative matrix factorization scheme (Fig. 1A) and an iterative algorithm (Fig. 1B) to estimate the proportions of all cell types and the methylation profiles of unknown cell types simultaneously.
We denote \(Y\in {R}_{+}^{m\times n}\) as the methylation profiles of m CpG sites in n tumor mixtures. Suppose that the tumor mixtures are made up of \(K\) cell types with a certain proportion. According to the deconvolution model:
where \({W}_{1}\in {R}_{+}^{m\times {K}_{1}}\), \({W}_{2}\in {R}_{+}^{m\times {K}_{2}}\) denote the methylation profiles of \({K}_{1}\) known cell types and \({K}_{2}\) unknown cell types (\(K={K}_{1}+{K}_{2}\)), \({H}_{1}\in {R}_{+}^{{K}_{1}\times n}\) and \({H}_{2}\in {R}_{+}^{{K}_{2}\times n}\) denote the proportions of known and unknown cell types, respectively. \(\varepsilon\) is an \(m\times n\) error matrix. Observing that \({y}_{ij}\in Y\), \({w}_{1\left(ij\right)}\in {W}_{1}\) and \({w}_{2(ij)}\in {W}_{2}\) represent the DNA methylation level (i.e., beta value) of a CpG site, then \(0\le {{y}_{ij},h}_{1\left(ij\right)},{h}_{2\left(ij\right)}\le 1\). And the proportions \({h}_{1\left(ij\right)},{ h}_{2\left(ij\right)}\) of \(K\) cell types in a mixture satisfy \(0\le {h}_{1\left(ij\right)},{h}_{2\left(ij\right)}\le 1\) and \(\sum_{i=1}^{{K}_{1}}{h}_{1\left(ij\right)} + \sum_{i={K}_{1}+1}^{{K}_{2}}{h}_{2\left(ij\right)}= 1\).
In this model, the methylation profiles \(Y\) of the mixtures and the methylation profiles \({W}_{1}\) of the partial cell types were known, and we aimed to estimate the proportions \(\left(\genfrac{}{}{0pt}{}{{H}_{1}}{{H}_{2}}\right)\) of all cell types and the methylation profiles \({W}_{2}\) of unknown cell types, which could be obtained by solving for the minimization error sum of squares, thus transforming Eq. (1) into:
where \(\cdot {}_{F}^{2}\) denotes the Frobenius norm.
Next, an iterative algorithm was used to estimate \(\left(\genfrac{}{}{0pt}{}{{H}_{1}}{{H}_{2}}\right)\) and \({W}_{2}\). As shown in Fig. 1B, we fixed the obtained partial reference data \({W}_{1}\) and iterated the values of \(\left(\genfrac{}{}{0pt}{}{{H}_{1}}{{H}_{2}}\right)\) and \({W}_{2}\) continuously by Eq. (2) to calculate their final results. The detailed algorithm flow is as follows:
where \(t\) is the number of iterations. In step a, we employed the RPMM [36] algorithm to initialize the methylation profiles \({W}_{2}\) of unknown cell types. In detail, RPMM is a clustering algorithm that clusters the methylation profiles \(Y\) of tumor samples into \({K}_{2}\) clusters by the binary distance formula and takes the clustering centers as the initial value of \({W}_{2}\). Furthermore, we compared RPMM with six initialization approaches including five different clustering algorithms (i.e., canberra, euclidean, manhattan, maximum, and minkowski) and a random generation algorithm (random). As shown in Additional file 1: Figure S1, there were no significant differences between the seven methods, but RPMM outperformed the other approaches in estimating proportions on the simulation dataset.
For PRMeth, if profiles of all constitutional cell types are available (i.e., \({K}_{1}=K\)), it is actually the QP method. On the contrary, if none of the constitutional cell types is known (\(\mathrm{i}.\mathrm{e}., {K}_{1}=0\)), the PRMeth method turns to the RF method. Therefore, the PRMeth method is a more general framework that includes the referencebased and referencefree methods as two special cases.
CpG site selection
The total number of CpG sites in the human genome is very huge. To reduce the potential noise and improve the computational efficiency, we selected CpG sites with high methylation variation in tumor samples by the coefficient of variation (\({c}_{v}\)) as follows:
where \(\sigma\) and \(\mu\) denote the standard deviation and mean of a CpG site in \(Y\), respectively. We sorted these sites according to \({c}_{v}\) and then selected the top n with the highest \({c}_{v}\) values as input features.
Cell type number prediction
In our method, the number \(K\) of cell types in tumor mixtures needs to be specified. The Bayesian information criterion (BIC) [37] is an important measure of model superiority that can give the optimal number of parameters in the model. Therefore, BIC was selected to identify \(K\) in the tumor mixtures. Furthermore, in order to weaken the penalty, a penalty factor \(\mathrm{was introduced}\). \(\lambda \_BIC\) is defined by the formula:
where \(N\) denotes the sample size, \(P\) denotes the number of model parameters, \(SSR\) denotes the residual sum of squares between the true and estimated methylation profiles of the tumor mixtures, and \(\lambda\) denotes the penalty factor, whose size is restricted to \((0, 1)\). In the PRMeth model, \(N=n\times m\) as well as \(P=K\left(n+m\right){nK}_{1}\), where \(n\), \(m\), \(K\) and \({K}_{1}\) denote the number of tumor mixtures, the number of CpG sites, the total number of cell types, and the number of known cell types, respectively. Different \(K\) values correspond to different \(\lambda \_BIC\) values, and the \(K\) value corresponding to the smallest \(\lambda \_BIC\) value is the optimal number of cell types for the tumor mixtures.
Results
Research design
The five methods, i.e., QP, DCQ, EpiDISH, RF, and CBS, are stateoftheart methods for the DNA methylation deconvolution task. Among them, RF, QP, DCQ, and EpiDISH used the linear model and CBS used the most popular nonlinear model (support vector regression). The two models were also adopted by the other deconvolution methods introduced in the Background section, so we compared PRMeth with the five methods.

1.
RF [23], a referencefree method for solving cell type proportions and cell type methylation profiles using iterative quadratic programming;

2.
QP [11], a referencebased method for solving cell type proportions using quadratic programming;

3.
CBS [14], a referencebased method for inferring the proportions of tumorinfiltrating immune cells using support vector regression;

4.
DCQ [13], a referencebased method for inferring the global dynamics of the number of immune cells in complex tissues using elastic net regularization.

5.
EpiDISH [16], a referencebased method for estimating cell type proportions using nonconstrained weighted linear regression.
The mean absolute error (MAE) and Pearson correlation coefficient (PCC) were used to evaluate the performance of different methods. In detail, MAE measures the mean absolute error between the estimated and true values of cell type proportions or cell type methylation profiles, and PCC quantifies the correlation coefficient between the estimated and true values of cell type proportions or cell type methylation profiles, with values ranging from \([1, 1]\).
Determination of the number of cell types
The number of cell types should be specified first for PRMeth. However, it is not a trivial task since we are infeasible to know the exact number without a singlecell sequencing experiment. We here determined the number \(K\) of cell types in mixture samples using \(\lambda \_BIC\), a modified Bayesian information criterion (see Methods for details). Assuming that the methylation profiles \(Y\) of tumor mixtures and the methylation profiles \({W}_{1}\) of \({K}_{1}\) cell types are known, penalty factor \(\lambda\) is taken as \(\mathrm{0.1,0.2},\dots ,0.9\), and \(K\) is chosen as \({K}_{1}+1,{K}_{1}+2,\dots ,{K}_{1}+k\), where \(k\le 30\). All \(\lambda\) and \(K\) were traversed to calculate their corresponding \(\lambda \_BIC\) values. The optimal number of cell types was determined as the \(K\) with the smallest \(\lambda \_BIC\) value.
\(\lambda \_BIC\) was tested on the Zhang dataset with in total 7 cell types by setting \({K}_{1}\) as 2, 3, 4, and 5, respectively. It was observed that the smallest \(\lambda \_BIC\) values corresponded to \(\lambda =0.3, 0.4, 0.4, 0.5\), and \(K=7\) for all \({K}_{1}\). When \(\lambda\) was fixed as \(0.3, 0.4, 0.4,\) or \(0.5\), we plotted the \(\lambda \_BIC\) values with \(K\) as shown in Fig. 2. As expected, all \(\lambda \_BIC\) values decreased first and then increased with the increase of \(K\), and PRMeth could successfully predict the correct number (\(K=7\)) of cell types in all scenarios.
Evaluation of different methods using simulation data
After successfully determining the total number of cell types, we next evaluated the estimation accuracy of PRMeth on the simulation dataset. First, the top 1000 CpG sites with the highest coefficient of variation (\({c}_{v}\)) were selected as the input for six methods. Then, we calculated the mean absolute error (MAE) between the true and predicted proportions of available cell types for each method at different noise levels. Here, the random noise was generated by a beta distribution whose mean is the methylation level of each site for each cell type in the base profiles and whose variance is a certain percentage of the maximum variance (i.e., \(mean*(1mean)\)) calculated by the above mean. In detail, we took 10%, 20%, 30%, and 40% of the maximum variance when processing lung cancer cell types and 5%, 10%, 15%, and 20% for normal cell types. As shown in Fig. 3A, the MAE of all six methods increased with the increase of the noise level. Compared to other methods, PRMeth consistently obtained the lowest bias and relatively stable results at all noise levels. When the noise level was (0.1, 0.05), we evaluated the performance of PRMeth in estimating the proportions of cell types at different numbers (\({K}_{2}=2, 3, 4, 5\)) of unknown cell types. It is shown that PRMeth always obtained the lowest and most stable bias, however, the MAE of the remaining methods all gradually increased with the increasing number of unknown cell types (Fig. 3B). For the three remaining noise levels (0.2, 0.1), (0.3, 0.15), and (0.4, 0.2), PRMeth performed similarly well (Additional file 1: Figure S2).
In addition to proportion prediction, PRMeth (as well as RF) can also infer the methylation profiles of cell types. Figure 3C–F show the MAE and Pearson correlation coefficient (PCC) between the true and predicted cell type methylation profiles calculated by the two methods at different noise levels or different numbers of unknown cell types. At all noise levels, PRMeth achieved consistently higher accuracy (Fig. 3C) and correlation (Fig. 3E) compared to RF. Furthermore, when the noise level was (0.1, 0.05), the MAE of PRMeth gradually increased but remained lower than RF (Fig. 3D) and its PCC decreased gradually but remained higher than RF (Fig. 3F) with the increasing number of unknown cell types. PRMeth exhibited the same results as Fig. 3D, F compared to the referencefree method at the three remaining noise levels (Additional file 1: Figure S3).
We also evaluated the computational performance of these six methods. As shown in Additional file 1: Table S1, executing 20 times at 100 samples and 1000 CpG sites, both the running time and memory usage of PRMeth is a little higher than the other methods. This is because many iterations are required to reach the optimal solution. In addition, we analyzed the running time and memory usage of PRMeth when the number of samples and features gradually increased. This reveals that the running time of PRMeth increased as the number of samples and features gradually increased, but there was no clear pattern in its memory usage (Additional file 1: Table S2).
Evaluation of different methods using Zhang data
We then evaluated different methods on the Zhang dataset from three aspects, i.e., the accuracies of six methods in estimating the proportions of known cell types, the accuracies of PRMeth and RF in estimating the proportions of all cell types, and the overall performance of proportion estimates at different numbers of unknown cell types. First, by setting \({K}_{1}\) as 4, we calculated the MAE between the true and predicted proportions of each of the four cell types using the six methods. Figure 4A, B demonstrate that PRMeth had the lowest MAE at both CD4 + T cells and monocytes compared to other methods. Figure 4C shows that RF had the lowest bias (\({MAE}_{RF}=0.0631\)) at CD8 + T cells, followed by PRMeth (\({MAE}_{PRMeth}=0.0775\)). About the MAE of B cells, PRMeth ranked fourth, which was slightly higher than EpiDISH, CBS, and QP (Fig. 4D). In general, PRMeth had better results for the proportion estimates of a single cell type compared to other methods. A similar performance was obtained by PRMeth when \({K}_{1}=2, 3, 5\) (Additional file 1: Figures S4, S5 and S6). Second, we obtained the MAE between the true and predicted proportions for each of all cell types using PRMeth and RF when \({K}_{1}=3\). Except for CD8 + T cells, the MAE of PRMeth was lower than RF for the remaining six cell types (Fig. 4E). Overall, our method had higher accuracy in predicting the proportions of each cell type compared to RF when \({K}_{1}=2, \mathrm{3,4}, 5\) (Fig. 4E and Additional file 1: Figure S7). Finally, the PCC between the true and predicted proportions of known cell types obtained by the six methods at different numbers of unknown cell types is shown in Fig. 4F. As the number of unknown cell types increased, the PCC of both PRMeth and referencebased methods decreased. An exception is RF, which does not require reference data as input. It is clear that the PCC of PRMeth was always the highest and that of the referencefree method was always the lowest. When calculating the MAE between the true and predicted proportions of known cell types using the six methods at different numbers of unknown cell types, it is found that PRMeth consistently showed superiority over other methods (Additional file 1: Figure S8).
In addition, we estimated the methylation profiles of cell types using PRMeth and RF. We found that the accuracy and correlation of the methylation profiles obtained by PRMeth at different numbers of unknown cell types were higher than RF (Additional file 1: Figure S9).
Evaluation of different methods using whole blood data
Next, we further validated our method on whole blood samples. We calculated MAE between the true and estimated proportions of known cell types by the six methods at \({K}_{1}=2, 3, 4, 5\). As shown in Fig. 5A–C, and Additional file 1: Figure S10, PRMeth showed the lowest bias at all values of \({K}_{1}\). We then compared all cell type proportions predicted by PRMeth with the true proportions measured by flow cytometry. This reveals that the estimation accuracy of PRMeth increased with increasing \({K}_{1}\) (Additional file 1: Figure S11 and Fig. 5D) and only a few predictions deviated from the true values at \({K}_{1}=5\) (Fig. 5D).
Similarly, we also estimated the cell type methylation profiles and found that the accuracy and correlation of PRMeth were consistently higher than RF (Additional file 1: Figure S12).
Application to TCGA data
Finally, we applied PRMeth to real tumor samples from TCGA. We selected seven types of immune cells (including monocytes, dendritic cells, macrophages, eosinophils, naive T cells, CD8 + T cells, and natural killer cells) as known partial reference data, and then deconvolved 400 tumor samples including 100 SKCM samples, 100 BRCA samples, 100 LAML samples, and 100 THYM samples. We first determined the total number of cell types in the four types of tumor samples using \(\lambda \_BIC\) and the \(K\) were 32, 29, 24, and 22, respectively. Because tumor tissue is a mixture of different cell types with a laminated structure that contains multiple cell types with different morphologies in each layer [38], we combined some cell types and assumed that the total numbers of cell types were 18, 16, 12, and 11 for SKCM, BRCA, LAML, and THYM, respectively. We then estimated the proportions of all cell types in these tumor samples using PRMeth and converted the absolute proportions of immune cells into relative proportions of each immune cell to all immune cells. As expected, different tumor samples showed different infiltration patterns of immune cells (Fig. 6A). In invasive breast carcinoma samples, macrophages occupied the highest proportion among all immune cells, which was consistent with previous literature findings [39] that a hallmark of breast cancer is high infiltration of M2 tumorassociated macrophages. The high infiltration levels of CD8 + T cells and macrophages in skin cutaneous melanoma samples were consistent with the study [40]. Acute myeloid leukemia and thymoma samples had high proportions of monocytes [35] and naive T cells [41], respectively.
To investigate the relationship between cell type proportions and tumor types, we used the Shannon index [42] representing the diversity of biomes to describe the heterogeneity degree of tumor samples. As shown in Fig. 6B, the heterogeneity scores (i.e., 1.6807, 1.6555, 1.3524, and 1.2401) of BRCA, SKCM, LAML, and THYM were significantly different, which illustrates the estimated proportions from PRMeth met the biological significance. We also analyzed the impact of the predicted proportions of cell types on the survival of cancer patients. We first used the surv_cutpoint function in the survminer package of R to divide cancer patients into high and lowinfiltrating groups based on the proportions of specific cell types (including known immune cells and estimated unknown cells), and then used Cox proportional hazards regression to calculate the survival rates of these two groups. We found that SKCM patients with a high infiltration level of CD8 + T cells and THYM patients with a high infiltration level of macrophages both had good overall survival (p = 0.0022 and 0.02, Fig. 6C, E), which was consistent with previous findings by Ma et al. [40] and Yang et al. [43]. In contrast, LAML patients with a high infiltration level of NK cells had poorer overall survival than those with a low infiltration level (p = 0.0449, Fig. 6D), which was consistent with the study's results [44] that the NK cells activated with high expression were associated with a poor prognosis. In addition, we also found that several unknown cell types had an impact on the survival of cancer patients (Fig. 6F–J).
Discussion
In this paper, we proposed a cell type decomposition model (PRMeth) based on partially available DNA methylation data, which employs a nonnegative matrix factorization and an iterative optimization algorithm. Given reasonable parameter settings, PRMeth could infer the proportions of all cell types and recover the methylation profiles of unknown cell types effectively. The study on the TCGA dataset showed that the immune cell proportions estimated by PRMeth were largely consistent with previous studies and met the biological significance. Compared to existing methods, the advantages of PRMeth are mainly reflected in the following points. First, PRMeth is applied to DNA methylation data that are relatively stable and easier to measure. Second, using partial DNA methylation data as a reference can reduce the difficulty of obtaining complete DNA methylation data. Third, PRMeth can infer not only the proportions of known cell types but also those of unknown cell types. Fourth, although the PRMeth method is driven by cancer research, it can be applied to other tissues, such as blood, to study the composition of cell types associated with other diseases, such as autoimmune diseases.
Despite its advantages, our study also suffers from the following limitations. First, our method requires the total number of cell types as input. The results on the Zhang dataset show that our method could obtain the exact total number of cell types using \(\lambda \_BIC\). However, the total number of cell types is often uncertain because all cells of a complex tumor tissue form a laminated structure. In other words, cells are grouped by similarities so the total number of cell types can be determined by different groupings. Therefore, we encourage users to conduct downstream association analysis by choosing a reasonable \(K\) in their study. Second, PRMeth does not apply to the estimation of cell type proportions for a single sample. In the future, we will expand the applicability of PRMeth and explore the relationship between cell type proportions and tumor subtypes, which may help to determine the optimal treatment regimen for a specific patient and predict potential targets for cancer immunotherapy.
Conclusion
Different from the available referencebased and referencefree methods, the proposed method PRMeth is based on partial reference information, which is more in line with real clinical practice. It not only circumvents the difficulty of obtaining complete DNA methylation reference data but also obtains satisfactory deconvolution accuracy, which will be conducive to the reduction of medical costs, the analysis of tumor heterogeneity, and the exploration of new directions of cancer immunotherapy.
Availability of data and materials
The datasets used during the current study include the simulation dataset, the Zhang dataset, the whole blood dataset, and the TCGA dataset. The GEO accession codes for the simulation dataset are GSE88824 and GSE92843. The Zhang dataset can be downloaded from https://github.com/Hanyuz1996/EMeth. The whole blood dataset can be accessed through the GEO accession number GSE35069 and the link https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0041361. The TCGA dataset is available via the Cancer Genome Atlas database and the GEO accession numbers GSE35069, GSE59250, and GSE71837. PRMeth is implemented in R and is freely available from GitHub (https://github.com/hedingqin/PRMeth).
Abbreviations
 TCGA:

The cancer genome atlas
 EWAS:

Epigenomewide association study
 FFPE:

Formalinfixed paraffinembedded tissues
 BRCA:

Invasive breast carcinoma
 SKCM:

Skin cutaneous melanoma
 LAML:

Acute myeloid leukemia
 THYM:

Thymoma
 NK:

Natural killer cell
 NHBEC:

Human bronchial epithelial cell
 GEO:

Gene expression omnibus
 NSCLC:

Nonsmall cell lung cancer
 BIC:

Bayesian information criterion
 RF:

ReferenceFree
 QP:

Quadratic programming
 CBS:

CIBERSORT
 DCQ:

Digital cell quantification
 EpiDISH:

Epigenetic dissection of intrasample heterogeneity
 MAE:

Mean absolute error
 PCC:

Pearson correlation coefficient
References
Baghba R, Roshangar L, JahanbanEsfahlan R, Seidi K, EbrahimiKalan A, Jaymand M, et al. Tumor microenvironment complexity and therapeutic implications at a glance. Cell Commun Signal. 2020;18(1):1–19.
Joyce JA, Fearon DT. T cell exclusion, immune privilege, and the tumor microenvironment. Science. 2015;348(6230):74–80.
Kessenbrock KPV, Werb Z. Matrix metalloproteinases: regulators of the tumor microenvironment. Cell. 2010;141(1):52–67.
Ren XKB, Zhang Z. Understanding tumor ecosystems by singlecell sequencing: promises and limitations. Genome Biol. 2018;19(1):211.
Oshimori N, Oristian D, Fuchs E. TGFbeta promotes heterogeneity and drug resistance in squamous cell carcinoma. Cell. 2015;160(5):963–76.
Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenomewide association studies. Genome Biol. 2014;15(2):1–9.
Ribas A, Wolchok JD. Cancer immunotherapy using checkpoint blockade. Science. 2018;359(6382):1350.
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.
Cao JY, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, et al. Comprehensive singlecell transcriptional profiling of a multicellular organism. Science. 2017;357(6352):661–7.
Chakravarthy A, Furness A, Joshi K, Ghorani E, Ford K, Ward MJ, et al. Pancancer deconvolution of tumour composition using DNA methylation. Nat Commun. 2018;9:1–13.
Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinform. 2012;13:1–16.
Quon G, Haider S, Deshwar AG, Cui A, Boutros PC, Morris Q. Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction. Genome Med. 2013;5:1–20.
Altboum Z, Steuerman Y, David E, BarnettItzhaki Z, Valadarsky L, KerenShaul H, et al. Digital cell quantification identifies global immune cell dynamics during influenza infection. Mol Syst Biol. 2014;10(2):720.
Newman AM, Liu CL, Green MR, Gentles AJ, Feng WG, Xu Y, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12(5):453.
Li B, Severson E, Pignon JC, Zhao HQ, Li TW, Novak J, et al. Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biol. 2016;17:1–16.
Teschendorff AE, Breeze CE, Zheng SJC, Beck S. A comparison of referencebased algorithms for correcting celltype heterogeneity in EpigenomeWide Association Studies. BMC Bioinform. 2017;18:1–14.
Racle J, de Jonge K, Baumgaertner P, Speiser DE, Gfeller D. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. Elife. 2017;6.
Wang XR, Park J, Susztak K, Zhang NR, Li MY. Bulk tissue cell type deconvolution with multisubject singlecell expression reference. Nat Commun. 2019;10:1–9.
Wang ZY, Cao SL, Morris JS, Ahn J, Liu R, Tyekucheva S, et al. Transcriptome deconvolution of heterogeneous tumor samples with immune infiltration. Iscience. 2018;9:451.
Wilson DR, Jin C, Ibrahim JG, Sun W. ICeDT provides accurate estimates of immune cell abundance in tumor samples by allowing for aberrant gene expression patterns. J Am Stat Assoc. 2020;115(531):1055–65.
Zhang HY, Cai RY, Dai J, Sun W. EMeth: an EM algorithm for cell type decomposition based on DNA methylation data. Sci Rep. 2021;11(1):1–12.
Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J. Epigenomewide association studies without the need for celltype composition. Nat Methods. 2014;11(3):309U283.
Houseman EA, Molitor J, Marsit CJ. Referencefree cell mixture adjustments in analysis of DNA methylation data. Bioinformatics. 2014;30(10):1431–9.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu DL, Galanter J, et al. Sparse PCA corrects for cell type heterogeneity in epigenomewide association studies. Nat Methods. 2016;13(5):443.
Onuchic V, Hartmaier RJ, Boone DN, Samuels ML, Patel RY, White WM, et al. Epigenomic deconvolution of breast tumors reveals metabolic coupling between constituent cell types. Cell Rep. 2016;17(8):2075–86.
Lutsik P, Slawski M, Gasparoni G, Vedeneev N, Hein M, Walter J. MeDeCom: discovery and quantification of latent components of heterogeneous methylomes. Genome Biol. 2017;18:1–20.
Qin YF, Zhang WW, Sun XQ, Nan SW, Wei NN, Wu HJ, et al. Deconvolution of heterogeneous tumor samples using partial reference signals. Plos Comput Biol. 2020;16(11):e1008452.
Daugaard I, Kjeldsen TE, Hager H, Hansen LL, Wojdacz TK. The influence of DNA degradation in formalinfixed, paraffinembedded (FFPE) tissue on locusspecific methylation assessment by MSHRM. Exp Mol Pathol. 2015;99(3):632–40.
Baron U, Tuerbachova I, Hellwag A, Eckhardt F, Berlin K, Hoffmuller U, et al. DNA methylation analysis as a tool for cell typing. Epigenetics. 2006;1(1):55–60.
Thirlwell C, Eymard M, Feber A, Teschendorff A, Pearce K, Lechner M, et al. Genomewide DNA methylation analysis of archival formalinfixed paraffinembedded tissue using the Illumina Infinium HumanMethylation27 BeadChip. Methods. 2010;52(3):248–54.
Teschendorff AE, Relton CL. Statistical and integrative systemlevel analysis of DNA methylation data. Nat Rev Genet. 2018;19(3):129–47.
Hoadley KA, Yau C, Wolf DM, Cherniack AD, Tamborero D, Ng S, et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158(4):929–44.
Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207–10.
Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlen SE, Greco D, et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS ONE. 2012;7(7):e41361.
Arneson D, Yang X, Wang K. MethylResolvera method for deconvoluting bulk DNA methylation profiles into known and unknown cell contents. Commun Biol. 2020;3(1):1–13.
Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, et al. Modelbased clustering of DNA methylation array data: a recursivepartitioning algorithm for highdimensional data arising as a mixture of beta distributions. BMC Bioinform. 2008;9:1–15.
Neath AA, Cavanaugh JE. The Bayesian information criterion: background, derivation, and applications. Wiley Interdiscip Rev Comput Stat. 2012;4(2):199–203.
Ahn J, Yuan Y, Parmigiani G, Suraokar MB, Diao LX, Wistuba II, et al. DeMix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics. 2013;29(15):1865–71.
ValetaMagara A, Gadi A, Volta V, Walters B, Arju R, Giashuddin S, et al. Inflammatory breast cancer promotes development of M2 tumorassociated macrophages and cancer mesenchymal cells through a complex chemokine network. Can Res. 2019;79(13):3360–71.
Ma JC, Jin Y, Gong BC, Li L, Zhao Q. Pancancer analysis of necroptosisrelated gene signature for the identification of prognosis and immune significance. Discover Oncol. 2022;13(1):1–24.
Strobel P, Helmreich M, Menioudakis G, Lewin SR, Rudiger T, Bauer A, et al. Paraneoplastic myasthenia gravis correlates with generation of mature naive CD4(+) T cells in thymomas. Blood. 2002;100(1):159–66.
Konopinski MK. Shannon diversity index: a call to replace the original Shannon’s formula with unbiased estimator in the population genetics studies. Peerj. 2020;8:e9391.
Yang J, Zhang Y, Song H. A disparate role of RP11424C20. 2/UHRF1 axis through control of tumor immune escape in liver hepatocellular carcinoma and thymoma. Aging. 2019;11(16):6422.
Huang L, Lin L, Fu X, Meng C. Development and validation of a novel survival model for acute myeloid leukemia based on autophagyrelated genes. PeerJ. 2021;9:e11968.
Acknowledgements
Not applicable.
Funding
This work was supported in part by Shanghai Science and Technology Innovation Action Planning (20dz1203800 to M.C.), Research and Development Planning in Key Areas of Guangdong Province (2021B0202070001 to M.C.), and the National Natural Science Foundation of China (61702325 to Y.Q.).
Author information
Authors and Affiliations
Contributions
DH conducted the experiments, performed the data analysis, and wrote the paper. YQ designed the study and supervised the research. MC contributed a critical review. WW embellished the language of the paper. CS provided some of the materials and experimental suggestions. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Supplementary Figures and Tables.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
He, D., Chen, M., Wang, W. et al. Deconvolution of tumor composition using partially available DNA methylation data. BMC Bioinformatics 23, 355 (2022). https://doi.org/10.1186/s12859022048937
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022048937
Keywords
 Cell population proportions
 Tumor heterogeneity
 Immunotherapy
 DNA methylation data
 Nonnegative matrix factorization