- Research Article
- Open Access
methCancer-gen: a DNA methylome dataset generator for user-specified cancer type based on conditional variational autoencoder
BMC Bioinformatics volume 21, Article number: 181 (2020)
Recently, DNA methylation has drawn great attention due to its strong correlation with abnormal gene activities and informative representation of the cancer status. As a number of studies focus on DNA methylation signatures in cancer, demand for utilizing publicly available methylome dataset has been increased. To satisfy this, large-scale projects were launched to discover biological insights into cancer, providing a collection of the dataset. However, public cancer data, especially for certain cancer types, is still limited to be used in research. Several simulation tools for producing epigenetic dataset have been introduced in order to alleviate the issue, still, to date, generation for user-specified cancer type dataset has not been proposed.
In this paper, we present methCancer-gen, a tool for generating DNA methylome dataset considering type for cancer. Employing conditional variational autoencoder, a neural network-based generative model, it estimates the conditional distribution with latent variables and data, and generates samples for specified cancer type.
To evaluate the simulation performance of methCancer-gen for the user-specified cancer type, our proposed model was compared to a benchmark method and it could successfully reproduce cancer type-wise data with high accuracy helping to alleviate the lack of condition-specific data issue. methCancer-gen is publicly available at https://github.com/cbi-bioinfo/methCancer-gen.
DNA methylation is one of the epigenetic mechanisms, playing a critical role in various biological processes, such as gene regulation, cell differentiation, and suppression of transposable elements [1–3]. Recent studies have reported that diverse types of neoplasia and cancer are related to changes in DNA methylation  and abnormal DNA methyl patterns are considered one of the biomarkers for diagnosing cancer [5, 6]. In addition, the tissue-specific DNA methylation patterns determine the origin of the cancer .
To satisfy growing needs for better diagnosis and advance understanding of driver mutations leading to uncontrolled cell growth and tumor formation, increasing amounts of genomic and epigenomic data have been publicly available through large-scale projects aimed for comprehensive integrated analysis of cancer . The Cancer Genome Atlas (TCGA) program provided a collection of multi-platform molecular profiles across 33 different cancer types, composed of various clinical and genomic datasets . Based on the multi-omics integrated analysis, evidence for biological mechanism in cancers was provided. ENCyclopedia of DNA elements (ENCODE) project  and Roadmap Epigenomics Mapping Consortium  produced public human epigenetic resources to investigate cancer biology. Through these projects, the identification of functional elements in the human genome sequence has been made. Utilizing public cancer resources, studies have focused on discovering the relationship between DNA methylation signature and cancer. MethyCancer presented and analyzed an integrated dataset of DNA methylation, mutation and gene expression profiling for tumor cells with cancer information . MethHC provided a systematic integration comprising DNA methylation and mRNA/microRNA profiles in normal and tumor tissues and demonstrated epigenetic patterns for cancer prognosis . MethCNA introduced a comprehensive database of DNA methylation and copy number alterations, which assisted to explore epigenetic patterns and identify key factors in cancer . However, most public methylome dataset utilized in research, are still limited to the above major repositories.
To overcome the limitation of public data, computational approaches for generating methylome dataset have been introduced to provide methylation levels and reproduce a wide range of experimental setups. M.R.Lacey et al. developed an algorithm for producing methylation profiles based on reduced representation bisulfite sequencing (RRBS) to identify interactions between technical and biological variables among the RRBS dataset analysis . Based on the observation from a subset of samples collected from ENCODE database, parametric models were fit to the distributions of CpG site positions and methylation levels to perform the simulation. DNemulator simulated cytosine methylation rate, sequencing errors and bisulfite conversion by random assignment and change with probability for various bisulfite sequencing experiments based on DNA reads of human reference genome . WGBSSuite was proposed as a simulation tool for single-base DNA methylation data based on whole genome bisulfite sequencing (WGBS), employing two hidden markov models each for CpG location and methylation status . Various experiment setups were reproduced to provide real case scenarios. pWGBSSimla generated WGBS data for a given user-specified genomic region and cell type by simulating methylated read count for specific CpG based on binomial distribution with approximated parameters for read depth and methylation rate of CpG . Although, these simulation tools allow performance comparison among different methylation analysis methods and help to reproduce a wide range of experimental design to support further analysis, however, either they do not provide condition-specific data generation such as cancer type or only allow limited number of pre-defined condition.
In recent years, deep neural network (DNN) based generative model has been presented and achieved remarkable results due to its ability for capturing nonlinear distributed representations . Variational autoencoder (VAE) , one of the deep generative model based on variational inference, has been widely adopted for learning latent representations and performing generation task based on trained features . Employing VAE, several studies have been introduced to explore biological features in cancer based on DNA methylation dataset. By learning lower dimensional latent space on methylome data of lung cancers, signals representing each subtype for the sample were profiled . Based on cancer relevant biological features extracted from VAE, breast cancer subtypes were classified to show the effectiveness of unsupervised learning using DNA methylation . A.J. Titus et al. extracted latent features using VAE to investigate a set of CpGs correlated to Estrogen Receptor status . Utilizing DNA methylation dataset, VAE has been employed to identify informative latent variables in the specific type of cancer, however, to the best of our knowledge, simulation of epigenetic dataset conditioned to the designated cancer type based on the generative model has not been presented yet.
In this paper, we propose a methCancer-gen, a tool for generating DNA methylome dataset based on a user-specified cancer type. We employed a conditional variational autoencoder (CVAE) , an extension of a standard VAE, suitable for incorporating a control for the condition. It allows generating samples similar but not identical to input data from modeling conditional distribution with latent variables and data. Different from VAE, CVAE has control on the data generation process, therefore by changing the conditional variable which refers to cancer type in our model, DNA methylation simulation data for specified cancer type will be generated. To demonstrate the data simulation of methCancer-gen for the user-specified cancer type, we compared dataset generated from our model to a benchmark method and validated its functionality.
To evaluate the methCancer-gen for DNA methylation data generation, a benchmark method for cancer data generation was designed under the assumption that beta values for each CpG site follow a beta distribution . The distributional parameters (α and β) for each CpG and cancer type were estimated and methylation dataset was simulated from the approximated distribution models. For each cancer type, 100 DNA methylation datasets were generated from methCancer-gen and benchmark method. We compared the accuracies of dataset generated from each method using the most widely used, five different machine learning (ML) based classification algorithms: decision tree (DT) , Naive Bayes (NB) , random forest (RF) , K-nearest neighbor (KNN) , and support vector machine (SVM) . This evaluation shows validation of whether the generated cancer dataset is predicted to the intended cancer type we specified to methCancer-gen. Overview of the performance evaluation design is described in Table 1.
We used a DNA methylome dataset composed of 8,051 primary solid tumor tissue samples from 25 cancer types measured by Illumina Human Infinium 450K assay , obtained from TCGA. 70% of dataset was randomly selected and used for generating simulation dataset by training methCancer-gen to learn latent representations and benchmark to estimate the distribution, while 30% was used for training multi-class classifiers for predicting 25 cancer types. Cancer types and the number of samples used for training methCancer-gen and 5 classifiers are listed in Table 2.
Performance evaluation for the simulation performance
To evaluate the performance of DNA methylation dataset generation of the methCancer-gen for designated cancer type and test the accuracy of simulation data with respect to real data, it was compared to the benchmark method based on estimating beta distribution for each CpG site in each cancer. Based on the preprocessed dataset, both methods generated 100 simulated datasets composed of 394,355 CpGs for each cancer type. Five different multi-class classification algorithms were used to predict cancer types of the simulated dataset from each generation method. The performance was evaluated by measuring average classification accuracy repeated ten times. The evaluation results showed that methCancer-gen outperformed the benchmark, achieving an average classification accuracy of 0.967, 0.875, 0.877, 0.858, and 0.694 for SVM, RF, KNN, NB, and DT, respectively, while benchmark was 0.964, 0.796, 0.875, 0.772 and 0.595, respectively (Fig. 1). The cancer-type wise accuracy and the area under curve (AUC) results are shown in the Supplementary material S1 (Table A) and S2.
Furthermore, we investigated whether a classifier trained using methCancer-gen would improve the classification accuracy compared to a classifier trained with data from TCGA only. For the experiments, three SVM classifiers were trained, where the first model was based on only utilizing 30% of TCGA data and the other two classifiers were trained based on a combined dataset with the same 30% TCGA data and the generated dataset from methCancer-gen and benchmark, respectively. During the experiment, the amount of generated data was gradually increased from 100 to 500 samples for each cancer type (Table 3). 70% of TCGA data used for training methCancer-gen was not included in training SVM classifiers. To evaluate the performance of each SVM classifier, we obtained 1,038 methylation samples of 8 cancer types from methCNA , a comprehensive database containing Infinium HumanMethylation450K data resources of human cancer collected from Gene Expression Omnibus database. Each experiment was repeated five times.
From the results (Table 4), the classifier utilizing dataset composed of TCGA and 300 generated datasets from methCancer-gen exhibited the highest average accuracy of 0.823 and AUC of 0.914, compared to 0.762 and 0.869 of the benchmark, and 0.751 and 0.864 of TCGA only. The cancer-type wise AUC results are shown in the Supplementary material S3. Moreover, utilizing 300 generated samples for training the SVM classifier achieved a higher average accuracy of 0.823, compared to 0.809 and 0.799 for using 200 and 100 simulation samples, respectively. Increasing the number of generated samples more than 300 for each cancer type did not help to improve the performance of the classifier. Overall, utilizing generated data by methCancer-gen improved the performance of the classifier on 6 of 8 cancer types.
In addition, we further investigated the simulation dataset from the methCancer-gen and benchmark method to assess whether each method approximates the distribution model closely to the original dataset. Utilizing t-distributed stochastic neighbor embedding (t-SNE)  method, the original methylome TCGA datasets and the simulation datasets from the methCancer-gen and benchmark were compressed into three-dimensional t-SNE spaces. From the result, the generated dataset from methCancer-gen were clearly separated into individual cancer types, validating that methCancer-gen could capture high-dimensional latent features of original dataset even within the similar cancers showing clusters of partial mixing, while the benchmark method showed sporadic result on those cancers (Supplementary material S4).
Although genome-wide DNA methylation measurement methods such as WGBS has been introduced, still most of the publically available dataset are array-based because of cost-efficiency. Besides, due to the relatively high cost of generating methylome data, the lack of public data issues is still an open problem.
From our modeling and experiments to alleviate the issue, it is proved that methCancer-gen provides more accurate DNA methylation profiles for each cancer type compared to the other method. Five different ML-based classifiers correctly classified the generated dataset from the proposed model to each cancer showing that our model successfully learned latent features and inferred the distribution of each cancer in an unsupervised manner.
methCancer-gen can be used for data augmentation strategy, where utilizing the generated dataset from methCancer-gen as a supplement of real data for model training could indeed improve the performance of a classifier. Up to the certain point, the larger the amount of simulation dataset, the more accurate performance could be achieved. In addition, the generated dataset could be utilized for imputation by replacing missing values (Supplementary material S5).
In this paper, we presented methCancer-gen, a neural network-based tool for generating DNA methylome samples for user-specified cancer type. The proposed model employs CVAE as a generative model to estimate the distributions that underlie observed methylation values by variational inference while accounting for cancer type. The simulation performance of our model was evaluated with comparison to the benchmark method and the benefit of utilizing methCancer-gen was tested, showing improved performance in both evaluation results. We believe that the methCancer-gen could alleviate the lack of DNA methylation data issue, and promote further epigenetic cancer research.
With the matrix of DNA methylation beta values and matched cancer type information as input, the methCancer-gen approximates the underlying distribution model of the input data. After model training, methylation beta value for the specified cancer type can be generated as output. Figure 2 depicts a flowchart describing the process.
To eliminate the bias caused by a high frequency of missing values during model training, the methCancer-gen provides a four-step preprocessing. First, CpG sites having missing values for all samples were removed. To retrieve maximum data, the dataset is divided into multiple subsets of 10,000 CpGs each. Therefore, samples showing missing values only for specific CpGs within each subset can be utilized for model training. Then, samples having a significant number of missing values are detected as outliers and discarded to minimize bias by applying inter-quartile range (IQR) method . Remaining missing values are imputed with median values.
Generating DNA methylome dataset for a given cancer type
The methCancer-gen model was constructed based on a CVAE neural network model conditioned on the input observation in VAE, where VAE is a probabilistic generative model combining DNN and variational learning framework. It has been demonstrated that VAE tends to be more stable in model training procedure and producing less obscure output than other generative models, as it is based on clear objective function to optimize based on log-likelihood . Through a process of generating a set of latent variable z from the prior distribution pθ(z), data x is generated from the generative model pθ(x|z) conditioned on z with respect to generative parameter θ, where the prior over z is assumed to be the standard normal distribution. To approximate the posterior distribution pθ(z|x) assumed to be a Gaussian, variational inference is used by introducing a proposal distribution qϕ(z|x), known as recognition model, where ϕ is the variational parameter. By applying the stochastic gradient variational bayes (SGVB) framework, the Gaussian parameters of VAE, μ and σ are estimated and the variational lower bound on log-likelihood is used as an objective function :
, where the first term denotes an expectation over the approximate posterior distribution, called reconstruction error, while the second term is a Kullback-Leibler (KL) divergence term considered as a regularizer. Implemented in a neural network, an encoder referred to as inference network models the recognition model and a decoder defines the conditional probability pθ(x|z), which is referred to as generative network.
In addition to VAE, CVAE imposes a condition y on the z and x, where the recognition and generation models are extended to qϕ(z|x,y) and pθ(x|z,y), respectively. In training procedure to maximize the conditional log-likelihood, the parameters of CVAE are estimated, and the variational lower bound on log-likelihood is defined as follows:
After training procedure, through sampling from the learned latent distribution with utilizing the generative network, simulated dataset inferred from input data can be generated. In methCancer-gen, x represents the input data of DNA methylation beta values, and y is a cancer type.
The methCancer-gen model consists of encoder and decoder with two hidden layers, where the encoder has an architecture of 500 and 250 hidden nodes with fully connected layers and activation functions of empirically-selected exponential linear units (ELUs)  and the tanh function  were applied. The decoder has a symmetrical structure to encoder extracting 125 latent variables. During the training phase, the model was optimized with the adaptive optimization algorithm, Adam  by simultaneously minimizing the reconstruction error and loss. The learning rate and training epoch were set to 1e-3 and 10,000, respectively. methCancer-gen is implemented in python with Tensorflow library (Version 1.8.0) and publicly available at https://github.com/cbi-bioinfo/methCancer-gen.
Availability of data and materials
The Cancer Genome Atlas
ENCyclopedia of DNA elements
Reduced representation bisulfite sequencing
Whole genome bisulfite sequencing
Deep neural network
Conditional variational autoencoder
Stochastic gradient variational bayes
Empirically-selected exponential linear units
Support vector machine
Area under curve
t-distributed stochastic neighbor embedding
Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, Bernstein BE, Nusbaum C, Jaffe DB, et al.Genome-scale dna methylation maps of pluripotent and differentiated cells. Nature. 2008; 454(7205):766.
Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo Q-M, et al.Human dna methylomes at base resolution show widespread epigenomic differences. Nature. 2009; 462(7271):315.
Barwick BG, Scharer CD, Martinez RJ, Price MJ, Wein AN, Haines RR, Bally AP, Kohlmeier JE, Boss JM. B cell activation and plasma cell differentiation are inhibited by de novo dna methylation. Nat Commun. 2018; 9(1):1900.
Jones PA, Baylin SB. The fundamental role of epigenetic events in cancer. Nat Rev Genet. 2002; 3(6):415.
Meng H, Murrelle EL, Li G. Identification of a small optimal subset of cpg sites as bio-markers from high-throughput dna methylation profiles. BMC Bioinformatics. 2008; 9(1):457.
Daura-Oller E, Cabre M, Montero MA, Paternain JL, Romeu A. Specific gene hypomethylation and cancer: new insights into coding region feature trends. Bioinformation. 2009; 3(8):340.
Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, et al.Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018; 173(2):291–304.
Pavlopoulou A, Spandidos DA, Michalopoulos I. Human cancer databases. Oncol Rep. 2015; 33(1):3–18.
Tomczak K, Czerwińska P, Wiznerowicz M. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemp Oncol. 2015; 19(1A):68.
Consortium EP, et al.The encode (encyclopedia of dna elements) project. Science. 2004; 306(5696):636–40.
Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, et al.The nih roadmap epigenomics mapping consortium. Nat Biotechnol. 2010; 28(10):1045.
He X, Chang S, Zhang J, Zhao Q, Xiang H, Kusonmano K, Yang L, Sun ZS, Yang H, Wang J. Methycancer: the database of human dna methylation and cancer. Nucleic Acids Res. 2007; 36(suppl_1):836–41.
Huang W-Y, Hsu S-D, Huang H-Y, Sun Y-M, Chou C-H, Weng S-L, Huang H-D. Methhc: a database of dna methylation and gene expression in human cancer. Nucleic Acids Res. 2014; 43(D1):856–61.
Deng G, Yang J, Zhang Q, Xiao Z-X, Cai H. Methcna: a database for integrating genomic and epigenomic data in human cancer. BMC Genomics. 2018; 19(1):138.
Lacey MR, Baribault C, Ehrlich M. Modeling, simulation and analysis of methylation profiles from reduced representation bisulfite sequencing experiments. Stat Appl Genet Mol Biol. 2013; 12(6):723–42.
Frith MC, Mori R, Asai K. A mostly traditional approach improves alignment of bisulfite-converted dna. Nucleic Acids Res. 2012; 40(13):100.
Rackham OJ, Dellaportas P, Petretto E, Bottolo L. Wgbssuite: simulating whole-genome bisulphite sequencing data and benchmarking differential dna methylation analysis tools. Bioinformatics. 2015; 31(14):2371–3.
Chung R-H, Kang C-Y. pwgbssimla: a profile-based whole-genome bisulphite sequencing data simulator incorporating methylation qtls, allele-specific methylations and differentially methylated regions. bioRxiv. 2018:390633. https://doi.org/10.1093/bioinformatics/btz635.
Xu J, Li H, Zhou S. An overview of deep generative models. IETE Tech Rev. 2015; 32(2):131–9.
Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint. 2013. arXiv:1312.6114.
Chang DT. Latent variable modeling for generative concept representations and deep generative models. arXiv preprint. 2018. arXiv:1812.11856.
Wang Z, Wang Y. Exploring dna methylation data of lung cancer samples with variational autoencoders. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE: 2018. p. 1286–9. https://doi.org/10.1109/bibm.2018.8621365.
Titus AJ, Bobak CA, Christensen BC. A new dimension of breast cancer epigenetics. 2018. https://doi.org/10.5220/0006636401400145.
Titus AJ, Wilkins OM, Bobak CA, Christensen BC. An unsupervised deep learning framework with variational autoencoders for genome-wide dna methylation analysis and biologic feature extraction applied to breast cancer. bioRxiv. 2018:433763. https://doi.org/10.1101/433763.
Sohn K, Lee H, Yan X. Learning structured output representation using deep conditional generative models. In: Advances in Neural Information Processing Systems: 2015. p. 3483–3491.
Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, Lin SM. Comparison of beta-value and m-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010; 11(1):587.
Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybernet. 1991; 21(3):660–74.
Rish I, et al.An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3: 2001. p. 41–46.
Liaw A, Wiener M, et al.Classification and regression by randomforest. R News. 2002; 2(3):18–22.
Hechenbichler K, Schliep K. Weighted k-Nearest-Neighbor Techniques and Ordinal Classification. Collaborative Research Center 386, Discussion Paper 399. 2004.
Suykens JA, Vandewalle J. Least squares support vector machine classifiers. Neural Process Lett. 1999; 9(3):293–300.
Dedeurwaerder S, Defrance M, Calonne E, Denis H, Sotiriou C, Fuks F. Evaluation of the infinium methylation 450k technology. Epigenomics. 2011; 3(6):771–84.
Maaten L. v. d., Hinton G. Visualizing data using t-sne. J Mach Learn Res. 2008; 9(Nov):2579–605.
Miller JN. Tutorial review?outliers in experimental data and their treatment. Analyst. 1993; 118(5):455–61.
Genevay A, Peyré G, Cuturi M. Gan and vae from an optimal transport point of view. arXiv preprint. 2017. arXiv:1706.01807.
Clevert D-A, Unterthiner T, Hochreiter S. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint. 2015. arXiv:1511.07289.
Karlik B, Olgac AV. Performance analysis of various activation functions in generalized mlp architectures of neural networks. Int J Artif Intell Expert Syst. 2011; 1(4):111–22.
Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint. 2014. arXiv:1412.6980.
This research was supported by Sookmyung Women ′s University Specialization Program Funding (SP1-201809-6). The funders had no roles in the design of the study and collection, analysis and execution of the study.
Ethics approval and consent to participate
This study utilized public TCGA and methCNA dataset, and ethics approval and consent are not needed.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material S1. (A) Average classification accuracy results for each cancer type based on different classifiers from the simulation performance evaluation in Fig. 1. (B) False positive rate (FPR) of methCancer-gen for each cancer type from the simulation performance evaluation in Fig. 1. To measure the FPR, multi-class datasets are converted to binary classification problems by using one class v.s. others scheme. (C) False negative rate (FNR) of methCancer-gen for each cancer type from the simulation performance evaluation in Fig. 1. To measure the FNR, multi-class datasets are converted to binary classification problems by using one class v.s. others scheme.
Supplementary material S2. Average AUC results for each cancer type from the performance evaluation in Fig. 1.To measure the AUC, multi-class datasets are converted to binary classification problems by using one class v.s. others scheme.
Supplementary material S3. Average AUC results of the SVM classifier for each cancer type from the second experiment (Table 4) to validate whether training a classifier based on a combined dataset with the original TCGA data and the generate ad data from methCancer-gen could improve the classification performance. Each experiment was repeated five times.
Supplementary material S4. t-SNE visualization of the original dataset and simulation dataset from methCancer-gen and the benchmark method is shown.
Supplementary material S5. Performance comparison of two SVM classifiers trained by median imputed dataset and imputed dataset using methCancer-gen generated data respectively. 100,000 missing values (NA) for the imputation test were randomly created within 30% samples of TCGAdata.
About this article
Cite this article
Choi, J., Chae, H. methCancer-gen: a DNA methylome dataset generator for user-specified cancer type based on conditional variational autoencoder. BMC Bioinformatics 21, 181 (2020). https://doi.org/10.1186/s12859-020-3516-8
- DNA methylation
- Conditional variational autoencoder