methCancer-gen: a DNA methylome dataset generator for user-specified cancer type based on conditional variational autoencoder.

BACKGROUND
Recently, DNA methylation has drawn great attention due to its strong correlation with abnormal gene activities and informative representation of the cancer status. As a number of studies focus on DNA methylation signatures in cancer, demand for utilizing publicly available methylome dataset has been increased. To satisfy this, large-scale projects were launched to discover biological insights into cancer, providing a collection of the dataset. However, public cancer data, especially for certain cancer types, is still limited to be used in research. Several simulation tools for producing epigenetic dataset have been introduced in order to alleviate the issue, still, to date, generation for user-specified cancer type dataset has not been proposed.


RESULTS
In this paper, we present methCancer-gen, a tool for generating DNA methylome dataset considering type for cancer. Employing conditional variational autoencoder, a neural network-based generative model, it estimates the conditional distribution with latent variables and data, and generates samples for specified cancer type.


CONCLUSIONS
To evaluate the simulation performance of methCancer-gen for the user-specified cancer type, our proposed model was compared to a benchmark method and it could successfully reproduce cancer type-wise data with high accuracy helping to alleviate the lack of condition-specific data issue. methCancer-gen is publicly available at https://github.com/cbi-bioinfo/methCancer-gen.


Background
DNA methylation is one of the epigenetic mechanisms, playing a critical role in various biological processes, such as gene regulation, cell differentiation, and suppression of transposable elements [1][2][3]. Recent studies have reported that diverse types of neoplasia and cancer are related to changes in DNA methylation [4] and abnormal DNA methyl patterns are considered one of the biomarkers for diagnosing cancer [5,6]. In addition, the tissue-specific DNA methylation patterns determine the origin of the cancer [7].
To satisfy growing needs for better diagnosis and advance understanding of driver mutations leading to uncontrolled cell growth and tumor formation, increasing amounts of genomic and epigenomic data have been publicly available through large-scale projects aimed for comprehensive integrated analysis of cancer [8]. The Cancer Genome Atlas (TCGA) program provided a collection of multi-platform molecular profiles across 33 different cancer types, composed of various clinical and genomic datasets [9]. Based on the multi-omics integrated analysis, evidence for biological mechanism in cancers was provided. ENCyclopedia of DNA elements (ENCODE) project [10] and Roadmap Epigenomics Mapping Consortium [11] produced public human epigenetic resources to investigate cancer biology. Through these projects, the identification of functional elements in the human genome sequence has been made. Utilizing public cancer resources, studies have focused on discovering the relationship between DNA methylation signature and cancer. MethyCancer presented and analyzed an integrated dataset of DNA methylation, mutation and gene expression profiling for tumor cells with cancer information [12]. MethHC provided a systematic integration comprising DNA methylation and mRNA/microRNA profiles in normal and tumor tissues and demonstrated epigenetic patterns for cancer prognosis [13]. MethCNA introduced a comprehensive database of DNA methylation and copy number alterations, which assisted to explore epigenetic patterns and identify key factors in cancer [14]. However, most public methylome dataset utilized in research, are still limited to the above major repositories.
To overcome the limitation of public data, computational approaches for generating methylome dataset have been introduced to provide methylation levels and reproduce a wide range of experimental setups. M.R.Lacey et al. developed an algorithm for producing methylation profiles based on reduced representation bisulfite sequencing (RRBS) to identify interactions between technical and biological variables among the RRBS dataset analysis [15]. Based on the observation from a subset of samples collected from ENCODE database, parametric models were fit to the distributions of CpG site positions and methylation levels to perform the simulation. DNemulator simulated cytosine methylation rate, sequencing errors and bisulfite conversion by random assignment and change with probability for various bisulfite sequencing experiments based on DNA reads of human reference genome [16]. WGBSSuite was proposed as a simulation tool for single-base DNA methylation data based on whole genome bisulfite sequencing (WGBS), employing two hidden markov models each for CpG location and methylation status [17]. Various experiment setups were reproduced to provide real case scenarios. pWGBSSimla generated WGBS data for a given user-specified genomic region and cell type by simulating methylated read count for specific CpG based on binomial distribution with approximated parameters for read depth and methylation rate of CpG [18]. Although, these simulation tools allow performance comparison among different methylation analysis methods and help to reproduce a wide range of experimental design to support further analysis, however, either they do not provide condition-specific data generation such as cancer type or only allow limited number of pre-defined condition.
In recent years, deep neural network (DNN) based generative model has been presented and achieved remarkable results due to its ability for capturing nonlinear distributed representations [19]. Variational autoencoder (VAE) [20], one of the deep generative model based on variational inference, has been widely adopted for learning latent representations and performing generation task based on trained features [21]. Employing VAE, several studies have been introduced to explore biological features in cancer based on DNA methylation dataset. By learning lower dimensional latent space on methylome data of lung cancers, signals representing each subtype for the sample were profiled [22]. Based on cancer relevant biological features extracted from VAE, breast cancer subtypes were classified to show the effectiveness of unsupervised learning using DNA methylation [23]. A.J. Titus et al. extracted latent features using VAE to investigate a set of CpGs correlated to Estrogen Receptor status [24]. Utilizing DNA methylation dataset, VAE has been employed to identify informative latent variables in the specific type of cancer, however, to the best of our knowledge, simulation of epigenetic dataset conditioned to the designated cancer type based on the generative model has not been presented yet.
In this paper, we propose a methCancer-gen, a tool for generating DNA methylome dataset based on a user-specified cancer type. We employed a conditional variational autoencoder (CVAE) [25], an extension of a standard VAE, suitable for incorporating a control for the condition. It allows generating samples similar but not identical to input data from modeling conditional distribution with latent variables and data. Different from VAE, CVAE has control on the data generation process, therefore by changing the conditional variable which refers to cancer type in our model, DNA methylation simulation data for specified cancer type will be generated. To demonstrate the data simulation of methCancer-gen for the user-specified cancer type, we compared dataset generated from our model to a benchmark method and validated its functionality.

Benchmark method
To evaluate the methCancer-gen for DNA methylation data generation, a benchmark method for cancer data generation was designed under the assumption that beta values for each CpG site follow a beta distribution [26]. The distributional parameters (α and β) for each CpG and cancer type were estimated and methylation dataset was simulated from the approximated distribution models. For each cancer type, 100 DNA methylation datasets were generated from methCancer-gen and benchmark method. We compared the accuracies of dataset generated from each method using the most widely used, five different machine learning (ML) based classification algorithms: decision tree (DT) [27], Naive Bayes (NB) [28], random forest (RF) [29], K-nearest neighbor (KNN) [30], and support vector machine (SVM) [31]. This evaluation shows validation of whether the generated cancer dataset is predicted to the intended cancer type we specified to methCancer-gen. Overview of the performance evaluation design is described in Table 1.

Dataset
We used a DNA methylome dataset composed of 8,051 primary solid tumor tissue samples from 25 cancer types measured by Illumina Human Infinium 450K assay [32], obtained from TCGA. 70% of dataset was randomly selected and used for generating simulation dataset by training methCancer-gen to learn latent representations and benchmark to estimate the distribution, while 30% was used for training multi-class classifiers for predicting 25 cancer types. Cancer types and the number of samples used for training methCancer-gen and 5 classifiers are listed in Table 2.

Performance evaluation for the simulation performance
To evaluate the performance of DNA methylation dataset generation of the methCancergen for designated cancer type and test the accuracy of simulation data with respect to real data, it was compared to the benchmark method based on estimating beta distribution for each CpG site in each cancer. Based on the preprocessed dataset, both methods  (Fig. 1). The cancer-type wise accuracy and the area under curve (AUC) results are shown in the Supplementary material S1 (Table A) and S2. Furthermore, we investigated whether a classifier trained using methCancer-gen would improve the classification accuracy compared to a classifier trained with data from TCGA only. For the experiments, three SVM classifiers were trained, where the first model was based on only utilizing 30% of TCGA data and the other two classifiers were trained based on a combined dataset with the same 30% TCGA data and the generated dataset from methCancer-gen and benchmark, respectively. During the experiment, the amount of generated data was gradually increased from 100 to 500 samples for each cancer type ( Table 3). 70% of TCGA data used for training methCancer-gen was not included in training SVM classifiers. To evaluate the performance of each SVM classifier, we obtained 1,038 methylation samples of 8 cancer types from methCNA [14], a comprehensive database containing Infinium HumanMethylation450K data resources of human cancer collected from Gene Expression Omnibus database. Each experiment was repeated five times.
From the results (Table 4)   Moreover, utilizing 300 generated samples for training the SVM classifier achieved a higher average accuracy of 0.823, compared to 0.809 and 0.799 for using 200 and 100 simulation samples, respectively. Increasing the number of generated samples more than 300 for each cancer type did not help to improve the performance of the classifier. Overall, utilizing generated data by methCancer-gen improved the performance of the classifier on 6 of 8 cancer types.
In addition, we further investigated the simulation dataset from the methCancer-gen and benchmark method to assess whether each method approximates the distribution model closely to the original dataset. Utilizing t-distributed stochastic neighbor embedding (t-SNE) [33] method, the original methylome TCGA datasets and the simulation datasets from the methCancer-gen and benchmark were compressed into threedimensional t-SNE spaces. From the result, the generated dataset from methCancer-gen were clearly separated into individual cancer types, validating that methCancer-gen could capture high-dimensional latent features of original dataset even within the similar cancers showing clusters of partial mixing, while the benchmark method showed sporadic result on those cancers (Supplementary material S4).

Discussion
Although genome-wide DNA methylation measurement methods such as WGBS has been introduced, still most of the publically available dataset are array-based because of cost-efficiency. Besides, due to the relatively high cost of generating methylome data, the lack of public data issues is still an open problem.
From our modeling and experiments to alleviate the issue, it is proved that methCancergen provides more accurate DNA methylation profiles for each cancer type compared to the other method. Five different ML-based classifiers correctly classified the generated dataset from the proposed model to each cancer showing that our model successfully  learned latent features and inferred the distribution of each cancer in an unsupervised manner. methCancer-gen can be used for data augmentation strategy, where utilizing the generated dataset from methCancer-gen as a supplement of real data for model training could indeed improve the performance of a classifier. Up to the certain point, the larger the amount of simulation dataset, the more accurate performance could be achieved. In addition, the generated dataset could be utilized for imputation by replacing missing values (Supplementary material S5).

Conclusions
In this paper, we presented methCancer-gen, a neural network-based tool for generating DNA methylome samples for user-specified cancer type. The proposed model employs CVAE as a generative model to estimate the distributions that underlie observed methylation values by variational inference while accounting for cancer type. The simulation performance of our model was evaluated with comparison to the benchmark method and the benefit of utilizing methCancer-gen was tested, showing improved performance in both evaluation results. We believe that the methCancer-gen could alleviate the lack of DNA methylation data issue, and promote further epigenetic cancer research.

Methods
With the matrix of DNA methylation beta values and matched cancer type information as input, the methCancer-gen approximates the underlying distribution model of the input data. After model training, methylation beta value for the specified cancer type can be generated as output. Figure 2 depicts a flowchart describing the process.

Preprocessing
To eliminate the bias caused by a high frequency of missing values during model training, the methCancer-gen provides a four-step preprocessing. First, CpG sites having missing values for all samples were removed. To retrieve maximum data, the dataset is divided into (2) Generation of DNA methylation dataset for specified cancer type by CVAE neural network model multiple subsets of 10,000 CpGs each. Therefore, samples showing missing values only for specific CpGs within each subset can be utilized for model training. Then, samples having a significant number of missing values are detected as outliers and discarded to minimize bias by applying inter-quartile range (IQR) method [34]. Remaining missing values are imputed with median values.

Generating DNA methylome dataset for a given cancer type
The methCancer-gen model was constructed based on a CVAE neural network model conditioned on the input observation in VAE, where VAE is a probabilistic generative model combining DNN and variational learning framework. It has been demonstrated that VAE tends to be more stable in model training procedure and producing less obscure output than other generative models, as it is based on clear objective function to optimize based on log-likelihood [35]. Through a process of generating a set of latent variable z from the prior distribution p θ (z), data x is generated from the generative model p θ (x|z) conditioned on z with respect to generative parameter θ, where the prior over z is assumed to be the standard normal distribution. To approximate the posterior distribution p θ (z|x) assumed to be a Gaussian, variational inference is used by introducing a proposal distribution q φ (z|x), known as recognition model, where φ is the variational parameter. By applying the stochastic gradient variational bayes (SGVB) framework, the Gaussian parameters of VAE, μ and σ are estimated and the variational lower bound on log-likelihood is used as an objective function : , where the first term denotes an expectation over the approximate posterior distribution, called reconstruction error, while the second term is a Kullback-Leibler (KL) divergence term considered as a regularizer. Implemented in a neural network, an encoder referred to as inference network models the recognition model and a decoder defines the conditional probability p θ (x|z), which is referred to as generative network. In addition to VAE, CVAE imposes a condition y on the z and x, where the recognition and generation models are extended to q φ (z|x, y) and p θ (x|z, y), respectively. In training procedure to maximize the conditional log-likelihood, the parameters of CVAE are estimated, and the variational lower bound on log-likelihood is defined as follows: After training procedure, through sampling from the learned latent distribution with utilizing the generative network, simulated dataset inferred from input data can be generated. In methCancer-gen, x represents the input data of DNA methylation beta values, and y is a cancer type.
The methCancer-gen model consists of encoder and decoder with two hidden layers, where the encoder has an architecture of 500 and 250 hidden nodes with fully connected layers and activation functions of empirically-selected exponential linear units (ELUs) [36] and the tanh function [37] were applied. The decoder has a symmetrical structure to encoder extracting 125 latent variables. During the training phase, the model was optimized with the adaptive optimization algorithm, Adam [38] by simultaneously minimizing the reconstruction error and loss. The learning rate and training epoch were set to 1e-3 and 10,000, respectively. methCancer-gen is implemented in python with Tensorflow library (Version 1.8.0) and publicly available at https://github.com/cbi-bioinfo/ methCancer-gen.