Data extraction and normalization
Data were obtained from TCGA and Chinese Glioma Genome Atlas (CGGA) datasets. TCGA, a project to understand the molecular mechanisms of cancer, has data on 1122 glioma samples [19]; 563 samples with 25,292 genes from RNA-seq and 18,976 genes from DNAm data were used as the training dataset. CGGA, a project to investigate brain tumors, has data on 2000 glioma samples collected from Chinese cohorts [20]; 970 samples with 23,271 genes from RNA-seq and 140 samples with 14,476 genes from DNAm data were used as external validation datasets.
We applied 2-step normalization on both training and validation datasets [21]. First, we used the median absolute deviation on both the training and validation datasets. Second, we applied the robust scale normalization on the training dataset, and scaled the validation dataset using the means and standard deviations of the training dataset (Additional file 15).
Construction of an autoencoder model
The autoencoder algorithm is a reduction method implemented using artificial neural networks. We used autoencoder to reconstruct \(x\) by the output \({x}^{{\prime}}\). Tanh was used as the activation function for the \(i\) layer [9], that is:
$$\gamma ={f}_{i}\left(x\right)=\mathrm{tanh}({W}_{i}.x+{b}_{i})$$
where \(x\) and \(\gamma\) are two vectors of size d and p; \({W}_{i}\) is the weight matrix of size p × d; \({b}_{i}\) is an intercept vector of size p; and \({W}_{i}.x\) gives a vector of size p [9].
For a k-layer autoencoder model, \({x}^{{\prime}}\) is:
$${x}^{{\prime}}={F}_{1\to k}\left(x\right)={f}_{1}\dots {f}_{k-1}{f}_{k}(x)$$
Logloss was used as the loss function to assess the error between \(x\) and \({x}^{{\prime}}\) [9], that is:
$$logloss\left(x,{x}^{{\prime}}\right)=\sum_{k=1}^{d}({x}_{k}\mathrm{log}\left({x}_{k}^{{\prime}}\right)+\left(1-{x}_{k}\right)\mathrm{log}\left(1-{x}_{k}^{{\prime}}\right))$$
To control overfitting, L1 regularization penalty \({a}_{w}\) was added on \({W}_{i}\), and L2 regularization penalty \({a}_{a}\) was added on \({F}_{1\to i}\left(x\right)\) [9], that is:
$$L\left(x,{x}^{{\prime}}\right)={\textit{logloss}}\left(x,{x}^{{\prime}}\right)+\sum_{i=1}^{k}({{a}_{w}||{w}_{i}||}_{1}+{{a}_{a}||{F}_{1\to i}\left(x\right)||}_{2}^{2})$$
We used preprocessed data from TCGA dataset as the input for the autoencoder framework. We constructed a five-layer autoencoder model with three hidden layers (500, 100, and 500 nodes). The bottleneck layer was used to obtain 100 new features. We set the L1 regularization to 0.0001 and L2 regularization to 0.001. The autoencoder was trained using a gradient descent algorithm with 10 epochs and 50% dropout, a learning rate of 1E-06, and a batch size of 32 (using the PythonKeras library).
Feature selection and K-means clustering
Survival-associated features were selected from the 100 new features using univariate Cox-PH models (p < 0.05, using the R survival package). The labels for different subtypes were obtained via K-means clustering from survival-associated features (using the Python scikit-learn package). We determined the optimal number of clusters using the calinski harabasz score and silhouette score.
Robustness assessment
We demonstrated the robustness of the model using internal and external validation datasets. After obtaining the labels, we built a support vector machine (SVM) model with cross-validation. The 563 samples of TCGA dataset were split into 10 folds for model training and testing with a 6/4 ratio. We selected the top 100 mRNAs or 100 methylation features of TCGA training dataset based on analysis of variance (ANOVA) F values. To predict on TCGA 2-omics testing dataset, we trained the SVM model from a combination of the top 100 mRNAs and 100 methylation features selected above. We further predicted on TCGA single omic testing dataset using the corresponding top 100 mRNAs or 100 methylation features.
We further used CGGA RNA-seq and CGGA DNAm datasets as external validation datasets. To predict on two external validation datasets, we utilized the common features between the top 100 mRNAs or 100 methylation features of the whole TCGA dataset and CGGA dataset to build the SVM model.
Evaluation metrics
We used three metrics to reflect the accuracy of survival prediction of the model. Log-rank p value, used to evaluate the survival difference between subgroups [22], and concordance index (C-index), used to assess the predictive ability of the model [23], were calculated using the R survival package. Brier score, used to measure the accuracy of probabilistic prediction, was calculated using the Python scikit-learn package [24].
Two alternative approaches
We further compared the performance of autoencoder-based approach with PCA and iCluster using the data from TCGA dataset. One hundred principal components were obtained by PCA (using the Python scikit-learn package), which was the same number as features in the bottleneck layer of autoencoder. Survival-associated principal components were selected from the 100 principal components using univariate Cox-PH models (p < 0.05, using the R survival package). Labels were obtained via K-means clustering from survival-associated principal components (using the Python scikit-learn package). iCluster obtained labels directly from initial features (using the R iCluster package) [25]. After obtaining the labels, we also built SVM models with cross-validation. The performance of the model was evaluated using the above three metrics.
Clinical covariate analysis
We examined the statistical differences in clinical covariates (age, gender, tumor grade, tumor type) between autoencoder subtypes using Wilcoxon rank-sum tests for continuous variables and χ2 tests for categorical variables. To test whether the accuracy of prediction could be improved by adding clinical information, we built a multivariate Cox-PH model (age, gender, tumor grade, tumor type and autoencoder subtypes) and compared the model with autoencoder subtypes or tumor types only model.
A systematic review reported that the IDH1 mutation is an independent factor for longer overall survival in patients with glioblastoma [6]. We performed χ2 tests on the IDH mutation between subtypes from the CGGA RNA-seq dataset.
Functional analysis
Functional analyses were performed to understand the characteristics of the autoencoder subtypes from TCGA dataset. DNAm-driven genes were identified by integrating DNAm and gene expression profiling analyses using the R MethylMix package [26]. DNAm-driven genes met the following two conditions: (1) DNAm levels of these genes were negatively correlated with the mRNA expression levels. The correlation coefficient was calculated using Spearman’s correlation test (correlation coefficients r < − 0.3). (2) There were significant differences in the levels of DNAm between autoencoder subtypes (Wilcoxon rank-sum tests p value < 0.05). Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was performed to determine the enriched pathways of DNAm-driven genes (p < 0.01). The results of KEGG pathway analysis were visualized via ConsensusPathDB (http://cpdb.molgen.mpg.de/).