Skip to main content

Advertisement

We’d like to understand how you use our websites in order to improve them. Register your interest.

Uncovering the prognostic gene signatures for the improvement of risk stratification in cancers by using deep learning algorithm coupled with wavelet transform

Abstract

Background

The aim of gene expression-based clinical modelling in tumorigenesis is not only to accurately predict the clinical endpoints, but also to reveal the genome characteristics for downstream analysis for the purpose of understanding the mechanisms of cancers. Most of the conventional machine learning methods involved a gene filtering step, in which tens of thousands of genes were firstly filtered based on the gene expression levels by a statistical method with an arbitrary cutoff. Although gene filtering procedure helps to reduce the feature dimension and avoid overfitting, there is a risk that some pathogenic genes important to the disease will be ignored.

Results

In this study, we proposed a novel deep learning approach by combining a convolutional neural network with stationary wavelet transform (SWT-CNN) for stratifying cancer patients and predicting their clinical outcomes without gene filtering based on tumor genomic profiles. The proposed SWT-CNN overperformed the state-of-art algorithms, including support vector machine (SVM) and logistic regression (LR), and produced comparable prediction performance to random forest (RF). Furthermore, for all the cancer types, we firstly proposed a method to weight the genes with the scores, which took advantage of the representative features in the hidden layer of convolutional neural network, and then selected the prognostic genes for the Cox proportional-hazards regression. The results showed that risk stratifications can be effectively improved by using the identified prognostic genes as feature, indicating that the representative features generated by SWT-CNN can well correlate the genes with prognostic risk in cancers and be helpful for selecting the prognostic gene signatures.

Conclusions

Our results indicated that gene expression-based SWT-CNN model can be an excellent tool for stratifying the prognostic risk for cancer patients. In addition, the representative features of SWT-CNN were validated to be useful for evaluating the importance of the genes in the risk stratification and can be further used to identify the prognostic gene signatures.

Background

For the past decade, the gene expression-based models had been widely used in the cancer researches for predicting the clinical outcomes and made considerable progress [1, 2]. A number of machine learning algorithms had been proposed to construct predictive models and validated in various cancer types [3,4,5,6,7], for the purpose of identifying the genome characteristics, e.g. cancer-related differentially expressed genes or structural variations, as well as predicting the clinical outcomes, such as the risk stratification for the patients in cancers. Although the performance of the predictive models largely depends on the number of samples collected for model training and is restricted by the endpoint predictability to a certain extent [8], the feature selection is also a vital step the gene expression-based modeling in the clinical outcomes prediction. In most cases, genes are firstly filtered by comparing the expression levels between two phenotypic conditions in clinics with a statistical method and using an arbitrary cutoff, e.g. p value < 0.05, and then only the rest of genes are applied to the model construction as features. Based on the filtered gene list, a series of variable selection methods, such as stepwise regression [9], simulated annealing [10] and variable combination population analysis (VCPA) [11], are also developed to identify the useful features for model construction [12, 13].

However, considering the fact that the statistics-based method with a ‘hard’ cutoff doesn’t necessarily evaluate the contribution of a gene to the clinical prediction, it is not a reasonable way to filter out genes before model construction, which may result in the omission of a part of genes that are still important to the disease. Therefore, we suggested a deep learning-based strategy as an alternative, which combined the convolutional neural network [14,15,16] with stationary wavelet transform [17] (SWT-CNN), to predict the survival in different cancer patients by using as many genes as possible to reduce the loss of feature information. In recent years, the emerging deep learning technique [18] has achieved rapid development in image processing field [19] as well as in the related areas, such as voice recognition [20], nature language processing [21] and chemical pattern recognition [22], in virtue of its representation learning strategy [23], which is announced to be superior to the conventional predictive learning because of its powerful ability to generate more complex representations of the target objectives by combining the simple features [24]. As a result, the representation learning algorithms are expected to perform better in variable selection procedure than other conventional methods. Several studies had also utilized the deep learning methods to predict the cancer prognosis with the genomics [25,26,27,28,29,30] data as well as reported the evaluation of the predictive performance of deep learning methods when compared with conventional machine learning models [31]. These studies only focused on the performance of deep learning algorithms on predicting the clinical endpoints and paid little attention on discussing the contribution of the genes in the prediction procedure, which isn’t conductive to improving predictive results and seeking the key diagnostic gene signatures for better understanding the disease mechanism.

In this study, we proposed SWT-CNN to stratify the prognostic risk for cancer patients by using as many genes as possible and validated it with the gene expression data of multiple cancer types downloaded from The Cancer Genome Atlas (TCGA) database [32]. Based on the evaluation of 15 tumor genomics datasets in TCGA, SWT-CNN provided superior performance compared to support vector machine (SVM) and logistic regression (LR), and yielded a comparable performance to random forest (RF). In addition, we also attempted to extract the representative features from the hidden layers of the CNN for evaluating the importance of the genes in risk stratification and prediction. In fact, when generating representative features, CNN tends to give heavy weights to those features with large values while ignoring some small ones. It is not a problem for image recognition and classification, but it is not suitable for genomics data modeling, because genes with relatively low expression levels may still be closely related to the clinical outcomes. Therefore, we introduced the wavelet transform algorithm, which is successfully used for the gene expression data analysis in previous studies [33,34,35,36,37,38,39,40,41], to enhance the significance of genes with relatively low expression levels in the gene list, so that CNN can give appropriate weight when abstracting and extracting features. For all the cancer types, we first evaluated the relationship between each gene and clinical outcome by scoring the gene based on the representative features in CNN, then selected those closely related to clinical outcome for the subsequent Cox proportional-hazards regression and prediction. Our results show that compared with SWT-CNN results, the risk prediction is further improved. The median overall survival time of high-risk patients stratified by Cox regression was lower than that of the patients classified by SWT-CNN. It demonstrates that representative features are useful for identifying the diagnostic genes and improving stratification of the cancer patients.

Results

Study design

In this study, the RNA-sequencing data and the clinical information of all cancer types were collected from The Cancer Genome Atlas (TCGA) [32] database and the patients were categorized into low- and high-risk groups according to their tumor stages and overall survival times respectively. The gene expression profile of each patient was firstly decomposed by SWT, and then the wavelet coefficients were inputted into CNN for predictive model construction. The comparative analysis of the model performance was conducted among SWT-CNN, SVM, RF and LR. More importantly, in order to explore the effectiveness of representative features in identifying diagnostic genes, we proposed a scoring function to estimate the weights of genes based on the representative features extracted from the hidden layer of CNN and selected the gene signatures for stratifying the patients in all the cancer types. The overview of our study was depicted in Fig. 1.

Fig. 1
figure1

The workflow of our study

Selection of wavelet functions

Considering the fact that different wavelet functions are suitable for different signals and different wavelet coefficients will be generated, we applied 12 commonly used wavelet functions (4 wavelet families × 3 wavelets functions per family) in decomposing the gene expression profiles of the patients and investigated the predictive performance of SWT-CNN. The AUCs of predicting risk stratification on the basis of tumor stages and 3-year overall survivals across different cancer types in validation set were shown in Fig. 2a and b, respectively. It can be seen from Fig. 2 that the difference of AUCs for the prediction of tumor stages and overall survivals mainly existed in different cancer types. For the prediction of tumor stages and 3-year overall survivals, the best AUCs were achieved for KIRP (AUC = 0.83) and LGG (AUC = 0.89), respectively.

Fig. 2
figure2

The AUCs achieved by using different wavelet functions for the prediction of tumor stages and 3-year overall survival. a The AUCs of predicting the tumor stages across different cancer types in validation set. b The AUCs of predicting the 3-year overall survivals across different cancer types in validation set

In addition, constructing models with the wavelet coefficients decomposed by different wavelet functions also had a certain impact on the prediction results. For each cancer type, we choose the most appropriate wavelet function to predict the tumor stages and overall survivals according to the AUCs. The optimal wavelet function as well as the corresponding best AUC in the prediction of the risk differentiated by tumor stages and 3-year overall survivals were listed in Tables 1 and 2, respectively. When predicting the tumor stages, the performance of CNN models with the wavelet coefficients decomposed by different wavelet functions in the wavelet families is different for 12 cancer types (Table 1). Interestingly, for the prediction of the overall survival after 3 years, more than half of cancer types achieved the best AUC when using Daubechies wavelet family to generate the wavelet coefficients (Table 2). Compared with the functions in other wavelet families, the wavelet functions in Daubechies wavelet family were simple with minimum support width, indicating that after the decomposition by the Daubechies wavelet functions, the fluctuation of the wavelet coefficients is small. Although the magnitude of wavelet coefficients of the original gene expression profiles became weaker after the decomposition by Daubechies wavelets, the prediction results of CNN model became better, which meant that Daubechies wavelet decomposition can better highlight the expression signals than the functions in other wavelet families when predicting the overall survival. On the contrary, the performance of the functions in the symlets wavelet family is relatively poor in predicting tumor stages and overall survival. Finally, as listed in Tables 1 and 2, we used the optimal wavelet function for each cancer type in the subsequent analysis.

Table 1 The detailed information of the data sets for tumor stage prediction
Table 2 The detailed information of the data sets for 3-year overall survival prediction

Performance of SWT-CNN on clinical prediction

After SWT decomposition, the wavelet coefficients of the gene expression profiles were inputted into a one-layer CNN model, which was announced to be sufficient for gene expression data modeling [31]. The area under the receiver operating characteristic curve (AUC) was used as performance metric for evaluating the predictive models. At the same time, as a comparison, we used SVM to predict the tumor stages and overall survivals. The AUCs achieved by SWT-CNN and SVM for predicting tumor stages of 12 cancer types and 3-year overall survival of 10 cancer types were shown in Fig. 3.

Fig. 3
figure3

The mean AUCs as well as the distribution of AUCs achieved by SWT-CNN and SVM with 100 sampling times for the prediction of tumor stages and 3-year overall survival. a Mean AUCs achieved by SWT-CNN and SVM for predicting the tumor stages. b The distribution of AUCs achieved by SWT-CNN and SVM for predicting the tumor stages. c Mean AUCs achieved by SWT-CNN and SVM for predicting the 3-year overall survivals. d The distribution of AUCs achieved by SWT-CNN and SVM for predicting the 3-year overall survivals

In general, the model performance largely depended on the predictability of the endpoints, which was consistent with previous study [8], even if the RNA-sequencing data was used. The tumor stages of KIRP and the 3-year overall survival of LGG were easier to predict and both mean AUCs achieved by SWT-CNN and SVM were higher than 0.7. By contrast, the tumor stages of LUSC and the 3-year overall survival of OV were the most difficult to predict. Both mean AUCs achieved by SWT-CNN and SVM were near 0.5 (Fig. 3a and c). Compared with the results of SVM, the AUCs achieved by SWT-CNN were higher, except for the prediction of 3-year overall survival of BLCA.

In terms of details, for the easily predicted cancer types, the performance of SWT-CNN was better than that of SVM. It can be seen from Fig. 3a and c, the mean AUCs of predicting the tumor stages of KIRP and the 3-year overall survival of LGG achieved by SWT-CNN (mean AUCs = 0.82 and 0.85, resp.) were 0.1 higher than those achieved by SVM (mean AUCs = 0.71 and 0.74, resp.). For some cancer types that were difficult to predict by SVM, SWT-CNN can still perform better. When predicting the tumor stages of KIRC, the mean AUC achieved by SWT-CNN (mean AUC = 0.74) was 0.19 higher than that achieved by SVM (mean AUC = 0.55). Almost all the AUCs from 100 random sampling achieved by SWT-CNN were higher than those achieved by SVM (Fig. 3b). Similar results can be found in the prediction of the tumor stages of COAD (mean AUCs for SWT-CNN and SVM = 0.64 and 0.51, resp.). As to the prediction of 3-year survival of KIRC, although the prediction results of SWT-CNN (mean AUC = 0.66) were 0.14 higher than that of SVM (mean AUC = 0.52), the AUCs of 100 random sampling achieved by SWT-CNN were scattered (Fig. 3d), indicating that the prediction of this cancer by SWT-CNN might be not stable enough.

Some cancer types were difficult to predict by both models. When predicting the tumor stages of LUAD and LUSC, the mean AUCs achieved by SWT-CNN were only 0.55 and 0.53, respectively. The prediction results of SVM were similar to random results (mean AUCs = 0.50 and 0.51 for LUAD and LUSC, resp.). Similar results can be found in the prediction of 3-year overall survival of OV (mean AUCs = 0.53 and 0.48 for SWT-CNN and SVM, resp.) and LUSC (mean AUCs = 0.54 and 0.48 for SWT-CNN and SVM, resp.). For such cancer types that were difficult to be predicted, it may be necessary to further select key genes for prediction so as to eliminate the interference of useless gene signatures. The mean AUCs and standard errors of AUCs on predicting the tumor stages and 3-year overall survivals were listed in the Additional file 1. The comparison results with RF and LR were shown in Additional file 3. The performance of RF was similar to that of SWT-CNN on predicting the tumor stages, and slightly better than that of SWT-CNN on predicting 3-year overall survival. In the subsequent analysis, we continued to investigate the utility of the representative features on selecting important genes and improving the prediction of such cancers. In addition, Kaplan-Meier survival analysis was conducted for all the cancer types. Figure 4 showed the survival curves of the patients in all the cancer types that were predicted to be high-risk or low-risk. The log-rank p values for all the cancer types were less than 0.0001, indicating that the survival times of the predicted high-risk patients were significantly different than those of the predicted low-risk patients.

Fig. 4
figure4

The results of Kaplan-Meier survival analysis of all cancer types

Identification of prognostic genes for further risk stratification

In the previous prediction, SWT-CNN model used all the genes as features for the clinical prediction. For each cancer type, we tried to apply our proposed scoring method to evaluating the importance of genes in the gene list and use the genes highly associated with the cancer to predict the overall survivals in the patients. We randomly selected 70% samples from the data set as the training set to build the model and left the rest samples as validation set. For each cancer type, the SWT-CNN model was firstly constructed and then, the representative features were extracted from the hidden layer of CNN to scoring the genes. Finally, the genes were ranked by their scores and the top n genes were used in the Cox proportional-hazards regression for the prediction of overall survival after treatments. We applied Kendal-Tau measure to compare top 100 genes in gene lists generated from the 5 bootstrap for 10 TCGA datasets (see Additional file 6). For example, the Kendal-Tau values are all more than 0.78 with corresponding p values less than 0.001 for the 5 gene lists in LGG, indicating the high stability of gene lists generated by the proposed SWT-CNN.

We took the gene expression data of OV as an example, for which the worst result was achieved by SWT-CNN in Fig. 3c. From the 274 samples, 191 samples were randomly selected as the training set and the original gene expression matrix Y (26, 270 genes × 191 samples) was constructed. The SWT-CNN had been run for 100 times on the training set and the best model was kept for extracting the representative features, which was a three dimensional matrix (191 samples × 3284 convolutional features × 64 channels) obtained by the treatment of pooling layer in CNN. Then, we averaged the data on 64 channels and obtained the representative feature matrix X, which contained 3284 features in rows and 191 samples in columns. According to our proposed method, the mapping coefficients from X to Y (matrix B with 26, 270 genes in rows and 3284 representative features in columns) were calculated. Finally, the matrix B was averaged by rows and a score vector with order 26, 270 × 1 was obtained, denoting the importance of the 26, 270 genes in the classification. The genes were ranked by their scores and the top 700 genes were used in the modeling procedures of univariate Cox regression and multivariate Cox regression. The genes significantly associated with the overall survival were listed in Table 3. In total, 67 genes were considered to be significantly associated with the 3-year overall survival of OV by univariate Cox regression and 11 genes were considered to be significant by the multivariate Cox regression. The genes selected by univariate and multivariate Cox regression for the other types of cancers were listed in Additional file 2. These genes might be considered as the diagnostic genes in the future studies.

Table 3 The genes considered to be significantly associated with the 3-year overall survivals of OV by the univariate Cox regression

After assigning the risk score for each of the patients by multivariate Cox regression, the receiver operating characteristics curve (ROC) was employed on the training set to determine the cutoff of risk score for the stratification of the patients. The patients with the risk score higher than the cutoff were assigned to the high-risk group and the rest were assigned to the low-risk group [42]. The stratification model was validated by using the validation set. Figure 5 showed the stratification results for OV data set. It can be seen from the K-M survival curves (Fig. 5a) that there was a more significant difference (log-rank test p value < 0.0001) in overall survival time between the high-risk and low-risk patient groups divided by the risk scores. The median overall survival time for the high-risk and low-risk groups was 850.5 and 1355 days, respectively. Figure 5b showed the distribution of the survival time of the high- and low-risk groups divided by risk score and SWT-CNN. Compared with the results by SWT-CNN, the mean survival time of the high-risk patients predicted by risk score was lower than that predicted by SWT-CNN. The distribution of the overall survival time for other types of cancers was shown in Fig. 6. In general, the risk stratification for patients by risk score with the diagnostic genes was more accurate than that by SWT-CNN with all the genes. The median survival time of the high- and low-risk groups divided by risk score and SWT-CNN for all the cancer types were listed in Table 4. Using the same labels defined in the previous prediction step, we evaluated the prediction performance of the risk score model on predicting the 3-year overall survivals of all the data set. For OV data set, Fig. 5c showed the ROC curves achieved by the risk score model (AUC = 0.66) and 100 runs of SWT-CNN (mean AUC = 0.53). It can be seen that the prediction of 3-year overall survival was obviously improved by the risk score model with the diagnostic genes as features. Figure 7 showed the AUCs achieved by risk score model as well as the mean AUCs of SWT-CNN. In general, compared with the prediction results of SWT-CNN, the prediction performance of the risk score model for all the cancer types has been improved except for LGG, demonstrating that the representative features generated by CNN can be helpful for identifying the disease-related genes. More importantly, risk score model generated a relatively small gene set, which can provide a more precise set of candidate genes for the subsequent biological interpretation and experimental verification in clinics. For the data sets of OV, LUAD, LIHC and BLCA, the AUC increased by 0.13, 0.06, 0.05 and 0.08. To some extent, the prediction performance of the risk score model for the other cancer types had also been improved.

Fig. 5
figure5

The results of risk score model for predicting the 3-year overall survivals of OV. a The survival curves of high-risk and low-risk patients in OV data set stratified by risk score model. b The distribution of survival times of high-risk and low-risk patients stratified by the risk score model and SWT-CNN. (c) The ROC curves achieved by the risk score model and 100 runs of SWT-CNN

Fig. 6
figure6

The distribution of survival times of high-risk and low-risk patients for the other cancer types stratified by the risk score model and SWT-CNN

Table 4 Median survival time of the high-risk and the low-risk patients that divided by the risk score model and SWT-CNN
Fig. 7
figure7

The AUCs achieved by risk score model and the mean AUCs achieved by 100 runs of SWT-CNN for predicting the 3-year overall survivals of all the data set

Discussion

Considering the fact that deep learning has been widely used in pattern recognition and started to be applied in cancer prognosis prediction, we proposed a method called SWT-CNN and thoroughly investigated the performance of the model on the clinical cancer prediction. In our study, the gene expression profiles of the patients were firstly decomposed into the wavelet coefficients by the stationary wavelet transform for the purpose of enhancing the weights of genes with relatively low expression levels in the gene list, and then were subsequently applied to the model construction and clinical prediction by using the convolutional neural network. In the modeling procedures, CNN algorithm can efficiently abstract the representative features from the gene expression patterns that highly associated with the cancer type by using a representation learning strategy, which has been considered to be superior to the conventional feature selection procedures. For the evaluation of the models, we followed the analytical pipeline in MAQC-II study [8] and used AUC as the metric to evaluate the model performance. Note that, for the prediction of continuous values related to the survival, Harrell’s c-index would be more suitable than AUC on assessing the performance of the models [43].

In general, the significant discrepancy in the prediction results mainly existed among different cancer types, which depended on the predictability of the clinical endpoints (Fig. 3a and c). In addition, the partition of data sets also has a certain impact on the prediction results, indicating that more samples should be added to improve the stability of the model (Fig. 3b and d). SWT-CNN showed superior prediction performance when comparing with SVM. For the cancer types that were easy to predict, both SWT-CNN and SVM achieved satisfied prediction results. The mean AUCs for predicting the tumor stage of KIRP and the 3-year overall survival of LGG achieved by SWT-CNN and SVM were higher than 0.80 (mean AUCs = 0.82 and 0.85, resp.) and 0.70 (mean AUCs = 0.71 and 0.74, resp.), respectively. However, for some cancer types, the prediction performance of SVM is obviously insufficient. When predicting the tumor stage and the 3-year overall survival of KIRC, the mean AUCs achieved by SVM were only 0.55 and 0.52, respectively, while the mean AUCs achieved by SWT-CNN were 0.74 and 0.66, respectively. For the cancer types that were difficult to predict, neither method can achieve satisfied results, even if the prediction results of SWT-CNN were slightly better than those of SVM. It is worth noting that almost all the gene expression-based prediction models are data dependent. To elucidate this point, we conducted the prediction of tumor stages and 3-year overall survivals of all the cancer types by using other two popular machine learning algorithms, namely random forest and logistic regression. The AUCs of predicting the tumor stages and 3-year overall survivals of all the cancer types were shown in the Additional file 3. It can be seen that the performance of random forest was comparable with that of the SWT-CNN and was superior to the performance of SVM and logistic regression. Therefore, except for the prediction accuracy, it is necessary to pay more attention to whether the model can generate an interpretable gene set for the subsequent downstream analysis.

For the gene expression-based prediction in clinics, people not only expect that the model performance is as good as possible, but also expect to obtain the interpretable gene features, which is not only convenient for subsequent biological analysis of cancer mechanisms, but also provides candidates for the discovery of valuable clinical diagnosis genes. To improve the prediction performance, more reliable candidate genes should be selected for model construction. Therefore, in this study, we also proposed a strategy to map the representative features in CNN to the original genes and weighted them with the scores. The higher score of a gene indicated that its expression level in the genome was more important for the clinical prediction. Considering the fact that CNN algorithm tends to give heavy weights to those features with large values while ignoring some small ones, we introduced SWT to enhance the significance of genes with relatively low expression levels in the gene list and make CNN algorithm weight the genes objectively. The prediction results of 3-year overall survival by using CNN with and without SWT were shown in Additional file 4. The performance of the models was comparable. However, when extracting the genes from the representative features generated by CNN alone and ranking them by their scores, the mitochondrial function related genes, namely MT-CO1, MT-ND4, MT-CO3, MT-ND1, MT-ND3, MT-CO2, MT-ND2, MT-ATP6, MT-ND4L and MT-CYB, were ranked in the top of the gene list in most of the models because their expression levels were pretty higher than those of other genes. When transforming the gene expression profile by using SWT before CNN modeling, the candidate genes can be correctly ranked in the gene list. By using these genes as features in the univariate and multivariate Cox regression, we finally established the risk score models for predicting the 3-year overall survivals of all the data set. For the data sets of BLCA, LIHC, LUAD and OV, the performance of risk score models (AUCs = 0.65, 0.70, 0.67 and 0.66, resp.) was higher than that of SWT-CNN (mean AUCs = 0.57, 0.65, 0.61 and 0.53, resp.). Note that, for the other cancer types, the prediction performance of risk score model was only slightly improved when comparing with that of SWT-CNN. The reason may be that the representative features cannot extract the gene expression patterns effectively. In addition, by using the candidate genes generated by the risk score model as features, we reconstructed the SVM models for predicting the 3-year overall survivals of all cancer types. The results showed that the performance of SVM is improved to some extent after using the new features when comparing with the original SVM models (Additional file 5), indicating the effectiveness of candidate genes generated from the representative features of CNN. In order to facilitate interpretation, we used a single-layer network for all data sets. In fact, for some cancer types that were hard to predict, we can appropriately increase the number of layers to ensure the effectiveness of representative feature extraction. More detailed researches on the adjustment of CNN parameters for specific endpoints can be conducted in the future work.

To further validate the function of the genes that were selected as features in univariate and multivariate Cox regression modeling, we mapped the genes to the Cancer Gene Census (CGC) data set in the Catalogue Of Somatic Mutations In Cancer (COSMIC release v90) database [44]. For the data set of OV, 67 genes were identified by univariate Cox regression to be significantly associated with the cancer (Table 3) and seven of them, namely COL1A1, COL3A1, RPL10, ARHGAP5, LATS1, TSHR and SLC34A2, were found in the CGC data set. Genes COL1A1 and COL3A1 are demonstrated that played an unfavorable role in the development of ovarian cancer, and could be considered as the prognostic genes of OV [45]. Moreover, COL1A1 has also been found to be positively related with the degree of invasion, metastasis, and advanced stages of gastric cancer [45, 46]. COL3A1 can also be considered to be a potential biomarker for colon cancer [47], breast cancer [48] and brain tumor [49]. ARHGAP5 was identified as an oncogene which can promote tumor metastasis and proliferation [50, 51]. It was also proved that ARHGAP5 could be a prognostic marker of gastric cancer [50] and the expression level of it was associated with invasive and migrative ability of nasopharyngeal carcinoma cells [51]. LATS1 is one of family proteins of large tumor suppressor (LATS) and has been proposed to be a tumor suppressor [52]. It had been found that the expression level of LATS1 had decreased in serous ovarian cancer patients but this gene highly expressed in normal ovarian tissue [52, 53]. SLC34A2 was proved to have an evident effect in the progression of several types of cancers, such as in ovarian cancer [54, 55], breast cancer and non-small cell lung cancer [56, 57]. The gene fusion of SLC34A2 and ROS1 played an important role in the progression of non-small cell lung cancer [58]. In addition, genes LRP1B and CCR4 were identified by univariate Cox regression from the BLCA and HNSC data sets (Additional file 2), respectively, which were reported in the COSMIC database. LRP1B is one of the top ten genes mutated in human cancers [59, 60] and might be a potential contributor to the emergence of chemotherapy resistance [59]. Gene CCR4 was reported to be associated with adult T-cell leukaemia and lymphoma in the COSMIC database. The ligands for CCR4 are produced by tumor cells or the microenvironment, and can attract CCR4-expressing T regulatory cells and create a good survive environment for tumor cells [61]. Moreover, genes SMARCD1 were identified by univariate Cox regression from both the KIRC and LIHC data sets, and gene TSHR were identified by univariate Cox regression from the LUAD data sets (Additional file 2). These results indicated that that our proposed scoring method can effectively screen the diagnostic genes for clinical cancer prediction by using representative features to evaluate the importance of genes.

In addition, some caveats were still necessarily discussed: 1) The performance of SWT-CNN was comparable to CNN (Fig. 1 in Additional file 4), indicating that the main contribution of SWT in the model was to enhance the significance of genes with relatively low expression levels. 2) Only a small proportion of genes identified by our strategy were included in the current version of the COSMIC database, and further experimental verification of the causal relationship of remaining genes and cancer etiology is needed. 3) To facilitate the evaluation of the performance of SWT-CNN, SVM, RF, and LR, we categorized the patients into two groups (high/low-risk groups) according to their tumor stages or survival times. The prediction results of multivariate Cox regression were also dichotomized by using a risk score cutoff. In fact, for the prediction of the survival, both deep learning-based algorithms (e.g., DeepSurv [43]) and Cox regression can directly use continuous values (e.g., survival time) for modeling. The prediction results can be evaluated by AUC [62] or c-index [63]. 4) Compared to AUC, c-index is a more statistically stringent performance metric for evaluating the survival models. Specifically, the c-index measure inspects the consistency of predicted and the actual labels (i.e., clinical outcomes), but also inspects the correlation between the predicted values and their survival time. For the binary classification, the result of the c-index is almost equivalent to that of AUC, e.g., evaluating the performance of logistic regression in binary classification. 5) As a classical signal processing method, Fourier transform can also obtain the Fourier coefficients by convoluting the original signal with Fourier functions. Compared with Fourier transform, wavelet transform has some advantages [64,65,66]. In this study, we used wavelet transform to decompose the gene expression profile, because the low frequency part of the wavelet coefficients was the approximation of the original signal, which can facilitate us to map the score matrix back to original gene list. The Fourier coefficients can only reflect the frequencies of sin/cos functions and it is difficult to match the Fourier coefficients with the original genes. Further researches can also explore whether it is possible to combine Fourier transform with CNN to identify the disease-related genes.

Conclusions

In our study, we proposed a gene expression-based method called SWT-CNN as an alternative for stratifying the prognostic risk for cancer patients and thoroughly investigated the performance of the model in the large data sets. Our results indicated that SWT-CNN can be an excellent tool for risk stratification in cancers. When evaluating the genes by using the representative features in CNN, the diagnostic genes that were highly associated with the cancers can be effectively identified and used as features for improving the prediction performance of the models. In addition, these diagnostic genes can also be helpful for better understanding the mechanism of the diseases.

Methods

Data sets

The RNA-sequencing data in FPKM format as well as the clinical information of the patients were downloaded from The Cancer Genome Atlas (TCGA) database [32] (https://portal.gdc.cancer.gov/). The summarized fragments per kilobase million (FPKM) of 60,483 transcripts (data in level 3) were firstly mapped to the unique genes by using the comprehensive gene annotation file (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_22/gencode.v22.annotation.gtf.gz). In total, the expression data of 34,534 unique protein coding genes and lncRNA genes were generated for the subsequent predictive model construction. We removed the genes, for which the expression levels were zero in over half of the patients [67]. As a result, a subset of genes was kept for the subsequent analysis. We downloaded all the data sets from TCGA. After data processing, the gene expression data of twelve cancer types, namely BLCA, BRCA, COAD, HNSC, KIRC, KIRP, LIHC, LUAD, LUSC, SKCM, STAD, THCA were used for the prediction of tumor stage, and the data of ten cancer types, namely BLCA, HNSC, KIRC, LGG, LIHC, LUAD, LUSC, OV, SKCM, UCEC were used for the prediction of 3-year overall survival.

To investigate the model performance on predicting clinical endpoints, we categorized the patients into two compared groups with different clinical information. For tumor stage prediction, the patients with the tumor stages of III and IV were categorized into the high-risk group and those with the stages I and II were categorized into the low-risk group. As for 3-year overall survival prediction, the patients, whose survival times were less than 3 years, were categorized into the high-risk group, and the rest were categorized into the low-risk group. Note that the patients, whose survival times were less than 3 years but still alive, were removed from the data sets. For both tumor stage and 3-year overall survival, the patients in high-risk group were considered as the positive samples. The number of samples and the proportion of positive and negative samples in each cancer data set were listed in Tables 1 and 2.

Support vector machine

Support vector machine (SVM) [68, 69] is a popular machine learning algorithm, which was firstly proposed by Vapnik [70] and has been widely used in binary classification for decades. SVM can well classify the samples via projecting the samples to the higher-dimensional space from the original space and searching for an optimal hyperplane for classifying the samples. Before model construction, the genes were filtered by using Student’s t-test and a fold change cutoff. Only the genes, for which the p values < 0.05 and the absolute values of fold change > 2, were kept for the subsequent modeling. In our study, we used rbf as the kernel function in SVM and optimized the parameters (c and gamma in kernel function) by a grid search approach. In order to reduce the impact of data partitioning on results, we randomly selected 70% samples as the training set and validated them by using the rest 30% samples. This procedure had been run for 100 times. The SVM modeling procedure was conducted in python 2.7 by using the sklearn package.

Stationary wavelet transform

The concept of wavelet was firstly introduced by Morlet and Grossmann [71] and had been successfully applied in signal processing field for decades. In a square integrable space L2(), the wavelet function is defined as:

$$ {\upvarphi}_{\mathrm{a},\mathrm{b}}\left(\mathrm{t}\right)=\frac{1}{\sqrt{\mathrm{a}}}\upvarphi \left(\frac{\mathrm{t}-\mathrm{b}}{\mathrm{a}}\right),\mathrm{a},\mathrm{b}\in \mathrm{R} $$
(1)

Where a and b represent the scale and translation parameters, respectively. A wavelet family can be generated by means of translations and dilations of φ. The continuous wavelet transform procedure can be described by a following equation:

$$ {\mathrm{F}}_{\mathrm{CWT}}\left(\mathrm{a},\mathrm{b}\right)=\frac{1}{\sqrt{\mid \mathrm{a}\mid }}{\int}_{-\infty}^{+\infty}\mathrm{f}\left(\mathrm{t}\right)\upvarphi \left(\frac{\mathrm{t}-\mathrm{b}}{\mathrm{a}}\right)\mathrm{dt} $$
(2)

Where f(t) is the original signal. It can be seen that the transformed signal (wavelet coefficients) FCWT(a,b) is the result of convolution between the original signal and the wavelet function. It is also a function of scale parameter a and translation parameter b. The inverse continuous wavelet transform can be easily conducted by calculating the convolution of transformed signal and the wavelet function:

$$ \mathrm{f}\left(\mathrm{t}\right)=\frac{1}{{\mathrm{C}}_{\upvarphi}^2}{\int}_{-\infty}^{+\infty }{\int}_{-\infty}^{+\infty }{\mathrm{F}}_{\mathrm{C}\mathrm{WT}}\left(\mathrm{a},\mathrm{b}\right)\frac{1}{{\mathrm{a}}^2}\upvarphi \left(\frac{\mathrm{t}-\mathrm{b}}{\mathrm{a}}\right)\mathrm{dbda} $$
(3)

where Cφ is the admissibility constant, which depends on the chosen wavelet function.

In general, the eq. (1) is discrete by using:

$$ \mathrm{a}={2}^{-\mathrm{j}},\mathrm{b}={2}^{-\mathrm{j}}\mathrm{k}\ \left(\mathrm{j},\mathrm{k}\in \mathrm{Z}\right) $$
(4)

Then, the Discrete Wavelet Transform (DWT) can be defined as:

$$ {\mathrm{F}}_{\left(\mathrm{DWT}\right)}\left(\mathrm{j},\mathrm{k}\right)={2}^{\mathrm{j}/2}{\int}_{-\infty}^{+\infty}\mathrm{f}\left(\mathrm{t}\right)\upvarphi \left({2}^{\mathrm{j}}-\mathrm{k}\right)\mathrm{dt} $$
(5)

After transformation, the original signal has been decomposed into the wavelet coefficients of the first layer, which represents the information of the low frequency part (approximate profiles) and the high frequency part (details) in the original signal, respectively. Then, the low frequency part can be further decomposed into the wavelet coefficients of the second layer and so on. As the number of decomposition layers increases, the degree of signal approximation increases. The loss of information is also increasing. We tested the prediction results by using the wavelet coefficients decomposed from 3 to 5 layers respectively (data not shown), and found that it had little influence on the prediction performance of the models. So, we chose a smaller number of decomposition layers to keep the original information as much as possible. In most cases, e.g. in the chemical signal processing, the low frequency part of the wavelet coefficients is the approximation of original signal and can reflect the profile of original signal to a certain extent. The high frequency part of the wavelet coefficients is usually considered to be related to the noise of the original signal. This is the reason why the high-frequency part of the coefficients is usually discarded when using the wavelet transform for signal denoising. In our study, we only used the low frequency part of the wavelet coefficients for the subsequent analysis. In addition, in order to maintain the number of features, we used stationary wavelet transform (SWT), also known as undecimated wavelet transform, which does not decimate coefficients at every transformation level. It is a translation-invariance modification of the DWT [72]. Due to the up-sampling operation of the filter coefficients, the SWT has the advantage of being shift-invariant compared with DWT [73, 74].

In this study, SWT can make generalization of the expression profiles of grouped genes and denoise the gene expression signal. When the gene expression profile was decomposed into the wavelet coefficients, we kept the low frequency part of the coefficients to obtain a cleaner signal, which was the approximation of original gene expression profile. Meanwhile, the difference in expression levels of different genes will be reduced in the process of approximation, which is conducive to generating the representative features by CNN. The gene expression profile of a patient was firstly decomposed into a certain layer by the stationary wavelet transform and then, the wavelet coefficients were subsequently input into the convolutional neural network. For instance, a gene expression matrix X contains n samples in rows and p genes in columns. The wavelet transform will decompose the gene expression data by samples. For each sample, the gene expression profile is a vector with order 1 × p (1 sample × p genes). If we decompose the profile into m layers, the wavelet coefficients matrix will be m × p (m layers × p wavelet coefficients). This decomposition procedure has been repeated for n times and the gene expression profiles of all samples have been transformed to the wavelet coefficients. As a result, the final wavelet coefficients matrix should be n × p × m (n samples × p wavelet coefficients × m layers). Subsequently, the wavelet coefficients matrix is input into CNN for modeling. Note that the wavelet coefficients are only the result of mathematical transform, which is the approximation of the original gene expression profile but cannot be directly associated with the biological meaning of the genes. Decomposing the signal with different wavelet functions may obtain different wavelet coefficients, it is necessary to investigate the impact of the wavelet coefficients calculated by different wavelet functions on the predictive models. After decades of development, many wavelet functions have been proposed for signal processing. Here, we chose four most commonly used wavelet families to test. Considering that there is little difference in wavelet basis functions in the same family, we selectively chose three wavelet basis functions from each family. Consequently, twelve commonly used wavelet functions were chosen and examined in this study including Daubechies wavelet family (db1, db3 and db5), Coiflets wavelet family (coif1, coif3 and coif5), biorthogonal wavelet family (bior3.1, bior3.3 and bior3.5) and symlets wavelet family (sym2, sym4 and sym6). The number of decomposition layers was set to 3. The wavelet decomposition procedures were conducted with a python package called PyWavelets [75]. To choose the optimal wavelet function, for each cancer type, we randomly selected 70% samples as training set and used the rest samples as the validation set. The twelve wavelet functions were separately used to decompose the gene expression profile of the sample and the wavelet coefficients were input into the CNN for modeling by using the training set. The validation set was used to evaluate the performance of the models. It can be decided which wavelet function combined with CNN was optimal for the current cancer type.

Convolutional neural network

As one of the classical deep learning algorithms, convolutional neural network (CNN) [76, 77] is widely used in image processing. Similar to the conventional neural network, CNN includes an input layer, an output layer and a number of hidden layers. Among the hidden layers, CNN usually involves the convolutional layers and pooling layers, which can efficiently reduce the connections between the neurons and extract the features from the original image, respectively. In our study, we constructed the CNN models involving an input layer, a convolutional layer, a pooling layer, a full connective layer and an output layer. The architecture of the CNN model and the used parameters were shown in Fig. 8. The wavelet coefficients matrices of the patients were directly input into the CNN models for classification. The functions for optimizer, loss, activation and output were separately set to RMSprop, binary_crossentropy, relu and softmax. All the calculations of CNN modeling were conducted in python 2.7 by using the tensorflow and keras packages. Similarly, we randomly selected 70% samples as the training set to construct the models and validated them by using the rest 30% samples. The sampling procedure had been repeated for 100 times. In addition, The Kaplan-Meier survival analysis was applied in evaluating the stratification of the patients. The calculation was conducted in GraphPad Prism 8 software (https://www.graphpad.com/scientific-software/prism/).

Fig. 8
figure8

The architecture of the SWT-CNN model in our study

Scoring approach for evaluating gene importance

We proposed a scoring approach to identify the diagnostic genes for improving the performance of prediction. For each cancer type, 70% samples were randomly selected as training set, which was firstly used to construct the SWT-CNN model. The modeling procedure had been repeated for 100 times and the best model was kept for extracting the representative features. We extracted the coefficients after the treatment of pooling layer of the best model and averaged the coefficients for all channels to obtain the representative feature matrix X. Then, according to the principle of least squares, we calculated the scores for all genes as follows:

$$ \mathrm{B}=\mathrm{Y}\bullet {\mathrm{X}}^{\mathrm{T}}{\left(\mathrm{X}\bullet {\mathrm{X}}^{\mathrm{T}}\right)}^{-1} $$
(6)

where matrix Y was the wavelet coefficient matrix that input into the CNN model, and the result B was the score matrix for the wavelet coefficients with the number of coefficients in rows and the number of representative features in columns. Because the wavelet coefficients were the approximation of original gene expression profile, the score matrix of wavelet coefficients can also be regarded as the score matrix of genes. Then, we averaged each row of matrix B and obtained the scores for all genes, which indicated the importance of the genes to the cancer.

Cox proportional-hazards regression

We extracted the top n genes ranked by the scores for the Cox proportional-hazards regression. The number n was tested from 100 to 1000 genes with a step of 100 and the optimal value was determined by using the training set. Univariate Cox regression was conducted by using the expression data of the top n genes combined with patient survival time for gene selection (p < 0.05). Then, the selected genes were used in the multivariable Cox regression. Note that, for the limited sample size and larger gene set, the multivariable Cox regression may be unstable and cannot converge. The penalized Cox regression [78] should be used instead. Finally, the genes significantly associated with the overall survival (p < 0.05) were considered as the diagnostic genes. To use these genes as features for the prediction of 3-year overall survival, we calculated the risk scores for the patients and summarized them as the probability of overall survival from the cancer data set as follow [42, 79, 80]:

$$ \mathrm{Risk}\ \mathrm{score}={\upbeta}_1{\mathrm{x}}_1+{\upbeta}_2{\mathrm{x}}_2+{\upbeta}_3{\mathrm{x}}_3+\dots +{\upbeta}_{\mathrm{N}}{\mathrm{x}}_{\mathrm{N}} $$
(7)

where xi is the gene expression value of the ith gene and β is the corresponding Cox coefficient. Then, receiver operating characteristics curve (ROC) was employed on the training set to determine the optimal cut-off points for classification [81]. And the cut-off was used to stratify patients into low- and high-risk groups in the validation set.

The source code can be downloaded from GitHub (https://github.com/zyrr183/TCGA_SWT-CNN-Risk-score-Method).

Availability of data and materials

The sample data and the source code used in our study can be accessed from GitHub (https://github.com/zyrr183/TCGA_SWT-CNN-Risk-score-Method). The full data sets can be freely downloaded from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). The datasets supporting the conclusions of this article are included within the article and its additional files.

Abbreviations

AUC:

The Area under the receiver operating characteristic curve

BLCA:

Bladder urothelial carcinoma

BRCA:

Breast invasive carcinoma

CNN:

Convolutional neural network

COAD:

Colon adenocarcinoma

HNSC:

Head and neck squamous cell carcinoma

KIRC:

Kidney renal clear cell carcinoma

KIRP:

Kidney renal papillary cell carcinoma

LGG:

Brain lower grade glioma

LIHC:

Liver hepatocellular caa

LUAD:

Lung adenocarcinoma

LUSC:

Lung squamous cell carcinoma

OV:

Ovarian serous cystadenocarcinoma

SKCM:

Skin cutaneous melanoma

STAD:

Stomach adenocarcinoma

SVM:

Support vector machine

SWT:

Stationary wavelet transform

TCGA:

The Cancer Genome Atlas

THCA:

Thyroid carcinoma

UCEC:

Uterine corpus endometrial carcinoma

References

  1. 1.

    Cosma G, Brown D, Archer M, Khan M, Pockley AG. A survey on computational intelligence approaches for predictive modeling in prostate cancer. Expert Syst Appl. 2017;70:1–19.

    Google Scholar 

  2. 2.

    Gyorffy B, Hatzis C, Sanft T, Hofstatter E, Aktas B, Pusztai L. 3 multigene prognostic tests in breast cancer: past, present, future. Breast Cancer Res. 2015;17:11.

    PubMed  PubMed Central  Google Scholar 

  3. 3.

    Wood DE, White JR, Georgiadis A, Van Emburgh B, Parpart-Li S, Mitchell J, et al. A machine learning approach for somatic mutation discovery. Sci Transl Med. 2018;10(457):eaar7939.

    PubMed  PubMed Central  Google Scholar 

  4. 4.

    Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16:321.

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Jurmeister P, Bockmayr M, Seegerer P, Bockmayr T, Treue D, Montavon G, et al. Machine learning analysis of DNA methylation profiles distinguishes primary lung squamous cell carcinomas from head and neck metastases. Sci Transl Med. 2019;11(509):eaaw8513.

    CAS  PubMed  Google Scholar 

  6. 6.

    Swan AL, Mobasheri A, Allaway D, Liddell S, Bacardit J. Application of machine learning to proteomics data: classification and biomarker identification in Postgenomics biology. OMICS. 2013;17(12):595–610.

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Goldenberg SL, Nir G, Salcudean SE. A new era: artificial intelligence and machine learning in prostate cancer. Nat Rev Urol. 2019;16(7):391–403.

    PubMed  Google Scholar 

  8. 8.

    Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, et al. The MicroArray quality control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28(8):827–38.

    CAS  PubMed  Google Scholar 

  9. 9.

    Li R, Liang H. Variable selection in semiparametric regression modeling. Ann Stat. 2008;36(1):261–86.

    PubMed  PubMed Central  Google Scholar 

  10. 10.

    Bandyopadhyay S, Saha S, Maulik U, Deb K. A simulated annealing-based multiobjective optimization algorithm: AMOSA. IEEE Trans Evol Comput. 2008;12(3):269–83.

    Google Scholar 

  11. 11.

    Wu H, Liu SP, Gong J, Liu JY, Zhang Q, Leng XH, et al. VCPA, a novel synthetic derivative of alpha-tocopheryl succinate, sensitizes human gastric cancer to doxorubicin-induced apoptosis via ROS-dependent mitochondrial dysfunction. Cancer Lett. 2017;393:22–32.

    CAS  PubMed  Google Scholar 

  12. 12.

    Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.

    CAS  PubMed  Google Scholar 

  13. 13.

    Gheyas IA, Smith LS. Feature subset selection in large dimensionality domains. Pattern Recogn. 2010;43(1):5–13.

    Google Scholar 

  14. 14.

    Jurtz VI, Johansen AR, Nielsen M, Almagro Armenteros JJ, Nielsen H, Sonderby CK, et al. An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics. 2017;33(22):3685–90.

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Chen Y, Jiang H, Li C, Jia X, Ghamisi P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans Geosci Remote Sens. 2016;54(10):6232–51.

    Google Scholar 

  16. 16.

    Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.

    CAS  PubMed  Google Scholar 

  17. 17.

    Yan R, Gao RX, Chen X. Wavelets for fault diagnosis of rotary machines: a review with applications. Signal Process. 2014;96:1–15.

    Google Scholar 

  18. 18.

    LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

    CAS  PubMed  Google Scholar 

  19. 19.

    Moen E, Bannon D, Kudo T, Graf W, Covert M, Van Valen D. Deep learning for cellular image analysis. Nat Methods. 2019;16(12):1233.

    CAS  PubMed  Google Scholar 

  20. 20.

    Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, et al. Deep learning for acoustic modeling in parametric speech generation. IEEE Signal Process Mag. 2015;32(3):35–52.

    Google Scholar 

  21. 21.

    Li H. Deep learning for natural language processing: advantages and challenges. Natl Sci Rev. 2018;5(1):24–6.

    Google Scholar 

  22. 22.

    Baskin II, Winkler D, Tetko IV. A renaissance of neural networks in drug discovery. Expert Opin Drug Discovery. 2016;11(8):785–95.

    CAS  Google Scholar 

  23. 23.

    Bengio Y. Deep learning of representations: looking forward. In: Statistical language and speech processing first international conference, SLSP 2013 Proceedings: LNCS 7978. Berlin: Springer; 2013. p. 1–37.

    Google Scholar 

  24. 24.

    Min X, Zeng WW, Chen SQ, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18:478.

    PubMed  PubMed Central  Google Scholar 

  25. 25.

    Singh R, Lanchantin J, Robins G, Qi YJ. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics. 2016;32(17):639–48.

    Google Scholar 

  26. 26.

    Li X, Liu LY, Zhou J, Wang C. Heterogeneity analysis and diagnosis of complex diseases based on deep learning method. Sci Rep. 2018;8:1.

    Google Scholar 

  27. 27.

    Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep. 2018;8:1.

    CAS  Google Scholar 

  28. 28.

    Yousefi S, Amrollahi F, Amgad M, Dong C, Lewis JE, Song C, et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Sci Rep. 2017;7:1.

    CAS  Google Scholar 

  29. 29.

    Lao JW, Chen YS, Li ZC, Li QH, Zhang J, Liu J, et al. A deep learning-based Radiomics model for prediction of survival in glioblastoma Multiforme. Sci Rep. 2017;7:1.

    Google Scholar 

  30. 30.

    Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20(7):389–403.

    CAS  PubMed  Google Scholar 

  31. 31.

    Yu H, Samuels DC, Zhao Y-y, Guo Y. Architectures and accuracy of artificial neural network for disease classification from omics data. BMC Genomics. 2019;20(1):167.

    PubMed  PubMed Central  Google Scholar 

  32. 32.

    Tomczak K, Czerwinska P, Wiznerowicz M. The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn). 2015;19(1A):A68–77.

    Google Scholar 

  33. 33.

    Xu Z, Hu L, Shi B, Geng S, Xu L, Wang D, et al. Ribosome elongating footprints denoised by wavelet transform comprehensively characterize dynamic cellular translation events. Nucleic Acids Res. 2018;46(18):e109.

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Patil S, Naik G, Pai R, Gad R. Stacked autoencoder for classification of glioma grade III and grade IV. Biomed Signal Process Control. 2018;46:67–75.

    Google Scholar 

  35. 35.

    Sarhan AM. Wavelet-based feature extraction for DNA microarray classification. Artif Intell Rev. 2013;39(3):237–49.

    Google Scholar 

  36. 36.

    Wu Y, Zhang L, Liu L, Zhang Y, Yi D. Identification of differentially expressed genes using multi-resolution wavelet transformation analysis combined with SAM. Gene. 2012;509(2):302–8.

    CAS  PubMed  Google Scholar 

  37. 37.

    Nanni L, Brahnam S, Lumini A. Combining multiple approaches for gene microarray classification. Bioinformatics. 2012;28(8):1151–7.

    CAS  PubMed  Google Scholar 

  38. 38.

    Nanni L, Lumini A. Wavelet selection for disease classification by DNA microarray data. Expert Syst Appl. 2011;38(1):990–5.

    Google Scholar 

  39. 39.

    Kim B-R, McMurry T, Zhao W, Wu R, Berg A. Wavelet-based functional clustering for patterns of high-dimensional dynamic gene expression. J Comput Biol. 2010;17(8):1067–80.

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Zhang S-W, Huang D-S, Wang S-L. A method of tumor classification based on wavelet packet transforms and neighborhood rough set. Comput Biol Med. 2010;40(4):430–7.

    PubMed  Google Scholar 

  41. 41.

    Xie X-P, Ding X-H, Wang H-Q, Jiang Y-C. Continuous wavelet analysis of gene expression signals for cancer classification. J Biol Syst. 2009;17(03):377–96.

    CAS  Google Scholar 

  42. 42.

    Firmino N, Martinez VD, Rowbotham DA, Enfield KSS, Bennewith KL, Lam WL. HPV status is associated with altered PIWI-interacting RNA expression pattern in head and neck cancer. Oral Oncol. 2016;55:43–8.

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Kim DW, Lee S, Kwon S, Nam W, Cha I-H, Kim HJ. Deep learning-based survival prediction of oral cancer patients. Sci Rep. 2019;9(1):6994.

    PubMed  PubMed Central  Google Scholar 

  44. 44.

    Sondka Z, Bamford S, Cole CG, Ward SA, Dunham I, Forbes SA. The COSMIC cancer gene census: describing genetic dysfunction across all human cancers. Nat Rev Cancer. 2018;18(11):696–705.

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Yang L, Jing J, Sun L, Yue Y. Exploring prognostic genes in ovarian cancer stage-related coexpression network modules. Medicine. 2018;97(34):e11895.

    PubMed  PubMed Central  Google Scholar 

  46. 46.

    Yasui W, Oue N, Ito R, Kuraoka K, Nakayama H. Search for new biomarkers of gastric cancer through serial analysis of gene expression and its clinical implications. Cancer Sci. 2004;95(5):385–92.

    CAS  PubMed  Google Scholar 

  47. 47.

    Wang X-Q, Tang Z-X, Yu D, Cui S-J, Jiang Y-H, Zhang Q, et al. Epithelial but not stromal expression of collagen alpha-1(III) is a diagnostic and prognostic indicator of colorectal carcinoma. Oncotarget. 2016;7(8):8823–38.

    PubMed  PubMed Central  Google Scholar 

  48. 48.

    Xiong G, Deng L, Zhu J, Rychahou PG, Xu R. Prolyl-4-hydroxylase α subunit 2 promotes breast cancer progression and metastasis by regulating collagen deposition. BMC Cancer. 2014;14(1):1.

    PubMed  PubMed Central  Google Scholar 

  49. 49.

    Hao JM, Chen JZ, Sui HM, Si-Ma XQ, Li GQ, Liu C, et al. A five-gene signature as a potential predictor of metastasis and survival in colorectal cancer. J Pathol. 2010;220(4):475–89.

    CAS  PubMed  Google Scholar 

  50. 50.

    Dong G, Wang B, An Y, Li J, Wang X, Jia J, et al. SIRT1 suppresses the migration and invasion of gastric cancer by regulating ARHGAP5 expression. Cell Death Dis. 2018;9(10):977.

    PubMed  PubMed Central  Google Scholar 

  51. 51.

    Fang Y, Zhu X, Wang J, Li N, Li D, Sakib N, et al. MiR-744 functions as a proto-oncogene in nasopharyngeal carcinoma progression and metastasis via transcriptional control of ARHGAP5. Oncotarget. 2015;6(15):13164–75.

    PubMed  PubMed Central  Google Scholar 

  52. 52.

    Montavon C, Stricker GR, Schoetzau A, Heinzelmann-Schwarz V, Jacob F, Fedier A. Outcome in serous ovarian cancer is not associated with LATS expression. J Cancer Res Clin Oncol. 2019;145(11):2737–49.

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Xu B, Sun D, Wang Z, Weng H, Wu D, Zhang X, et al. Expression of LATS family proteins in ovarian tumors and its significance. Hum Pathol. 2015;46(6):858–67.

    CAS  Google Scholar 

  54. 54.

    Lv Y, Wang T, Fan J, Zhang Z, Zhang J, Xu C, et al. The effects and mechanisms of SLC34A2 on maintaining stem cell-like phenotypes in CD147+ breast cancer stem cells. Tumor Biol. 2017;39:4.

    Google Scholar 

  55. 55.

    Rangel LBA, Sherman-Baust CA, Wernyj RP, Schwartz DR, Cho KR, Morin PJ. Characterization of novel human ovarian cancer-specific transcripts (HOSTs) identified by serial analysis of gene expression. Oncogene. 2003;22(46):7225–32.

    CAS  PubMed  Google Scholar 

  56. 56.

    Li X, Xing J, Wang H, Yu E. The SLC34A2-ROS-HIF-1-induced up-regulation of EZH2 expression promotes proliferation and chemo-resistance to apoptosis in colorectal cancer. Biosci Rep. 2019;39(5):BSR20180268.

    CAS  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Jiang Z, Hao Y, Ding X, Zhang Z, Liu P, Wei X, et al. The effects and mechanisms of SLC34A2 on tumorigenicity in human non-small cell lung cancer stem cells. Tumor Biol. 2016;37(8):10383–92.

    CAS  Google Scholar 

  58. 58.

    Wu K, Liao X, Gong Y, He J, Zhou J-K, Tan S, et al. Circular RNA F-circSR derived from SLC34A2-ROS1 fusion gene promotes cell migration in non-small cell lung cancer. Mol Cancer. 2019;18(1):98.

    PubMed  PubMed Central  Google Scholar 

  59. 59.

    Cowin PA, George J, Fereday S, Loehrer E, Van Loo P, Cullinane C, et al. LRP1B deletion in high-grade serous ovarian cancers is associated with acquired chemotherapy resistance to liposomal doxorubicin. Cancer Res. 2012;72(16):4060.

    CAS  PubMed  Google Scholar 

  60. 60.

    Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463(7283):899–905.

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Ishida T, Ueda R. Immunopathogenesis of lymphoma: focus on CCR4. Cancer Sci. 2011;102(1):44–50.

    CAS  PubMed  Google Scholar 

  62. 62.

    Uno H, Cai T, Tian L, Wei LJ. Evaluating prediction rules for t-year survivors with censored regression models. J Am Stat Assoc. 2007;102(478):527–37.

    CAS  Google Scholar 

  63. 63.

    Harrell F, Califf R, Pryor D, Lee K, Rosati RA. Evaluating the yield of medical tests. JAMA. 1982;247:2543–6.

    PubMed  Google Scholar 

  64. 64.

    Shao X-G, Leung AK-M, Chau F-T. Wavelet: a new trend in chemistry. Acc Chem Res. 2003;36(4):276–83.

    CAS  PubMed  Google Scholar 

  65. 65.

    Meijering E. A chronology of interpolation: from ancient astronomy to modern signal and image processing. Proc IEEE. 2002;90(3):319–42.

    Google Scholar 

  66. 66.

    Strang G. Wavelet transforms versus Fourier transforms. Bull Am Math Soc. 1993;1:28.

    Google Scholar 

  67. 67.

    Kim SY, Kim TR, Jeong H-H, Sohn K-A. Integrative pathway-based survival prediction utilizing the interaction between gene expression and DNA methylation in breast cancer. BMC Med Genet. 2018;11(3):68.

    Google Scholar 

  68. 68.

    Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9:1.

    Google Scholar 

  69. 69.

    Orru G, Pettersson-Yeo W, Marquand AF, Sartori G, Mechelli A. Using support vector machine to identify imaging biomarkers of neurological and psychiatric disease: a critical review. Neurosci Biobehav Rev. 2012;36(4):1140–52.

    PubMed  Google Scholar 

  70. 70.

    Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.

    Google Scholar 

  71. 71.

    Grossmann A, Morlet J. Decomposition of hardy functions into square integrable wavelets of constant shape. SIAM J Math Anal. 1984;15(4):723–36.

    Google Scholar 

  72. 72.

    Percival DB, Walden AT. Wavelet methods for time series analysis. Cambridge: Cambridge University Press; 2000..

  73. 73.

    Frassineti L, Barba C, Melani F, Piras F, Guerrini R, Manfredi C. Automatic detection and sonification of nonmotor generalized onset epileptic seizures: preliminary results. Brain Res. 2019;1721:146341.

    CAS  PubMed  Google Scholar 

  74. 74.

    Guo H, Burrus CS. Convolution using the undecimated discrete wavelet transform. IEEE Int Conf Acoustics Speech Signal Process Conf Proceed. 1996;3:1291–4.

    Google Scholar 

  75. 75.

    Lee GR, Gommers R, Wasilewski F, Wohlfahrt K, O’Leary A. PyWavelets: a python package for wavelet analysis. J Open Source Softw. 2019;4(36):1237.

    Google Scholar 

  76. 76.

    Yu SQ, Jia D, Xu CY. Convolutional neural networks for hyperspectral image classification. Neurocomputing. 2017;219:88–98.

    Google Scholar 

  77. 77.

    Shen W, Zhou M, Yang F, Yu DD, Dong D, Yang CY, et al. Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recogn. 2017;61:663–73.

    Google Scholar 

  78. 78.

    Simon N, Friedman JH, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(Issue 5):1.

    PubMed  PubMed Central  Google Scholar 

  79. 79.

    Sui Y, Ju C, Shao B. A lymph node metastasis-related protein-coding genes combining with long noncoding RNA signature for breast cancer survival prediction. J Cell Physiol. 2019;234(11):20036–45.

    CAS  PubMed  Google Scholar 

  80. 80.

    Martinez VD, Enfield KSS, Rowbotham DA, Lam WL. An atlas of gastric PIWI-interacting RNA transcriptomes and their utility for identifying signatures of gastric cancer recurrence. Gastric Cancer. 2016;19(2):660–5.

    CAS  PubMed  Google Scholar 

  81. 81.

    Krishnan P, Ghosh S, Graham K, Mackey JR, Kovalchuk O, Damaraju S. Piwi-interacting RNAs and PIWI genes as novel prognostic markers for breast cancer. Oncotarget. 2016;7(25):37944–56.

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This project was supported by the grant from the National Natural Science Foundation of China (No. 21575094). The publication cost of this article was funded by the National Natural Science Foundation of China (No.21575094).

Author information

Affiliations

Authors

Contributions

ZW and CL designed the experiments. YZhao, YZhou, YH and YL performed data analysis. ZW and YZhao wrote the main manuscript text and prepared all the figures. YZhao, ML, XP, CL and ZW discussed the results and revised the manuscript. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Chuan Li or Zhining Wen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1. The mean AUCs and standard errors of AUCs on predicting the tumor stages and 3-year overall survivals.

Additional file 2. Results of the univariate Cox regression. Tables contain the genes considered to be significantly associated with the 3-year overall survivals of all the data set by the univariate Cox regression.

Additional file 3. The performance of SWT-CNN, SVM, random forest and logistic regression on predicting the tumor stages and the 3-year overall survivals of all cancer types.

Additional file 4. The performance of CNN algorithm with and without SWT on predicting 3-year overall survival of all the cancer types.

Additional file 5. The performance of the models on predicting 3-year overall survivals of all cancer types.

Additional file 6. Kendal-Tau values of gene lists generated from the 5 bootstrap for 10 TCGA datasets.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, Y., Zhou, Y., Liu, Y. et al. Uncovering the prognostic gene signatures for the improvement of risk stratification in cancers by using deep learning algorithm coupled with wavelet transform. BMC Bioinformatics 21, 195 (2020). https://doi.org/10.1186/s12859-020-03544-z

Download citation

Keywords

  • Convolutional neural network
  • Stationary wavelet transform
  • Cancer prognosis prediction
  • RNA-sequencing
  • Cox regression