Investigate the relevance of major signaling pathways in cancer survival using a biologically meaningful deep learning model

Survival analysis and prediction are important in cancer studies. In addition to the Cox proportional hazards model, recently deep learning models have been proposed to integrate the multi-omics data for survival prediction. Cancer signaling pathways are important and interpretable concepts that define the signaling cascades regulating cancer development and drug resistance. Thus, it is interesting and important to investigate the relevance to patients’ survival of individual signaling pathways. In this exploratory study, we propose to investigate the relevance and difference of a small set of core cancer signaling pathways in the survival analysis of cancer patients. Specifically, we built a biologically meaningful and simplified deep neural network, DeepSigSurvNet, for survival prediction. In the model, the gene expression and copy number data of 1648 genes from 46 major signaling pathways are used. We applied the model on 4 types of cancer and investigated the relevance and difference of the 46 signaling pathways among the 4 types of cancer. Interestingly, the interpretable analysis identified the distinct patterns of these signaling pathways, which are helpful to understand the relevance of the signaling pathways in terms of their association with cancer survival time. These highly relevant signaling pathways can be novel targets, combined with other essential signaling pathways inhibitors, for drug and drug combination prediction to improve cancer patients’ survival time.


Introduction
Survival analysis is important in cancer prognosis based on the clinical factors, e.g., age, gender, race, stage. Moreover, it is important to identify and understand essential biomarkers with the availability of large-scale genomics data, e.g., gene expression and copy number variation, in addition to these clinical factors. The cox proportional hazards model (Cox PH) model 1 is the classic model for the survival analysis. Recently, the deep learning models have been widely used in image analysis 2,3 , medical informatics data analysis 4 , nature language process (NLP) 5 , and showed significantly improved performance than traditional machine learning models. Deep learning models have been developed for survival analysis.
Compared with the Cox PH model, the deep learning models showed improved prediction accuracy by integrating a large number of genomics features. For example, for the liver cancer subtyping and survival analysis 6 , the auto-encoder model was first employed to reduce the dimension of feature space, considering the large-number of genomics features, e.g., gene expression, miRNA, Methylation. The important features (non-linear combinations of raw genomics features) were identified using the Cox PH model 1 for the clustering analysis to identify sub-groups with distinct survival outcomes. Then the analysis of variance (ANOVA), based on the clustering results, was applied on the raw genomics features was further used to identify the important genes. However, the auto-encoder model itself was not analyzed to identify the important raw genomics features in a non-linear perspective. In the Cox-nnet model, the RNA-seq data of TCGA samples was used as the input a deep neural network to predict the survival time. To identify the potential associated signaling pathways of hidden nodes, the Pearson's correlation values between the expression of individual genes and the output of the given hidden nodes were calculated to identify the most linearly correlated genes. Then gene set enrichment analysis (GSEA) 7 was employed to link the hidden nodes with the enriched signaling pathways. Moreover, the Survival Convolutional Neural Networks (SCNN) 8 was developed to predict the survival using histologic images of cancer patients. Then heat map visualization of SCNN model output of regions of interest (image patches) was overlaid on the image to indicate the important regions in the images correlated with survival outcome.
In cancer studies, many dysfunctional signaling pathways are identified 9 , which play important roles in tumor development and drug response. In this study, we aim to investigate the relevance of these signaling pathways in the context of survival outcome prediction using a biologically meaningful and simplified deep learning model, DeepSigSurvNet. Specifically, only signaling pathways (46 pathways) were collected from KEGG 10 signaling database. The gene expression and copy number data of 1648 genes from the 46 major signaling pathways are from 4 types of cancer: breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), glioblastoma multiforme (GBM), and skin cutaneous melanoma (SKCM). The model was evaluated using the c-index.
Interestingly, the interpretable analysis using the Layer-wise relevance propagation (LRP) 11 approach identified distinct probability density distribution patterns of these signaling pathways, which are helpful to understand the relevance of the signaling pathways in terms of the association with cancer patients' survival. The important signaling pathways can be novel targets for drug and drug combination prediction to improve cancer patients' survival time. In the following sections, the materials and methods, results and discussions are presented.

RNA-seq and Copy number data of 4 types of cancer
From the UCSC Xena data server, the mean-normalized log2 scaled RSEM 12 values (per gene) across all TCGA cohorts (HiSeqV2_PANCAN dataset) and integer copy number data (per gene) from GISTIC2 analysis were downloaded for 4 types of cancer: breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD), glioblastoma multiforme (GBM), and skin cutaneous melanoma (SKCM). The phenotype (clinical) data, including survival time, age, gender, stage, of the cancer samples are also available from the Xena data server. Table I shows the number of cancer samples, dataset and URLs to download these datasets. For the prediction purpose, cancer patients with survival time greater than 3000 days are not included.  10 is a database for the systematic understanding of gene functions. The KEGG signaling pathways provide the knowledge of signaling transduction and cellular processes. There are 303 pathways in KEGG database, and 45 of them are annotated as "signaling pathways". Many of the signaling pathways are important oncogenic signaling pathways 9 , e.g., EGFR, WNT, Hippo, Notch, PI3K-Akt, RAS, TGFβ, p53. The 'cell cycle' cellular process is also included. For simplification, the 'cell cycle' is also viewed as one 'signaling' pathway. In total, 46 signaling pathways (45 signaling pathways + cell cycle) are selected (see Table II). Among these 46 signaling pathways, there are 1648 genes with both gene expression and copy number variation data. In summary, there are gene expression (TPM) and copy number variation data of 1648 genes in 46 signaling pathways of 45 cancer cell lines, which was used as the input of the deep learning model.

Model performance evaluation
To evaluate the performance of the proposed model, the concordance index (c-index) metric was used. The c-index is defined as follows. Let " and # $ be the true and predicted survival time. The concordance is defined as ( # $ > ( $ | " > * ), where i and j are two randomly selected samples. The c-index indicates the probability that prediction and the real survival time are relatively consistent or concordant, i.e., # $ > ( $ , " > * , or # $ < ( $ , " < * . Let C, D, T represent the number of concordant, discordant, and equal survival time, then the c-index is defined as: .
We compared the proposed model with random forest model, which is available as in the RandomForestRegression from the scikit-learn package. We train the random forest model using the same training and test dataset setting for the four types of cancer. The "n_estimator" and "max_depth" parameters are used to find the best performance of the random forest models. For the DeepSigSurvNet model, we use same architecture for all four types of cancer, with different dropout rate, regularization value and epoch number for each cancer type. Table III and Table IV show the comparison results. As can be seen, the random forest model has the similar c-index values in the training datasets. However, it has much lower c-index values on the test datasets, compared with the proposed DeepSignSurvNet model, which indicates that the proposed deep learning model is robust.

Relevance of individual signaling pathways in 4 types of cancer
As discussed, it is interested to investigate and understand how the individual signaling pathways contribute to the cancer patients' survival prediction. After training the deep learning models, we employed the 'iNNvestigate' package to calculate the relevance scores of the individual signaling pathways on individual cancer patients in each of 4 types of cancer. Fig. 2 and Fig. 3 show the probability density distributions of 46 signaling pathways in the 4 types of cancer.
Specifically, for BRCA, the mTOR, Hedgehog, PI3K-Akt, TGF-beta, AMPK, VEGF, Apelin, Adipocytokine and Oxytocin signaling pathways have the strongest relevance scores. P53, Wnt, Notch, NF-Kaapa B, FoxO, cGMP-PKG, cAMP, Chemokine, Sphingolipid, Relaxin, Thyroid hormone signaling pathways have relative high relevance scores. Surprisingly, the MAPK, ErbB, Ras, Rap1, JAK-STAT signaling pathways as well as cell cycle are not well associated with patients' survival outcome. It is well known that these signaling pathways play important roles in cancer development. However, they can be separable in BRCA cancer samples to be the essential signaling pathways for patients' survival outcome prediction. For LUDA, the patterns of density distributions are different from BRCA. More signaling pathways show high but not very strong relevance scores. For example, the MAPK, Ras, Rap1, cGMP-PKG, HIF-1, mTOR, PI3K-Akt, Wnt, Notch Hedgehog, C-type lectin receptor, GnRH, Neurotrophin, and Thyroid hormone signaling pathways have relatively high and consistent relevance scores. On the other hand, the AMPK, Hippo and NOD-like signaling pathways have the zero-mean value but with great variance. Then it is hard to evaluate their relevance important in cancer patients' survival prediction analysis. For GBM, the Ras, p53, mTOR, PI3K-Akt, Notch, Hippo, TNF, Estrogen, Thyroid hormone and Relaxin signaling pathways have relatively high relevance scores; and the other signaling pathways are not correlated to the patients' survival.  In summary, the probability density distribution patterns of all the 46 signaling pathways vary significantly among the 4 types of cancer. For example, the p53 and mTOR signaling pathways are strongly relevant to patients' survival outcomes in BRCA, GBM, SKCM, but not in the LUDA cancer patients. The MAPK, RAS, Rap1, ErBB signaling pathways are the known important signaling pathways in cancer. However, they are not strongly correlated with cancer patients' survival outcome in the prediction models. It might be because all of these important signaling pathways are always activated in cancer patients. Thus, they are important targets for cancer therapy, but not informative in terms of the survival time prediction. Also, the cell cycle signaling does not play an important role in the survival time prediction. Moreover, a small set of signaling pathways, e.g., T cell receptor, B cell receptor, Fc epsilon RI, TNF signaling pathways do not show important contributions to the survival of cancer patients across all 4 types of cancer. Also, for each type of cancer, the less than half of the signaling pathways have strong effects to the survival prediction. Thus, drugs and drug combinations that can inhibit these essential signaling pathways, and that can inhibit the signaling pathways with strong relevance scores for each type of cancer might be effective to improve cancer patients' survival time and outcome.

Discussion and conclusion
Survival prediction is important in cancer studies. Deep learning models also have been proposed for the survival prediction, and outperformed the classic Cox PH model. However, it is challenging to understand the contributions of individual genes considering the non-linear combinations of a large number of genomics features, e.g., gene expression, copy number variation. Signaling pathways are important in cancer research to understand the signaling cascades regulating cancer development and drug response. Instead of using a large number of genomics features, in this study, we proposed a relatively biologically meaningful and simplified deep learning model, DeepSigSurvNet, for the survival prediction. In the model, the gene expression and copy number data of 1648 genes from 46 major signaling pathways are used. The analysis of deep learning model on 4 types of cancer can identify the distinct patterns of these signaling pathways, which are helpful to understand the relevance of the signaling pathways in the context of survival analysis, and can be novel targets for drug and drug combination prediction to improve cancer patients' survival outcome. In conclusion, the proposed deep learning model can correlate the signaling pathways (not individual genes level) with cancer patients' survival time and outcome.
There are some limitations of the proposed model that need to be further addressed. In addition to the 46 signaling pathways, other KEGG pathways, like metabolism pathways, will be further evaluated. Moreover, Gene oncology 17 (GO) terms provide alternative biologically meaningful biologically processes (BP) (gene sets). Also, other omics data, like protein, methylation, genetic mutation can be integrated conveniently to the model in addition to the copy number, gene expression data. As aforementioned, the important genes within the important signaling pathways can be used potential gene signatures to discover drugs using the connectivity map (CMAP) 15,16 . We will investigate these possible directions in the future work.