PathExpSurv: pathway expansion for explainable survival analysis and disease gene discovery

Background In the field of biology and medicine, the interpretability and accuracy are both important when designing predictive models. The interpretability of many machine learning models such as neural networks is still a challenge. Recently, many researchers utilized prior information such as biological pathways to develop neural networks-based methods, so as to provide some insights and interpretability for the models. However, the prior biological knowledge may be incomplete and there still exists some unknown information to be explored. Results We proposed a novel method, named PathExpSurv, to gain an insight into the black-box model of neural network for cancer survival analysis. We demonstrated that PathExpSurv could not only incorporate the known prior information into the model, but also explore the unknown possible expansion to the existing pathways. We performed downstream analyses based on the expanded pathways and successfully identified some key genes associated with the diseases and original pathways. Conclusions Our proposed PathExpSurv is a novel, effective and interpretable method for survival analysis. It has great utility and value in medical diagnosis and offers a promising framework for biological research. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05535-2.

1 Foundation of Survival Time Analysis

Notations of Common Terms in Survival Analysis
Survival analysis is used to handle the survival data which is right-censored (i.e.some patients may leave the study or the study ends before an event occurs).
The compositions of survival data are: a patient's observed covariates x, a survival time T , and an event indicator δ ∈ {0, 1}.
The survival function is denoted as which signifies the probability that the failure time is latter than t.
The hazard function is defined as: It is the probability an individual will not survive for an additional time ∆t, given they have already survived up to time t.

Cox Proportional Hazards Model
The Cox proportional hazards model is a common method for modeling an individual's survival given the feature x.The model assumes that the hazard function is composed of two non-negative functions: a baseline hazard function, h 0 (t), and a risk score, r(x) = exp(x T β), defined as the effect of an individual's obeserved covariates on the baseline hazard.Then the hazard function is assumed to have the form: The likelihood of the event to be observed occurring for subject i at time T i can be written as: The Cox log partial likelihood function is defined as: The commonly used metric in survival analysis is concordance index (C-index): 2 Details of Data

Cancer Datasets
For data acquisition, we obtained 3 different survival datasets from UCSC Xena: (1) Breast Cancer Dataset (BRCA), (2) Lower Grade Glioma Dataset (LGG), (3) Thyroid Cancer Dataset (THCA).Table S1 shows the number of samples and selected genes of 3 different datasets.For each cancer, we only used gene expression data as the feature and the total number of genes was 60489.Then we did some preprocessing for the gene expression data.First, we transformed the read counts through log 2 (x + 1).Second, we selected the top variable genes of which the standard deviations among the patients were larger than 1.In this way there were only 2005 (BRCA), 1061 (THCA) and 1126 (LGG) genes left.Third, we normalized the data into a standard normal distribution in order to overcome some problems like gradient vanishing in the deep learning models.

KEGG Signaling Pathways
We took the prior pathways as the functional modules and the source of the prior signaling pathways was KEGG DISEASE Database.The details of the composition of the pathways are showed in Table S2 .
Table S2: Compositions of the prior pathways.3 Experiment Results

Two-phase Training
Table S3 displayed the exact results of two phases, and Figure S1 showed the training curves.We found that the optimal C-indices of training phase were mostly better than those of pre-training phase, which means that the training of pre-trained network learned more useful information beyond the prior pathways modules.

Retraining
Table S4 and Figure S2 compared the results of model with original pathways mask and expanded pathways mask.The results showed that the expanded pathways achieved better performance than original pathways.We found that the expanded pathways outperformed significantly original modules at epoch 5, which indicated that the newly learned metapathways could help predictor to learn more quickly.We also noticed that the C-index on the training set of expanded pathways was worse than that of original pathways, but the result was converse on testing set, which might demonstrate that the meta-pathways overcome the over-fitting problem in some way.

Gene Ontology (GO) Term Enrichment Analysis
We performed Gene Ontology (GO) term enrichment analysis for the supplement genes of every pathway, so as to discover some relationships between original pathways and expanded pathways.The results are shown in Figure S3, and the details for the supplement genes of ERK signaling pathway (BRCA), NOTCH signaling pathway (BRCA) and WNT signaling pathway (THCA) are shown in Tables S5, S6 and S7, respectively.

Single-Gene Survival Analysis
We performed the single-gene survival analysis to validate the significance of the supplement genes in each disease.For one specific gene, we divided the dataset into two groups: high expression group contained the top 50% gene expression level and low expression group contained the others.Then we ploted the Kaplan-Meier curves of the two groups , and identified the most significantly different genes (p < 0.05).We displayed the Kaplan-Meier curves in Figure S4.For BRCA, we identified key genes: LINC01235, TTC36, H2BC4, THBS1, AGPAT2, MMP12.For THCA, we got STC1, ND4L, APOD.For LGG, we obtained H1-2, LYVE1, MFAP4, PCDHGB6.These genes were differentially expressed between two groups and might contribute to the performance improvement.

Figure S1 :
Figure S1: Training curves of pre-training and training phase.The model achieves better performance in training phase (2nd 100 epoches) than that in pre-training phase (1st 100 epoches).
Figure S3: Go Enrichment Analysis Figure S4: Kaplan Meier curves

Table S3 :
Results of pre-training & training phase.

Table S4 :
Comparison of original pathways and expanded pathways.
Figure S2: Training curves of retraining.The expanded-pathways model can achieve better performance quickly and finally outperform original-pathways model.