Skip to main content

An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome

Abstract

Background

Predicting outcome of breast cancer is important for selecting appropriate treatments and prolonging the survival periods of patients. Recently, different deep learning-based methods have been carefully designed for cancer outcome prediction. However, the application of these methods is still challenged by interpretability. In this study, we proposed a novel multitask deep neural network called UISNet to predict the outcome of breast cancer. The UISNet is able to interpret the importance of features for the prediction model via an uncertainty-based integrated gradients algorithm. UISNet improved the prediction by introducing prior biological pathway knowledge and utilizing patient heterogeneity information.

Results

The model was tested in seven public datasets of breast cancer, and showed better performance (average C-index = 0.691) than the state-of-the-art methods (average C-index = 0.650, ranged from 0.619 to 0.677). Importantly, the UISNet identified 20 genes as associated with breast cancer, among which 11 have been proven to be associated with breast cancer by previous studies, and others are novel findings of this study.

Conclusions

Our proposed method is accurate and robust in predicting breast cancer outcomes, and it is an effective way to identify breast cancer-associated genes. The method codes are available at: https://github.com/chh171/UISNet.

Peer Review reports

Background

According to the global cancer statistics in 2020, breast cancer is the most common malignant tumor, accounting for two million (11.7%) patients worldwide [1]. The outcomes of these breast cancer patients were observed to be significantly different under the same treatment, reflecting the heterogeneity of breast cancer. Accurate breast cancer outcome prediction is important for designing effective follow-up treatments, and improving the survival periods and life quality of patients [2].

With the advancement of molecular sequencing techniques, an increasing amount of high-dimensional genomic data has been used to evaluate cancer outcomes to support clinical decision- making [3, 4]. The most widely used method for evaluating patient risks is the Cox proportional hazard model [5]. This method analyses the influences of different factors on cancers by calculating the survival ratios of patients without knowing patients’ survival distributions. In addition, Wang et al. designed random survival forests (RSF) to predict cancer outcomes by utilizing the bootstrapping strategy [6]. However, these methods have limited performance on high-dimensional gene expression data [7]. To solve this problem, many feature dimensionality reduction technologies were added to the algorithm. Lin used features extracted by principal component analysis (PCA) in the Cox method to predict disease prognosis [8]. Considering that PCA performs poorly in a high-dimensional nonlinear space, Cai selected kernel-PCA instead of PCA to generate compressed features for patient risk prediction [9]. Another way to solve the computational challenge caused by high-dimensional features is to add a regularization component to the Cox model. Boulesteix combined adaptive Lasso regularization and the Cox regression method (IPW-lasso) to estimate cancer prognosis [10]. The method minimized the likelihood function via an L1-norm regularization constant to shrink the coefficients of the features. In addition, the survival support vector machine (SSVM) proposed by Evers yielded improved cancer outcome prediction performance by using a sparse kernel function. However, choosing a suitable kernel function and hyperparameters is a complex process. Recently, Liu et al. designed an integrated learning method called EXSA based on the XGBoost framework to predict cancer outcomes. The results show that it outperformed other traditional machine learning methods [11].

In recent years, deep learning (DL) technologies have demonstrated their efficacy in handling high-dimensional nonlinear features [12]. The residual neural network was used in Li’s work for breast cancer prognostic index classification [13]. Deep_surv, proposed by Katzman, was designed to estimate cancer outcomes by combining a deep neural network (DNN) and the proportional hazard loss function [14]. Chaudhary used an Autoencoder to reconstruct high-dimensional expression features, and the generated features were used for liver cancer survival analysis [15]. Based on this method, Yang et al. proposed DCAP by using a denoising autoencoder to build a robust model for defending against data noise [16]. Nonetheless, the separation of the feature extraction and risk prediction processes hindered the convenience of this method. To solve this problem, an end-to-end framework called TCAP was designed to combine the risk prediction loss and data recovery loss [17]. Bashier et al. proposed a multi-omics data integration approach that combines gene similarity networks with convolutional neural networks to accurately predict the stage of breast tumors [18]. On the basis of these studies, to speed up the convergence of DNN model and reduce the risk of overfitting during model training, Qiu introduced a meta-learning-based network for cancer outcome prediction [19].

Although DL-based methods have achieved better results in cancer outcome prediction, the application of these methods is still limited by the lack of model interpretability. Interpreting the factors associated with cancer outcomes is important for medical decision-making and target drug development. The widely used method for solving this problem is differential expression analysis (DEA). Nevertheless, when the average expression of the given features is low, the log-fold change values computed in DEA are disproportionately affected by noise [20]. Hence, Hao proposed an interpretable DNN framework for cancer survival analysis, by calculating the gradients in the neural network [21]. However, this approach may lead to gradient saturation, making it difficult for the neural network to identify important features [22]. Zhao et al. proposed DeepOmix for cancer prognosis prediction [23]. According to the predicted risks, DeepOmix performed the Kolmogorov‒Smirnov test to identify prognosis-related genes. However, the genes identified by these methods are hard to be proved as effectively related to cancer outcomes. The studies indicated that by removing these genes from the model, the accuracy of prediction is not reduced dramatically [24]. Thus, it is necessary to develop an interpretable model to accurately predict cancer outcomes and identify cancer-related key genes to reveal the novel cancer mechanisms.

To address these problems, we propose an Uncertainty-based Interpretable deep Semi-supervised Network (UISNet) for breast cancer outcome prediction. The main contributions of our research are given as follows:

  1. 1.

    An uncertainty-based algorithm that combines the Monte Carlo dropout and the integrated gradients is designed to improve the reliability of the interpretable results.

  2. 2.

    By introducing the prior biological pathway information as a sparse layer, UISNet deals with the high-dimensional gene expression data effectively.

  3. 3.

    UISNet considered the heterogeneity of patients to extract useful information for cancer outcome prediction. This information was integrated into a unified framework after dimension reduction. All these tasks are simultaneously optimized by the shared representations in the neural network.

  4. 4.

    UISNet was used in seven breast cancer datasets from the TCAG and GEO databases, and the prediction results were analyzed comprehensively. The results indicated that UISNet is accurate and robust when predicting breast cancer outcomes, and is able to identify prognosis-related genes, efficiently.

The details of UISNet are introduced in Section "Methods". Section "Results" shows the experimental results and biological analysis. Finally, we present the conclusion and discussion in Section "Discussion".

Methods

Datasets

In this study, seven breast cancer datasets collected from GEO (https://www.ncbi.nlm.nih.gov) and TCGA (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) were used for method evaluations. Considering the requirement for cancer outcome prediction data with consistent gene expression profiles and comprehensive information on patient survival, we identified six available datasets from the GEO database, encompassing a total of 1323 breast cancer patients. The features of 4767 genes in these datasets were normalized by log transformation, and the batch effect was removed by using the “limma” package [25]. The statistical information of each used dataset is given in Table 1.

Table 1 The statistical information of the utilized cancer data in our study

The architecture of the proposed deep learning framework

As shown in Fig. 1, high-dimensional gene expression data are given in the input layer, and prior biological pathway information is introduced in the sparse layer. The uncertainty-based interpretable deep semi-supervised network (UISNet) can learn meaningful information by incorporating the prior biological knowledge in the sparse layer (e.g., KEGG and Reactome gene connection pathways). The knowledge regarding the learned relationships between genes and functional pathways is used to form sparse connections between the input layer and the sparse layer instead of full connections.

Fig. 1
figure 1

The architecture of UISNet for breast cancer outcome evaluation. Prior biological pathway information is introduced in the sparse layer, and UISNet predicts breast cancer outcomes by integrating patient heterogeneity clustering, dimensionality reduction, and cancer outcome prediction tasks into a unified framework

Here, we constructed a binary biadjacency matrix to represent the sparse connections between the genes and functional pathways by using the strategy presented by Hao [21]. Supposing that p gene features and prior information, including q pathways from the KEGG and Reactome databases, are given in the neural network, the binary biadjacency matrix can be expressed as \(A\epsilon {\mathbb{B}}^{q\times p}\), and the element \({a}_{ij}\) in \(A\) is given as:

$$a_{ij} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {1 \le i \le q,1 \le j \le p} \hfill \\ {0,} \hfill & {else} \hfill \\ \end{array} } \right.$$
(1)

The node values h in the UISNet framework are computed as follows:

$$h_{l + 1} = \left\{ {\begin{array}{*{20}l} {Relu\left( {\left( {W*A} \right)h_{gene} + b} \right),} \hfill & {sparse\;layer } \hfill \\ {Relu\left( {W*h_{l} + b} \right),} \hfill & {hidden\;layer} \hfill \\ \end{array} } \right.$$
(2)

where RelU is a nonlinear activation function, \({h}_{gene}\) represents the gene expression values, \({h}_{l}\) is the output in layer l, W is the weight matrix, and b is the bias.

The feature dimensionality reduction is performed by the Eq. (3), where \(X=\left({x}_{1},{x}_{2},\dots {x}_{p}\right)\) represents the gene expressions of the breast cancer patients, and Z is the low-dimensional representation of X in the last hidden layer. The feature dimensionality reduction task is used to obtain high-quality compressed features in the hidden layer. Similar to the encoder-decoder structure, supposing that E is the encoder function and D is the decoder function, the compressed Z is written as: Z = E(X), and the recovered X′ can be expressed as X′ = D(Z). The loss induced by the dimensionality reduction task is written as:

$${L}_{D}={\sum }_{i=1}^{p}{\left({x}_{i}-{x}_{i}^{\mathrm{^{\prime}}}\right)}^{2}$$
(3)

The subtype clustering task is designed to extract information on breast cancer patient heterogeneity. In the last hidden layer, a feature matrix is formed by integrating the produced Z and the cluster labels L. The clustering task loss in UISNet is defined as the KL divergence between the two distributions S and T [32]:

$${L}_{c}=KL(S||T)=\sum_{i}\sum_{j}{s}_{ij}log\frac{{s}_{ij}}{{t}_{ij}}$$
(4)

where \({t}_{ij}\) describes the similarity between the cluster center \({\mu }_{j}\) and an embedded point \({z}_{j}\) by Student’s t-distribution:

$${t}_{ij}=\frac{{\left(1+{\Vert {z}_{i}-{\mu }_{j}\Vert }^{2}\right)}^{-1}}{{\sum }_{j}{\left(1+{\Vert {z}_{i}-{\mu }_{j}\Vert }^{2}\right)}^{-1}}$$
(5)

\({s}_{ij}\) is the target distribution:

$${s}_{ij}=\frac{{t}_{ij}^{2}/{\sum }_{i}{t}_{ij}}{{\sum }_{j}\left({t}_{ij}^{2}/{\sum }_{i}{t}_{ij}\right)}$$
(6)

The initial cluster labels L of the patients are calculated by k-means. The number of clusters k is the value in [2,3,4] with the largest silhouette score. To ensure the accuracy of the clustering task in UISNet, the computed labels are updated in each epoch while the program runs.

The risk prediction task is used to evaluate breast cancer prognoses by Eq. (7). In Eq. (7), \(S\left(t\right)=Pr(T\ge t)\) is the survival probability that a patient will survive before time \(t\). The time interval \(T\) is the time elapsed between data collection and the patient’s last contact (the end of the experiment/patient death). The risk function at t can be given as follows:

$$\lambda \left(t\right)=\underset{\delta \to 0}{{\text{lim}}}\frac{{\text{Pr}}(t\le T<t+\delta |T\ge t)}{\delta }$$
(7)

The loss function of the outcome prediction task can be expressed as Eq. (8) [14]:

$${L}_{P}=-\sum_{i=1}\left({h}_{\theta }\left(x\right)-log\sum_{j\in \mathfrak{R}\left({T}_{i}\right)}{exp}^{{h}_{\theta }\left({x}_{j}\right)}\right)$$
(8)

where UISNet updates \(h\left(x\right)\) according to the weight \(\theta\), and \(\mathfrak{R}\left({T}_{i}\right)\) represents the risk set of the breast cancer patients that are still alive at time point \({T}_{i}\).

By integrating the feature dimensionality reduction, patient heterogeneity clustering, and cancer outcome prediction tasks into a unified framework, the total loss of UISNet can be given as follows:

$${l}_{UISNet }={\gamma L}_{D}+{\beta L}_{C}+{L}_{P}$$
(9)

\(\gamma\) and \(\beta\) can balance the importance of these tasks, which can be seen as the hyperparameters chosen by the cross-validation step. In this study, the value of \(\gamma\) was set to 1, and \(\beta\) was set to 10.

The uncertainty-based integrated gradients algorithm

In [21], the gradients of the output y with respect to the input x \((W=\partial y/\partial x)\) were used to quantify the importance of each gene to cancer prognosis. However, computing the gradients in a deep neural network may lead to gradient saturation. To interpret the feature importance of the prediction model, UISNet uses the Gauss‒Legendre quadrature to approximate the integral of the gradients after calculating the gradients of the input x across different scales against the baseline \({x}_{i}\) (zero-scaled):

$$IG\left({x}_{i}\right)\colon\colon =\left({x}_{i}-{x}_{i}^{\prime}\right)\times {\int }_{\alpha =0}^{1}\frac{\partial F\left({x}^{\prime}+\alpha \times (x-{x}^{\prime})\right)}{\partial {x}_{i}}d\alpha$$
(10)

Nevertheless, the evaluation results given by the integrated gradients algorithm are not always reliable. Calculating the uncertainty of the predictions can enable the reliability of the results to be judged. Bayesian neural networks have been designed to quantify the uncertainty of results, but due to the large number of required computations, Monte Carlo dropout and Gaussian distribution models are often used as approximate solutions for Bayesian neural networks. Compared to Gaussian distribution models, Monte Carlo dropout can better approximate Bayesian neural networks by using the dropout term as one regularization term, for calculating the uncertainty of the results [33]. The objective function for using L2 regularization in Monte Carlo dropout can be expressed as:

$${l}_{MC}:=\frac{1}{n}{\sum }_{1}^{n}E\left({y}_{i},{\widehat{y}}_{i}\right)+\lambda {\sum }_{l=1}^{L}{\Vert {W}_{i}\Vert }_{2}^{2}+{\Vert {b}_{i}\Vert }_{2}^{2}$$
(11)

where L is the number of layers in the deep neural network, and \({y}_{i}\) and \({\widehat{y}}_{i}\) are the target and the output of the network, respectively. By following [34], we trained UISNet with different dropout settings at T inference times as the Monte Carlo dropout strategy. Supposing that lgxi represents the integrated gradients importance of the ith node, the uncertainty of the ith features is designed as:

$$U\left({x}_{i}\right)=std\sum lg{x}_{i}^{t}/ ave\sum lg{x}_{i}^{t}$$
(12)

where std(*) and ave(*) are the standard deviation and average values of \(lg{x}_{i}\) at T inference times, respectively. The importance weight of the gene feature in the network is expressed as:

$$V\left({x}_{i}\right)=\left(1-{U}{\prime}\left({x}_{i}\right)\right)* IG\left({x}_{i}\right)$$
(13)

where U′ \(\left({x}_{i}\right)\) is the adjusted uncertainty value of \(U\left({x}_{i}\right)\) after log transformation and min–max normalization. In summary, the UISNet algorithm is given as follows:

figure a

Algorithm of UISNet

Performance evaluations and parameter selection

In this study, the cancer outcome prediction performances of different methods were compared through the C-index and |log10(P)| values. The C-index value is the fraction of all pairs of patients whose predicted outcomes are correctly ordered based on Harrell’s C statistics [35]. A higher |log10(P)| value indicates more significant survival differences between the patient subgroups divided based on the predicted risks (Additional file 1).

The parameter list in this study is given below. The number of nodes in hidden layer 1 was set to 1000, and the number of nodes in hidden layer 2 was set to 500. The number of nodes Z in hidden layer 3 was selected from [10, 20, 50]. The learning rate (LR) was selected from [1e-6,1e-7,1e-8], and the maximum number of epochs in the neural network was set to 2000. The optimal parameters were selected by fivefold cross validation. To access the robustness of our method in 5-fold cross-validation and 10-fold cross-validation, in Additional file 1; Supplementary Table S1 we show the deviation obtained by UISNet in 5-fold cross validation and 10-fold cross validation.

Results

Method comparison

The UISNet model was evaluated by the C-index (CI) in predicting the outcome of the patients in seven breast cancer datasets. The average CI values are given in Table 2. The UISNet was compared with six methods, the adaptive Lasso-penalized Cox model (IPW-lasso), the integrated learning-based Cox method (EXSA), the deep survival network (Deep_surv), the denoising autoencoder-based Cox model (DCAP) and the deep survival network with a meta-learning framework (MTC). As shown in Table 2, the UISNet achieved the average CI (0.691) across seven datasets, which is significantly higher than the CI 0.619, 0.638, 0.653, 0.665, and 0.677 achieved by the methods, IPF-lasso, EXSA, Deep_surv, DCAP, and MTC, respectively. The t-test was performed on the results obtained from UISNet and other methods, demonstrating significant improvements of our method compared to the alternative approaches.

Table 2 The CI values obtained by different methods on breast cancer datasets

When we divided the patients into subgroups according to the estimated prognosis risks, UISNet achieved average |log10(P)|= 1.608 across the seven datasets, which is higher than the |log10(P)| values achieved by IPF-lasso (1.036). Meanwhile, UISNet performed better than the other four compared methods (Table 3, average |log10(P)|= 1.167). Additionally, we show the average time-dependent AUC scores [36] in Fig. 2. By testing on the eight breast cancer datasets, the UISNet achieved the highest AUC score of 0.676 among the compared methods (IPF-lasso = 0.623, EXSA = 0.634, Deep-surv = 0.653, DCAP = 0.660, MTC = 0.670).

Table 3 The |log10(P)| values produced by different methods on breast cancer datasets
Fig. 2
figure 2

The AUC scores obtained by different methods on eight breast cancer datasets

Parameter sensitivity analysis

To evaluate the effects of the hyperparameters on the prediction of UISnet, we examined the CI values obtained on BRCA and BRCA_all with different parameter combinations (Fig. 3). The number of nodes in hidden layer 3 was selected from [100, 50, 20, 10], while the learning rate was set to [1e-6, 1e-7, 5e-7, 1e-8]. By comparing the standard deviation values of the CI values while one parameter was fixed, we found that the effect of the node size in the network was relatively small (std = 0.010), which was lower than the result of the learning rate (std = 0.028). Nevertheless, it is difficult to determine the optimal parameter combinations in different datasets. In this study, we used a fivefold CV to select suitable hyperparameters of UISNet in model training.

Fig. 3
figure 3

Parameter sensitivity analysis for UISNet. The y-axis values represent the C-index values obtained by using different hyperparameter combinations. The number of nodes in hidden layer 3 was selected from [100, 50, 20, 10], and the learning rate was set to [1e-6, 1e-7, 5e-7, 1e-8]

Ablation experiment

To evaluate the contribution of each task in UISNet for cancer outcome prediction, we compared the performance achieved by excluding different tasks from the framework. As shown in Fig. 4, when only the single prediction task was used to construct the neural network (-DRSS), the DNN framework achieved an average CI value of 0.610, which is 6.87% lower than that obtained by UISNet. When excluding the clustering task (-SS) from UISNet, the framework caused a decrease in the CI value from 0.655 to 0.631(− 3.66%), and a smaller decrease was caused by the exclusion of the dimensionality reduction task (-DR, CI = 0.642, − 1.98%). These results indicated that the clustering task provides more useful information than the dimensionality reduction task, and the prediction accuracy benefits from integrating these tasks into a unified framework.

Fig. 4
figure 4

Ablation experiment results obtained by UISNet for cancer outcome prediction. -DRSS represents the result achieved by only using the cancer outcome prediction task, and -SS and -DR indicate the results obtained when excluding the clustering task and the dimensionality reduction task, respectively

Independent tests

To further validate the performance of the UISNet, we conducted independent tests by separating the breast cancer dataset from the integrated dataset BRCA_all as an independent test dataset (Fig. 5). The results indicated that the CI values obtained by UISNet in the independent tests are higher than 0.659, averagely. The Kaplan‒Meier survival curves illustrate the significant (P < 0.05) difference in survival between the two patient subgroups classified by UISNet. All these results proved the robustness of our method.

Fig. 5
figure 5

Kaplan‒Meier survival curves drawn based on the patient subgroups classified by UISNet. The red lines represent high-risk patients, and the black lines represent low-risk patients

Feature interpretability evaluation

To evaluate the feature interpretability of UISNet, we compared UISNet with the method without using the uncertainty strategy (IG) and the differential expression analysis (DEA) while analyzing the BRCA_all dataset. The top 200 genes sorted based on the importance weights given by UISNet are shown in Fig. 6a. The IG values of these genes were calculated by Eq. (12). The result shows UISNet selected some important genes such as MAPK1, AKT1, RAF1, that have been proved related to breast cancer prognosis. Figure 6b shows the top 200 genes ranked based on the |log(fold change)| values produced by DEA (adjusted-p-values < 0.05), and the gene expression heatmap of these genes is shown in Fig. 6c. The names of the top 20 genes selected by UISNet and DEA are additionally annotated in Fig. 6a and b, respectively.

Fig. 6
figure 6

Breast cancer-related gene selection results produced by different methods. a Top 200 genes selected by computing the importance of the features in UISNet. The x-axis represents the average importance values of the different genes, and the y-axis is the uncertainty weight value computed by Eq. (12). b The results were used to identify the top 200 genes ranked based on the |log(fold change)| of DEA (adjusted-p-values < 0.05). c Heatmap of the 200 identified differentially expressed genes. d Breast cancer outcome prediction performance achieved by using different numbers of selected gene features based on DEA, the IG-based method without the uncertainty strategy (IG), and UISNet. A Venn diagram was used to show the number of overlapping genes selected by different methods

The Venn diagram in Fig. 6d shows the number of overlapping genes selected by different methods. The figure indicates that there are 92 overlapping genes selected from IG and UISNet, while the number of the common genes selected by DEA and UISNet is only four. Additionally, the breast cancer outcome prediction performance achieved by using different numbers of selected gene features based on DEA, IG, and UISNet are shown in Fig. 6d. It shows that while using the top 5 gene features, DEA and IG achieved lower CI values (0.559 and 0.542) than UISNet (0.565). When the number of used genes was 200, UISNet obtained the highest CI value of 0.652. The results demonstrate that by comparing with IG and DEA, UISNet can find genes that have a greater impact on the prognosis of breast cancer.

Biointerpretability assessment

According to the importance weights (IWs) given by UISNet, we selected the top 20 genes for further analysis (Fig. 7a). Specifically, 17 of the selected genes have been validated to be associated with breast cancer. AKT1 encodes one of three human AKT serine-threonine protein kinase family members, and mutations in AKT1 are linked to breast cancer cell growth [37]. PTEN can negatively regulate intracellular phosphatidylinositol-3,4,5-triphosphate and exerts a tumor suppression effect by negatively regulating the PI3K-AKT signaling pathway [38]. In Lama’s study, the molecular changes in MAPK1 lead to overexpression of matrix metallopeptidase, which is associated with poor prognosis in breast cancer patients [39]. The upregulation of PRKCA has been found to be linked with resistance to antiestrogen treatment and the aggressive nature of tumors. PRKCA serves as a pivotal signaling hub and a potential therapeutic target in breast cancer stem cells, which exhibit comparable cell surface marker profiles to those observed in TNB [40]. CDC23, regulated by mir-34c, may be responsible for mir-34c-induced cell cycle arrest, where miR-34c can induce G2/M cell cycle arrest in breast cancer cells [41]. UBA52 has been reported to potentially associate with the development of resistance to Lapatinib in breast cancer treatment [42].

Fig. 7
figure 7

Interpretability assessment of the results obtained by UISNet. a The top 20 genes selected by UISNet ranked based on their importance weights. b The identified breast cancer-related KEGG pathways enriched by using the top 20 genes selected by UISNet

In addition, reduced expression of the RAF1 kinase inhibitor protein has been found to be associated with breast cancer metastasis [43]. The regulation of CDKN1A gene expression by LRH-1 influences the proliferation of breast tumor cells [44]. Downregulation of IKBKB expression by MicroRNA-16 enhances the sensitivity of breast cancer cells to paclitaxel treatment [45]. PPP2CA has been found to promote the proliferation and invasion of breast cancer cells [46]. Increased expression of PRKCA has been linked to resistance against antiestrogen treatment and the aggressive nature of tumors. Overexpression of PSMB4 promotes the proliferation and survival of breast cancer cells, lleading to an unfavorable prognosis [47]. The study revealed a significant upregulation in AP1M2, PSMD10, and RPL2 expression in breast cancer tissue compared to normal tissues [48,49,50]. RPS27A is overexpressed in breast cancer, enhanceing EBV-encoded LMP1-mediated proliferation and invasion through stabilization of LMP1 [51]. The RPS4X protein was identified as a potential biomarker for controlling cisplatin resistance in breast cancer treatment [52]. Although functional studies have not directly implicated MAP2K1, POLR2J, and SEH1L in breast cancer development and progression, they have been associated with other malignancies [53,54,55]. The indications from our model imply that they could potentially emerge as targets for breast cancer.

We further performed the KEGG enrichment analysis on these 20 genes (Fig. 7b), and found that they are enriched in many signaling pathways related to the occurrence and development of breast cancer, such as the mTOR and PI3K-AKT signaling pathways. The downstream transcription factors of the mTOR signaling pathway (with the highest enrichment score) include HIF1α, c-Myc, FoxO, and other important cancer regulatory molecules [56]. It has been proven that Paclitaxel can modulate the proliferation and migration of breast cancer cells via the mTOR signaling pathway [57]. The PI3K-AKT signaling pathway is a crucial component of many signaling pathways involving membrane-bound ligands, which are crucial for the survival of tumor cells [58]. The EGFR tyrosine kinase inhibitor resistance pathway enriched by UISNet shows that the identified genes may affect the tyrosine kinase inhibitor resistance in breast cancer treatment. The dysregulated activation of the ErbB signaling pathway plays a critical role in regulating cell growth, differentiation, and survival in breast cancer. Moreover, it is closely associated with tumor initiation, progression, and metastasis [59]. PD-L1 expression and PD-1 checkpoint pathway in breast cancer is closely related to immune regulation and tumor evasion from immune surveillance, which plays a key role in the regulation of immune responses [60]. The activation of the HIF-1 pathway in breast cancer is intricately associated with tumorigenesis, disease progression, and acquisition of treatment resistance [61]. Additionally identified genes enrich several breast cancer-related KEGG pathways, including VEGF and Ras signaling pathway, breast cancer pathway, pathways in cancer, and microRNAs in cancer. These findings demonstrate that UISNet can construct a breast cancer outcome prediction model with enhanced interpretability for biomedical applications.

Discussion

Although our method provides model biological interpretability while improving the prediction accuracy, there are still some questions worth discussing, as described below. Firstly, the high censoring rates (12.17–87.03%) in the breast cancer data affected the calculation of the true survival proportion and decreased the performance of our method. Secondly, previous studies have shown that multi-omics integration is helpful for improving cancer outcome prediction performance. Expanding UISNet to integrate multi-omics data in an interpretable manner will be a potential way to improve the prediction performance of the model.

In the future, we will update our interpretable method by incorporating more medical information, including DNA methylation and slide images. Additionally, we want to design an effective strategy to evaluate the true survival times of censored patients to reduce the adverse impact of high censoring rates on model training. Furthermore, considering the molecular-level similarities observed between gynecologic and breast tumors [62], we will utilize the UISNet model to investigate gynecologic cancers such as cervical cancer and ovarian cancer, with the aiming of identifying potential pan-gynecologic-cancer related biomarkers for effective therapeutic interventions.

Conclusions

DL-based methods have been proven to achieve accurate performance in cancer outcome prediction cases. Nevertheless, the lack of model interpretability limits the applicability of these methods. To address this challenge, we proposed a novel uncertainty-based interpretable deep neural network called UISNet for breast cancer outcome prediction. UISNet provides interpretable solutions by computing the integrated gradients of features with an uncertainty-based strategy. Furthermore, it improved model performance by introducing prior biological pathway knowledge and utilizing patient heterogeneity information. The experimental results show that UISNet achieved a 5.79% higher CI value than the compared state-of-the-art methods on average. Based on the feature interpretation results of the prediction model, 11 of the 20 identified genes have been proven to be associated with breast cancer. The comprehensive tests indicated that our proposed method is accurate and robust, and is an effective way to identify cancer-related genes. In summary, we believe that UISNet is a valuable and meaningful foundation for further cancer prognosis prediction research.

Availability of data and materials

All the data analyzed are downloaded from GEO (https://www.ncbi.nlm.nih.gov) and TCGA (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). The method codes are available at https://github.com/chh171/UISNet.

References

  1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J Clin. 2021;71(3):209–49.

    Article  Google Scholar 

  2. Kroemer G, Senovilla L, Galluzzi L, André F, Zitvogel L. Natural and therapy-induced immunosurveillance in breast cancer. Nat Med. 2015;21(10):1128–38.

    Article  CAS  PubMed  Google Scholar 

  3. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17.

    Article  CAS  PubMed  Google Scholar 

  4. Tran T-O, Vo TH. Le NQK: omics-based deep learning approaches for lung cancer decision-making and therapeutics development. Briefings Funct Genomics 2023. elad031

  5. Matsuo K, Purushotham S, Jiang B, Mandelbaum RS, Takiuchi T, Liu Y, Roman LD. Survival outcome prediction in cervical cancer: Cox models vs deep-learning model. Am J Obstet Gynecol. 2019;220(4):381e381-381e314.

    Article  Google Scholar 

  6. Wang H, Zhou L. Random survival forest with space extensions for censored data. Artif Intell Med. 2017;79:52–61.

    Article  PubMed  Google Scholar 

  7. Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biom J. 2010;52(1):70–84.

    Article  MathSciNet  PubMed  Google Scholar 

  8. Lin D, Banjevic D, Jardine AK. Using principal components in a proportional hazards model with applications in condition-based maintenance. J Oper Res Soc. 2006;57(8):910–9.

    Article  Google Scholar 

  9. Cai T, Tonini G, Lin X. Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics. 2011;67(3):975–86.

    Article  MathSciNet  PubMed  PubMed Central  Google Scholar 

  10. Boulesteix A-L, De Bin R, Jiang X, Fuchs M: IPF-LASSO: integrative-penalized regression with penalty factors for prediction based on multi-omics data. In; Computational and mathematical methods in medicine 2017, 2017.

  11. Liu P, Fu B, Yang SX, Deng L, Zhong X, Zheng H. Optimizing survival analysis of XGBoost for ties to predict disease progression of breast cancer. IEEE Trans Biomed Eng. 2020;68(1):148–60.

    Article  ADS  PubMed  Google Scholar 

  12. Le NQK. Potential of deep representative learning features to interpret the sequence information in proteomics. Proteomics. 2022;22(1–2):2100232.

    Article  CAS  Google Scholar 

  13. Zhou L, Rueda M, Alkhateeb A. Classification of breast cancer nottingham prognostic index using high-dimensional embedding and residual neural network. Cancers. 2022;14(4):934.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol. 2018;18(1):24.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep learning-based multi-omics integration robustly predicts survival in liver cancer. Clin Cancer Res. 2018;24(6):1248–59.

    Article  CAS  PubMed  Google Scholar 

  16. Chai H, Zhou X, Zhang Z, Rao J, Zhao H, Yang Y. Integrating multi-omics data through deep learning for accurate cancer prognosis prediction. Comput Biol Med. 2021;134: 104481.

    Article  CAS  PubMed  Google Scholar 

  17. Chai H, Zhang Z, Wang Y, Yang Y. Predicting bladder cancer prognosis by integrating multi-omics data through a transfer learning-based Cox proportional hazards network. CCF Trans High Perform Comput. 2021;3(3):311–9.

    Article  Google Scholar 

  18. ElKarami B, Alkhateeb A, Qattous H, Alshomali L, Shahrrava B. Multi-omics data integration model based on UMAP embedding and convolutional neural network. Cancer Inform. 2022;21:11769351221124204.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Qiu YL, Zheng H, Devos A, Selby H, Gevaert O. A meta-learning approach for genomic survival analysis. Nat Commun. 2020;11(1):1–11.

    Article  Google Scholar 

  20. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform. 2013;14(1):1–18.

    Article  Google Scholar 

  21. Hao J, Kim Y, Mallavarapu T, Oh JH, Kang M. Interpretable deep neural network for cancer survival analysis by integrating genomic and clinical data. BMC Med Genomics. 2019;12(10):1–13.

    Google Scholar 

  22. Qi Z, Khorram S, Li F: Visualizing deep networks by optimizing with integrated gradients. In: CVPR workshops: 2019; 2019.

  23. Zhao L, Dong Q, Luo C, Wu Y, Bu D, Qi X, Luo Y, Zhao Y. DeepOmix: a scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J. 2021;19:2719–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Petsiuk V, Das A, Saenko K. Rise: randomized input sampling for explanation of black-box models. arXiv:180607421 (2018).

  25. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7): e47.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006;98(4):262–72.

    Article  CAS  PubMed  Google Scholar 

  27. Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF. Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics. 2008;9(1):1–12.

    Article  Google Scholar 

  28. Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A, Pilch H, Lehr H-A, Hengstler JG, Kolbl H, Gehrmann M. The humoral immune system has a key prognostic impact in node-negative breast cancer. Can Res. 2008;68(13):5405–13.

    Article  CAS  Google Scholar 

  29. Symmans WF, Hatzis C, Sotiriou C, Andre F, Peintinger F, Regitnig P, Daxenbichler G, Desmedt C, Domont J, Marth C. Genomic index of sensitivity to endocrine therapy for breast cancer. J Clin Oncol. 2010;28(27):4111.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Li Y, Zou L, Li Q, Haibe-Kains B, Tian R, Li Y, Desmedt C, Sotiriou C, Szallasi Z, Iglehart JD. Amplification of LAPTM4B and YWHAZ contributes to chemotherapy resistance and recurrence of breast cancer. Nat Med. 2010;16(2):214–8.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Hatzis C, Pusztai L, Valero V, Booser DJ, Esserman L, Lluch A, Vidaurre T, Holmes F, Souchon E, Wang H. A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA. 2011;305(18):1873–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Guo X, Gao L, Liu X, Yin J: Improved deep embedded clustering with local structure preservation. In IJCAI: 2017; 2017. p. 1753–1759.

  33. Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X, Khosravi A, Acharya UR. A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inf Fusion. 2021;76:243–97.

    Article  Google Scholar 

  34. Wang Y, Zhang Y, Tian J, Zhong C, Shi Z, Zhang Y, He Z: Double-uncertainty weighted method for semi-supervised learning. In: International conference on medical image computing and computer-assisted intervention; 2020. Springer; 2020. p. 542–551.

  35. Van Belle V, Pelckmans K, Van Huffel S, Suykens JA. Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif Intell Med. 2011;53(2):107–18.

    Article  PubMed  Google Scholar 

  36. Kamarudin AN, Cox T, Kolamunnage-Dona R. Time-dependent ROC curve analysis in medical research: current methods and applications. BMC Med Res Methodol. 2017;17(1):1–19.

    Article  Google Scholar 

  37. Hinz N, Jücker M. Distinct functions of AKT isoforms in breast cancer: a comprehensive review. Cell Commun Signal. 2019;17(1):1–29.

    Article  Google Scholar 

  38. Milella M, Falcone I, Conciatori F, Cesta Incani U, Del Curatolo A, Inzerilli N, Nuzzo C, Vaccaro V, Vari S, Cognetti F. PTEN: multiple functions in human malignant tumors. Front Oncol. 2015;5:24.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Hamadneh L, Bahader M, Abuarqoub R, AlWahsh M, Alhusban A, Hikmat S. PI3K/AKT and MAPK1 molecular changes preceding matrix metallopeptidases overexpression during tamoxifen-resistance development are correlated to poor prognosis in breast cancer patients. Breast Cancer. 2021;28(6):1358–66.

    Article  PubMed  Google Scholar 

  40. Tam WL, Lu H, Buikhuisen J, Soh BS, Lim E, Reinhardt F, Wu ZJ, Krall JA, Bierie B, Guo W. Protein kinase C α is a central signaling node and therapeutic target for breast cancer stem cells. Cancer Cell. 2013;24(3):347–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Achari C, Winslow S, Ceder Y, Larsson C. Expression of miR-34c induces G2/M cell cycle arrest in breast cancer cells. BMC Cancer. 2014;14(1):1–9.

    Article  Google Scholar 

  42. Zhang L, Huang Y, Zhuo W, Zhu Y, Zhu B, Chen Z. Identification and characterization of biomarkers and their functions for Lapatinib-resistant breast cancer. Med Oncol. 2017;34:1–8.

    Article  Google Scholar 

  43. Hagan S, Al-Mulla F, Mallon E, Oien K, Ferrier R, Gusterson B, Curto García JJ, Kolch W. Reduction of Raf-1 kinase inhibitor protein expression correlates with breast cancer metastasis. Clin Cancer Res. 2005;11(20):7392–7.

    Article  CAS  PubMed  Google Scholar 

  44. Bianco S, Jangal M, Garneau D, Gevry N. LRH-1 controls proliferation in breast tumor cells by regulating CDKN1A gene expression. Oncogene. 2015;34(34):4509–18.

    Article  CAS  PubMed  Google Scholar 

  45. Tang X, Jin L, Cao P, Cao K, Huang C, Luo Y, Ma J, Shen S, Tan M, Li X. MicroRNA-16 sensitizes breast cancer cells to paclitaxel through suppression of IKBKB expression. Oncotarget. 2016;7(17):23668.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Zeng Q, Jin F, Qian H, Chen H, Wang Y, Zhang D, Wei Y, Chen T, Guo B, Chai C. The miR-345-3p/PPP2CA signaling axis promotes proliferation and invasion of breast cancer cells. Carcinogenesis. 2022;43(2):150–9.

    Article  CAS  PubMed  Google Scholar 

  47. Wang H, He Z, Xia L, Zhang W, Xu L, Yue X, Ru X, Xu Y. PSMB4 overexpression enhances the cell growth and viability of breast cancer cells leading to a poor prognosis. Oncol Rep. 2018;40(4):2343–52.

    CAS  PubMed  Google Scholar 

  48. Wu C-C, Kao T-J, Ta HDK, Anuraga G, Andriani V, Athoillah M, Chiao C-C, Wu Y-F, Lee K-H, Wang C-Y. Prognostic and immune infiltration signatures of proteasome 26S subunit, non-ATPase (PSMD) family genes in breast cancer patients. Aging. 2021;13(22):24882.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Wilson-Edell KA, Kehasse A, Scott GK, Yau C, Rothschild DE, Schilling B, Gabriel BS, Yevtushenko MA, Hanson IM, Held JM. RPL24: a potential therapeutic target whose depletion or acetylation inhibits polysome assembly and cancer cell growth. Oncotarget. 2014;5(13):5165.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Kim DH, Lee KE. Discovering breast cancer biomarkers candidates through mRNA expression analysis based on the cancer genome atlas database. J Person Med. 2022;12(10):1753.

    Article  Google Scholar 

  51. Li H, Zhang H, Huang G, Bing Z, Xu D, Liu J, Luo H, An X. Loss of RPS27a expression regulates the cell cycle, apoptosis, and proliferation via the RPL11-MDM2-p53 pathway in lung adenocarcinoma cells. J Exp Clin Cancer Res. 2022;41(1):33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Garand C, Guay D, Sereduk C, Chow D, Tsofack SP, Langlois M, Perreault E, Yin HH, Lebel M. An integrative approach to identify YB-1-interacting proteins required for cisplatin resistance in MCF7 and MDA-MB-231 breast cancer cells. Cancer Sci. 2011;102(7):1410–7.

    Article  CAS  PubMed  Google Scholar 

  53. Choi YL, Soda M, Ueno T, Hamada T, Haruta H, Yamato A, Fukumura K, Ando M, Kawazu M, Yamashita Y. Oncogenic MAP2K1 mutations in human epithelial tumors. Carcinogenesis. 2012;33(5):956–61.

    Article  CAS  PubMed  Google Scholar 

  54. Wang T, Liu D, Wang L, Liu M, Duan W, Yi J, Yi Y. DNA repair genes are associated with subtype classification, prognosis, and immune infiltration in uveal melanoma. J Oncol. 2022.

  55. Yu J, Liu T-T, Liang L-L, Liu J, Cai H-Q, Zeng J, Wang T-T, Li J, Xiu L, Li N. Identification and validation of a novel glycolysis-related gene signature for predicting the prognosis in ovarian cancer. Cancer Cell Int. 2021;21:1–14.

    Article  CAS  Google Scholar 

  56. Tarrado-Castellarnau M, de Atauri P, Cascante M. Oncogenic regulation of tumor metabolic reprogramming. Oncotarget. 2016;7(38):62726.

    Article  PubMed  PubMed Central  Google Scholar 

  57. Miricescu D, Totan A, Stanescu-Spinu I-I, Badoiu SC, Stefani C, Greabu M. PI3K/AKT/mTOR signaling pathway in breast cancer: from molecular landscape to clinical aspects. Int J Mol Sci. 2020;22(1):173.

    Article  PubMed  PubMed Central  Google Scholar 

  58. Hoxhaj G, Manning BD. The PI3K–AKT network at the interface of oncogenic signalling and cancer metabolism. Nat Rev Cancer. 2020;20(2):74–88.

    Article  CAS  PubMed  Google Scholar 

  59. Hardy KM, Booth BW, Hendrix MJ, Salomon DS, Strizzi L. ErbB/EGF signaling and EMT in mammary development and breast cancer. J Mammary Gland Biol Neoplasia. 2010;15(2):191–9.

    Article  PubMed  PubMed Central  Google Scholar 

  60. Schütz F, Stefanovic S, Mayer L, von Au A, Domschke C, Sohn C. PD-1/PD-L1 pathway in breast cancer. Oncol Res Treat. 2017;40(5):294–7.

    Article  PubMed  Google Scholar 

  61. Flamant L, Notte A, Ninane N, Raes M, Michiels C. Anti-apoptotic role of HIF-1 and AP-1 in paclitaxel exposed breast cancer cells under hypoxia. Mol Cancer. 2010;9:1–15.

    Article  Google Scholar 

  62. Berger AC, Korkut A, Kanchi RS, Hegde AM, Lenoir W, Liu W, et al. A comprehensive pan-cancer molecular study of gynecologic and breast cancers. Cancer Cell. 2018;33(4):690–705.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Mr. Weizhen Deng in Foshan University for supporting our study with data collection.

Funding

This work was funded by the National Natural Science Foundation of China (No. 62201150), Jihua laboratory scienctific project (X210101UZ210), and Research on in situ mass spectrometry for analyzing complex traditional Chinese medicine systems (2021ZDZX2060).

Author information

Authors and Affiliations

Authors

Contributions

HC and YY conceived the study. HC, SL, and JL performed the data analysis. HC, SL, YO, and HZ interpreted the results. HC, and MH wrote the manuscript.

Corresponding authors

Correspondence to Yongzhong OuYang or Huiying Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Table S1. The deviation obtained by UISNet in 5-fold cross validation and 10-fold cross validation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chai, H., Lin, S., Lin, J. et al. An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome. BMC Bioinformatics 25, 88 (2024). https://doi.org/10.1186/s12859-024-05716-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05716-7

Keyword