An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome

Chai, Hua; Lin, Siyin; Lin, Junqi; He, Minfan; Yang, Yuedong; OuYang, Yongzhong; Zhao, Huiying

doi:10.1186/s12859-024-05716-7

Research
Open access
Published: 29 February 2024

An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome

Hua Chai¹^na1,
Siyin Lin²^na1,
Junqi Lin¹,
Minfan He¹,
Yuedong Yang²,
Yongzhong OuYang¹ &
…
Huiying Zhao³

BMC Bioinformatics volume 25, Article number: 88 (2024) Cite this article

798 Accesses
Metrics details

Abstract

Background

Predicting outcome of breast cancer is important for selecting appropriate treatments and prolonging the survival periods of patients. Recently, different deep learning-based methods have been carefully designed for cancer outcome prediction. However, the application of these methods is still challenged by interpretability. In this study, we proposed a novel multitask deep neural network called UISNet to predict the outcome of breast cancer. The UISNet is able to interpret the importance of features for the prediction model via an uncertainty-based integrated gradients algorithm. UISNet improved the prediction by introducing prior biological pathway knowledge and utilizing patient heterogeneity information.

Results

The model was tested in seven public datasets of breast cancer, and showed better performance (average C-index = 0.691) than the state-of-the-art methods (average C-index = 0.650, ranged from 0.619 to 0.677). Importantly, the UISNet identified 20 genes as associated with breast cancer, among which 11 have been proven to be associated with breast cancer by previous studies, and others are novel findings of this study.

Conclusions

Our proposed method is accurate and robust in predicting breast cancer outcomes, and it is an effective way to identify breast cancer-associated genes. The method codes are available at: https://github.com/chh171/UISNet.

Peer Review reports

Background

According to the global cancer statistics in 2020, breast cancer is the most common malignant tumor, accounting for two million (11.7%) patients worldwide [1]. The outcomes of these breast cancer patients were observed to be significantly different under the same treatment, reflecting the heterogeneity of breast cancer. Accurate breast cancer outcome prediction is important for designing effective follow-up treatments, and improving the survival periods and life quality of patients [2].

With the advancement of molecular sequencing techniques, an increasing amount of high-dimensional genomic data has been used to evaluate cancer outcomes to support clinical decision- making [3, 4]. The most widely used method for evaluating patient risks is the Cox proportional hazard model [5]. This method analyses the influences of different factors on cancers by calculating the survival ratios of patients without knowing patients’ survival distributions. In addition, Wang et al. designed random survival forests (RSF) to predict cancer outcomes by utilizing the bootstrapping strategy [6]. However, these methods have limited performance on high-dimensional gene expression data [7]. To solve this problem, many feature dimensionality reduction technologies were added to the algorithm. Lin used features extracted by principal component analysis (PCA) in the Cox method to predict disease prognosis [8]. Considering that PCA performs poorly in a high-dimensional nonlinear space, Cai selected kernel-PCA instead of PCA to generate compressed features for patient risk prediction [9]. Another way to solve the computational challenge caused by high-dimensional features is to add a regularization component to the Cox model. Boulesteix combined adaptive Lasso regularization and the Cox regression method (IPW-lasso) to estimate cancer prognosis [10]. The method minimized the likelihood function via an L1-norm regularization constant to shrink the coefficients of the features. In addition, the survival support vector machine (SSVM) proposed by Evers yielded improved cancer outcome prediction performance by using a sparse kernel function. However, choosing a suitable kernel function and hyperparameters is a complex process. Recently, Liu et al. designed an integrated learning method called EXSA based on the XGBoost framework to predict cancer outcomes. The results show that it outperformed other traditional machine learning methods [11].

In recent years, deep learning (DL) technologies have demonstrated their efficacy in handling high-dimensional nonlinear features [12]. The residual neural network was used in Li’s work for breast cancer prognostic index classification [13]. Deep_surv, proposed by Katzman, was designed to estimate cancer outcomes by combining a deep neural network (DNN) and the proportional hazard loss function [14]. Chaudhary used an Autoencoder to reconstruct high-dimensional expression features, and the generated features were used for liver cancer survival analysis [15]. Based on this method, Yang et al. proposed DCAP by using a denoising autoencoder to build a robust model for defending against data noise [16]. Nonetheless, the separation of the feature extraction and risk prediction processes hindered the convenience of this method. To solve this problem, an end-to-end framework called TCAP was designed to combine the risk prediction loss and data recovery loss [17]. Bashier et al. proposed a multi-omics data integration approach that combines gene similarity networks with convolutional neural networks to accurately predict the stage of breast tumors [18]. On the basis of these studies, to speed up the convergence of DNN model and reduce the risk of overfitting during model training, Qiu introduced a meta-learning-based network for cancer outcome prediction [19].

Although DL-based methods have achieved better results in cancer outcome prediction, the application of these methods is still limited by the lack of model interpretability. Interpreting the factors associated with cancer outcomes is important for medical decision-making and target drug development. The widely used method for solving this problem is differential expression analysis (DEA). Nevertheless, when the average expression of the given features is low, the log-fold change values computed in DEA are disproportionately affected by noise [20]. Hence, Hao proposed an interpretable DNN framework for cancer survival analysis, by calculating the gradients in the neural network [21]. However, this approach may lead to gradient saturation, making it difficult for the neural network to identify important features [22]. Zhao et al. proposed DeepOmix for cancer prognosis prediction [23]. According to the predicted risks, DeepOmix performed the Kolmogorov‒Smirnov test to identify prognosis-related genes. However, the genes identified by these methods are hard to be proved as effectively related to cancer outcomes. The studies indicated that by removing these genes from the model, the accuracy of prediction is not reduced dramatically [24]. Thus, it is necessary to develop an interpretable model to accurately predict cancer outcomes and identify cancer-related key genes to reveal the novel cancer mechanisms.

To address these problems, we propose an Uncertainty-based Interpretable deep Semi-supervised Network (UISNet) for breast cancer outcome prediction. The main contributions of our research are given as follows:

1.
An uncertainty-based algorithm that combines the Monte Carlo dropout and the integrated gradients is designed to improve the reliability of the interpretable results.
2.
By introducing the prior biological pathway information as a sparse layer, UISNet deals with the high-dimensional gene expression data effectively.
3.
UISNet considered the heterogeneity of patients to extract useful information for cancer outcome prediction. This information was integrated into a unified framework after dimension reduction. All these tasks are simultaneously optimized by the shared representations in the neural network.
4.
UISNet was used in seven breast cancer datasets from the TCAG and GEO databases, and the prediction results were analyzed comprehensively. The results indicated that UISNet is accurate and robust when predicting breast cancer outcomes, and is able to identify prognosis-related genes, efficiently.

The details of UISNet are introduced in Section "Methods". Section "Results" shows the experimental results and biological analysis. Finally, we present the conclusion and discussion in Section "Discussion".

Methods

Datasets

In this study, seven breast cancer datasets collected from GEO (https://www.ncbi.nlm.nih.gov) and TCGA (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) were used for method evaluations. Considering the requirement for cancer outcome prediction data with consistent gene expression profiles and comprehensive information on patient survival, we identified six available datasets from the GEO database, encompassing a total of 1323 breast cancer patients. The features of 4767 genes in these datasets were normalized by log transformation, and the batch effect was removed by using the “limma” package [25]. The statistical information of each used dataset is given in Table 1.

Table 1 The statistical information of the utilized cancer data in our study

Full size table

The architecture of the proposed deep learning framework

As shown in Fig. 1, high-dimensional gene expression data are given in the input layer, and prior biological pathway information is introduced in the sparse layer. The uncertainty-based interpretable deep semi-supervised network (UISNet) can learn meaningful information by incorporating the prior biological knowledge in the sparse layer (e.g., KEGG and Reactome gene connection pathways). The knowledge regarding the learned relationships between genes and functional pathways is used to form sparse connections between the input layer and the sparse layer instead of full connections.

Here, we constructed a binary biadjacency matrix to represent the sparse connections between the genes and functional pathways by using the strategy presented by Hao [21]. Supposing that p gene features and prior information, including q pathways from the KEGG and Reactome databases, are given in the neural network, the binary biadjacency matrix can be expressed as $A\epsilon {\mathbb{B}}^{q\times p}$, and the element ${a}_{ij}$ in $A$ is given as:

$$a_{ij} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {1 \le i \le q,1 \le j \le p} \hfill \\ {0,} \hfill & {else} \hfill \\ \end{array} } \right.$$

(1)

The node values h in the UISNet framework are computed as follows:

$$h_{l + 1} = \left\{ {\begin{array}{*{20}l} {Relu\left( {\left( {W*A} \right)h_{gene} + b} \right),} \hfill & {sparse\;layer } \hfill \\ {Relu\left( {W*h_{l} + b} \right),} \hfill & {hidden\;layer} \hfill \\ \end{array} } \right.$$

(2)

where RelU is a nonlinear activation function, ${h}_{gene}$ represents the gene expression values, ${h}_{l}$ is the output in layer l, W is the weight matrix, and b is the bias.

The feature dimensionality reduction is performed by the Eq. (3), where $X=\left({x}_{1},{x}_{2},\dots {x}_{p}\right)$ represents the gene expressions of the breast cancer patients, and Z is the low-dimensional representation of X in the last hidden layer. The feature dimensionality reduction task is used to obtain high-quality compressed features in the hidden layer. Similar to the encoder-decoder structure, supposing that E is the encoder function and D is the decoder function, the compressed Z is written as: Z = E(X), and the recovered X′ can be expressed as X′ = D(Z). The loss induced by the dimensionality reduction task is written as:

$${L}_{D}={\sum }_{i=1}^{p}{\left({x}_{i}-{x}_{i}^{\mathrm{^{\prime}}}\right)}^{2}$$

(3)

The subtype clustering task is designed to extract information on breast cancer patient heterogeneity. In the last hidden layer, a feature matrix is formed by integrating the produced Z and the cluster labels L. The clustering task loss in UISNet is defined as the KL divergence between the two distributions S and T [32]:

$${L}_{c}=KL(S||T)=\sum_{i}\sum_{j}{s}_{ij}log\frac{{s}_{ij}}{{t}_{ij}}$$

(4)

where ${t}_{ij}$ describes the similarity between the cluster center ${\mu }_{j}$ and an embedded point ${z}_{j}$ by Student’s t-distribution:

$${t}_{ij}=\frac{{\left(1+{\Vert {z}_{i}-{\mu }_{j}\Vert }^{2}\right)}^{-1}}{{\sum }_{j}{\left(1+{\Vert {z}_{i}-{\mu }_{j}\Vert }^{2}\right)}^{-1}}$$

(5)

${s}_{ij}$ is the target distribution:

$${s}_{ij}=\frac{{t}_{ij}^{2}/{\sum }_{i}{t}_{ij}}{{\sum }_{j}\left({t}_{ij}^{2}/{\sum }_{i}{t}_{ij}\right)}$$

(6)

The initial cluster labels L of the patients are calculated by k-means. The number of clusters k is the value in [2,3,4] with the largest silhouette score. To ensure the accuracy of the clustering task in UISNet, the computed labels are updated in each epoch while the program runs.

The risk prediction task is used to evaluate breast cancer prognoses by Eq. (7). In Eq. (7), $S\left(t\right)=Pr(T\ge t)$ is the survival probability that a patient will survive before time $t$. The time interval $T$ is the time elapsed between data collection and the patient’s last contact (the end of the experiment/patient death). The risk function at t can be given as follows:

$$\lambda \left(t\right)=\underset{\delta \to 0}{{\text{lim}}}\frac{{\text{Pr}}(t\le T<t+\delta |T\ge t)}{\delta }$$

(7)

The loss function of the outcome prediction task can be expressed as Eq. (8) [14]:

$${L}_{P}=-\sum_{i=1}\left({h}_{\theta }\left(x\right)-log\sum_{j\in \mathfrak{R}\left({T}_{i}\right)}{exp}^{{h}_{\theta }\left({x}_{j}\right)}\right)$$

(8)

where UISNet updates $h\left(x\right)$ according to the weight $\theta$, and $\mathfrak{R}\left({T}_{i}\right)$ represents the risk set of the breast cancer patients that are still alive at time point ${T}_{i}$.

By integrating the feature dimensionality reduction, patient heterogeneity clustering, and cancer outcome prediction tasks into a unified framework, the total loss of UISNet can be given as follows:

$${l}_{UISNet }={\gamma L}_{D}+{\beta L}_{C}+{L}_{P}$$

(9)

$\gamma$ and $\beta$ can balance the importance of these tasks, which can be seen as the hyperparameters chosen by the cross-validation step. In this study, the value of $\gamma$ was set to 1, and $\beta$ was set to 10.

The uncertainty-based integrated gradients algorithm

In [21], the gradients of the output y with respect to the input x $(W=\partial y/\partial x)$ were used to quantify the importance of each gene to cancer prognosis. However, computing the gradients in a deep neural network may lead to gradient saturation. To interpret the feature importance of the prediction model, UISNet uses the Gauss‒Legendre quadrature to approximate the integral of the gradients after calculating the gradients of the input x across different scales against the baseline ${x}_{i}$ (zero-scaled):

$$IG\left({x}_{i}\right)\colon\colon =\left({x}_{i}-{x}_{i}^{\prime}\right)\times {\int }_{\alpha =0}^{1}\frac{\partial F\left({x}^{\prime}+\alpha \times (x-{x}^{\prime})\right)}{\partial {x}_{i}}d\alpha$$

(10)

Nevertheless, the evaluation results given by the integrated gradients algorithm are not always reliable. Calculating the uncertainty of the predictions can enable the reliability of the results to be judged. Bayesian neural networks have been designed to quantify the uncertainty of results, but due to the large number of required computations, Monte Carlo dropout and Gaussian distribution models are often used as approximate solutions for Bayesian neural networks. Compared to Gaussian distribution models, Monte Carlo dropout can better approximate Bayesian neural networks by using the dropout term as one regularization term, for calculating the uncertainty of the results [33]. The objective function for using L2 regularization in Monte Carlo dropout can be expressed as:

$${l}_{MC}:=\frac{1}{n}{\sum }_{1}^{n}E\left({y}_{i},{\widehat{y}}_{i}\right)+\lambda {\sum }_{l=1}^{L}{\Vert {W}_{i}\Vert }_{2}^{2}+{\Vert {b}_{i}\Vert }_{2}^{2}$$

(11)

where L is the number of layers in the deep neural network, and ${y}_{i}$ and ${\widehat{y}}_{i}$ are the target and the output of the network, respectively. By following [34], we trained UISNet with different dropout settings at T inference times as the Monte Carlo dropout strategy. Supposing that lgx_i represents the integrated gradients importance of the ith node, the uncertainty of the ith features is designed as:

$$U\left({x}_{i}\right)=std\sum lg{x}_{i}^{t}/ ave\sum lg{x}_{i}^{t}$$

(12)

where std(*) and ave(*) are the standard deviation and average values of $lg{x}_{i}$ at T inference times, respectively. The importance weight of the gene feature in the network is expressed as:

$$V\left({x}_{i}\right)=\left(1-{U}{\prime}\left({x}_{i}\right)\right)* IG\left({x}_{i}\right)$$

(13)

where U′ $\left({x}_{i}\right)$ is the adjusted uncertainty value of $U\left({x}_{i}\right)$ after log transformation and min–max normalization. In summary, the UISNet algorithm is given as follows:

Performance evaluations and parameter selection

In this study, the cancer outcome prediction performances of different methods were compared through the C-index and |log10(P)| values. The C-index value is the fraction of all pairs of patients whose predicted outcomes are correctly ordered based on Harrell’s C statistics [35]. A higher |log10(P)| value indicates more significant survival differences between the patient subgroups divided based on the predicted risks (Additional file 1).

The parameter list in this study is given below. The number of nodes in hidden layer 1 was set to 1000, and the number of nodes in hidden layer 2 was set to 500. The number of nodes Z in hidden layer 3 was selected from [10, 20, 50]. The learning rate (LR) was selected from [1e-6,1e-7,1e-8], and the maximum number of epochs in the neural network was set to 2000. The optimal parameters were selected by fivefold cross validation. To access the robustness of our method in 5-fold cross-validation and 10-fold cross-validation, in Additional file 1; Supplementary Table S1 we show the deviation obtained by UISNet in 5-fold cross validation and 10-fold cross validation.

Results

Method comparison

The UISNet model was evaluated by the C-index (CI) in predicting the outcome of the patients in seven breast cancer datasets. The average CI values are given in Table 2. The UISNet was compared with six methods, the adaptive Lasso-penalized Cox model (IPW-lasso), the integrated learning-based Cox method (EXSA), the deep survival network (Deep_surv), the denoising autoencoder-based Cox model (DCAP) and the deep survival network with a meta-learning framework (MTC). As shown in Table 2, the UISNet achieved the average CI (0.691) across seven datasets, which is significantly higher than the CI 0.619, 0.638, 0.653, 0.665, and 0.677 achieved by the methods, IPF-lasso, EXSA, Deep_surv, DCAP, and MTC, respectively. The t-test was performed on the results obtained from UISNet and other methods, demonstrating significant improvements of our method compared to the alternative approaches.

Table 2 The CI values obtained by different methods on breast cancer datasets

Full size table

When we divided the patients into subgroups according to the estimated prognosis risks, UISNet achieved average |log10(P)|= 1.608 across the seven datasets, which is higher than the |log10(P)| values achieved by IPF-lasso (1.036). Meanwhile, UISNet performed better than the other four compared methods (Table 3, average |log10(P)|= 1.167). Additionally, we show the average time-dependent AUC scores [36] in Fig. 2. By testing on the eight breast cancer datasets, the UISNet achieved the highest AUC score of 0.676 among the compared methods (IPF-lasso = 0.623, EXSA = 0.634, Deep-surv = 0.653, DCAP = 0.660, MTC = 0.670).

Table 3 The |log10(P)| values produced by different methods on breast cancer datasets

Full size table

Parameter sensitivity analysis

To evaluate the effects of the hyperparameters on the prediction of UISnet, we examined the CI values obtained on BRCA and BRCA_all with different parameter combinations (Fig. 3). The number of nodes in hidden layer 3 was selected from [100, 50, 20, 10], while the learning rate was set to [1e-6, 1e-7, 5e-7, 1e-8]. By comparing the standard deviation values of the CI values while one parameter was fixed, we found that the effect of the node size in the network was relatively small (std = 0.010), which was lower than the result of the learning rate (std = 0.028). Nevertheless, it is difficult to determine the optimal parameter combinations in different datasets. In this study, we used a fivefold CV to select suitable hyperparameters of UISNet in model training.

Ablation experiment

To evaluate the contribution of each task in UISNet for cancer outcome prediction, we compared the performance achieved by excluding different tasks from the framework. As shown in Fig. 4, when only the single prediction task was used to construct the neural network (-DRSS), the DNN framework achieved an average CI value of 0.610, which is 6.87% lower than that obtained by UISNet. When excluding the clustering task (-SS) from UISNet, the framework caused a decrease in the CI value from 0.655 to 0.631(− 3.66%), and a smaller decrease was caused by the exclusion of the dimensionality reduction task (-DR, CI = 0.642, − 1.98%). These results indicated that the clustering task provides more useful information than the dimensionality reduction task, and the prediction accuracy benefits from integrating these tasks into a unified framework.

Independent tests

To further validate the performance of the UISNet, we conducted independent tests by separating the breast cancer dataset from the integrated dataset BRCA_all as an independent test dataset (Fig. 5). The results indicated that the CI values obtained by UISNet in the independent tests are higher than 0.659, averagely. The Kaplan‒Meier survival curves illustrate the significant (P < 0.05) difference in survival between the two patient subgroups classified by UISNet. All these results proved the robustness of our method.

Feature interpretability evaluation

To evaluate the feature interpretability of UISNet, we compared UISNet with the method without using the uncertainty strategy (IG) and the differential expression analysis (DEA) while analyzing the BRCA_all dataset. The top 200 genes sorted based on the importance weights given by UISNet are shown in Fig. 6a. The IG values of these genes were calculated by Eq. (12). The result shows UISNet selected some important genes such as MAPK1, AKT1, RAF1, that have been proved related to breast cancer prognosis. Figure 6b shows the top 200 genes ranked based on the |log(fold change)| values produced by DEA (adjusted-p-values < 0.05), and the gene expression heatmap of these genes is shown in Fig. 6c. The names of the top 20 genes selected by UISNet and DEA are additionally annotated in Fig. 6a and b, respectively.

The Venn diagram in Fig. 6d shows the number of overlapping genes selected by different methods. The figure indicates that there are 92 overlapping genes selected from IG and UISNet, while the number of the common genes selected by DEA and UISNet is only four. Additionally, the breast cancer outcome prediction performance achieved by using different numbers of selected gene features based on DEA, IG, and UISNet are shown in Fig. 6d. It shows that while using the top 5 gene features, DEA and IG achieved lower CI values (0.559 and 0.542) than UISNet (0.565). When the number of used genes was 200, UISNet obtained the highest CI value of 0.652. The results demonstrate that by comparing with IG and DEA, UISNet can find genes that have a greater impact on the prognosis of breast cancer.

Biointerpretability assessment

According to the importance weights (IWs) given by UISNet, we selected the top 20 genes for further analysis (Fig. 7a). Specifically, 17 of the selected genes have been validated to be associated with breast cancer. AKT1 encodes one of three human AKT serine-threonine protein kinase family members, and mutations in AKT1 are linked to breast cancer cell growth [37]. PTEN can negatively regulate intracellular phosphatidylinositol-3,4,5-triphosphate and exerts a tumor suppression effect by negatively regulating the PI3K-AKT signaling pathway [38]. In Lama’s study, the molecular changes in MAPK1 lead to overexpression of matrix metallopeptidase, which is associated with poor prognosis in breast cancer patients [39]. The upregulation of PRKCA has been found to be linked with resistance to antiestrogen treatment and the aggressive nature of tumors. PRKCA serves as a pivotal signaling hub and a potential therapeutic target in breast cancer stem cells, which exhibit comparable cell surface marker profiles to those observed in TNB [40]. CDC23, regulated by mir-34c, may be responsible for mir-34c-induced cell cycle arrest, where miR-34c can induce G2/M cell cycle arrest in breast cancer cells [41]. UBA52 has been reported to potentially associate with the development of resistance to Lapatinib in breast cancer treatment [42].

In addition, reduced expression of the RAF1 kinase inhibitor protein has been found to be associated with breast cancer metastasis [43]. The regulation of CDKN1A gene expression by LRH-1 influences the proliferation of breast tumor cells [44]. Downregulation of IKBKB expression by MicroRNA-16 enhances the sensitivity of breast cancer cells to paclitaxel treatment [45]. PPP2CA has been found to promote the proliferation and invasion of breast cancer cells [46]. Increased expression of PRKCA has been linked to resistance against antiestrogen treatment and the aggressive nature of tumors. Overexpression of PSMB4 promotes the proliferation and survival of breast cancer cells, lleading to an unfavorable prognosis [47]. The study revealed a significant upregulation in AP1M2, PSMD10, and RPL2 expression in breast cancer tissue compared to normal tissues [48,49,50]. RPS27A is overexpressed in breast cancer, enhanceing EBV-encoded LMP1-mediated proliferation and invasion through stabilization of LMP1 [51]. The RPS4X protein was identified as a potential biomarker for controlling cisplatin resistance in breast cancer treatment [52]. Although functional studies have not directly implicated MAP2K1, POLR2J, and SEH1L in breast cancer development and progression, they have been associated with other malignancies [53,54,55]. The indications from our model imply that they could potentially emerge as targets for breast cancer.

We further performed the KEGG enrichment analysis on these 20 genes (Fig. 7b), and found that they are enriched in many signaling pathways related to the occurrence and development of breast cancer, such as the mTOR and PI3K-AKT signaling pathways. The downstream transcription factors of the mTOR signaling pathway (with the highest enrichment score) include HIF1α, c-Myc, FoxO, and other important cancer regulatory molecules [56]. It has been proven that Paclitaxel can modulate the proliferation and migration of breast cancer cells via the mTOR signaling pathway [57]. The PI3K-AKT signaling pathway is a crucial component of many signaling pathways involving membrane-bound ligands, which are crucial for the survival of tumor cells [58]. The EGFR tyrosine kinase inhibitor resistance pathway enriched by UISNet shows that the identified genes may affect the tyrosine kinase inhibitor resistance in breast cancer treatment. The dysregulated activation of the ErbB signaling pathway plays a critical role in regulating cell growth, differentiation, and survival in breast cancer. Moreover, it is closely associated with tumor initiation, progression, and metastasis [59]. PD-L1 expression and PD-1 checkpoint pathway in breast cancer is closely related to immune regulation and tumor evasion from immune surveillance, which plays a key role in the regulation of immune responses [60]. The activation of the HIF-1 pathway in breast cancer is intricately associated with tumorigenesis, disease progression, and acquisition of treatment resistance [61]. Additionally identified genes enrich several breast cancer-related KEGG pathways, including VEGF and Ras signaling pathway, breast cancer pathway, pathways in cancer, and microRNAs in cancer. These findings demonstrate that UISNet can construct a breast cancer outcome prediction model with enhanced interpretability for biomedical applications.

Discussion

Although our method provides model biological interpretability while improving the prediction accuracy, there are still some questions worth discussing, as described below. Firstly, the high censoring rates (12.17–87.03%) in the breast cancer data affected the calculation of the true survival proportion and decreased the performance of our method. Secondly, previous studies have shown that multi-omics integration is helpful for improving cancer outcome prediction performance. Expanding UISNet to integrate multi-omics data in an interpretable manner will be a potential way to improve the prediction performance of the model.

In the future, we will update our interpretable method by incorporating more medical information, including DNA methylation and slide images. Additionally, we want to design an effective strategy to evaluate the true survival times of censored patients to reduce the adverse impact of high censoring rates on model training. Furthermore, considering the molecular-level similarities observed between gynecologic and breast tumors [62], we will utilize the UISNet model to investigate gynecologic cancers such as cervical cancer and ovarian cancer, with the aiming of identifying potential pan-gynecologic-cancer related biomarkers for effective therapeutic interventions.

Conclusions

DL-based methods have been proven to achieve accurate performance in cancer outcome prediction cases. Nevertheless, the lack of model interpretability limits the applicability of these methods. To address this challenge, we proposed a novel uncertainty-based interpretable deep neural network called UISNet for breast cancer outcome prediction. UISNet provides interpretable solutions by computing the integrated gradients of features with an uncertainty-based strategy. Furthermore, it improved model performance by introducing prior biological pathway knowledge and utilizing patient heterogeneity information. The experimental results show that UISNet achieved a 5.79% higher CI value than the compared state-of-the-art methods on average. Based on the feature interpretation results of the prediction model, 11 of the 20 identified genes have been proven to be associated with breast cancer. The comprehensive tests indicated that our proposed method is accurate and robust, and is an effective way to identify cancer-related genes. In summary, we believe that UISNet is a valuable and meaningful foundation for further cancer prognosis prediction research.

Availability of data and materials

All the data analyzed are downloaded from GEO (https://www.ncbi.nlm.nih.gov) and TCGA (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). The method codes are available at https://github.com/chh171/UISNet.

References

Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J Clin. 2021;71(3):209–49.
Article Google Scholar
Kroemer G, Senovilla L, Galluzzi L, André F, Zitvogel L. Natural and therapy-induced immunosurveillance in breast cancer. Nat Med. 2015;21(10):1128–38.
Article CAS PubMed Google Scholar
Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17.
Article CAS PubMed Google Scholar
Tran T-O, Vo TH. Le NQK: omics-based deep learning approaches for lung cancer decision-making and therapeutics development. Briefings Funct Genomics 2023. elad031
Matsuo K, Purushotham S, Jiang B, Mandelbaum RS, Takiuchi T, Liu Y, Roman LD. Survival outcome prediction in cervical cancer: Cox models vs deep-learning model. Am J Obstet Gynecol. 2019;220(4):381e381-381e314.
Article Google Scholar
Wang H, Zhou L. Random survival forest with space extensions for censored data. Artif Intell Med. 2017;79:52–61.
Article PubMed Google Scholar
Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biom J. 2010;52(1):70–84.
Article MathSciNet PubMed Google Scholar
Lin D, Banjevic D, Jardine AK. Using principal components in a proportional hazards model with applications in condition-based maintenance. J Oper Res Soc. 2006;57(8):910–9.
Article Google Scholar
Cai T, Tonini G, Lin X. Kernel machine approach to testing the significance of multiple genetic markers for risk prediction. Biometrics. 2011;67(3):975–86.
Article MathSciNet PubMed PubMed Central Google Scholar
Boulesteix A-L, De Bin R, Jiang X, Fuchs M: IPF-LASSO: integrative-penalized regression with penalty factors for prediction based on multi-omics data. In; Computational and mathematical methods in medicine 2017, 2017.
Liu P, Fu B, Yang SX, Deng L, Zhong X, Zheng H. Optimizing survival analysis of XGBoost for ties to predict disease progression of breast cancer. IEEE Trans Biomed Eng. 2020;68(1):148–60.
Article ADS PubMed Google Scholar
Le NQK. Potential of deep representative learning features to interpret the sequence information in proteomics. Proteomics. 2022;22(1–2):2100232.
Article CAS Google Scholar
Zhou L, Rueda M, Alkhateeb A. Classification of breast cancer nottingham prognostic index using high-dimensional embedding and residual neural network. Cancers. 2022;14(4):934.
Article CAS PubMed PubMed Central Google Scholar
Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol. 2018;18(1):24.
Article PubMed PubMed Central Google Scholar
Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep learning-based multi-omics integration robustly predicts survival in liver cancer. Clin Cancer Res. 2018;24(6):1248–59.
Article CAS PubMed Google Scholar
Chai H, Zhou X, Zhang Z, Rao J, Zhao H, Yang Y. Integrating multi-omics data through deep learning for accurate cancer prognosis prediction. Comput Biol Med. 2021;134: 104481.
Article CAS PubMed Google Scholar
Chai H, Zhang Z, Wang Y, Yang Y. Predicting bladder cancer prognosis by integrating multi-omics data through a transfer learning-based Cox proportional hazards network. CCF Trans High Perform Comput. 2021;3(3):311–9.
Article Google Scholar
ElKarami B, Alkhateeb A, Qattous H, Alshomali L, Shahrrava B. Multi-omics data integration model based on UMAP embedding and convolutional neural network. Cancer Inform. 2022;21:11769351221124204.
Article PubMed PubMed Central Google Scholar
Qiu YL, Zheng H, Devos A, Selby H, Gevaert O. A meta-learning approach for genomic survival analysis. Nat Commun. 2020;11(1):1–11.
Article Google Scholar
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform. 2013;14(1):1–18.
Article Google Scholar
Hao J, Kim Y, Mallavarapu T, Oh JH, Kang M. Interpretable deep neural network for cancer survival analysis by integrating genomic and clinical data. BMC Med Genomics. 2019;12(10):1–13.
Google Scholar
Qi Z, Khorram S, Li F: Visualizing deep networks by optimizing with integrated gradients. In: CVPR workshops: 2019; 2019.
Zhao L, Dong Q, Luo C, Wu Y, Bu D, Qi X, Luo Y, Zhao Y. DeepOmix: a scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J. 2021;19:2719–25.
Article CAS PubMed PubMed Central Google Scholar
Petsiuk V, Das A, Saenko K. Rise: randomized input sampling for explanation of black-box models. arXiv:180607421 (2018).
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7): e47.
Article PubMed PubMed Central Google Scholar
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006;98(4):262–72.
Article CAS PubMed Google Scholar
Loi S, Haibe-Kains B, Desmedt C, Wirapati P, Lallemand F, Tutt AM, Gillet C, Ellis P, Ryder K, Reid JF. Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen. BMC Genomics. 2008;9(1):1–12.
Article Google Scholar
Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A, Pilch H, Lehr H-A, Hengstler JG, Kolbl H, Gehrmann M. The humoral immune system has a key prognostic impact in node-negative breast cancer. Can Res. 2008;68(13):5405–13.
Article CAS Google Scholar
Symmans WF, Hatzis C, Sotiriou C, Andre F, Peintinger F, Regitnig P, Daxenbichler G, Desmedt C, Domont J, Marth C. Genomic index of sensitivity to endocrine therapy for breast cancer. J Clin Oncol. 2010;28(27):4111.
Article PubMed PubMed Central Google Scholar
Li Y, Zou L, Li Q, Haibe-Kains B, Tian R, Li Y, Desmedt C, Sotiriou C, Szallasi Z, Iglehart JD. Amplification of LAPTM4B and YWHAZ contributes to chemotherapy resistance and recurrence of breast cancer. Nat Med. 2010;16(2):214–8.
Article PubMed PubMed Central Google Scholar
Hatzis C, Pusztai L, Valero V, Booser DJ, Esserman L, Lluch A, Vidaurre T, Holmes F, Souchon E, Wang H. A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA. 2011;305(18):1873–81.
Article CAS PubMed PubMed Central Google Scholar
Guo X, Gao L, Liu X, Yin J: Improved deep embedded clustering with local structure preservation. In IJCAI: 2017; 2017. p. 1753–1759.
Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X, Khosravi A, Acharya UR. A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inf Fusion. 2021;76:243–97.
Article Google Scholar
Wang Y, Zhang Y, Tian J, Zhong C, Shi Z, Zhang Y, He Z: Double-uncertainty weighted method for semi-supervised learning. In: International conference on medical image computing and computer-assisted intervention; 2020. Springer; 2020. p. 542–551.
Van Belle V, Pelckmans K, Van Huffel S, Suykens JA. Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif Intell Med. 2011;53(2):107–18.
Article PubMed Google Scholar
Kamarudin AN, Cox T, Kolamunnage-Dona R. Time-dependent ROC curve analysis in medical research: current methods and applications. BMC Med Res Methodol. 2017;17(1):1–19.
Article Google Scholar
Hinz N, Jücker M. Distinct functions of AKT isoforms in breast cancer: a comprehensive review. Cell Commun Signal. 2019;17(1):1–29.
Article Google Scholar
Milella M, Falcone I, Conciatori F, Cesta Incani U, Del Curatolo A, Inzerilli N, Nuzzo C, Vaccaro V, Vari S, Cognetti F. PTEN: multiple functions in human malignant tumors. Front Oncol. 2015;5:24.
Article PubMed PubMed Central Google Scholar
Hamadneh L, Bahader M, Abuarqoub R, AlWahsh M, Alhusban A, Hikmat S. PI3K/AKT and MAPK1 molecular changes preceding matrix metallopeptidases overexpression during tamoxifen-resistance development are correlated to poor prognosis in breast cancer patients. Breast Cancer. 2021;28(6):1358–66.
Article PubMed Google Scholar
Tam WL, Lu H, Buikhuisen J, Soh BS, Lim E, Reinhardt F, Wu ZJ, Krall JA, Bierie B, Guo W. Protein kinase C α is a central signaling node and therapeutic target for breast cancer stem cells. Cancer Cell. 2013;24(3):347–64.
Article CAS PubMed PubMed Central Google Scholar
Achari C, Winslow S, Ceder Y, Larsson C. Expression of miR-34c induces G2/M cell cycle arrest in breast cancer cells. BMC Cancer. 2014;14(1):1–9.
Article Google Scholar
Zhang L, Huang Y, Zhuo W, Zhu Y, Zhu B, Chen Z. Identification and characterization of biomarkers and their functions for Lapatinib-resistant breast cancer. Med Oncol. 2017;34:1–8.
Article Google Scholar
Hagan S, Al-Mulla F, Mallon E, Oien K, Ferrier R, Gusterson B, Curto García JJ, Kolch W. Reduction of Raf-1 kinase inhibitor protein expression correlates with breast cancer metastasis. Clin Cancer Res. 2005;11(20):7392–7.
Article CAS PubMed Google Scholar
Bianco S, Jangal M, Garneau D, Gevry N. LRH-1 controls proliferation in breast tumor cells by regulating CDKN1A gene expression. Oncogene. 2015;34(34):4509–18.
Article CAS PubMed Google Scholar
Tang X, Jin L, Cao P, Cao K, Huang C, Luo Y, Ma J, Shen S, Tan M, Li X. MicroRNA-16 sensitizes breast cancer cells to paclitaxel through suppression of IKBKB expression. Oncotarget. 2016;7(17):23668.
Article PubMed PubMed Central Google Scholar
Zeng Q, Jin F, Qian H, Chen H, Wang Y, Zhang D, Wei Y, Chen T, Guo B, Chai C. The miR-345-3p/PPP2CA signaling axis promotes proliferation and invasion of breast cancer cells. Carcinogenesis. 2022;43(2):150–9.
Article CAS PubMed Google Scholar
Wang H, He Z, Xia L, Zhang W, Xu L, Yue X, Ru X, Xu Y. PSMB4 overexpression enhances the cell growth and viability of breast cancer cells leading to a poor prognosis. Oncol Rep. 2018;40(4):2343–52.
CAS PubMed Google Scholar
Wu C-C, Kao T-J, Ta HDK, Anuraga G, Andriani V, Athoillah M, Chiao C-C, Wu Y-F, Lee K-H, Wang C-Y. Prognostic and immune infiltration signatures of proteasome 26S subunit, non-ATPase (PSMD) family genes in breast cancer patients. Aging. 2021;13(22):24882.
Article PubMed PubMed Central Google Scholar
Wilson-Edell KA, Kehasse A, Scott GK, Yau C, Rothschild DE, Schilling B, Gabriel BS, Yevtushenko MA, Hanson IM, Held JM. RPL24: a potential therapeutic target whose depletion or acetylation inhibits polysome assembly and cancer cell growth. Oncotarget. 2014;5(13):5165.
Article PubMed PubMed Central Google Scholar
Kim DH, Lee KE. Discovering breast cancer biomarkers candidates through mRNA expression analysis based on the cancer genome atlas database. J Person Med. 2022;12(10):1753.
Article Google Scholar
Li H, Zhang H, Huang G, Bing Z, Xu D, Liu J, Luo H, An X. Loss of RPS27a expression regulates the cell cycle, apoptosis, and proliferation via the RPL11-MDM2-p53 pathway in lung adenocarcinoma cells. J Exp Clin Cancer Res. 2022;41(1):33.
Article CAS PubMed PubMed Central Google Scholar
Garand C, Guay D, Sereduk C, Chow D, Tsofack SP, Langlois M, Perreault E, Yin HH, Lebel M. An integrative approach to identify YB-1-interacting proteins required for cisplatin resistance in MCF7 and MDA-MB-231 breast cancer cells. Cancer Sci. 2011;102(7):1410–7.
Article CAS PubMed Google Scholar
Choi YL, Soda M, Ueno T, Hamada T, Haruta H, Yamato A, Fukumura K, Ando M, Kawazu M, Yamashita Y. Oncogenic MAP2K1 mutations in human epithelial tumors. Carcinogenesis. 2012;33(5):956–61.
Article CAS PubMed Google Scholar
Wang T, Liu D, Wang L, Liu M, Duan W, Yi J, Yi Y. DNA repair genes are associated with subtype classification, prognosis, and immune infiltration in uveal melanoma. J Oncol. 2022.
Yu J, Liu T-T, Liang L-L, Liu J, Cai H-Q, Zeng J, Wang T-T, Li J, Xiu L, Li N. Identification and validation of a novel glycolysis-related gene signature for predicting the prognosis in ovarian cancer. Cancer Cell Int. 2021;21:1–14.
Article CAS Google Scholar
Tarrado-Castellarnau M, de Atauri P, Cascante M. Oncogenic regulation of tumor metabolic reprogramming. Oncotarget. 2016;7(38):62726.
Article PubMed PubMed Central Google Scholar
Miricescu D, Totan A, Stanescu-Spinu I-I, Badoiu SC, Stefani C, Greabu M. PI3K/AKT/mTOR signaling pathway in breast cancer: from molecular landscape to clinical aspects. Int J Mol Sci. 2020;22(1):173.
Article PubMed PubMed Central Google Scholar
Hoxhaj G, Manning BD. The PI3K–AKT network at the interface of oncogenic signalling and cancer metabolism. Nat Rev Cancer. 2020;20(2):74–88.
Article CAS PubMed Google Scholar
Hardy KM, Booth BW, Hendrix MJ, Salomon DS, Strizzi L. ErbB/EGF signaling and EMT in mammary development and breast cancer. J Mammary Gland Biol Neoplasia. 2010;15(2):191–9.
Article PubMed PubMed Central Google Scholar
Schütz F, Stefanovic S, Mayer L, von Au A, Domschke C, Sohn C. PD-1/PD-L1 pathway in breast cancer. Oncol Res Treat. 2017;40(5):294–7.
Article PubMed Google Scholar
Flamant L, Notte A, Ninane N, Raes M, Michiels C. Anti-apoptotic role of HIF-1 and AP-1 in paclitaxel exposed breast cancer cells under hypoxia. Mol Cancer. 2010;9:1–15.
Article Google Scholar
Berger AC, Korkut A, Kanchi RS, Hegde AM, Lenoir W, Liu W, et al. A comprehensive pan-cancer molecular study of gynecologic and breast cancers. Cancer Cell. 2018;33(4):690–705.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Mr. Weizhen Deng in Foshan University for supporting our study with data collection.

Funding

This work was funded by the National Natural Science Foundation of China (No. 62201150), Jihua laboratory scienctific project (X210101UZ210), and Research on in situ mass spectrometry for analyzing complex traditional Chinese medicine systems (2021ZDZX2060).

Author information

Hua Chai and Siyin Lin contributed equally to co-first authors of this article.

Authors and Affiliations

School of Mathematics and Big Data, Foshan University, Foshan, 528000, China
Hua Chai, Junqi Lin, Minfan He & Yongzhong OuYang
School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
Siyin Lin & Yuedong Yang
Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, 510000, China
Huiying Zhao

Authors

Hua Chai
View author publications
You can also search for this author in PubMed Google Scholar
Siyin Lin
View author publications
You can also search for this author in PubMed Google Scholar
Junqi Lin
View author publications
You can also search for this author in PubMed Google Scholar
Minfan He
View author publications
You can also search for this author in PubMed Google Scholar
Yuedong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yongzhong OuYang
View author publications
You can also search for this author in PubMed Google Scholar
Huiying Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HC and YY conceived the study. HC, SL, and JL performed the data analysis. HC, SL, YO, and HZ interpreted the results. HC, and MH wrote the manuscript.

Corresponding authors

Correspondence to Yongzhong OuYang or Huiying Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Table S1. The deviation obtained by UISNet in 5-fold cross validation and 10-fold cross validation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Chai, H., Lin, S., Lin, J. et al. An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome. BMC Bioinformatics 25, 88 (2024). https://doi.org/10.1186/s12859-024-05716-7

Download citation

Received: 27 November 2023
Accepted: 21 February 2024
Published: 29 February 2024
DOI: https://doi.org/10.1186/s12859-024-05716-7

An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome

Abstract

Background

Results

Conclusions

Background

Methods

Datasets

The architecture of the proposed deep learning framework

The uncertainty-based integrated gradients algorithm

Performance evaluations and parameter selection

Results

Method comparison

Parameter sensitivity analysis

Ablation experiment

Independent tests

Feature interpretability evaluation

Biointerpretability assessment

Discussion

Conclusions

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keyword

BMC Bioinformatics

Contact us