Skip to main content

Occlusion enhanced pan-cancer classification via deep learning

Abstract

Quantitative measurement of RNA expression levels through RNA-Seq is an ideal replacement for conventional cancer diagnosis via microscope examination. Currently, cancer-related RNA-Seq studies focus on two aspects: classifying the status and tissue of origin of a sample and discovering marker genes. Existing studies typically identify marker genes by statistically comparing healthy and cancer samples. However, this approach overlooks marker genes with low expression level differences and may be influenced by experimental results. This paper introduces “GENESO,” a novel framework for pan-cancer classification and marker gene discovery using the occlusion method in conjunction with deep learning. we first trained a baseline deep LSTM neural network capable of distinguishing the origins and statuses of samples utilizing RNA-Seq data. Then, we propose a novel marker gene discovery method called “Symmetrical Occlusion (SO)”. It collaborates with the baseline LSTM network, mimicking the “gain of function” and “loss of function” of genes to evaluate their importance in pan-cancer classification quantitatively. By identifying the genes of utmost importance, we then isolate them to train new neural networks, resulting in higher-performance LSTM models that utilize only a reduced set of highly relevant genes. The baseline neural network achieves an impressive validation accuracy of 96.59% in pan-cancer classification. With the help of SO, the accuracy of the second network reaches 98.30%, while using 67% fewer genes. Notably, our method excels in identifying marker genes that are not differentially expressed. Moreover, we assessed the feasibility of our method using single-cell RNA-Seq data, employing known marker genes as a validation test.

Peer Review reports

Introduction

Cancer continues to be a major global health challenge and a leading cause of mortality, driving extensive research efforts to advance diagnosis and treatment techniques [1]. Cancer detection aims to accurately classify the tumor types within a sample, even in the early stages of the disease, and to identify specific markers for each cancer type. However, conventional clinical methods for cancer detection, relying on manual microscopic examination of morphological tumor characteristics, are labor-intensive, require extensive training, and are prone to human error [2].

In comparison, pan-cancer classification using Next-Generation Sequencing (NGS) holds great clinical potential in transcending traditional cancer diagnosis methods. It can not only contribute to the adoption of personalized medicine but may also reveal undiscovered connections between tumor types

Pan-cancer Classification methods that analyze Differentially Expressed Genes (DEGs) and use statistical algorithms with NGS data may fail to capture the interactions among genes that exhibit subtle variations in expression levels [3]. Beyond DEG analysis, several studies have adopted machine learning approaches, including k-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machine (SVM), and Multi-task learning, for pan-cancer classification [4,5,6]. These studies also utilize gene screening methods such as ANOVA tests, Principal Component Analysis (PCA), and autoencoders. However, there remains a significant challenge in the interpretability of marker gene selection, particularly in understanding and quantifying the impact of individual genes across different classes of pan-cancer classifications.

This paper introduces a novel framework, “GENESO,” designed for pan-cancer classification and marker gene discovery. GENESO utilizes advanced deep learning techniques, particularly the innovative symmetrical occlusion method, to analyze NGS RNASeq data from clinical samples. By employing symmetrical occlusion, GENESO not only predicts the status and tissue-of-origin of the samples but also assesses the significance of each gene in the classification process.

Method

Dataset preprocessing

As tumors can vary significantly, with different cells within the same tumor or among tumors of the same type having distinct characteristics, we gathered RNA-Seq data from various public sources to ensure a comprehensive dataset (detailed in Supplementary File 1). This dataset includes paired tumor and normal samples, along with some rare tumors. For example, Petrini et al.’s dataset comprises 286 thymus tumor samples from patients with diverse characteristics [7]. Additionally, RNA-Seq data on rare tumors such as paraganglioma and sarcoma were collected from Snezhkina et al. and Lesluyes et al., respectively [8, 9].

To organize the dataset, we categorized the RNA-Seq data based on the tissue of origin and further divided it according to the status of each sample (e.g., cancer, normal), resulting in 28 classes. We chose not to create finer categories to ensure sufficient samples in each class.

The RNA-Seq data was aligned against the human reference genome GRCh.38 using “Bowtie2”, “featurecounts” was used to quantify the number of mapped reads for each gene. Genes on sex chromosomes without coverage across all classes were excluded from analysis [10, 11]. No batch effect correction was applied to avoid introducing bias.

Gene expression level normalization

Various methods have been employed for the normalization of read count data in previous studies, with the most common approaches being “Transcripts Per Kilobase Million” (TPM) and “Reads Per Kilobase Million” (RPKM). These methods are designed to mitigate the bias introduced by gene length by taking transcript length into account and performing a multiplication of a constant on the result as shown in Eq. 1.

$$\begin{aligned} & RPKM = 10^{9} \times \frac{{{\text{Reads mapped to gene}}}}{{{\text{Total reads}} \times {\text{Gene length}}}} \\ & TPM = 10^{6} \times \frac{{\frac{{{\text{Reads mapped}}}}{{{\text{Gene length}}}}}}{{\sum {\frac{{{\text{Reads mapped}}}}{{{\text{Gene length}}}}} }} \\ & NRC = \frac{{{\text{Reads mapped}}}}{{{\text{Total mapped reads}}}} \\ \end{aligned}$$
(1)

However, neither TPM nor RPKM are suitable for pan-cancer classification using neural networks. Instead, we employ “Normalized Counts” (NRC), which eliminates the constant number and gene length from the normalization process. There are three reasons why NRC is more suitable for neural network processing:

  1. 1.

    Redundant constant number: The Z-score normalization process within the neural network renders constants in the equations of TPM and RPKM unnecessary.

  2. 2.

    Suitability for cross-sample comparisons: Extensive research has shown that TPM and RPKM are more appropriate for comparing transcript expression within a single sample rather than facilitating cross-sample comparisons [12,13,14,15].

  3. 3.

    Gene length influence in neural network: During neural network prediction, genes are compared across multiple samples. Consequently, the influence of gene length remains consistent throughout all samples, enabling us to remove it without compromising classification performance.

Neural network architecture

The LSTM (Long Short-Term Memory), as depicted in Eq. 2, serves as the fundamental unit of the LSTM layer. It incorporates three inputs and produces two outputs: the sequence input \(x_t\), the cell state input from the previous LSTM cell \(c_{t-1}\), the hidden state input from the previous LSTM cell \(h_{t-1}\), the hidden state output \(h_t\) and the cell state output \(c_t\). In simple terms, the hidden state \(h_t\) retains short-term memory, while the cell state \(c_t\) preserves long-term or global memory.

$$\begin{aligned} & i_{t} = \sigma (W_{x} ix_{t} + W_{h} ih_{{t - 1}} + W_{c} ic_{{t - 1}} + b_{i} ) \\ & f_{t} = \sigma (W_{x} fx_{t} + W_{h} fh_{{t - 1}} + W_{c} fc_{{t - 1}} + b_{f} ) \\ & c_{t} = f_{t} c_{{t - 1}} + i_{t} \tanh (W_{x} cx_{t} + W_{h} ch_{{t - 1}} + b_{c} ) \\ & o_{t} = \sigma (W_{x} ox_{t} + W_{h} oh_{{t - 1}} + W_{c} oc_{t} + b_{0} ) \\ & h_{t} = o_{t} \tanh (c_{t} ) \\ \end{aligned}$$
(2)

As illustrated in Fig. 1, the initial layer of the neural network serves as the input layer, performing Z-score normalization on the input data. This normalization step is critical for achieving optimal network performance, enabling unbiased comparison of gene expression levels across samples regardless of scale differences.

Following the input layer, the neural network comprises two LSTM layers: the first with 120 cells and the second with 80 cells. This architecture allows the first layer to abstract normalized inputs, extract high-level information, and outperform convolutional neural networks with multiple convolutional layers [16].

To prevent overfitting and facilitate the identification of general patterns in the dataset, a dropout layer with a dropout probability of 30% is applied after each LSTM layer [17].

At the end of the neural network, a “Fully Connected” (FC) layer connects the LSTM layer and the output layer, generating predicted labels. The softmax and classification layer then produce the prediction labels based on the FC layer output, identifying the label with the highest prediction score.

Fig. 1
figure 1

Overview of neural network architecture

Marker gene identification

Overview

To identify marker genes and quantitatively evaluate their significance for pan-cancer classification, a novel algorithm named “Symmetrical Occlusion” (SO) is introduced. This method draws inspiration from “CNN occlusion” (CO), a feature identification technique used in neural network-based image classification, but it overcomes the limitations of CO in identifying signature genes.

SO assesses the significance of individual genes in pan-cancer classification using a neural network. It operates by mimicking both the “gain of function” and “loss of function” of genes. First, SO manipulates gene expression levels relative to the original values to create pseudo samples. Then, these pseudo samples are input into baseline neural networks to observe changes in the network’s output. These fluctuations are then used to quantify the importance of individual genes.

CNN occlusion method and its drawbacks

The CO method is widely used for identifying important regions within an image. The theory behind CO is that blocking or occluding a crucial region typically results in a sharp decrease in the prediction score, indicating the probability of the input image belonging to the corresponding class [18, 19]. This change can rapidly and accurately identify important regions for the neural network.

However, directly applying CO to marker gene identification faces challenges due to disparities in data structure between images and gene expression data. Unlike images which is a two-dimensional matrix, gene expression data is represented as a vector based on gene position, lacking the inherent spatial relationships present in pixel data. Moreover, gene expression ranges vary between samples, in contrast to the fixed pixel value range in images.

Symmetrical occlusion

Training of baseline neural network

The SO method employs a multi-step process to assess the importance of a specific gene in pan-cancer classification. Initially, a baseline LSTM neural network is trained using the entire dataset of genes (Fig. 2). This step establishes the baseline performance for pan-cancer classification which is essential for subsequent comparisons.

Fig. 2
figure 2

A baseline LSTM model is trained first for the occlusion

“Gain of function” and “loss of function” simulation

Next, the importance of a specific gene in pan-cancer classification is quantified using SO by simulating its “gain of function” and “loss of function” effects.

In the “gain of function” simulation, the expression level of a gene in reference samples from the validation dataset is systematically occluded or replaced with increasing “occlusion values” in a step-wise manner, with each step being one-tenth of the original gene expression level. This process generates new pseudo samples with the gene replaced by new expression levels until the expression level exceeds twice the maximum value of the gene with the highest expression level.

Similarly, the “loss of function” simulation involves replacing the gene’s expression level with lower occlusion values than reference samples until the expression level reaches zero. This procedure is repeated for all reference samples across the 28 classes to enhance robustness. While it is biologically impossible for the gene to reach certain expression levels in some of the pseudo samples, the existence of these “imaginary” pseudo samples is a vital part of the simulation process.

For example, as illustrated in Fig. 3, during the “gain of function” simulation of gene BRCA1 on a sample where its expression level is 10 and the maximum expression level in this sample is 20, “gain of function” simulation would generate pseudo samples with BRCA1’s expression level being replaced by occlusion values ranging from 10.1 to 40. Likewise, “loss of function” would yield pseudo samples with decreasing occlusion values. Among these pseudo samples, those with biologically impossible gene expression levels are defined as “imaginary” pseudo samples, while the rest are categorized as “real” pseudo samples.

Fig. 3
figure 3

Generation of pseudo samples from reference sample using symmetrical occlusion. The biologically-impossible pseudo samples are categorized as “imaginary” pseudo samples. The gene expression level in this figure are fictional

Quantification of gene importance

Pseudo samples are input into the baseline LSTM neural network which outputs prediction scores measuring the confidence of the neural network in classifying the samples. By combining the results from the “gain of function” and “loss of function” simulations, the influence of the occlusion value on prediction scores for all 28 classes can be visualized through a series of line plots.

The occlusion score, indicating the importance of a gene in a specific class, is determined by calculating the absolute difference between the maximum and minimum prediction scores for the corresponding class across pseudo samples generated from the same reference sample. Subsequently, the occlusion score is obtained by averaging the absolute values obtained from all reference samples. Finally, the mean occlusion score of a gene across all classes is calculated, signifying its importance in pan-cancer classification.

Continuing the previous example, as illustrated in Fig. 4, a collection of prediction scores is obtained after passing the pseudo samples into the baseline neural network. For the “gain of function” simulation, an increase in the prediction score for the “breast cancer” class and a decrease in other classes such as “breast normal” is expected. After combining the results from “loss of function” simulation, the influence of occlusion value of gene BRCA1 on the reference sample can be visualized in a line plot.

Fig. 4
figure 4

Visualization of the influence of symmetrical occlusion on gene BRCA1 by classifying the pseudo samples derived from reference sample using the baseline model and plotting the change in corresponding prediction score

Repeating the step on the reference samples from the remaining 27 classes, the influence of gene BRCA1 on prediction scores on reference samples for all 28 classes can be visualized through a series of plots. As shown in Fig. 5a, gene BRCA1 exhibits great influence in the prediction score of the “breast cancer” class, resulting in an absolute difference between the maximum and minimum of 0.15 for breast cancer in one of the reference sample. It is higher than the absolute difference of 0.05 in one of the reference samples in lung cancer as shown in Fig. 5b. This demonstrates that gene BRCA1 is a more important gene in breast cancer than in lung cancer.

Fig. 5
figure 5

Comparison of prediction score differences in breast cancer (BRCA) and lung cancer (LUNC) after conducting symmetrical occlusion on gene BRCA1 in reference samples from corresponding classes. Higher differences in the prediction score suggest that gene BRCA1 is a more important gene in breast cancer

To enhance the robustness of the occlusion score, the difference between the maximum and minimum prediction scores from multiple reference samples is collected, and the mean is calculated and marked as the occlusion score. As illustrated in Fig. 6, the occlusion score of gene BRCA1 in breast cancer is 0.16, while its occlusion score in lung cancer is 0.048. By combining the occlusion scores of BRCA1 in other classes, the mean occlusion score of BRCA1, which measures its importance in pan-cancer classification, is 0.097.

Fig. 6
figure 6

Calculation of occlusion score and the mean occlusion score of gene BRCA1

Neural network optimization

As illustrated in Fig. 7, after completing the steps in the previous section, a table summarizing the mean occlusion score of all genes is obtained. To enhance the pan-cancer classification accuracy of the LSTM neural network while using fewer genes, the genes in the summary table are initially ranked based on their mean occlusion score. Subsequently, new training datasets are generated by selecting different subsets of the top-ranking genes.

Using repeated fivefold cross-validation, new LSTM neural networks are trained on these datasets. The network exhibiting the highest validation accuracy is chosen as the final neural network. This approach facilitates the identification of a smaller subset of genes capable of achieving high classification accuracy, thereby improving efficiency and cost-effectiveness for future studies.

Fig. 7
figure 7

Training of new neural networks using training dataset containing different combination of genes ranked by their mean occlusion score to optimize the performance of the neural network

Result

Overview

In this study, we developed an LSTM neural network for pan-cancer classification and introduced the “Symmetrical Occlusion” algorithm to identify pivotal genes in classification while enhancing network performance with fewer genes. Our dataset comprised 3,524 samples classified into 28 classes, including 22 pairs of tumor and normal samples from the same tissue of origin. Through fivefold cross-validation, the network achieved an accuracy of 96.59% using NRC as a normalization method with the complete gene set. Conversely, the accuracy dropped to 91.84% and 89.53% when utilizing the complete gene sets of TPM and RPKM, respectively, indicating the superiority of NRC for gene expression quantification in pan-cancer classification.

As demonstrated in Fig. 8, sorting genes based on their mean occlusion scores led to a notable enhancement in validation accuracy, reaching 98.30% with only the top-ranking 33% genes (Fig. 8a). Additionally, two example confusion matrices are provided in Supplementary Fig. 1. Moreover, by constructing a training dataset specific to top-ranking genes sorted by their occlusion scores in each class, we achieved validation accuracies ranging between 94% and 96% (Fig.  8b). These findings underscore the efficacy of our proposed framework in enhancing pan-cancer classification accuracy while reducing the gene count necessary for precise classification.

Fig. 8
figure 8

Comparison of validation accuracy of neural network using different gene selection strategy

Performance comparisons

Pan-cancer classification

In this paper, LSTM achieved the best performance in validation accuracy and prediction classes compared to previously reported works on pan-cancer classification. As shown in Table 1, Mostavi et al. achieved an accuracy of 92.5% by implementing various CNN networks featuring a vector of 7100 selected genes and a reshaped 100 \(\times\) 71 matrix as input [20]. Similarly, Zhao et al. constructed a network featuring 1D inception architecture and achieved an accuracy of 92.89%, while de Guia et al. and Khalifa et al. converted the gene expression vector into a matrix for CNN training and obtained accuracies of 95.65% and 96.90% [21,22,23].

Table 1 Performance comparison with existing methods

Additionally, compared to the CNN-based methods, our LSTM neural network can distinguish 28 classes of various tissues of origin and status, which is significantly higher than others. For example, Sun et al. implemented a model to distinguish normal samples and tumor samples without tissue of origin (binary classification) and achieved an accuracy of 96.0%, while their second model for classifying 11 tumor classes with tissue of origin achieved an accuracy of 98.6% [24]. Similarly, Khalifa et al. achieved an accuracy of 96.9% with five tumor classes and no normal class [22].

Metastasized cancer classification

To compare the classification performance against Sun et al., metastasized colorectal cancer samples from Kim et al. were utilized as a test dataset for direct comparison [25, 26]. Notably, our accuracy is significantly higher than the reported work using the same dataset. Our study achieved an accuracy of 88.89%, with 16 out of 18 metastasized colorectal cancer samples correctly classified as “colorectal cancer”. The remaining two samples were classified as “colorectal normal” and “liver normal”, respectively.

Identification of marker gene

Marker gene selection method comparison

Selecting appropriate marker genes is crucial in improving pan-cancer classification performance. We compared the performance of several published marker gene selection methods on the same LSTM network, using marker genes selected according to their original method. Additionally, we compared the occlusion score and tissue specificity entropy score and found that the occlusion method is superior to the entropy method in identifying pan-cancer marker genes.

In the study conducted by Mostavi et al., a fixed threshold was used to select genes with an FPKM mean or standard deviation above the threshold [20]. However, as this method has two criteria, it is difficult to control the number of selected genes and thus remains unchanged. Consequently, a gene list containing 29,777 genes was used for preparing the LSTM neural network dataset. In the work done by Zhao et.al’s study, the top 40 genes with the highest difference between the median expression of each gene in the in-class sample relative to the out-of-class samples were selected [23]. A total of 1120 genes were selected using this method. To make it comparable to the occlusion method, the scope was widened to increase the number of marker genes selected.

Classification performance comparison

To compare the performance of different marker gene selection strategies, we trained separate LSTM neural networks with the same architecture using the selected marker genes and applied a fivefold cross-validation strategy. As shown in Fig. 9a, the LSTM network trained with a dataset containing genes selected by the occlusion method achieved the highest median validation accuracy of 97.51% with the fewest number of genes. Meanwhile, the LSTM network trained using a modified selection strategy by selecting the top 33% unique genes from individual 28 classes according to their occlusion score (occlusion unique) attained a median validation accuracy of 95.10%. In comparison, the LSTM network trained with genes selected by Mostavi’s method obtained a median validation accuracy of 96.09%. The original method from Zhao et al. had the lowest median validation accuracy of 94.74%, which increased to 96.73% after widening the criteria to include a similar number of genes as other methods. Further analysis suggests that the mean occlusion score of the genes is an important factor in determining the validation accuracy of the LSTM network.

To investigate the relationship between genes selected by different strategies and the performance of the neural network, their mean occlusion scores are visualized in Fig. 9a for comparison.

Occlusion Mean: 19,002 genes selected by their mean occlusion score showed a concentrated right-skewed distribution in the histogram and thus exhibited the best classification performance.

Occlusion Unique: The modified occlusion method selects 40,059 unique genes, which is the greatest number of genes, however, the histogram distribution is wider than both the original occlusion method and Mostavi’s method, resulting in lower classification performance.

Mostavi’s: Mostavi’s method selects 29,777 genes with a distribution similar to “occlusion unique” but fewer genes with mean occlusion score lower than 130, and thus outperforms “occlusion unique”.

Zhao’s Modified: Since the original method from Zhao et al. selected only 262 genes and obtained the lowest performance, the criteria are widened to select 31,145 genes for a fair comparison with other methods. Compared with Mostavi’s method, it is more biased towards genes with higher mean occlusion scores and outperforms Mostavi’s method. It is worth noting that both methods contain genes with the lowest mean occlusion score which hampers the classification performance.

In summary, the mean occlusion score of the selected genes has a significant impact on the neural network’s accuracy; genes with a mean occlusion score higher than 134 improve the accuracy of the network while genes with a lower mean occlusion score have a detrimental effect.

Fig. 9
figure 9

Comparison of the validation accuracy of neural networks trained using genes selected by various methods and their distribution of mean occlusion scores

Absence of correlation between tissue specificity score and mean occlusion score

Genes expressed exclusively in specific cancer types or tissues, and not in others, often exhibit high tissue specificity and may serve as markers for cancer. To quantify this specificity, Schug et al. introduced a tissue specificity score based on Shannon entropy, a concept commonly employed in information theory [27]. Later, an enhanced method known as “ROKU entropy” was published, which further refined the sequencing of genes according to tissue specificity across tissues [28]. To examine whether the occlusion score in this study correlates with these entropies, tissue specificity scores based on Shannon entropy and ROKU entropy were obtained using the software “TSPEX” [29]. The maximum entropy value of Shannon entropy and ROKU entropy output by TSPEX is \(log_2N \approx 8.43\) (N = 28), indicating complete tissue specificity. The gene expression matrix for all 28 classes is constructed using the median of FPKM for all samples, as required by the software.

As shown in Fig. 10, there is only a weak positive correlation between Shannon entropy and mean occlusion score (Fig. 10) and a weak positive correlation between ROKU entropy and occlusion score (Fig. 10b. Meanwhile, there is a strong positive correlation between Shannon entropy and ROKU entropy (Fig. 10c). This suggests that, despite lower tissue specificity, some genes are still considered important by the occlusion method. Most importantly, this indicates that the occlusion method evaluates the importance of each gene in pan-cancer classification beyond its tissue specificity.

Fig. 10
figure 10

Scatter plots of Shannon tissue specificity entropy score, ROKU entropy score and mean occlusion score

Low Tissue Specificity but High Mean Occlusion Score. To investigate the disparity between tissue specificity score and mean occlusion score, it’s crucial to understand why some genes are not ideal markers for pan-cancer classification despite their high tissue specificity. Investigations were first conducted on genes with a high tissue specificity score but a low mean occlusion score. As shown in Table 2, the gene GFAP is a classical marker of astrocytoma and is exclusively expressed in brain tissue [30, 31]. Therefore, GFAP exhibits one of the highest tissue specificity scores. However, this extremely high tissue specificity also prevents GFAP from being an ideal pan-cancer classification marker, resulting in very low mean occlusion scores. A similar phenomenon was observed with genes such as KRT4, LIPF, TG, and PGC, which are also almost exclusively expressed in specific tissues, making them less suitable for pan-cancer classification [30, 32,33,34].

Table 2 Example genes with low tissue specificity score but low mean occlusion score

High tissue specificity and high mean occlusion score

Investigations were also conducted on genes with high tissue specificity and a high mean occlusion score. As shown in Table 3, the microRNA MIR663B exhibits both a high mean occlusion score and a high tissue specificity score. Upon closer examination, it is evident that while MIR663B is expressed in several tissues, its expression level in specific tissues such as the brain is significantly higher than in others [35,36,37].

Table 3 Example genes with high tissue specificity score and high mean occlusion score

High tissue specificity but low mean occlusion score

In contrast, Table 4 illustrates genes with a high mean occlusion score but a low tissue specificity score. This discrepancy arises from their broad expression across multiple tissues coupled with a lower standard deviation of expression levels. Despite their low tissue specificity, some research has indicated their association with cancer. For instance, the lncRNA ADPGK-AS1 is linked to osteosarcoma, colorectal cancer, pancreatic cancer, and breast cancer through various pathways [38,39,40,41]. Another example is the lncRNA DNM1P35, which serves as a novel prognostic factor for kidney cancer [42]. Additionally, STK24-AS1 has been implicated in predicting patient survival rates in colon cancer patients [43].

Table 4 Example genes with high mean occlusion score but low tissue specificity score

These findings highlight the superiority of the symmetrical occlusion method over tissue specificity methods in identifying optimal pan-cancer marker genes. Tissue specificity methods, akin to other statistical techniques, may overlook genes with subtle expression level differences. For the complete table, please refer to Supplementary File 2.

Literature search on top marker genes

To demonstrate the clinical significance of our findings and the relevance of marker genes with the highest occlusion scores to cancer, literature searches were conducted on ZNF709 (protein coding), MIR663B (MicroRNA), and FGF14-AS2 (lncRNA) as examples.

ZNF709

Zinc Finger Protein 709 (ZNF709) is a protein-coding gene belonging to the zinc finger family, involved in cellular processes like transcriptional regulation, DNA repair, and cell differentiation [44,45,46]. Despite its low tissue specificity score (Shannon entropy: 0.3253, ROKU entropy: 0.4742), it stands out as the 4th gene with the highest mean occlusion score. The Human Protein Atlas reports strong expression of ZNF709 in cancers such as thyroid, colorectal, and breast cancer [47, 48].

Heyliger et al. discussed the clinical relevance of ZNF709 in clear cell renal carcinoma, suggesting its downregulation is associated with significantly favorable survival outcomes [49]. Wang et al. identified ZNF709 as one of the independent prognostic factors for pancreatic cancer [50]. Knockdown studies conducted by Yan et al. showed that downregulation of ZNF709 led to increased expression of p53, a well-known therapeutic target for cancer treatment [51,52,53].

Therefore, ZNF709 could potentially be a target for therapeutic intervention, where increasing its expression levels might enhance the tumor suppressor functions of p53.

MIR663B

MicroRNA 663B (MIR663B) is a small RNA molecule involved in gene regulation, specifically as part of the microRNA family. MicroRNAs play a crucial role in post-transcriptional regulation by binding to target messenger RNA molecules, thereby modulating their stability and translation. MIR663B exhibits very high tissue specificity scores (Shannon entropy: 4.8074, ROKU entropy: 4.8074) and is ranked 7th by mean occlusion score.

Recent studies have highlighted the potential significance of MIR663B in cancer progression and treatment. For instance, Jiang et al. elucidated the role of MIR663B in tamoxifen resistance in breast cancer, suggesting its involvement in modulating TP73 expression, a key factor in drug resistance mechanisms [54, 55]. Wang et al. demonstrated that MIR663B promotes cell proliferation and epithelial-mesenchymal transition in nasopharyngeal carcinoma by directly targeting SMAD7 [56]. Additionally, You et al. found that MIR663B exposed to TGF-\(\beta\)1 promotes cervical cancer metastasis and epithelial-mesenchymal transition by targeting MGAT3 [57]. Guo et al. also revealed that MIR663B targets GAB2 to restrict cell proliferation and invasion in hepatocellular carcinoma [58].

FGF14-AS2

Fibroblast Growth Factor 14 Antisense RNA 2 (FGF14-AS2) is a long non-coding RNA involved in gene regulation, particularly in post-transcriptional regulation by binding to target messenger RNA molecules and modulating their stability and translation. Despite its low tissue specificity scores (Shannon entropy: 0.6033, ROKU entropy: 0.9991), FGF14-AS2 ranks 61st (top 1%) in terms of mean occlusion score.

Experimental studies by Yang et al. and Jin et al. have explored the function of long non-coding RNA FGF14-AS2 in breast cancer, revealing its role in repressing metastasis and suggesting its potential therapeutic implications [59, 60]. Additionally, Hou et al. elucidated the inhibitory effect of FGF14-AS2 overexpression on colorectal cancer proliferation via the RERG/Ras/ERK signaling pathway by sponging microRNA-1288-3p [61]. Moreover, Li et al. demonstrated that FGF14-AS2 inhibits prostate carcinoma cell growth by modulating the miR-96-5p/AJAP1 axis, indicating its tumor-suppressive role in prostate cancer [62].

Marker gene indentification in single cell RNA-Seq

To further verify whether the symmentrical occlusion could accurately classify cell types, a separate LSTM neural network was trained on human muscle “single-cell RNA Sequencing” (scRNA-Seq) data acquired from public sources using the same workflow. The neural network obtained a validation accuracy of 96.0%, indicating that the network can accurately identify cell types using scRNA-Seq data.

Loss of function simulation

To verify that the symmentrical occlusion is able to identify known marker genes, 100 randomly selected “Muscle Stem Cells” (MuSC) cells were extracted from the validation dataset. Then, the expression level of the gene PAX7 which is one of the marker genes of MuSC cells and only expressed in MuSC, was reduced to simulate “loss of function”.

As shown in Fig. 11a, from top to bottom, as the expression level of gene PAX7 decreases during the occlusion process, many of these cells begin to be misclassified as non-MuSC, indicating that PAX7 is indeed an essential gene for the identification of MuSC cells. Meanwhile, Fig. 11b shows that reducing the expression level of the gene Dido1 which was randomly selected, had no impact on the prediction result of another 100 randomly selected MuSC cells.

Fig. 11
figure 11

“Loss of function” simulation on known marker gene of MuSC cells and randomly selected gene in MuSC cells

Gain of function simulation

To underscore the significance of PAX7 as a key marker gene for MuSC cells, the expression level of PAX7 was increased in 100 randomly selected non-MuSC cells to simulate the “gain of function” of PAX7. As shown in Fig. 12a, from bottom to top, the classification results reveal a notable shift, with the majority of non-MuSC cells being classified as “MuSC” as the expression level of PAX7 is systematically increased.

In stark contrast, as shown in Fig. 12b increasing the expression level of the randomly selected gene, CRYZ, in 100 non-MuSC cells had little to no impact on the classification result. This stark difference in outcomes emphasizes the influential role that gene PAX7 plays in the accurate identification of MuSC cells compared to randomly selected genes.

Fig. 12
figure 12

“Gain of function” simulation on known marker gene of MuSC cells and randomly selected gene in non-MuSC cells

This result highlights the potential of the LSTM neural network for cell type classification using scRNA-Seq data. Moreover, by conducting symmetrical occlusion on both marker genes of MuSC and non-MuSC cells, we have demonstrated the effectiveness of the occlusion method in identifying marker genes. These findings may contribute significantly to the advancement of more accurate and efficient methods for single-cell classification and annotation. The robust performance of the LSTM network and the insights gained from symmetrical occlusion pave the way for enhanced methodologies in the field, offering valuable tools for researchers working with scRNA-Seq data to unravel the complexities of cellular heterogeneity.

Gene ontology analysis on identified marker genes

The top 33% of genes, ranked by occlusion score, were subjected to “Gene Ontology” (GO) term analysis using DAVID to confirm their relevance to cancer [63]. As shown in Fig. 13, the most significant molecular function identified is “olfactory receptor activity”, which has been linked to the perception of smell and cancer by prior publications. For instance, Shibel et al. reported that olfactory receptor OR5H2 regulates the proliferation of endometrial cancer cells through the IGF1 signaling pathway [64].

Furthermore, PSGR, a prostate-specific G protein-coupled receptor, has been found to be upregulated in prostate cancer. Another study by Webber et al. reported that olfactory receptor OR10H1 is primarily expressed in human bladder cancer [65]. Pathway analysis also revealed olfactory transduction to be the most significant pathway, followed by “micro RNAs in cancer” and JAK-STAT signaling. The JAK-STAT pathway is known to promote tumour genesis, and its inhibition can impede cancer cell growth [66]. Overall, these findings suggest that the selected genes may be implicated in cancer-related processes. The complete result can be found in Supplementary File 3.

Fig. 13
figure 13

Gene ontology analysis result on top 33 percent genes with the highest mean occlusion score

Pseudogenes among the top-ranking genes

Pseudogenes were once considered non-functional copies of parental genes resulting from mRNA retrotransposition or genomic duplication, seemingly lacking biological significance. However, recent research has unveiled their diverse roles in physiological and pathological processes, particularly in cancer contexts [67]. Studies indicate that pseudogene expression patterns vary across tumor subtypes and can even impact patient survival in specific cancers like kidney cancer [68,69,70].

Among the top 33% of genes ranked by mean occlusion score, 6,108 are pseudogenes. To explore any correlation with previous findings, 2,616 pseudogenes reported by Han et al. as differentially expressed across four tumor types were selected. Corresponding ensemble gene IDs were extracted via the ensemble.org REST API and cross-referenced to ensure consistency in genomic location. Of these, 974 pseudogenes had valid gene IDs, with significant representation from different tumor types: 34.85% from Glioblastoma (GBM), 27.76% from Breast Cancer (BRCA), 27.67% from Lung Squamous Cell Carcinoma (LUSC), and 10.47% from Uterine Corpus Endometrial Carcinoma (UCEC), all within the top 33% of genes selected by the occlusion method. This suggests that pseudogenes may significantly contribute to pan-cancer classification.

To gauge the contribution of pseudogenes, a depletion study was conducted. LSTM neural networks were trained using updated datasets with pseudogenes excluded from the top 33% of genes. Surprisingly, networks trained without pseudogenes exhibited a marginal decrease in average accuracy (0.5%) compared to networks trained with pseudogenes included. This indicates that while pseudogenes may have a limited impact on pan-cancer classification accuracy, their inclusion remains crucial for optimal performance.

Discussion

This paper showcases the effectiveness of the NRC normalization method and the LSTM neural network in pan-cancer classification and marker gene prediction using both RNASeq and scRNA-Seq data. Additionally, the proposed occlusion algorithm proves its efficacy in identifying marker genes and enhancing the classification accuracy of the neural network with fewer genes.

Moreover, the occlusion algorithm has the potential to unveil gene-gene interactions by testing combinations of candidate genes, albeit this approach may pose computational challenges due to the vast number of possible combinations. These findings underscore the versatility and promise of the occlusion algorithm for diverse applications in genomics research.

Beyond cell type classification using scRNA-Seq data, the LSTM neural network shows promise in detecting novel cell types absent from the training dataset. This is evidenced by the network’s tendency to misclassify novel cell types with low prediction scores. For instance, when MuSC cells are excluded from the training dataset but included for prediction, the network misclassifies all MuSC cells as other cell types, yet with a markedly lower prediction score (0.4–0.6) compared to true positive results (approximately 1). Leveraging this discrepancy enables the detection of novel cell types, achieving an accuracy of 85–88% when employing machine learning classifiers such as linear discriminant and SVM.

Availability of data and materials

All data generated or analyzed during this study are publicly available using their accession numbers in the supplementary files. The code is available from the corresponding author upon reasonable request.

References

  1. Xia C, Dong X, Li H, Cao M, Sun D, He S, Yang F, Yan X, Zhang S, Li N. Cancer statistics in China and United States, 2022: profiles, trends, and determinants. Chin Med J. 2022;135(05):584–90.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Tanaka N, Kaczynska D, Kanatani S, Sahlgren C, Mitura P, Stepulak A, Miyakawa A, Wiklund P, Uhlen P. Mapping of the three-dimensional lymphatic microvasculature in bladder tumours using light-sheet microscopy. Br J Cancer. 2018;118(7):995–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Chen JJ, Wang SJ, Tsai CA, Lin CJ. Selection of differentially expressed genes in microarray data analysis. Pharmacogenomics J. 2007;7(3):212–20.

    Article  CAS  PubMed  Google Scholar 

  4. Mahin KF, Robiuddin Md, Islam M, Ashraf S, Yeasmin F, Shatabda S. PanClassif improving pan cancer classification of single cell RNA-Seq gene expression data using machine learning. Genomics. 2022;114(2): 110264.

    Article  CAS  PubMed  Google Scholar 

  5. Hossain SM, Khatun L, Ray S, Mukhopadhyay A. Pan-cancer classification by regularized multi-task learning. Sci Rep. 2021;11(1):24252.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Khadirnaikar S, Shukla S, Prasanna SR. Integration of pan-cancer multi-omics data for novel mixed subgroup identification using machine learning methods. PLoS ONE. 2023;182023(10): e0287176.

    Article  Google Scholar 

  7. Petrini I, Meltzer PS, Kim I-K, Lucchi M, Park K-S, Fontanini G, Gao J, Zucali PA, Calabrese F, Favaretto A, Rea F, Rodriguez-Canales J, Walker RL, Pineda M, Zhu YJ, Lau C, Killian KJ, Bilke S, Voeller D, Dakshanamurthy S, Wang Y, Giaccone G. A specific missense mutation in GTF2I occurs at high frequency in thymic epithelial tumors. Nat Genet. 2014;46(8):844–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Snezhkina AV, Lukyanova EN, Zaretsky AR, Kalinin DV, Pokrovsky AV, Golovyuk AL, Krasnov GS, Fedorova MS, Pudova EA, Kharitonov SL, Melnikova NV, Alekseev BY, Kiseleva MV, Kaprin AD, Dmitriev AA, Kudryavtseva AV. Novel potential causative genes in carotid paragangliomas. BMC Med Genet. 2019;20(Suppl 1):48.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Lesluyes T, Baud J, Pérot G, Charon-Barra C, You A, Valo I, Bazille C, Mishellany F, Leroux A, Renard-Oldrini S, Terrier P, Cesne AL, Laé M, Piperno-Neumann S, Bonvalot S, Neuville A, Collin F, Maingon P, Coindre J-M, Chibon F. Genomic and transcriptomic comparison of post-radiation versus sporadic sarcomas. Mod Pathol Off J US Can Acad Pathol. 2019;32(12):1786–94.

    CAS  Google Scholar 

  10. Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012;9(4):357–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Liao Y, Smyth GK, Shi W. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30.

    Article  CAS  PubMed  Google Scholar 

  12. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X. A survey of best practices for RNA-Seq data analysis. Genome Biol. 2016;17(1):1–19.

    Google Scholar 

  13. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J. A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013;14(6):671–83.

    Article  CAS  PubMed  Google Scholar 

  14. Zhao S, Ye Z, Stanton R. Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols. RNA. 2020;26(8):903–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Zhao Y, Li M-C, Konaté MM, Chen L, Das B, Chris Karlovich P, Williams M, Evrard YA, Doroshow JH, McShane LM. TPM, FPKM, or normalized counts? A comparative study of quantification measures for the analysis of RNA-Seq data from the NCI patient-derived models repository. J Transl Med. 2021;19(1):269.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Mohamed A, Graves A, Hinton G. Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech and signal processing; 2013. p. 6645–9

  17. Pierre B, Sadowski Peter J. Understanding dropout. In: Advances in neural information processing systems; 2013. vol. 26, p. 2814–22.

  18. Huang H, Li D, Zhang Z, Chen X, Huang K. Adversarially occluded samples for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 5098–5107.

  19. Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Computer vision-ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I 13. Springer; 2014. p. 818–33

  20. Mostavi M, Chiu YC, Huang Y, Chen Y. Convolutional neural network models for cancer type prediction based on gene expression. BMC Med Genomics. 2020;13(Suppl 5):44.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. de Guia JM, Devaraj M, Leung CK. DeepGX: deep learning using gene expression for cancer classification. In: Proceedings of the 2019 IEEE/ACM international conference on advances in social networks analysis and mining; 2019. p. 913–20.

  22. Khalifa NE, Taha MH, Ali DE, Slowik A, Hassanien AE. Artificial intelligence technique for gene expression by tumor RNA-Seq data: a novel optimized deep learning approach. IEEE Access. 2020;8:22874–83.

    Article  Google Scholar 

  23. Zhao Y, Pan Z, Namburi S, Pattison A, Posner A, Balachander S, Paisie CA, Reddi HV, Rueter J, Gill AJ, Fox S, Raghav KPS, Flynn WF, Tothill RW, Li S, Karuturi RKM, George J. CUP-AI-Dx: a tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence. EBioMedicine. 2020;61: 103030.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Sun K, Wang J, Wang H, Sun H. Genect: a generalizable cancerous status and tissue origin classifier for pan-cancer biopsies. Bioinformatics. 2018;34(23):4129–30.

    Article  CAS  PubMed  Google Scholar 

  25. Fan F, Chen D, Zhao Y, Wang H, Sun H, Sun K. Rapid preliminary purity evaluation of tumor biopsies using deep learning approach. Comput Struct Biotechnol J. 2020;18:1746–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Kim SK, Kim SY, Kim JH, Roh SA, Cho DH, Kim YS, Kim JC. A nineteen gene-based risk score classifier predicts prognosis of colorectal cancer patients. Mol Oncol. 2014;8(8):1653–66.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Schug J, Schuller WP, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ Jr. Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biol. 2005;6(4):R33.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Kadota K, Ye J, Nakai Y, Terada T, Shimizu K. Roku: a novel method for identification of tissue-specific genes. BMC Bioinform. 2006;7:294.

    Article  Google Scholar 

  29. Camargo AP, Vasconcelos AA, Fiamenghi MB, Pereira GAG, Carazzolle MF. Tspex: a tissue-specificity calculator for gene expression data. Res Square; 2020.

  30. Fagerberg L, Hallström BM, Oksvold P, Kampf C, Djureinovic D, Odeberg J, Habuka M, Tahmasebpoor S, Danielsson A, Edlund K, Asplund A, Sjöstedt E, Lundberg E, Szigyarto CA, Skogs M, Takanen JO, Berling H, Tegel H, Mulder J, Nilsson P, Schwenk JM, Lindskog C, Danielsson F, Mardinoglu A, Sivertsson A, von Feilitzen K, Forsberg M, Zwahlen M, Olsson I, Navani S, Huss M, Nielsen J, Ponten F, Uhlén M. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol Cell Proteom. 2014;13(2):397–406.

    Article  CAS  Google Scholar 

  31. van Bodegraven EJ, van Asperen JV, Robe PAJ, Hol EM. Importance of GFAP isoform-specific analyses in astrocytoma. Glia. 2019;67(8):1417–33.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Duff MO, Olson S, Wei X, Garrett SC, Osman A, Bolisetty M, Plocik A, Celniker SE, Graveley BR. Genome-wide identification of zero nucleotide recursive splicing in drosophila. Nature. 2015;521(7552):376–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, Foster B, Moser M, Karasik E, Gillard B, Ramsey K, Sullivan S, Bridge J, Magazine H, Syron J, Fleming J, Siminoff L, Traino H, Mosavel M, Barker L, Jewell S, Rohrer D, Maxim D, Filkins D, Harbach P, Cortadillo E, Berghuis B, Turner L, Hudson E, Feenstra K, Sobin L, Robb J, Branton P, Korzeniewski G, Shive C, Tabor D, Qi L, Groch K, Nampally S, Buia S, Zimmerman A, Smith A, Burges R, Robinson K, Valentino K, Bradbury D, Cosentino M, Diaz-Mayoral N, Kennedy M, Engel T, Williams P, Erickson K, Ardlie K, Winckler W, Getz G, DeLuca D, MacArthur D, Kellis M, Thomson A, Young T, Gelfand E, Donovan M, Meng Y, Grant G, Mash D, Marcus Y, Basile M, Liu J, Zhu J, Tu Z, Cox NJ, Nicolae DL, Gamazon ER, Im HK, Konkashbaev A, Pritchard J, Stevens M, Flutre T, Wen X, Dermitzakis ET, Lappalainen T, Guigo R, Monlong J, Sammeth M, Koller D, Battle A, Mostafavi S, McCarthy M, Rivas M, Maller J, Rusyn I, Nobel A, Wright F, Shabalin A, Feolo M, Sharopova N, et al. The genotype-tissue expression (GTEx) project. Nat Genet. 2013;45(6):580–5.

    Article  CAS  Google Scholar 

  34. Pontén F, Jirström K, Uhlen M. The human protein atlas—a tool for pathology. J Pathol J Pathol Soc Great Br Ireland. 2008;216(4):387–93.

    Google Scholar 

  35. Cai H, An Y, Chen X, Sun D, Chen T, Peng Y, Zhu F, Jiang Y, He X. Epigenetic inhibition of miR-663b by long non-coding RNA HOTAIR promotes pancreatic cancer cell proliferation via up-regulation of insulin-like growth factor 2. Oncotarget. 2016;7(52):86857.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Mulong D, Shi D, Yuan L, Li P, Chu H, Qin C, Yin C, Zhang Z, Wang M. Circulating miR-497 and miR-663b in plasma are potential novel biomarkers for bladder cancer. Sci Rep. 2015;5(1):10437.

    Article  Google Scholar 

  37. Hong S, Yan Z, Wang H, Ding L, Song Y, Bi M. miR-663b promotes colorectal cancer progression by activating RAS/RAF signaling through downregulation of TNK1. Hum Cell. 2020;33(1):104–15.

    Article  CAS  PubMed  Google Scholar 

  38. Luo XF, Wu XJ, Wei X, Wang AG, Wang SH, Wang JL. LncRNA ADPGK-AS1 regulated cell proliferation, invasion, migration and apoptosis via targeting miR-542-3p in osteosarcoma. Eur Rev Med Pharmacol Sci. 2019;23(20):8751–60.

    PubMed  Google Scholar 

  39. Jiang HY, Wang ZJ. ADPGK-AS1 promotes the progression of colorectal cancer via sponging miR-525 to upregulate FUT1. Eur Rev Med Pharmacol Sci. 2020;24(5):2380–6.

    PubMed  Google Scholar 

  40. Song S, Weihua Yu, Lin S, Zhang M, Wang T, Guo S, Wang H. LncRNA ADPGK-AS1 promotes pancreatic cancer progression through activating ZEB1-mediated epithelial-mesenchymal transition. Cancer Biol Therapy. 2018;19(7):573–83.

    Article  CAS  Google Scholar 

  41. Yang J, Weizhu W, Minhua W, Ding J. Long noncoding RNA ADPGK-AS1 promotes cell proliferation, migration, and EMT process through regulating miR-3196/otx1 axis in breast cancer. In Vitro Cel Dev Biol Anim. 2019;55(7):522–32.

    Article  CAS  Google Scholar 

  42. Song J, Peng J, Zhu C, Bai G, Liu Y, Zhu J, Liu J. Identification and validation of two novel prognostic LncRNAs in kidney renal clear cell carcinoma. Cell Physiol Biochem. 2018;48(6):2549–62.

    Article  CAS  PubMed  Google Scholar 

  43. Yang L, Yang T, Wang H, Dou T, Fang X, Shi L, Li X, Feng M. DNMBP-AS1 regulates NHLRC3 expression by sponging miR-93-5p/17-5p to inhibit colon cancer progression. Front Oncol. 2022;12: 765163.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Liu Z, Lam N, Thiele CJ. Zinc finger transcription factor CASZ1 interacts with histones, DNA repair proteins and recruits NuRD complex to regulate gene transcription. Oncotarget. 2015;6(29):27628–40.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Kwak S, Kim TW, Kang B-H, Kim J-H, Lee J-S, Lee H-T, Hwang I-Y, Shin J, Lee J-H, Cho E-J, Youn H-D. Zinc finger proteins orchestrate active gene silencing during embryonic stem cell differentiation. Nucleic Acids Res. 2018;46(13):6592–607.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Cassandri M, Smirnov A, Novelli F, Pitolli C, Agostini M, Malewicz M, Melino G, Raschellá G. Zinc-finger proteins in health and disease. Cell Death Discov. 2017;3(1):1–12.

    Article  Google Scholar 

  47. Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, Benfeitas R, Arif M, Liu Z, Edfors F, Sanli K, von Feilitzen K, Oksvold P, Lundberg E, Hober S, Nilsson P, Mattsson J, Schwenk JM, Brunnström H, Glimelius B, Sjöblom T, Edqvist P-H, Djureinovic D, Micke P, Lindskog C, Mardinoglu A, Ponten F. A pathology atlas of the human cancer transcriptome. Science. 2017;357(6352):eaan2507.

    Article  PubMed  Google Scholar 

  48. Uhlén M, Björling E, Agaton C, Al-Khalili Szigyarto C, Amini B, Andersen E, Andersson A-C, Angelidou P, Asplund A, Asplund C, Berglund L, Bergström K, Brumer H, Cerjan D, Ekström M, Elobeid A, Eriksson C, Fagerberg L, Falk R, Fall J, Forsberg M, Björklund MG, Gumbel K, Halimi A, Hallin I, Hamsten C, Hansson M, Hedhammar M, Hercules G, Kampf C, Larsson K, Lindskog M, Lodewyckx W, Lund J, Lundeberg J, Magnusson K, Malm E, Nilsson P, Ödling J, Oksvold P, Olsson I, Öster E, Ottosson J, Paavilainen L, Persson A, Rimini R, Rockberg J, Runeson M, Sivertsson Å, Sköllermo A, Steen J, Stenvall M, Sterky F, Strömberg S, Sundberg M, Tegel H, Tourle S, Wahlund E, Waldén A, Wan J, Wernérus H, Westberg J, Wester K, Wrethagen U, Xu LL, Hober S, Pontén F. A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cel Proteom. 2005;4(12):1920–32.

    Article  Google Scholar 

  49. Heyliger SO, Soliman KFA, Saulsbury MD, Renee RR. Prognostic relevance of ZNF844 and Chr 19p13 2 KRAB-zinc finger proteins in clear cell renal carcinoma. Cancer Genom Proteom. 2022;19(3):305–27.

    Article  CAS  Google Scholar 

  50. Wang W, Zhijian X, Wang N, Yao R, Qin T, Lin H, Yue L. Prognostic value of eight immune gene signatures in pancreatic cancer patients. BMC Med Genom. 2021;14(1):42.

    Article  CAS  Google Scholar 

  51. Yan W, Scoumanne A, Jung Y-S, Xu E, Zhang J, Zhang Y, Ren C, Sun P, Chen X. Mice deficient in poly(C)-binding protein 4 are susceptible to spontaneous tumors through increased expression of ZFP871 that targets p53 for degradation. Genes Dev. 2016;30(5):522–34.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Hibino E, Hiroaki H. Potential of rescue and reactivation of tumor suppressor p53 for cancer therapy. Biophys Rev. 2022;14(1):267–75.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Farnebo M, Bykov VJN, Wiman KG. The p53 tumor suppressor: a master regulator of diverse cellular processes and therapeutic target in cancer. Biochem Biophys Res Commun. 2010;396(1):85–9.

    Article  CAS  PubMed  Google Scholar 

  54. Jiang H, Cheng L, Hu P, Liu R. MicroRNA-663b mediates TAM resistance in breast cancer by modulating TP73 expression. Mol Med Rep. 2018;18(1):1120–6.

    CAS  PubMed  Google Scholar 

  55. Howell A, Howell SJ. Tamoxifen evolution. Br J Cancer. 2023;128(3):421–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Wang M, Jia M, Yuan K. MicroRNA-663b promotes cell proliferation and epithelial mesenchymal transition by directly targeting SMAD7 in nasopharyngeal carcinoma. Exp Ther Med. 2018;16(4):3129–34.

    PubMed  PubMed Central  Google Scholar 

  57. You X, Wang Y, Meng J, Han S, Liu L, Sun Y, Zhang J, Sun S, Li X, Sun W, Dong Y, Zhang Y. Exosomal miR-663b exposed to TGF-ß1 promotes cervical cancer metastasis and epithelial-mesenchymal transition by targeting MGAT3. Oncol Rep. 2021;45(4):1.

    Article  Google Scholar 

  58. Guo L, Li B, Miao M, Yang J, Ji J. MicroRNA-663b targets GAB2 to restrict cell proliferation and invasion in hepatocellular carcinoma. Mol Med Rep. 2019;19(4):2913–20.

    CAS  PubMed  Google Scholar 

  59. Yang F, Liu Y, Dong S, Ma R, Bhandari A, Zhang X, Wang O. A novel long non-coding RNA FGF14-AS2 is correlated with progression and prognosis in breast cancer. Biochem Biophys Res Commun. 2016;470(3):479–83.

    Article  CAS  PubMed  Google Scholar 

  60. Jin Y, Zhang M, Duan R, Yang J, Yang Y, Wang J, Jiang C, Yao B, Li L, Yuan H, Zha X, Ma C. Long noncoding RNA FGF14-AS2 inhibits breast cancer metastasis by regulating the miR-370-3p/FGF14 axis. Cell Death Discov. 2020;6(1):1–14.

    Article  Google Scholar 

  61. Hou R, Liu Y, Yanzhuo S, Shu Z. Overexpression of long non-coding RNA FGF14-AS2 inhibits colorectal cancer proliferation via the RERG/Ras/ERK signaling by sponging microRNA-1288-3p. Pathol Oncol Res. 2020;26(4):2659–67.

    Article  CAS  PubMed  Google Scholar 

  62. Li R, Chen Y, Wu J, Cui X, Zheng S, Yan H, Wu Y, Wang F. LncRNA FGF14-AS2 represses growth of prostate carcinoma cells via modulating miR-96-5p/AJAP1 axis. J Clin Lab Anal. 2021;35(11): e24012.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Huang DW, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, Guo Y, Stephens R, Baseler MW, Lane HC, et al. David bioinformatics resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 2007;35(suppl–2):W169–75.

    Article  PubMed  PubMed Central  Google Scholar 

  64. Shibel R, Sarfstein R, Nagaraj K, Lapkina-Gendler L, Laron Z, Dixit M, Yakar S, Werner H. The olfactory receptor gene product, OR5H2, modulates endometrial cancer cells proliferation via interaction with the IGF1 signaling pathway. Cells. 2021;10(6):1483.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Weber L, Schulz WA, Philippou S, Eckardt J, Ubrig B, Hoffmann MJ, Tannapfel A, Kalbe B, Gisselmann G, Hatt H. Characterization of the olfactory receptor or10h1 in human urinary bladder cancer. Front Physiol. 2018;9:456.

    Article  PubMed  PubMed Central  Google Scholar 

  66. Bose S, Banerjee S, Mondal A, Chakraborty U, Pumarol J, Croley CR, Bishayee A. Targeting the JAK/STAT signaling pathway using phytocompounds for cancer prevention and therapy. Cells. 2020;9(6):1451.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Xiao-Jie L, Ai-Mei G, Li-Juan J, Jiang X. Pseudogene in cancer: real functions and promising signature. J Med Genet. 2015;52(1):17–24.

    Article  PubMed  Google Scholar 

  68. Pan Y, Sun C, Huang M, Liu Y, Qi F, Liu L, Wen J, Liu J, Xie K, Ma H, Hu Z, Shen H. A genetic variant in pseudogene E2F3P1 contributes to prognosis of hepatocellular carcinoma. J Biomed Res. 2014;28(3):194–200.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Loh YH, Wu Q, Chew JL, Vega VB, Zhang W, Chen X, Bourque G, George J, Leong B, Liu J, Wong KY, Sung KW, Lee CW, Zhao XD, Chiu KP, Lipovich L, Kuznetsov VA, Robson P, Stanton LW, Wei CL, Ruan Y, Lim B, Ng HH. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat Genet. 2006;38(4):431–40.

    Article  CAS  PubMed  Google Scholar 

  70. Han L, Yuan Y, Zheng S, Yang Y, Li J, Edgerton ME, Diao L, Xu Y, Verhaak RGW, Liang H. The pan-cancer analysis of pseudogene expression reveals biologically and clinically relevant tumour subtypes. Nat Commun. 2014;5:3963.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank the reviewers for their valuable suggestions and Yuanyuan Wei from the Department of Biomedical Engineering at the Chinese University of Hong Kong for her help in polishing the manuscript.

Funding

This work was supported by General Research Funds (GRF) from the Research Grants Council (RGC), University Grants Committee of the Hong Kong Special Administrative Region, China. GRF Project Codes: 2141109, 2141157, 2141261, 14105123, 14103522, 14120420, 14120619 to H.S. GRF Project Codes: 14106521, 14100620, 14105823, 14115319 to H.W. The Warshel Institute for Computational Biology funding from Shenzhen City and Longgang District (LGKCSDPT2024001); University Development Fund -Research Start-up Fund UDF01003011, The Chinese University of Hong Kong, Shenzhen.

Author information

Authors and Affiliations

Authors

Contributions

X.Z. conducted the design and programming of the software as well as drafting of the manuscript. Z.C., H.W., and H.S. contributed to the conceptualization of the methods used and reviewed the draft. All authors reviewed the manuscript and approved the final version of the manuscript.

Corresponding author

Correspondence to Hao Sun.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, X., Chen, Z., Wang, H. et al. Occlusion enhanced pan-cancer classification via deep learning. BMC Bioinformatics 25, 260 (2024). https://doi.org/10.1186/s12859-024-05870-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05870-y

Keywords