DeeP4med: deep learning for P4 medicine to predict normal and cancer transcriptome in multiple human tissues

Mahdi-Esferizi, Roohallah; Haji Molla Hoseyni, Behnaz; Mehrpanah, Amir; Golzade, Yazdan; Najafi, Ali; Elahian, Fatemeh; Zadeh Shirazi, Amin; Gomez, Guillermo A.; Tahmasebian, Shahram

doi:10.1186/s12859-023-05400-2

Research
Open access
Published: 04 July 2023

DeeP4med: deep learning for P4 medicine to predict normal and cancer transcriptome in multiple human tissues

Roohallah Mahdi-Esferizi^na1^nAff1,
Behnaz Haji Molla Hoseyni²^na1,
Amir Mehrpanah³,
Yazdan Golzade⁴,
Ali Najafi⁵,
Fatemeh Elahian¹,
Amin Zadeh Shirazi ORCID: orcid.org/0000-0002-1906-9900⁶,
Guillermo A. Gomez ORCID: orcid.org/0000-0002-0494-2404⁶ &
…
Shahram Tahmasebian⁷

BMC Bioinformatics volume 24, Article number: 275 (2023) Cite this article

2336 Accesses
1 Citations
Metrics details

Abstract

Background

P4 medicine (predict, prevent, personalize, and participate) is a new approach to diagnosing and predicting diseases on a patient-by-patient basis. For the prevention and treatment of diseases, prediction plays a fundamental role. One of the intelligent strategies is the design of deep learning models that can predict the state of the disease using gene expression data.

Results

We create an autoencoder deep learning model called DeeP4med, including a Classifier and a Transferor that predicts cancer's gene expression (mRNA) matrix from its matched normal sample and vice versa. The range of the F1 score of the model, depending on tissue type in the Classifier, is from 0.935 to 0.999 and in Transferor from 0.944 to 0.999. The accuracy of DeeP4med for tissue and disease classification was 0.986 and 0.992, respectively, which performed better compared to seven classic machine learning models (Support Vector Classifier, Logistic Regression, Linear Discriminant Analysis, Naive Bayes, Decision Tree, Random Forest, K Nearest Neighbors).

Conclusions

Based on the idea of DeeP4med, by having the gene expression matrix of a normal tissue, we can predict its tumor gene expression matrix and, in this way, find effective genes in transforming a normal tissue into a tumor tissue. Results of Differentially Expressed Genes (DEGs) and enrichment analysis on the predicted matrices for 13 types of cancer showed a good correlation with the literature and biological databases. This led that by using the gene expression matrix, to train the model with features of each person in a normal and cancer state, this model could predict diagnosis based on gene expression data from healthy tissue and be used to identify possible therapeutic interventions for those patients.

Peer Review reports

Background

In the past, diseases were considered the result of alterations in the function of one or more genes, so the diagnosis and treatment of patients were based on reductionist approaches to correct these genetic alterations. However, a fundamental revolution in medicine is a change from this reductionist view to a more holistic (systems biology) approach to understanding the biology of disease [1,2,3]. In systems biology, organ function results from the simultaneous interaction of all genes, mRNA, proteins, and metabolites across different cells constituting different types of tissues. Therefore, omics studies aimed to collect High-throughput genomic, epigenomic, transcriptomic, proteomic, and metabolomic data [4]. From this perspective, each omics dataset is a network layer, and the cell was considered as several integrated networks, so the disease is defined as a disorder or change in these networks [5].

P4 medicine (predict, prevent, personalize, and participate) is the latest approach to overcoming complex diseases like cancer. The development of computational models that can use omics data to predict disease and offer proper drugs to each person is very challenging [6, 7]. One of the essential omics is transcriptomic and deep learning is a powerful method for processing gene expression data and extracting new knowledge from disease [8]. In 2019, Lotfollahi developed scGen to analyze and predict the effect of a perturbation (i.e., drug, disease) at single-cell resolution [9]. This was followed by several review articles that explained the role of data science and machine learning in precision medicine (Fröhlich in 2018, Papadakis in 2019, and MacEachern in 2020 published) [10,11,12]. Finally, in 2022, Leon Hetzel developed a deep learning model for drug discovery based on cellular response to perturbations in a single-cell transcriptomics context [13]. Also, many research consortia worldwide have started working in this field, including MLPM (Machine learning for personalized medicine) at the Marie Curie Initial Training Network, funded by the European Union [14,15,16,17].

Many studies aimed to obtain genes expressed differently in tumors and normal. These genes are critical to understanding the function of the disease, but in these studies, two groups of individuals were compared [18,19,20]. At the same time, cancer is a complex disease, and patients with the same type of cancer may have different gene expressions. Also, some studies were performed to repurpose drugs for diseases based on these genes [21], but one drug is effective in some patients and ineffective in others. Our goal is to get one step closer to personalized medicine by trying to get the cancer-related genes for each person individually. So We try to make every tumor sample as close to normal as possible to find effective genes specific to that patient.

In this study, we developed a model called DeeP4med to apply deep learning in P4 medicine. We used the datasets collected and preprocessed [22] which is a combination of The Cancer Genome Atlas (TCGA) [23] and genotype-tissue expression project (GTEx) [24]. This dataset contains 6111 tumor and 2996 normal samples in total that have been sampled from 13 different tissues. We selected 18,154 features (genes) that were common across all samples. For simplicity, we ignored the sub-tissue classification within the tissue type. In the preprocessing step, we divided each feature by its maximum value in the dataset. DeeP4med comprises a classifier and a Transferor. The classifier is used to identify the tissue type and the tissue condition (normal or tumor). Transferor takes a person's normal expression matrix and predicts the tumor matrix in the same person and vice versa. Hence, the model tries to learn the important features of converting a normal sample to a tumor sample. Then, based on a sample's important features and other personal features, it predicts and generates a new expression matrix. Because of this ability of the model, it considers two components of P4 medicine: predict and personalize, and by using them, we can achieve two other components: prevent and participate. To evaluate the results predicted by the model, we analyzed them in terms of conventional machine learning and bioinformatics methods, which are reviewed in the results section. (Fig. 1.)

Results

After creating the model, we evaluated the model’s performance with two different approaches: (1) Performance analysis of the Transferor and Classifier of DeeP4med. (2) Investigating the validity and biological significance of the data generated by the model by Differentially expressed genes (DEGs) and enrichment of analysis.

Performance analysis of transferor and classifier

In order to show the Transferor’s performance for changing the type of mRNA (normaltumor and tumor normal), we computed its F1-Score (see Table 1), Precision and Recall (Additional file 1: Tables S1 and S2, respectively). To achieve this, we also needed to evaluate the Classifier performance with respect to the tissue(breast, prostate, lung, …, etc.) and disease (tumor, normal) beforehand. For this purpose, we report their F1-Score, Precision and Recall, for tissue and disease classification. These performance measures, along with their corresponding confusion matrices summarised in Fig. 2.

Table 1 F1 score of tissue classification. (Left), F1 score of tissue classification with Classifier. (Right), F1 score for tissue classification of generated data with Transferor network that evaluated with Classifier

Full size table

Performance of classifier compared with other machine learning models

To reduce the data dimensionality, we used principal component analysis (PCA) as a preprocessing step [25]. When dealing with high-dimensional data, it is natural to assume that the latent variables of the data-generating distribution sit on a much lower-dimensional manifold. By finding a lower-dimensional representation through PCA, we preserve important information while removing redundant dimensions, simplifying analysis and modeling. After tuning the parameters of seven traditional machine learning models, we used K-Fold cross-validation with K = 5. we put one-fifth of the data for testing and the other four-fifths for training and validation in each fold. Finally, to report the model's performance, we considered the average performance of the model in different folds. Finally, we compared their performance with performing DeeP4med. Selected baselines are Support vector classifier (SVC), Logistic regression (LR), Linear discriminant analysis (LDA), Naive Bayes (NBayes), Decision tree (DTree), Random forest (RForest), K nearest neighbors (KNN). The results show that DeeP4med has a better performance for identifying tissue types (Additional file 1: Table S3 (left)) and outperforms the other baselines in classifying disease samples (Additional file 1: Table S3 (right)). We should note that the results are consistent using different PCA dimensions (PCA dim = 120, PCA dim = 90, and PCA dim = 150). See Additional file 1: Tables S3, S4, and S5, respectively.

Biological benchmark

Since our primary purpose was to develop a model that (i) can predict the disease state (i.e., tumor transcriptome) on a patient-by-patient basis based on (RNAseq) healthy tissue information and (ii) predict the healthy state from known tumor information (i.e., RNAseq from tumor biopsies), we designed DeeP4med to produce two types of expression matrices for each tissue:

(1) “transfer tumor” (TT). This data is generated by applying DeeP4med to RNAseq data form normal tissue samples (i.e., original normal data, ON).

For notation clarity, we label this data set as ON_TT

(2) “transfer normal” (TN). This data is generated by applying DeeP4med to RNAseq data from tumor tissue samples (i.e., original tumor data, OT).

For notation clarity, we label this data set as OT_TN.

Half of the data are original in these two types of expression matrices; the remaining are transfers. To evaluate the model's performance of these two matrices in each tissue, (1) DEGs analysis and (2) ENRICHMENT analysis is performed. Then the results were compared against each other. The number of samples in each tissue is shown in Additional file 1: Table S6, and the expression matrices of all tissues are present in Additional file 2: part 1.

DEG analysis

DEG analysis between tumor and normal states was performed using the limma package [26] on the idep.951 platform [27] between (i) ON and TT groups and (ii) OT and TN groups.

We predicted that if DeeP4med works accurately, there should be a significant overlap of DEGs identified in conditions (i) and (ii).

To test this, genes with adjusted p-value ≤ 0.05 and LFC (log fold change, tumor versus normal) ≤ -1 (down-regulated) and LFC ≥ 1 (up-regulated, i.e., when the gene is expressed higher in the tumor compared to the normal) were considered for further analysis. The result files from the idep.951, including DEGs and PCA plots for each tissue, are in the Additional file 2: part 2. We use the Venn diagram tool [28], to identify the intersecting DEG genes (up or downregulated) that are common to conditions (i) and (ii).

Using Eqs. (1) and (2), the true positive rate was calculated (Additional file 1: Table S7).

$$True\;positive_{{UP}} = {\text{ }}\frac{{intersect\;(UP)}}{{mean\;UP\;\left( {ON/TT\& OT/TN} \right)}}$$

(1)

$$True\;positive_{{Down}} = \frac{{intersect\;(Down)}}{{mean\;Down\;\left( {ON/TT\& OT/TN} \right)}}$$

(2)

The prostate had the highest true positive rate, so we chose it to evaluate the model's performance from a biological aspect. Figure 3 shows the results of the Venn diagram and PCA in the prostate, which shows that the DeeP4med can distinguish between normal and tumor states in each matrix (Venn diagrams of other tissues are in the Additional file 2: part 3).

Enrichment analysis

By DEGs and using the Enrichment Analysis Visualization Appyter website[29], several enrichment analyzes were performed, which we will discuss below: (1) Gene ontology (GO)_biological process. (2) Cancer cell Line encyclopedia (CCLE) Proteomics. (3) Kyoto encyclopedia of genes and genomes (KEGG) pathway. (4) ChIP enrichment analysis (ChEA). Table 2 shows the results of these four types of enrichment and some examples of intersecting results between the two types of matrices in the 13 types of cancer. Although the enrichment results for all tissues are shown in Table 2, in the following, we will only evaluate the enrichment results of prostate cancer based on the articles. (Results of enrichment analysis for all tissues are presented in the Additional file 2: part 4).

Table 2 Enrichment results. Four types of enrichment were performed on ON_TT and OT_TN in the 13 types of cancer. The number of results for each matrix is shown as well as the number of intersecting results. In the last column, the names of some intersecting results are shown

Full size table

CCLE_Proteomics_2020 enrichment analysis

According to Table 2, VCAP is present in the enrichment result of both types of prostate matrix. Using the TCGA-110CL (https://comphealth.ucsf.edu/app/tcga-110) website, the expression profiles of real prostate cancer samples in the TCGA database were compared with the expression profiles of different cell lines, as shown in the Additional file 1: Fig. S1. This figure shows that the VCAP cell line correlates most with prostate cancer samples. Therefore, the nature of the data produced by DeeP4med is consistent with real data. The number of cell lines that have been correctly identified by enrichment analysis for each tissue and the best cell line and its P-value are shown in the Additional file 1: Table S8. The expression matrix generated by the model was correctly identified in 11 of the 13 tissues. Only salivary and cervical tissues lacked the appropriate cell lines. However, for salivary, five cell lines such as BICR6, SCC25, HSC4, BICR22, and CAL27, were identified that were anatomically close to this tissue (Additional file 2: part 4_salivary section).

KEGG 2021 human enrichment analysis

According to Table 2, there are 21 intersect metabolic pathways in the prostate, and the Ras signaling pathway is one of the most important of them, so we discuss its role in prostate cancer (a complete list of metabolic pathways is shown in the Additional file 2: part 4_prostate section). In 2009, Pearson et al. [30] showed malfunction in Wnt and Ras signaling, and mutations in K-ras and beta-catenin can lead to invasive carcinoma in the prostate. In 2016, Chen et al. [31] by text mining the prostate cancer articles, extracted 41 important proteins, and created a protein–protein interaction (PPI) network. By applying functional annotation on a network, they find Ras protein signal transduction is one of the important signaling pathways in prostate cancer. Also in 2021, Strittmatter et al. [32] show the change in ERG expression gene by Ras/ERK and PI3K/AKT signaling pathways, promoting prostate tumor.

GO_Biological_Process_2021 analysis

Based on enrichment results in Table 2, 264 intersect biological processes in prostate cancer were obtained (In the Additional file 2: part 4_prostate section). The MAPK cascade intersects between two matrices and is a key downstream of Ras signaling, so we choose the MAPK cascade to discuss its role in prostate cancer. A search of the coremine (https://www.coremine.com/medical/#search) database revealed that there were approximately 20 articles related to prostate cancer and MAPK cascade (GO:0000165) and 12 articles related to actin cytoskeleton reorganization (GO:0031532) and prostate cancer. In 2019, Wu et al. [33] with an analysis of expression and methylation profiles of prostate cancer, find 322 genes that were hypermethylated and downregulated. By enriching these genes, they found one of the important biological processes was the MAPK cascade. In 2020, Singh et al. [34] for the identification of biomarkers in prostate cancer, analysis proteomics profile of prostate cancer cell lines such as LNCaP and PC-3 by mass spectrometry, they found 474 proteins were deregulated. Enrichment analysis reveals that some biological processes, such as the MAPK cascade, have an essential role in the initiation and progression of cancer. In 2021, Shen et al. [35] MAPK4 expression (one gene of MAPK cascade) promoted prostate cancer cell proliferation, so this gene was a potential target for prostate cancer treatment.

ChEA_2016 enrichment analysis

According to Table 2, FOXP1, RELA, and KDM2B, transcription factors intersect in OT_TN & ON_TT in prostate cancer. The Coremine website finds approximately 18, 250, and 4 articles for FOXP1, RELA, and KDM2B related to prostate cancer. In 2021, Panigrahi et al. [36] knocked down the RAD9 gene in prostate cancer DU145 cells and found that expression of FOXP1 were down-regulated, so migration and proliferation of tumor cell decreased. In 2022, Raspin et al. [37] investigate some gene fusions in prostate cancer in TCGA data. One of the genes fusion related to RYBP: FOXP1. (complete list of transcription factors enrichment is shown in the Additional file 2: part 4_prostate section).

To better evaluate the performance of our model, we compared the matrices produced in the model with the original matrix (ON/OT). We used two different approaches: a biological approach using the DEGs method and a statistical approach using the PCA and UMAP methods which show the distribution of samples by reducing the dimensions. The output of our model for each tissue is two types of matrices: OT/TN and ON/TT. Next, we separately compared the DEGs obtained from each of these matrices with the DEGs obtained from the true matrix by the Venn diagram. There are 13 tissues; for each tissue, there are two comparisons (OT/TN vs. original and ON/TT vs. original), and in each comparison, two states of up and down were analyzed separately. Therefore, 52 Venn diagrams were obtained (Additional file 2: part 5_Venny and DEGs). The results show that depending on the type of tissue, a sufficient number of DEGs (UP & Down genes) are common in these three types of matrices, which indicates that our model has been able to produce matrices that are similar to the true matrix. Also, PCA and UMAP plots show that in three types of matrices (original, OT/TN, and ON/TT), normal samples' distribution differs from tumor samples. These results indicate that our model has understood the pattern of normal and tumor samples and produced new matrices (Additional file 2: part 5_PCA&UMAP). Also, the list of common DEGs between all three matrices and their Venn diagram in Additional file 2: part 6 is available. These genes are the most important because they exist in all three matrices.

Discussion

By focusing on each patient and understanding the complex molecular mechanisms of the disease and its interaction with environmental factors and individual genetic diversity, P4 medicine has become the most effective approach in personalized medicine. By applying system biology methods, P4 medicine's primary goal is to make the disease state predictable, preventable, and curable. However, individuals' genetic and demographic information affects the molecular mechanisms that drive the disease stage, and identifying them requires deep learning approaches. In this work, we developed a transfer model capable of predicting the disease state using RNAseq data (i.e., bulk transcriptomics). Transcriptomic data is readily available through different projects (i.e., TCGA) and is also more dynamic than genomic data alone, as it also reveals changes in the epigenome of cells and how gene expression is modulated by different disease conditions but also, in the context of cancer cells, by the interaction of tumor cells and the tumor microenvironment. That means RNAseq captures the changes in disease cells by measuring the cell's gene expression profile. Because the changes in all genes are measured, RNAseq data is very comprehensive and suitable for applications in Deep Learning. Our fundamental goal in developing DeeP4med was to use deep learning to predict changes in gene expression profiles. In this regard, previous work has attempted to do this using different datasets. For example, DeepChrome uses histone modifications to predict gene expression profiles [38]. HE2RNA use histopathology images to predict gene expression profiles in tumor [39] or tuberculosis [40]. Some models, like Enformer [41] or similar models [42, 43], predict gene expression from DNA sequences. DeeP4med predicts normal gene expression from tumor gene expression and vice versa. One of DeeP4med's uses is to predict how cancer would look if happening to a normal person. Suppose we have a normal gene expression profile of a healthy person in a specific tissue. The model can predict the probable tumor profile of that healthy person in the future. So we can find out which genes are involved in this process and reduce the risk of cancer in that person by prescribing certain drugs or taking special care. By developing such models using other omics data such as genomics, proteomic or metabolic, we can hope that besides predicting the expression profile, the model can also suggest specific and proper drugs for treatment. Developing this model and its capacity to predict the tumor state from healthy conditions will stimulate further research in P4 medicine. One of the therapeutic aspects of developing such models is integrating them with models that use deep learning in drug Repositioning [44, 45]. The use of combined models is a new horizon in the diagnosis and treatment of diseases.

Method

Our deep learning model contains two separate deep models, Classifier and Transferor, based on their function. We first trained a neural network called Classifier to classify the type of gene expression profiles (tumor or normal) and their corresponding tissues. Then, using Classifier as our discriminator, we trained Transferor, an autoencoder for transferring the type of gene expressions from the tumor to their nearest normal version and vice versa while keeping their tissue of origin unchanged. The Transferor is conditioned on the type of input sample (tumor or normal) and simultaneously generates the normal and tumoral versions of the input. Classifier helps us accurately measure the performance of the Transferor in terms of concordance between the expected type and tissue and that of the first generated mRNA. Another performance measure that has been introduced to the loss function of the Transferor is the mean squared distance between the input sample and the second generated mRNA.

Experimental setup

The loss function of the Transferor is a weighted sum of three losses. The first two losses are computed based on the Classifier's output and measure the Transferor's performance as a classification task. The third loss computes the distance between the generated mRNA and the input and can be considered a regression task. We used mean squared error to measure the distance between the input and the generated mRNA. We use cosine similarity to measure the correspondence between the type and tissue of the input and generated mRNA [46]. To assess the performance of the model, we used fivefold cross-validation.

Classifier

The classifier has an architecture similar to that of the model proposed in DeePathology [46], which is an autoencoder augmented with two classifiers (see Fig. 4). In this work, after training the whole proposed architecture, we remove all layers related to mRNA reconstruction and only use the type and tissue classifier layers. Following DeePathology [46], to show that our autoencoder effectively separates the input samples, we visualize the embeddings at the bottleneck layer of DeeP4med.

This network can be symbolized as:

$$({tissue}_{output},{type}_{output},{mRNA}_{output})={MLP}_{\gamma }^{autoencoder}({mRNA}_{input})$$

(3)

Combination of autoencoder and classifier with weighted loss as below:

$$\begin{aligned} & w_{1} *MSE\left( {mRNA_{{input}} ,mRNA_{{output}} } \right) + w_{2} *Cosine\;Distance\left( {type_{{input}} ,type_{{output}} } \right) \\ & + w_{3} *Cosine\;Distance\left( {tissue_{{input}} ,tissue_{{output}} } \right) \\ \end{aligned}$$

(4)

Such that ${w}_{1},{w}_{2}$ and ${w}_{3}$ are weights and we set as they have used before [46]. We used only the classifier part of this network in learning Transferor, as we explained in Eq. (8).

Transferor

The transferor consists of an encoder and two decoders that share parameters. In each forward pass of the model, an mRNA profile and its type are encoded and again concatenated to each type separately. The resulting vectors are encoded sequences of mRNA concatenated with a type that is opposite or the same as the input. Then, the type of augmented embeddings is fed to the decoders individually (Fig. 5). Formally, we can summarise the encoding process in Eq. (5).

$$h={MLP}_{\phi }^{enc}({mRNA}_{input};type)$$

(5)

MLPφ shows our encoder is parametrized by φ and h is the embedding of an mRNA given its type. Equations (6) and (7) show decoding.

$${mRNA}_{dec}^{(1)}={MLP}_{\psi }^{dec}(h;{type}_{original})$$

(6)

$${mRNA}_{dec}^{(2)}={MLP}_{\psi }^{dec}(h;{type}_{opposite})$$

(7)

We want DeeP4med to keep the tissue unchanged but control the type of mRNA. Formally, it should satisfy Eq. (8):

$$({type}_{output};{tissue}_{output})={MLP}_{\theta }^{Classifier}({mRNA}_{dec}^{(2)})$$

(8)

Each output of the Transferor contributes to the loss function: The first output, which should have the same tissue but the opposite type as the input, is evaluated by the Classifier. The second output, which should have the same tissue and the same type as the input, is used to measure the similarity between the input and output. Finally, the total loss for this network is a weighted sum of the cosine distance between the Classifier’s outputs and the expected tissue and type and the mean squared distance between the generated mRNA and the input. Mathematically, we have Eq. (9):

$$\begin{aligned} Loss_{{Total}} & = w_{1} \cdot MSE\left( {mRNA_{{dec}}^{{\left( 1 \right)}} ;mRNA_{{input}} } \right) \\ & \quad + w_{2} \cdot Cosine\,Distance\left( {type_{{opposite}} ;type_{{output}} } \right) \\ & \quad + w_{3} \cdot Cosine\,Distance\left( {tissue_{{input}} ;tissue_{{output}} } \right) \\ \end{aligned}$$

(9)

At this stage, only the Transferor parameters are updated, and the classifier parameters are frozen.

Hyperparameter tuning

In Neural Networks (NN), there are many hyperparameters, and tuning them is critical for finding the best model. Given that we used multilayer perceptron networks, the hyperparameters are:

Units: The number of neurons in layers is critical to finding the best architecture.
Activation function: In Artificial neural network (ANN), an activation function is applied after a weighted sum of input for each neuron. ReLU (Rectified linear unit) [47], defined as f(x) = max(0; x), widely used in ANN, was one of our selections for the activation function. Softplus (f(x) = ln(1 + e^x))[48] and Linear (f(x) = x) are another of our selection. Also, we use Elu (Exponential linear unit), which is defined as Eq. (10):
$$f(x) = \left\{ {\begin{array}{*{20}l} {e^{x} - 1 \le 0} \hfill \\ {x > 0} \hfill \\ \end{array} } \right.$$
(10)
Dropout rate: Dropout layer set to zero values with probability as defined rate. This is a widely used technique for preventing overfitting in recent years.

The hyperparameter search space is shown in the Additional file 1: Tables S9 and S10.

Conclusion

A general review of all the results shows that DeeP4med has been successful in terms of machine learning methods. Also, regarding biological results, DeeP4med performs relatively well depending on the tissue type. Of course, DeeP4medis still needs to complete and have considered all aspects. Indeed, the performance of the model can be improved in future studies.

Availability of data and materials

The datasets analysed during the current study and its supplementary information files are available in the google drive repository, (https://drive.google.com/drive/folders/1lMMQdMXsHT8fcP9Mz9sb6NpyFByI7rcj?usp=share_link). The code is available from the corresponding author upon reasonable request (stahmasebian@gmail.com).

Abbreviations

DEGs:: Differentially expressed genes
TCGA:: The Cancer Genome Atlas
GTEx:: Genotype-tissue expression
SVC:: Support vector classifier
LR:: Logistic regression
LDA:: Linear discriminant analysis
NBayes:: Naive Bayes
DTree:: Decision tree
RForest:: Random forest
KNN:: K nearest neighbors
PCA:: Principal component analysis
TT:: Transfer tumor
ON:: Original normal
TN:: Transfer normal
OT:: Original tumor
LFC:: Log fold change
GO:: Gene ontology
CCLE:: Cancer cell line encyclopedia
KEGG:: Kyoto encyclopedia of genes and genomes
ChEA:: ChIP enrichment analysis
PPI:: Protein–protein interaction
ReLU:: Rectified linear unit
Elu:: Exponential linear unit
MLPM:: Machine learning for personalized medicine
ANN:: Artificial neural network

References

Schleidgen S, Fernau S, Fleischer H, Schickhardt C, Oßa A-K, Winkler EC. Applying systems biology to biomedical research and health care: a précising definition of systems medicine. BMC Health Serv Res. 2017;17:761.
Article PubMed PubMed Central Google Scholar
Beresford MJ. Medical reductionism: lessons from the great philosophers. QJM: Int J Med. 2010;103:721–4.
Article Google Scholar
Ayers D, Day PJ. Systems medicine: the application of systems biology approaches for modern medical research and drug development. Mol Biol Int. 2015;2015:698169.
Article PubMed PubMed Central Google Scholar
Seo J, Shin JY, Leijten J, Jeon O, Camci-Unal G, Dikina AD, et al. High-throughput approaches for screening and analysis of cell behaviors. Biomaterials. 2018;153:85–101.
Article CAS PubMed Google Scholar
Zheng F, Wei L, Zhao L, Ni F. Pathway network analysis of complex diseases based on multiple biological networks. BioMed Res Int. 2018;2018:1–12.
Article Google Scholar
Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multi-omics data integration methods. Front Genet. 2017. https://doi.org/10.3389/fgene.2017.00084.
Article PubMed PubMed Central Google Scholar
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype-phenotype interactions. Nat Rev Genet. 2015;16:85–97.
Article CAS PubMed Google Scholar
Casamassimi A, Federico A, Rienzo M, Esposito S, Ciccodicola A. Transcriptome profiling in human diseases: New advances and perspectives. Int J Mol Sci. 2017;18:1652.
Article PubMed PubMed Central Google Scholar
Lotfollahi M, Wolf FA, Theis FJ. scGen predicts single-cell perturbation responses. Nat Methods. 2019;16:715–21.
Article CAS PubMed Google Scholar
Maceachern SJ, Forkert ND. Machine learning for precision medicine. Genome. 2021;64:416–25.
Article PubMed Google Scholar
Fröhlich H, Balling R, Beerenwinkel N, Kohlbacher O, Kumar S, Lengauer T, et al. From hype to reality: Data science enabling personalized medicine. BMC Med. 2018;16:150.
Article PubMed PubMed Central Google Scholar
Papadakis GZ, Karantanas AH, Tsikankis M, Tsatsakis A, Spandidos DA, Marias K. Deep learning opens new horizons in personalized medicine (Review). Biomed Rep. 2019;10:215–7.
PubMed PubMed Central Google Scholar
Hetzel L, Böhm S, Kilbertus N, Günnemann S, Lotfollahi M, Theis F. Predicting single-cell perturbation responses for unseen drugs. 2022.
Weiss JC, Natarajan S, Peissig PL, McCarty CA, Page D. Machine learning for personalized medicine: predicting primary myocardial infarction from electronic health records. AI Mag. 2012;33:33–45.
Google Scholar
Papaxanthos L, Llinares-López F, Bodenham D, Borgwardt K. Finding significant combinations of features in the presence of categorical covariates. 2016.
Llinares-López F, Grimm DG, Bodenham DA, Gieraths U, Sugiyama M, Rowan B, et al. Genome-wide detection of intervals of genetic heterogeneity associated with complex traits. Bioinformatics. 2015;31:i240-9.
Article PubMed PubMed Central Google Scholar
Sugiyama M, López FL, Kasenburg N, Borgwardt KM. Significant subgraph mining with multiple testing correction. 2014.
Zhao H-B, Xu G-B, Yang W-Q, Li X-Z, Chen S-X, Gan Y, et al. Bioinformatics-based identification of the key genes associated with prostate cancer. Zhonghua Nan Ke Xue. 2021;27:489–98.
PubMed Google Scholar
Wang KP, Yuan YJ, Zhu JQ, Li BL, Zhang TT. Analysis of key genes and signal pathways of human papilloma virus-related head and neck squamous cell carcinoma. Zhonghua Kou Qiang Yi Xue Za Zhi. 2020;55:571–7.
CAS PubMed Google Scholar
Wang Y, Wang Y-S, Hu N-B, Teng G-S, Zhou Y, Bai J. Bioinformatics analysis of core genes and key pathways in myelodysplastic syndrome. Zhongguo Shi Yan Xue Ye Xue Za Zhi. 2022;30:804–12.
PubMed Google Scholar
Pan Z, Fang Q, Zhang Y, Li L, Huang P. Identification of key pathways and drug repurposing for anaplastic thyroid carcinoma by integrated bioinformatics analysis. Zhejiang Da Xue Xue Bao Yi Xue Ban. 2018;47:187–93.
PubMed Google Scholar
Wang Q, Armenia J, Zhang C, Penson AV, Reznik E, Zhang L, et al. Data descriptor: Unifying cancer and normal RNA sequencing data from different sources. Scientific Data. 2018. https://doi.org/10.1038/sdata.2018.61.
Article PubMed PubMed Central Google Scholar
Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Wspolczesna Onkol. 2015;1A:A68-77.
Article Google Scholar
Ardlie KG, DeLuca DS, Segrè AV, Sullivan TJ, Young TR, Gelfand ET, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–60.
Article Google Scholar
Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom Intell Lab Syst. 1987;2:37–52.
Article CAS Google Scholar
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47.
Article PubMed PubMed Central Google Scholar
Ge SX, Son EW, Yao R. iDEP: An integrated web application for differential expression and pathway analysis of RNA-Seq data. BMC Bioinform. 2018;19:1–24.
Article Google Scholar
Oliveros, J.C. (2007–2015) Venny. An Interactive Tool for Comparing Lists with Venn’s Diagrams. - References - Scientific Research Publishing. https://www.scirp.org/(S(lz5mqp453edsnp55rrgjct55))/reference/referencespapers.aspx?referenceid=2904043. Accessed 26 Jun 2022.
Clarke DJB, Jeon M, Stein DJ, Moiseyev N, Kropiwnicki E, Dai C, et al. Appyters: turning Jupyter Notebooks into data-driven web apps. Patterns. 2021;2:100213.
Article PubMed PubMed Central Google Scholar
Pearson HB, Phesse TJ, Clarke AR. K-ras and Wnt signaling synergize to accelerate prostate tumorigenesis in the mouse. Can Res. 2009;69:94–101.
Article CAS Google Scholar
Chen C, Shen H, Zhang LG, Liu J, Cao XG, Yao AL, et al. IdenConstruction and analysis of protein-protein interaction networks based on proteomics data of prostate cancer. Int J Mol Med. 2016;37:1576–86.
Article CAS PubMed PubMed Central Google Scholar
Strittmatter BG, Jerde TJ, Hollenhorst PC. Ras/ERK and PI3K/AKT signaling differentially regulate oncogenic ERG mediated transcription in prostate cells. PLoS Genet. 2021;17:e1009708.
Article CAS PubMed PubMed Central Google Scholar
Wu K, Yin X, Jin Y, Liu F, Gao J. Identification of aberrantly methylated differentially expressed genes in prostate carcinoma using integrated bioinformatics. Cancer Cell Int. 2019. https://doi.org/10.1186/s12935-019-0763-8.
Article PubMed PubMed Central Google Scholar
Singh AN, Sharma N. Quantitative SWATH-based proteomic profiling for identification of mechanism-driven diagnostic biomarkers conferring in the progression of metastatic prostate cancer. Front Oncol. 2020;10:493.
Article PubMed PubMed Central Google Scholar
Shen T, Wang W, Zhou W, Coleman I, Cai Q, Dong B, et al. MAPK4 promotes prostate cancer by concerted activation of androgen receptor and AKT. J Clin Investig. 2021. https://doi.org/10.1172/JCI135465.
Article PubMed PubMed Central Google Scholar
Panigrahi SK, Broustas CG, Cuiper PQ, Virk RK, Lieberman HB. FOXP1 and NDRG1 act differentially as downstream effectors of RAD9-mediated prostate cancer cell functions. Cellular Signal. 2021;86:110091.
Article CAS Google Scholar
Raspin K, O’Malley DE, Marthick JR, Donovan S, Malley RC, Banks A, et al. Analysis of a large prostate cancer family identifies novel and recurrent gene fusion events providing evidence for inherited predisposition. Prostate. 2022;82:540–50.
Article CAS PubMed Google Scholar
Singh R, Lanchantin J, Robins G, Qi Y. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics. 2016;32:i639–48.
Article CAS PubMed Google Scholar
Schmauch B, Romagnoni A, Pronier E, Saillard C, Maillé P, Calderaro J, et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat Commun. 2020;11:1–15.
Article Google Scholar
Tavolara TE, Niazi MKK, Gower AC, Ginese M, Beamer G, Gurcan MN. Deep learning predicts gene expression as an intermediate data modality to identify susceptibility patterns in Mycobacterium tuberculosis infected Diversity Outbred mice. EBioMedicine. 2021;67:103388.
Article CAS PubMed PubMed Central Google Scholar
Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18:1196–203.
Article CAS PubMed PubMed Central Google Scholar
Vaishnav ED, de Boer CG, Molinet J, Yassour M, Fan L, Adiconis X, et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature. 2022;603:455–63.
Article CAS PubMed Google Scholar
Washburn JD, Mejia-Guerra MK, Ramstein G, Kremling KA, Valluru R, Buckler ES, et al. Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence. Proc Natl Acad Sci USA. 2019;116:5542–9.
Article CAS PubMed PubMed Central Google Scholar
Zhao B-W, Wang L, Hu P-W, Wong L, Su X-R, Wang B-Q, et al. Fusing Higher and Lower-order Biological Information for Drug Repositioning via Graph Representation Learning. IEEE Trans Emerg Topics Comput. 2023. https://doi.org/10.1109/TETC.2023.3239949.
Article Google Scholar
Zhao B-W, You Z-H, Hu L, Guo Z-H, Wang L, Chen Z-H, et al. A novel method to predict drug-target interactions based on large-scale graph representation learning. Cancers. 2021;13:2111.
Article CAS PubMed PubMed Central Google Scholar
Azarkhalili B, Saberi A, Chitsaz H, Sharifi-Zarchi A. DeePathology: deep multi-task learning for inferring molecular pathology from cancer transcriptome. Sci Rep. 2019;9:1–14.
Article CAS Google Scholar
Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair. https://www.researchgate.net/publication/221345737_Rectified_Linear_Units_Improve_Restricted_Boltzmann_Machines_Vinod_Nair. Accessed 26 Jun 2022.
Glorot X, Bordes A, Bengio Y. Deep Sparse Rectifier Neural Networks. 2011.

Download references

Acknowledgements

The authors would like to thank the Cellular and Molecular Research Center, Basic Health Sciences Institute, Shahrekord University of Medical Sciences, Shahrekord, Iran. We are grateful from Dr. Seyed Abbas Mirzaei (Med. Biotechnol. Dep.) for his helpful comments and suggestions.

Funding

This study was supported by a grant from Shahrekord University of Medical Sciences for financial support (IR.SKUMS.REC.1397.293). This study was also financially supported by a complementary Grant no: 980201 of the Biotechnology Development Council of the Islamic Republic of Iran.

Author information

Roohallah Mahdi-Esferizi
Present address: Department of Medical Biotechnology, School of Advanced Technologies, Shahrekord University of Medical Sciences, Shahrekord, Iran
Roohallah Mahdi-Esferizi and Behnaz Haji Molla Hoseyni have contributed equally.

Authors and Affiliations

Department of Medical Biotechnology, School of Advanced Technologies, Shahrekord University of Medical Sciences, Shahrekord, Iran
Fatemeh Elahian
Laboratory of Systems Biology and Bioinformatics (LBB), University of Tehran, Tehran, Iran
Behnaz Haji Molla Hoseyni
Faculty of Mathematics, Shahid Beheshti University, Tehran, Iran
Amir Mehrpanah
Department of Mathematics, Faculty of Basic Sciences, Iran University of Science and Technology,(IUST), Tehran, Iran
Yazdan Golzade
Molecular Biology Research Center, Systems Biology and Poisonings Institute, Baqiyatallah University of Medical Sciences, Tehran, Iran
Ali Najafi
Centre for Cancer Biology, SA Pathology and University of South Australia, Adelaide, SA, 5000, Australia
Amin Zadeh Shirazi & Guillermo A. Gomez
Cellular and Molecular Research Center, Basic Health Sciences Institute, Shahrekord University of Medical Sciences, Shahrekord, Iran
Shahram Tahmasebian

Authors

Roohallah Mahdi-Esferizi
View author publications
You can also search for this author in PubMed Google Scholar
Behnaz Haji Molla Hoseyni
View author publications
You can also search for this author in PubMed Google Scholar
Amir Mehrpanah
View author publications
You can also search for this author in PubMed Google Scholar
Yazdan Golzade
View author publications
You can also search for this author in PubMed Google Scholar
Ali Najafi
View author publications
You can also search for this author in PubMed Google Scholar
Fatemeh Elahian
View author publications
You can also search for this author in PubMed Google Scholar
Amin Zadeh Shirazi
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo A. Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Shahram Tahmasebian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RME, BHMH, and ST conceptualized the main idea. RME, BHMH, AN, FE, GAG, AZS, and ST conceived and designed the experiments. RME and BHMH performed the experiments. RME, AM, and YG analyzed the data. RME, BHMH, AM, YG, AN, GAG, AZS, and ST contributed materials/analysis tools. RME, AM, FE, GAG, AZS, and ST wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shahram Tahmasebian.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Details of computational performance and hyperparameter space of the DeeP4med (Transferor and Classifier) as well as biological evaluation of the model.

Additional file 2.

Dataset (gene expression matrix of different tissues), DEGs and enrichment analysis results, common and important genes between different matrices.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Mahdi-Esferizi, R., Haji Molla Hoseyni, B., Mehrpanah, A. et al. DeeP4med: deep learning for P4 medicine to predict normal and cancer transcriptome in multiple human tissues. BMC Bioinformatics 24, 275 (2023). https://doi.org/10.1186/s12859-023-05400-2

Download citation

Received: 02 January 2023
Accepted: 25 June 2023
Published: 04 July 2023
DOI: https://doi.org/10.1186/s12859-023-05400-2

DeeP4med: deep learning for P4 medicine to predict normal and cancer transcriptome in multiple human tissues

Abstract

Background

Results

Conclusions

Background

Results

Performance analysis of transferor and classifier

Performance of classifier compared with other machine learning models

Biological benchmark

DEG analysis

Enrichment analysis

CCLE_Proteomics_2020 enrichment analysis

KEGG 2021 human enrichment analysis

GO_Biological_Process_2021 analysis

ChEA_2016 enrichment analysis

Discussion

Method

Experimental setup

Classifier

Transferor

Hyperparameter tuning

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us