Skip to main content
  • Methodology article
  • Open access
  • Published:

PASNet: pathway-associated sparse deep neural network for prognosis prediction from high-throughput data



Predicting prognosis in patients from large-scale genomic data is a fundamentally challenging problem in genomic medicine. However, the prognosis still remains poor in many diseases. The poor prognosis may be caused by high complexity of biological systems, where multiple biological components and their hierarchical relationships are involved. Moreover, it is challenging to develop robust computational solutions with high-dimension, low-sample size data.


In this study, we propose a Pathway-Associated Sparse Deep Neural Network (PASNet) that not only predicts patients’ prognoses but also describes complex biological processes regarding biological pathways for prognosis. PASNet models a multilayered, hierarchical biological system of genes and pathways to predict clinical outcomes by leveraging deep learning. The sparse solution of PASNet provides the capability of model interpretability that most conventional fully-connected neural networks lack. We applied PASNet for long-term survival prediction in Glioblastoma multiforme (GBM), which is a primary brain cancer that shows poor prognostic performance. The predictive performance of PASNet was evaluated with multiple cross-validation experiments. PASNet showed a higher Area Under the Curve (AUC) and F1-score than previous long-term survival prediction classifiers, and the significance of PASNet’s performance was assessed by Wilcoxon signed-rank test. Furthermore, the biological pathways, found in PASNet, were referred to as significant pathways in GBM in previous biology and medicine research.


PASNet can describe the different biological systems of clinical outcomes for prognostic prediction as well as predicting prognosis more accurately than the current state-of-the-art methods. PASNet is the first pathway-based deep neural network that represents hierarchical representations of genes and pathways and their nonlinear effects, to the best of our knowledge. Additionally, PASNet would be promising due to its flexible model representation and interpretability, embodying the strengths of deep learning. The open-source code of PASNet is available at


Predicting prognosis in patients from large-scale genomic data is a fundamentally challenging problem in genomic medicine [13]. Along with the rapid advances of high-throughput technologies and their effectivenesses, high-dimensional genomic data provides more accurate and richer biological descriptions of clinical phenotypes of interests than ever before. Therefore, translating large-scale genomic profiles to clinical outcomes not only improves predicting patient prognosis but also helps in identifying prognostic factors and biological processes.

The capabilities of high-level biological representation and interpretation of the prognosis are often more desired in biomedical research rather than merely improving predictive performance. Pathway-based analysis is an approach that a number of studies have been investigating to improve both predictive performance and biological interpretability [46]. In pathway-based analyses, the incorporation of biological pathway databases in a model takes advantage of leveraging prior biological knowledge so that potential prognostic factors of well-known biological functionality can be identified. Pathway-based analyses identify biological links between pathways and clinical outcomes and enable the interpretation of biological processes where their corresponding genes and proteins are involved. Thus, pathway-based interpretation and visualization provide an intuitive and comprehensive understanding of functionally-related molecular mechanisms.

Moreover, pathway-based approaches have shown more reproducible analysis results than gene expression data analysis alone [4, 710]. High-level representations of gene co-expressions are considered in most pathway-based analyses; each of which represents a biological pathway while preserving the original information. Thus, pathway-based analyses remedy the limitations of gene expression data, which are intrinsically sensitive to stochastic fluctuations and are often caused by multiple potential sources, such as inherent stochasticity of biochemical processes, environmental differences, and genetic mutation [11]. Pathway-based markers were proposed for classifying breast cancer metastasis and ovarian cancer survival time [5]. Cancer subtypes were discovered with pathway-based markers via Restricted Boltzmann Machine (RBM) [8]. A group LASSO-based approach associated genes with pathways and characterized them based on biological pathways [10]. Higher-order functional representation of pathway-based metabolic features provided reproducible biomarkers for breast cancer diagnosis [9].

However, reliable and accurate prognosis still remains poor in many diseases due to the following challenges: high-dimension, low-sample size data and complex nonlinear effects between biological components.

Genomic data are highly dimensional relative to their sample sizes. High-dimension, low-sample size (HDLSS) data often make prediction models sensitive to noise and false positive associations, which consequently make predicting accurate prognoses difficult. LASSO-based approaches have been mainly considered to estimate the effects of a gene set that are associated with various types of clinical outcomes on HDLSS data. The LASSO-based approaches embed sparse coding schemes into linear or logistic regression models for selecting few but greatly informative features among the high-dimensional data. For instance, a logistic regression with sparse regularization was applied for the prognostic model of mortality after acute myocardial infarction [12]. Random LASSO was proposed to enhance the LASSO solution by applying multiple bootstrapping and was applied to predict patients’ survival times with glioblastoma gene expression data [13]. LASSO-based regression models as a prediction model were validated with multiple imputed data in chronic obstructive pulmonary disease patients [14].

Pathway-based analysis also helps to reduce data dimensionality. The number of biological pathways is relatively smaller than the number of genes, and a set of genes in the same pathway can be represented by the pathway’s effect. Thus, pathways can be used as summary variables for the input of the predictive model instead of including all genes, which consequently reduces the model complexity.

Most association studies between a gene set and various clinical outcomes have considered linear or logistic regression models for identifying prognostic factors as well as understanding a biological mechanism of the progression of disease. However, nonlinear effects of genes or pathways may fail to be identified by linear-based approaches. As a solution, kernel-based models have been proposed to capture nonlinear effects of complex pathways [15, 16]. Multiple kernel learning models were introduced to aggregate complex effects from multiple pathways [17, 18]. Kernel Principle Component Analysis (KPCA) was applied to reduce the dimensionality of the feature space by using the correlation structure of the pathways [18].

Recently, several attempts to capture hierarchical effects of genes and pathways have been made. Inferences of multilayered hierarchical gene regulatory networks have been considered to understand how pathways regulate each other hierarchically. A bottom-up graphic Gaussian model [19] and a recursive random forest algorithm [20] were proposed to construct multilayered hierarchical gene regulatory networks. Moreover, complex biological networks were modeled by inferring the multiple hierarchical models (1) between gene expression and pathways and (2) within pathways [21]. However, complex hierarchical relationships between pathways have not been considered for prognostic studies yet, to the best of our knowledge, although hierarchical effects of pathways are prevalent in biological systems [22].

In this paper, we propose a Pathway-Associated Sparse Deep Neural Network (PASNet) to achieve the goals: (1) to predict prognosis in patients accurately by incorporating biological pathways, (2) to provide a solution for hierarchical interpretation of nonlinear relationships between biological pathways of disease systematically, and (3) to handle computational problems on HDLSS data with unbalanced classes. An innovative aspect of our model is biological interpretability; we achieved this with sparse coding and by constructing hidden layers with biological pathways, which oppose the black box nature of deep learning. Our new sparse deep learning architecture represents multiple molecular biological layers, such as a gene layer and a pathway layer, along with their hierarchical relationships, which use sparse regularization.


Pathway-Associated Sparse Deep Neural Network (PASNet) identifies a subset of genes and pathways involved in a disease as prognostic biomarkers, as well as their interactions. PASNet models a multilayered, hierarchical biological system of genes and pathways on a disease, while leveraging the strengths of deep learning for competitive predictive performance. The sparsity of PASNet allows one to interpret the model, which is what conventional fully-connected networks lack. The architecture of PASNet and the strategies for training a sparse neural network model with HDLSS and imbalanced data are described in “Methods” section.

We conducted experiments to evaluate PASNet’s predictive performance for long-term survival prediction in Glioblastoma multiforme (GBM). The capability of the prediction was assessed by comparing our model with the classifiers that have been used for long-term survival prediction. Furthermore, we will describe how PASNet can represent the biological system of GBM in the following section.


GBM is a primary brain cancer that shows poor prognosis performance due to the above challenges. Comprising more than half of all brain tumors, GBM is the most prevailing and aggressive malignant type of primary astrocytomas [23]. Patients with GBM have a median survival time of approximately 15 months with intensive treatments [24]. Furthermore, long-term survival patients with GBM are rare as more than 90% of patients are deceased within three years of diagnosis. Although treatments in neurosurgery, chemotherapy, and radiotherapy have improved, the prognosis of GBM remains poor [25]. Hence, the advancement in understanding molecular mechanisms and related biological pathways of GBM is significant to accelerating the progress for new treatments [24].

We used the gene expression data of GBM patients, which is available at The Cancer Genome Atlas (TCGA, The dataset includes the gene expression data of 522 samples and 12,042 genes and provides survival time and status. We considered patients who survived past 24 months (regardless of survival status) as long-term survivals (LTS) and patients that deceased in less than 24 months as short-term survivals (non-LTS). Living patients with a survival time of less than 24 months were excluded in the experiments and considered censored data. Finally, we obtained 99 LTS and 376 non-LTS samples, where around 20% of the samples were LTS patients.

For pathway-based analysis, we utilized a biological pathway database from the Molecular Signatures Database (MSigDB) [26]. In MSigDB, we extracted the biological pathways of Reactome. Then, we excluded the pathways that include less than ten genes, because small pathways are often redundant with larger pathways. As the input features, we considered the genes that belong to at least one pathway, since pathway annotations of genes are essential to construct the mask matrix M between the gene layer and the pathway layer. Finally, we considered 574 pathways and 4359 genes in the experiments. The gene expression data were standardized to a mean of zero and a standard deviation of one.

Experimental setting

We followed a typical design of conventional deep neural networks for PASNet. A sigmoid function and cross-entropy were considered for the activation and the cost function, respectively. A softmax function was used in the output layer so that the probabilities of output nodes add up to one. For the optimal tuning of PASNet’s training, we empirically determined the hyper-parameters by random search before cross-validation experiments. The learning rate (η) was set to 1e−4, and L2 regularization (λ) was set to 3e−4. Adaptive Moment Estimation (Adam) was performed as the stochastic optimizer [27]. The dropouts for two intermediate layers were also applied with a dropping probability of 0.8 and 0.7, respectively. PASNet was implemented by PyTorch, and the source code is available at


We evaluated PASNet by comparing the performance with classifiers that have been used for prognosis prediction: Support Vector Machine (SVM), Random LASSO [13], LASSO Logistic Regression (LLR) [1], and neural network with dropout (Dropout NN).

Specifically, we used a SVM with a radial basis function (RBF) kernel (γ=2−16 and C=23.9 by two-step grid search [28]). Random LASSO was trained so that every feature could be selected 20 times on average by bootstrapping, and the L1 regularization parameter was determined by 10-fold cross-validation. The LASSO parameter for LLR was also selected by 10-fold cross-validation. The fully-connected Dropout NN was designed with the same numbers of intermediate layers and neurons as the proposed PASNet as well as the dropout probabilities. The learning rate was 0.01 and the L2 regularization was 0.005. Note that PASNet has less number of weights to be trained in each epoch because of sparse coding, compared to Dropout NN. Hence, the optimal hyper-parameters of L2 regularization and learning rate should be different between PASNet and Dropout NN. We empirically searched the optimal hyper-parameters for PASNet and Dropout NN separately through multiple experiments. Dropout NN was implemented by PyTorch (

The experiments were carried out by stratified 5-fold cross-validation for maintaining the same proportions of the imbalanced samples in the classes. The cross-validation experiments were repeated ten times for performance reproducibility. Data preprocessing, such as data normalization, was separately applied on each fold. The testing data on each fold was scaled with the mean and standard deviation of the training data of the same fold.

The predictive performances of the five models were evaluated with two metrics: Area Under the Curve (AUC) and F1-scores. The Receiver Operating Characteristic (ROC) curve (see Fig. 1) was traced over the thresholds of scores to examine the trade-off between True Positive Rate (TPR=TP/(TP+FN)) and False Positive Rate (FPR=FP/(FP+TN)), where LTS was considered positive. An AUC was computed by the area under the ROC curve. An F1-score, an average of Positive Predicted Value (PPV=TP/(TP+FP)) and TPR, is calculated by 2(PPV×TPR)/(PPV+TPR). The F1-score was computed for the LTS class.

Fig. 1
figure 1

ROC Curves. PASNet produces the highest AUC of 0.6622 while the AUC of Dropout NN, SVM, random LASSO, and LLR is 0.6408, 0.6337, 0.6209, and 0.5899, respectively

The average AUC and the average F1-score of the five methods on the test datasets are shown in Table 1. PASNet outperformed others as both AUC and F1-score are relatively high. PASNet produced AUC of 0.6622 ±0.013 (mean ±std) and F1-score of 0.3978 ±0.016. Following PASNet, Dropout NN produced AUC of 0.6408 ±0.014, and SVM produced AUC of 0.6337 ±0.015.

Table 1 Comparison of AUC and F1-score in over ten stratified 5-fold cross-validations

To statistically assess the performance of PASNet (AUC) as compared to others, we conducted the Wilcoxon signed-rank test: a non-parametric paired, two sided test for the null hypothesis that states the median difference in paired samples is zero. Specifically, the null hypothesis is that the benchmark classifier has equal or better performance than our proposed algorithm. Table 2 shows the performance of PASNet is significantly better than others, where the null hypotheses are rejected at the 5% significance level (p-value <0.05). Hence, the outperformance of PASNet was statistically significant compared to the benchmark classifiers.

Table 2 The Wilcoxon signed-rank tests for comparing PASNet with the Benchmark Classifiers

SVM and Dropout NN showed a higher AUC than LASSO logistic regression and Random LASSO, probably because of their capability of capturing nonlinear effects of genes. Compared to Dropout NN, PASNet is a relatively thin network, where the connections between layers are very sparse. However, PASNet interestingly produced higher performance than Dropout NN. It shows that PASNet builds a robust network model, which is simplified to represent the biological processes for prognosis prediction by incorporating biological prior knowledge.


Although PASNet yielded competitive predictive performance in the experiments, a more promising contribution of PASNet is in the model’s interpretability. In this section, we demonstrate a plausible biological mechanism inferred by PASNet for long-term survival prediction in GBM. The graphical representations of the PASNet model are illustrated in Figs. 2, 3 and 4 in the top-down order. The heatmaps were generated by sorting the weights and node values of LTS, and positive and negative weight values are colored in red and blue, respectively.

Fig. 2
figure 2

Graphical representation of the output node values over the samples by PASNet. LTS samples obtain higher node values in LTS node than non-LTS samples. Similarly, non-LTS samples obtain higher node values in non-LTS node than LTS samples

Fig. 3
figure 3

Graphical representation among the output layer, hidden layer, and pathway layer in PASNet. (a) The weights between the hidden layer and the output layer. Hidden nodes are sorted in a descending order. (b) The node values in the hidden layer. The horizontal dotted lines indicates LTS/non-LTS samples. The vertical dotted lines indicates LTS/non-LTS samples are significantly distinguished by top 16 pathways. (c) The absolute weights between the pathway layer and the hidden layer

Fig. 4
figure 4

Graphical representation of the 10 top-ranked pathways by PASNet (a) The absolute weights between the 10 top-ranked pathway nodes and the hidden layer. It is a zoom-in view of Fig. 3c. (b) Weights between the gene layer and the 10 top-ranked pathway nodes. The connections are determined by Reactome database

First, Fig. 2 manifests the posterior probability of the samples in the clinical outcomes. The dark block on the top shows the output node values (−log2(node value)) of the LTS samples, while the remaining ones are non-LTS samples. The weight values of the connections from hidden nodes to the output nodes are depicted in Fig. 3a, where dropped connections are colored in white. The figure reveals distinct patterns of weights (opposite signs) to the two output neurons. Note that there are hidden nodes disconnected to the neurons in the output layer (colored in white) by sparse coding, which shows that the hidden nodes are insignificant.

The hidden node values of the samples are shown in Fig. 3b. The values of the hidden nodes indicate the intensity of the group effects on the pathways, which are connected to the hidden nodes. For instance, the first 16 hidden nodes in Fig. 3b show distinguishable intensities on LTS and non-LTS patients. The LTS patients present significant intensities of the group effects of the 16 pathways while non-LTS patients show significant lower values.

The weights between the pathway nodes and the hidden nodes are exhibited in Fig. 3c, and the top-10 ranked pathways among them are zoomed in Fig. 4a. It appears that a small number of pathways mainly contribute to the hidden nodes simultaneously, which implies that the cohort of the pathways may be candidates of prognostic biomarkers in long-term survival of GBM. The top-10 ranked pathways include signaling by GPCR, GPCR downstream signaling, innate immune system, adaptive immune system, metabolism of carbohydrates, transmembrane transport of small molecules, developmental biology, metabolism of proteins, class A/1 (rhodopsin-like receptors), and axon guidance. Most of the pathways are referred to as significant pathways in GBM in biological literature. The pathways and the references are listed in Table 3. Since the top-10 ranked pathways are all large (gene numbers >200), we further explored small pathways as well. Class B/2 (Secretin family receptors) pathway which includes 88 genes is ranked 14th. One of the subgroups in Class B/2 family is categorized as brain-specific angiogenesis inhibitors that are growth suppressors of glioblastoma cells [29]. Hence, Class B/2 pathway may play an important role in inhibition of GBM.

Table 3 Top-10 ranked pathways for survival prediction in GBM by PASNet

The genes of the pathways are illustrated by the weight values in Fig. 4b. Since the connections between the gene layer and the pathway layer are given by pathway databases, e.g., Reactome, they are very sparse. It also shows that multiple pathways share genes in common. The genes, which are most frequently shown in the ten pathways, include CDC42, PRKCQ, RAC1, AKT1, AKT2, AKT3, C3, CREB1, GRB2, HRAS, KRAS, NRAS, PRKACA, PRKACB, PRKACG, RAF1, and YWHAB, where CDC42, PRKCQ, and RAC1 are shown in six pathways and others are in five pathways. Among them, several genes have been reported as biomarkers in GBM. For instance, AKT1, AKT2, and AKT3, belonging to the five pathways of signaling by GPCR, GPCR downstream signaling, innate immune system, adaptive immune system, and developmental biology, are three isoforms of AKT in PI3K/AKT pathway, which is an important drug target in many cancers including GBM [30]. In particular, AKT2 is a well-known proto-oncogene that promotes the growth of tumors and reduces the survival of patients in GBM [31, 32].

Finally, we demonstrate a hierarchical representation of genes and pathways in PASNet. In Fig. 5a, PASNet is partially visualized, where positive and negative weights are colored in red and blue respectively. The pathways are represented by the corresponding genes in the pathway layer, and then the nonlinear effects of the pathways are described in the hidden layer. The hierarchical representations can be captured in the output layer, which produces a posterior probability for prognosis prediction. Although we considered a single hidden layer to simplify the model with HDLSS data in this study, multiple hidden layers may be able to capture the biological processes and their effects more accurately if a sufficient number of samples are available. Figure 5b–c illustrate distinctive representations of LTS and non-LTS samples in PASNet. The color of nodes in the figures shows the values computed with LTS/non-LTS samples in average. Note that node values between the pathway layer and the output layer are between zero and one. The node with a high value may be a potential prognostic biomarker in the group. Figure 5b shows that pathways including aquaporin-mediated transport, signaling by BMP, and cytokine signaling in immune system are activated with LTS samples. The second node in the hidden layer is triggered by the active pathways, and the hidden node activates the LTS node in the output layer. On the other hand, Fig. 5c shows that additional pathways of signaling by GPCR and innate immune system are also activated for non-LTS samples. The other two hidden nodes take the active pathways into account, and they activate the non-LTS node in the output layer. Hence, the two pathways of signaling by GPCR and innate immune system may be potential prognostic biomarkers for predicting LTS/non-LTS. Pathway of signaling by GPCR has been investigated as a potential therapeutic target to inhibit the progression of glioblastomas. [33]. Activating the innate immune system, i.e. immunotherapy, is a promising strategy for the treatment of GBM [34]. Vascular endothelial growth factor (VEGF), a modulator of the innate immune system, is reported crucial for the tumor progression [35]. Moreover, aquaporin-mediated transport, signaling by BMP, and cytokine signaling in immune system may play an important role in GBM, since they are shown in common as active in both LTS and non-LTS. Note that the activation/inactivation of a node in PASNet does not directly represent biological activation in the system, whereas it indicates different states of the biological components in the groups.

Fig. 5
figure 5

Hierarchical representation of pathways in PASNet. (a) PASNet is partially visualized showing the five pathways. Distinct neural network activations between LTS (b) and non-LTS (c) are shown via PASNet. The nodes of the neural network of (b) and (c) correspond to (a). For instance, the nodes in the pathway layer of (b) and (c) represent signaling by GPCR, innate immune system, aquaporin-mediated transport, signaling by BMP, and Cytokine signaling in immune system. The pathways of signaling by GPCR and innate immune system are inactive with LTS patients, whereas the both pathways are active with non-LTS patients


In this paper, we proposed pathway-associated sparse deep neural network for prognosis predictions (long-term survivals in GBM in this study). PASNet builds a network model by leveraging prior biological knowledge of pathway databases and by taking hierarchical nonlinear relationships of biological processes into account. To improve the model interpretability, PASNet introduces sparse coding. Moreover, we developed a training strategy to avoid the overfitting problem with HDLSS data and the imbalanced problem.

To investigate the performance of PASNet, we used gene expression data of GBM patients in TCGA. PASNet was assessed by comparing the predictive performance with support vector machine, random LASSO, LASSO logistic Regression, and neural network with dropout that have been widely used for prognosis prediction. PASNet outperformed them with respect to both AUC and F1-score in the multiple stratified 5-fold cross-validation experiments. Furthermore, we discussed how PASNet can describe the biological system of GBM.

PASNet is the first deep neural network-based model that represents hierarchical representations of genes and pathways and their nonlinear effects, to the best of our knowledge. Additionally, PASNet would be promising due to its flexible model representation and interpretability, embodying the strengths of deep learning.


The architecture of PASNet

PASNet incorporates biological pathways and the concept of sparse modeling based on Deep Neural Network (DNN). The neural network architecture of PASNet consists of a gene layer (an input layer), a pathway layer that represents the biological pathways linked with input genes, a hidden layer that represents hierarchical relationships among biological pathways, and an output layer that corresponds with clinical outcomes, e.g. a binary class that has long-term survival and short-term survival, stages of cancer (see Fig. 6).

Fig. 6
figure 6

Architecture of PASNet. The structure of PASNet is constructed by a gene layer (an input layer), a pathway layer that represents the biological pathways linked with input genes, a hidden layer that represents hierarchical relationships among biological pathways, and an output layer that corresponds with clinical outcomes, e.g. a binary class that has long-term survival and short-term survival, stages of cancer

In PASNet, sparse coding is considered on the connections between layers for model interpretability. Sparse coding provides a solution to capture significant components of a biological mechanism in the model, since biological processes may involve only a few biological components. On the other hand, conventional fully-connected networks lack to represent biological mechanisms.

Gene layer

The gene layer (as an input layer) corresponds to gene expression data. A patient sample of m gene expressions is formed as a column vector, which is denoted by x={x1,x2,...,xm}. Each input node represents one gene feature.

Pathway layer

The pathway layer represents biological pathways, where each node indicates an individual pathway. The connections between the gene layer and the pathway layer are established by well-known pathway databases (e.g., Reactome and KEGG). Pathway databases contain associations between pathways and genes; each of which provides a set of gene components. Therefore, the pathway layer makes it possible to interpret the model as a pathway-based analysis.

To begin with initializing the connections between the gene layer and the pathway layer, we consider a binary biadjacency matrix (A) from biological pathway databases. The biadjacency matrix can be defined as \(\textbf {A} \in \mathbb {B}^{n \times m}\), where n is number of pathways and m is number of genes. Then, an element of A, i.e., aij, is set to one if gene j belongs to pathway i; otherwise, zero. Sparse coding is applied based on the matrix A to represent the relationships between genes and pathways in the model.

Hidden layer

Biological components may cooperate with others instead of functioning alone. A biological system involves multiple pathways which have interactions together, whereas a node in the pathway layer indicates a biological pathway. The associative interactions between pathways can be represented in the hidden layer. In PASNet, the hidden layer represents biological nonlinear associations between the pathways to outputs.

Sparse coding between the pathway and the hidden layers enables one to interpret these relationships. Although we consider only a single hidden layer in this study for simplicity’s sake, multiple hidden layers can be used for deeper hierarchical representations of pathways. For example, if there are two hidden layers, the second hidden layer will represent deeper hierarchical associations of the nodes of the first hidden layer, which are association effects of pathways.

Output layer

The output layer shows clinical outcomes for which nodes compute the posterior probabilities. In this layer, sparse coding allows to distinguish hierarchical groups of pathways (which are detected from hidden layers) to predict clinical outcomes. In PASNet, more than two clinical outcomes can be easily represented with multiple nodes in the output layer.

Consequently, PASNet can dissect distinguishable biological processes of hierarchical nonlinear relationships and associations of genes and pathways to predict clinical outcomes. Furthermore, this generative model-based approach would be often useful to predict prognosis accurately with complex data of HDLSS. When data is highly complex and only small sample sizes are available, model optimization may be easily biased to the training data rather than providing a general solution. On the other hand, the integration of the biological structures and prior knowledge to the model would produce a robust solution.

Overall description of PASNet training

The main challenge in training PASNet is to reduce both risk of overfitting and computational complexity of training on HDLSS data. The related works that have handled the HDLSS data problem are discussed in “Related works in deep learning” section. To unravel the problems, PASNet optimizes a small sub-network, which involves feasible nodes and parameters to train instead of the whole network and then makes the sub-network sparse. Figure 7 illustrates the overall training flow of PASNet.

Fig. 7
figure 7

Training of PASNet. (a) Weights and biases are randomly initialized. Connections between the gene layer and the pathway layer are determined by biological pathway databases, and the remaining layers are considered as fully-connected in this step. (b) A sub-network is randomly selected using a dropout technique and trained. (c) Sparse coding optimizes the sparsity of connections in the sub-network

First, we initialize the connections between the gene layer and the pathway layer with prior biological knowledge of pathways (see Fig. 7a). Active/inactive connections are determined by the biadjacency matrix, A. The weights of active connections and biases are randomly initialized from standard normal distribution, while the weights of inactive connections are set to zero. The sparsity of the connections between the gene layer and the pathway layer is invariant over the entire training. The remaining layers are fully interconnected as the initial.

In the training phase, we repeat training sub-networks and applying sparse coding on the sub-networks until convergence (Fig. 7b–c). A sub-network is selected by a dropout technique, where neurons are randomly dropped in the intermediate layers. In Fig. 7b, a small sub-network is shown with bold solid circles and lines. Then, the small sub-network is trained by feed-forward and backpropagation. Note that only weights and biases of the sub-network are trained. Upon the completion of the sub-network’s training, sparse coding is applied to the sub-network by trimming the connections that do not contribute or worsen to minimize the loss. In Fig. 7c, the dropped connections and nodes are marked as bold, dashed lines. The details of the training are elucidated in the following sections.

Sparse coding

Once a small sub-network is completed to train with the HDLSS data, the sub-network is imposed to be sparse for the model interpretation. The sparsity of the sub-network is determined by the mask matrix M on each layer as:

$$ \textbf{h}^{(\ell +1)} = a\left(\left(\textbf{W}^{(\ell)}\star\textbf{M}^{(\ell)}\right)\textbf{h}^{(\ell)}+\textbf{b}^{(\ell)}\right), $$

where denotes element-wise multiplication, and a(·) is an activation function. h() denotes an output vector on the -th layer, and W() and b() are a weight matrix and a bias vector, respectively. An element value of M is either one or zero, which determines whether the associated weights are dropped in the current epoch.

The mask matrix M is generated with respect to a sparsity level (S) that indicates the proportion of weights to be dropped in a single layer. S is a value between 0 to 100, where zero creates a fully-connected layer while 100 causes no connection. The optimal S is approximated on each layer individually in the sub-network, while most related methods consider a single hyper-parameter for the sparsity of all layers [36, 37]. The individual setting of the sparsity on each layer shows different levels of biological associations on the genes and pathways.

We obtain the optimal sparsity level S that minimizes the cost score. For efficient computation, the cost scores are computed with a small number of finite sparsity levels. Then, the optimal sparsity level is estimated by applying a cubic-spline interpolation to the cost scores with the assumption that the cost function, with respect to the sparsity level, is continuous.

In particular, an element of M is set to one if the absolute value of the corresponding weight is greater than threshold Q; otherwise, the element is zero, where Q is an S-th percentile of absolute values of W. Note that the mask between the gene layer and the pathway layer, i.e. M(0), is determined by the biadjacency matrix A of biological pathways. Thus, the mask matrices are formulated as

$$ \textbf{M}^{(\ell)}= \begin{cases} \mathbbm{1}\left(\lvert \textbf{W}^{(\ell)} \rvert \geq Q^{(\ell)}\right), & \text{if}\, \ell \neq 0 \\ \textbf{A}, & \text{if}\, \ell = 0 \end{cases} $$

where Q() is the S-th percentile of |W()| if ≠0.

Cost-sensitive learning for imbalanced data

We refine the cost function and the backpropagation for cost-sensitive learning, since imbalanced data causes bias of the predictions towards the majority class. We adapt the Mean False Error (MFE) method [38], which penalizes the errors of the majority class.

Let K be the number of clinical outcomes. The normalized cost is computed separately for each class by:

$$ \mathcal{L} = \sum\limits_{k=1}^{K} \mathcal{C}_{k} + \frac{1}{2}\lambda\lVert \textbf{W}\lVert_{2}, $$
$$ \mathcal{C}_{k} = \frac{1}{n_{k}}\sum\limits_{i=1}^{n_{k}}c\left(\textbf{y}_{i}, \tilde{\textbf{y}}_{i}\right), $$

where \(\mathcal {C}_{k}\) denotes mean error on the class k, and nk is the number of samples in the class k. yi is a vectorized ground truth class label of the i-th sample, and \(\tilde {\textbf {y}}_{i}\) is its vectorized prediction. c(·) denotes a cost function (e.g., cross-entropy loss), and \(\mathcal {L}\) is the total cost. W2 denotes a L2-norm of W, and λ>0 is a regularization hyperparameter.

In the backpropagation phrase, the gradient is also computed separately for each class. Hence, the weights and biases on the -th layer are updated by:

$$\begin{array}{@{}rcl@{}} \textbf{W}^{(\ell)} &\leftarrow& (1-\eta\lambda)\textbf{W}^{(\ell)} - \eta\sum\limits_{k=1}^{K} \frac{\partial \mathcal{C}_{k}}{\partial \textbf{W}^{(\ell)}}, \end{array} $$
$$\begin{array}{@{}rcl@{}} \textbf{b}^{(\ell)} &\leftarrow& \textbf{b}^{(\ell)} - \eta\sum\limits_{k=1}^{K} \frac{\partial \mathcal{C}_{k}}{\partial \textbf{b}^{(\ell)}}, \end{array} $$

where η is a learning rate. The algorithm of PASNet is briefly described in Algorithm 1.

Related works in deep learning

In recent years, deep learning has been spotlighted as the most active research field in various machine learning communities, such as image analysis, speech recognition, and natural language processing as its promising potential is being actively discussed in bioinformatics and biomedicine [39]. Most deep learning-based approaches have been developed for classification and association studies in bioinformatics. For instance, D-GEX infers the expression of target genes from landmark genes, capturing the nonlinear relationships by combining gene expression, DNA methylation, and miRNA expression data [40]. A convolutional neural network (CNN) was adapted to predict DNA-protein binding sites with Chromatin Immunoprecipitation sequencing (ChIP-seq) data [41]. Additionally, CNN-based DeepBind was proposed to predict whether a specific DNA/RNA binding protein will bind to a specific DNA sequence [42]. The functionality of non-coding variants was predicted by DeepSEA by employing a CNN model [43].

Although only a small subset of deep learning research has been reported in bioinformatics due to the difficulty of structure definition and interpretation, the future of deep learning in biology and medicine is promising [44]. First, since a neural network is inspired by the neurons in the human brain, a neuron network architecture is applicable to modeling a mechanism for a complex biological system. Specifically, deep learning approaches take advantage of flexible representation of hierarchical structures from inputs to outputs. The representation of nonlinear effects of neurons in multiple layers in neural networks may be able to model hierarchical biological signals. DCell constructs a multi-layer neural network based on extensive prior biological knowledge to simulate the growth of a eukaryotic cell [45]. However, DCell’s network architecture is entirely based on well-known prior biological knowledge, so the model was applied to relatively simple biological system of yeast. Moreover, deep learning captures nonlinear effects of variables with high-level feature representation, which allows deep learning to outperform other state-of-the-art methods.

However, training deep neural networks with HDLSS data poses a computational problem. A large number of parameters are involved in deep neural networks, and it often makes the training infeasible or causes a model overfit on HDLSS data. Particularly, backpropagation gradients in neural networks are of high variance on HDLSS data, which consequently causes the model overfit [46]. In order to tackle the HDLSS problem, the leave-one-out approach was used to avoid the overfitting problem in backpropagation [47]. Regarding backpropagation, the risk of overfitting was examined with validation data by the leave-one-out approach and terminates the training early when overfitting occurs. For an alternative solution, an attempt to reduce the dimensionality of the input space to a feasible size has been made [48]. Dimension reduction techniques, such as subsampled randomized Hadamard transform (SRHT) and Count Sketch-base construction, were utilized to reduce the dimensional size of the input data. Then, the projected data into the lower space were introduced to a neural network for training.

For HDLSS data, feature selection is one of the conventional approaches. Deep Feature Selection (DFS) was developed to select a discriminative feature subset in a deep learning model [49]. Although DFS is not the optimal solution to low-sample size data, DFS shows that deep learning can detect informative and discriminative features of nonlinearity effects through multiple layers with high-dimensional data. Then, Deep Neural Pursuit (DNP) improved the solution of the feature selection in deep learning, taking the HDLSS data problem into account [46]. DNP iteratively augments features in the input layer by performing multiple dropouts. The multiple dropouts grant the ability to train a small-sized sub-network at a time and to compute gradients with low variance for alleviating the overfitting problem.



Adaptive moment estimation


Area under the curve


Chromatin Immunoprecipitation sequencing


Convolutional neural network


Deep feature selection


Deep neural network


Deep neural pursuit

Dropout NN:

Neural network with dropout


Glioblastoma multiforme


High-dimension, low-sample size


Kernel principle component analysis


LASSO logistic regression


Mean false error


Molecular signatures database


Pathway-associated sparse deep neural network


Radial basis function


Restricted Boltzmann machine


Receiver operating characteristic


Subsampled randomized hadamard transform


Support vector machine


The cancer genome atlas


  1. Lu J, Cowperthwaite MC, Burnett MG, Shpak M. Molecular Predictors of Long-Term Survival in Glioblastoma Multiforme Patients. PloS ONE. 2016; 11(4):0154313.

  2. Onaitis MW, et al. Prediction of Long-Term Survival After Lung Cancer Surgery for Elderly Patients in The Society of Thoracic Surgeons General Thoracic Surgery Database. Ann Thorac Surg. 2018; 105(1):309–16.

  3. Cao Y, et al. Prediction of long-term survival rates in patients undergoing curative resection for solitary hepatocellular carcinoma. Oncol Letters. 2018; 15(2):2574–82.

  4. Jin L, et al. Pathway-based Analysis Tools for Complex Diseases: A Review. Genomics Proteomics Bioinforma. 2014; 12(5):210–20.

  5. Kim S, Kon M, DeLisi C. Pathway-based classification of cancer subtypes. Biol Direct. 2012; 7:21.

  6. Cirillo E, Parnell LD, Evelo CT. A review of pathway-based analysis tools that visualize genetic variants. Front Genet. 2017; 8(174):174.

  7. Drier Y, Sheffer M, Domany E. Pathway-based personalized analysis of cancer. Proc Natl Acad Sci U S A. 2013; 110(16):6388–93.

  8. Mallavarapu T, Kim Y, Oh JH, Kang M. R-pathcluster: Identifying cancer subtype of glioblastoma multiforme using pathway-based restricted boltzmann machine. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).2017. p. 1183–8.

  9. Huang S, et al. Novel personalized pathway-based metabolomics models reveal key metabolic pathways for breast cancer diagnosis. Genome Med. 2016; 8(1):34.

  10. Li Y, Nan B, Zhu J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics. 2015; 71(2):354–63. 15334406.

  11. Raser JM, O’Shea EK. Noise in Gene Expression: Orgins, Consequences, and Control. Science. 2005; 309(5743):2010–3. NIHMS150003.

  12. Steyerberg EW, Eijkemans MJC, Habbema JDF. Application of Shrinkage Techniques in Logistic Regression Analysis: A Case Study. Statistica Neerlandica. 2001; 55(1):76–88.

  13. Wang S, Nan B, Rosset S, Zhu J. Random lasso. Ann Appl Stat. 2011; 5(1):468–85.

  14. Musoro JZ, Zwinderman AH, Puhan MA, Ter Riet G, Geskus RB. Validation of prediction models based on lasso regression with multiply imputed data. BMC Med Res Methodol. 2014;14(1).

  15. Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics. 2007; 63(4):1079–88.

  16. Liu D, Ghosh D, Lin X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics. 2008;9.

  17. Bach FR, Lanckriet GRG, Jordan MI. Multiple kernel learning, conic duality, and the SMO algorithm. In: Twenty-first International Conference on Machine Learning - ICML ’04. 2004. p. 6.

  18. Sinnott JA, Cai T. Pathway aggregation for survival prediction via multiple kernel learning. Stat Med. 2018;0(0).

  19. Kumari S, et al. Bottom-up GGM algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways or processes. BMC Bioinformatics. 2016;17(1).

  20. Deng W, Zhang K, Busov V, Wei H. Recursive random forest algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways. PLoS ONE. 2017;12(2).

  21. Pham LM, Carvalho L, Schaus S, Kolaczyk ED. Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian Hierarchical Approach. J Am Stat Assoc. 2016; 111(513):73–92.

  22. Kher S, Peng J, Wurtele ES, Dickerson J. In: Pérez-Sánchez H, (ed).Hierarchical Biological Pathway Data Integration and Mining, Bioinformatics: IntechOpen; 2012. Available from:

  23. Hanif F, Muzaffar K, Perveen K, Malhi SM, Simjee SU. Glioblastoma Multiforme: A Review of its Epidemiology and Pathogenesis through Clinical Presentation and Treatment. Asian Pac J Cancer Prev. 2017; 18(1):3–9.

  24. Davis ME. Glioblastoma: Overview of Disease and Treatment. Clin J Oncol Nurs. 2016; 20(5):1–14.

  25. Walid MS. Prognostic factors for long-term survival after glioblastoma. Permanente J. 2008; 12(4):45–8.

  26. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov J, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015; 1(6):417–25.

  27. Kingma DP, Ba J. Adam: A method for stochastic optimization. CoRR. 2014;abs/1412.6980.

  28. Hsu C-W, Chang C-C, Lin C-J. A Practical Guide to Support Vector Classification. Available from: Accessed 15 June 2008.

  29. Harmar AJ. Family-B G-protein-coupled receptors. Genome Biol. 2001; 2(12):3013–1301310.

  30. Joy A, et al. The role of AKT isoforms in glioblastoma: AKT3 delays tumor progression. J Neuro-Oncol. 2016; 130(1):43–52.

  31. Hu B, et al. Astrocyte elevated gene-1 interacts with Akt isoform 2 to control glioma growth, survival, and pathogenesis. Cancer Res. 2014; 74(24):7321–32.

  32. Hinske LC, et al. Intronic mirna-641 controls its host gene’s pathway pi3k/akt and this relationship is dysfunctional in glioblastoma multiforme. Biochem Biophys Res Commun. 2017; 489(4):477–83.

  33. Cherry AE, Stella N. G protein-coupled receptors as oncogenic signals in glioma: Emerging therapeutic avenues. Neuroscience. 2014; 278(1):222–36.

  34. Lim M, Xia Y, Bettegowda C, Weller M. Current state of immunotherapy for glioblastoma. Nat Rev Clin Oncol. 2018; 15(7):422–42.

  35. Turkowski K, et al. VEGF as a modulator of the innate immune response in glioblastoma. GLIA. 2018; 66(1):161–74.

  36. Han S, et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks. Int Conf Learn Represent. 2017.

  37. Wang B, Klabjan D. Regularization for Unsupervised Deep Neural Nets. CoRR. 2016; 1:1–7.

  38. Wang S, Liu W, Wu J, Cao L, Meng Q, Kennedy PJ. Training deep neural networks on imbalanced data sets. 2016 Int Jt Conf Neural Netw. 2016;:4368–4374.

  39. Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinforms. 2017; 18(5):851–69.

  40. Liang M, Li Z, Chen T, Zeng J. Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep Learning Approach. IEEE/ACM Trans Comput Biol Bioinforma. 2015; 12(4):928–37.

  41. Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics. 2016; 32(12):121–7.

  42. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015; 33(8):831–8.

  43. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015; 12(10):931–4.

  44. Ching T, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141).

  45. Ma J, Yu MK, Fong S, Ono K, Sage E, Demchak B, Sharan R, Ideker T. Using deep learning to model the hierarchical structure and function of a cell. Nat Methods. 2018; 15(4):290–8.

  46. Liu B, Wei Y, Zhang Y, Yang Q. Deep Neural Networks for High Dimension, Low Sample Size Data. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17: 2017. p. 2287–93.

  47. Pasini A. Artificial neural networks for small dataset analysis. J Thorac Dis. 2015; 7(5):953–60.

  48. Wójcik PI, Kurdziel M. Training neural networks on high-dimensional data using random projection. Pattern Anal Applic. 2018.

  49. Li Y, Chen C-Y, Wasserman WW. Deep feature selection: Theory and application to identify enhancers and promoters. J Comput Biol. 2016; 23(5):322–36. PMID: 26799292.

  50. Zhang J, Feng H, Xu S, Feng P. Hijacking GPCRs by viral pathogens and tumor. 2016.

  51. Feng L, et al. Heterogeneity of tumor-infiltrating lymphocytes ascribed to local immune status rather than neoantigens by multi-omics analysis of glioblastoma multiforme. Sci Reports. 2017;1(7).

  52. Zhou C, et al. Analysis of the gene-protein interaction network in glioma. Genet Mol Res. 2015; 14(4):14196–206.

  53. Choi HY, et al. G protein-coupled receptors in stem cell maintenance and somatic reprogramming to pluripotent or cancer stem cells. BMB Rep. 2015; 48(2):68–80.

  54. Chédotal A, Kerjan G, Moreau-Fauvarque C. The brain within the tumor: New roles for axon guidance molecules in cancers. 2005.

Download references


We would like to thank Dr. Jung Hun Oh for his help and advice in this study.


Not applicable.

Availability of data and materials

The datasets are publicly available and accessible at

Author information

Authors and Affiliations



MK designed and supervised the research project. JH developed and implemented the algorithms, and performed the data analyses. JH and MK wrote the manuscript. YK helped the implementation and performed the experiments. TK discussed and verified the biological interpretation of PASNet. All authors have read and approved the final version of the manuscript.

Corresponding author

Correspondence to Mingon Kang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hao, J., Kim, Y., Kim, TK. et al. PASNet: pathway-associated sparse deep neural network for prognosis prediction from high-throughput data. BMC Bioinformatics 19, 510 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: