Gene expression and clinical datasets
The gene expression matrix and clinical data of cancer patients were download from TCGA [21]. Individuals with missing clinical or gene expression data were excluded. Long-term survival (LTS) was defined as survival > \(k\) years after diagnosis, whereas non-LTS was defined as survival < = \(k\) years (\(k\in [\mathrm{1,3},\mathrm{5,10}]\)). The gene expression matrix is denoted as \(\mathrm{M}\in {\mathrm{R}}^{\mathrm{n}\times \mathrm{r}}\), where n and r indicate the number of samples and genes, respectively, and the value in the matrix is FPKM (Fragments per Kilobase of exon model per Million mapped fragments) that quantifies the gene expression.
After data preprocessing, we selected four types of cancer (i.e., lung adenocarcinoma, LUAD, N = 515; kidney renal clear cell carcinoma, KIRC, N = 530; low grade glioma, LGG, N = 511; skin cutaneous melanoma, SKCM, N = 468). We determined the survival years threshold k = 3 to validate our method, which was considered to have a larger data sample size and better data balance. After removing individuals who missed clinical information or survived but had their last followed-up ≤ 3 years, the LTS and non-LTS groups had 132 and 137, 232 and 110, 155 and 75, 290 and 111 cases for LUAD, SKCM, LGG and KIRC, respectively. Age and tumor stage were used as clinical features through feature selection.
Graph construction with gene expression and pathway network data
Typically, a graph is defined as \(\mathcal{G}=(v, e)\), where \(v\) is a set of nodes and \(e\) is the edges between them. We proposed to construct pathway graph: genes as nodes \(v\); metabolite and protein-mediated gene interactions [22] as edges \(e\) and gene expression data as node’s feature \(x\). To construct the graph, pathway and gene network information was retrieved from the Reactome database [23] using graphite [22], a R package for automatic download pathway information and converting pathway to graph. Then, we removed pathways with less than 15 genes or more than 400 genes and duplicated pathways.
In total, 2390 pathways downloaded from Reactome database, while 855 pathways were remained for followed analysis (Additional file 1). There were 8611 unique genes involved in those 855 pathways, with a median of 50 genes in each pathway (Fig. 1A). We counted the sharing of genes among pathways, and the result showed that 1613 genes involved unique pathway, 50% genes shared less than 3 pathways, 95% genes shared less than 22 pathways and the max shared pathways number was 275 (Fig. 1B).
PathGNN architecture
An overview of the proposed PathGNN model was shown in Fig. 2A. We considered clinical outcome being associated with pathway abnormalities and clinical features. Feature selection was performed on clinical features by statistical tests (t-test or chi-square test). PathGNN was designed as two components, named Subnetwork1 and Subnetwork2, respectively. The Subnetwork1 was used to capture features hidden in pathways; the Subnetwork2 classified patients into LTS and non-LTS.
The architecture of the Subnetwork1 is comprised of three blocks (i.e., BLOCK-1/2/3), a readout layer after each block (i.e., Readout), graph normalization layer between blocks (i.e., Graph Normalization) and multilayer perceptrons (i.e., MLP) (Fig. 2B). Among them, each block is used to learn the feature representation of the pathway, which includes a graph convolution layer and a graph pooling layer.
A more detailed description is shown in pseudocode, all pathway graphs (\(G\)) are input in Subnetwork1, and then each graph (\({g}_{i}\), \(i\) represents \(i\)-th graph) is subjected to multiple graph neural network blocks (\(N\)). In each block, 3 layers are performed on the graph, as follow: (1) feature representation learning is first performed on the pathway graph using GraphSAGE [24]. GraphSAGE is used to generate node embeddings by sampling and aggregating feature information from local neighborhoods, which is suitable for overcoming inductive bias, since a patient includes hundreds of graphs. (2) As the number of pathways and genes varies greatly in the pathway graph, thus hierarchical representation is needed to make the model adaptive to the different number of nodes. Thus, the graph pooling layer is implemented by SAGPool algorithm [25] to yield hierarchical representations. (3) The readout layer is designed to compute a feature vector of the whole graph using Set2Set algorithm [26] based on iterative content-based attention.
There is a graph normalization layer after each block, which has been demonstrated to be effective for GNNs with multiple layers [27]. Finally, let \(H=[{h}_{1}, {h}_{2},{h}_{3}]\) be the input of MLP, where \({h}_{1}, {h}_{2},{h}_{3}\) are the output of Readout layers. Using a MLP module with two neural network layers to calculate \({S}_{i}\), as follow:
$$hidden= \sigma \left({W}_{1}\bullet H + {b}_{1}\right)$$
$${S}_{i} = \sigma \left({W}_{2}\bullet hidden + c\right)$$
where \(\upsigma\) is the \(tanh\) activation function, \({W}_{1}\) and \({W}_{2}\) are the weight matrix, \({b}_{1}\) and \(c\) are the bias term, \({S}_{i}\) represents the comprehensive features of \(ith\) pathway extracted by Subnetwork1.
The representations (\(S\)) output from the Subnetwork1 indicates the comprehensive feature including topological and gene-sets information in pathways. After this, we designed Subnetwork2 for predicting clinical outcome, in which both \(S\) and clinical features are concatenated as input data (Fig. 2A). More specifically, the Subnetwork2 is built based on MLP that consisted of an input layer, two hidden layers with 128 and 32 nodes, respectively, a dropout layer with a dropout rate of 40% and an output layer. Each hidden layer was immediately followed by a rectified linear unit (ReLU) activation function. Finally, we use the cross-entropy loss and fivefold cross-validation (CV) to train the model.
Training the PathGNN model
The models are built based on Pytorch (version, 1.8) [28] and Pytorch Geometric (version, 2.0.3) [29]. We trained the model for 250 epochs, and the batch size was set to 60. Adam was used to optimizing the model, with a learning rate of 0.001 for the first 150 epochs and 0.0005 for the last 100 epochs. We evaluated the performance of the model with the area under the curve (AUC) metric. We trained the models with NVIDIA Tesla V100 GPU with 32 GB memory.
Ablation study
To understand the contribution of the component to the overall system, an optimization study by removing or replacing individual modules was conducted on LUAD. Specifically, we first replaced the GraphSAGE with graph attention network (GAT) or graph convolutional network (GCN) to investigate the impact on applying different graph convolution algorithm. Next, we removed or added blocks in the Subnetwork1 to explore the effect of different number of blocks. In addition, the Set2Set that adopted in the PathGNN architecture is a global pooling algorithm based on iterative content attention. We replaced the Set2Set with global add pooling and global average pooling algorithm, which returns graph-level outputs by adding and averaging node features across the node dimension, respectively. The experimental training procedure for this method optimization was the same as before.
Biological interpretation using integrated gradient
The aberrant pathways associated with clinical prognosis were identified by IG. Here, we calculated the importance value (\({IG score}_{i}\)) for the \(ith\) pathway as follow:
$${IG score}_{i}=\left({S}_{i}- {S}_{i}^{\mathrm{^{\prime}}}\right){\int }_{0}^{1}\frac{\partial f\left({S}_{i}^{\mathrm{^{\prime}}} + \alpha \left(\left({S}_{i}- {S}_{i}^{\mathrm{^{\prime}}}\right)\right)\right)}{{\partial S}_{i}^{\mathrm{^{\prime}}}}d\alpha$$
where \({S}_{i}\) denotes the input feature that output from Subnetwork1, \(f\) represents the neural network model (i.e., PathGNN model) and \({S}_{i}^{^{\prime}}\) represents the baseline. The subscript \(i\) is the index of \(ith\) pathway.
The IG scores were z-transformed and those pathways with z-score value (\({IG score}_{i}^{z}\)) greater 1.96 (i.e., P value < 0.05) [30] were considered as candidate aberrant pathways.
Comparison with benchmark methods
The predictive performance of PathGNN was compared with several machine learning methods, including random forest (RF), deep neural network (DNN), logistic regression (LR) and pathway-guided deep neutral network (PGDNN) [7,8,9]. For all experiments, the same experimental settings were used as described in the above cross validation section. RF and LR were performed using scikit-learn package (version, 1.0.1) [31], while DNN and PGDNN were built by Pytorch.
The DNN model was constructed with four hidden layers, where the number of nodes were 855, 128, 64 and 16, respectively. For PGDNN, the architecture was the same as DNN, but the first hidden layer, named pathway layer, which represents the biological pathways linked with input genes. The construction of these two models can be found in the Additional file 2.