Figure 1 illustrates the framework of our approach MRNGCN. First, MRNGCN builds three gene relationship networks: a gene–gene network, a gene–outlying gene network and a gene–miRNA network. Then, it learns gene features from the three networks through three parameter-sharing heterogeneous graph convolution network models and a shared self-attention layer. Next, our model jointly utilizes 1D and 2D convolution operations to fuse the gene features learned from the three networks. Finally, we leverage the fused gene features, the original gene features and the gene features learned from the gene–gene network to optimize the model by minimizing the node and link prediction. Meanwhile, a logistic regression model combines these features to predict cancer driver genes.

### Experimental data

We downloaded gene mutation, DNA methylation data, and miRNA expression data from TCGA [2], containing more than 8,000 samples for 16 cancer types. Gene expression data were obtained from Wang et al. [20], which were collected from TCGA and were further normalized and batch corrected by ComBat [21]. Like MTGCN [16], we only focused on cancer types for which gene expression data and DNA methylation information are available in both tumor and normal tissues. Hence, this work involves 16 cancer types. We constructed the gene–gene network based on PPI data from the Consensus Path DB (CPDB) [22] database. We got 13,627 genes and 504,378 gene–gene edges with interaction scores above 0.5. We used the genes in the PPI networks for cancer driver prediction. The miRNA-gene associations from the mirTarbase V8.0 database [23] contain 2,599 miRNAs, 15,064 genes, and 380,639 miRNA-gene associations. The miRNA-disease associations were from the HMDD database version 3.0 (http://www.cuilab.cn/hmdd). The benchmark driver gene of pan-cancer was from MTGCN [16] Additional file 1. They were 796 high confidence driver genes in NGC 6.0. To obtain a negative sample list, we started with all genes and recursively removed genes from the NCG, COSMIC, OMIM databases and KEGG cancer pathways. Hence, our pan-cancer dataset consists of 796 positive and 2,187 negative samples. Cancer type-specific positive samples were from NCG 6.0 tagged with that cancer type and shared the same negative samples with pan-cancer.

### Building multiple gene relationship networks

#### Gene–gene network

We used the PPI network to construct the gene–gene network. Let \(A_{PP} \in \left\{ {0,1} \right\}^{n \times n}\) be the adjacency matrix of the gene–gene network with the number n of genes. If two genes connect through an edge in the PPI network, the corresponding value in the matrix \(A_{PP}\) is 1. Otherwise, it is 0. We used MTGCN [16] to prepare initial attributes for gene nodes in the network (See Additional file 1 for details). Let \(X_{P} \in R^{{n \times F_{1} }}\) denote the initial gene attributes consisting of biological and topological features. For each cancer type, we calculated gene mutation rate, differential DNA methylation rate, and gene differential expression rate as the biological features of the genes. Since we only focused on 16 cancer types, each gene had a 48-dimensional biological feature vector, including 16 mutation rates, 16 methylation values, and 16 differential expression rates, which were min–max normalized. The 16-dimensional topological features of genes resulted from the note2Vec on the gene–gene network. We concatenated the biological and topological features to get the 64-dimensional initial attributes of genes.

#### Gene–outlying gene network

We considered that a driver gene usually affects the expression of genes linked to it in a biological network, so we constructed a gene–outlying gene network. A gene is considered to be outlying if the absolute value of its z-score is above the threshold (this work sets the threshold to 2) [10]. We collected all outlying genes at least expressed abnormally in one sample of the 16 cancer types. Let \(A_{PO} \in \left\{ {0,1} \right\}^{n \times m}\) be the adjacency matrix of the gene–outlying network with the number n of genes and m of outlying genes. We connected a gene and an outlying gene and set the corresponding value of \(A_{PO}\) as 1 if the gene mutates in at least one cancer sample and links to the outlying gene in the PPI network. Hence, the gene–outlying gene network of pan-cancer contains 13,627 genes, 12,248 outlying genes and 469,078 edges. We initialized the attributes of the gene nodes in the gene–outlying network as \(X_{P}\), the same attributes of the gene nodes in the gene–gene network. The initial attributes of the outlying genes consist of the average z-scores across all samples of a cancer type and the frequencies of being outlying among the samples of the cancer type. Let \(X_{O} \in R^{{m \times F_{2} }}\) be the vector of the initial attributes of the outlying genes, which would consist of 16 z-score features and 16 frequency features in the pan-cancer dataset. For convenience, the outlying gene initial features underwent linearly transformation from 32 dimensions to 64, the dimension of initial gene features.

#### Gene–miRNA network

We constructed a gene–miRNA network considering the regulatory relationship of miRNAs on gene expression. The known associations between miRNAs and their targeted genes were from mirTarbase V8.0 [23]. Let \(A_{PR} \in \left\{ {0,1} \right\}^{n \times t}\) be the adjacency matrix of the gene–miRNA network with n genes and t miRNAs. The genes were the mutated gene of TCGA samples. The miRNAs were those that both appeared in the mirTarbase database and the TCGA samples. Hence, the gene–miRNA network involves 1390 miRNAs and 153,913 edges connected with the 13,627 genes. The value of \(A_{PR}\) is 1 if a gene is associated with a miRNA. Otherwise, it is 0. We also initialized the attributes of the gene nodes in the gene–miRNA network as \(X_{P}\). The initial attributes of miRNAs denoted by \(X_{R} \in R^{{t \times F_{3} }}\) include the average z-scores and the average different expression values across all samples of every cancer type and the similarities with other miRNAs. Since miRNAs link to the pathologies of cancers by regulating the expression of their targeted genes and the dysfunction of similar miRNAs would lead to a similar phenotype, we introduced the miRNA similarities as part of initial miRNA attributes. Similar to previous works [19], the miRNA similarity was measured by the miRNA GIP (Gaussian Interaction Profile) kernel based on known miRNA-disease associations. We linearly transform the miRNA GIP similarity matrix and obtained 16-dimensional miRNA similarity features to avoid bias. Finally, for the pan-cancer with 16 cancer types, \(X_{R}\) would have 16 z-scores and 16 differential expression values and the miRNA similarities of length 16 and the number of genes connected to each miRNA. Similarly, for convenience, we linearly transformed the miRNA initial features to the same dimension as initial gene features. To fully use the valuable gene–miRNA associations, we input \(X_{P}\) and \(X_{R}\) into a two-layer heterogeneous graph convlutional network (HGCN) model the same as the model in section "the heterogeneous graph convolutional network" to implement pre-training on the gene–miRNA network and learn new features for genes and miRNA, called \(X_{{P_{pre} }}\) and \(X_{{R_{pre} }}\). Hence, we used \(X_{{P_{pre} }}\) and \(X_{{R_{pre} }}\) as the gene and miRNA initial features of the gene–miRNA network for following feature learning (See Additional file 1 for details).

### Learning node features from multiple networks

#### The heterogeneous graph convolutional network

We employed three two-layer heterogeneous graph convolutional network (HGCN) modules to learn feature representations for the nodes of the three relationship networks. The HGCN modules update the node features by aggregating both neighbourhood features and neighbourhood interactions. To extract the common features of the three relationship networks, we input the three networks and their initial node attributes into three parameter-sharing HGCN models.

Aggregating neighbourhood feature captures the interaction pattern of the node in the network. We started the aggregation by normalizing the adjacency matrix of the three networks. Let \(A_{ij} \in \left\{ {A_{PP} ,A_{PO} ,A_{PR} } \right\}\) be one of the adjacent matrices of the three networks. \(P_{ij} \in \left\{ {P_{PP} ,P_{PO} ,P_{PR} } \right\}\) is its normalized matrix. \(P_{ij} = D_{i}^{{ - \frac{1}{2}}} A_{ij} D_{j}^{{ - \frac{1}{2}}}\), \(D_{i} = \mathop \sum \limits_{j} A_{ij} + 1\) and \(D_{j} = \mathop \sum \limits_{i} A_{ji} + 1\). Since \(A_{ji} = A_{ij}^{T}\), then \(P_{ji} = P_{ij}^{T}\). We took Eqs. (1)–(3) to aggregate neighbourhood features for the nodes in the gene–gene, gene–outlying gene and gene–miRNA networks, respectively. Here, \(\theta_{k} \in R^{F1 \times F2}\) are shared by the three networks.

$$AGG_{NF_{P1}} = P_{PP} X_{P} \theta_{k}$$

(1)

$$AGG_{{NF_{P2} }} = P_{PO} X_{O} \theta_{k} ,\quad AGG_{{NF_{O} }} = P_{OP} X_{P} \theta_{k}$$

(2)

$$AGG_{{NF_{P3} }} = P_{PR} X_{{R_{pre} }} \theta_{k} , \quad AGG_{{NF_{R} }} = P_{RP} X_{{P_{pre} }} \theta_{k}$$

(3)

To further capture the interaction patterns of the network nodes, we considered the neighbourhood interactions, which were measured by the element-wise dot product between the node features and its neighbours' features. Equations (4)–(6) aggregate neighbourhood interactions for nodes of the gene–gene, gene–outlying and gene–miRNA networks, respectively. Here \(W_{1} \in R^{F1 \times F2}\), \(b_{1} \in R^{F1 \times F2}\) are shared by the three networks. \(\odot\) denotes the element-wise product.

$$AGG_{{NI_{P1} }} = P_{PP} X_{P} \odot X_{P} W_{1} + b_{1}$$

(4)

$$AGG_{{NI_{P2} }} = P_{PO} X_{O} \odot X_{P} W_{1} + b_{1} ,\quad AGG_{{NI_{O} }} = P_{OP} X_{P} \odot X_{O} W_{1} + b_{1}$$

(5)

$$AGG_{{NI_{P3} }} = P_{PR} X_{{R_{pre} }} \odot X_{{P_{pre} }} W_{1} + b_{1} , \quad AGG_{{NI_{R} }} = P_{RP} X_{{P_{pre} }} \odot X_{{R_{pre} }} W_{1} + b_{1}$$

(6)

Finally, the HGCN models learned features for nodes of the three networks by aggregating neighbourhood features and interactions. Mathematically, the process can be defined as follows.

$$H\left( X \right)_{V} = HGCN_{V} (\{ H\left( X \right)_{i} \}_{i \in N\left( V \right)} ) = \sigma (AGG_{{NF_{V} }} + AGG_{{NI_{V} }} )$$

(7)

where \(V \in \left\{ {P1,P2,P3,O,R} \right\}\) denotes the target node, \(H\left( X \right)_{V}\) denotes the features of the node \(V\) learned from the corresponding network through the HGCN model. \(H\left( X \right)_{i}\) is features of the neighbores of the target node, and \(N\left( V \right)\) denotes neighbor set of the node \(V\) in one of the three networks, and \(\sigma\) denotes the activation function, e.g. ReLU.

Our graph convolution module contains multiple graph convolution layers. Setting the number \(L\) of graph convolution layers, \(H\left( X \right)_{P1}\), \(H\left( X \right)_{P2}\) and \(H\left( X \right)_{P3}\) denote the final gene features learned by the HGCN models from the gene–gene, gene–outlying gene and gene–miRNA networks, respectively. Equation (8) express the process. Here, \(L = 2\).

$$\left\{ {\begin{array}{*{20}l} {H\left( X \right)_{P1} = HGCN_{P}^{L} \left( {\left\{ {HGCN_{P}^{L - 1} \cdots HGCN_{P}^{1} \left( {\left\{ {X_{P} } \right\}_{P \in N\left( P \right)} } \right)} \right\}_{P \in N\left( P \right)} } \right)} \hfill \\ {H\left( X \right)_{P2} = HGCN_{P}^{L} \left( {\left\{ {HGCN_{P}^{L - 1} \cdots HGCN_{P}^{1} \left( {\left\{ {X_{O} } \right\}_{O \in N\left( P \right)} } \right)} \right\}_{O \in N\left( P \right)} } \right)} \hfill \\ {H\left( X \right)_{P3} = HGCN_{P}^{L} \left( {\left\{ {HGCN_{P}^{L - 1} \cdots HGCN_{P}^{1} \left( {\left\{ {X_{{R_{pre} }} } \right\}_{R \in N\left( P \right)} } \right)} \right\}_{R \in N\left( P \right)} } \right)} \hfill \\ \end{array} } \right.$$

(8)

#### Bilinear aggregation layer

Since the cancer driver genes cause abnormal expression of their downstream genes and lead to cancer development, we constructed the gene–outlying gene network to learn gene features that help detect cancer driver genes. The outlying genes working together contribute to cancer progression [9, 10]. To take advantage of the interactions between the outlying genes, we introduced a bilinear graph neural network (BGNN) [24] layer to learn feature representations for the nodes of the gene–outlying network. In Eq. (9), genes in the gene–outlying network can also aggregate the interaction features between their neighbors. Here, we considered the gene nodes themselves and merge them into the neighbor set to obtain the extended neighbourhood \(\tilde{N}\left( P \right)\). Moreover, we ignored self-interactions in the neighbourhood to avoid introducing additional noise. The bilinear aggregation features of genes in the gene–outlying gene network defined as follows.

$$H\left( X \right)_{P}^{BA} = \frac{1}{{b_{P} }}\mathop \sum \limits_{{i \in \tilde{N}\left( P \right)}} \mathop \sum \limits_{{j \in \tilde{N}\left( P \right)\& i < j}} \left( {X\left( i \right)W + b} \right) \odot \left( {X\left( j \right)W + b} \right)$$

(9)

where \(W\) and \(b\) denote the learnable parameters, ⊙ is the product of elements. and are two different node indices from \(\tilde{N}\left( P \right)\). \(X \in \{ X_{P} ,X_{O} \}\) is the initial features of genes or outlying genes. \(b_{P} = \frac{1}{2}\tilde{d}_{p} \left( {\tilde{d}_{p} - 1} \right)\) denotes the number of node interactions.

Hence, the genes in the gene–outlying gene network can learn two features from the network. The one is \(H\left( X \right)_{P2}\), learned by a two-layer HGCN. The other is \(H\left( X \right)_{P}^{BA}\), learned by a BGNN layer. We summarized the two features and balanced their weights using a parameter to get the final gene features, \({\tilde{\text{H}}}\left( X \right)_{P2}\), from the gene–outlying network.

$${\tilde{\text{H}}}\left( X \right)_{P2} = \left( {1 - \alpha } \right)*H\left( X \right)_{P2} + \alpha *H\left( X \right)_{P}^{BA}$$

(10)

#### Self-attention layer

After running HGCN models on the three relational networks, we learnt gene features \(H\left( X \right)_{P1}\), \({\tilde{\text{H}}}\left( X \right)_{P2}\) and \(H\left( X \right)_{P3}\) from the gene–gene, gene–outlying gene and gene–miRNA networks, respectively. Previous studies observed that driver genes often work together to form protein complexes or are enriched in some signal pathways. To use the interactions between genes and pay more attention to the crucial interactions when learning features for genes, we took the three features as inputs of a self-attention layer seperately. The self-attention module can naturally combine all gene features from a network as inputs, allowing the inputs to interact with each other and find out who they should pay more attention to. For example, we took the gene features from the gene–gene network as inputs of the self-attention model (See Eq. (11)). The self-attention model multiplies every input with \(W^{Q}\), \(W^{K}\) and \(W^{V}\) to obtain its query(\(Q_{1}\)), key(\(K_{1}\)) and value(\(V_{1}\)) representations. The genes in the gene–gene network(Attention(\(Q_{1}\), \(K_{1}\), \(V_{1}\)) generated their contextual representations through multiplication between the weighted attention-score matrix and all inputs' values. The weighed attention-score matrix was calculated by applying a softmax on a dot product between the queries with all inputs’ keys, divided each by \(\sqrt d\).

$$\left\{ {\begin{array}{*{20}l} {\left[ {Q_{i} ,K_{i} ,V_{i} } \right] = H\left( X \right)_{Pi} \left[ {W^{Q} ,W^{K} ,W^{V} } \right]} \hfill \\ {Attention\left( {Q_{i} ,K_{i} ,V_{i} } \right) = softmax(Q_{i} K_{i}^{T} /\sqrt d )V_{i} } \hfill \\ \end{array} } \right.$$

(11)

where i = 1,2,3, which represents self-attention layer for three networks, \(d\) is the dimensionality of \(Q\) and \(K\). \(K_{*}^{T}\) is the matrix transpose. To preserve the uniqueness of the gene features learned for every network, we added the gene features before the self-attention layer with the gene features after the self-attention to obtain the gene features \(H\left( X \right)_{att}^{i} , i = 1,2,3\) for the gene–gene, gene–outlying gene and gene–miRNA networks, respectively (see Eq. (12)).

$$H\left( X \right)_{att}^{i} = Attention\left( {Q_{i} ,K_{i} ,V_{i} } \right) + H\left( X \right)_{Pi} .$$

(12)

#### Feature fusion

After passing the initial gene features through the HGCN model and self-attention layer, we obtained three gene features from the three networks, denoted by \(H\left( X \right)_{att}^{1}\), \(H\left( X \right)_{att}^{2}\), \(H\left( X \right)_{att}^{3}\). Then, we employed three 1D-convolution modules to reduce the dimensions of the three gene features and a 2D-convolution module to fuse the gene features.

Every 1D-convolution module consists of two convolution layers. The size of the convolution kernel for both convolution layers is 1. The number of input channels of the module is the dimension of the feature matrix and the number of output channels is 1. The learned gene features are denoted as \(H\left( X \right)_{1D}^{1}\), \(H\left( X \right)_{1D}^{2}\), \(H\left( X \right)_{1D}^{3}\) for the gene–gene, gene–outlying gene and gene–miRNA networks, respectively.

We integrated the three gene features from the output of the 1D-convolution module to generate fused gene features (denoted as \(H\left( X \right)_{2D}\)) through a 2D-convolution module. The 2D-convolution module consists of a two-dimensional convolutional layer (see Additional file 1: Figure S1). Firstly, we stack the three gene features from the output of the 1D-convolution module (i.e., \(H\left( X \right)_{1D}^{1}\), \(H\left( X \right)_{1D}^{2}\), \(H\left( X \right)_{1D}^{3}\)) to obtain a feature matrix \(H\left( X \right)_{stack} \in R^{3 \times n \times 1}\), with \(n\) denoting the number of gene nodes. Then we padded out a circle of zeros around the \(H\left( X \right)_{stack}\) and implemented two-dimensional convolutional operations on the \(H\left( X \right)_{stack}\). The convolution kernel size is \(\left( {w_{c} \times h_{c} } \right)\) and the perceptual field of \(H\left( X \right)_{stack}\) is \(3 \times w_{c} \times h_{c}\). Here, we set \(w_{c} = h_{c} = 3\). The input channel is 3, and the output channel is 1. We summed the feature maps on each channel to form the fused gene features, \(H\left( X \right)_{2D} \in R^{n \times 1}\).

### Model optimization and driver gene prediction

#### Model optimization

After passing the initial input gene features through the HGCN modules, the self-attention layer and the feature fusion module, we obtained the gene feature representations containing the network context information (\(H\left( X \right)_{2D}\)). The original gene features characterizing the gene themselves also play an important role in driver gene identification. Hence we pass the initial gene features through a three-layer multi-layer perceptron (MLP) to get another gene feature representations, called \(H\left( X \right)_{mlp}\), defined in Eq. (13).

$$H\left( X \right)_{mlp} = Linear^{3} \left( {\sigma \left( {Linear^{2} \left( {\sigma \left( {Linear^{1} (X_{P} )} \right)} \right)} \right)} \right)$$

(13)

where \(X_{P}\) stores initial gene features. We added \(H\left( X \right)_{mlp}\) with \(H\left( X \right)_{2D}\) to generate synthesis gene feature representations denoted by \(H\left( X \right)_{syn}\) and used the synthesis gene features to predict cancer driver genes after passing the sigmoid function. A binary cross-entropy was employed to quantify the node prediction loss.

$$L_{n\_loss} \left\{ {H\left( X \right)_{syn} } \right\} = - \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left[ {y_{i} log\left( {\widehat{{y_{i} }}} \right) + \left( {1 - y_{i} } \right)log\left( {1 - \widehat{{y_{i} }}} \right)} \right]$$

(14)

where \(\widehat{{y_{i} }}\) is the predicted score of the gene \(i\), and \(y_{i}\) is its real label whose value is 0 or 1, \(n\) is the number of genes in the training dataset. To ensure the reliability of the gene features learned in the network context and their independent predictive power, we applied the sigmoid function on the \(H\left( X \right)_{2D}\) for cancer gene predictions. The node prediction loss was calculated as follows.

$$L_{n\_loss1} \left\{ {H\left( X \right)_{2D} } \right\} = - \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left[ {y_{i} log\left( {\widehat{{y_{i} }}} \right) + \left( {1 - y_{i} } \right)log\left( {1 - \widehat{{y_{i} }}} \right)} \right]$$

(15)

The HGCN model updates the representation of the gene nodes in the networks by aggregating their neighbors' features and interactions with their neighbors. However, this model does not ensure that the learned gene features maintain the original network structure. Moreover, the model parameters are optimized on the limited number of known driver gene labels. To solve this problem, we implemented an inner product between the gene features learned from the gene–gene network, (i.e. \(H\left( X \right)_{1D}^{1}\)) to predict the network links [16]. The reconstructed adjacency matrix is \(\hat{A}_{PP}\).

$$\hat{A}_{PP} = \sigma (\{ H\left( X \right)_{1D}^{1} \} \{ H\left( X \right)_{1D}^{1} \;^{T} \} )$$

(16)

where \(\sigma\) is the sigmoid function. We then calculated the binary cross-entropy loss of the link prediction (see Eq. 17)).

$$L_{r\_loss} \left\{ {H\left( X \right)_{1D}^{1} } \right\} = - \frac{1}{n}\left\{ {\mathop \sum \limits_{i,j \in E} \left\{ {log\hat{a}_{i,j} } \right\} + \mathop \sum \limits_{i,j \in Neg} \left\{ {1 - log\hat{a}_{i,j} } \right\}} \right\}$$

(17)

where \(E\) is the edge set of the gene–gene network, and \(n\) is the size of \(E\). \(Neg\) is the set of negative samples with size \(n\), obtained by negative sampling, and \(\hat{a}_{i,j}\) is the value of the reconstructed adjacency matrix.

Our final loss function consists of two node-prediction losses and a link-prediction loss, defined in Eq. (18). \(\omega_{1}\) and \(\omega_{2}\) are hyper-parameters, regulating the weight of each loss term in training.

$$L_{total} = L_{n\_loss} + \omega_{1} *L_{n\_loss1} + \omega_{2} *L_{r\_loss} .$$

(18)

#### Predicting cancer driver genes

Our MRNGCN approach is composed of several modules. Every modules encodes different kinds of gene features. These features character genes' roles in cell life from different views. For example, the gene features from the 1D-convolution modules (i.e. \(H\left( X \right)_{1D}^{1}\), \(H\left( X \right)_{1D}^{2}\), \(H\left( X \right)_{1D}^{3}\)) represent gene characteristics in the three networks. The 2D-convolution module produces the fused gene features \(H\left( X \right)_{2D}\), and \(H\left( X \right)_{mlp}\) represents the original gene features. We leveraged a logistic regression (LR) model to combine these gene features to predict cancer driver genes in the test set. The definition of the LR model is as follows.

$$x = w_{1} H\left( X \right)_{1D}^{1} + w_{2} H\left( X \right)_{1D}^{2} + w_{3} H\left( X \right)_{1D}^{3} + w_{4} H\left( X \right)_{2D} + w_{5} H\left( X \right)_{mlp} + \varepsilon$$

$$f\left( x \right) = \frac{1}{{1 + e^{ - x} }}$$

(19)

where \({w}_{1}\), \({w}_{2}\), \({w}_{3}\), \(w_{4}\), \(w_{5}\) are weights of the LR model, illustrating the contribution of each feature to the driver gene identification. See Additional file 1 for the pseudo-code of MRNGCN.