CGINet: graph convolutional network-based model for identifying chemical-gene interaction in an integrated multi-relational graph

Background Elucidation of interactive relation between chemicals and genes is of key relevance not only for discovering new drug leads in drug development but also for repositioning existing drugs to novel therapeutic targets. Recently, biological network-based approaches have been proven to be effective in predicting chemical-gene interactions. Results We present CGINet, a graph convolutional network-based method for identifying chemical-gene interactions in an integrated multi-relational graph containing three types of nodes: chemicals, genes, and pathways. We investigate two different perspectives on learning node embeddings. One is to view the graph as a whole, and the other is to adopt a subgraph view that initial node embeddings are learned from the binary association subgraphs and then transferred to the multi-interaction subgraph for more focused learning of higher-level target node representations. Besides, we reconstruct the topological structures of target nodes with the latent links captured by the designed substructures. CGINet adopts an end-to-end way that the encoder and the decoder are trained jointly with known chemical-gene interactions. We aim to predict unknown but potential associations between chemicals and genes as well as their interaction types. Conclusions We study three model implementations CGINet-1/2/3 with various components and compare them with baseline approaches. As the experimental results suggest, our models exhibit competitive performances on identifying chemical-gene interactions. Besides, the subgraph perspective and the latent link both play positive roles in learning much more informative node embeddings and can lead to improved prediction.

to screen as many potential drug candidates as possible in the prophase. Over 80% of FDA-approved drugs are small-molecule chemicals that act on single or multiple gene (or protein) targets, ultimately achieving curative effects [2,3]. Obviously, elucidation of interactive relation between chemicals and genes, named chemical-gene interactions (CGIs), is of key relevance not only for discovering new drug leads in drug development but also for repositioning existing drugs to novel therapeutic targets. With known CGIs, numerous researches provided new insights into rapidly screening candidate chemicals for treatments of corresponding diseases, such as HIV [4], HCV [5], lung cancer [6], and so forth. Unfortunately, proven CGIs are present in limited amounts. For example, the PubChem database contains more than 30 million chemicals, but few have confirmed gene targets [7]. This predicament drives the imperative need for automatic and efficient methods to infer chemical-gene interactions as a preliminary process rather than experimentally determining every possible chemical-gene pair, which is time-consuming and costly. According to different kinds of data used, we roughly divide the computational methods for CGI prediction into three categories: biomedical literature-based, molecular structure-based, and biological network-based.

Biomedical literature-based approaches
A wealth of knowledge about chemical-gene interactions is scattered over the published biomedical literature, resulting in the inefficient query of CGI information of interest. The challenge is to detect the chemicals and the genes with close association mentioned in an unstructured text and further determine which type of interaction they share. Biomedical literature-based methods tackle the problems with well-designed or deeplearning features enhanced by natural language processing (NLP) techniques [8][9][10]. In recent studies, multiple deep neural network (DNN) models, including convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory network (LSTM), and attention-based DNN, have been applied to learn CGI classifiers [11][12][13]. These approaches feed the DNN models with low-dimension pre-trained word embeddings without complicated feature engineering. Notably, attention-based DNN models exhibit competitive performance compared with other models and have the inherent ability to extract salient features for CGI identification as needed. Besides, some advanced researches extend the language models with syntax and semantic information, such as part of speech (POS), syntactic structure, dependency tree, and knowledge graph for a better understanding of the context [8,14,15]. However, such methods based on biomedical articles limit in predicting unpublished and unknown CGIs.

Molecular structure-based approaches
Among these methods, molecular docking, which explores the predominant binding models of two interacting molecules using known 3D-structures, were initially studied [16,17]. It uses various scoring functions to predict the binding affinity of molecules. The limitations lie in that it critically dependents on the available high-quality 3D-structure data and generally takes excessive computing resources. The follow-up researches focus on representing chemicals and genes by fingerprints as inputs of the machinelearning models [7,18,19], such as logistic regression, k-nearest neighbor (KNN), support vector machine (SVM), etc. Fingerprint is the most commonly used descriptor of the substructure of the molecule. However, the fingerprint is defined as a binary vector whose index value represents whether the substructure of a molecule exists or not, making it quite sparse and not sufficiently informative for CGI prediction. Recent researches have paid more attention to recruiting the end-to-end models on simplified molecularinput line-entry system (SMLES) string for chemicals and structural property sequence (SPS) for genes to learn super representations [2,[20][21][22]. The results achieved demonstrate that the models trained with super representations are more robust than those trained with traditional descriptors.

Biological network-based approaches
Compared with molecular structure-based approaches, biological network-based approaches combine the chemical space and the gene space into a consistent space by a constructed heterogeneous network/graph. Chemicals and genes are treated as nodes of the network. The links between two nodes denote their interactive relations, including intra-domain relations between two nodes of the same type, e.g. chemical-chemical interactions, and cross-domain relations between two nodes belonging to different types, e.g. chemical-gene interactions [23]. Multiple large-scale databases have captured as much as possible of knowledge about chemical-gene interactions from the publicly accessible data, such as STITCH (Search Tool for InTeractions of Chemicals) [24], CTD (Comparative Toxicogenomics Database) [25]. The emergence of these aggregated databases provides new opportunities for CGI prediction. Numerous studies develop a slew of network-based inference models that integrate diverse CGI-related information from the heterogeneous network and automatically learn the features of individual nodes for predicting missing relations [26][27][28]. The biological network-based approach has excellent advantages in potential CGI extraction as it does not rely on specific biological properties description or 3D-structure data of molecules.
Research on identifying chemical-gene interaction is still in its infancy, and there is much room for improvement in its performance. In this manuscript, we present the CGINet model, using a framework of encoder-decoder, to formulate the CGI identification problem as a task of multi-relational link prediction between chemicals and genes in a heterogeneous network/graph containing three types of nodes: chemicals, genes, and pathways. CGINet employs the graph convolutional network (GCN) as an auto-encoder on aggregating, transforming, and propagating neighborhood information over the graph. We investigate two different perspectives on learning node embeddings. One is to view the graph as a whole, and the other is to adopt a subgraph view that initial node embeddings are learned with the binary association subgraphs and then transferred to the multi-interaction subgraph for final node embeddings learning. Lastly, the node embeddings are sent to the decoder, which uses a tensor decomposition model to formulate chemical-gene interactions. CGINet adopts an end-to-end way that the encoder and the decoder are trained jointly with known CGIs in a multi-relational graph.
We study three implementations of the CGINet models with various components and compare them with baseline approaches. As the experimental results suggest, our models exhibit competitive performances in predicting chemical-gene interactions. The main contributions of our work are: (1) We present a graph convolutional network-based model to predict the missing links between the chemicals and the genes in a heterogeneous graph. Our model takes advantage of the information from latent links based on biological insights, outperforming the baseline models. (2) The model which adopts a subgraph perspective can dramatically reduce the training time and also improves performance. (3) Our model is capable of predicting novel chemical-gene interactions, which are not appeared in the original graph.

Experimental settings
We construct a multi-relational graph containing 65 types of chemical-gene interaction. Every given chemical-gene pair is identified into none, one or more interaction types. As most graph-based approaches have done [26][27][28], we randomly split the CGI instances into training, validation, and test sets for each interaction type, having 8:1:1 ratio. The CGINet model is optimized with an Adam optimizer [29], and the parameters used in our models are summarized in Table 1. We individually measure the performance of each interaction type using area under the receiver-operating characteristic (AUROC), area under the precision-recall curve (AUPRC), and average precision for the top-k identifications (AP@k). To avoid the overfitting issue, we perform cross-validation and initialize the trainable parameters with multiple random seeds. The experimental results are given as average performance. We implement the CGINet model with Python language using the Tensorflow package [30].

Baseline approaches
(1) Random walk-based embeddings. The DeepWalk model learns node embeddings by randomly capturing neighborhood information on the basis of the depth-first search method, while the Node2Vec model combines the depth-first search and the breadthfirst search methods to aggregate proximal nodes. (2) Matrix factorization-based embeddings. The SVD and the Laplacian models both factorizes the adjacency matrix of the graph to obtain the node embeddings. We use these learned node embeddings as input to train a logistic regression classifier for each interaction type. (3) Graph convolutional network-based methods. We employ a 2-layer GCN on learning node embedding with the CG-graph or the total graph, respectively named as GCN-CG and GCN-Total.

Performance comparison of different thresholds
A threshold coefficient is designed in our model as a gatekeeper to control the requirement of a definite latent link. We investigate the change of performance of our model with different thresholds. As shown in Fig. 1, Larger threshold leads to less latent links. The overall performance of CGINet-2 and CGINet-3 increases with the growth threshold. To be specific, CGINet-2 with = 0.4 and CGINet-3 with = 0.5 , show respectively better performance. These suggest that stricter threshold value makes the latent links more credible for updating the topological structure of the graph. We proceed by making a performance comparison between the CGINet models with various components and baseline models. Table 2 gives the performance comparison of our models with baseline methods. Matrix factorization-based approaches and random walk-based approaches both learn node embeddings and train relation classifiers in two individual stages. The latter methods (SVD, Laplacian) show better performance than the former methods (DeepWalk, Node-2vec) on processing such a heterogeneous multi-relational graph. Random walk-based approaches excessively dependent on the specific structure of the graph. In contrast, the CGINet models train the encoder and the decoder jointly. Most of our models outperform the baseline models, especially the CGINet-2 model achieves 5.7% of relative improvements in AUPRC compared with the best results of baselines (Laplacian). Compared to GCN-CG, GCN-Total shows manifest performance degradation. Especially it drops to 57.1% in AP@20. We hypothesize that the reason behind this is due to the limitation of the GCN-Total model in focusing on capturing interactions of interest in an integrated multi-relational graph that contains non-target associations (e.g. chemical-pathway associations, gene-pathway associations). Based on this assumption, we investigate a subgraph view of learning target node representations by two steps in the CGINet-1 model. It is inspiring to see that CGINet-1 outperforms GCN-Total by 7.8% (AUROC), 10.4% (AUPRC), and 19.9% (AP@20), indicating that more focused learning of node embedding facilitates better use of the graph data. Furthermore, compared with GCN-CG, CGINet-1 leads to about 4% of relative improvements in AUPRC. It verifies that initial node embeddings pre-trained with the binary association subgraph provide practical knowledge for final node embedding learning.

Comparison with baseline models
A further comparison among our models (CGINet-1, CGINet-2, and CGINet-3) reveals that the models which aggregate information from the new neighbor nodes across latent links perform better than the models only capture labeled neighborhood information. To be specific, CGINet-2 and CGINet-3 lead to about 2% increase in AUPRC compared with CGINet-1. It is consistent with our findings in "Data observation" section that updating the topological properties of nodes with latent links can significantly provide informative features for learning more effective node embeddings. Besides, CGINet-2 exhibits optimal performance in AUROC (92.7%) but is inferior to CGINet-3 by down to 76.5% in AP@20. In view of the overall situation, the latent rate setting enhances the classification power of the model but along with the poor ranking ability. Consequently, the CGINet-3 model, which considers the equal contribution of latent links for each interaction type, has better higher overall performance.
The above analysis has illustrated that our models which adopt the subgraph view can significantly improve performance. We also calculate the average training time of each epoch for the GCN-based models, as shown in the last column of Table 2. Compared with GCN-Total, our models can reduce at least 65% of training time while achieving much better performance.

Comparison on interaction type-wise performance
As shown in Fig. 2, compared to CGINet-1, CGINet-3 achieves improved performances on over half interaction types (34 of 65 types; right side of Fig. 2) but gets degraded performances on the other types (left side of Fig. 2). Through detailed investigation on the performance per interaction type, we find that encoding updated neighborhood information across latent links prefers to play a positive role in predicting some specific interaction types without considering the degree of action (e.g. cleavage, sumoylation, metabolic processing, and glucuronidation), but participates negatively in identifying some other types (e.g. secretion, transport, and reaction). More interestingly, metabolic processing is the parent interaction type of cleavage, sumoylation, and glucuronidation. It inspires us to optimize our models by paying more attention to the deep-seated mechanism of the biological reaction in later research. We visualize the top 15 best performance interaction types in the CGINet-3 model, as shown in Table 3. It is also worth noting that even though some interaction types have extremely few known edges for training, the model can still be adept at predicting them, e.g. decreases^acetylation (147 edges), affects^chemical synthesis (181 edges) and decreases^cleavage (188 edges). We believe that developing a global decoder associated with all interaction types enables our model to share information across different types of interactions.

Discussion
For the random walk-based approaches, the chemical space and the gene space are combined into a consistent space. The node embeddings are learned in a homogeneous graph. In contrast, the essence of our model is to analyze the dependency between different semantic spaces in a heterogeneous graph. It allows us to integrate more diverse biomedical data into our model, such as the disease and the phenotype Fig. 2 Comparison of CGINet-1 and CGINet-3 on interaction type-wise performance in AUPRC. The interaction type identifiers are sorted by the difference between the performances of CGINet-3 and CGINet-1 in AUPRC information. We can not only explore the relation between chemicals and genes but also discover more internal connections in the molecular and patient population data.
It also worth noting that our model is capable of predicting novel chemical-gene interactions which are not appeared in the original graph. With Eqs. (4) and (5), we can calculate the probability P ij r of unknown chemical-gene pairs (c i , g j ) under each interaction type r . Higher probability indicates that chemical c i inclines to interact with the gene g j . We can turn to the online public databases to see whether or not the corresponding literature evidence can be retrieved. Table 4 provides some novel predictions with literature evidence.

Conclusions
In this paper, we present CGINet, a graph convolutional network-based method for predicting compound-gene interactions in an integrated multi-relational graph. CGI-Net adopts a subgraph view that the initial node embeddings are learned with the binary association subgraphs and then transferred to the multi-interaction subgraph for more focused learning of higher-level target node representations. The experimental results have shown that the CGINet models exhibit competitive performance compared with the baseline models. Moreover, learning node embeddings with latent links can lead to improved performance.  CGINet is a transductive learning method that is applied to a static graph. To be specific, we train the graph neural network with all known nodes and part of edges (training edges) in the graph, producing node embedding for each node. The graph neural network learns the node embedding from neighborhood information through the adjacency matrix (or Laplacian matrix). That is to say, adding new nodes to the graph will change the adjacency matrix (or Laplacian matrix). The model should be retrained. This inherent property makes the graph neural network poor in dealing with the dynamic graph. In future work, we are interested in enhancing the capacity of our model for dealing with the dynamic graph. Moreover, we will gather more diverse biomedical information (e.g. compound-disease associations, gene-disease associations, and pathway-disease associations) and pay more attention to constructing a larger-scale bio-network for thoroughly analyzing the mechanism of action about the biological reactions. We aim to build a robust model for figuring out the long dependency between different molecules with better interpretability.

Integrated multi-relational graph
We construct a heterogeneous graph containing three types of nodes: chemicals, genes, and pathways, where pathway can shed light on the mechanism of action underlying CGI. A total of five individual chemicals/genes/pathways related graphs, including four binary association subgraphs [chemical-chemical graph (CC-graph), gene-gene graph (GG-graph), chemical-pathway graph (CP-graph), and gene-pathway graph (GP-graph)] and one multi-interaction subgraph [chemical-gene graph (CG-graph)], are collected from multiple curated databases and used to construct an integrated multi-relational graph.

Binary association subgraphs
We extract the CC-graph from the STITCH database, which contains 17,705,818 chemical-chemical associations across 389,393 chemicals. For the GG-graph, we grab 715,612 gene-gene associations between 19,081 genes complied by Decagon [40]. We obtain the CP-graph and GP-graph from the Comparative Toxicogenomics Database. There are 1,285,158 chemical-pathway associations and 135,809 gene-pathway associations consisted of 10,034 chemicals, 11,588 genes, and 2,352 pathways.

Multi-interaction subgraph
A link in the multi-interaction graph represents the association between two nodes as well as their interaction type. We construct the CG-graph by 13,488 chemicals, 50,876 genes, and 1,935,152 chemical-gene interactions pulled from the Comparative Toxicogenomics Database. Each CGI has a degree (increases, decreases, or affects) and type (e.g. activity, expression, and reaction), e.g. "Chemical X decreases the activity of Gene Y", denoted as a triple (chemical X, decreases^activity, gene Y).
Herein, we consider only 65 types of interactions between chemicals and genes that each appears in at least 180 CGIs. Besides, the CC-graph and the GG-graph are both trimmed by deleting nodes not involved in the CP-graph, GP-graph, and CG-graph. The final integrated graph has 14,269 chemicals, 51,069 genes, and 2,363 pathways. These nodes are connected by a total of 4,653,387 associations/interactions. An example of the integrated multi-relational graph and the detailed statistical data of the final graph are shown in Fig. 3 and Table 5, respectively.

Data observation
The clustering result achieved in Parsons et al. [41] suggests that the chemicals incline to cluster with the genes related to each other. More specifically, if chemical c 1 interacts with gene g 1 , and gene g 1 genetically associates with gene g 2 , then we can reasonably assume that chemical c 1 and gene g 2 chemically genetically interact. In other words, there is a latent link connecting chemical c 1 and gene g 2 . Based on this assumption, we carry on an observation about two types of topological substructures, S-G and S-G-P. Figure 4 gives examples of these two substructures.
Firstly, the substructures matched with the S-G and the S-G-P are extracted separately from the entire multi-relational graph. Secondly, we respectively count the number of CGIs that existed in the S-G or the S-G-P with de-duplication. After that, we investigate the frequency distribution of interaction types and the proportion of CGIs involved in the S-G or the S-G-P for each interaction type. We find that: (1) averagely, > 62% of individual CGIs are involved in the S-G, and about 50% of individual CGIs are involved in the Fig. 3 An example of the integrated multi-relational graph. The links shown in red indicate that Dichlorodiphenyl Dichloroethylene (node c 1 ) results in decreased activity of CYP19A1 (node g 1 ), SRD5A2 (node g 2 ), and increased secretion of ADIPOQ (node g 3 ). Chemical-chemical associations, gene-gene associations, chemical-pathway associations, and gene-pathway associations involved in this case are marked as highlighted blue links S-G-P, suggesting that it is significant to capture unknown but potential links to update the topological properties of chemicals and genes for learning much more informative node embeddings. (2) The frequency of CGIs involved in the S-G or the S-G-P both decrease with the reduction of the total number of CGIs for each interaction type group (Fig. 5). The reason probably lies in the extreme imbalance of data, where 20% of interaction types capture about 93% of CGIs (e.g. increases^expression, decreases^expression, and affects^cotreatment). Therefore, we make a specific investigation on whether or not different contributions of latent links for each interaction type should be considered in "Results" section. These findings have remarkable inspirations for the development of the model in the following section.

Problem formulation
The CGI identification problem is formulated as a task of link prediction in the integrated multi-relational graph including four binary association subgraphs and one multi-interaction subgraph. We denote the associated relation set as − R= {r cc , r gg , r cp , r gp } , and the interactive relation set as where N cg is Fig. 4 The examples of the S-G and the S-G-P. a S-G. Node c 1 , g 1 , and g 2 are linked in pairs. b S-G-P. Based on the structure of the S-G, chemical and gene nodes also share node p 1 in the S-G-P. The S-G indicates the potential interaction between chemical c 1 and gene g 2 if chemical c 1 interacts with gene g 1 which associates with gene g 2 . Besides that, the S-G-P also considers the mechanism of action (pathway p 1 ) underlying the chemical-gene interaction and gene-gene association Using the graph G , our goal is to calculate the probability of an edge e ij = v i , r, v j i∈[N c ],j∈[N g ] of interaction type r be assigned to ∼ R , which implies that how likely chemical v i results in an interaction type r of gene v j . To achieve that, we develop an end-to-end trainable model Fig. 6 The flowchart of the CGINet pipeline. a The framework of CGINet. The graph convolutional encoder takes the integrated multi-relational graph as input (the one-hot vectors for each node and the adjacency matrices) and returns a chemical embedding matrix and a gene embedding matrix. The tensor decomposition decoder uses these node embeddings to compute the probabilities of interactions between the chemicals and the candidate genes. b Graph convolutional encoder. We take the subgraph perspective as an example. Initial embeddings of chemicals c 1 and genes g 1 are learned with the binary association subgraph. For example, c 1 receives information from neighbor nodes, including chemical nodes ( c 2 , c 3 , c 4 ) and pathways ( p 1 p 2 ). The initial embeddings are then transferred to the multi-interaction subgraph for learning final embeddings. In the multi-interaction subgraph, the encoder aggregates information not only from the neighbor nodes across known edges but also from the new neighbors connected by latent links (shown in dotted line). For example, c 1 encodes neighborhood information from g 1 , g 2 , g 3 and g 4 Fig. 7 Tensor decomposition decoder. The chemical embedding matrix and the gene embedding matrix are learned from the graph convolutional encoder. Tensor D is a set of the matrix D r CGINet (Fig. 6a) that has two main components, a graph convolutional encoder (Fig. 6b) and a tensor decomposition decoder (Fig. 7).

Graph convolutional encoder
Much research has proved graph convolutional networks to be effective in node/graph representation learning [42,43]. The graph convolutional network usually extracts local substructure features for individual nodes by iteratively aggregating, transforming, and propagating information from neighbor nodes. A deeper graph convolutional network can integrate the normalized message from all neighbors up to k-hops away. Notably, 2-layer graph convolutional network models yield the best performance based upon empirical observation [44].
Herein, we propose an encoder equipped with 2-layer graph convolutional networks taking the graph G as input and producing topological-preserving embedding z i for each node. We investigate two perspectives on encoding neighborhood information with the graph G : total graph perspective and subgraph perspective. The former is to view the graph as a whole, while the latter is to adopt a subgraph view that initial node embeddings are learned with the binary association subgraphs and then transferred to the multi-interaction subgraph for final node embeddings learning.

Total graph perspective
A 2-layer graph convolutional network operates directly on the entire multi-relational graph G . In each layer, GCN updates the embedding for each node by simply summing different nearby information propagated across different types of edges. Given the k th hidden state h k i of node v i , where v i ∈ {V c ∪ V g ∪ V p } , the ( k + 1) th hidden state h k+1 i of node v i is specifically updated as follow: where h k i ∈ R d k with d k denotes the embedding size of the k th hidden layer. r ∈ R ∪R denotes one of the interaction types. W k r is the trainable parameter matrix of interaction type r . N r i is the neighbor set of node v i under interaction type r . 1/ √ N r i N r j and 1/ √ N r i are normalization constants. σ is a non-linear activation function like ReLU . The node features are initialized as one-hot vectors and input to the first layer, denoted as h 0 i = x i . We stack two graph convolutional layers such that the final node embedding is computed as:

Subgraph perspective
Instead of taking the graph as a whole, we split the graph G into two subgraphs, the binary association subgraph G (including the CC-graph, GG-graph, CP-graph, GPgraph) and multi-interaction subgraph G (the CG-graph). We respectively use two 2-layer graph convolutional networks for learning node embedding in these two separate subgraphs.
In the binary association subgraph G , chemical nodes only encode information from the neighbor nodes of chemicals and pathways, while gene nodes receive message from the neighbor nodes of genes and pathways. The hidden state h k i ∈ R d k of each hidden layer in the first 2-layer graph convolutional network is updated similarly as Eq. (1). The only difference is r ∈ R . We assign the output node embedding as z i = h K i with K = 2 . These embeddings are then transferred to the subgraph G to initialize corresponding chemical and gene features, denoted as As the observations in "Data observation" section suggest, we take account of extracting latent links to reconstruct the topological structures of nodes in the multi-interaction subgraph G . By searching over the entire graph G with the substructure S-G-P, we screen out candidate latent links under each interaction type, denoted as L r = l r i i∈ [N r ] , where N r is the number of candidate latent links under interaction type r . Let N r i denotes the number of substructures containing latent link l r i . A candidate latent link l r i is decided to be the definite latent link if: where is the threshold coefficient.
We use the confirmed latent links to update the topological properties of each node v i . The set of new neighbors of node v i under interaction type r can be denoted as L r i . With taking account of the information propagated across latent edges, the hidden layer of the second 2-layer graph convolutional network is defined as follow: where hk i ∈ Rdk with dk denotes the dimensionality of the k -th hidden layer. r ∈R denotes one of the interaction types. Importantly note that µ r ∈ [0, 1] is a trainable parameter, defined as latent rate, used to measure the contribution of latent links for interaction type r . The final node embedding is assigned as: z i =hK i , where K = 2 and v i ∈ {V c ∪ V g }.

Tensor decomposition decoder
Given a chemical v i and a gene v j , the decoder returns the probability P ij r of an edge e ij = v i , r, v j , which represents how likely chemical v i results in an interaction type r of gene v j . The decoder takes advantage of a tensor decomposition model, called DEDICOM [45], to formulate chemical-gene interactions, as shown in Fig. 7.
Based on the node embeddings z i and z j learned by the encoder, the decoder computes a score G z i , r, z j for the edge e ij , and then act a sigmoid function σ on it as follow: (2) N r i ≥ max 2, max N r 0 ,N r 1 , . . . ,N r N r × , G z i , r, z j = z T i D r RD r z j , where D r is a local diagonal matrix giving weights to each dimension of the node embedding under interaction type r . R is a global parameter matrix associated with all interaction types, which enables the model to share information across different interaction types. Note that the matrix D r and R are both trainable parameters of shape d k × d k . These two matrices are initialized using the same method introduced in Glorot et al. [46].

Model training
We perform negative sampling during the training procedure, which can reduce the training time greatly. We generate a negative sample (v i , r, v n ) by replacing the node v j of the known edge v i , r, v j with node v n , which is chosen randomly according to a sampling distribution in Mikolov et al. [47]. Specifically, the distribution probability of node v n is calculated based on its degree d(v n ) as follow: Given a set of chemical-gene pairs and the labels, we encourage the model to enlarge the margin m by minimizing the hinge loss function [48]: where is a set of neural network parameters. P in r denotes the probability of the negative sample (v i , r, v n ) associated with the known edge v i , r, v j . With the hinge loss, any case where the difference is larger than the margin m will not be penalty.