 Methodology article
 Open access
 Published:
CGINet: graph convolutional networkbased model for identifying chemicalgene interaction in an integrated multirelational graph
BMC Bioinformatics volume 21, Article number: 544 (2020)
Abstract
Background
Elucidation of interactive relation between chemicals and genes is of key relevance not only for discovering new drug leads in drug development but also for repositioning existing drugs to novel therapeutic targets. Recently, biological networkbased approaches have been proven to be effective in predicting chemicalgene interactions.
Results
We present CGINet, a graph convolutional networkbased method for identifying chemicalgene interactions in an integrated multirelational graph containing three types of nodes: chemicals, genes, and pathways. We investigate two different perspectives on learning node embeddings. One is to view the graph as a whole, and the other is to adopt a subgraph view that initial node embeddings are learned from the binary association subgraphs and then transferred to the multiinteraction subgraph for more focused learning of higherlevel target node representations. Besides, we reconstruct the topological structures of target nodes with the latent links captured by the designed substructures. CGINet adopts an endtoend way that the encoder and the decoder are trained jointly with known chemicalgene interactions. We aim to predict unknown but potential associations between chemicals and genes as well as their interaction types.
Conclusions
We study three model implementations CGINet1/2/3 with various components and compare them with baseline approaches. As the experimental results suggest, our models exhibit competitive performances on identifying chemicalgene interactions. Besides, the subgraph perspective and the latent link both play positive roles in learning much more informative node embeddings and can lead to improved prediction.
Background
Drug discovery is a complex, lengthy, inefficient, and expensive process. The estimated average time needed to launch a new drug is around 10–15 years at an average cost of about $1.8 billion [1]. To expedite the drug development process, it is critical to screen as many potential drug candidates as possible in the prophase. Over 80% of FDAapproved drugs are smallmolecule chemicals that act on single or multiple gene (or protein) targets, ultimately achieving curative effects [2, 3]. Obviously, elucidation of interactive relation between chemicals and genes, named chemicalgene interactions (CGIs), is of key relevance not only for discovering new drug leads in drug development but also for repositioning existing drugs to novel therapeutic targets. With known CGIs, numerous researches provided new insights into rapidly screening candidate chemicals for treatments of corresponding diseases, such as HIV [4], HCV [5], lung cancer [6], and so forth. Unfortunately, proven CGIs are present in limited amounts. For example, the PubChem database contains more than 30 million chemicals, but few have confirmed gene targets [7]. This predicament drives the imperative need for automatic and efficient methods to infer chemicalgene interactions as a preliminary process rather than experimentally determining every possible chemicalgene pair, which is timeconsuming and costly. According to different kinds of data used, we roughly divide the computational methods for CGI prediction into three categories: biomedical literaturebased, molecular structurebased, and biological networkbased.
Biomedical literaturebased approaches
A wealth of knowledge about chemicalgene interactions is scattered over the published biomedical literature, resulting in the inefficient query of CGI information of interest. The challenge is to detect the chemicals and the genes with close association mentioned in an unstructured text and further determine which type of interaction they share. Biomedical literaturebased methods tackle the problems with welldesigned or deeplearning features enhanced by natural language processing (NLP) techniques [8,9,10]. In recent studies, multiple deep neural network (DNN) models, including convolutional neural network (CNN), recurrent neural network (RNN), long shortterm memory network (LSTM), and attentionbased DNN, have been applied to learn CGI classifiers [11,12,13]. These approaches feed the DNN models with lowdimension pretrained word embeddings without complicated feature engineering. Notably, attentionbased DNN models exhibit competitive performance compared with other models and have the inherent ability to extract salient features for CGI identification as needed. Besides, some advanced researches extend the language models with syntax and semantic information, such as part of speech (POS), syntactic structure, dependency tree, and knowledge graph for a better understanding of the context [8, 14, 15]. However, such methods based on biomedical articles limit in predicting unpublished and unknown CGIs.
Molecular structurebased approaches
Among these methods, molecular docking, which explores the predominant binding models of two interacting molecules using known 3Dstructures, were initially studied [16, 17]. It uses various scoring functions to predict the binding affinity of molecules. The limitations lie in that it critically dependents on the available highquality 3Dstructure data and generally takes excessive computing resources. The followup researches focus on representing chemicals and genes by fingerprints as inputs of the machinelearning models [7, 18, 19], such as logistic regression, knearest neighbor (KNN), support vector machine (SVM), etc. Fingerprint is the most commonly used descriptor of the substructure of the molecule. However, the fingerprint is defined as a binary vector whose index value represents whether the substructure of a molecule exists or not, making it quite sparse and not sufficiently informative for CGI prediction. Recent researches have paid more attention to recruiting the endtoend models on simplified molecularinput lineentry system (SMLES) string for chemicals and structural property sequence (SPS) for genes to learn super representations [2, 20,21,22]. The results achieved demonstrate that the models trained with super representations are more robust than those trained with traditional descriptors.
Biological networkbased approaches
Compared with molecular structurebased approaches, biological networkbased approaches combine the chemical space and the gene space into a consistent space by a constructed heterogeneous network/graph. Chemicals and genes are treated as nodes of the network. The links between two nodes denote their interactive relations, including intradomain relations between two nodes of the same type, e.g. chemicalchemical interactions, and crossdomain relations between two nodes belonging to different types, e.g. chemicalgene interactions [23]. Multiple largescale databases have captured as much as possible of knowledge about chemicalgene interactions from the publicly accessible data, such as STITCH (Search Tool for InTeractions of Chemicals) [24], CTD (Comparative Toxicogenomics Database) [25]. The emergence of these aggregated databases provides new opportunities for CGI prediction. Numerous studies develop a slew of networkbased inference models that integrate diverse CGIrelated information from the heterogeneous network and automatically learn the features of individual nodes for predicting missing relations [26,27,28]. The biological networkbased approach has excellent advantages in potential CGI extraction as it does not rely on specific biological properties description or 3Dstructure data of molecules.
Research on identifying chemicalgene interaction is still in its infancy, and there is much room for improvement in its performance. In this manuscript, we present the CGINet model, using a framework of encoderdecoder, to formulate the CGI identification problem as a task of multirelational link prediction between chemicals and genes in a heterogeneous network/graph containing three types of nodes: chemicals, genes, and pathways. CGINet employs the graph convolutional network (GCN) as an autoencoder on aggregating, transforming, and propagating neighborhood information over the graph. We investigate two different perspectives on learning node embeddings. One is to view the graph as a whole, and the other is to adopt a subgraph view that initial node embeddings are learned with the binary association subgraphs and then transferred to the multiinteraction subgraph for final node embeddings learning. Lastly, the node embeddings are sent to the decoder, which uses a tensor decomposition model to formulate chemicalgene interactions. CGINet adopts an endtoend way that the encoder and the decoder are trained jointly with known CGIs in a multirelational graph.
We study three implementations of the CGINet models with various components and compare them with baseline approaches. As the experimental results suggest, our models exhibit competitive performances in predicting chemicalgene interactions. The main contributions of our work are: (1) We present a graph convolutional networkbased model to predict the missing links between the chemicals and the genes in a heterogeneous graph. Our model takes advantage of the information from latent links based on biological insights, outperforming the baseline models. (2) The model which adopts a subgraph perspective can dramatically reduce the training time and also improves performance. (3) Our model is capable of predicting novel chemicalgene interactions, which are not appeared in the original graph.
Results
Experimental settings
We construct a multirelational graph containing 65 types of chemicalgene interaction. Every given chemicalgene pair is identified into none, one or more interaction types. As most graphbased approaches have done [26,27,28], we randomly split the CGI instances into training, validation, and test sets for each interaction type, having 8:1:1 ratio. The CGINet model is optimized with an Adam optimizer [29], and the parameters used in our models are summarized in Table 1. We individually measure the performance of each interaction type using area under the receiveroperating characteristic (AUROC), area under the precisionrecall curve (AUPRC), and average precision for the topk identifications (AP@k). To avoid the overfitting issue, we perform crossvalidation and initialize the trainable parameters with multiple random seeds. The experimental results are given as average performance. We implement the CGINet model with Python language using the Tensorflow package [30].
We study three model implementations CGINet1/2/3 with various components and compare them with baseline approaches (DeepWalk [31], Node2Vec [32], SVD [33], Laplacian [34], GCN [35]). Brief descriptions about these approaches are given as follow:
Baseline approaches
(1) Random walkbased embeddings. The DeepWalk model learns node embeddings by randomly capturing neighborhood information on the basis of the depthfirst search method, while the Node2Vec model combines the depthfirst search and the breadthfirst search methods to aggregate proximal nodes. (2) Matrix factorizationbased embeddings. The SVD and the Laplacian models both factorizes the adjacency matrix of the graph to obtain the node embeddings. We use these learned node embeddings as input to train a logistic regression classifier for each interaction type. (3) Graph convolutional networkbased methods. We employ a 2layer GCN on learning node embedding with the CGgraph or the total graph, respectively named as GCNCG and GCNTotal.
CGINet1/2/3/ approaches
CGINet1, CGINet2, and CGINet3 all adopt a subgraph view of learning final node embeddings by two steps. Besides, CGINet2 and CGINet3 take account of encoding information across latent links. The latent rate \(\mu\) in the CGINet2 model is a trainable parameter (\(\mu \in [\mathrm{0,1}]\)), while it is fixed to the value of 1 in CGINet3 (\(\mu =1\)).
Performance comparison of different thresholds
A threshold coefficient \(\lambda\) is designed in our model as a gatekeeper to control the requirement of a definite latent link. We investigate the change of performance of our model with different thresholds. As shown in Fig. 1, Larger threshold leads to less latent links. The overall performance of CGINet2 and CGINet3 increases with the growth threshold. To be specific, CGINet2 with \(\lambda =0.4\) and CGINet3 with \(\lambda =0.5\), show respectively better performance. These suggest that stricter threshold value makes the latent links more credible for updating the topological structure of the graph. We proceed by making a performance comparison between the CGINet models with various components and baseline models.
Comparison with baseline models
Table 2 gives the performance comparison of our models with baseline methods. Matrix factorizationbased approaches and random walkbased approaches both learn node embeddings and train relation classifiers in two individual stages. The latter methods (SVD, Laplacian) show better performance than the former methods (DeepWalk, Node2vec) on processing such a heterogeneous multirelational graph. Random walkbased approaches excessively dependent on the specific structure of the graph. In contrast, the CGINet models train the encoder and the decoder jointly. Most of our models outperform the baseline models, especially the CGINet2 model achieves 5.7% of relative improvements in AUPRC compared with the best results of baselines (Laplacian).
Compared to GCNCG, GCNTotal shows manifest performance degradation. Especially it drops to 57.1% in AP@20. We hypothesize that the reason behind this is due to the limitation of the GCNTotal model in focusing on capturing interactions of interest in an integrated multirelational graph that contains nontarget associations (e.g. chemicalpathway associations, genepathway associations). Based on this assumption, we investigate a subgraph view of learning target node representations by two steps in the CGINet1 model. It is inspiring to see that CGINet1 outperforms GCNTotal by 7.8% (AUROC), 10.4% (AUPRC), and 19.9% (AP@20), indicating that more focused learning of node embedding facilitates better use of the graph data. Furthermore, compared with GCNCG, CGINet1 leads to about 4% of relative improvements in AUPRC. It verifies that initial node embeddings pretrained with the binary association subgraph provide practical knowledge for final node embedding learning.
A further comparison among our models (CGINet1, CGINet2, and CGINet3) reveals that the models which aggregate information from the new neighbor nodes across latent links perform better than the models only capture labeled neighborhood information. To be specific, CGINet2 and CGINet3 lead to about 2% increase in AUPRC compared with CGINet1. It is consistent with our findings in “Data observation” section that updating the topological properties of nodes with latent links can significantly provide informative features for learning more effective node embeddings. Besides, CGINet2 exhibits optimal performance in AUROC (92.7%) but is inferior to CGINet3 by down to 76.5% in AP@20. In view of the overall situation, the latent rate setting enhances the classification power of the model but along with the poor ranking ability. Consequently, the CGINet3 model, which considers the equal contribution of latent links for each interaction type, has better higher overall performance.
The above analysis has illustrated that our models which adopt the subgraph view can significantly improve performance. We also calculate the average training time of each epoch for the GCNbased models, as shown in the last column of Table 2. Compared with GCNTotal, our models can reduce at least 65% of training time while achieving much better performance.
Comparison on interaction typewise performance
As shown in Fig. 2, compared to CGINet1, CGINet3 achieves improved performances on over half interaction types (34 of 65 types; right side of Fig. 2) but gets degraded performances on the other types (left side of Fig. 2). Through detailed investigation on the performance per interaction type, we find that encoding updated neighborhood information across latent links prefers to play a positive role in predicting some specific interaction types without considering the degree of action (e.g. cleavage, sumoylation, metabolic processing, and glucuronidation), but participates negatively in identifying some other types (e.g. secretion, transport, and reaction). More interestingly, metabolic processing is the parent interaction type of cleavage, sumoylation, and glucuronidation. It inspires us to optimize our models by paying more attention to the deepseated mechanism of the biological reaction in later research.
We visualize the top 15 best performance interaction types in the CGINet3 model, as shown in Table 3. It is also worth noting that even though some interaction types have extremely few known edges for training, the model can still be adept at predicting them, e.g. decreases^acetylation (147 edges), affects^chemical synthesis (181 edges) and decreases^cleavage (188 edges). We believe that developing a global decoder associated with all interaction types enables our model to share information across different types of interactions.
Discussion
For the random walkbased approaches, the chemical space and the gene space are combined into a consistent space. The node embeddings are learned in a homogeneous graph. In contrast, the essence of our model is to analyze the dependency between different semantic spaces in a heterogeneous graph. It allows us to integrate more diverse biomedical data into our model, such as the disease and the phenotype information. We can not only explore the relation between chemicals and genes but also discover more internal connections in the molecular and patient population data.
It also worth noting that our model is capable of predicting novel chemicalgene interactions which are not appeared in the original graph. With Eqs. (4) and (5), we can calculate the probability \({\mathcal{P}}_{r}^{ij}\) of unknown chemicalgene pairs \(({c}_{i},{g}_{j})\) under each interaction type \(r\). Higher probability indicates that chemical \({c}_{i}\) inclines to interact with the gene \({g}_{j}\). We can turn to the online public databases to see whether or not the corresponding literature evidence can be retrieved. Table 4 provides some novel predictions with literature evidence.
Conclusions
In this paper, we present CGINet, a graph convolutional networkbased method for predicting compoundgene interactions in an integrated multirelational graph. CGINet adopts a subgraph view that the initial node embeddings are learned with the binary association subgraphs and then transferred to the multiinteraction subgraph for more focused learning of higherlevel target node representations. The experimental results have shown that the CGINet models exhibit competitive performance compared with the baseline models. Moreover, learning node embeddings with latent links can lead to improved performance.
CGINet is a transductive learning method that is applied to a static graph. To be specific, we train the graph neural network with all known nodes and part of edges (training edges) in the graph, producing node embedding for each node. The graph neural network learns the node embedding from neighborhood information through the adjacency matrix (or Laplacian matrix). That is to say, adding new nodes to the graph will change the adjacency matrix (or Laplacian matrix). The model should be retrained. This inherent property makes the graph neural network poor in dealing with the dynamic graph. In future work, we are interested in enhancing the capacity of our model for dealing with the dynamic graph. Moreover, we will gather more diverse biomedical information (e.g. compounddisease associations, genedisease associations, and pathwaydisease associations) and pay more attention to constructing a largerscale bionetwork for thoroughly analyzing the mechanism of action about the biological reactions. We aim to build a robust model for figuring out the long dependency between different molecules with better interpretability.
Methods
Integrated multirelational graph
We construct a heterogeneous graph containing three types of nodes: chemicals, genes, and pathways, where pathway can shed light on the mechanism of action underlying CGI. A total of five individual chemicals/genes/pathways related graphs, including four binary association subgraphs [chemicalchemical graph (CCgraph), gene–gene graph (GGgraph), chemicalpathway graph (CPgraph), and genepathway graph (GPgraph)] and one multiinteraction subgraph [chemicalgene graph (CGgraph)], are collected from multiple curated databases and used to construct an integrated multirelational graph.
Binary association subgraphs
We extract the CCgraph from the STITCH database, which contains 17,705,818 chemicalchemical associations across 389,393 chemicals. For the GGgraph, we grab 715,612 gene–gene associations between 19,081 genes complied by Decagon [40]. We obtain the CPgraph and GPgraph from the Comparative Toxicogenomics Database. There are 1,285,158 chemicalpathway associations and 135,809 genepathway associations consisted of 10,034 chemicals, 11,588 genes, and 2,352 pathways.
Multiinteraction subgraph
A link in the multiinteraction graph represents the association between two nodes as well as their interaction type. We construct the CGgraph by 13,488 chemicals, 50,876 genes, and 1,935,152 chemicalgene interactions pulled from the Comparative Toxicogenomics Database. Each CGI has a degree (increases, decreases, or affects) and type (e.g. activity, expression, and reaction), e.g. “Chemical X decreases the activity of Gene Y”, denoted as a triple (chemical X, decreases^activity, gene Y).
Herein, we consider only 65 types of interactions between chemicals and genes that each appears in at least 180 CGIs. Besides, the CCgraph and the GGgraph are both trimmed by deleting nodes not involved in the CPgraph, GPgraph, and CGgraph. The final integrated graph has 14,269 chemicals, 51,069 genes, and 2,363 pathways. These nodes are connected by a total of 4,653,387 associations/interactions. An example of the integrated multirelational graph and the detailed statistical data of the final graph are shown in Fig. 3 and Table 5, respectively.
Data observation
The clustering result achieved in Parsons et al. [41] suggests that the chemicals incline to cluster with the genes related to each other. More specifically, if chemical \({c}_{1}\) interacts with gene \({g}_{1}\), and gene \({g}_{1}\) genetically associates with gene \({g}_{2}\), then we can reasonably assume that chemical \({c}_{1}\) and gene \({g}_{2}\) chemically genetically interact. In other words, there is a latent link connecting chemical \({c}_{1}\) and gene \({g}_{2}\). Based on this assumption, we carry on an observation about two types of topological substructures, SG and SGP. Figure 4 gives examples of these two substructures.
Firstly, the substructures matched with the SG and the SGP are extracted separately from the entire multirelational graph. Secondly, we respectively count the number of CGIs that existed in the SG or the SGP with deduplication. After that, we investigate the frequency distribution of interaction types and the proportion of CGIs involved in the SG or the SGP for each interaction type. We find that: (1) averagely, > 62% of individual CGIs are involved in the SG, and about 50% of individual CGIs are involved in the SGP, suggesting that it is significant to capture unknown but potential links to update the topological properties of chemicals and genes for learning much more informative node embeddings. (2) The frequency of CGIs involved in the SG or the SGP both decrease with the reduction of the total number of CGIs for each interaction type group (Fig. 5). The reason probably lies in the extreme imbalance of data, where 20% of interaction types capture about 93% of CGIs (e.g. increases^expression, decreases^expression, and affects^cotreatment). Therefore, we make a specific investigation on whether or not different contributions of latent links for each interaction type should be considered in “Results” section. These findings have remarkable inspirations for the development of the model in the following section.
Problem formulation
The CGI identification problem is formulated as a task of link prediction in the integrated multirelational graph including four binary association subgraphs and one multiinteraction subgraph. We denote the associated relation set as \({\stackrel{}{R}=\{r}^{cc},{r}^{gg},{r}^{cp},{r}^{gp}\}\), and the interactive relation set as \(\stackrel{\sim }{R}={\{{r}_{i}^{cg}\}}_{i\in {[N}^{cg}]}\), where \({N}^{cg}\) is the number of interaction types. Given a set of chemicals \({V}_{c}={\{{v}_{i}\}}_{i\in [{N}^{c}]}\), a set of genes \({V}_{g}={\{{v}_{i}\}}_{i\in [{N}^{g}]}\), and a set of pathways \({V}_{p}={\{{v}_{i}\}}_{i\in [{N}^{p}]}\), where \({N}^{c/g/p}\) is the number of chemicals/genes/pathways, the entire graph can be denoted as \(G=(V,E)\), where \(V=\{{v}_{i}{v}_{i}\in {V}_{c}\cup {V}_{g}\cup {V}_{p}\}\) and \(E=\left\{\left({v}_{i},r,{v}_{j}\right)r\in \{\stackrel{}{R}\cup \stackrel{\sim }{R}\}\right\}\). Using the graph \(G\), our goal is to calculate the probability of an edge \({e}_{ij}={\left({v}_{i},r,{v}_{j}\right)}_{i\in [{N}^{c}],j\in [{N}^{g}]}\) of interaction type \(r\) be assigned to \(\stackrel{\sim }{R}\), which implies that how likely chemical \({v}_{i}\) results in an interaction type \(r\) of gene \({v}_{j}\). To achieve that, we develop an endtoend trainable model CGINet (Fig. 6a) that has two main components, a graph convolutional encoder (Fig. 6b) and a tensor decomposition decoder (Fig. 7).
Graph convolutional encoder
Much research has proved graph convolutional networks to be effective in node/graph representation learning [42, 43]. The graph convolutional network usually extracts local substructure features for individual nodes by iteratively aggregating, transforming, and propagating information from neighbor nodes. A deeper graph convolutional network can integrate the normalized message from all neighbors up to khops away. Notably, 2layer graph convolutional network models yield the best performance based upon empirical observation [44].
Herein, we propose an encoder equipped with 2layer graph convolutional networks taking the graph \(G\) as input and producing topologicalpreserving embedding \({z}_{i}\) for each node. We investigate two perspectives on encoding neighborhood information with the graph \(G\): total graph perspective and subgraph perspective. The former is to view the graph as a whole, while the latter is to adopt a subgraph view that initial node embeddings are learned with the binary association subgraphs and then transferred to the multiinteraction subgraph for final node embeddings learning.
Total graph perspective
A 2layer graph convolutional network operates directly on the entire multirelational graph \(G\). In each layer, GCN updates the embedding for each node by simply summing different nearby information propagated across different types of edges. Given the \(k\)th hidden state \(h_{i}^{k}\) of node \(v_{i}\), where \(v_{i} \in \{ V_{c} \cup V_{g} \cup V_{p} \}\), the (\(k + 1)\)th hidden state \(h_{i}^{k + 1}\) of node \(v_{i}\) is specifically updated as follow:
where \(h_{i}^{k} \in {\mathbb{R}}^{{d_{k} }}\) with \(d_{k}\) denotes the embedding size of the \(k\)th hidden layer. \(r \in \left\{ {\overline{R} \cup \tilde{R}} \right\}\) denotes one of the interaction types. \(W_{r}^{k}\) is the trainable parameter matrix of interaction type \(r\). \({\mathcal{N}}_{i}^{r}\) is the neighbor set of node \(v_{i}\) under interaction type \(r\). \(1/\surd \left {{\mathcal{N}}_{i}^{r} } \right\left {{\mathcal{N}}_{j}^{r} } \right\) and \(1/\surd \left {{\mathcal{N}}_{i}^{r} } \right\) are normalization constants. \(\sigma\) is a nonlinear activation function like \(ReLU\). The node features are initialized as onehot vectors and input to the first layer, denoted as \(h_{i}^{0} = x_{i}\). We stack two graph convolutional layers such that the final node embedding is computed as: \(z_{i} = h_{i}^{K}\) with \(K = 2\).
Subgraph perspective
Instead of taking the graph as a whole, we split the graph \(G\) into two subgraphs, the binary association subgraph \(\overline{G}\) (including the CCgraph, GGgraph, CPgraph, GPgraph) and multiinteraction subgraph \(\tilde{G}\) (the CGgraph). We respectively use two 2layer graph convolutional networks for learning node embedding in these two separate subgraphs.
In the binary association subgraph \(\overline{G}\), chemical nodes only encode information from the neighbor nodes of chemicals and pathways, while gene nodes receive message from the neighbor nodes of genes and pathways. The hidden state \(\overline{h}_{i}^{{\overline{k}}} \in {\mathbb{R}}^{{\overline{d}_{{\overline{k}}} }}\) of each hidden layer in the first 2layer graph convolutional network is updated similarly as Eq. (1). The only difference is \(r \in \overline{R}\). We assign the output node embedding as \(\overline{z}_{i} = \overline{h}_{i}^{{\overline{K}}}\) with \(\overline{K} = 2\). These embeddings are then transferred to the subgraph \(\tilde{G}\) to initialize corresponding chemical and gene features, denoted as \(\tilde{x}_{i} = \overline{z}_{i}\), where \(v_{i} \in \{ V_{c} \cup V_{g} \}\).
As the observations in “Data observation” section suggest, we take account of extracting latent links to reconstruct the topological structures of nodes in the multiinteraction subgraph \(\tilde{G}\). By searching over the entire graph \(G\) with the substructure SGP, we screen out candidate latent links under each interaction type, denoted as \(L_{r} = \left\{ {l_{i}^{r} } \right\}_{{i \in \left[ {N^{r} } \right]}}\), where \(N^{r}\) is the number of candidate latent links under interaction type \(r\). Let \(\hat{N}_{i}^{r}\) denotes the number of substructures containing latent link \(l_{i}^{r}\). A candidate latent link \(l_{i}^{r}\) is decided to be the definite latent link if:
where \(\lambda\) is the threshold coefficient.
We use the confirmed latent links to update the topological properties of each node \(v_{i}\). The set of new neighbors of node \(v_{i}\) under interaction type \(r\) can be denoted as \({\mathcal{L}}_{i}^{r}\). With taking account of the information propagated across latent edges, the hidden layer of the second 2layer graph convolutional network is defined as follow:
where \(\tilde{h}_{i}^{{\tilde{k}}} \in {\mathbb{R}}^{{\tilde{d}_{{\tilde{k}}} }}\) with \(\tilde{d}_{{\tilde{k}}}\) denotes the dimensionality of the \(\tilde{k}\)th hidden layer. \(r \in \tilde{R}\) denotes one of the interaction types. Importantly note that \(\mu^{r} \in \left[ {0,1} \right]\) is a trainable parameter, defined as latent rate, used to measure the contribution of latent links for interaction type \(r\). The final node embedding is assigned as: \(z_{i} = \tilde{h}_{i}^{{\tilde{K}}}\), where \(\tilde{K} = 2\) and \(v_{i} \in \{ V_{c} \cup V_{g} \}\).
Tensor decomposition decoder
Given a chemical \(v_{i}\) and a gene \(v_{j}\), the decoder returns the probability \({\mathcal{P}}_{r}^{ij}\) of an edge \(e_{ij} = \left( {v_{i} ,r,v_{j} } \right)\), which represents how likely chemical \(v_{i}\) results in an interaction type \(r\) of gene \(v_{j}\). The decoder takes advantage of a tensor decomposition model, called DEDICOM [45], to formulate chemicalgene interactions, as shown in Fig. 7.
Based on the node embeddings \(z_{i}\) and \(z_{j}\) learned by the encoder, the decoder computes a score \({\mathcal{G}}\left( {z_{i} ,r,z_{j} } \right)\) for the edge \(e_{ij}\), and then act a sigmoid function \(\sigma\) on it as follow:
where \({\mathcal{D}}_{r}\) is a local diagonal matrix giving weights to each dimension of the node embedding under interaction type \(r\). \({\mathcal{R}}\) is a global parameter matrix associated with all interaction types, which enables the model to share information across different interaction types. Note that the matrix \({\mathcal{D}}_{r}\) and \({\mathcal{R}}\) are both trainable parameters of shape \(d_{k} \times d_{k}\). These two matrices are initialized using the same method introduced in Glorot et al. [46].
Model training
We perform negative sampling during the training procedure, which can reduce the training time greatly. We generate a negative sample \(\left( {v_{i} ,r,v_{n} } \right)\) by replacing the node \(v_{j}\) of the known edge \(\left( {v_{i} ,r,v_{j} } \right)\) with node \(v_{n}\), which is chosen randomly according to a sampling distribution in Mikolov et al. [47]. Specifically, the distribution probability of node \(v_{n}\) is calculated based on its degree \(d\left( {v_{n} } \right)\) as follow:
Given a set of chemicalgene pairs and the labels, we encourage the model to enlarge the margin \(m\) by minimizing the hinge loss function [48]:
where \({\Theta }\) is a set of neural network parameters. \({\mathcal{P}}_{r}^{in}\) denotes the probability of the negative sample \(\left( {v_{i} ,r,v_{n} } \right)\) associated with the known edge \(\left( {v_{i} ,r,v_{j} } \right)\). With the hinge loss, any case where the difference is larger than the margin \(m\) will not be penalty.
Availability of data and materials
The code files are available at: https://github.com/WebyGit/CGINet.
Abbreviations
 CGIs:

Chemicalgene interactions
 NLP:

Natural language processing
 DNN:

Deep neural network
 CNN:

Convolutional neural network
 RNN:

Recurrent neural network
 LSTM:

Long shortterm memory network
 POS:

Part of speech
 KNN:

Knearest neighbor
 SVM:

Support vector machine
 SMILES:

Simplified molecularinput lineentry system
 SPS:

Structural property sequence
 STITCH:

Search Tool for InTeractions of Chemicals
 CTD:

Comparative Toxicogenomics Database
 GCN:

Graph convolutional network
 AUROC:

Area under the receiveroperating characteristic
 AUPRC:

Area under the precisionrecall curve
 AP@k:

Average precision for the topk identifications
References
Paul SM, Mytelka DS, Dunwiddie CT, et al. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov. 2010;9(3):203–14.
Karimi M, Wu D, Wang Z, et al. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. 2019;35(18):3329–38.
Shi Y, Zhang X, Liao X, et al. Proteinchemical interaction prediction via kernelized sparse learning svm. Biocomputing. 2013;2013:41–52.
Li BQ, Niu B, Chen L, et al. Identifying chemicals with potential therapy of HIV based on proteinprotein and proteinchemical interaction network. PLoS ONE. 2013;8(6):e65207.
Chen L, Lu J, Huang T, et al. Finding candidate drugs for hepatitis C based on chemicalchemical and chemicalprotein interactions. PLoS ONE. 2014;9(9):e107767.
Lu J, Chen L, Yin J, et al. Identification of new candidate drugs for lung cancer using chemical–chemical interactions, chemical–protein interactions and a Kmeans clustering algorithm. J Biomol Struct Dyn. 2016;34(4):906–17.
Cheng Z, Zhou S, Wang Y, et al. Effectively identifying compoundprotein interactions by learning from positive and unlabeled examples. IEEE/ACM Trans Comput Biol Bioinform. 2016;15(6):1832–43.
Lung PY, He Z, Zhao T, et al. Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering. Database. 2019;2019(1):8.
Peng Y, Rios A, Kavuluru R, et al. Extracting chemical–protein relations with ensembles of SVM and deep learning models. Database. 2018;2018(1):9.
Sun C, Yang Z, Wang L, et al. Attention guided capsule networks for chemicalprotein interaction extraction. J Biomed Inform. 2020;103:103392.
Lu H, Li L, He X, et al. Extracting chemicalprotein interactions from biomedical literature via granular attention based recurrent neural networks. Comput Methods Programs Biomed. 2019;176:61–8.
Corbett P, Boyle J. Improving the learning of chemicalprotein interactions from literature using transfer learning and specialized word embeddings. Database. 2018;2018(1):10.
Liu S, Shen F, Komandur Elayavilli R, et al. Extracting chemical–protein relations using attentionbased neural networks. Database. 2018;2018(1):12.
Sun C, Yang Z, Su L, et al. Chemicalprotein interaction extraction via Gaussian probability distribution and external biomedical knowledge. Bioinformatics (Oxford, England). 2020;36:4323–30.
Sun C, Yang Z, Luo L, et al. A deep learning approach with deep contextualized word representations for chemical–protein interaction extraction from biomedical literature. IEEE Access. 2019;7:151034–46.
Donald BR. Algorithms in structural molecular biology. Cambridge: MIT Press; 2011.
Morris GM, Huey R, Lindstrom W, et al. AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J Comput Chem. 2009;30(16):2785–91.
Tabei Y, Yamanishi Y. Scalable prediction of compoundprotein interactions using minwise hashing. BMC Syst Biol. 2013;7(S6):S3.
Fang J, Li Y, Liu R, et al. Discovery of multitargetdirected ligands against Alzheimer’s disease through systematic prediction of chemical–protein interactions. J Chem Inf Model. 2015;55(1):149–64.
Lee I, Keum J, Nam H. DeepConvDTI: prediction of drugtarget interactions via deep learning with convolution on protein sequences. PLoS Comput Biol. 2019;15(6):e1007129.
Monteiro NRC, Ribeiro B, Arrais JP. Deep neural network architecture for drugtarget interaction prediction. In: International conference on artificial neural networks. Springer, Cham (2019), p. 804–809
Li S, Wan F, Shu H, et al. MONN: a multiobjective neural network for predicting compoundprotein interactions and affinities. Cell Syst. 2020;10(4):308322.e11.
Lee B, Zhang S, Poleksic A, et al. Heterogeneous multilayered network model for omics data integration and analysis. Front Genet. 2020;10:1381.
Kuhn M, Szklarczyk D, Franceschini A, et al. STITCH 3: zooming in on protein–chemical interactions. Nucleic Acids Res. 2012;40(D1):D876–80.
Davis AP, Grondin CJ, Johnson RJ, et al. The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 2019;47(D1):D948–54.
Luo Y, Zhao X, Zhou J, et al. A network integration approach for drugtarget interaction prediction and computational drug repositioning from heterogeneous information. Nat Commun. 2017;8(1):1–13.
Wu Z, Li W, Liu G, et al. Networkbased methods for prediction of drugtarget interactions. Front Pharmacol. 2018;9:1134.
Wan F, Hong L, Xiao A, et al. NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug–target interactions. Bioinformatics. 2019;35(1):104–11.
Kingma DP, Ba J. Adam: a method for stochastic optimization; 2014. arXiv:1412.6980
Abadi M, Barham P, Chen J et al. Tensorflow: a system for largescale machine learning. In: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (2016), p. 265–283
Perozzi B, AlRfou R, Skiena S. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (2014), p. 701–710
Grover A, Leskovec J. node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (2016), p. 855–864
Golub GH, Reinsch C. Singular value decomposition and least squares solutions. In: Linear algebra (Springer, Berlin 1971), p. 134–151
Cai D, He X, Han J, et al. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell. 2010;33(8):1548–60.
Kipf TN, Welling M. Semisupervised classification with graph convolutional networks (2016). arXiv:1609.02907
Jiang K, Li K, Qin F, et al. Assessment of a novel β2adrenoceptor agonist, trantinterol, for interference with human liver cytochrome P450 enzymes activities. Toxicol In Vitro. 2011;25(5):1033–8.
Slavov S, StoyanovaSlavova I, Li S, et al. Why are most phospholipidosis inducers also hERG blockers? Arch Toxicol. 2017;91(12):3885–95.
Abe H, Saito F, Tanaka T, et al. Developmental cuprizone exposure impairs oligodendrocyte lineages differentially in cortical and white matter tissues and suppresses glutamatergic neurogenesis signals and synaptic plasticity in the hippocampal dentate gyrus of rats. Toxicol Appl Pharmacol. 2016;290:10–20.
Liang S, Liang S, Yin N, et al. Toxicogenomic analyses of the effects of BDE47/209, TBBPA/S and TCBPA on early neural development with a human embryonic stem cell in vitro differentiation system. Toxicol Appl Pharmacol. 2019;379:114685.
Zitnik M, Agrawal M, Leskovec J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics. 2018;34(13):i457–66.
Parsons AB, Brost RL, Ding H, et al. Integration of chemicalgenetic and genetic interaction data links bioactive compounds to cellular target pathways. Nat Biotechnol. 2004;22(1):62–9.
Sun M, Zhao S, Gilvary C, et al. Graph convolutional networks for computational drug development and discovery. Brief Bioinform. 2020;21(3):919–35.
Harada S, Akita H, Tsubaki M, et al. Dual graph convolutional neural network for predicting chemical networks. BMC Bioinform. 2020;21:1–13.
Xu K, Li C, Tian Y, et al. Representation learning on graphs with jumping knowledge networks (2018). arXiv:1806.03536
Papalexakis EE, Faloutsos C, Sidiropoulos ND. Tensors for data mining and data fusion: models, applications, and scalable algorithms. ACM Trans Intell Syst Technol (TIST). 2016;8(2):1–44.
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (2010), p. 249–256
Mikolov T, Sutskever I, Chen K et al. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems (2013), p. 3111–3119
Srebro N, Rennie J, Jaakkola TS. Maximummargin matrix factorization. In: Advances in neural information processing systems (2005), p. 1329–1336
Acknowledgements
Not applicable.
Funding
This work is jointly funded by the National Science Foundation of China (U1811462), the National Key R&D project by Ministry of Science and Technology of China (2018YFB1003203), and the open fund from the State Key Laboratory of High Performance Computing (No. 20190111). The funder CW took part in the formulation and development of methodology, and provided financial support for this study.
Author information
Authors and Affiliations
Contributions
WW and XY developed the algorithms and drafted the manuscript; they developed the codes, prepared the datasets for testing, drafted the discussion and revised the whole manuscript together with CW and CY. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
No ethics approval and consent were required for the study.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, W., Yang, X., Wu, C. et al. CGINet: graph convolutional networkbased model for identifying chemicalgene interaction in an integrated multirelational graph. BMC Bioinformatics 21, 544 (2020). https://doi.org/10.1186/s12859020038993
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859020038993