HGDTI: predicting drug–target interaction by using information aggregation based on heterogeneous graph neural network

Background In research on new drug discovery, the traditional wet experiment has a long period. Predicting drug–target interaction (DTI) in silico can greatly narrow the scope of search of candidate medications. Excellent algorithm model may be more effective in revealing the potential connection between drug and target in the bioinformatics network composed of drugs, proteins and other related data. Results In this work, we have developed a heterogeneous graph neural network model, named as HGDTI, which includes a learning phase of network node embedding and a training phase of DTI classification. This method first obtains the molecular fingerprint information of drugs and the pseudo amino acid composition information of proteins, then extracts the initial features of nodes through Bi-LSTM, and uses the attention mechanism to aggregate heterogeneous neighbors. In several comparative experiments, the overall performance of HGDTI significantly outperforms other state-of-the-art DTI prediction models, and the negative sampling technology is employed to further optimize the prediction power of model. In addition, we have proved the robustness of HGDTI through heterogeneous network content reduction tests, and proved the rationality of HGDTI through other comparative experiments. These results indicate that HGDTI can utilize heterogeneous information to capture the embedding of drugs and targets, and provide assistance for drug development. Conclusions The HGDTI based on heterogeneous graph neural network model, can utilize heterogeneous information to capture the embedding of drugs and targets, and provide assistance for drug development. For the convenience of related researchers, a user-friendly web-server has been established at http://bioinfo.jcu.edu.cn/hgdti.

node features may reduce prediction precision. Besides, it adversely affects the prediction result when NeoDTI fuses neighbor features and ignores the importance of each neighbor. Recently, the theory of graph neural network (GNN) [26] has matured, and the algorithm framework has gradually enriched, including GCN (Graph Convolution Networks), GAT (Graph Attention Networks) [27], GAE (Graph Autoencoders) [28]. Zhang et al. [29] proposed a heterogeneous graph neural network (HetGNN), which applies a series of aggregation operations to heterogeneous neighbors to obtain the ultimate node embedding. This inspired us to build our own model for discovering new DTIs.
In this paper, we present HGDTI model, a heterogeneous graph neural network for predicting DTI. Firstly, in the pre-processing step, we sample negative pairs from unknown DTI pairs by employing negative sampling technology. Then, HGDTI uses LSTM to abstract content of the node (e.g. drug, protein, disease, and side-effect), and extracts the final embedding of drugs and proteins by aggregating the contents of heterogeneous neighbors. Finally, the obtained drug and protein embeddings are used to predict DTI through a fully connected neural network. The entire learning and prediction process is an end-to-end workflow. Hence, it is possible to obtain the feature representation of drugs and targets closest to the DTI network. Through comprehensive tests, we compare the performance of DTI prediction between HGDTI and other state-of-theart predictors. In addition, the robustness and extendability of HGDTI are inspected by testing partial heterogeneous networks. Overall, HGDTI can integrate more heterogeneous data sources to provide more accurate results for DTI prediction, which may also provide a better solution for drug discovery and repositioning.

DTI problem formulation
In this work, the dataset is a heterogeneous graph composed of various nodes and edges. Nodes include drugs, proteins, diseases, and side effects. Edges include interactions, similarities, and associations. Our model learns embedded representations of drugs and proteins from this graph to predict DTIs. Next, the definition of heterogeneous graph is given.
Definition HG (Heterogeneous Graph). HG is defined as an undirected graph G = (V , E, O V , R E ) , where V is the node set, E is the edge set, the object type of each node v ∈ V belongs to the object type set O V , the relation type of each edge e ∈ E belongs to the relation type set R E . Besides, we define that C(v) ∈ R |V |×dim (dim: feature dimension) maps the initial feature set of nodes, F (v) ∈ R |V |×dim indicates final embeddings.
The node type set O V includes drug, target, side-effect and disease. The link type set R E is composed of drug-similarity-drug, drug-interaction-drug, protein-similarity-protein, drug-interaction-protein, drug-association-disease, etc., total 8 types (as shown in Fig. 1, also available See "Datasets" section). It is noted that all nodes are connected via interaction, similarity, or association edges with non-negative weight W e . Among that, interaction edge or association edge with value 1. In addition, the edge weight between two "unrelated" nodes is set to 0, such as unknown DTIs. Besides, there are two edges connected between two nodes simultaneously. For example, two drugs are connected through the drug-similarity-drug edge and drug-interaction-drug edge.

Embedding learning
In the graph network, the embedding learning model is to use the topology structure and the content information of the node in the network to obtain the final representation of the node. For example, DeepWalK [30], node2vec [31] and metapath2vec [32] employ random walk strategies to get the context sequence of the node in the network and learn node embedding with the help of word2vec [33]. struc2vec [34] leveraging local network structure information to differentiate node representation. GCN [26], the graph neural network version of CNNs, aggregates local (i.e. adjacency) context information of the node through a series of convolution operations. Different from the random walk strategy and simple convolution operation in the above methods, HGDTI only considers the first-order relationship (i.e. direct relationship) between nodes and convolves the information of adjacent neighbors. Moreover, in order to distinguish the importance of different types of neighbors, different weights are set for different types of neighbors during the aggregation process.

Pre-processing
In the actual training scenario, the number of known DTIs is much lower than unknowns. Such an extremely unbalanced dataset brings incredible difficulty to DTI network prediction. A solution is to employ random sampling to construct negative samples from unknown DTIs. Nevertheless, this way may reduce the accuracy of prediction and treat unknown drug-target pairs that exist possible interactions as non-interactions. A previous research by Liu et.al. [35] demonstrated the correctness of negative samples sampling method directly affected the prediction performance. Recently, Eslami et.al. [15] also utilized a similar method to preprocess the negative sample dataset and obtained remarkable experimental results. Similarly, we screen out reliable negative samples. The screening basis is that drugs that are not similar to or do not interact with all drugs corresponding to the target in known DTIs are unlikely to interact with the target and vice versa. Firstly, we denote the drug set D and the target set T, sort out the target list T d i (d i ∈ D) corresponding to each drug d i and the drug list D t j t j ∈ T corresponding to each protein t j from known DTIs, respectively. Secondly, give drug matrix A ∈ R |D|×|D| representing DDS matrix (i.e. drug-drug chemistry similarity matrix), and target matrix B ∈ R |T |×|T | representing PPS matrix (i.e. protein-protein sequence similarity matrix). Then, define reliable score s ij of the drug-target pair d i − t j in unknown DTIs. Define s DT ij = t k ∈T d i B t j t k , that sum up the similarity between the target list T d i that interact with drug d i and target t j . Similarly, define s TD ji = d k ∈D t j A d i d k , which sums up the similarity between the drug list D t j that interact with target t j and drug d i . Finally, a reliable score s ij between drug d i and protein t j is computed as: The negative candidate pairs are arranged in descending order according to the reliable score calculated by the above formula, and the high score is selected as the reliable negative DTIs. Sample a certain number of unknown DTIs as negatives and known DTIs as positives to form the complete data set for subsequent model training and testing.

Representing drug molecules with the 2D molecular fingerprint
HGDTI leverages the molecular fingerprint approach to extract the initial feature of the drug, which is frequently employed in drug-related prediction problems [36][37][38][39]. Molecular fingerprint is a method of binary coding of molecular structure to describe the presence or absence of particular substructures. Xiao et.al. [37] has given a crystal clear description of how to obtain the molecular fingerprint of the drug compound, and hence there is no need to repeat here. It is noted that we download the SMILES file of the drug from https:// go. drugb ank. com/. Drug molecular fingerprint C d is represented as a 256-digit hexadecimal string. In particular, the optimal dimension dim of drug feature C d in HGDTI is 128 (See "Hyperparameter Selection" section). Therefore, the dimension of C d needs to be reduced. Generally, the feature size reduction methods include embedding and fully connection. Here the average approach is adopted. Formally, the content feature of drug d is computed as follows: where C d [0 : 127] and C d [128 : 255] stand for the pre-128 bits and the post-128 bits of C d respectively.

Representing protein sequences with pseudo amino acid composition
Pseudo amino acid composition(PseAAC) [40] can capture the amino acid composition information of protein sequence and preserve the sequence-order information. Above all, there are ten kinds of physical and chemical properties representing protein properties [37] to convert protein sequences into real strings. HGDTI chooses hydrophobicity, hydrophilicity and side-chain mass as three types of amino acid properties, and the dimension of protein feature vector C t is set to 64. For the specific calculation method, refer to PseAAC or visit https:// ifeat ure. erc. monash. edu/. Finally, we elevate the optimal (1) dimension of protein feature C t to 128, and the duplicate concatenation method is adopted. Thus the content feature of target t is formulated as: The operator denotes concatenation.
Steps (1)(2)(3) are to learn the node embeddings that encode both heterogeneous neighbors and itself characteristic contents.
Step (4) is a deep neural network classifier, which is used to predict DTIs by training the node embedding to obtain a 0-1 threshold. Next, we will introduce the algorithm formula for each step in detail. The whole process is illustrated in Fig. 2.
Step 1: Node Features Encoding. We have defined the initial features of nodes as C(v) , where the drug feature C d is extracted from the molecular fingerprint, the protein feature C t is extracted from PseAAC, the disease and side-effect features are represented by parameterized 0-1 standardized stochastic vector [25] to learn the optimal representation and speed up convergence. In this step, we define a submodule based on bi-directional LSTM (Bi-LSTM) [41] to capture "deep" feature interactions and obtain more abstract nonlinear expressions. The feature encoding for node v is defined as: where f 1 (v) ∈ R dim×1 (dim: feature dimension), the operator ⊕ denotes concatenation. Bi-LSTM block treats each one-dimensional input (vector) as a sentence with only one word ( 1 × dim tensor). Overall, the above formula uses Bi-LSTM to extract the general content embedding of v, as illustrated in Fig. 2a. Note that single feature C(v) can flexibly extend the model by adding other features (e.g. the physical and chemical properties of drugs [42], the PSSM profile of proteins [43]) for weighted average. In particular, four Bi-LSTM models are utilized to extract the content of different types of nodes respectively.
Step 2: Homogeneous Neighbors Aggregation. In this step, we design a submodule that aggregates heterogeneous adjacent node features. N r (v) = {u, u ∈ V , u � = v, r ∈ R E } denotes neighbor set that links to node v via edges of type r. Then, we employ an aggregated function G r to fuse features of u ∈ N r (v) . G r is a weighted summation that is not alike from neighbors aggregation approach of HetGNN [29], which treats all edges as equal. Formally, the aggregated embedding of N r (v) is defined as: where G r ∈ R dim×1 (dim: feature dimension), f 1 (u) is feature encoding of node u which is calculated by step (1), W e is a non-negative weight which represents a score of edge e. M r (v) = u∈N r (v),e=(v,u,r) W e stands for a normalization term. To be more specific, r-type aggregated embedding for node v is summed by same type neighbors feature to 23:126 multiply by ratio which is the normalized weight (e.g. W e M r (v) ) with respect to edges of type r.
Step 3: Heterogeneous Neighbors Aggregation. Continue to the previous step, we have got the aggregated embedding G r (v) with respect to edge-type r for node v. Taking into account that heterogeneous nodes have different degrees of impact on the final embeddings, we employ the attention mechanism [27] to incorporate the aggregated Fig. 2 a Encode each node feature via BiLSTM. b Aggregate neighbors to obtain drug and protein embedding, predict label via a two-layers neural network, finally optimizes the model via a cross-entropy loss embedding G r (v) with the initial feature C(v) of node v. Formally, the final embedding of node v is formulated as follow: where F (v) ∈ R |V |×dim ( |V | : node size, dim: feature dimension), α * (e.g. α v , α r ) indicates influence level for the final embeddings. Then, we define ϕ(v) that stands for C(v) and LeakyReLU denotes a leaky version of a Rectified Linear Unit, u ∈ R 2dim×1 is the attention parameter.
Our task is to predict the drug-target interaction. In the final prediction step, only the final embeddings of drug and target are involved. Therefore, node v in steps(2-3) refers to drugs and targets.
Step 4: DTI Classification. To determine whether there is an interaction between the drug-target pair, we employ a fully connected neural network to train the drug embedding F d (u) and the protein embedding F t (v) and predict DTIs. Thus, the predict probability function O is defined as follow: where FC 1 and FC 2 form a two-layer fully connected neural network that performs a linear transformation on embeddings, ReLU (Rectified Linear Unit) indicates nonlinearity capability of the model. The operator ⊕ denotes concatenation between the drug embedding and the protein embedding to obtain 2 × dim dimension embedding, which is the input of first layer FC 1 . Specifically, FC 1 has dim/2 neurons which are connected to each dimension of the input embedding, FC 2 that the final output layer contains only one neuron corresponding to output result which is fully connected to the previous layer, sigmoid stands for a nonlinear activation function that projects from the result of a final layer onto DTI probability. Steps (2)(3)(4) are shown in Fig. 2b.
At last, we adopt cross-entropy loss function that calculates the difference between DTI probability and drug-target pair label.
In general, all the above steps can be trained through an end-to-end manner by performing Adam optimizer [44] and 0.001 learning rate to minimize the final loss function and update the model parameters. We repeat the training iterations until the change between two consecutive iterations is less than the threshold. The entire framework is implemented on the PyTorch platform and GPU hardware.

Datasets
The datasets are collected from previous research [24], include 4 types of nodes and 8 types of edges. Specifically, 708 drugs, 1,923 known DTIs as well as drugdrug interaction network have been extracted from DrugBank (Version 3.0) [45]. 1,512 proteins and protein-protein interaction network have been extracted from the HPRD database (Release 9) [46]. 5,603 diseases, drug-disease association and protein-disease association networks have been extracted from the Comparative Toxicogenomics Database [47]. 4,192 side-effects and drug-side-effect association network have been extracted from the SIDER database [48]. In addition, 364 sideeffects and 161 diseases are isolated. Besides, we adopt two similarity information, a drug-structure similarity network (i.e. a pair-wise chemical structure similarity network measured by the dice similarities of the Morgan fingerprints with radius 2, which have been computed by RDKit [49]), and a protein sequence similarity network (which have been obtained based on the pair-wise Smith-Waterman score [50]). The datasets have been utilized in previous researches [15,25]. As shown in the statistics in Table 1, tests a-f same as in NeoDTI [25] corresponds to Figs. 4 and 5.

Reliable negatives
In the original dataset, the vast majority of DTIs are unknown, including potential DTIs and non-DTIs. Unlike the previous model which treats all unknown DTI pairs as negative samples, we consider selecting the "correct" unknown DTI pairs as negative samples as much as possible. We employ negative sampling technique (See "Preprocessing" section) to calculate reliable scores between drugs and targets, and divide reliable negative samples according to the distribution of reliable scores of drug-target pairs (Fig. 3). As the figure, the reliable scores of unknown DTIs are mainly concentrated around 0 score and 1 score. Combined with specific numerical analysis, we choose DTI with a reliable value greater than 0.1 as a negative sample, which is equivalent to nearly half of the unknown in benchmark (Fig. 3a), 30% in non-unique and 80% in unique (Fig. 3b).

HGDTI yields significant capability for DTIs prediction
For the sake of comparing HGDTI with the previous state-of-the-art DTI prediction methods, we use the same dataset and the 10-fold cross-validation method. To mimic this scenario that only a minimal number of drug-target pairs are known DTIs in the practical situation, we sample all positive samples (known DTIs) and negative samples, which are selected based on the method explained in "Reliable negatives" section, in which negative samples are 10 times that of positive samples. During the experiment, the dataset will be cross-cut by hierarchical sampling to ensure that the proportions of various samples in the training set and test set are the same as the original dataset. The dataset is divided into 10 non-overlapping subsets according to the ratio (i.e. 1:10) of positive and negative samples in the original data set, 9 subsets are used as the training set and the remaining 1 subset is used as the test set. Like other predictive methods, we employed the Area Under Receiver Operating Characteristic (AUROC) curve and Area Under Precision-Recall (AUPR) curve to evaluate prediction performance for all methods. In general, ROC curves present the trend between true positive rate (TPR) and false positive rate (FPR), and PR curves reveal the trend between precision and recall using several classification thresholds. AUPR is more sensitive than AUROC for extremely skewed datasets. Therefore, the predictive ability of model can be better explained in such a scenario. Since random sampling will cause jitter in the prediction results, we randomly select 10 sets of samples through 10 fixed second-level random seeds generated from a first-level random seed "10". The second-level random seeds are shown in Table 2. The final result is summarized over 10 trials and expressed as mean ± standard deviation. We compare the performance of HGDTI with six predictive models, including NeoDTI [25], DTINet [24], MSCMF [13], NetLapRLS [12] and BLMNII [11]. The result of the comparison shows that HGDTI remarkably outperforms other models, with 11.1% higher AUPR and 4.5% higher AUROC than the second-best method (Figs. 4a, 5a). DTINet generates low-dimensional features representing the structure of nodes in context through a network diffusion algorithm (random walk with restart, RWR). HGDTI adopts the fingerprint features of drug molecules and the PseAAC  features of proteins, and enhances feature learning through the neighborhood aggregation of nodes. Comparing with NeoDTI, HGDTI uses weighted aggregation of heterogeneous neighbors and utilizes reliable negative samples. The process of searching the hyperparameter of feature dimension in these baseline methods can be found in "Hyperparameter selection" section. The original dataset may contain approximate samples (i.e. sharing homologous proteins and similar drugs between know DTIs), which may affect the veracity of the predictive power by easy predictions. To explore this issue, we perform the following additional tests (Figs. 4b-e, 5b-e): (1) the removal of DTIs with similar drugs (i.e. drug chemical structure similarities > 0.6) or homologous proteins (i.e. protein sequence similarities > 0.4); (2) the removal of DTIs with drugs sharing similar drug interactions (i.e. Jaccard similarities > 0.6); (3) the removal of DTIs with drugs sharing similar side-effects (i.e. Jaccard similarities > 0.6); (4) the removal of DTIs with drugs or proteins sharing similar diseases (i.e. Jaccard similarities > 0.6). In the above experimental scenarios, we adopt the same positive and negative sample ratio and the uniform 10-fold cross-validation method. All test results demonstrate that HGDTI still remarkably outperforms other prediction methods after the removal of redundant samples, which also certifies the stability of HGDTI.
In addition, we also conducted comparative experiments on "unique" data, in which drugs interact with only one target and vice versa. In that, the unique DTIs prediction lacks sufficient neighbors. To assess the performance of DTIs prediction methods in this scenario, we split the dataset into non-unique DTIs and unique DTIs, which are used in the training phase and the test phase respectively, and the ratios between positive and negative remain unchanged. We detect that HGDTI is unsatisfactory in terms of AUPR (Fig. 4f ), which indicates that HGDTI is not suitable for improving model performance by capturing rich neighborhood information in sparse networks.
It can be seen that discrete nodes that are more extreme than "unique" have worse prediction results, which is also the limitation of graph neural networks. Therefore, for new drugs and new targets that are not in the graph HG, HGDTI cannot aggregate the multi-source information around the node, resulting in unsatisfactory predictive performance. Fig. 6 Optimal dimension of feature. All results were summarized over 10 trials and expressed as mean ± standard deviation

Hyperparameter selection
All node features adopt a uniform dimension d ∈ 64, 128, 256 . To determine the optimal representation dimension of feature, we randomly divide the training set into 5% as the validation set to select the best hyperparameter. The result is shown in Fig. 6. When d = 64, 128 and 256, the corresponding AUPR scores were 0.899, 0.961 and 0.585 respectively, while the corresponding AUROC scores were 0.946, 0.979 and 0.795 respectively. Consequently, HGDTI has the best prediction effect and the smallest variance result when d = 128.

The rationality of negative sampling technique
In order to prove that the superiority of the HGDTI algorithm is not contributed by the negative sampling technique, we compare the second-best NeoDTI with HGDTI under the condition of the negative sampling technique. As presented in the results, HGDTI outperforms NeoDTI by 1.7% in terms of AUPR and 0.9% in terms of AUROC (Fig. 7a). At the same time, we test the performance of HGDTI without negative sampling technology on several scenarios (Fig. 7b-f ). In the first test, we observe a significant improvement (4.5% in terms of AUPR and 1.3% in terms of AUROC) over the second-best NeoDTI. These results indicate that under the same sampling conditions, the power of HGDTI to identify DTI is better than other models, and negative sampling technology can further narrow the prediction range of model. To study the impact of negative sampling technology on the classification ability of HGDTI, we further achieve model's DTIs prediction results using random sampling. As expected, model's ability to identify DTIs dropped prominently by 6.9% in terms of AUPR and 3.4% in terms of AUROC (Fig. 8). The importance of negative sampling technology is self-evident.

Robustness of HGDTI
In the following section, we would like to discuss the robustness of model and the correctness of design. Above all, we further explore the influence of integrating multiple heterogeneous data on DTIs prediction. The experimental data is formed by deleting heterogeneous networks on the basis of the benchmark dataset, and the experimental evaluation method remains unchanged. We first remove the side-effect network, and model prediction results decrease slightly with 0.9% in terms of AUPR and 0.5% in terms of AUROC (Fig. 9a). Then contrast the experimental results of removing the drug or protein interaction network in the heterogeneous network (Fig. 9b). Subsequently, the disease network is removed from the benchmark dataset, and the evaluation metric is significantly reduced by 2.0% in terms of AUPR and 1.2% in terms of AUROC (Fig. 9c). The contrast of these experiments indicates that the fusion of different individual networks can more accurately express the characteristics of drugs and targets and improve the performance of DTIs prediction.
In the benchmark dataset, we find that the effective representation of the node itself is missing. In order to complement the features of drugs and proteins, HGDTI introduces drug molecular fingerprint features ("Representing drug molecules with the 2D molecular fingerprint" section) and protein pseudo-amino acid composition information ("Representing protein sequences with pseudo amino acid composition" section). We further investigate the effect of these features on the model. The experimental results show that the absence of molecular fingerprint information leads to 9.7% reduction in the AUPR metric and 3.7% decrease in the AUROC metric, and the absence of pseudo-amino acid component results in loss with 13.7% in the AUPR metric and 6.9% in the AUROC metric (Fig. 9d), which sufficiently proves the contribution of molecular fingerprint and pseudo-amino acid component to the predictive ability of HGDTI. Fig. 8 Comparison result of negative sampling with random sampling in terms of the AUPR and AUROC scores. All results were summarized over 10 trials and expressed as mean ± standard deviation Fig. 9 Remove HGDTI's drug or target-related information can reduce predictive performance. a Remove drug-side-effect association network. b Remove drugs and proteins interaction networks. c Remove disease association networks. d Remove drug fingerprint and protein PseAAC. All results are summarized in 10 trials and expressed as mean ± standard deviation Fig. 10 Comparison result of different aggregation layers in terms of the AUPR and AUROC scores. All results were summarized over 10 trials and expressed as mean ± standard deviation According to Henaff 's conclusion [51] that higher layers have lower performance, we only construct one layer of neighborhood aggregation. To illustrate the correctness of the structural design, we experiment with the effect of various neighborhood extents on predictive capability. The comparison (Fig. 10) reveals that the aggregation operation significantly improves the performance, but the results decrease slightly as the aggregation layer deepens. The fifth-order aggregation has only more than 1% AUPR difference.

Conclusion
We have proposed a DTI prediction methodology, called HGDTI, to learn the embedding of drugs and targets hidden in various heterogeneous network and input into a fully connected neural network to predict DTIs. The entire framework is divided into a feature learning neural network and a label prediction neural network. By optimizing the parameters of HGDTI through an end-to-end approach, the former can capture more reliable features, and the latter can predict closer labels. After several realistic test scenarios, it is proved that HGDTI is superior to other methods in terms of prediction performance and can integrate more heterogeneous networks to improve prediction accuracy. Moreover, negative sampling technology can further narrow the prediction range. In general, HGDTI can be utilized as an excellent tool for computational drug discovery and drug repositioning.