DGDTA: dynamic graph attention network for predicting drug–target binding affinity

Background Obtaining accurate drug–target binding affinity (DTA) information is significant for drug discovery and drug repositioning. Although some methods have been proposed for predicting DTA, the features of proteins and drugs still need to be further analyzed. Recently, deep learning has been successfully used in many fields. Hence, designing a more effective deep learning method for predicting DTA remains attractive. Results Dynamic graph DTA (DGDTA), which uses a dynamic graph attention network combined with a bidirectional long short-term memory (Bi-LSTM) network to predict DTA is proposed in this paper. DGDTA adopts drug compound as input according to its corresponding simplified molecular input line entry system (SMILES) and protein amino acid sequence. First, each drug is considered a graph of interactions between atoms and edges, and dynamic attention scores are used to consider which atoms and edges in the drug are most important for predicting DTA. Then, Bi-LSTM is used to better extract the contextual information features of protein amino acid sequences. Finally, after combining the obtained drug and protein feature vectors, the DTA is predicted by a fully connected layer. The source code is available from GitHub at https://github.com/luojunwei/DGDTA. Conclusions The experimental results show that DGDTA can predict DTA more accurately than some other methods.


Background
Drug-target interaction (DTI) prediction is a critical task in drug discovery and drug repositioning [1,2].Structural changes to a drug can significantly alter its binding affinity with proteins [3], making it important to predict whether a drug can bind to a specific protein.However, the traditional high-throughput screening experiments used to detect this activity are expensive and time-consuming [4].Therefore, computing methods for DTI prediction have become popular and effective [5,6].
DTI calculation methods focus on binary classification [2,7], and the main goal is to determine whether a drug and a target interact with each other [8].However, the assumed binding strength values of the given protein and the drug compound are continuous and referred to as their binding affinity.The drug-target pair prediction task is described as an affinity prediction problem [8] in which, the binding affinity score is directly used, thus creating a more realistic experiment.In addition, regression-based models are more advantageous in approximating the strength of DTIs [9], making them more conducive to the discovery of new drug compounds in the limited drug research space.
Recently, some methods [10,11] for predicting drug-target affinity (DTA) have been developed.SimBoost [11] enhances the performance of learning-based methods by extracting features from drugs, targets, and drug-target pairs and providing them to gradient-enhanced supervised learning methods.Affinity is characterized by an inhibition constant ( K i ), dissociation constant ( K d ), changes in free energy measures ( δG,δH ), half-maximal inhibition constant ( IC 50 ) [12], half-maximal activity concentration ( AC 50 ) [13], KIBA score [14] and scoring.Stronger affinity readings indicate greater DTIs [15].In the KronRLS [10,16] model, the Kronecker products of a drug and target are constructed by drug and protein pairs to calculate the kernel K of the pairs, which is entered into a regularized least-squares regression model (RLS) to predict the binding affinity.
With the success of deep learning, various deep networks have been used for DTA prediction [8,13], and have achieved better performance than machine learning.Some prediction methods are summarized in Table 1.In the DeepDTA [8] model, one-dimensional sequences of drugs and proteins are fed into a convolutional neural network (CNN) to extract the features of drugs and their targets through the (simplified molecular input line entry system) SMILES string representations of the drugs, and good results have been achieved.The PADME [13] model combines molecular graph convolution of compounds and protein features and uses fixed-rule descriptors to represent proteins, improving the predictive performance of the model.The model is more scalable than traditional machine learning models.WideDTA [17] builds on DeepDTA [8] by representing drugs and proteins as words, learning more potential characteristics of drugs and proteins.However, since the convolution window of a CNN is fixed, this network is unable to extract the features of contextual information.To represent molecules in a natural way that preserves as much molecular structure information as possible, thus allowing the model to better learn the relevance of the underlying space, an increasing number of approaches are utilizing graph neural networks to predict DTA.MT-DTI [18] introduces the attention mechanism in drug representation and takes more account of the correlation between different molecules, which improves the prediction performance of DTA and greatly increases the interpretability.In DeepGS [19], the topological structure information of a drug is extracted by using a graph attention network (GAT) [20], while the local chemical background of the drug is captured by using a bidirectional gated recurrent unit (Bi-GRU) [21] and combined with the protein sequence features extracted by a CNN for prediction.rzMLP [22] uses a gMLP model to aggregate input features with constant size, and uses a ReZero layer to smooth the training process for that block.The model is able to learn more complex global features while avoiding poor predictions due to a too deep model.EnsembleDLM [23] aggregates predictions from multiple deep neural networks, not only obtaining better predictions, but also exploring how much data deep learning networks need to achieve better prediction performance.GANsDTA [24] employs a generative adversarial network (GAN) [25] to extract features of protein sequences and compound SMILES in an unsupervised manner.Because GAN's feature extractor does not require labeled data, the model is able to accommodate unlabeled data for training.Because GAN's feature extractor does not require labeled data, the model is able to accommodate unlabeled data for training.The model can use more datasets to learn protein and drug features, thus achieving correspondingly better feature representation and prediction performance.GraphDTA [26] modelled drugs as molecular graphs with one-dimensional drug sequences, then put the graph into several graph network models and obtained deep learning models, which were excellent at the time.GraphDTA [26] demonstrated that representing drugs as graphs can further improve the prediction capabilities of deep learning models in terms of DTA.
However, two problems remain that prevent accurate DTA.(1) The GAT model used by some contemporary methods is a restricted form of static attention, and the attention coefficient function of the nodes in the drug graph is monotonic, which leads to the inability to comprehensively extract drug features.(2) When processing protein sequences, the contextual association information of amino acid sequences is not acquired, and the protein association features are thus ignored.To solve the above problem, this paper proposes a method named dynamic graph DTA (DGDTA).In DGDTA, each drug is considered a graph of interactions between atoms and edges, and a dynamic attention score is used to consider which atoms and edges in the drug graph play more critical roles in predicting DTA.Compared with static attention, DGDTA is able to extract a more comprehensive drug signature.To better obtain the contextual features of amino acid sequences in proteins, DGDTA introduces bidirectional long short-term memory (Bi-LSTM) [27] to extract more comprehensive amino acid sequence features in combination with drugs.Through validations conducted on the Davis [28] and KIBA [14] datasets, DGDTA achieves better performance than the competing methods in terms of results.In this paper, a dynamic graph attention network example is given to further improve the representativeness and effectiveness of drug molecule maps.The experimental results demonstrate the effectiveness of DGDTA.

Methods
DGDTA is a method for predicting DTA based on a deep learning network, and its architecture (shown in Fig. 1), is divided into three main steps.( 1) Obtaining drug features.DGDTA uses the SMILES [29] as the drug compound input, and transforms the drug into a drug graph consisting of atoms and edges with reference to the natural properties of the drug.According to the literature, a two-layer graph network structure has better feature extraction performance.DGDTA uses a two-layer dynamic graph attention network (GATv2) [30]

Obtaining drug features
With the development of graph neural networks for DTA, many approaches have been presented.When using a graph to represent a drug, it is difficult to accurately extract graph features due to the complexity of drug graphs.DGDTA adopts a dynamic GAT to obtain drug features.Through SMILE code, drug's atomic composition, and the valence charge number of atoms can be inferred, which can further judge drug information such as the number of hydrogen bonds, and then used for the drug's feature representation in affinity prediction.To better extract drug features, DGDTA uses the SMILES [29] sequences of drugs as inputs, and uses RDKit to extract the atoms and interactions from the SMILES sequences.Then, DGDTA constructs a graph for each drug based on its SMILES sequence.A drug graph is denoted as G = (V , E) , where V is a node represented by a drug atom, and E represents the set of edges between nodes.Each node is represented by an n-dimensional vector from DeepChem [31].This n-dimensional vector includes the atomic symbols, the number of adjacent hydrogen atoms, the number of adjacent atoms, the implicit valence of the atoms (implicit valence) and whether the bonds are aromatic.One node is represented as d = f 1 , f 2 , f 3 . . ., f n .By representing the atoms d of each drug as the vertices of the drug graph, the features D = d 1 , d 2 , d 3 . . ., d D of each drug are obtained.To obtain more information about the graph structure in n-dimensional space, this paper adopts a dynamic attention mechanism for the graph: e d i , d j denotes the importance of the features of neighbour node j to node i , where N i represents the neighbours of node i , a ∈ R 2d ′ , W ∈ R 2d ′ ×d are learned, and II denotes vector concatenation.Utilizing the softmax function to normalize all neighbours, we can obtain the following attention function: Combining Eqs. ( 1) and ( 2), the coefficients of attention are expressed as: After integrating the feature information of the neighbouring nodes, we apply the nonlinear parameter σ , to obtain the output features of each node: Nodes are represented as the weighted averages of their neighbouring feature vectors.To further solidify the learning process of dynamic graph self-attention and improve the learning effect, the attention is extended to multiheaded attention. (1) H independent attention mechanisms connect the semantic feature vectors of the nodes through Eq. ( 5), and obtain an updated drug feature representation 2 , d D .Based on a combination of research and experiments, a two-layer graph network structure is able to obtain more accurate prediction results.First, the graph network in the second layer uses a dynamic graph neural network and obtains the drug feature representations D ; this version is named DGDTA-AL.After many experiments and comparisons, the graph network in the second layer is replaced with a GCN, whose propagation rules are as follows: H (l) denotes the nodal feature matrix of l th , where ∼ A= A + I , A is the adjacency matrix, I is the unit matrix, ∼ D= D + I , D is the degree matrix, and W is a trainable weight.A drug feature representation ′ D is obtained.The GCN is applied to the full graph via the Laplacian matrix, which captures the connectivity relationships between the graph nodes and updates the node features of the full graph.In this paper, this version is named DGDTA-CL.We use the rectified linear unit ( ReLU ) activation function after each layer and use global maximum pooling in the last layer to obtain the vector representation of the drug.

Extracting protein features
A protein sequence is a string of ASCII characters represented as amino acids.In many methods, one-hot codes are used to represent drugs and proteins, as well as other biological sequences, such as DNA and RNA.We use one-hot encoding to represent the atoms of the drug and incorporate atomic properties for drug initialization.Because drug molecules are shorter and simpler in structure than proteins, we utilize one-hot encoding to expand the dimensionality of the drug's representation.This enables model to capture specific information associated with each drug atom.For protein, in order to prevent feature singularity, we employ different approaches for the initialization.In this paper, we map each amino acid to a numerical value and represent one protein as a sequence of integers.And then an embedding layer is added to the sequence, where each character is represented by a 128-dimensional vector.For training purposes, the sequences are cut or padded to a fixed sequence with a length of 1000.If the sequence is short, it is padded with 0 values.In this paper, the embedding representation ( c ∈ R d p , where d is the dimensionality of the protein embedding) is a Bi-LSTM layer that cap- tures the dependencies the characters in a sequence of length n ( C = [c 1 , c 2 . . .c n ] ).We obtain p i ∈ R 2d 1 , where d 1 denotes the number of output cells used in each LSTM cell. (5) The vector P is composed of the output vectors generated by the Bi-LSTM; i.e., P = [p 1 , p 2 . . .p n ] .Finally, we use a one-dimensional convolutional layer to learn differ- ent levels of abstract features to obtain a vector of protein sequences representations.

Performing DTA prediction
The prediction layer connects the learned drug vector representation with the vector representation of the protein sequence.Then, they are used as inputs and the output y is obtained from the fully connected layer.
where W output denotes the weight matrix of the fully connected layer and b output denotes the bias of the fully connected layer.
We choose the mean square error (MSE) loss as the loss function, which has the advantage of a function curve that is smooth, continuous and derivable everywhere, making it convenient for use in the gradient descent algorithm.As the error decreases, the gradient also decreases, which is more conducive to convergence and more stable.
where Y i ∈ R B , y i ∈ R B denotes the predicted affinity value between the i th sample and the label of the affinity value in the sample, and B denotes the batch size.

Model training
DGDTA takes drug SMILES strings and protein amino acid sequences as inputs.In this paper, Python 3.9, PyTorch 1.12.1 and PyG2.1 are used to implement dynamic GAT and LSTM.In this paper, the number of layers in the graph neural network is set to 2, Bi-LSTM is applied, the number of hidden states is set to 10, and the dropout parameter is set to 0.2.Then, the proposed method is trained on the above dataset for 1000 epochs, and the adaptive moment estimation (Adam) optimizer is used with a learning rate of 0.0005.The devices that are used for the experiments are an Intel(R) Xeon(R) Platinum 8260 CPU @ 2.30 GHz and an NVIDIA GeForce RTX 3090 GPU.

Results
In this section, we present the dataset used, the evaluation metrics, an ablation study and the results of a comparison with state-of-the-art methods.This section also illustrates the advantage of the dynamic GAT and gives an example of a real drug-target combination. (8)

Dataset and evaluation metrics
We use the Davis [28] and KIBA [14] datasets to evaluate the performance of the method proposed in this paper.The numbers of drugs and targets in the dataset, and the sample sizes for training and testing during the experiments are shown in Table 2.In this paper, the concordance index (CI; the larger the better) [32] and MSE (the smaller the better) are also used as the main indicators for evaluating the performance of the tested models.
In this paper, the GAT and GAT_GCN models are chosen as baseline1 and baseline2 of the ablation study, respectively.

Ablation study
In the ablation study, we analyse the effectiveness of the innovative elements of our method.In this section, to be as fair as possible, we use the same training and testing sets as those employed by the baselines and the same evaluation metrics.In this paper, a dynamic graph neural network is incorporated into the drug graph, and Bi-LSTM is added to extract protein amino acid sequence features to further improve the model accuracy.The popular GRU model is added as a comparison method.GRU and LSTM are important variants of recurrent neural networks, and they have strong memory and long-distance dependence capturing ability when processing sequence data.GRU has higher computational efficiency with reduced parameter settings compared to LSTM, but this also leads to some loss of information at longer distances in some cases.In order to better capture the contextual association information of amino acid sequences and further prove the effectiveness of LSTM method, GRU is introduced as a comparison in the ablation study.And the results of the ablation study are shown in Figs. 2 and 3.  Figure 2 shows that on the Davis and KIBA datasets, the DTA prediction results obtained by Model-2 using the dynamic GAT achieve a higher CI and a smaller MSE than those of baseline 1 in the same number of epochs.Model-1 with the addition of Bi-LSTM method is also better than baseline1.Based on Model-2, Bi-LSTM is used to improve the ability to extract contextual protein amino acid sequence features.The evaluation score of Model-4 is improved further, while the prediction result is better than that of the GRU in Model-3 with the same parameters.Model-4 achieves the best results in the 200-epoch and 1000-epoch comparisons conducted on both datasets, and Model-4 is the DGDTA-AL method illustrated in 2.1.As shown in Fig. 3, Model-8 obtains the highest CI and the lowest MSE in the comparison with baseline 2 over the same number of epochs; Model-8 is the DGDTA-CL method.In this paper, the results obtained by different models in the ablation study are presented in Table 3.On the Davis dataset, DGDTA-AL achieves the best results (in bold), reaching 0.899 and 0.225 CI and MSE values, respectively, which are improvements of 0.7% and 0.7% over those of baseline.DGDTA-CL achieves a CI of 0.902 and an MSE of 0.125 on the KIBA dataset, which are improvements of 1.1% and 1.4% over those of baseline 2, respectively.The results of the ablation study demonstrate the effectiveness of the innovative elements proposed in this paper.

Comparison with the state-of-the-art methods
In this section, Table 4 shows the experimental results obtained by DGDTA and the comparison methods.To be consistent with the ablation experiment in 3.2, we use the same datasets and evaluation metrics.Based on this, we added the r 2 m evaluation metric.As shown in Table 4, DGDTA-AL is better than the mainstream DTA methods in terms of the CI, MSE and r 2 m on the Davis dataset.Compared with DeepGLSTM [33], which has the best results among the comparison methods, the CI and MSE of the proposed approach are improved by 0.6% and 1.1%, respectively.Additionally, the CI and MSE are improved by 0.9% and 0.4%, respectively, over those of the excellent MATT-DTI [34] method.And, r 2 m reaches 0.707.As shown in Table 4, DGDTA-CL achieves a more significant improvement in its results on the KIBA dataset.Compared with the DeepGLSTM [33] method, DGDTA-CL attains 1.2% and 1.8% performance improvements in terms of the CI and MSE metrics, and 1.3% and 2.5% CI and MSE improvements are achieved over the MATT-DTI [34] method, respectively.And, r 2 m reaches 0.809.Figure 4   both datasets to further demonstrate the performance improvement provided by the DGDTA method.The experimental results show that DGDTA is better than the comparative methods, and the use of a dynamic graph with attention to extract drug features and effective contextual protein information is significant for predicting DTA.

Advantages of the DGDTA model
A dynamic GAT suggests that a traditional GAT is only a computationally constrained form of "static" attention: for any query node, the attention function is monotonic with respect to the key fraction [30].As shown in the GAT heatmap presented in Fig. 5, the ordering of the attention coefficients is global, and all queries focus primarily on the 7th key.
Formula (10) is the method for calculating the attention coefficients in the GAT, indicating the importance of the feature of node j to node i .As N i is limited, there exists a node j max where the attention distribution a only calculates static attention from j max due to it being the maximum value.To overcome the monotonicity restric- tion of the key score, Formula ( 12) is transformed into Formula (1).This variant is more expressive than the GAT, as shown in the attention maps of GATv2 in Fig. 5.
Since static attention cannot have different correlations for different keys and different queries, if there is one key that has a higher attention score than the others, then no query can ignore the score of this key, which results in very limited static attention.Among the datasets, Davis contains 2457 positive samples and 27,599 negative samples, the total number of samples is small, and the label distribution in the dataset is unbalanced.KIBA has 22,729 positive samples and 95,525 negative samples, so it contains more samples than Davis, but most of the labels in KIBA are very concentrated, and the label distribution is relatively normal.These problems create barriers for the model in terms of affinity prediction.Dynamic graph attention pays different amounts of attention to different queries in the attention score, enabling it to better distinguish the similarities and differences between samples.It is more discriminative during drug graph extraction and alleviates the imbalance problem in the given dataset.Figure 6 shows the MSE changes exhibited by the DGDTA-AL, DGDTA-CL, baseline 1 and baseline 2 models on Davis and KIBA at 200 and 500 epochs.Blue and green represent our proposed models with faster decreasing trends.The results demonstrate the more significant improvement yielded by the dynamic GAT in terms of predicting DTA.(12) e d i , d j = LeakyReLU a T [Wd i ] � Wd j j ∈ N i

Example of a realistic drug-target combination
To further demonstrate the validity of the proposed method, this paper gives an example to show the 3D model produced for a tested sample in reality.As shown in Fig. 7, the targeted drug (sunitinib) inhibits receptor tyrosine kinases (RTKs), where certain receptor tyrosine kinases are involved in tumour growth, pathological blood vessel formation and tumour metastasis.In biological and cytometric assays, sunitinib has been shown to inhibit tumour growth, cause tumour regression and inhibit tumour metastasis.In this paper, the bound small drug molecules are scaled up on the right side, and the drug and  its binding target correspond to the drug 'DB5329102' and the target 'ITK' in the test set, respectively; this is done to verify the validity and practicality of the model proposed in this paper in practical applications through known drug-target binding examples.

Discussion
In this paper, DGDTA is proposed based on the dynamic graph attention model and is divided into two versions, DGDTA-AL and DGDTA-CL, to predict the affinity values between drugs and proteins.Ablation experiments are performed on the Davis and KIBA datasets, and the proposed approach is compared with the DTA models that are popular today.The experimental results show that DGDTA can achieve better prediction performance and demonstrate that the dynamic graph attention model can extract more comprehensive feature representations from molecular drug maps.

Conclusions
DGDTA can effectively predict DTA via deep learning, and it can obtain high CI and MSE metrics on experimental datasets, but it still has shortcomings.First, while dynamic graph attention models attain good prediction performance, they also require increased prediction time and computational cost.Second, drugs and proteins have very complex spatial structures, and much characteristic drug and protein information is lost in onedimensional sequences.
In the future, further consideration will be given to fusing other characteristic drug information, such as their side effects, physicochemical properties, and deep structures.This will contribute to improving the performance of drug-target binding prediction models from various aspects.name: DGDTA.Project home page: https:// github.com/ luoju nwei/ DGDTA.Operating system(s): Linux or other unix-like systems.Programming language: python 3.x.License: GNU GPL v3.Any restrictions to use by non-academics: license needed.

Fig. 2
Fig. 2 Comparison between baseline1 and different models at 200 and 1000 epochs

Fig. 3
Fig. 3 Comparison between baseline2 and different models at 200 and 1000 epochs plots the CI scores obtained by the methods in the table for

Fig. 4
Fig. 4 CI comparison among the experimental methods on the Davis and KIBA datasets

Fig. 5
Fig. 5 Attention coefficients of the GAT and GATv2

Table 3
Ablation study on the Davis and KIBA datasets *Bold values represent the best result

Table 4
Comparison with the state-of-the-art methods *Bold values represent the best result