 Research
 Open Access
 Published:
SubMDTA: drug target affinity prediction based on substructure extraction and multiscale features
BMC Bioinformatics volumeÂ 24, ArticleÂ number:Â 334 (2023)
Abstract
Background
Drugâ€“target affinity (DTA) prediction is a critical step in the field of drug discovery. In recent years, deep learningbased methods have emerged for DTA prediction. In order to solve the problem of fusion of substructure information of drug molecular graphs and utilize multiscale information of protein, a selfsupervised pretraining model based on substructure extraction and multiscale features is proposed in this paper.
Results
For drug molecules, the model obtains substructure information through the method of probability matrix, and the contrastive learning method is implemented on the graphlevel representation and subgraphlevel representation to pretrain the graph encoder for downstream tasks. For targets, a BiLSTM method that integrates multiscale features is used to capture longdistance relationships in the amino acid sequence. The experimental results showed that our model achieved better performance for DTA prediction.
Conclusions
The proposed model improves the performance of the DTA prediction, which provides a novel strategy based on substructure extraction and multiscale features.
Introduction
Drug development is a complex progress involving long research cycles, high costs, and low success rates, which could take several decades and 400â€“900 million dollars for a new drug from screening small molecules to market approval [1]. In the past few years, the information technology has been widely applied in computeraided drug design (CADD) methods to accelerate the speed of drug development [2]. The prediction of drugâ€“target binding affinity (DTA) is an important step in drug discovery, which provides information on the strength of interaction between drug molecules and target proteins. Therefore, the development of efficient and accurate algorithm of DTA prediction is of great significance in CADD.
Early computer virtual screening mainly focused on two types of methods: molecular docking [3,4,5] and ligandbased similarity [6, 7]. The molecular docking technique utilized the threedimensional structure of protein targets and drug molecules, and the affinity can be predicted by simulating the docking process of proteins and molecules [8, 9]. However, the acquisition of threedimensional structures is difficult, and largescale molecular docking process is timeconsuming. In contrast to molecular docking, ligandbased methods do not rely on the threedimensional structure of molecules, which predict DTA by comparing new ligands with known ligands. However, when the number of known ligands is insufficient, the ability of ligandbased approach is limited. In response to these challenges, machine learning methods for DTA prediction [10,11,12] have been gradually introduced in the virtual screening and improved the performance of DTA prediction. Wang et al. [10] treated the interaction between drugs and targets as a binary classification problem. After extracting chemical descriptors of drugs and protein sequence information, an SVM model was used for prediction. KronRLS [11] used PubChem structure clustering tool [13] and Smith Waterman algorithm [14] to obtain similarity matrices for drugs and proteins, and the Kronecker product of similarity matrices was used to define similarity scores for drugâ€“target pairs. To alleviate the limitation of linear dependence in KronRLS, SimBoost [12] constructed a drugâ€“target similarity network and established a gradient boosting regression tree model for prediction. However, these machine learning methods rely on carefully designed handcrafted features, and the selection of these features depends on specific domain knowledge and experience [15] As deep learning (DL) methods have demonstrated superior learning capabilities over traditional machine learning methods in multiple fields, they have gradually been applied to solve problems in bioinformatics, including the DTA prediction [16,17,18,19,20,21,22]. DeepDTA [16] used protein sequence and molecular sequence information in two separate CNN networks. The output feature vectors were concatenated and fed into three fully connected layers to predict binding affinity. DeepCDA [17] combined CNN and LSTM to encode protein sequence and molecular sequence and proposed a bidirectional attention mechanism to predict DTA. FusionDTA [18] replaced the coarse pooling method with a novel multihead linear attention mechanism to aggregate global information to address the issue of information loss. Additionally, the knowledge distillation was applied to transfer learnable information from a teacher model to a student model to solve the problem of parameter redundancy. As molecule could be represented as a graph, in which chemical atoms and bonds can be represented by nodes and edges. With the rapid development of graph neural networks (GNN), researchers have applied GNN models to DTA prediction. GraphDTA [19] used the topological structure information of molecular graphs and different GNN models for drug representation, while CNN was used to learn protein representation which is similar with DeepDTA. DGraphDTA [20] constructed protein graphs based on protein contact maps for the first time, then the GNN was used to predict DTA through molecular graphs and protein graphs. MGraphDTA [21] constructed a superdeep GNN with 27 graph convolution layers by introducing dense connections to capture both local and global structures of molecules. These methods indicate that deep learning networks can better capture the features of drugs and proteins. Due to the high cost and time consumption of laboratory experiments, the size of training dataset for drug discovery is limited, which may cause overfitting problems for machine learning methods and affect the generalization of learned features. Selfsupervised learning can use unlabeled data for pretraining and transfer the learned model to downstream tasks, which can alleviate the requirement for labeled data. There are also selfsupervised learning methods used in drug discovery [23]. InfoGraph [24] maximized the mutual information between graph embedding and substructure embedding at different scales to learn graph representations. MPG [25] compared two halfgraphs and distinguished whether they come from the same source as a selfsupervised learning strategy. GROVER [26] proposed two pretraining tasks: for the node/edge level task, it randomly masked a local subgraph of the target node/edge and predicted the contextual property; for the graph level task, it extracted the semantic motifs existing in molecular graphs (such as functional groups) and predicted whether these motifs existed for a molecule. However, most existing research integrated all structural features and node attributes of the graph to provide an overview of the graph, ignoring more finegrained substructure semantics. Proteins are macromolecules composed of amino acids. There are 22 amino acids that make up an organism, which are represented by 22 letters and can be naturally represented as a sequence of letters. Sequencebased DL models can effectively consider the contextual relationships of the sequences. MATTDTI [27] utilized three convolutional layers as the feature extractor, followed by a max pooling layer. A multihead attention block was built to model the similarity of drugâ€“target pairs as the interaction information for DTA prediction. TransformerCPI [28] used a onedimensional convolutional gated convolutional network and gated linear unit instead of the selfattention layer in the Transformer encoder. However, current studies focus only on the single scale of protein sequences, and traditional sequencebased approaches process the whole sequence at once may lead to the loss of local information and neglect multiscale features of proteins, so how to combine multiscale information to improve the robust of protein representation is also an open issue. In order to overcome the limitations of existing methods, we propose a novel framework, SubMDTA, a drug target affinity prediction method based on substructure extraction and multiscale features. For molecules, inspired by Wang et al. [29], a selfsupervised learning method based on molecular substructure is proposed for molecular representation. During the pretraining phase, subgraphs are generated to obtain substructure information, and subgraphs are replaced according to their similarity relationships to generate reconstructed graphs. We simultaneously maximize the mutual information between the subgraph and the original graph, as well as between the reconstructed graph and the original graph, to improve the correlation between subgraphlevel and graphlevel representations. After pretraining, the trained model is finetuned in downstream tasks. For proteins, a BiLSTM method that integrates multiscale information based on ngram method is proposed for feature extraction. Finally, the drug and protein features are concatenated and fed into a Multilayer Perceptron (MLP) for DTA prediction. We compared our proposed method with several stateoftheart methods and the experimental results demonstrate that our method significantly outperforms other methods on the Davis [30] and KIBA [31] datasets.
Materials and methods
The SubMDTA performs DTA prediction by integrating structural information of drug molecules and sequence features of targets, and the general architecture of SubMDTA is shown in Fig.Â 1. It consists of a pretraining part and a DTA prediction part. In the pretraining part, the drug SMILES (Simplified Molecular Input Line Entry System) [32] strings in the pretraining dataset are first converted into molecular graphs, followed by encoding the graph representations using the GIN [33] network. Then the substructural and reconstruct graphs are extracted. After obtaining two types of features, the mutual information between them and the original graph are maximized. The DTA prediction part uses the trained GIN encoder for molecular representation. For protein sequences, they are firstly embedded by ngram coding, and fed into BiLSTM to obtain their representations. Finally, the drug representation and the protein representation are concatenated and fed into the fully connected layer to predict the binding affinity.
Datasets
The Davis and KIBA datasets were used to evaluate the performance of the proposed model. The Davis dataset was obtained by selecting certain kinase proteins and their corresponding inhibitors, with binding affinity represented by the dissociation constant \(K_d\), and affinity was processed using Eq.Â 1. It contains 442 proteins, 68 drugs, and 30,056 drugâ€“target interactions. The average length of the drug SMILES strings is 64, and the average length of the protein sequences is 788. The KIBA dataset includes combined kinase inhibitor biological activities from various sources, such as inhibition constant (\(K_i\)), dissociation constant (\(K_d\)), or the halfmaximal inhibitory concentration (\(IC_{50}\)), and predicts biological activity using the KIBA score. It consists of 229 proteins, 2111 drugs, and 118,254 drugâ€“target interactions. The average length of the drug SMILES strings is 58, and the average length of the protein sequences is 728, the detail information of these two datasets is shown in Table 1.
Molecular encoder
For each drug molecule in the experimental dataset, it is represented by its corresponding SMILES. The open source cheminformatics software RDKit [34] is used to convert SMILES string into its corresponding molecular graph. For the node features, we use a set of atomic feature representations adopted from DeepChem [35]. In order to better explore the features of molecule, Graph Isomorphism Network (GIN) is used as the graph encoder in this paper. GIN provides better inductive bias for graph representation learning, which generates node representations by repeatedly aggregating information from the local neighborhood nodes. After each GIN layer, there is a batch normalization layer activated by the ReLU function. Specifically, GIN uses a MLP model to update the node features and its update process can be written as:
where \(\varepsilon\) is either a learnable parameter or fixed scalar, \(\textbf{x}_{i}^{l}\) denotes the node feature of the ith node in the lth layer, \(\mathcal {N}_{i}\) are neighborhoods to node i, and \(\textbf{x}_{j}^{l}\) denotes the node features of the jth node in the lth layer.
Multiple GIN layers could aggregate information of node from its multihop neighbors, and the information embedded in the representations of different hops will gradually change from local information to global information. After L layers of GIN, a list of node representations \(\left\{ {\textbf{x}_{i}^{0},\textbf{x}_{i}^{1},\ldots ,\textbf{x}_{i}^{L}} \right\}\) is generated. To avoid loss of node information, a convolution kernel of size (L, 1) called \(\textbf{Conv}\) is used to aggregate node representations at different layers as Eq. 3, thus local and global information can be combined.
After obtaining the final node embeddings containing information at different levels of the graph, the obtained embeddings are aggregated into fixedlength graphlevel representations using a readout function. In this paper, we use a global summation pooling function which we called as \(\textbf{GlobalAddPool}\) to read out the representation h(G) of the nodes as Eq. 4:
where the \({X}^{G}\) represent the node feature matrix. It returns batched graphlevel output by aggregating node features across the node dimension, thus ensuring that the global representation of graph is more comprehensive.
Contrastive learning method for molecular representation
Inspired by the mutual informationbased contrastive learning algorithm [36, 37], maximizing the mutual information of molecular graphs can obtain more feature representations. The overall framework of our approach is shown in Fig.Â 2. The drug molecule graph acquires the original features after GIN encoding, followed by substructure extraction. For the original graph, its original feature is selected to form a positive sample pair with each subgraph representations, and the subgraphs of other graphs in the same batch form negative sample pairs. In order to capture the inherent relations between graphs, subgraphs are ranked according to similarity and half of them are replaced to obtain the reconstructed graph. The original graph with its reconstructed graph constitutes a positive sample, and with the reconstructed graph within the same batch constitutes a negative sample.
Subgraphlevel contrastive learning
In this paper, a subgraphâ€™s generation method [29] is utilized in contrastive learning. After obtaining the node feature matrix \(X^{G}\), it is transformed by linearly function with the learnable matrix W and the rowbyrow \(\textbf{Softmax}\) function is used to obtain a probability matrix A as Eq. 5. \(A_{ij}\) denotes the probability of the ith node in the jth subgraph. The \(\textbf{Softmax}\) function exponentiates the input vectors and sums them to obtain a scalar. The exponent value of each element is then divided by this scalar to obtain the normalized probability value.
Based on the probability matrix A, we can divide the original graph into two subgraphs by a predefined probability 0.5. After T rounds of splitting, we obtain \(S = 2^{T}\) subgraphs. The node representations of each subgraph are denoted as \(X^{G_{i}},~i = 1,~2,~\ldots ,~S\). Here, we adopt the same pooling function as Eq. 4 to obtain the graphlevel representation. After the reading out function, we can obtain the subgraph representation \(h\left( G_{i} \right)\) as Eq. 6:
Mutual information (MI) is an indicator to quantify the relationship between two random variables. Let \(\phi\) represent the parameters of the graph neural network, and a discriminator \(T_{\omega }:~h_{\phi }(G){\times h}_{\phi }\left( G_{i} \right)\) which takes as input a subgraph/graph embedding pair and determines whether they come from the same graph is used:
where \(I_{\phi ,\omega }\left( h_{\phi }(G);h_{\phi }\left( G_{i} \right) \right.\) is a mutual information estimator modeled by the discriminator and parameterized by the neural network.
We use the Jensenâ€“Shannon (JS) mutual information estimator [38] on local/global pairs to maximize the mutual information on a given subgraph/graph embedding as Eq. 8. The JS mutual information estimator is approximately monotonic with respect to the KL scatter (the traditional definition of mutual information), but it is more stable and can provide better results [39].
where \(P = p\left( h_{\phi }(G),h_{\phi }\left( G_{i} \right) \right)\) is the joint distribution of the global graph representation and the subgraph representation, and \(Q = p\left( h_{\phi }(G) \right) p\left( h_{\phi }\left( G_{i} \right) \right)\) denotes the product of marginal distributions of two embeddings. In contrastive learning, Q denotes the distribution of positive pairs, P denotes the distribution of negative pairs, and \(sp(x) = log\left( 1 + e^{x} \right)\) is the softplus function.
Graphlevel contrastive learning
The reconstructed graph generation method is based on the strategy of similar subgraph substitution. To better capture the structural information of the graph, given the generated subgraph of a certain original graph \(G_i\), we compute its cosine similarity to the generated subgraphs of other original graphs \(G_i\) in the same batch as Eq. 9:
After ranking, half of the original subgraphs are replaced according to the similarity values, and finally aggregated and assembled into a reconstructed graph using a convolution kernel of size (S,Â 1).
For the reconstructed representation \(h( \hat{G} )\), the global feature h(G) of its original graph is selected to form a positive sample pair, and the negative sample pair constitute the reconstructed graph \(h( \hat{G}' )\) of other graphs in the same batch. We use the same mutual information calculation method to maximize the mutual information between the original and reconstructed graphs, denoted as \(I_{\phi ,\omega }^{JSD}( {h_{\phi }(G);h_{\phi }( \hat{G} )} )\) as Eqs. 10 and 11:
The final loss is the sum of two mutual information losses:
To enhance the generalization of the selfsupervised learning features, 50,000 molecules are randomly selected from the ZINC database for pretraining the selfsupervised model, and a highquality molecular encoder is obtained from learning rich molecular structure and semantic information in unlabeled molecular data.
Protein representation
For each protein in the experimental dataset, the protein sequence is obtained from the UniProt database through its gene name. The sequence is a string of ASCII characters representing amino acids. The ngram [40] is used to define the â€śwordsâ€ť in the amino acid sequence, and the protein sequence is split into multiple overlapping ngram amino acid word. Depending on the permutations and combinations, there are \(22^n\) ngram words. However, if the ngram syntax number is too large, the word frequency may be too low. Taking n = 3 as an example, given a protein sequence \(S = s_{1}s_{2}s_{3}\ldots s_{s}\), S represents the length of protein sequence, we divide it into ngram words:
We use the symbol \(s_{i:i + 2}\) to represent the protein word \(\left[s_{i};s_{i + 1};s_{i + 2} \right]\), and then encode the word using the Eq. 14.
where the \(\textbf{Embedding}\) function initializes the weight from the standard normal distribution according to the input vocabulary size and embedding dimension, and outputs the word vector corresponding to the vocabulary index in the weight.
In this work, inspired by MGraphDTA [21], we set n = 2, 3, 4 to encode protein respectively in order to detect the local residue patterns of proteins at different scales. Finally we get three types of embedding \(c_{i}^{2},c_{i}^{3},c_{i}^{4}\). For protein sequence, sequencebased models are the optimal choice for feature extraction. Long shortterm memory network (LSTM) [41] is a DL model to overcome the gradient disappearance problem to process sequence data. The main idea is to introduce an adaptive gating mechanism which determines the extent to which the LSTM unit maintains its previous state and remembers the extracted features of the current data input.
Bidirectional LSTM (BiLSTM) [42] is a variant of LSTM that combines the outputs of two LSTMs, one processing sequences from left to right and the other from right to left, to capture longterm dependencies and contextual relationships. Since each amino acid residue in the sequence information of the protein has interrelationship with residues in the both directions, the BiLSTM is more suitable to process protein sequence, which is defined as Eq. 15:
where \(\overset{\rightarrow }{h_{i}}\) and \(\overset{\leftarrow }{h_{i}}\) denote the hidden states of the time step computed from lefttoright and righttoleft, respectively, and \(h_i\) denotes the global representation of the tth time step stitched together by them.
The word vector \(c_{i}^{2},c_{i}^{3},c_{i}^{4}\) are fed into the BiLSTM layer to capture the dependencies between characters in the sequence. After the maxpooling layer, the three features are concatenated together to obtain the final protein representation. The BiLSTM framework is shown in Fig.Â 3.
DTA prediction
In this paper, we treat the drugâ€“target binding affinity prediction task as a regression task. With the representation learned from the previous sections, we can integrate all the information from the drug and target to predict the DTA value. As shown in Fig.Â 4, drug representation and protein representation are concatenated together, which is fed into two dense fully connected layers to predict the DTA value. Besides, the ReLU is used as the activation function for increasing the nonlinear relationship. Given the set of drugâ€“target pairs and the groundtruth labels, we use the mean squared error (MSE) as the loss function.
Results and discussion
Metrics
The DTA prediction is regarded as a regression problem and our model was evaluated using three metrics including mean squared error (MSE), concordance index (CI), and regression toward the mean (\(r_m^2\) index). MSE calculates difference between the predicted and actual values through the function of squared loss as follows:
where \({\hat{y}}_{i}\) is the predicted value, \(y_i\) is the true value, and n is the number of drugâ€“target pairs. CI is used to measure whether the predicted DTA values of two random drugâ€“target pairs are predicted in the same order as their true values:
where \(b_x\) is the predicted value of the larger affinity \(d_x\), \(b_y\) is the predicted value of the smaller affinity \(d_y\), h(x) is the step function. Z is the normalization constant which indicates the number of drugâ€“target pairs.
\(r_m^2\) is used to evaluate the external predictive performance of the model as follows:
where \(r^2\) and \(r_0^2\) are the squared correlation coefficients between the true and predicted values with and without intercepts, respectively.
Comparison with existing methods
To evaluate the performance of our model, we compared the model with other methods for DTA prediction, including KronRLS [11], SimBoost [12], DeepDTA [16], WideDTA [43], MATTDTI [26], DeepGS [44], AttentionDTA [45], GraphDTA [18], and DeepGLSTM [46]. Table 2 shows the performance of different models based on MSE, CI, and \(r_m^2\) metrics on the Davis dataset. On the Davis dataset, our method significantly outperformed the other methods in terms of MSE (0.218) and \(r_m^2\) (0.719), which are 4.8% and 4.8% better than the previous optimal method, respectively. The CI of SubMDTA was very close to the best method DeepGLSTM by 0.001.
Moreover, we evaluated our model on KIBA dataset. As shown in Table 2, SubMDTA achieved the best performance among existing methods with MSE of 0.129, CI of 0.898, and \(r_m^2\) of 0.793, where the MSE was 3% higher than the previous best method. The above results show that the proposed method can be considered as an accurate and effective tool for DTA prediction. Compared with other models, the superiority of our model can be summarized for two reasons: (i) to obtain more discriminative molecular representations, we utilized the local and global information of molecule through a pretraining task, which can focus on the structural features of molecular graph; (ii) compared with the conventional embedding method of protein sequence, our method used multiple ngram sequence representations containing multilevel information. Thus, our model can integrate the intrinsic information of compounds and protein sequences into a more comprehensive representation, which is helpful to improve the accuracy and robustness of the model.
In addition, we evaluated our model on KIBA dataset. As shown in Table 3, SubMDTA achieved the best performance among existing methods with MSE of 0.129, CI of 0.898, and \(r_m^2\) of 0.793, where the MSE was 3% higher than the previous best method.
The above results show that the proposed method can be considered as an accurate and effective tool for DTA prediction. Compared with other models, the superiority of our model can be summarized for two reasons: (i) to obtain more discriminative molecular representations, we utilized the local and global information of molecule through a pretraining task, which can focus on the structural features of molecular graph; (ii) compared with the conventional embedding method of protein sequence, our method used multiple ngram sequence representations containing multilevel information. Thus, our model can integrate the intrinsic information of compounds and protein sequences into a more comprehensive representation, which is helpful to improve the accuracy and robustness of the model.
Comparison with different drug molecular representations
The complex structure of drug molecules is difficult to directly obtain its features, so special representation methods are required. We validated graphbased representation methods and molecular fingerprint methods. SubMDTA first converts the smiles string of the drug molecule into a molecular graph, and then uses onehot encoding to obtain the features of the drug molecule according to the atomic attributes. Molecular fingerprint is a method of converting a molecular structure into a binary or sparse vector representation, where each bit or feature represents a specific substructure or chemical property of the molecule. In this section, the Morgan fingerprint [47] and the MACCS fingerprint [48] were used for comparison. SubMDTA, Morgan, and MACCS achieved MSE of 0.218, 0.221, and 0.222, respectively. It can be seen from Fig.Â 5 that SubMDTA finally obtained the best results among three, which may be related to the fact that the graphbased method can better capture the detailed structure of molecules.
The construction of effective GNN networks for extracting discriminative features of drugs is essential to improve the prediction accuracy of DTA. Empirically, it is often difficult to obtain sufficient information from singlelayer networks compared with multilayer networks, and too many layers may result in the problem of oversmoothing. Therefore, a fourlayer GNN network was used in the proposed method. We tried three types of GNN architectures (GCN, GAT, and GIN) for performance comparison. It is obvious from the Fig.Â 6a and the Fig.Â 6c that the GIN model achieves an MSE of 0.218 and fran \(r_m^2\) of 0.719, which is the best performance. As shown in Fig.Â 6b, the CI of the GAT model achieves 0.897, which is higher than 0.894 of GIN, but the difference is not obvious. This may be because that GIN can capture local features in the graph while retaining global information, thus improving its characterization ability.
Comparison with different protein representations
For protein feature representation, we propose a method based on ngram multiscale features fusion. Thus, we explored the effects of different protein sequence embedding methods, which are onehot coding, 2gram, 3gram, 4gram coding and ngram fusion coding methods, and the experimental results are shown in Fig.Â 7. Onehot coding achieved an MSE of 0.234, CI of 0.892, and \(r_m^2\) of 0.695. Compared with onehot encoding, ngram encoding provided better representations by capturing multiple characters in the sequence, and the MSE reached 0.226, 0.225, and 0.223 using 2gram, 3gram, and 4gram, respectively.
The performance of multiscale representations was the best among them. This is because that the whole protein sequence contains many subsequences or structural domains, and the introduction of multiscale features could capture more amino acid combinations and result in a better performance.
For protein feature extraction methods, we choose convolutional neural network (CNN) and bidirectional gated recurrent unit (BiGRU) as comparison methods. CNN extracts features from input data through convolution operations. BiGRU is a variant of recurrent neural network which consists of two GRUs for forward and backward processing. CNN and BiGRU achieved MSE of 0.225, and 0.232, respectively. SubMDTA achieved MSE of 0.219, which increased by 3.1% and 6.0%. As can be seen from Fig.Â 8, SubMDTA obtained the best MSE result, which proves the superiority of SubMDTA in processing protein sequence data.
Ablation study
To verify the effectiveness of the proposed model, we designed and conducted ablation experiments to determine the contributions of different factors of the model.
In the proposed model, maximizing the mutual information between the graph and the subgraph representations in the SSL task is helpful to preserve substructure information. In order to demonstrate the advantages of substructures, we designed three variants SubMDTAa, SubMDTAb and SubMDTAc to evaluate the importance of the pretraining task module. As shown in Table 4, SubMDTAa obtained an MSE of 0.224. The introduction of contrastive learning improved the MSE to 0.225 and 0.230 by SubMDTAb and SubMDTAc, respectively. The MSE of SubMDTA which combined these two methods reached 0.218. This may be related to the fact that using one type of mutual information alone cannot obtain the comprehensive features. Meanwhile, maximizing the mutual information between the graph representation and the reconstructed graph representation can enable the embedding to focus on the global features of the graph.
Case study
In order to verify the robustness of proposed method, we applied approved drugs targeting the Type1 angiotensin II receptor in DrugBank for a case study. According to similar steps to MSFDTA [49], after training SubMDTA on the Davis dataset, we predicted the affinities between the receptor and 1781 available small molecule drugs. Among them, 9 out of 1781 drugs are known to bind this receptor. To ensure a fair comparison, this receptor never appeared in the Davis dataset. The predicted affinities between the nine drugs and the receptor are listed in descending order, as shown in Table 5. It can be seen that according to the sorting results of SubMDTA, 8 drugs are ranked in the top 13 % of 1781 drugs, and 7 drugs appear in the top 4 %. These results suggest that SubMDTA can identify novel targetprotein interacting drugs well and has the potential to be developed as a predictive tool.
Conclusion
In this paper, we present a new model SubMDTA using selfsupervised learning and multiscale features for DTA prediction. The drug representations are extracted by contrastive learning methods between graphlevel and subgraph representations and between graphlevel and reconstructed graph representations, which is refined by downstream task. In addition, multiscale sequence features were fused to learn protein representations, which captured long distance and multiple relationships in amino acid sequences. The experimental results proved that our method outperformed existing methods. In our future work, we will take account into the progresses made in heterogeneous information networks [50] and incorporate them to enhance the prediction ability of our models.
Availability of data and materials
The code and data are provided at https://github.com/1q84er/SubMDTA
Abbreviations
 DTA:

Drugâ€“target affinity
 DL:

Deep learning
 SMILES:

Simplified molecular input line entry system
 BiLSTM:

Bidirectional long shortterm memory
 LSTM:

Long shortterm memory
 GCN:

Graph convolutional networks
 GAT:

Graph attention networks
 GIN:

Graph isomorphism networks
 SSL:

Selfsupervised learning
References
Vermaas JV, Sedova A, Baker MB, Boehm S, Rogers DM, Larkin J, Glaser J, Smith MD, Hernandez O, Smith JC. Supercomputing pipelines search for therapeutics against covid19. Comput Sci Eng. 2020;23(1):7â€“16.
Lin X, Li X, Lin X. A review on applications of computational methods in drug screening and design. Molecules. 2020;25(6):1375.
Morris GM, Huey R, Lindstrom W, Sanner MF, Belew RK, Goodsell DS, Olson AJ. Autodock4 and autodocktools4: automated docking with selective receptor flexibility. J Comput Chem. 2009;30(16):2785â€“91.
Trott O, Olson AJ. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010;31(2):455â€“61.
John S, Thangapandian S, Sakkiah S, Lee KW. Potent bace1 inhibitor design using pharmacophore modeling, in silico screening and molecular docking studies. BMC Bioinform. 2011;12(1):1â€“11.
Schuffenhauer A, Floersheim P, Acklin P, Jacoby E. Similarity metrics for ligands reflecting the similarity of the target proteins. J Chem Inf Comput Sci. 2003;43(2):391â€“405.
Klabunde T. Chemogenomic approaches to drug discovery: similar receptors bind similar ligands. Br J Pharmacol. 2007;152(1):5â€“7.
Shaik NA, Hakeem KR, Banaganapalli B, Elango R. Essentials of bioinformatics, vol. i. Cham: Springer International Publishing; 2019.
Yang C, Chen EA, Zhang Y. Proteinligand docking in the machinelearning era. Molecules. 2022;27(14):4568.
Wang F, Liu D, Wang H, Luo C, Zheng M, Liu H, Zhu W, Luo X, Zhang J, Jiang H. Computational screening for active compounds targeting protein sequences: methodology and experimental validation. J Chem Inf Model. 2011;51(11):2821â€“8.
Pahikkala T, Airola A, PietilĂ¤ S, Shakyawar S, Szwajda A, Tang J, Aittokallio T. Toward more realistic drugâ€“target interaction predictions. Brief Bioinform. 2015;16(2):325â€“37.
He T, Heidemeyer M, Ban F, Cherkasov A, Ester M. Simboost: a readacross approach for predicting drugâ€“target binding affinities using gradient boosting machines. J Cheminform. 2017;9(1):1â€“14.
Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37(suppl_2):623â€“33.
Smith TF, Waterman MS, et al. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195â€“7.
Wu Y, Gao M, Zeng M, Zhang J, Li M. Bridgedpi: a novel graph neural network for predicting drugprotein interactions. Bioinformatics. 2022;38(9):2571â€“8.
Ă–ztĂĽrk H, Ă–zgĂĽr A, Ozkirimli E. Deepdta: deep drugâ€“target binding affinity prediction. Bioinformatics. 2018;34(17):821â€“9.
Abbasi K, Razzaghi P, Poso A, Amanlou M, Ghasemi JB, MasoudiNejad A. Deepcda: deep crossdomain compoundprotein affinity prediction through LSTM and convolutional neural networks. Bioinformatics. 2020;36(17):4633â€“42.
Yuan W, Chen G, Chen CYC. Fusiondta attentionbased feature polymerizer and knowledge distillation for drugâ€“target binding affinity prediction. Brief Bioinform. 2022;23(1):506.
Nguyen T, Le H, Quinn TP, Nguyen T, Le TD, Venkatesh S. Graphdta: predicting drugâ€“target binding affinity with graph neural networks. Bioinformatics. 2021;37(8):1140â€“7.
Jiang M, Li Z, Zhang S, Wang S, Wang X, Yuan Q, Wei Z. Drugâ€“target affinity prediction using graph neural network and contact maps. RSC Adv. 2020;10(35):20701â€“12.
Yang Z, Zhong W, Zhao L, Chen CYC. Mgraphdta: deep multiscale graph neural network for explainable drugâ€“target binding affinity prediction. Chem Sci. 2022;13(3):816â€“33.
Lin S, Shi C, Chen J. Generalizeddta: combining pretraining and multitask learning to predict drugâ€“target binding affinity for unknown drug discovery. BMC Bioinform. 2022;23(1):1â€“17.
Li Z, Jiang M, Wang S, Zhang S. Deep learning methods for molecular representation and property prediction. Drug Discov Today. 2022. https://doi.org/10.1016/j.drudis.2022.103373.
Sun FY, Hoffmann J, Verma V, Tang J. Infograph: unsupervised and semisupervised graphlevel representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000 2019.
Li P, Wang J, Qiao Y, Chen H, Yu Y, Yao X, Gao P, Xie G, Song S. An effective selfsupervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform. 2021;22(6):109.
Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, Huang J. Selfsupervised graph transformer on largescale molecular data. Adv Neural Inf Process Syst. 2020;33:12559â€“71.
Zeng Y, Chen X, Luo Y, Li X, Peng D. Deep drugâ€“target binding affinity prediction with multiple attention blocks. Brief Bioinform. 2021;22(5):117.
Chen L, Tan X, Wang D, Zhong F, Liu X, Yang T, Luo X, Chen K, Jiang H, Zheng M. Transformercpi: improving compoundprotein interaction prediction by sequencebased deep learning with selfattention mechanism and label reversal experiments. Bioinformatics. 2020;36(16):4406â€“14.
Wang C, Liu Z. Learning graph representation by aggregating subgraphs via mutual information maximization. arXiv preprint arXiv:2103.13125 2021.
Davis MI, Hunt JP, Herrgard S, Ciceri P, Wodicka LM, Pallares G, Hocker M, Treiber DK, Zarrinkar PP. Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol. 2011;29(11):1046â€“51.
Tang J, Szwajda A, Shakyawar S, Xu T, Hintsanen P, Wennerberg K, Aittokallio T. Making sense of largescale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model. 2014;54(3):735â€“43.
Weininger D. Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31â€“6.
Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 2018.
Bento AP, Hersey A, FĂ©lix E, Landrum G, Gaulton A, Atkinson F, Bellis LJ, De Veij M, Leach AR. An open source chemical structure curation pipeline using RDKit. J Cheminform. 2020;12:1â€“16.
Ramsundar B, Eastman P, Walters P, Pande V. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. Oâ€™Reilly Media; 2019.
Velickovic P, Fedus W, Hamilton WL, LiĂ˛ P, Bengio Y, Hjelm RD. Deep graph infomax ICLR (Poster). 2019;2(3):4.
Park C, Han J, Yu H. Deep multiplex graph infomax: attentive multiplex network embedding using global information. KnowlBased Syst. 2020;197:105861.
Nowozin S, Cseke B, Tomioka R. fgan: Training generative neural samplers using variational divergence minimization. Adv Neural Inf Process Syst 2016;29.
Hjelm RD, Fedorov A, LavoieMarchildon S, Grewal K, Bachman P, Trischler A, Bengio Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 2018.
Dong QW, Wang XL, Lin L. Application of latent semantic analysis to protein remote homology detection. Bioinformatics. 2006;22(3):285â€“90.
Hochreiter S, Schmidhuber J. Long shortterm memory. Neural Comput. 1997;9(8):1735â€“80.
Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Sign Process. 1997;45(11):2673â€“81.
Ă–ztĂĽrk H, Ozkirimli E, Ă–zgĂĽr A. Widedta: prediction of drugâ€“target binding affinity. arXiv preprint arXiv:1902.04166 2019.
Lin X. Deepgs: Deep representation learning of graphs and sequences for drugâ€“target binding affinity prediction. arXiv preprint arXiv:2003.13902 2020.
Zhao Q, Xiao F, Yang M, Li Y, Wang J. Attentiondta: prediction of drugâ€“target binding affinity using attention model. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM), 2019; IEEE, pp. 64â€“69.
Mukherjee S, Ghosh M, Basuchowdhuri P. Deepglstm: deep graph convolutional network and lstm based approach for predicting drugâ€“target binding affinity. In: Proceedings of the 2022 SIAM international conference on data mining (SDM), 2022; SIAM, 729â€“737.
Zhao BW, You ZH, Hu L, Guo ZH, Wang L, Chen ZH, Wong L. A novel method to predict drugâ€“target interactions based on largescale graph representation learning. Cancers. 2021;13(9):2111.
Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of mdl keys for use in drug discovery. J Chem Inf Comput Sci. 2002;42(6):1273â€“80.
Ma W, Zhang S, Li Z, Jiang M, Wang S, Guo N, Li Y, Bi X, Jiang H, Wei Z. Predicting drugâ€“target affinity by learning protein knowledge from biological networks. IEEE J Biomed Health Inform. 2023;27(4):2128â€“37.
Zhao BW, Wang L, Hu PW, Wong L, Su XR, Wang BQ, You ZH, Hu L. Fusing higher and lowerorder biological information for drug repositioning via graph representation learning. IEEE Trans Emerg Topics Comput. 2023. https://doi.org/10.1109/TETC.2023.3239949.
Acknowledgements
Not applicable.
Funding
This work has been supported by Shandong Key Science and Technology Innovation Project [2021CXGC011003] and Qingdao Key Technology Research and Industrialization Projects[2232qljh8gx]
Author information
Authors and Affiliations
Contributions
Conceptualization, SP and ZL; methodology, SP; validation, LX; dataset, LX; writingoriginal draft preparation, SP; writingreview and editing, ZL All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Pan, S., Xia, L., Xu, L. et al. SubMDTA: drug target affinity prediction based on substructure extraction and multiscale features. BMC Bioinformatics 24, 334 (2023). https://doi.org/10.1186/s12859023054604
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023054604
Keywords
 Drugâ€“target binding affinity
 Selfsupervised learning
 Mutual information
 Multiscale features