In order to realize DTA prediction in unknown drug discovery, this study proposes the GeneralizedDTA model by combining self-supervised pre-training and multi-task learning. Two kinds of protein pre-training tasks are adopted to learn structural information of amino acid sequences. A kind of new drug pre-training task is designed to learn structural information of molecular graphs of drug compounds. In order to alleviate the catastrophic forgetting problem of pre-training parameters, a multi-task learning framework with a dual adaptation mechanism is developed to prevent the prediction model from falling into overfitting. Figure 2 gives the model architecture of GeneralizedDTA, which includes four modules: the protein encoding layer, the drug encoding layer, the DTA prediction layer, and the multi-task learning framework.
Protein encoding layer
The protein encoding layer encodes amino acid sequences of proteins as vectors by using protein pre-training tasks. Inspired by BERT [27], this study adopts a transformer model with the multi-head attention as the encoder to receive amino acid sequences. Given a amino acid sequence \(t=\left[ t_{1}, \ldots , t_{n}\right]\) where \(t_{i} \in \{21\) amino acid types\(\}\), the transformer model converts it into \(z=\left[ z_{1}, \ldots , z_{n}\right]\) as follows:
$$\begin{aligned}&z={\text {Transformer}}(Q, K, V ; t)=\text {Concat}\left( \text {head}_{1}, \ldots , \text {head}_{n}\right) W^{\circ } \end{aligned}$$
(1)
$$\begin{aligned}&\text{ head}_{i}=\text{ Attention }(Q, K, V) \end{aligned}$$
(2)
$$\begin{aligned}&{\text {Attention}}(Q, K, V)={\text {softmax}}\left( \frac{Q K^{T}}{\sqrt{d_{k}}}\right) V \end{aligned}$$
(3)
where \(Q \in R^{d_{1} \times d_{2}}, K \in R^{d_{1} \times d_{2}}, V \in R^{d_{1} \times d_{2}}\) are the parameters of attention, n is the number of heads, \(W^{o} \in R^{d_{1}} \times d_{1}\) is the weight of heads and \(\sqrt{d_{k}}\) is the dimension number of Q. The self-attention function is computed on the dot products of each queries with all keys simultaneously, and divided by a softmax function to obtain the weights on the values [28]. It can be simplified as a parameterized function Transformer (\(\bullet\)) with the parameter set \(\theta\) :
$$\begin{aligned} z=\text{ Transformer }(\theta ; t), \theta =\left\{ Q, K, V, W^{o}\right\} \end{aligned}$$
(4)
Based on the transformer model, this study adopts two pre-training tasks to obtain structural information of amino acid sequences of proteins.
Masked Language Modeling (MLM) Task [28]: this task screens some amino acids at random and predicts their types. Given a masked amino acid sequence t and a masked amino acid set \(m=\left\{ m_{1}, m_{2}, \ldots , m_{N}\right\}\), the MLM decoder calculates the log probability for t as follows:
$$\begin{aligned} z= & {} \text{ Transformer }(\theta ; t) \end{aligned}$$
(5)
$$\begin{aligned} m^{\prime }= & {} F C\left( \theta _{1}; z\right) \end{aligned}$$
(6)
where \(F C(\bullet )\) is a fully connected neural network (FC) with the parameter \(\theta _{1}\) and \(m^{\prime }=\left\{ m_{1}^{\prime }, m_{2}^{\prime }, \ldots , m_{N}^{\prime }\right\}\) represents the predicted amino acid set for the whole masked amino acid set. Then the log-likelihood function is used as the evaluation metrics for the MLM task:
$$\begin{aligned} {\mathcal {L}}^{\mathbf {M L M}}\left( \theta , \theta _{1} ; m\right) =-\left[ \sum _{i=1}^{N} m_{i}^{\prime } \ln m_{i}+\left( 1-m_{i}^{\prime }\right) \ln \left( 1-m_{i}\right) \right] \end{aligned}$$
(7)
By the above MLM task, the transformer model could effectively learn the bidirectional contextual representation of amino acid sequences of proteins.
Same Family Prediction (SFP) Task [29, 30]: this task enables the model to determine if two proteins belong to the same family. In order to pre-train the transformer model with the SFP task, this study selects two amino acid sequences \(t^{1}\) and \(t^{2}\) from the Pfam dataset. Random sampling is adopted to ensure the probabilities that they come from the same class and different classes are the same. Aiming at the protein pair \(\left\langle \mathrm {t}^{1}, \mathrm {t}^{2}\right\rangle\), a FC with dropout [31] is used to calculate their similarity value:
$$\begin{aligned} {\hat{c}}=F C\left( \theta _{2} ; z_p\right) \end{aligned}$$
(8)
where \(\theta _{2} \in {\mathbb {R}}^{|z| \times 2}\) is the parameter of FC, \(z_p=\left[ z_{1}^{1}, \cdots , z_{1_{1}}^{1}, z_{1}^{2}, \cdots , z_{n_{2}}^{2}\right] z \in {\mathbb {R}}^{|z|\times 1}\) is the vector representation of \(\left\langle \mathrm {t}^{1}, \mathrm {t}^{2}\right\rangle\) and \({\hat{c}} \in {\mathbb {R}}^{2 \times 1}\) is the predicted similarity value, i.e., a probability that the protein pair belongs to the same protein family. The SFP task trains the model to minimize the cross-entropy loss which is designed to deal with predicted errors on probabilities. Therefore, this study adopts the log-likelihood function to measure the SFP loss:
$$\begin{aligned} {\mathcal {L}}^{\mathrm {SFP}}\left( \theta , \theta _{2} ; t\right) =-\ln p\left( n=n_{i} \mid \theta , \theta _{2}\right) , \quad n_{i} \in [ \text{ same } \text{ family, } \text{ not } \text{ same } \text{ family}] \end{aligned}$$
(9)
As the transformer model is asked to produce the higher similarity value for proteins from the same family, the SFP task enables the transformer model to better absorb global structural information of amino acid sequences of proteins.
Drug encoding layer
The drug encoding layer encodes molecular graphs of drug compounds as vectors by a brand-new drug pre-training task. It adopts GCN [32] to mine potential relationships from molecular graphs of drug compounds.
Given a molecular graph of drug compound \({\mathcal {G}}=({\mathcal {V}}, {\mathcal {E}}, {\mathcal {X}}, {\mathcal {Z}})\) where \({\mathcal {V}}\) is the chemical atom set, \({\mathcal {E}}\) is the chemical bond set, \({\mathcal {X}} \in {\mathbb {R}}^{|\nu | \times d_{v}}\) and \({\mathcal {Z}} \in {\mathbb {R}}^{|\varepsilon | \times d_{e}}\) are the atom and bond feature sets, respectively. GCN is mainly involved with two key computations “update” and “aggregate” for each atom at every layer. They can be represented as a parameterized function \(\Psi (\bullet )\) with the parameter \(\psi\) :
$$\begin{aligned} \begin{aligned} {\mathbf {h}}_{v}^{l}&=\Psi (\psi ; {\mathcal {A}}, {\mathcal {X}}, {\mathcal {Z}})^{l}\\&={\text {UPDATE}}\left( {\mathbf {h}}_{v}^{l-1}, {\text {AGGREGATE}}\left( \left\{ \left( {\mathbf {h}}_{v}^{l-1}, {\mathbf {h}}_{w}^{l-1}, {\mathbf {z}}_{w v}\right) : u \in {\mathcal {N}}_{v}\right\} \right) \right) \end{aligned} \end{aligned}$$
(10)
where \(u, v \in {\mathcal {V}}\) are two chemical atoms, \(z_{u v}\) is the feature vector of the chemical bond (u, v), \(\mathrm {h}_{v}^{0}=\mathrm {x}_{v} \in {\mathcal {X}}\) is the input of GCN and represents the feature of atom v, \(\mathrm {h}_{v}^{l}\) represents the feature of atom v on the l-th layer of GCN, \({\mathcal {A}}\) is the adjacency matrix of drug compound \({\mathcal {G}}\), and \({\mathcal {N}}_{v}\) is the neighborhood atom set of atom v.
In order to get a representation of drug compound \({\mathcal {G}}\), the POOLING function on the last GCN layer is used to transform the molecular graph into a vector:
$$\begin{aligned} {\mathbf {h}}_{{\mathcal {G}}}={\text {POOLING}}\left( \left\{ {\mathbf {h}}_{v}^{l} \mid v \in {\mathcal {V}}\right\} \right) \end{aligned}$$
(11)
where \(h_{{\mathcal {G}}}\) is the vector representation of drug compound \({\mathcal {G}}\) , POOLING is a simple pooling function like the max or mean-pooling [33, 34]. For simplicity, we represent GCN as follows:
$$\begin{aligned} {\mathbf {h}}_{{\mathcal {G}}}=G C N (\psi ; {\mathcal {G}}) \end{aligned}$$
(12)
Based on the GCN model, this study designs a new pre-training task to learn structural information of molecular graphs of drug compounds.
Drug Pre-training (DP) Task: this new task is designed to improve the representation learning capability on drugs by encouraging the generation of similar embeddings for neighboring chemical atoms in the molecular graph of drug compounds [35]. The aggregation is a key computation in each layer of GCN. In compound-level aggregation, the neighboring chemical atoms aggregate their information based on Eq. (10) [36, 37]. For each chemical atom \(v \in {\mathcal {V}}\), GCN gets its representation by \(\mathrm {h}_{v}\) and \(\Psi (\bullet )\) in Eq. (10). Therefore, as shown in Fig.3, given a random atom bond u as the center node, the self-supervised loss function [38] is chosen to realize the DP task, i.e., encourage similar embeddings for neighboring chemical atoms:
$$\begin{aligned} {\mathcal {L}}^{\text{ atom } }\left( \psi ; \mathcal {{\mathcal {G}}}\right) =\sum _{(u, v) \in {\mathcal {G}}}-\ln \left( \sigma \left( {\mathbf {h}}_{u}^{\top } {\mathbf {h}}_{v}\right) \right) -\ln \left( \sigma \left( -{\mathbf {h}}_{u}^{\top } {\mathbf {h}}_{v^{\prime }}\right) \right) \end{aligned}$$
(13)
where v is the context anchor node which is directly connected to the center node u, \(v^{\prime }\) is the negative context node which is not directly connected to u, \(\psi\) is the parameter of GCN, and \(\sigma\) is the sigmoid function. By 5 layers of GCN, each atom embedding absorbs almost all small local structures in the molecular graph [39, 40].
DTA prediction layer
The DTA prediction layer is to associate the drug compound with the protein for predicting their binding affinity. This study adopts a FC for DTA prediction. For a given drug-protein pair \(\left\langle {\mathcal {G}}, t\right\rangle\) where \({\mathcal {G}}\) is a molecular graph of drug compound and t is an amino acid sequence, the corresponding drug compound vector \({\varvec{h}}_{\mathcal {G}}\) and the protein vector \(z_p\) can be obtained by the drug encoding layer and the protein encoding layer. Then, the process of predicting their binding affinity \({\hat{y}}\) is shown as follows:
$$\begin{aligned} {\hat{y}}=F C\left( \gamma ; {\text {Concat}}\left( {\varvec{h}}_{\mathcal {G}}, z_p\right) \right) \end{aligned}$$
(14)
where \(\gamma\) is the parameter of full connection layers and Concat (\(\bullet\)) indicates that the input is the concatenated vector of \({\varvec{h}}_{G}\) and \(z_p\).
The DTA prediction task trains the model to minimize the loss function. This study adopts the mean squared error (MSE) as the loss function:
$$\begin{aligned} {\mathcal {L}}^{\text{ affinities } }(\theta , \psi , \gamma ;\left\langle {\mathcal {G}}, t\right\rangle )=\frac{1}{2}({\hat{y}}-y)^{2} \end{aligned}$$
(15)
where \({\hat{y}}\) is the predicted binding affinity of drug-protein pair and y is the true value, \(\theta , \psi , \gamma\) are combined as model parameters.
Multi-task learning framework with a dual adaptation mechanism
This study adopts multi-task learning to link the encoder, i.e. the pre-training tasks and the decoder, i.e. the DTA prediction task, for preventing overfitting caused by the local optimality under a relatively small supervised samples. In order to make the overall model bias against the main task DTA prediction, this study adopts the updated strategy of MAML [41].
The drug pre-training task is defined as the query set. For this task, we adjust the prior parameter \(\psi\) of compound-level aggregation with one or a few gradient descent steps. The learning rate is set to \(\alpha\) for dual adaptation. The new prior parameter \(\psi ^{\prime }\) can be obtained as follows:
$$\begin{aligned} \psi ^{\prime }=\psi -\alpha \frac{\partial {\mathcal {L}}^{\text{ atom } }\left( \psi ; {\mathcal {G}}\right) }{\partial \psi } \end{aligned}$$
(16)
Then, the FC parameter \(\gamma\) in the DTA prediction layer, which is defined as the support set, will be updated as follows:
$$\begin{aligned} \gamma ^{\prime }=\gamma -\alpha \frac{\partial {\mathcal {L}}^{\text{ affinities } }\left( \psi ^{\prime }, \gamma ;( {\mathcal {G}}, t)\right) }{\partial \gamma } \end{aligned}$$
(17)
After that, all the parameters are updated through the backpropagation of the overall loss function of the multi-tasking learning. We define the overall loss function as follows:
$$\begin{aligned} {\mathcal {L}}^{\text{ all } }=\lambda _{\text{ atom } } {\mathcal {L}}^{\text{ atom } }+{\mathcal {L}}^{\text{ affinities } } \end{aligned}$$
(18)
where \(\lambda _{\text{ atom }}\) set manually is the weight of the loss function of drug pre-training task. This study updates all learnable parameters by gradient descent. Before the pre-training drug task, we record the original model parameters, and take the parameters (query set) updated for the first time in pre-training as the prior parameters of the subsequent DTA prediction. The comprehensive loss function of DTA prediction and the drug pre-training task is taken as the objective function of dual adaptation. Subsequent original parameters are updated through the multi-task learning framework. Different from the frozen-strategy, the updated model parameters are original parameters rather than prior parameters.
The dual adaptation mechanism needs to save all learnable parameters in the pre-training task. For multi-head transformers learning, this will bring a huge increase in training time. Furthermore, this study mainly focuses on unknown drugs and introduces the drug pre-training task into the DTA prediction. Therefore, the multi-task learning framework in this study only combines the drug pre-training task with the DTA prediction task by using the above dual adaptation mechanism.