Building dataset and preprocessing
In this study, we used the PDBbind database (version 2016) containing 13,308 protein–ligand complexes. The PDBbind data includes the 3D structure data of protein–ligand complexes and their corresponding experimentally determined binding affinities in terms of dissociation (Kd), inhibition (Ki), or half-concentration (IC50) constants. Based on the PDBbind database, Wang et al. [10] compiled a refined set to provide high-quality complexes and binding data according to the following conditions. First, only complexes with an X-ray crystal structure resolution better than or equal to 2.5 Å were considered. Second, only complexes with known dissociation constants or inhibition constants in the affordable range were considered (pKi and pKd values distributed in the 2–12 range). Third, only non-covalently bound complexes were considered. Fourth, the ligand molecule did not contain any uncommon elements, such as Be, B, Si, and metal atoms. Therefore, only complexes with ligand molecules containing the common heavy atoms (C, N, O, F, P, S, Cl, Br, and I) were considered. Because the quality of a dataset used to develop a scoring function has a significant influence on its performance, we adopted the v2016 refined set consisting of 4,507 complexes.
We also adopted the PDBbind v2018 refined set (4,463 complexes), CASF-2016 benchmark set (285 complexes), and CASF-2013 benchmark set (195 complexes). The latter two were used only as test datasets for model performance evaluation. To prevent a protein–ligand complex’s simultaneous inclusion in the training, validation, and test datasets, each dataset was constructed according to the following rules. The training dataset comprised 3,689 complexes after removing complexes overlapping with the CASF-2016 and CASF-2013 datasets from the v2016 refined set. The validation dataset for model parameter optimization was composed of 677 complexes after removing complexes overlapping with the training, CASF-2016, and CASF-2013 datasets from the v2018 refined set. The removal was based on PDB ID.
According to previous studies, training and testing with data from a specific database tend to yield overly optimized results [38,39,40]. We collected CSAR NRC-HiQ set1 and CSAR NRC-HiQ set2 composed of 176 and 167 complexes, respectively, as external test datasets. For each dataset, we removed complexes overlapping with the training, validation, CASF-2016, and CASF-2013 datasets. Then, the four conditions proposed by Wang et al. were applied. This resulted in 50 complexes for the CSAR NRC-HiQ set1 dataset and 44 complexes for the CSAR NRC-HiQ set2 dataset, which were used as the test datasets. A summary of each dataset is shown in Fig. 1, and the PDB IDs for all complexes in each dataset are provided in Additional file 1: Table S5. All water molecules and cofactors were removed from the crystal structure, and USCF Chimera [41] and Openbabel [42] were used for preprocessing.
Structure similarity
The protein (ligand) structure similarity of each complex in the test dataset to that in the training dataset is explained here. The structural similarity between the two proteins is calculated by TM-align and expressed as a TM-score. The TM-score ranges between 0 and 1, and higher values indicate that the two proteins are structurally more similar. Since most proteins have multiple chains, all chain structures of each protein were extracted, and their corresponding TM-scores were calculated. The structural similarity between the two proteins was defined as the lowest pairwise-chains TM-score, depending on the chain structure. We also calculated the highest pairwise-chains TM-score. Finally, the protein structure similarity of each complex in the test dataset to the training dataset was defined as the maximum TM-score value. The structural similarity between the two ligands was denoted as a Tanimoto coefficient. The Tanimoto coefficient ranges between 0 and 1, and higher values indicate the two ligands are structurally more similar. As with the proteins, the ligand structure similarity of each complex in the test dataset to the training dataset was defined as the maximum Tanimoto coefficient value. Data on protein structure similarity and ligand structure similarity for each complex are provided in Additional file 1: Table S6.
Definition of descriptors
BAPA’s input, a molecular complex, is represented as a 1D vector, which is calculated based on descriptor information obtained from the training dataset. Using nine heavy atoms commonly observed in protein–ligand complexes, descriptors were generated focused on contacted protein and ligand atom pairs in the molecular complex. Let L be a list of heavy atoms in ligands [CL, NL, OL, FL, PL, SL, ClL, BrL, IL] where L[i] is the i-th atoms type of ligand (\(0\le {\varvec{i}}\le 8\)). Likewise, let P be a list of heavy atoms in proteins [CP, NP, OP, FP, PP, SP, ClP, BrP, IP] where P[j] is the j-th atom type of protein (\(0\le {\varvec{j}}\le 8\)). For each i and j, a set of contacts X[i][j] is defined by:
$${\varvec{X}}\left[{\varvec{i}}\right]\left[{\varvec{j}}\right]=\left\{\left({\varvec{L}}{\left[{\varvec{i}}\right]}_{{\varvec{l}}},\boldsymbol{ }{\varvec{P}}{\left[{\varvec{j}}\right]}_{{\varvec{k}}}\right)\right|{\varvec{d}}\left(({\varvec{L}}{\left[{\varvec{i}}\right]}_{{\varvec{l}}},{\varvec{P}}{\left[{\varvec{j}}\right]}_{{\varvec{k}}})\right)\le {{\varvec{d}}}_{{\varvec{c}}{\varvec{u}}{\varvec{t}}{\varvec{o}}{\varvec{f}}{\varvec{f}}}\}$$
where L[i]l and P[j]k are the l-th atom of the i-th atom type in the ligand and the k-th atom of the j-th atom type of the protein, respectively. The distance between the protein atom and the ligand atom pair is calculated by Euclidean distance. We used 12 \(\mathbf{\AA }\) as dcutoff, based on previous studies [13, 43]. For example, there are two atom pairs with the distance less than 12 \(\mathbf{\AA }\) (in Fig. 4), so \({\varvec{X}}[2][2]=\{({\varvec{L}}{\left[2\right]}_{4},\boldsymbol{ }{\varvec{P}}{\left[2\right]}_{2})\}=\{({{\varvec{O}}}_{{\varvec{L}}4}\leftrightarrow {{\varvec{O}}}_{{\varvec{P}}2})\}\) and \({\varvec{X}}[2][0]=\{({\varvec{L}}{\left[2\right]}_{1},\boldsymbol{ }{\varvec{P}}{\left[0\right]}_{21})\}=\boldsymbol{ }\{({{\varvec{O}}}_{{\varvec{L}}1}\boldsymbol{ }\leftrightarrow \boldsymbol{ }{{\varvec{C}}}_{{\varvec{P}}21})\}.\)
The elements of sets X[i][j] form an imaginary edge of which two nodes are ligand and protein atom types. A graph that is extended with one-step neighborhoods from the imaginary edge is defined as a descriptor. The edge between the extended one-step neighborhoods and the imaginary edge has one of five bond types (single, double, triple, amide, and aromatic). Following the previous example, \(({{\varvec{O}}}_{{\varvec{L}}4}\leftrightarrow {{\varvec{O}}}_{{\varvec{P}}2})\) is extended to \(\mathbf{^{\prime}}{{\varvec{C}}}_{{\varvec{L}}6}-({{\varvec{O}}}_{{\varvec{L}}4}\leftrightarrow {{\varvec{O}}}_{{\varvec{P}}2})={{\varvec{C}}}_{{\varvec{P}}9}\boldsymbol{^{\prime}}\) and \(({{\varvec{O}}}_{{\varvec{L}}1}\boldsymbol{ }\leftrightarrow \boldsymbol{ }{{\varvec{C}}}_{{\varvec{P}}21})\) is extended to \(\mathbf{^{\prime}}{{\varvec{C}}}_{{\varvec{L}}1}-\left({{\varvec{O}}}_{{\varvec{L}}1}\leftrightarrow {{\varvec{C}}}_{{\varvec{P}}21}\right)-{{\varvec{C}}}_{{\varvec{P}}20}={{\varvec{C}}}_{{\varvec{P}}16}\boldsymbol{^{\prime}}\). Because the order of the bonds (edges) is not considered, \({\boldsymbol{^{\prime}}{\varvec{C}}}_{{\varvec{L}}1}-\left({{\varvec{O}}}_{{\varvec{L}}1}\leftrightarrow {{\varvec{C}}}_{{\varvec{P}}21}\right)-{{\varvec{C}}}_{{\varvec{P}}20}={{\varvec{C}}}_{{\varvec{P}}16}\boldsymbol{^{\prime}}\) and \(\mathbf{^{\prime}}{{\varvec{C}}}_{{\varvec{L}}1}-\left({{\varvec{O}}}_{{\varvec{L}}1}\leftrightarrow {{\varvec{C}}}_{{\varvec{P}}21}\right)={{\varvec{C}}}_{{\varvec{P}}16}-{{\varvec{C}}}_{{\varvec{P}}20}\boldsymbol{^{\prime}}\) are the same. Removal of the atom indexes yields two descriptors, \(\mathbf{^{\prime}}{{\varvec{C}}}_{{\varvec{L}}}-\left({{\varvec{O}}}_{{\varvec{L}}}\leftrightarrow {{\varvec{O}}}_{{\varvec{P}}}\right)={{\varvec{C}}}_{{\varvec{P}}}\boldsymbol{^{\prime}}\) and \(\mathbf{^{\prime}}{{\varvec{C}}}_{{\varvec{L}}}-\left({{\varvec{O}}}_{{\varvec{L}}}\leftrightarrow {{\varvec{C}}}_{{\varvec{P}}}\right)={{\varvec{C}}}_{{\varvec{P}}}-{{\varvec{C}}}_{{\varvec{P}}}\boldsymbol{^{\prime}}\) in Fig. 4. In this way, u unique descriptors were calculated from the training dataset, and each protein–ligand complex was represented as a descriptor vector d having the frequencies of u unique descriptors as elements.
Autodock Vina-based additional features
BAPA exploits Vina terms that reflect distance information of intermolecular interactions in a protein–ligand complex. We used five additional intermolecular Vina terms and one flexible Vina terms. Intermolecular Vina terms consist of three steric interactions (gauss1, gauss2, repulsion), hydrophobic, and hydrogen bond terms. The flexible term is the number of activate rotatable bonds between the heavy atoms of ligand [44].
Proposed model
Overall schema of deep neural network
The proposed model, BAPA, has three kinds of neural network layers (convolutional, attention, and dense) for binding affinity prediction. We designed the model to extract local structure patterns from descriptor vector d via the convolutional layer. The latent representation (encoder vector; e) of the complex is calculated from the output of the convolutional layer and Vina terms via feedforward network. Based on this information, the attention layer calculates the descriptors important for affinity prediction and yields encoded context vector c. Finally, the concatenation of an encoded vector e and an encoded context vector c is input to the feedforward network to predict the binding affinity. Every layer was activated with the exponential linear unit (ELU) function and the whole network was implemented by TensorFlow (1.12.0). The overall architecture is depicted in Additional file 2: Fig. S2.
Convolutional layer with descriptor embeddings
The model starts with the embedding matrix \({\varvec{E}}\in {\mathbb{R}}^{{\varvec{u}}\boldsymbol{ }\times \boldsymbol{ }{\varvec{h}}}\) to transform each descriptor to the corresponding embedding vector \({{\varvec{E}}}_{{\varvec{i}}}\in {\mathbb{R}}^{{\varvec{h}}}\) where u is the count of descriptors and h is the embedding size. An embedding matrix is initialized by the truncated standard normal distribution. To add local structure information to each descriptor embedding vector, an element of the corresponding descriptor vector (frequency of each descriptor) is multiplied by a weight. Then, the input of the convolutional layer is generated through a dense layer for each column of embedding matrix to which local structure information was added.
To find the pattern in the input, all convolutional layers applied one-dimensional (1D) convolution operation. For example, the first convolutional layer used three filters, and the stride size of each filter is one, so a feature map with a size of (3 × N × 1) is generated. To extract various patterns of the descriptors, five different window sizes (2, 4, 6, 8, and 10) were used, so that five (3 × N × 1) feature maps were generated in the first convolutional layer. Each of these five-feature map passes through the max pooling layer and decreases in size. The depth of the convolutional layer is four and the convolution operation is the same fashion for all convolutional layers, except that the number of filters is six for the second and third convolutional layers, and nine for the fourth convolutional layer. Detailed parameters for the convolutional layers are provide in Additional file 1: Table S7. The results of the fourth convolutional layer are flattened and concatenated, resulting in a single vector of size 513. The single vector and Vina terms are merged into the encoded vector e, which is the latent vector of the complex in the feedforward network.
Attention layers for important descriptors
In machine translation, the attention mechanism is mainly designed to solve the problem of long-term dependencies when the input sequence is long. When a word is predicted using a decoder, an attention mechanism puts more focus on words that are more related. In this study, we designed the attention layer to focus on more relevant descriptors. The latent representation of the complex (encoded vector; e) is input as an attention layer to calculate the contribution of each descriptor to the affinity prediction.
Encoded vector e and each row of embedding matrix Ei are calculated into query vector q, key vector ki, and value vector vi through a dense layer. Note that in this study the key vector ki and the value vector vi have the same value. The similarity between query vector q and key vector ki (\(0\le {\varvec{i}}\le {\varvec{u}}\)) is calculated using the inner product. The similarities are transformed into descriptor weights via softmax. The weighted sum of the value vector vi over the descriptor weight is used as the context vector. The context vector is input to one dense layer to generate the encoded context vector c, which is used to predict the binding affinity together with encoded vector e.
Feedforward network for binding affinity
The encoded context vector c, which is an output of the attention layer, and the encoded vector e, are used to predict the binding affinity through the feedforward network consisting of 512, 256, and 128 neurons.
Definition of loss function and weight optimization
In the proposed neural network model, the input flows to the output layer in a feedforward fashion. The mean squared error was used as a loss function to train the weights and biases. To prevent overfitting, we applied L2 regularization, so the norm of weights is added to the loss. The Adam optimizer was used for training the network (learning rate 0.005, batch size 256).