End-to-end learning for compound activity prediction based on binding pocket information

Background Recently, machine learning-based ligand activity prediction methods have been greatly improved. However, if known active compounds of a target protein are unavailable, the machine learning-based method cannot be applied. In such cases, docking simulation is generally applied because it only requires a tertiary structure of the target protein. However, the conformation search and the evaluation of binding energy of docking simulation are computationally heavy and thus docking simulation needs huge computational resources. Thus, if we can apply a machine learning-based activity prediction method for a novel target protein, such methods would be highly useful. Recently, Tsubaki et al. proposed an end-to-end learning method to predict the activity of compounds for novel target proteins. However, the prediction accuracy of the method was still insufficient because it only used amino acid sequence information of a protein as the input. Results In this research, we proposed an end-to-end learning-based compound activity prediction using structure information of a binding pocket of a target protein. The proposed method learns the important features by end-to-end learning using a graph neural network both for a compound structure and a protein binding pocket structure. As a result of the evaluation experiments, the proposed method has shown higher accuracy than an existing method using amino acid sequence information. Conclusions The proposed method achieved equivalent accuracy to docking simulation using AutoDock Vina with much shorter computing time. This indicated that a machine learning-based approach would be promising even for novel target proteins in activity prediction.

initial stage of drug discovery, a compound screening is often performed for selecting drug candidate compounds from a chemical compound library. Such a library sometimes contains millions to tens of millions of compounds and the cost of the screening cannot be ignored. Virtual screening is a computational method to predict the activity of chemical compounds without biochemical experiments. The accuracy has been improved recently and the technology has been applied successfully [3].
The virtual screening technique can be roughly divided into two categories: ligandbased virtual screening (LBVS) methods and structure-based virtual screening (SBVS) methods. LBVS predicts the activity of a given compound only from its chemical structure and supervised machine learning algorithms are used for the prediction. The method generally shows a good performance if we have a sufficient number of activity data for a target protein. However, LBVS cannot be applied if we have no compound activity data of a target protein. In contrast, SBVS can be used even if there is no activity information of a target protein. SBVS generally performs a simulation based on physicochemistry using the 3-dimensional structure of a target protein for predicting the activity of a compound. Docking simulation is one of the most major SBVS methods. It predicts the binding energy of a compound to a target protein by virtually docking the compound (i.e., ligand) with a binding site (i.e., pocket) of a target protein surface.
Currently, various docking simulation software, such as AutoDock Vina [4], Glide [5], and eHiTS [6], have been developed and widely used in practical research projects [7]. However, docking simulation has a problem. Docking simulation needs to search a large conformational space by rotating and translating a compound and calculate the binding energy of a protein-compound interaction using a complicated score function. This process is computationally expensive and takes a few minutes to predict the activity of one compound with a single CPU core [5]. The execution time of docking simulation of a compound is acceptable. However, for virtual screening, we have to perform the process for a large number of compounds. Therefore, even if a docking simulation can evaluate one compound in 10 s, a screening of a compound library includin10 million compounds requires approximately 1200 CPU days.
In order to tackle this problem, several researchers proposed machine learning-based virtual screening methods to predict the activity of a compound for a novel target protein. Such researches used not only a chemical compound structure but also protein information as the input of the machine learning models. Recently, Tsubaki et al. proposed an end-to-end learning prediction method for the problem and achieved better accuracy than the previous methods [8]. End-to-end learning combines feature design and task learning at the same time. By directly learning the relationship between the input and the output, it is possible to obtain a better representation of the input data rather than to manually encode it. The accuracies of end-to-end learning-based methods were often higher than those of conventional methods in various applications, such as Diagnosis of X-ray Images [9]. In the research of Tsubaki et al., they used a graph neural network for a compound and a one-dimensional convolutional neural network for a protein amino acid sequence. They managed to avoid manual encoding of input vectors. However, this method has some room for improvement because the method used only amino acid information. Structure information of a protein, especially the pocket structure, is considered to contain more useful information for ligand binding prediction. As illustrated by the key and keyhole theory, a binding compound often has a specific shape for the pocket structure of a target protein. Thus, the pocket structure information is more important for binding estimation than the amino acid sequence which only implies the pocket information. In addition, when using an amino acid sequence, it is assumed that there is sequence homology between a target protein and a protein in the training data set. However, using a pocket structure, it may be possible to predict the activity even for a novel target protein without any sequence homology.
In this paper, we proposed a method to predict the activity of a compound by end-toend learning using the pocket structure information of a protein. The proposed method uses graph neural networks both for compound and protein pocket structures. As a result of the benchmark on a data set for virtual screening performance evaluation, the proposed method achieved higher accuracy compared to the previous method which uses amino acid sequence information as the input.

Dataset
We used the DUD-E (a Database of Useful Decoys: Enhanced) dataset [10] as a training dataset of machine learning-based prediction models. DUD-E is a dataset constructed for the performance evaluation of the structure-based screening method created by Mysinger et al. A total of 102 target proteins were selected considering diversity, and active compounds and decoy compounds were prepared for each target. In total, the dataset contains 22,886 active compounds and more than 1 million decoy compounds. In this study, a down-sampled version at a 1:1 ratio of active to decoy was used as the training dataset. We checked that proteins in the training dataset had no sequence similarity to those in the test dataset using NCBI-BLAST [11] as described below (the sequence identity was lower than 30%. ) We used the MUV (Maximum Unbiased Validation) dataset [12] as our test dataset because the three-dimensional structure of the target protein is required. Rohrer et al. obtained the assay data for 17 target proteins from the bioactivity data contained in PubChem [13], and assigned 30 active compounds and 15,000 decoy compounds to each target protein. In this study, a total of 9 target proteins, whose protein-ligand complex structure has been solved, was used. Those proteins are described in Table 1. This is the same dataset as used in the research by Ragoza et al. [14]

Evaluation measure
We used AUROC (Area Under Receiver Operating Characteristic) as an evaluation index. AUROC is an index using the area under ROC curve, and it is an evaluation index mainly used for binary classification problems. ROC is a curve with a true positive rate (TPR) on the vertical axis and a false positive rate (FPR) on the horizontal axis. TPR is the proportion of positives correctly identified as positive in the dataset, and FPR is the proportion of negatives incorrectly identified as positive in the dataset. TPR and FPR can be calculated by the following formulas.
We also used F1-score for the evaluation. F1-score is the harmonic mean of the precision and recall. The precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly. The recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive.

Prediction accuracy evaluation
For checking the improvement of the prediction accuracy of the proposed method, we performed the evaluation on the MUV dataset. We compared the prediction accuracy of the proposed method with the method using sequence information by Tsubaki et al. and docking simulation using AutoDock Vina, which is one of the popular docking simulation software. Table 2 shows the results of AUROC. The proposed method achieved better accuracy than both, Tsubaki et al. and AutoDock Vina. However, the improvement of the proposed method compared to Tsubaki et al. 's method was not that big. As shown in Table 2, the proposed method has shown better accuracy for almost all targets (8/9 targets). In addition, judging from the results in Table 2 Table 3 shows the results of the F1-score evaluation. The proposed method has shown better accuracy in comparison to Tsubaki et al. and AutoDock Vina as well as in AUROC. We also checked the statistical significance of the improvement for the F1-score. The p-value of the improvements for Tsubaki et al. and AutoDock Vina were 0.047 and 0.051, respectively. For the F1-score, the difference between the accuracy of the proposed method and that of AutoDock Vina was clearer than in AUROC, but the improvement was also not significant.

Evaluation of computing time
The proposed method achieved to improve overall prediction accuracy. However, such methods are not so useful if they require significant computing resources. Therefore, we also evaluated the computing time. The prediction time was measured on an f-node of supercomputer TSUBAME3.0 at Tokyo institute of Technology. The details are shown in Table 4. AutoDock Vina used a single CPU core. The proposed method and Tsubaki et al. 's method used a single CPU core and a GPU card. The results of the prediction time per compound are shown in Table 5. The proposed method took a long time to complete the prediction compared to Tsubaki et al. because the pocket graph used in the proposed method is more complicated than the 1-dimensional convolution neural network  used in that research. However, compared with AutoDock Vina, the prediction time is much faster (more than 1000-fold acceleration), and we think that the performance is still sufficient to be useful for practical usage.

Conclusions
In this research, we proposed a new end-to-end learning method for activity prediction using protein pocket structure information. The proposed method has shown higher prediction accuracy than previous methods. Furthermore, compared with docking simulation software, the proposed method has shown higher accuracy in a much shorter computing time. Unfortunately, the improvement of the proposed method was not statistically significant in the case of AutoDock Vina. One clear reason why there was no statistical significance is the small size of the dataset used in the evaluation.
In this research, we used only 9 protein targets for the evaluation, which clearly is not enough for this type of evaluation. However, besides the MUV dataset, there is currently no available dataset that is sufficiently unbiased. For instance, the DUD-E dataset, which was used in previous research, has a bias and it is not suitable for the comparison between docking and machine learning-based methods [15]. Thus, we used it only as the training dataset and evaluated the performance based on a subset of the MUV dataset. The development of a larger dataset that is more suitable for our approach is one of our future works. In this research, the proposed method has shown worse accuracy than AutoDock Vina for 2 proteins (MUV ID 689 and 466). Unfortunately, we could not find any clear reason for that, but we will be able to analyze such things based on a larger dataset.
In addition to a new dataset construction, currently, we use only protein pocket structure information as a protein feature, but a combination of amino acid information and protein pocket structure information may improve overall prediction accuracy.

Materials and methods
The proposed method uses pocket structure information instead of amino acid sequence information used in the previous method as the feature of a protein and applies a graph neural network for not only a compound but also a protein. The generation of compound features is the same as the previous method by Tsubaki et al. [8], but protein feature generation is highly different. The proposed method consists of the following three parts; (1) Graph generation of a compound structure and a protein pocket structure. (2) Feature learning by graph neural network. (3) Activity prediction by a classifier.

Graph generation of a compound structure and protein pocket structure
A compound structure was firstly converted from a SMILES format string into a graph structure consisting of vertices (atom types) and edges (chemical bonds) by RDKit. If a SMILES-formatted file contains dots (represents non-concatenation), it is excluded from the data set because the generation of a single graph is not possible. Protein pocket structure information was extracted as the information (residue type, coordinates) of ligand contact residues identified by LPC software [16]. Then the information was converted into a graph. The vertices of the graph correspond to each residue. The edges of the graph correspond to residue interactions including bonded and nonbonded ones (Fig. 1). The vertices were categorized by their amino acid type (20 types) and represented as 20-dimensional one-hot vectors. The edges were categorized into five types according to the distance between Cα atoms of residues (I: 1.0-4.8 Å, II: 4.8-7.0 Å, III: 7.0-9.2 Å, IV: 9.2-11.4 Å, V: 11.4-13.6 Å). The types of edges were defined according to previous related research by Ito et al. [17]. By roughly grouping the distance between Cα atoms, we expected that it can cope with the structural change of a protein pocket.

Graph neural network
The compound and protein pocket graphs generated by the procedure described above are each converted to real-valued vectors by using a graph neural network. The procedure of the graph neural network consists of three parts (embedding, transition, and averaging) both for compound graphs and protein pocket graphs (Fig. 2). In the embedding part, we employed the same method used in research by Tsubaki et al. [8]. We defined r-radius vertex and r-radius edge, which introduced the concept of r-radius subgraphs [18] to vertex and edge, respectively. The r-radius subgraphs include all neighboring vertices and edges within a radius r (r is the number of hops on the graph) from a certain vertex. In the transition part, the following operation (i) and (ii) for each r-radius vertex and each r-radius edge are repeated. (i) Add adjacent r-radius vertex and r-radius edge vector. (ii) The vector generated by the operation (i) is input to the non-linear function and updated. Finally, all r-radius vertex vectors generated by the operation (ii) are averaged and one real-valued vector is output in the averaging part.

Activity prediction by a classifier
The activity of a compound was predicted using the d-dimensional compound vector y molecule and the protein vector y protein obtained from a graph neural network described above. Firstly, we simply concatenated y molecule and y protein as follows: [ y molecule ; y protein ] . Then, the input z ∈ R 2 to a softmax layer is obtained by the following equation: Here, W output ∈ R 2×2d , b output ∈ R 2 . Finally, z = [y 0 , y 1 ] was inputted into the softmax layer and binary classification on whether the ligand is active or not was performed.

Hyperparameter optimization and docking simulation settings
We used almost the same neural network structure and hyperparameters used in the previous method by Tsubaki et al. Thus, the hyperparameter values are the same as previous research except for the parameters related to protein features (i.e. r-radius of a protein pocket subgraph). The value was determined by using the training dataset. Details of hyperparameters used in the proposed method are shown in Table 6.
We performed experiments on target proteins for which protein-ligand complex structures were obtained. Therefore, we assumed that the central coordinates of a ligand were the central coordinates of the pocket. A 24 Å × 24 Å × 24 Å cube centered on that z = W output y molecule ; y protein + b output