This section introduces and motivates the use of four scoring functions building upon AutoDock Vina, two benchmarks to evaluate and compare performance of these scoring functions, the performance metrics, and the experimental setup.

### Model 1 - AutoDock Vina

AutoDock Vina [1] was chosen as a baseline scoring function because of its popularity among the research community. Vina’s popularity roots in its substantial improvements on both the average accuracy of the binding pose prediction and the running speed. Its remarkable performance in pose generation as well as its open source nature are other appealing aspects of this widely-used tool.

Like all classical scoring functions [6], Vina assumes a predetermined functional form. Vina’s score for the *k*th pose of a molecule is given by the predicted free energy of binding to the target protein and computed as:

$$ e'_{k}=\frac{e_{k,inter}+e_{k,intra}-e_{1,intra}}{1+w_{6}N_{rot}} $$

(1)

where

$$\begin{array}{@{}rcl@{}} e_{k,inter} &=& w_{1} \cdot Gauss1_{k} \\ &&+ w_{2} \cdot Gauss2_{k} \\ &&+ w_{3} \cdot Repulsion_{k} \\ &&+ w_{4} \cdot Hydrophobic_{k} \\ &&+ w_{5} \cdot HBonding_{k} \end{array} $$

(2)

$$\begin{array}{@{}rcl@{}} w_{1} &=& -0.035579 \\ w_{2} &=& -0.005156 \\ w_{3} &=& 0.840245 \\ w_{4} &=& -0.035069 \\ w_{5} &=& -0.587439 \\ w_{6} &=& 0.05846 \end{array} $$

(3)

\(e^{\prime }_{k}\) is the predicted free energy of binding reported by the Vina software when scoring the *k*th docked pose. *e*
_{
k,inter
} and *e*
_{
k,intra
} are the inter-molecular and intra-molecular contributions, respectively, which have both the same functional form described in Eq. 2 but are summed over different atom pairs. The values for the six weights were calculated by OLS (Ordinary Least Squares) using a nonlinear optimisation algorithm as it has been the case in related force-field scoring functions [10], although this process was not fully disclosed in the original publication [1]. *N*
_{
rot
} is the calculated number of rotatable bonds. The predicted free energy of binding in kcal/mol units was converted into *p*
*K*
_{
d
} units with *p*
*K*
_{
d
}=−0.73349480509*e* so as to compare to binding affinities in *p*
*K*
_{
d
} or *p*
*K*
_{
i
} units. Mathematical expressions and further explanations can be found in [2].

Unlike our previous study [6] on scoring crystal poses, where *k*=1 because only the crystal pose was considered, this study aims at training and testing on docked poses, so *k* will range from 1 to 9 depending on the specific pose to use for each molecule (Vina returns a maximum of 9 poses per docking run). Thus *e*
_{
k,intra
} and *e*
_{1,i
n
t
r
a
} do not necessarily cancel out. As a result, the five terms from *e*
_{
k,intra
} were considered as additional features in models 2, 3 and 4.

### Model 2 - MLR::Vina

This model retains the 11 unweighted Vina terms (5 from *e*
_{
k,inter
}, 5 from *e*
_{
k,intra
}, and *N*
_{
rot
}) as features, but changes the regression method to multiple linear regression (MLR), a regression model commonly adopted by classical scoring functions, such as empirical scoring functions. The use of MLR implies an additive functional form and therefore MLR::Vina is a classical scoring function [6].

Vina’s scoring function is not exactly a sum of energetic terms because *w*
_{6}≠0 (although the denominator of Eq. 1 is close to 1 because of the low value of *w*
_{6}. In order to make the problem amenable to MLR, we performed a grid search on *w*
_{6} and thereafter ran MLR on the remaining weights. More precisely, we sampled 101 values for *w*
_{6} from 0 to 1 with a step size of 0.01. Interestingly we found that the *w*
_{6} values of the best models were always between 0.000 and 0.030. Then we again sampled 31 values for *w*
_{6} in this range with a step size of 0.001, and used the *w*
_{6} value that resulted in the lowest RMSE (Root Mean Square Error) on the test set.

### Model 3 - RF::Vina

This model also retains the 11 unweighted Vina terms as features, but changes the regression method to Random Forest (RF) [11], so as to implicitly learn the functional form from the data. Hence this model circumvents the modelling assumption of a predetermined functional form and thus allows to investigate the impact of such modelling assumption by comparing RF::Vina to MLR::Vina. Besides RF, other machine learning techniques such as SVR (Support Vector Regression) [12] can certainly be applied to this problem, although this is out of the scope of this study.

A RF is an ensemble of different decision trees randomly generated from the same training data via bootstraping [11]. RF trains its constituent trees using the CART algorithm [13], and selects the best data split at each node of the tree from a typically small number (mtry) of randomly chosen features. In regression applications, the RF prediction is given by arithmetic mean of all the individual tree predictions in the forest.

For each value of the mtry parameter from 1 to all 11 features, we built a RF model with 500 trees, as we and others [14] have not observed any substantial gain in performance by training RF with a higher number of trees on this class of problems. The selected model was the one that led to the lowest RMSE on a subset of training data of each tree collectively known as the OOB (Out of Bag) data. Because RF is stochastic, this process was repeated ten times with ten different random seeds. The predictive performance was reported for the RF with the best seed that resulted in the lowest RMSE on the test set. Further details on RF model building in this context can be found in [6].

### Model 4 - RF::VinaElem

This model retains RF as the regression method, but expands the feature set to 47 features by adding the 36 RF-Score [4] features. Like in the training process of RF::Vina, the same ten seeds were used, and for a given random seed, a RF model for each mtry value from 1 to 47 was built and that with the lowest RMSE on OOB data was selected. The predictive performance was reported for the RF with the best seed that led to the lowest RMSE on the test set.

RF-Score features are defined as the occurrence count of intermolecular contacts between elemental atom types *i* and *j*, as shown in Eqs. 4 and 5, where *d*
_{
kl
} is the Euclidean distance between the *k*th protein atom of type *j* and the *l*th ligand atom of type *i* calculated from a structure; *K*
_{
j
} is the total number of protein atoms of type *j* (*#*{*j*}=4, considered protein atom types are C, N, O, S) and *L*
_{
i
} is the total number of ligand atoms of type *i* (*#*{*i*}=9, considered ligand atom types are C, N, O, F, P, S, Cl, Br, I); \(\mathcal {H}\) is the Heaviside step function that counts contacts within a neighbourhood of *d*
_{
cutoff
} Å. For instance, *x*
_{7,8} is the number of occurrences of ligand nitrogen atoms (*i*=7) hypothetically interacting with protein oxygen atoms (*j*=8) within a chosen neighbourhood. Full details on RF-Score features are available in [4, 12].

$$ x_{ij}=\sum\limits_{k=1}^{K_{j}}\sum\limits_{l=1}^{L_{i}}\mathcal{H}(d_{cutoff}-d_{kl}) $$

(4)

$$ \mathbf x=\{x_{ij}\}\in N^{36} $$

(5)

### PDBbind v2007 benchmark

We adopted the PDBbind v2007 benchmark [15], arguably the most widely used [6, 7] for binding affinity prediction of diverse complexes. Its test set comprises 195 diverse complexes from the core set, whereas its training set comprises 1105 non-overlapping complexes from the refined set. Both the test and training sets come with measured binding affinities spanning more than 12 orders of magnitude. This benchmark has the advantage of permitting a direct comparison against the same four models that were trained and tested on crystal poses [6] of this benchmark.

### PDBbind v2013 blind benchmark

We also adopted the PDBbind v2013 blind benchmark [6], a recently proposed new benchmark mimicking a blind test to provide a more realistic validation than the PDBbind v2007 benchmark. Its test set is composed of all the complexes in the PDBbind v2013 refined set that were not in the v2012 refined set, i.e. those 382 complexes that were newly added in the v2013 release. Its training set is simply the v2012 refined set, which contains 2897 complexes. By construction, this benchmark can be regarded as a blind test in that only data available until a certain year is used to build the scoring function that will be used to predict the binding affinity of future complexes as if these had not yet been measured. Consequently, the test set and training set do not overlap. Again, this benchmark has the advantage of permitting a direct comparison against the same four models that were trained and tested on crystal poses [6] of this benchmark.

In addition to the above training set, three more training sets were added in order to study how the performance of the four models would vary given different number of training complexes. The refined sets of PDBbind v2002 (N=792), v2007 (N=1300), v2010 (N=2057) and v2012 (N=2897) were chosen so that there is approximately the same number of complexes between consecutive releases. Complexes containing metal ions not supported by Vina were discarded. More details about this benchmark can be found in [6].

### Performance measures

As usual [15], predictive performance was quantified by the Root Mean Square Error (RMSE), Standard Deviation (SD), Pearson correlation (Rp) and Spearman rank-correlation (Rs) between predicted and measured binding affinities. Their mathematical expressions are shown in Eqs. 6, 7, 8, and 9. Given a scoring function *f* and the measured binding affinity *y*
^{(n)} and the features \(\overrightarrow {x}^{(n)}\) characterising the *n*th complex out of *N* complexes in the test set, \(p^{(n)}=f(\overrightarrow {x}^{(n)})\) is the predicted binding affinity, \(\{\hat {p}^{(n)}\}\) are the fitted values from the linear model between {*y*
^{(n)}} and {*p*
^{(n)}} on the test set, whereas \(\{y_{r}^{(n)}\}\) and \(\{p_{r}^{(n)}\}\) are the rankings of {*y*
^{(n)}} and {*p*
^{(n)}}, respectively. Note that SD was calculated in a linear correlation, but RMSE was not. Lower values in RMSE and SD and higher values in Rp and Rs indicate a better predictive performance.

$${} RMSE = \sqrt{\frac{1}{N}\sum\limits_{n=1}^{N}\left(p^{(n)}-y^{(n)}\right)^{2}} $$

(6)

$${} SD = \sqrt{\frac{1}{N-2}\sum\limits_{n=1}^{N}\left(\hat{p}^{(n)}-y^{(n)}\right)^{2}} $$

(7)

$${} {{\begin{aligned} R_{p} = \frac{N\sum_{n=1}^{N}p^{(n)}y^{(n)}-\sum_{n=1}^{N}p^{(n)}\sum_{n=1}^{N}y^{(n)}}{\sqrt{\!\left(\!N\sum_{n=1}^{N}(p^{(n)})^{2}-\left(\!\sum_{n=1}^{N}p^{(n)}\right)^{2}\!\right)\!\!\left(\!N\!\sum_{n=1}^{N}(y^{(n)})^{2}-\left(\!\sum_{n=1}^{N}y^{(n)}\right)^{2}\!\right)}} \end{aligned}}} $$

(8)

$$ {{\begin{aligned} R_{s} = \frac{N\sum_{n=1}^{N}p_{r}^{(n)}y_{r}^{(n)}-\sum_{n=1}^{N}p_{r}^{(n)}\sum_{n=1}^{N}y_{r}^{(n)}}{\sqrt{\left(\!N\sum_{n=1}^{N}(p_{r}^{(n)})^{2}-\left(\!\sum_{n=1}^{N}p_{r}^{(n)}\!\right)^{2}\!\right)\!\!\left(\!N\sum_{n=1}^{N}(y_{r}^{(n)})^{2}-\left(\!\sum_{n=1}^{N}y_{r}^{(n)}\!\right)^{2}\!\right)}} \end{aligned}}} $$

(9)

The Root Mean Square Deviation (RMSD) measures how geometrically different the redocked pose is from the corresponding co-crystallized pose of the same ligand molecule, i.e. the pose generation error. Suppose *N*
_{
a
} is the number of heavy atoms, \(\left (x_{c}^{(n)}, y_{c}^{(n)}, z_{c}^{(n)}\right)\) and \(\left (x_{d}^{(n)}, y_{d}^{(n)}, z_{d}^{(n)}\right)\) are the 3D coordinate of the *n*th heavy atom of the crystal and docked poses, respectively, the pose generation error is calculated as:

$${} {{\begin{aligned} RMSD =\! \sqrt{\!\!\frac{1}{N_{a}}\sum_{n=1}^{N_{a}}\!\left[\!\left(\!x_{c}^{(n)}-x_{d}^{(n)}\!\right)^{2}\,+\,\left(\!y_{c}^{(n)}\,-\,y_{d}^{(n)}\!\right)^{2}\,+\,\left(\!z_{c}^{(n)}-z_{d}^{(n)}\!\right)^{2}\right]} \end{aligned}}} $$

(10)

### Experimental setup

To generate docked poses, each ligand in the two benchmarks was docked into the binding site of its target protein using Vina with its default settings. This process is known as redocking. The search space was defined by finding the smallest cubic box that covers the entire ligand and then by extending the box by 10Å in X, Y, Z dimensions. Water molecules were removed, while metal ions recognized by Vina were retained as part of the protein. This preprocessing procedure is commonly adopted in the development of both classical scoring functions [1] and machine-learning scoring functions [16].

Redocking a ligand into its cognate protein resulted in up to nine docked poses. Thus, the question arises of which pose best represents its molecule for calculating the values of the features. Here we evaluate different schemes referring to the specific pose from which the features are extracted. In scheme 1, the chosen pose is the crystal pose. In scheme 2, the chosen pose is the docked pose with the best Vina score, i.e. the one with the lowest Vina score in terms of estimated free energy of binding in kcal/mol units. We trained the four models on both crystal and docked poses (in both schemes), and tested them also on both crystal and docked poses (in both schemes).

To make our experiments comprehensive, we also evaluated additional schemes. In scheme 3, the chosen pose is the docked pose with the lowest RMSD. In scheme 4, the chosen pose is the docked pose with a Vina score closest to the measured binding affinity. In scheme 5, the chosen poses are all the 9 docked poses, which hence results in a 9 times larger feature set (the number of features is 91 for models 2 and 3, and 415 for model 4). For ligands with less than 9 docked poses returned, the features extracted from the pose with the lowest Vina score are repeated as many times as poses are missing. In scheme 6, the chosen poses are the 2 docked poses with the lowest and the second lowest Vina score, which hence results in a double-sized feature set (the number of features is 21 for models 2 and 3, and 93 for model 4). For ligands with less than 2 docked poses outputted, the features extracted from the pose with the lowest Vina score are repeated. The rationale of introducing these schemes is that, schemes 1 to 4 help to determine which particular pose would be useful for improving predictive accuracy, while schemes 5 and 6 help to examine the effect of pose ensemble instead of a single pose.

Hereafter whenever we mention the docked pose, we implicitly refer to the one with the best Vina score (scheme 2), if not specified otherwise.