Data Sets
The data set consists of protein complexes whose structures have been solved by Xray crystallography and for which alanine mutational data are available. Structures are obtained from the Protein Data Bank (PDB) [36]. Alanine mutation data are collected from the Alanine Scanning Energetics database (ASEdb) [37] and from previous publications [11, 13]. We have only considered proteinprotein interactions involving an extended interface. Mutation data related to proteinpeptide complexes have not been included. For each reported mutation we have verified the original reference and checked that the residue mapping from sequence to structure is consistent (few discrepancies have been found). Only mutations occurring at the complex interface have been retained. Interface residues are defined as those having at least one heavy atom within 5 Å of an heavy atom in the binding partner. Similarly, two residues across the interface are considered in contact if any of their heavy atoms are within 5 Å.
To ensure that the data set is sufficiently diverse and representative of proteinprotein interfaces in general, we have analysed the complexes in terms of interacting domains and of location of the binding interface (i.e. which residues and residue contacts are at the interface). Domain structures can be classified according to CATH [38], which follows a hierarchical scheme. The first five levels in this hierarchy are Class, Architecture, Topology (fold), Homologous superfamily and Sequence family (clustered at a threshold of 35% identity). We have required that no two pairs of interacting domains have the same CATH numbers at the Slevel. If two domain pairs have the same CATH numbers, we verify if they use the same or a different interface to interact. In the latter case, both structure are included in the data set (in practise this issue arises for only one pair of structures, PDB codes 3HFM and 1VFB).
The data set contains 20 protein complex structures. Following previous publications [19], we define hot spots as those alanine mutations for which ΔΔG ≥ 2 kcal/mol (ΔΔG is the change in binding free energy). In total the data set comprises 349 mutations, of which 81 correspond to hot spots. Some studies such as for example in the Robetta paper [11] define hot spots using instead a threshold of 1 kcal/mol. According to this definition, in our data set there would be 165 hot spots and 184 non hot spots. We have tested our approach in this case as well and report the results in Table 1b.
The list of the 20 protein structures and their interacting domains is reported in Additional file 2: Supplemental Tables S5 and S6. Note that homologous protein complexes are present in our data set but at least one of the proteins involved has less than 35% sequence similarity to its homologue. We remark that in these cases only a limited number of mutations occurs at equivalent sequence positions, as deduced from the alignment, and even removing them from the data set does not affect the overall results. As can be noted, the data set is dominated by the immunoglobin superfamily (10 structures out of 20 comprise at least one immunoglobin domain). We have verified a posteriori that this does not introduce a bias and that mutations on immunoglobins are not predicted with higher accuracy than on other proteins (see Results and Discussion section).
A detailed list of the individual mutations with their respective ΔΔG is reported in Additional file 3: Data set S1. We have analysed the distribution of amino acid types that occur in hot spots, in relation to the distribution of amino acid types in the database (Table 6). A similar analysis was performed in ref [19] and the results roughly agree with ours (e.g. tryptophan and tyrosine are among the most frequent amino acids in hot spots). We find however some notable differences, e.g. hot spots are not enriched in arginine (they are instead substantially enriched in lysine). Whereas the analysis in ref [19] was based on the whole ASEdb database, we have filtered out redundant entries and included only mutations located at a proteinprotein interface for which the crystal structure of the complex is available. This might be the origin of the observed differences.
In recent publications [13, 26, 27], mutation data extracted from the Binding Interface Database (BID) [39] have been used to test proposed hot spot prediction models. In our opinion, these data have several problems (e.g. they are not associated to ΔΔG values) and it is questionable whether they are useful in assessing the predictive power of a method. Nonetheless, for completeness we have extracted from BID our own data set and tested our method on it. We report the results in Additional file 4 together with a discussion of our concerns about this data set.
Data clustering for crossvalidation
Mutations in the same or homologous binding interfaces can not in general be expected to be unrelated. This constitutes a problem when estimating the performance of a model through, e.g., crossvalidation as there might be some 'trivial' similarities between data in the training and the test sets. In accordance, to avoid any potential bias, we have grouped together mutations belonging to the same or homologous complexes and assigned them to the same crossvalidation fold. For our purposes, we have considered as homologous two complexes that share at least one pair of interacting domains similarly classified at the Hlevel in CATH, i.e. domain pairs with the same first 4 CATH numbers.
The above protocol produces 16 clusters (see Additional file 2: Supplemental Table S6) and accordingly a 16fold crossvalidation strategy is employed to assess the performance of the method (see below for more details on crossvalidation). In order to assess the robustness of the approach and the presence of a residual redundancy in the 16fold partition of the data set, we have also tested more stringent clustering criteria, i.e. complexes are grouped together if at least one of the two following conditions is met

Interacting domains pairs are the same at the Tlevel (i.e. have the same fold),

Individual domains are the same at the Slevel and use the same binding interface, irrespective of their interacting partners.
This latter set of criteria results in 12 clusters (see Additional file 2: Supplemental Table S6) and in a 12fold crossvalidation strategy.
Input features: energy components
As input features for the machine learning algorithms we have used basic energy terms that have been found to be important for the stability of protein complexes. These are van der Waals potential, hydrogen bonds, Coulomb electrostatics and desolvation energy. For each of the four energy components we have separately calculated 3 different energy contributions (schematised in Figure 1):

Sidechain intermolecular energies: interaction energies between sidechain atoms of the mutated residue and atoms in the partner protein (respectively atoms in the red filled area and blue striped area in Figure 1).

Environment intermolecular energies: interaction energies between atoms in the two proteins that are within 10 Å of the C_{
β
}of the mutated residue (respectively atoms in the red striped area and blue striped area in Figure 1). We do not include the contribution from the mutated sidechain in this term.

Sidechain intramolecular energies: interaction energies between sidechain atoms of the mutated residue and other atoms in the same protein (respectively atoms in the red filled area and red striped area in Figure 1).
In total therefore there are 12 input features (4 × 3), although e.g. for Support Vector Machines we have used only a subset of them as this would give better or equivalent performances to the full set (see below for more details).
The energy components are calculated from the PDB structures. Only heavy atoms are considered and hydrogen atoms, if present, have been discarded. The computation of the different energy terms draws on established forcefields [11, 40, 41] but it incorporates some adjustments. It is important to remark that all terms are pairwise additive and therefore their calculation is formally equivalent. Before applying machine learning methods input features are rescaled so that they vary within similar numerical ranges. In the classification problem, each feature is standardised by subtracting its mean and dividing by its standard deviation. For regression purposes instead features are rescaled by their respective quadratic means. The latter normalisation makes it easier to search for solutions with no constant term, i.e. such that ΔΔG = 0 if all energy terms are zero.
van der Waals energy
The energy of van der Waals interactions between two atoms i and j is calculated using a "smoothed" 6  12 LennardJones potential
where r is the distance between the two atoms and the parameters r_{
ij
}and ϵ_{
ij
}are taken from CHARMM19 force fields [40]. The parameter r_{0} = 0.5 Å has the effect of widening the region of maximum affinity and to reduce the potential energy at r = 0 to a finite value. It makes the potential less sensitive to the precise position of atoms and to minor coordinate errors and local clashes. A similar option is available in the AutoDock suite [42]. We have also introduced a long distance cutoff, i.e. V_{
vdW
}(r) = 0 if r > 8 Å.
Electrostatics energy
Electrostatic interactions are evaluated with a screened Coulomb potential, in which the dielectric constant increases linearly with distance. The atomic partial charges have been taken from the CHARMM19 parameter set [40]. In the case of sidechain intermolecular interactions the charged forms of Arg, Lys, Asp and Glu is used [40]. For the environment intermolecular and sidechain intramolecular interactions instead a neutralised form of the ionic groups in these residues is employed [41]. This choice reflects the consideration that salt bridges in deleted sidechains might have a significant impact on binding free energy changes. In all cases we consider a neutral form of His (calculated as an average over the two possible single protonated configurations). As we do not explicitly include hydrogen atoms, their partial charges are simply added to those of the heavy atom to which they are covalently bound (e.g. the partial charge on a backbone N atom is taken to be 0.1 = 0.35 + 0.25, where 0.35 and 0.25 are respectively the partial charges associated with a bare backbone N atom and its attached hydrogen).
Hydrogen bond energy
We have followed the approach outlined in [11]. The energy of a hydrogen bond is computed as a linear combination of a distancedependent part and two angular dependent components. The distance dependent term is modelled with a 1012 potential, whereas the angular dependencies are derived from the probability distributions observed in highresolution crystal structures. The same energy function is used for sidechain/sidechain and sidechain/backbone hydrogen bonds but they are weighted differently. Sidechain/sidechain hydrogen bonds are further divided into three classes, depending on the extent of burial of participating residues. Weights for intermolecular and intramolecular hydrogen bonds are also different. Details and parametrisation can be found in [11].
In our implementation, we have introduced a smoothing parameter, r_{0} = 0.25 Å, in the distance dependent component, similarly to the van der Waals potential described above. We have not distinguished between sp^{2} and sp^{3} hybridised acceptor atoms, rather we have taken an average of the knowledgebased angular potential in these two cases. In ref [11] backbone/backbone hydrogen bonds are not included in the free energy function as alanine scanning does not directly probe them. There is therefore no reported weight associated to them. In our formulation instead backbone/backbone hydrogen bonds contribute to the environment intermolecular energies. For simplicity we have associated them the same weight as for the sidechain/backbone hydrogen bonds. We have used the program HBPLUS [43] to determine the characteristics (i.e. atomic distances and angles) of hydrogen bonds in the complexes.
Desolvation energy
The desolvation energy is evaluated using an implicit solvation model [41], which decomposes the solvation free energy into a sum of pairwise atomic interactions. The calculation of the input features associated with desolvation energy is therefore formally equivalent to the other energy terms. We have used a set of improved solvation parameters described in [44].
Support Vector Machine models
Support Vector Machines (SVMs) are playing an increasingly important role in the field of computational biology [20, 21]. They can be applied both to classification and regression problems. In the context of our investigation the former corresponds to predict whether a residue is a hot spot or not, the latter to estimate the actual value of ΔΔG. We have used the program package SVM^{light} [45], which is available at the website http://svmlight.joachims.org/. The program provides several standard kernel functions and we have experimented with linear, polynomial and Gaussian kernels, and with different combinations of input features. For classification purposes, we find the best results using a linear kernel and a set of 8 input features corresponding to the sidechain and environment intermolecular energies. Using either a polynomial or a Gaussian kernel and a different (possibly larger) combination of input features, the performance would not improve, so we have opted for the simpler model. In the regression task, the best results are obtained with a linear kernel and by adding the van der Waals term from the sidechain intramolecular energies to the features set (9 input features in total). The bias parameter in the kernel has been set to zero.
Standard SVM classifiers follow a supervised ("inductive") learning approach, whereby a predictive model is built on the basis of available positive and negative examples. The SVM^{light} package implements also a semisupervised learning algorithm, the socalled Transductive SVMs (TSVMs), in which unlabelled data are exploited in order to develop a better discriminatory model. In our case, unlabelled data correspond to interface residues that have not been mutated to alanine and for which therefore ΔΔG is not known. The unlabelled data are extracted from the 20 protein complexes in the data set and used to integrate the original mutational data (maintaining the 16fold partition described above for crossvalidation). Optimal performance for TSVM classifier is obtained with a linear kernel and the same 8 features as for the SVM classifier.
CrossValidation and Model Selection
SVMs have a number of tunable parameters (hyperparameters) which should be chosen in order to achieve good generalisation performance and avoid overfitting. For the classification task using a linear kernel, for example, two parameters, C and j, have to be set. The parameter C controls the trade off between the training error and the square norm of the weights associated to the features; the cost factor j determines how training errors on positive examples outweight errors on negative examples. An additional parameter is present in TSVMs, the fraction p of unlabelled examples to be classified into the positive class. In a regression setting the hyperparameter ϵ controls the tolerance to errors. The optimal values for the hyperparameters are not known in advance. In order to both evaluate fairly the performance of the methods and do not introduce biases in the choices of hyperparameters, we have implemented a nestedloop crossvalidation scheme [46] (see scheme in Additional file 2: Supplemental Figure S1) coupled with a grid search. The procedure ensures that hyperparameters are selected without ever considering the performance on the test sets.
In nestedloop crossvalidation, the data set is split into n folds (in our case n = 16 or n = 12). (n  1) folds are used as the training set and the remaining fold is used as test set. The hyperparameters are optimised on the training set, by applying a grid search and an internal (n  1)fold crossvalidation strategy. The hyperparameters that give the best crossvalidated performance are selected and the associated model is then applied to the heldout test set. This process is repeated n times so that each of the original n folds is used as the test set once and predictions are obtained for each entry in data set. The procedure therefore consists of two nested crossvalidation loops: an outer one for testing, an inner one for choosing hyperparameters. In the inner cycle, we have assessed the model performance by means of the F1 score and the rootmeansquare error (RMSE) for classification and regression tasks respectively (see below for more details).
For a given performance measure (e.g. the F1 score), we estimate its value f by considering the whole data set and by comparing predictions (obtained as described above) to the observed ΔΔG. The associated statistical error is instead evaluated as follows. Let {f_{
i
}}_{i = 1... n}be the set of optimised values obtained on the n training sets, as a result of the inner loops of crossvalidation. We then write
Note that the more standard procedure of calculating the score and its error from the average and standard deviation on the n test sets is not applicable in our case as individual test sets are rather small and of different sizes.
We have verified that our approach does not produce overfitting of the data by comparing results obtained from a 16fold and from a more stringent 12fold crossvalidation. A significant better performance for the 16fold crossvalidation would suggest overfitting. In addition we have tested the n models on their own training sets. Better results can clearly be expected in this case with respect to the test set but a large difference would possibly indicate overfitting of the data.
Gaussian Process models
Gaussian Processes (GPs) are machine learning methods rooted in a Bayesian probabilistic framework [16]. They can be used for both regression and classification tasks. GPs assume that the data are a finite number of noisy observations generated by an unknown function. In line with Bayesian approaches [47], GPs assign a prior probability to each possible function which is then updated into a posterior probability in light of the observed data using Bayes' rule. Predictions on new data are made by a weighted average over all possible functions (models), the weights being equal to the posterior probabilities. Notice the different learning strategy between SVMs and GPs: SVMs determine one optimal model and make predictions based on it, GPs estimate a probability distribution over all possible models and predictions are model averages. One advantage of the latter approach is that it provides also confidence measures (i.e. error bars) on the predictions.
A GP is defined by a covariance function, which is the analogous of the kernel function in SVMs. Learning in GPs involves selecting a suitable covariance function for the problem at hand. As for SVMs, standard covariance functions include linear, polynomial and Gaussian functions. For both regression and classification we find the best results using a linear kernel and the full set of 12 input features. In the regression problem, the bias parameter of the kernel is set to zero. The data likelihood is assumed to be isotropic Gaussian. Notice that by using a linear kernel, a GP regression becomes equivalent to a Bayesian linear model with a Gaussian prior probability distribution over the linear coefficients. The GP classifier is computed based on the Laplace approximation [16] and using a logistic likelihood function. Intuitively, GP classification can be viewed as a generalisation of logistic regression (see e.g. [48]). For the calculations, we have used a software developed by one of us (CA).
CrossValidation and Model Selection
Unlike SVMs, GPs do not require crossvalidation to select the kernel hyperparameters. They are learnt from the data by maximising a quantity called the logmarginal likelihood, which is obtained by integrating out the latent function values. This procedure, known as Occam's razor [49], implicitly penalises overcomplex models and is thus not prone to overfitting. In the case of a linear kernel there are no hyperparameters to select. The performance of the GPbased model is estimated using the same test set as the one used for the SVMs, i.e. the 16 tests corresponding to the 16 partitions used for the outer loop of the nested crossvalidation scheme.
Linear regression models
As a baseline regression method to compare results with we have built a linear model by least squares fitting (LLSF). The model has no constant term and the best fit is calculated by minimising the deviation between calculated and observed ΔΔG values. It includes all 12 energy features. We have also constructed "minimal" versions of the model based on only one or two features. Model parameters are calculated on the training sets and performances evaluated on the test sets, following similar procedia employed for SVM and GP regressions.
Measures of prediction performance
For classification, we primarily assess the prediction performances of our and related methods using the F1 score. Let TP, FP, FN refer to the number of true positives, false positives and false negative respectively. Precision (P, also called specificity) and recall (R, also called sensitivity) are defined as
The F1 score is the harmonic mean of precision and recall
Precision, recall and the F1 score are comprised between 0 and 1, the larger the value the better the performance. A random predictor on our data set would score P_{
ran
}= 0.23, R_{
ran
}= q and , where 0.23 is the fraction of hot spots in our database and 0 ≤ q ≤ 1 is the fraction of presumed hot spots in the predicted set. It follows that 0 ≤ F 1_{
ran
}≤ 0.37 and if the expected frequency is equal to the observed frequency of hot spots (i.e. R_{
ran
}= q = 0.23), then F 1_{
ran
}= 0.23. For each prediction set, we have also calculated the Matthew's correlation coefficient (MCC) given by
where TN is the number of true negative and TP, FP and FN are as above. The MCC value is between 1 and 1: a perfect predictor has MCC = 1 whereas for a random predictor MCC = 0 (MCC = 1 for a perfect inverted predictor)
The performance of the regression models is evaluated using the rootmeansquare error (RMSE) and the correlation coefficient (r) between the predicted (calc) and the observed (obs) ΔΔG
where sums are over the N entries in the data set and the symbols ⟨·⟩ and σ denote averages and standard deviations respectively. A regression model can be easily mapped into a classifier by associating a residue to an observed or predicted hot spot if ΔΔG ≥ 2 kcal/mol. The same performance measures used in the classification problem can then be used to assess the regression models as well.