Performance of Web tools for predicting changes in protein stability caused by mutations

Background Despite decades on developing dedicated Web tools, it is still difficult to predict correctly the changes of the thermodynamic stability of proteins caused by mutations. Here, we assessed the reliability of five recently developed Web tools, in order to evaluate the progresses in the field. Results The results show that, although there are improvements in the field, the assessed predictors are still far from ideal. Prevailing problems include the bias towards destabilizing mutations, and, in general, the results are unreliable when the mutation causes a ΔΔG within the interval ± 0.5 kcal/mol. We found that using several predictors and combining their results into a consensus is a rough, but effective way to increase reliability of the predictions. Conclusions We suggest all developers to consider in their future tools the usage of balanced data sets for training of predictors, and all users to combine the results of multiple tools to increase the chances of having correct predictions about the effect of mutations on the thermodynamic stability of a protein. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04238-w.

on the stability of a protein is useful to identify therapeutic approaches for the treatment of diseases involving protein misfolding, such as the use of pharmacological chaperones [4]. These compounds either bind specifically to target proteins and stabilize their native conformation, or decrease the folding energy barriers, thereby promoting the transition of the misfolded forms into the native conformation [5].
The experimental assessment of the impact of a mutation on protein stability is not trivial, especially if hundreds of mutations are associated to a protein. Therefore, during the last three decades, dozens of computational methods have been developed to predict the alteration of the thermodynamic stability of a protein following mutations (for a recent and thorough review, see [6]). Their reliability has been questioned repeatedly by assessments performed in the past years [7][8][9][10][11][12][13][14], and, as a result, new methods have been developed more recently, which claim to overcome the limitations of the older methods [6]. Several new methods, which are often available online through Web servers, are based on different strategies, such as machine learning, which are supposed to produce more accurate results, but most have not been assessed by independent researchers.
In the past, we extensively used several of these methods to predict the impact of mutations on the stability of enzymes involved in the genetic disease galactosemia [15,16]), and we noticed that often their predictions were not in agreement with each other. Moreover, comparing their predictions with the ΔΔG of the mutation obtained experimentally, we often found that the predictors were not fully reliable. In order to increase the reliability of our predictions, we performed an independent assessment of predictors that have emerged as promising or more reliable in the last few years. Here, we describe the problems that we dealt with, the solutions that we adopted, the results that we obtained, and the take-home messages learned from this study.

Results
We evaluated five Web tools for the prediction of the effect of mutations on protein stability: DUET [17], INPS-3D [18], MAESTROweb [19], PoPMuSiC [12], DynaMut [20]. The most important features of each predictor are summarized in Table 1; for more details, the reader is referred to ref. [6] or to the original publications of the methods.
We collected a benchmark dataset of monomeric and multimeric proteins (Additional files 13 and 14: S1 and S2 Files), with known values of ΔΔG associated with different mutations, starting from the dataset used by Vihinen and coworkers to develop their PON-tstab predictor [21], freely available in the VariBench benchmark database suite [22]. We predicted the change in stability of these mutant proteins and analysed the results of each predictor in order to determine their performance. As in this case the definition of both positive and negative predictions is arbitrary, we defined as positive predictions those mutations that caused a positive ΔΔG variation, and as negative predictions those mutations that caused a negative ΔΔG variation. We also separated those mutations that caused a ΔΔG variation ≤|0.5| kcal/ mol from those mutation that caused a ΔΔG variation >|0.5| kcal/mol. In fact, the predictions for mutations with experimental ΔΔG variation in the range ± 0.5 kcal/mol are likely less reliable, because of their proximity to the experimental error [8]. The "Methods" section reports detailed information about the criteria used for creating the dataset and the statistical analysis used for the comparison.  [19] Analysis of the performance of stability predictors on the full dataset of monomeric proteins Since most of the predictors of protein stability are set to analyse single-chain proteins with single mutations, we first assessed their predictions on the full dataset of monomeric proteins (see "Methods" section). The results are shown in Table 2. Considering the data predicted for mutations with a ΔΔG value outside the range of the experimental error ( Table 2, upper part), we found PoPMuSiC and INPS-MD to be the predictors with the highest number of true negatives (i.e. destabilizing mutations that are correctly predicted as destabilizing) and false negatives (i.e. stabilizing mutations that are wrongly predicted as destabilizing). Conversely, we found DynaMut to be the predictor with the highest number of false positives (i.e. destabilizing mutations that are wrongly predicted as stabilizing) and true positives (i.e. stabilizing mutations that are correctly predicted as stabilizing), and with the lowest MCC value. However, the MCC values appeared to be relatively poor for all the predictors, considering that MCC = 1 indicates the best possible prediction and 0 a random prediction. As expected, the predictions made on mutations with a ΔΔG inside the range of the experimental error are poor ( Table 2, lower part). In this case, PoPMuSiC gave the best result in terms of MCC, and INPS-MD and DynaMut gave the worst results. However, the MCC value appeared to be close to randomness for all predictors used on this subset of mutations. Figure 1 shows the ROC (Receiver Operative Characteristic) and PRC (Precision-Recall Curve) calculated on the full dataset of monomeric proteins. Considering ROC, Table 2 General results from the assessment on the full dataset of monomeric proteins True negative and true positive values have been considered as those predictions that correctly predicted a negative and a positive sign for destabilizing and stabilizing mutations, respectively  Fig. 1c. All of the five methods analyzed were very close to the random dotted line: as expected, these ΔΔG values could not be reliably predicted. The PRC for this subset is shown in Fig. 1d. Again, all five methods were very close to the random dotted line, with the rippled trends that do not show if a method is better than another. This qualitative evaluation was confirmed by the values of AUC for both curves reported in Additional file 10: Table S1, from which it appears that all the methods but DynaMut have a difference of < 7% for the AUC related to both ROC and PRC, calculated for the monomeric subset outside the range of the experimental error, whereas DynaMut has a difference of around 16-24%. The AUC values for the curves related to predictions for ΔΔG inside the range of experimental error are lower and similar for all predictors (Additional file 10: Table S1).
Since in many cases the PDB file of a monomeric protein contains more than one chain, we performed a check to determine whether the positive or negative signs of the predictions made on each chain were in agreement. The results are shown in Additional file 11: Table S2. We found that PoPMuSiC was the only predictor with identical results (both in sign and in value) for all multiple chains, whereas the other predictors show different results among the chains, indicating that they are influenced by subtle differences in the structures of the different chains. Nevertheless, the sign of the predictions was usually the same for the different chains for all predictors, although DynaMut showed the lowest agreement among the chains, especially for monomeric proteins with three chains in their PDB file.
We also analysed the agreement among the different predictors for mutations on the monomeric dataset. The results are shown in Additional file 12: Table S3. As expected from the previous results, we found that DynaMut was the predictor that agreed least with the others, while the other predictors were in agreement for more than 80% of the predictions, with no significant differences among them.
In the database used for these analyses, the mutations to Alanine are about 5 times more frequent than the mutations to other types of residue (Additional file 14: File S2). We determined whether this unbalanced composition might represent a bias that affected the results of our assessment. Therefore, we created a subset of monomeric proteins by including all mutations except Alanine, and we recalculated ROC and PRC on this subset. The results (Additional file 1: Figure S1) showed that the ROC and PRC calculated for the subset of data without Alanine mutations were indistinguishable from the curves obtained for the entire dataset ( Fig. 1a, b). In addition, the ROC and PRC calculated on the subset made up uniquely of monomeric proteins with mutations to Alanine are superimposable onto the general ROC and PRC (data not shown). Therefore, we concluded that the unbalanced composition of the dataset did not affect the results.
Another possible source of bias is the overrepresentation of mutations coming from a few specific proteins in the dataset. As shown in Additional file 14: File S2, three proteins each contributed more than 5% of the total mutations in the dataset of monomeric proteins. We therefore excluded these proteins and re-calculated the ROC and PRC curves. Results are reported in Additional file 2: Figure S2. Results showed that only the performance of DynaMut was remarkably improved in this subset.
In order to determine whether the size of the dataset affected the results, we performed a bootstrap analysis (sampling with replacement) by increasing the size of the subsets from 100 to 700 mutations, and measured how this affected the AUC values of the ROCs. These results are shown in Additional file 3: Figure S3, in which we reported the results for the largest and the smallest subsets. The increase in the dimensions of the dataset depicts narrower curves, but without a remarkable shift of their median value, indicating that even the smaller subset was of sufficient size to give reliable results, despite its spreader distribution. DynaMut remained the predictor with the lowest AUC value in both subsets, followed by MAESTROweb, while the performances of the other methods were similar.

Analysis of the performance of stability predictors on a balanced dataset of monomeric proteins
Destabilizing mutations are overrepresented in the full dataset of monomeric proteins; thus, we calculated the performance of the predictors using a subset in which the number and the distribution of the ΔΔG of destabilizing and stabilizing mutations was well balanced (see Methods). The results are shown in Table 3. Using this dataset, we found that PoPMuSiC and INPS-MD had the highest number of true negative and false negative predictions, indicating that they are biased towards destabilizing mutations. A similar bias, although less evident, was also present for DUET and MAESTROweb. Conversely, DynaMut appeared to be more shifted towards stabilizing mutations, being the one with the highest number of true positive and false positive predictions.
Overall, we found that on the balanced dataset, considering the predictions for mutations with a ΔΔG value outside the range of the experimental error (Table 3, upper part), the performance of all of the predictors except DynaMut were less Table 3 General results from the assessment on the balanced dataset of monomeric proteins True negative and true positive values have been considered as those predictions that correctly predicted a negative and a positive sign for destabilizing and stabilizing mutations, respectively  (Fig. 2a). Conversely, in the balanced dataset (Fig. 2b), all methods but DynaMut were still shifted to negative values. Finally, even in this balanced dataset, when the ΔΔG was inside the range of the experimental error (Table 3, lower part), the predictions of all of the Web tools were close to randomness, as indicated by MCC values close to 0.

Analysis of performance of stability predictors on full and balanced datasets of multimeric proteins
The performance of the different predictors on the full dataset of multimeric proteins, considering only the mutations outside the range of the experimental error, is shown  Table 4, upper part, and Additional file 4: Figure S4. The performance of the predictors worsened slightly compared with the results obtained for the monomeric dataset, but there was no significant difference between the predictions made by MAESTROweb, which has a specific input option for multimeric proteins, and the other predictors. The analysis of the predictions based on the balanced dataset (Table 4, lower part, and Additional file 5: Figure S5) led to the same conclusions as those obtained from the monomeric dataset: all predictors except DynaMut were biased towards destabilizing mutations, while DynaMut still had the worst performance in terms of MCC values.

Analysis of performance of predictors on the full and balanced datasets of monomeric and multimeric proteins using a consensus of predictors
The MCC obtained for each predictor on our different subsets was still far from the ideal value of 1. In the past, we applied an empirical consensus method to the analysis of our Galactosemia Protein Dataset by attributing a classification to the mutants ("more stable"/"less stable") based on the consensus of at least two out of the three (2/3) selected methods [15]. In order to determine whether this empirical procedure improved the performance of the available predictors, we assessed the reliability of the consensus of the predictors for attributing an effect (destabilizing or stabilizing) on the mutants in all of our datasets. We analysed the results of the consensus both of three out of five (3/5) predictors, and of two out of three (2/3), checking in the latter case all possible combinations and selecting those that obtained the best MCC. The results are shown in Table 5. In general, we found that the MCC resulting from the consensus of the predictors was higher (both in full and in balanced datasets), compared to the MCC values calculated for every single predictor. In addition, the consensus of 3/5 predictors performed better in the full dataset of monomeric proteins, whereas the consensus of 2/3 predictors performed better in the balanced dataset of monomeric proteins and in the full dataset of multimeric proteins. In the case of the balanced dataset of multimeric proteins, both consensus performed as well as the method with the highest MCC value (PoPMuSiC).

Discussion
It is not a trivial undertaking to measure the effect of point mutations on the thermodynamic stability of a protein on a large scale. Many computational methods have been developed to predict the ΔΔG associated with protein mutations, and most of them are freely available via Web interfaces and provide their predictions quickly (from a few minutes to a few hours) [6]; thus, they are easy to use for "naïve users" in the field of bioinformatics. Nevertheless, several evaluations made in the last few years have questioned their reliability [7][8][9][10][11][12][13][14]. For this reason, we decided to perform a new assessment, starting from a "naïve user" perspective, to the prediction of whether a particular mutation, or set of mutations, might destabilize a particular protein. From our recent overview of the predictors that are currently available [6], we selected five Web server tools with the aim of testing structure-based predictors, representative of the different approaches developed to cope with the abovementioned issue, freely accessible to the users, and which allow calculating many mutations in a reasonable time. Moreover, considering that several assessments in the past have raised the question of whether many predictors are biased towards destabilizing mutations, we specifically addressed this issue by creating a subset of balanced mutations, upon which we evaluated the performance of the different predictors. Results on the balanced dataset indicate worse performances of the predictors, suggesting that the abundancy of destabilizing mutations in unbalanced datasets might mask the bias of the predictors towards destabilizing mutations. Comparing our results with those obtained 10 years ago by Khan and Vihinen [8] in an assessment that used the same metrics, it is evident that the performance of the most recent predictors is generally improved, but the road towards the full reliability of their predictions is still long and winding. Indeed, although the accuracy and true negative rate are greatly improved, the true positive rate remains unchanged with respect to the older statistics. The MCC calculated for most of the methods available 10 years ago was close to 0, indicating that most of the predictions were close to randomness, and the best performing method showed an MCC of 0.27 [8]. In this study, the highest MCC value obtained by the best performing method is 0.58, twice that value, but it is still far from the ideal value of 1.
The results we obtained on the balanced dataset (Table 3) show that destabilizing mutations are better predicted than stabilizing ones by all predictors except DynaMut. Interestingly, we found that this tool had the highest number of true positives predictions (i.e. stabilizing mutations correctly predicted to have a stabilizing effect) in all the tested conditions. However, it was also the one with the highest number of false positive predictions (i.e. destabilizing mutations wrongly predicted to have a stabilizing effect) and this is the reason for its poor results in terms of MCC in all tested datasets.
Most of available tools are not able to predict the effect of concurrent mutations on the different chains of a multimeric protein. Therefore, the predictions have to be made for a single subunit at a time. Thus, it is difficult to derive an overall result for the whole protein, as the developers do not supply any indication on how to use their predictors in this case. To address this issue, we tested whether the prediction of the ΔΔG variation on the whole protein obtained with MAESTROweb (the only tool able to manage multimeric proteins) was comparable with the sum of the ΔΔG values of the single subunits predicted by the same tool. We calculated a correlation coefficient of 0.728, therefore we deduced that summing up the predictions of the ΔΔG variation of the single subunits was an acceptable approximation to evaluate the effect of concurrent mutations on the stability of a multimeric protein. The results obtained on multimeric proteins for all predictors (Table 4) show that MAESTROweb does not outperform the other tools, despite it has been tailored to manage also multimeric proteins.
When the experimental ΔΔG value of a mutation is close to the experimental error (± 0.5 kcal/mol) [8], all the tools return essentially random predictions for this mutation. It is worth noting that the predictors currently associate a binary output (destabilizing/ stabilizing) to the predicted ΔΔG value, with two negative consequences: 1. in the "twilight zone" of the experimental error, the risk of having an unreliable prediction is very high; 2. the naïve user is misled by the interpretation of the real effect of the mutation. Indeed, a ΔΔG value of − 0.01 kcal/mol measured for a mutation means that the stability of the protein is unchanged, but these predictors would classify such a result as a "destabilizing effect". Therefore, we advise all users to look with caution at the predictions for which the ΔΔG values fall within the range of the experimental error, and we suggest all developers to define the effect of a mutation on the stability of the protein as "uncertain" when its predicted ΔΔG falls within the range ± 0.5 kcal/mol.
Finally, using the results of single predictors to perform an in-house consensus procedure, we obtained predictions with an increased reliability compared with the single best performing method. This solution provides a sort of compensation in the predictions, which allows overcoming the problems of every single method. Moreover, even the single best performing method in our assessment in terms of MCC value (DUET) is, in its turn, a consensus method. Therefore, based on our experience, we advise all users to be unsatisfied with a single prediction, and to combine the results of multiple tools to increase the chances of having a correct prediction about the effect of a mutation on the thermodynamic stability of a protein.

Conclusions
The web tools used in this work perform surely better than the predictors assessed about 10 years ago, although improvements are still needed. As take home messages, this work suggests on an hand to developers to consider the needs of training their predictors on balanced datasets and evaluating carefully the cases with very low impact on the thermostability, and on the other hand to users to compare results from different predictors to obtain more reliable evaluations.

Web tools assessed for the prediction of protein stability
After a careful evaluation of the different approaches available to predict the impact of the mutations on the thermodynamic stability of a protein [6], we selected the following predictors to perform our assessment: DUET [17] (http:// biosig. unime lb. edu. au/ duet/), a web tool combining two other complementary predictors (named SDM and mCSM), previously developed by the same research group, in a consensus predictor; INPS-3D [18] (https:// inpsmd. bioco mp. unibo. it/ inpsS uite/ defau lt/ index 3D), a machine-learning method tailored to address the problem of anti-symmetric property; MAESTROweb [19] (https:// pbwww. che. sbg. ac. at/ maest ro/ web), the only Web server able to manage both multimeric proteins and compound heterozygous multiple mutations; PoPMuSiC (https:// soft. dezyme. com/ query/ create/ pop; accessible to registered users only, freely available for academic users), a method formed by a linear combination of statistical potentials, the coefficients of which depend on solvent accessibility of the mutated residue. The new version of this predictor, developed by Pucci and coworkers in 2018 and called PoPMuSiCSym [12] has been tailored to correct the bias toward destabilizing mutations by imposing physical symmetries under inverse mutations; however, the version of the predictor executed by the Web server is called "PoPMuSiC". Finally, Dyna-Mut [20] (http:// biosig. unime lb. edu. au/ dynam ut/) is one of the most recent Web servers developed, based on a quite original approach in which Normal Mode analysis is performed to take into account the contribution of protein flexibility.
These predictors were selected because they are freely available online and they are representative of the different approaches developed during the last few years [6]. Moreover, PoPMuSiC was selected because it appears to be the best tool available from the last extensive comparison made by Pucci et al. [12], and DUET because it is among the most popular stability predictor servers (as per Google Scholar citations). The other three predictors are more recent and were not previously assessed by researchers different from their developers. In particular, INPS-MD was selected because its authors stated that they carefully designed the training set for its development, to overcome the bias for destabilizing mutations [18]. Finally, DynaMut was selected because it is very popular (as per Google Scholar citations), and MAESTROweb for its unique ability to manage multimeric proteins.
In order to predict the change in protein stability, all of these tools require: (a) the structure of the wild-type protein in PDB format; (b) the specification of the missense mutation that the user wishes to study; (c) the protein chain involved in the mutation. As output, each tool provides the predicted ∆∆G value of the mutant compared with the wild type protein, and a binary classification of the mutation (either stabilizing or destabilizing) based on the sign of the predicted ∆∆G value. A negative sign for this value is associated with destabilizing mutations for DynaMut, DUET and INPS-MD, whereas PoPMuSiC and MAESTROweb associate a negative sign to stabilizing mutations. In order to standardize this binary output and make the statistical analyses easier, we inverted the sign of the output of PoPMuSiC and MAESTROweb in our statistical evaluations.
We defined arbitrarily as positive predictions those mutations associated to a positive ∆∆G sign and as negative predictions those mutations associated to a negative ∆∆G sign. As this assumption can affect non-symmetric evaluations such as ROC and PRC curves, we report under Additional files 6 to 9: Figures S6 to S9 the results of ROC and PRC curves obtained when the opposite choice is applied, i.e., positive predictions are associated to negative ΔΔG predictions.

Dataset used for the assessment
To create the reference dataset for our assessment, we started from the benchmark dataset of proteins used by Vihinen and coworkers to develop their PON-tstab predictor [21], freely available at the VariBench database suite [22] (http:// struc ture. bmc. lu. se/ VariB ench/ stabi lity. php; Dataset 5). This dataset, extracted from ProTherm [23], is probably the best curated resource of its kind currently available, but it still shows limitations due to its redundancy in terms of protein families, to its compositional bias in terms of type of mutations included, to an imbalance between destabilizing and stabilizing mutations and to the quality of the reference structures indicated [6]. To overcome these limitations, we created a "reference standard dataset", extracting only high-quality reference proteins from VariBench. At the purpose, we first discarded: 130 mutations for which a reference structure was not cited in ProTherm; three proteins for which a discrepancy was present between the mutations reported in the sequence and the structure (e.g. mutation in the sequence not corresponding to the position of the mutation in the structure); several proteins for which the reference structure was obtained by NMR; one protein in which only the Cα trace was present in the reference structure; one protein in which 19 mutations involve a residue included in a missing segment in the structure. We also discarded several proteins for which the quality of the structure was judged unsatisfactory, when evaluated in accordance with the following criteria: resolution, R free/R value difference, Ramachandran outliers, sidechain outliers, normalized realspace R-value (RSRZ) outliers, geometric quality of the chain [24]. Finally, we discarded several proteins in which the reference (wild type) structure contained some mutations. The group of mutations of phage T4 lysozyme represented a particular case. The two reference structures for this protein indicated in the original dataset were 2LZM [25] and 1L63 [26]. Both of which are considered as wild type proteins, but 1L63 is the structure of the so-called "pseudo wild type " or "cysteine-free lysozyme", in which Cysteine 54 and Cysteine 97 were replaced by Threonine and Alanine, respectively, in order to disrupt the existing disulfide bridge between these two residues [27]. Comparing the mutations reported in the PON-tstab dataset and their reference in the literature, we kept in our "reference" dataset only those mutations for which we identified the reference lysozyme used for the experimental determination of ΔΔG.
Our final "reference standard" dataset for benchmarking includes 48 proteins, of which 10 are homodimeric, 1 is homotetrameric and the others are monomeric, according to the information of their biological assembly, derived from the indications in RCSB PDB database [28], from the literature associated to each structure and from visual inspection to solve ambiguities. In some cases, the multimeric proteins show only one chain in the PDB file: in these cases, we reconstructed the biological assembly by means of Make-Multimer.py tool, available at http:// watcut. uwate rloo. ca/ tools/ makem ultim er/. On the other hand, many monomeric proteins show more than one (identical) chain in their PDB file: in these cases, we decided to submit each chain independently for the prediction. For the monomeric proteins with more than one chain in their PDB files, we calculated the predictions separately for each chain, and then we averaged the value of ΔΔG for each mutation and predictor. For multimeric proteins, only MAESTROweb allows to calculate directly the effect of mutations on a protein composed by more than one chain. Therefore, we performed the prediction by using the appropriate settings for multimeric prediction available on the Web server, whereas for the other predictors we performed the predictions separately for each chain, and then summed the value of ΔΔG for each mutation and predictor.
From these data, it appears that the distribution of mutations with ΔΔG < 0 and ΔΔG > 0 is unbalanced; in particular, mutations with ΔΔG < 0 are three times those with ΔΔG > 0. Therefore, we derived two subsets of mutations, one for monomeric and one for multimeric proteins, well balanced for the distribution of ΔΔG, by keeping all stabilizing mutations and selecting an equal number of destabilizing mutations. The criteria to balance the dataset were: (1) ΔΔG distribution: we divided the dataset into ΔΔG ranges (in absolute value), and selected for each ΔΔG range an equal number of destabilizing and stabilizing mutations; (2) mutant protein: for each ΔΔG range, we selected destabilizing mutations coming from the same protein, or at least coming from proteins from the same structural family, as identified by the CATH code [29]; (3) type of mutation: for each ΔΔG range, we selected a balanced number of mutations of the same type (e.g. a mutation replacing a hydrophobic residue by a negatively charged one). These two subsets were used to assess the bias of the predictors towards destabilizing mutations.
The original dataset, the reference dataset and the two balanced subsets are provided as Additional file 13: File S1. A detailed analysis of our reference dataset composition in terms of type of proteins, type of mutations, distribution according to experimental ΔΔG, residue distribution, contributions of different proteins, structural features is reported in Additional file 14: File S2.
As reported in Table 1, the datasets used for training all the predictors include, or are derived, from the so-called "S2648 dataset", made of 2648 different point-mutations (602 stabilizing and 2046 destabilizing) across 131 globular proteins with experimentally determined structures and impact on protein stability, extracted from ProTherm (freely available at the VariBench database suite [22] (http:// struc ture. bmc. lu. se/ VariB ench/ stabi lity. php; Dataset 10). This dataset was originally developed in 2009 to train a previous version of PoPMuSiC and therefore is less curated than the one we used to assess the performances of the predictors. Our dataset includes 28 proteins (out of 48) and 641 mutations (out of 1024) in common with the S2648 dataset (Additional file 14: File 2 details also this information). However, as we pointed out previously, in our dataset all chains included in the PDB files of the 28 proteins were taken into account, whereas only the first chain of the PDB file was taken as reference in S2648 dataset. Moreover, in our dataset multimeric proteins were considered in their quaternary assembly, whereas in the S2648 dataset all proteins were treated as monomeric. For these reasons, our dataset cannot be considered truly overlapping the S2648 dataset.
Criteria and statistical analysis used to perform the assessment As our main goal was to assess if predictors correctly identify those mutations that destabilize or stabilize the protein of interest, we evaluated whether the sign of the ΔΔG predicted by the different tools was in agreement with the sign of the experimental measure associated to the same mutation.
In our assessment, we evaluated the performance of the predictors separately for those mutations with a ΔΔG value comparable to the experimental error made to measure ΔΔG variation (between − 0.5 and 0.5 kcal/mol [8]), and for mutations with a ΔΔG value falling outside this range.