 Research
 Open access
 Published:
Statistical modeling to quantify the uncertainty of FoldXpredicted protein folding and binding stability
BMC Bioinformatics volume 24, Article number: 426 (2023)
Abstract
Background
Computational methods of predicting protein stability changes upon missense mutations are invaluable tools in highthroughput studies involving a large number of protein variants. However, they are limited by a wide variation in accuracy and difficulty of assessing prediction uncertainty. Using a popular computational tool, FoldX, we develop a statistical framework that quantifies the uncertainty of predicted changes in protein stability.
Results
We show that multiple linear regression models can be used to quantify the uncertainty associated with FoldX prediction for individual mutations. Comparing the performance among models with varying degrees of complexity, we find that the model precision improves significantly when we utilize molecular dynamics simulation as part of the FoldX workflow. Based on the model that incorporates information from molecular dynamics, biochemical properties, as well as FoldX energy terms, we can generally expect upper bounds on the uncertainty of folding stability predictions of ± 2.9 kcal/mol and ± 3.5 kcal/mol for binding stability predictions. The uncertainty for individual mutations varies; our model estimates it using FoldX energy terms, biochemical properties of the mutated residue, as well as the variability among snapshots from molecular dynamics simulation.
Conclusions
Using a linear regression framework, we construct models to predict the uncertainty associated with FoldX prediction of stability changes upon mutation. This technique is straightforward and can be extended to other computational methods as well.
Background
Computational methods that predict protein folding and binding stability are crucial in diverse areas of study ranging from molecular evolution to biomedicine. The changes in stability due to missense mutation can alter protein function, with implications for disease mechanisms and evolutionary trajectory. The ability to efficiently screen protein variants is also essential in protein engineering and drug discovery. Protein stability is quantified by a thermodynamic measure, Gibbs free energy (\(\Delta G\)), and the difference between \(\Delta G\) of a wildtype and \(\Delta G\) of a mutant (\(\mathrm{\Delta \Delta }G\)), indicates how strongly a mutation stabilizes or destabilizes folding or binding (Fig. 1).
While experimental measurement of \(\Delta G\) is laborious and costly, numerous computational methods that predict protein stability are available and enable quick screening of a large number of protein variants. There are several methods of constructing force fields to calculate protein stability. Physical effective energy functions, also referred to as physicalbased potentials, explicitly calculate energies based on the mechanics of physical forces between atoms. Statistical effective energy functions, or knowledgebased potentials, use statistical analysis of existing structure data [1,2,3]. A third category, empirical effective energy functions, is based on empirical data from experiments on single or multiple site mutations, and is the basis of the FoldX program that is the focus of this work [2, 4, 5]. Other methods, such as those utilizing machinelearning algorithms rather than explicit computation of energy contributions have also proliferated in recent years, adding to the wide array of tools available for in silico experiments in various fields [6,7,8,9,10,11].
FoldX is a popular tool due to the ease of use and computational speed. It calculates \(\Delta G\) as a linear combination of contributing energy terms (see Table 2 for the list of energy terms) which is fitted to experimental data [2, 5]. After computing \(\Delta G\) between unfolded and folded (or unbound and bound) states, \(\mathrm{\Delta \Delta }G\) is then calculated as the difference between \(\Delta G\) of the wildtype and \(\Delta G\) of the mutant protein, i.e., \(\mathrm{\Delta \Delta }G={\Delta G}^{Mutant}{\Delta G}^{WT}\). In this definition, a negative \(\mathrm{\Delta \Delta }G\) indicates stabilizing mutation while a positive \(\mathrm{\Delta \Delta }G\) indicates destabilizing mutation.
FoldX has been an integral tool in our group’s body of work that combines molecular modeling and mutagenesis experiments [12,13,14]. However, a measure of uncertainty significantly improves the usefulness of the computational prediction. With the exception of MAESTRO, multiagent machinelearningbased algorithm that returns a consensus predicted value along with confidence estimation [10], the only way to estimate the accuracy of most programs is through the published performance metrics available in literature. However, benchmarking results from different sources vary widely and are not likely to offer practical insight for a particular experiment. For example, the developers of FoldX reported the correlation between predicted \(\mathrm{\Delta \Delta }G\) and experimentallydetermined \(\mathrm{\Delta \Delta }G\) to be 0.81 [5], but other studies have reported different levels of correlations, from as low as 0.19 to as high as 0.73 [15, 16]. In addition to this wide variation, interpreting these values is complicated by the fact that correlation depends not only on the error in FoldX but also on the range and distribution of \(\mathrm{\Delta \Delta }G\) values among the mutations studied.
We also note that the error of an experimental measurement of \(\mathrm{\Delta \Delta }G\) is another source of uncertainty. It is generally assumed that the experimental error is small, given identical experimental conditions, but can range from 0.1 to 0.5 kcal/mol [17]. While the issue of experimental error is not the focus of this study, it should be clear that this is a source of potential uncertainty. Here, we make the assumption that the experimental \(\mathrm{\Delta \Delta }G\) values in databases represent the “truth.”
Our goal in this study is to develop a framework for quantifying the uncertainty associated with \(\mathrm{\Delta \Delta }G\) prediction by FoldX. To that end, we constructed a set of linear regression models using datasets containing 1187 mutations (672 for folding stability and 515 for binding stability), generated prediction intervals around the point estimate of each individual mutation, and assessed model performance based on interval width and coverage. We built models across a spectrum of complexity: obtaining \(\mathrm{\Delta \Delta }G\) by running FoldX on a single experimental protein structure vs on snapshot structures from a molecular dynamics (MD) simulation, and predictor variables ranging from FoldX energy terms to biochemical properties. We find that incorporating MD simulation greatly improves model performance, and that the uncertainty of FoldX is typically on the order of ± 3 kcal/mol for folding stability, and even larger for binding stability. The result further shows that we can infer the magnitude of the error based on major energy contribution terms in FoldX and biochemical properties intrinsic to the mutated residue. Our findings provide a more realistic interpretation of FoldX predictions and provide a framework that can be extended to other similar programs.
Methods
Datasets
Ten protein systems with experimentallydetermined \(\mathrm{\Delta \Delta }G\) of folding were selected from the ProTherm database [18] and ten protein complex systems with experimentallydetermined \(\mathrm{\Delta \Delta }G\) of binding from the Skempi database [19], as we have previously described [20]. The criteria for system selection included having both stabilizing and destabilizing mutations and having more than 20 mutations that were not all to alanine. Table 1 lists the selected systems along with their PDB IDs, the number of residues of the protein, the number of mutated residues, and number of mutations.
Obtaining predicted \({\varvec{\Delta}}{\varvec{\Delta}}\mathbf{G}\)
The details of preparing structures, MD simulation, and FoldX analyses are described in Miller et al. [20]. To summarize, structure files downloaded from the PDB website using their PDB IDs (Table 1) were prepared for analysis by editing out unnecessary chains, fixing the missing residues, and standardizing the nomenclature. The edited structure files were then used to perform 100 nslong MD simulations with the GROMACS 5.0.3 MD package to sample the configurational variation of proteins observed under physiological conditions. 100 snapshots, each 1 ns apart, were captured from each MD simulation. Each snapshot was then analyzed using FoldX 4.0 to calculate \(\mathrm{\Delta \Delta }G\) of folding and binding upon mutations with available experimentally measured \(\mathrm{\Delta \Delta }G\) values. \(\mathrm{\Delta \Delta }G\) values per mutation from each of the 100 snapshots were then averaged to obtain the final \(\mathrm{\Delta \Delta }G\). For comparison, we also built a set of analogous but separate datasets with \(\mathrm{\Delta \Delta }G\) values calculated from using a single experimental structure file and analyzing with FoldX. Typically, a FoldX \(\Delta G\) calculation output file contains the \(\Delta G\) values associated with each energy term (Table 2). We used these output files to generate additional datasets with \(\mathrm{\Delta \Delta }G\) values of individual energy terms resulting from FoldX calculations with MD snapshots and an experimental structure alone. Subsequent model selection and validation procedures described below were performed separately on these datasets.
Defining variables
Our goal was to predict the uncertainty of the FoldXcalculated \(\mathrm{\Delta \Delta }G\) (\({\mathrm{\Delta \Delta }G}_{FoldX}\)) assuming that the experimentallydetermined \(\mathrm{\Delta \Delta }G\) (\({\mathrm{\Delta \Delta }G}_{exp}\)) represents the truth. We defined this quantity, Error (Fig. 2), as the absolute difference between \({\mathrm{\Delta \Delta }G}_{FoldX}\) and \({\mathrm{\Delta \Delta }G}_{exp}\), and assigned it as the response variable in the model search.
\(\mathrm{\Delta \Delta }G\) of individual energy terms, e.g., van der Waals energy, solvation energy, entropy terms, etc. (Table 2), that contribute to the calculation of total \({\mathrm{\Delta \Delta }G}_{FoldX}\) were considered as potential predictor variables in the model search. This allows us to learn if an increase in any particular energy types correlates with increased error. When fitting models on the dataset generated from the MD snapshots where \({\mathrm{\Delta \Delta }G}_{FoldX}\) is an average from 100 snapshots, the standard deviation (SD) of \({\mathrm{\Delta \Delta }G}_{FoldX}\) was also included in the pool of predictors. A large SD indicates that a large amount of conformation variation is observed across the MD simulation, and may have an effect on the Error. We also considered biochemical properties of mutated residues such as secondary structures and solvent accessibility as potential predictors (Table 2).
Model selection and evaluation
All model search process, validation, and subsequent analyses were performed using R 4.1. Two methods were used in the model selection: the stepwise selection method and the best subset selection method. In stepwise selection, predictor variables are alternatively added and removed one at a time until the best model is reached based on a selection criterion (i.e., all further steps produce models with poorer fit). On the other hand, best subset selection compares models with every possible combination of predictor variables. step function in stats package and regsubsets function in leaps package were used for these methods respectively. Both functions take a dataframe consisting of a response variable and all potential predictor variables as an input, perform the described search algorithm, and outputs the best model based on a given criterion. Bayesian information criterion (BIC) was our selection criterion of choice and is given by: \(BIC=2loglike+(\mathrm{log}\left(n\right))\bullet d\) where \(loglike\) is the maximum log likelihood of the fitted model, \(n\) is the number of observations, and \(d\) is the number of predictor variables. BIC applies a heavy penalty for an increased number of parameters and thereby avoids overfitting.
We searched for five different types of models of varying levels of complexity, using the two selection methods (Table 3). For Model 1, the response variable, Error, is calculated based on a single experimental structure \({\mathrm{\Delta \Delta }G}_{FoldX}\), and the potential predictor variables made available to the model search were restricted to the individual energy terms from FoldX output (Table 2, “FoldX energy terms”). Model 2 is similar to Model 1 but the pool of predictor variables also included biochemical properties (Table 2, “Biochemical properties”). For Model 3, the response variable, Error, is calculated based on average \({\mathrm{\Delta \Delta }G}_{FoldX}\) from 100 MD snapshots. The potential predictor variables during model search were the individual energy terms from FoldX output (MD snapshot averages). Model 4 is similar to Model 3 but the potential predictors also included biochemical properties, and Model 5 also included SD values associated with each averaged energy term.
We compared the model performance using a modified leaveoneout cross validation, a common method in which the ith observation in a dataset is excluded from model training. The model fitted with \(n1\) observations is then tested on the excluded datapoint. This process is repeated to obtain \(n\) testing results and the aggregate is an indication of the overall model performance. This can be in the form of average mean squared error in cases involving continuous response variables, or an error rate in classification problems. While our model predicts Error that is on a continuous scale, we used the upper bound of the 95% prediction interval (Fig. 2) to calculate coverage as a performance metric. Also, because we were concerned about nonindependence among data from the same system and suspected qualitative differences among the systems, we modified the method by excluding all datapoints from one system as a testing set—i.e., we leave one system out. The candidate model is fit using the rest of the datapoints from the remaining nine systems, and tested on the excluded datapoints. This process is repeated for each system and the coverage of a model is calculated as follows:
where \(N\) is the total number of observations in the dataset, \({n}_{i}\) is the number of observations in \(i\)th system, and \({B}_{j}\) is the upper bound of the prediction interval of the Error at the \(j\)th observation in the \(i\)th system. The magnitude of the upper bound is an indication of the model precision, and we used the median width (to avoid overinfluence of large values) as a part of the performance metric. All together, we produced and evaluated a total of ten models: one set of five models (Table 3) using the folding stability dataset, and another set using the binding stability dataset.
Results
Leaveonesystemout cross validation result
With datasets comprising of single amino acid mutations with known experimental \(\mathrm{\Delta \Delta }G\) (\({\mathrm{\Delta \Delta }G}_{exp}\)) and FoldXpredicted \(\mathrm{\Delta \Delta }G\) (\({\mathrm{\Delta \Delta }G}_{FoldX}\)), we used stepwise selection and best subset selection methods of model search to produce five pairs—one from each selection method—of different models (Table 3) for each of the folding dataset and binding dataset. We then determined the better model from each pair by leaveonesystemout cross validation where the model is fit with data from nine protein systems and tested on the remaining datapoints iteratively. The coverage rate from this process (Eq. 2) is the proportion of the prediction intervals that capture the Error when tested on the data the model was not trained on. Based on this measure of model accuracy and the median interval widths, a measure of model precision, we found that models from the best subset selection method performed better than those from stepwise selection method. We tested backward elimination as well but further removal of predictor variables did not improve performance when tested with cross validation.
After we determined the final five sets of models (Table 3) for each of the folding and binding datasets, we assessed the performance among them. Comparing models built with singlestructure \(\mathrm{\Delta \Delta }G\) (Models 1 and 2) and those with MD snapshots (Models 3–5), coverage was similar in all cases at approximately 95%. However, the median width of the prediction interval decreased considerably in Models 3–5, especially in the folding dataset (Table 4). This implies that the additional information supplied by MD simulation brings a marked improvement in model precision. This is also evidenced by the large decrease in BIC between Models 1–2 and Models 3–5 (Table 4). Subsequent results and discussions will focus on the full model (Model 5), but we emphasize that qualitative results among shared variables are similar in all models.
While the models achieve 95% coverage overall, a closer analysis of the cross validation results reveals that the coverage varies widely among protein systems as shown in Fig. 3 (panels B and D), and Table 5. For the folding energy dataset, there were a total of 36 outliers—i.e., those not captured by the prediction interval—out of 672 datapoints and 19 of them were mutations in 1VQB, 7 in 1BNI, and 5 in 1WQ5 (pvalue = 0.00050, chisquared goodnessoffit test). Out of 27 outliers for the binding energy dataset, 15 were from 3HFM, and 4 from 1PPF (pvalue = 0.00050, chisquared goodnessoffit test).
Significant predictors
For the folding energy dataset, the predictors of the final models are van der Waals energy, van der Waals clash, side chain entropy, SD of total energy, involvement of proline, and secondary structure. For binding, the predictors are van der Waals clash, SD of backbone clash, SD of side chain entropy, SD of total energy, secondary structure, and relative solvent accessibility (RSA). Table 6 lists the predictor variables and their statistical details. Details for the entire ten models are available in the Additional File 1. Table 6 also shows the effect of each predictor on the prediction interval width (see details in table legend).
Discussion
A novel approach to estimate the error of predicted \({\varvec{\Delta}}{\varvec{\Delta}}\mathbf{G}\)
This study seeks to quantify the uncertainty associated with computational prediction of \(\mathrm{\Delta \Delta }G\) to aid researchers in informed usage of the prediction. Specifically, we focused on FoldX because it is one of the popular and readilyavailable software. We built multiple linear regression models with predictor variables selected from a pool of FoldX energy terms and biochemical properties to predict the errors associated with each \({\mathrm{\Delta \Delta }G}_{FoldX}\) value. We assessed models of varying complexity using an empirical test of leaveonesystemout cross validation and determined that the performance improves considerably in models trained on datasets based on \({\mathrm{\Delta \Delta }G}_{FoldX}\) from MD snapshots (Models 3–5). The coefficients (sign and rank) of the shared predictors among these models were qualitatively similar. Furthermore, the full model had the best BIC score for binding (\(\Delta BIC\) = 9) and was similar to Model 3 for folding (\(\Delta BIC\) = + 2). Given this, we focused on the full model that contains the most diverse sets of predictors in order to explore the effects of the FoldX energy terms, MD simulation, and biochemical properties of the mutated residues, on estimating the error.
One of our major findings is that the margin of error in \({\mathrm{\Delta \Delta }G}_{FoldX}\) is much larger than the general assumption in the field. According to the cross validation results of Models 3–5 which utilize MD simulation, intervals of approximately ± 2.9 kcal/mol for folding stability and ± 3.5 kcal/mol for binding stability around a given \({\mathrm{\Delta \Delta }G}_{FoldX}\) are necessary in order to capture \({\mathrm{\Delta \Delta }G}_{exp}\) within 95% prediction interval. The intervals are even wider for models based on singlestructure calculation (i.e., no MD simulation). Given this wide margin, researchers should be cautious in how FoldX predictions are utilized.
While the median interval width provides a general idea of potential errors, the intervals can be smaller or larger based on the unique characteristics of the mutated residue and the nature of the amino acid substitution. The mutationspecific estimated error can then be used as a weighting factor in downstream usage of \({\mathrm{\Delta \Delta }G}_{FoldX}\). The conventional way of error estimation is limited to the blanket ruleofthumb based on published software performance data, which has a wide range as pointed out previously. While the developers of FoldX reported a correlation of 0.81 with SD of 0.46 kcal/mol between \({\mathrm{\Delta \Delta }G}_{exp}\) and \({\mathrm{\Delta \Delta }G}_{FoldX}\) based on more than 1000 mutations [5], other studies have reported lower correlations: 0.54 based on a curated 605 mutations [25] and 0.50 based on another curated set of 1200 mutations [3]. Much lower correlations—as low as 0.19—come from studies that tested FoldX on a limited set of mutations from a single protein [15, 26]. In addition to the wide variation of reported accuracy, another limitation is that a summary measure such as the correlation coefficient does not fully inform the extreme ends of the variability in error.
Some studies suggest FoldX performs better as a qualitative predictor. In the aforementioned study by Potapov et al., the accuracy (percentage of correctly predicted stabilizing or destabilizing mutations out of all mutations) of FoldX was 69.5% based on 1200 mutations with the classification determined by whether \(\mathrm{\Delta \Delta }G\) was greater or less than 0 kcal/mol. Accuracy increased to 74.2% when only the mutations with \(\left\mathrm{\Delta \Delta }G\right>\) 2 kcal/mol were considered [3]. However, FoldX has a tendency to predict destabilizing mutations with higher accuracy than stabilizing mutations. This biased prediction accuracy is a wellknown drawback of many computational programs, arising from the fact that a random mutation tends to be destabilizing and thus most training datasets are comprised of mostly destabilizing mutations [27,28,29].
The overall accuracy of qualitative prediction is mostly driven by the fact that FoldX tends to correctly predict destabilizing mutations that often make up a larger part of the training dataset as in our case (Fig. 4A). In panel B, we can see that more than 75% of the mutations predicted to be destabilizing are correctly identified (pink, nonshaded areas). In contrast, more than half of the mutations which FoldX predicted to be stabilizing (\(\mathrm{\Delta \Delta }G<\) 0.5 kcal/mol, the two leftmost bins) are actually neutral or destabilizing (shaded white and pink areas). In Fig. 4C, where the colors indicate possible classification based on our error bounds instead of \({\mathrm{\Delta \Delta }G}_{exp}\), we can see that most of the mutations near zero (\({\mathrm{\Delta \Delta }G}_{FoldX}\) ±2 kcal/mol) cannot be classified with confidence. With folding dataset, we can be more confident with mutations whose \({\mathrm{\Delta \Delta }G}_{FoldX}>\) 2.5 kcal/mol to be truly destabilizing. The unexpected trend in binding dataset where the uncertainty increases again, with \({\mathrm{\Delta \Delta }G}_{FoldX}>\) 8 kcal/mol in particular, is due to all of the 15 mutations in that bin belonging to a single protein system, 1PPF, having an unusually large \({\mathrm{\Delta \Delta }G}_{FoldX}\) and error bounds.
We were next interested in identifying the extent to which the type of mutation (stabilizing vs destabilizing) affects the misclassification problem. The misclassification rates seen in Fig. 4B are influenced by the underlying distribution of \(\mathrm{\Delta \Delta }G\) for the mutations chosen in a particular study as well as the variance and bias of the prediction method, such as FoldX. Specifically, the low proportion of correctlyidentified stabilizing mutations in Fig. 4B is a result of the inherent bias of FoldX and the low frequency of stabilizing mutations in our data. In order to understand the effect of the underlying distribution, we created a dataset by resampling (with replacement) 1000 observations from each bin along \({\mathrm{\Delta \Delta }G}_{exp}\), essentially simulating a scenario of picking mutations from a uniform distribution of \({\mathrm{\Delta \Delta }G}_{exp}\). When we visualize the proportion of correctlypredicted classifications, we see similar accuracy among stabilizing and destabilizing mutations (Fig. 5A). Essentially, a uniform sampling of \({\mathrm{\Delta \Delta }G}_{exp}\) yields high classification accuracy among mutations observed to be strongly stabilizing and destabilizing. However, misclassification remains a problem for mutations with \({\mathrm{\Delta \Delta }G}_{exp}\) values near zero. As Fig. 5B shows, only those predicted to be strongly destabilizing can be classified unambiguously, which was also the case before resampling (Fig. 4C). Altogether, we find that the uncertainty of FoldX prediction is large enough that classifying mutations based on \({\mathrm{\Delta \Delta }G}_{FoldX}\) is not reliable.
Predictor variables and their role
In addition to providing a measure of uncertainty of individual \({\mathrm{\Delta \Delta }G}_{FoldX}\) values, our model also gives an insight into the factors that influence that uncertainty. In their publication describing the development of the energy function used in the FoldX program, Guerois et al. conducted a detailed analysis of outlier mutations both in training database and test database. Their explanations for the large discrepancy between predicted and experimental \(\mathrm{\Delta \Delta }G\) include mutations that greatly affect the unfolded state, possible structural relaxation from removal of large side chains (most of these are mutations to alanine or glycine), and mutations from proline. Some of these outliers originate from a specific protein (1FMK in the training database and 1STN in the test database) and their analysis surmised that these discrepancies were not generalized [2]. While we also observe systemspecific differences in median Error and the model’s predictive accuracy (Fig. 3 and Table 5), our model shows the Error is influenced by factors that are generalizable across protein systems.
For the folding energy dataset, van der Waals and van der Waals clash terms appear consistently with significant effects in all five models. Holding other variables equal, having high values for these predictors cause a significant increase in uncertainty, as can be seen in the Coefficient estimates and the Effects on interval width in Table 6. Mechanistically, van der Waals energy arises from the electrostatic forces between atoms, relative to the distance between them. In the FoldX algorithm, van der Waals is determined by the transfer energy of the molecule from vapor to water. Increased van der Waals implies a large change in interatom forces as a residue is substituted. It may be that the larger this change, the more difficult and errorprone it becomes to predict the stability. Similarly, amino acid substitutions that result in a large change in van der Waals clash appear to increase Error as well. Clashes occur when there is an overlap of atomic radii such as when a small amino acid is mutated to a large amino acid. While FoldX can compute local rearrangement around the mutated residue, it cannot account for the more global rearrangement in the protein that may occur, leading to an inaccurate prediction. This may also be the reason that using MD snapshots tends to improve FoldX prediction since the protein can change conformation near the mutation site. Van der Waals clash in FoldX is a corrective term to compensate for the overestimation of solvation and van der Waals energies arising from a steric clash [2]. Large clash values thus imply that energies are calculated from incorrect atomic positions and the compensation by the clash term may not be adequate for some mutations.
The effect of another significant predictor, side chain entropy, may follow a similar reasoning. The side chain entropy represents the entropic cost of fixing the side chain upon folding (or binding) —i.e., unfavorable \(\Delta G\) contribution due to the decrease in the conformational space available to the side chain arrangement. A large difference in \(\Delta S\) (entropy) between the wildtype and the mutated residue implies a severe steric restriction such as when a bulky amino acid replaces a small one. As with van der Waals energy and van der Waals clash terms, the accuracy of FoldX seems to decline when it computes a large change in entropy.
SD of total energy indicates the variation among \(\mathrm{\Delta \Delta }G\) of total energy calculated for each of the 100 MD snapshots, and is a significant predictor for the folding energy model. For binding energy, several SD terms, including van der Waals clash, backbone van der Waals clash and side chain entropy, appear significant and their contribution to the model performance is evidenced by the lowest BIC values (Table 4). Structures that exhibit large fluctuations of coordinates during MD simulation produces a large SD of the \({\mathrm{\Delta \Delta }G}_{FoldX}\) values calculated from those snapshots. It may mean that \(\mathrm{\Delta \Delta }G\)—whether experimental or computational—is difficult to measure accurately for mutated residues occurring in the regions with high conformational variability. Other SD terms in the binding model have negative effects on Error (larger the SD, smaller the Error), making it difficult to deduce the mechanism behind the observation. We can only surmise that the degree of conformational variation impacts the accuracy of FoldX calculation.
As for biochemical properties, a notable predictor is proline—either a proline residue in wildtype or a mutation to proline. With all other variables being equal, the involvement of proline increases the uncertainty (i.e. interval width) by approximately 0.761 kcal/mol (Table 6). The cyclic structure of proline constricts bond angles and may disrupt the secondary structure of a peptide chain or introduce steric clashes, resulting in inaccurate \({\mathrm{\Delta \Delta }G}_{FoldX}\). Mutations involving glycine and alanine are also known to affect stability prediction and were tested during model search. However, in the presence of a larger pool of potential predictors (energy terms, etc.), only proline was significant enough to remain in the final models. We verified that this result was not due to sample sizes as the number of mutations involving proline (n = 27) was smaller than that of alanine (n = 251) or glycine (n = 105). The significance of proline mutations in spite of the small number of datapoints indicates that the unique biochemistry of proline makes accurate prediction of \(\mathrm{\Delta \Delta }G\) challenging.
The main driver of protein folding is hydrophobic packing where hydrophobic residues aggregate in the core and minimize their exposure to the polar solvent. Since a mutation of a core residue is more likely to disrupt the stability of the folded protein than a surface residue, we suspected that RSA, which measures how exposed or buried a residue is, might have a role in estimating the Error. Interestingly, RSA appeared only with the binding energy dataset, with a negative effect. A mutation occurring at a residue that has high RSA—i.e., more exposed and away from the binding interface—will not change \(\mathrm{\Delta \Delta }G\) significantly and therefore will have less error associated with the prediction. It appears that this effect of RSA on the Error is not as pronounced with \(\mathrm{\Delta \Delta }G\) of folding stability in the presence of many other potential predictors as with \(\mathrm{\Delta \Delta }G\) of binding stability.
The role of secondary structure as a predictor is less straightforward. We treated the secondary structure as a categorical variable with seven levels following the DSSP classification scheme [23], excluding I which is a rare \(\pi\)helix and not present in our datasets. We suspected that highly organized structures such as helices or \(\beta\) sheets (H, G, E) might show a significant contribution to the Error compared to less rigid structures such as loops (S, T), but did not see this pattern in our result. In spite of the unclear directionality of their effects and high pvalues (Table 6), eliminating this predictor from the model resulted in inferior performance in cross validation.
Leaveonesystemout cross validation to account for the variation among protein systems and complexes
The applicability of this study is twofold: (1) we learn the factors that influence the Error as discussed in the previous section, and (2) we can apply our trained models to a new protein system or complex in which we do not have experimental data on stability. To achieve the second point, it was critical that we avoid overfitting and prioritize generalizability while keeping the loss of precision as minimal as possible. We used BIC that applies a heavier penalty on the number of parameters than AIC during the model search, and relied on the leaveonesystemout cross validation to determine the bestperforming model rather than on BIC alone.
The result of this cross validation method simulates a realworld scenario of applying a model trained on a dataset with known \({\mathrm{\Delta \Delta }G}_{exp}\) to a new set of mutations—most likely in a protein that was not in the training data—where \({\mathrm{\Delta \Delta }G}_{exp}\) is unknown. The systembysystem breakdown in Fig. 3 and Table 5 provides a glimpse of the bestcase and worstcase scenarios. For example, we can expect coverage greater than 95% for most systems (7 out of 10 for folding stability and 8 out of 10 for binding stability). It may be as low as less than 80% in a minority of cases (as in 1VQB for folding energy and 3HFM in binding energy). The model precision (prediction interval width) also varies among systems but not as widely as the coverage.
The uneven distribution of the model accuracy (in terms of coverage) suggests there are systemspecific factors that we are not able to incorporate in our models. Notwithstanding, the utility of the model is in a comparative analysis of mutations: given a group of mutations with FoldX predictions, the error bounds can alert the user to particular mutations whose true \({\mathrm{\Delta \Delta }G}_{exp}\) are likely to deviate greatly from \({\mathrm{\Delta \Delta }G}_{FoldX}\) based on the mutationspecific characteristics.
The role of molecular dynamics simulation
While MD is not a part of conventional FoldX workflow, it has been shown that averaging of \(\mathrm{\Delta \Delta }G\) from MD snapshots results in an improved correlation between \({\mathrm{\Delta \Delta }G}_{FoldX}\) and \({\mathrm{\Delta \Delta }G}_{exp}\) [20, 30]. In order to test whether the improvement extends to the error estimation, we built models separately using datasets with \(\mathrm{\Delta \Delta }G\) calculated from a single experimental structure (Models 1–2) and datasets with averaged \(\mathrm{\Delta \Delta }G\) from MD snapshots (Models 3–5). While all models maintained the coverage rate of approximately 95%, the median widths of the prediction interval noticeably decreased in the models with MD snapshots data (Table 4). The rationale for MD simulation stems from the fact that a single structure of a protein does not accurately represent the true conformational space in a natural, aqueous environment, and hence inaccurate \({\mathrm{\Delta \Delta }G}_{FoldX}\). MD snapshots sample various conformations of the protein, and the average \({\mathrm{\Delta \Delta }G}_{FoldX}\) across these structures take this variation into account. Implicit in the SD values associated with each averaged \({\mathrm{\Delta \Delta }G}_{FoldX}\) is the degree of the conformational variability, and we found that they are significant predictors of the Error (Table 6). We also saw that MD simulation contributes to the model precision, as the MD snapshotsbased models showed tighter prediction intervals than the singlestructurebased models (Table 4). However, we are unable to delineate whether the extra information from MD simulation directly contributes to the improved precision or if it only decreases the Error itself and the tighter interval is a consequence of the smaller Error estimate.
Altogether, the models are able to predict potential errors with much better precision when MD simulation is utilized. We recognize that MD simulation is a technical resource that may not be available readily. Even without it, we have shown that models can be built from singlestructure datasets with comparable coverage albeit with wider prediction intervals. An addition of biochemical properties, that can be easily calculated, improves the precision slightly by shortening the intervals.
Conclusions
Computational methods of predicting mutational effects on protein stability are invaluable in highthroughput mutational studies. While there are abundant benchmark studies on the performance of various methods, few programs offer a measure of uncertainties associated with individual predictions. Focusing on the popular program, FoldX, we built multiple linear regression models to predict the magnitude of the discrepancy between the experimental and computational stability changes due to single amino acid mutations. We found that the model performs best when supplied with information from MD simulation of the protein and biochemical properties of the mutated residues. However, simpler models based on only FoldX information can still be useful. Our models also provided mechanistic insight into the factors that contribute to the error. Because our models predict errors based on the mutationspecific predictors, the unique error estimates can then be used as weighting factors in downstream analyses using FoldX results. This method can be extended beyond FoldX and has the potential to be a tool for researchers to help guide their computational predictions.
Availability of data and materials
Datasets and the code to generate results in the paper are available at https://github.com/yesols/foldx_uncertainty, archive https://doi.org/10.5281/zenodo.7897628. FoldX program is publicly available from their website: https://foldxsuite.crg.eu/. DSSP program for obtaining secondary structure information is available at: https://swift.cmbi.umcn.nl/gv/dssp/. PDB files for all twenty protein systems were downloaded from RCSB Protein Data Bank website: https://www.rcsb.org/.
Abbreviations
 BIC:

Bayesian information criterion
 MD:

Molecular dynamics
 RSA:

Relative solvent accessibility
 SD:

Standard deviation
References
Lazaridis T, Karplus M. Effective energy functions for protein structure prediction. Curr Opin Struct Biol. 2000;10(2):139–45.
Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002;320(2):369–87.
Potapov V, Cohen M, Schreiber G. Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein Eng Des Sel. 2009;22(9):553–60.
Mendes J, Guerois R, Serrano L. Energy estimation in protein design. Curr Opin Struct Biol. 2002;12(4):441–6.
Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005;33(2):W3828.
Capriotti E, Fariselli P, Casadio R. IMutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucl Acids Res. 2005;33(2):W30610.
Dehouck Y, Kwasigroch JM, Gilis D, Rooman M. PoPMuSiC 2: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinform. 2011;12(1):151.
Giollo M, Martin AJ, Walsh I, Ferrari C, Tosatto SC. NeEMO: a method using residue interaction networks to improve prediction of protein stability upon mutation. BMC Genom. 2014;15(4):S7.
Pires DEV, Ascher DB, Blundell TL. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucl Acids Res. 2014;42(W1):W314–9.
Laimer J, Hofer H, Fritz M, Wegenkittl S, Lackner P. MAESTR—multi agent stability prediction upon point mutations. BMC Bioinform. 2015;16(1):116.
Cao H, Wang J, He L, Qi Y, Zhang JZ. DeepDDG: predicting the stability change of protein point mutations using neural networks. J Chem Inf Model. 2019;59(4):1508–14.
Yang J, Naik N, Patel JS, Wylie CS, Gu W, Huang J, et al. Predicting the viability of betalactamase: how folding and binding free energies correlate with betalactamase fitness. PLoS ONE. 2020;15(5):e0233509.
Beach SS, Hull MA, Ytreberg FM, Patel JS, Miura TA. Molecular modeling predicts novel antibody escape mutations in the respiratory syncytial virus fusion glycoprotein. J Virol. 2022;96(13):e00353e422.
Li S, Patel JS, Yang J, Crabtree AM, Rubenstein BM, LundAndersen PK, et al. Defining the HIV capsid binding site of nucleoporin 153. mSphere. 2022;7(5):e0031022.
Song X, Wang Y, Shu Z, Hong J, Li T, Yao L. Engineering a More Thermostable Blue Light Photo Receptor Bacillus subtilis YtvA LOV Domain by a Computer Aided Rational Design Method. PLOS Comput Biol. 2013;9(7):e1003129.
Buß O, Rudat J, Ochsenreither K. FoldX as protein engineering tool: better than random based approaches? Comput Struct Biotechnol J. 2018;1(16):25–33.
Montanucci L, Martelli PL, BenTal N, Fariselli P. A natural upper bound to the accuracy of predicting protein stability changes upon mutations. Bioinformatics. 2019;35(9):1513–7.
Kumar MDS, Bava KA, Gromiha MM, Prabakaran P, Kitajima K, Uedaira H, et al. ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions. Nucl Acids Res. 2006;34(1):D2046.
Moal IH, FernándezRecio J. SKEMPI: a Structural Kinetic and energetic database of Mutant Protein Interactions and its use in empirical models. Bioinformatics. 2012;28(20):2600–7.
Miller CR, Johnson EL, Burke AZ, Martin KP, Miura TA, Wichman HA, et al. Initiating a watch list for Ebola virus antibody escape mutations. PeerJ. 2016;16(4):e1674.
Zamyatnin AA. Protein volume in solution. Prog Biophys Mol Biol. 1972;1(24):107–23.
Monera OD, Sereda TJ, Zhou NE, Kay CM, Hodges RS. Relationship of sidechain hydrophobicity and αhelical propensity on the stability of the singlestranded amphipathic αhelix. J Pept Sci. 1995;1(5):319–29.
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers. 1983;22(12):2577–637.
Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum Allowed Solvent Accessibilites of Residues in Proteins. PLoS ONE. 2013;8(11):e80635.
Broom A, Jacobi Z, Trainor K, Meiering EM. Computational tools help improve protein stability but with a solubility tradeoff. J Biol Chem. 2017;292(35):14349–61.
AyusoTejedor S, Abián O, Sancho J. Underexposed polar residues and protein stabilization. Protein Eng Des Sel. 2011;24(1–2):171–7.
Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS. The stability effects of protein mutations appear to be universally distributed. J Mol Biol. 2007;369(5):1318–32.
Pucci F, Bernaerts KV, Kwasigroch JM, Rooman M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics. 2018;34(21):3659–65.
Bæk KT, Kepp KP. Data set and fitting dependencies when estimating protein mutant stability: Toward simple, balanced, and interpretable models. J Comput Chem. 2022;43(8):504–18.
Gonzalez TR, Martin KP, Barnes JE, Patel JS, Ytreberg FM. Assessment of software methods for estimating proteinprotein relative binding affinities. PLoS ONE. 2020;15(12):e0240573.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Science Foundation EPSCoR TrackII, Award No. OIA1736253, and the Institute for Modeling Collaboration and Innovation at the University of Idaho sponsored by National Institute of General Medical Sciences, award number P20GM104420. CM was also supported by National Institutes of Health Grant R01 GM076040.
Author information
Authors and Affiliations
Contributions
CM and MY conceived the project. JP generated computational data. YS wrote the code, analyzed data, drafted and edited the manuscript. CM supervised the analysis and revised the manuscript. JP and MY reviewed and revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
An R output file showing a performance comparison summary of all models and model details of each, listing the significant predictors.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sapozhnikov, Y., Patel, J.S., Ytreberg, F.M. et al. Statistical modeling to quantify the uncertainty of FoldXpredicted protein folding and binding stability. BMC Bioinformatics 24, 426 (2023). https://doi.org/10.1186/s12859023055370
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023055370