Relating destabilizing regions to known functional sites in proteins

Background Most methods for predicting functional sites in protein 3D structures, rely on information on related proteins and cannot be applied to proteins with no known relatives. Another limitation of these methods is the lack of a well annotated set of functional sites to use as benchmark for validating their predictions. Experimental findings and theoretical considerations suggest that residues involved in function often contribute unfavorably to the native state stability. We examine the possibility of systematically exploiting this intrinsic property to identify functional sites using an original procedure that detects destabilizing regions in protein structures. In addition, to relate destabilizing regions to known functional sites, a novel benchmark consisting of a diverse set of hand-curated protein functional sites is derived. Results A procedure for detecting clusters of destabilizing residues in protein structures is presented. Individual residue contributions to protein stability are evaluated using detailed atomic models and a force-field successfully applied in computational protein design. The most destabilizing residues, and some of their closest neighbours, are clustered into destabilizing regions following a rigorous protocol. Our procedure is applied to high quality apo-structures of 63 unrelated proteins. The biologically relevant binding sites of these proteins were annotated using all available information, including structural data and literature curation, resulting in the largest hand-curated dataset of binding sites in proteins available to date. Comparing the destabilizing regions with the annotated binding sites in these proteins, we find that the overlap is on average limited, but significantly better than random. Results depend on the type of bound ligand. Significant overlap is obtained for most polysaccharide- and small ligand-binding sites, whereas no overlap is observed for most nucleic acid binding sites. These differences are rationalised in terms of the geometry and energetics of the binding site. Conclusion We find that although destabilizing regions as detected here can in general not be used to predict binding sites in protein structures, they can provide useful information, particularly on the location of functional sites that bind polysaccharides and small ligands. This information can be exploited in methods for predicting function in protein structures with no known relatives. Our publicly available benchmark of hand-curated functional sites in proteins should help other workers derive and validate new prediction methods.


Background
Available three-dimensional structures of proteins of unknown biological role are rapidly increasing as a result of structural genomics initiatives [1,2]. This prompted the development of methods for annotating protein structures at the residue level and inferring binding sites using information from related proteins [3][4][5]. A common approach to detect functional sites in proteins has been to identify evolutionarily conserved residues that are spatially contiguous in the protein structure [6][7][8][9]. This approach has lately been extended and integrated with detailed analyses of structural features, related protein structures and sequence information by several groups [10][11][12][13]. But the lack of related proteins, a common occurrence with structural genomics targets, hinders the wide applicability of many of these methods [14]. In addition, there are now many examples where above-average sequence variability rather than sequence conservation is associated with functional regions [15,16].
For these reasons, methods capable of identifying functional residues in absence of information on conserved residues [17] have attracted considerable attention. Available methods of this type are based on the detection of particular geometrical features in the protein structure, such as clefts [18], proximity of residues to the protein center [19], mutual spatial proximity of residues [20], or spatial motifs such as the well-known catalytic triad in serine proteases [21,22]. More recently, methods using a combination of evolutionary, geometrical and stabilityrelated information to identify functional residues have also been proposed [23,24].
There are good indications that evolution often optimizes functional properties at the expense of thermodynamic stability. Site-directed mutagenesis of the catalytic residues in T4 lysozyme yields inactive mutant proteins that are more stable than the wild-type [25], indicating that the catalytic residues destabilize the wild-type enzyme. Similar conclusions were drawn from mutagenesis experiments on other proteins [26][27][28][29][30]. In particular, the catalytic power of enzymes is believed to result from the presence of specific constellations of polar residues in the active site, which can introduce either electrostatic [31] or steric [32,33] strain into the folded protein conformation in absence of the bound ligand [34]. Related to this observation is the recent finding that residues in left handed helices, which occur rarely in proteins, are often important for function [35].
Further evidence that protein sequences may not be optimized for protein stability has been provided by studies using computational protein design procedures. Those procedures select sequences that optimize the stability of a given protein three dimensional structure. They were recently shown to generate native-like sequences in the protein core but not on the surface, suggesting that surface residues may be selected primarily for functional reasons at the expense of stability [36].
Following this reasoning, several studies have shown that functional sites in protein structures can be detected by identifying residues positioned in unfavorable or unusual energetic environments. This includes the analysis of ionisable groups with perturbed titration curves [37] and the use of continuum electrostatics methods for the identification of polar residues engaged in unfavorable electrostatic interactions [38] in enzyme active sites [38][39][40]. Binding sites in proteins were also shown to consist of neighbouring regions of low and high stability [41].
A major challenge for functional site prediction methods in proteins is their validation against a benchmark set of known functional sites. Such benchmark must be large and diverse enough so as to cover many types of functional sites. These sites should furthermore be described in a standard fashion and this description should be based on all available information (structural, biochemical, sitedirected mutagenesis etc.). Unfortunately, such benchmarks are still unavailable, although efforts are currently in progress to address this issue. Resources like the Catalytic Site Atlas [42] are very helpful, but are limited to catalytic residues in enzymes. They hence lack information on non-catalytic ligand-binding residues or other types of functionally important residues. Resources such as Pdb-Sum [43], Pdb SITE records [44] or SwissProt [45] also provide useful information, but only for a subset of the entries. The BIND database [46] provides annotations for residues involved in ligand binding and different types of function, but makes no distinction between biologically relevant association modes and non-relevant ones. Other structure-based binding site databases suffer from that same limitation, and other issues that cannot be addressed without manual verification, such as inclusion of residues known to be important for function from nonstructural evidence [47,48].
Providing a comprehensive and relevant functional site benchmark for proteins is not straightforward and reflects the difficulty to define what a functional site is. Where should one draw the limit? Should residues important for maintaining the native 3D structure or for enabling conformational changes, both of which may be required for function, also be considered as functional residues? Even when focusing on ligand binding alone, important choices need to be made in defining the ligand binding residues. Are those the residues that are directly involved in non-bonded interactions with the ligand in the holoprotein, or should one include other residues in the neighbourhood? One may choose to define ligand bind-ing residues as those contributing significantly to the protein-ligand binding free energy. But this may likewise require the consideration of residues remote from the binding site, which might be involved in electrostatic steering effects [49,50]. Waiting for these issues to be addressed, functional site predictions are currently validated against information that is either approximate or incomplete. Some methods have been validated against sets of functional residues defined on the basis of thorough literature curation, but those are generally restricted to a very small numbers of proteins, and the definitions used are somewhat ad-hoc (e.g. [7]). This paper presents a procedure for detecting destabilizing regions in protein three-dimensional structures solely on the basis of objective energetic criteria. The correspondence between these regions and known functional sites is quantitatively evaluated in order to assess the effectiveness of energetic criteria alone in functional site prediction. Our analysis focuses entirely on ligand-binding sites. To enable adequate validation we build a benchmark of 74 such binding sites from a non-redundant set (with sequence identity of at most 25%) of 63 proteins having a high quality crystal apo-structure, and at least one characterised binding site. These are defined here using a set of objective criteria and information extracted from the 3D structures and from a comprehensive analysis of the associated literature.
In our procedure the contribution of each residue to the protein folding free energy (e.g. its stability) is evaluated using an all-atom force-field developed previously for protein design applications [36,51]. Residues providing destabilizing free energy contributions are identified and grouped together to yield the destabilizing regions in a stepwise protocol, which takes into account their proximity in the 3D structure and the level of their destabilizing contribution. This protocol is governed by 4 adjustable parameters, which have straightforward physical meanings. These parameters are adjusted so as to optimize the overlap between the identified destabilizing regions and known binding sites in a set of 7 proteins (the learning set), which are unrelated (< 25% sequence identity) to the set of 63 proteins used for the analysis (the test set).
A systematic comparison between the destabilizing regions identified by our procedure in the 63 apo-protein structures and the known ligand binding sites reveals that their overlap is on average limited, but significantly better than random. A statistically significant overlap between the two types of regions (destabilizing and binding sites) is obtained in 77% of the proteins in which destabilizing regions are detected. Most interestingly, our study shows that the extent of overlap largely depends on the type of ligand whose binding site is being considered. Largest overlaps are obtained for sites binding small ligands and polysaccharides, while very poor overlap is almost systematically obtained with nucleic acid-binding sites. These differences are rationalized in terms of the geometric and energetic properties of the various binding sites, and the potential of using energetic criteria such as those proposed here for the prediction of functional sites in solved protein structures with no known relatives is discussed. The software DESITE for identifying destabilizing regions in protein structure is available upon request.

Functional sites in proteins
In order to evaluate the degree of overlap between the destabilizing regions identified with our procedure and the regions that actually mediate function in the proteins of interest, an objective and unified description of the latter regions, termed here functional sites is required. To derive such description a detailed analysis that combined information from PDB entries and from biochemical and mutagenesis data extracted from the literature (see Methods) was conducted on the 63 proteins of our test set (see [52]).
A functional site was defined as a group of residues. In the vast majority of the cases the defined groups represent residues involved in ligand binding, where the ligands encompass molecules of different types and sizes. The defined sites are therefore strictly speaking ligand-binding sites. Hence residues not directly involved in ligand binding, but required for maintaining the stability of the native conformation or for enabling conformational changes required for function are not explicitly considered as being part of functional sites. Table 1 summarizes the salient features of the characterized sites by protein and ligand type. Further details on the properties of each binding site and the full list of residues in the sites can be found at [53]. For all homo-multimers in the dataset, equivalent copies of the binding sites occur in the different subunits, but only a single copy is discussed here.
The 63 proteins of our dataset were found to contain a total of 74 binding sites, with 9 proteins containing 2 binding sites, and one protein (CheY) with 3 binding sites. The analyzed proteins contain 10 nucleic acid-binding sites, 11 polysaccharide binding sites, 17 interaction sites with other proteins, 8 peptide binding sites, 3 for lipids, 8 for metal ions, and 24 small ligand-binding sites. Seven sites were found to bind multiple ligands. The size of the binding sites, expressed in terms of the number of residues per site copy, varies from 4 to 49 residues, with an average of 17 residues. The fraction of the total number of residues contained in individual binding sites ranges  between 1.5% and 31.3% (with an average of ~10%). But most binding sites (67/74) have a small relative size of less than 2% of the total number of residues in the protein.
The average amino acid composition found in binding sites is illustrated in Fig. 1. In comparison to the amino acid composition of the complete proteins of our set ( Fig.  1), binding sites are enriched in aromatic residues (W, Y), H, R, and N, and to a lesser extent in M and D. On the other hand these sites are depleted in P, K, E and the aliphatic residues (A, V, L, I). The lower occurrence of lysine residues in binding sites, as compared with other polar residues, thus follows the trend observed previously for protein-protein interfaces [54] and catalytic sites [55]. The enrichment in aromatic and polar residues such as H, T, R, and N most probably reflects the hydrogen bonding potential of the corresponding side chains, with the aromatics mostly occuring in sugar-binding sites. The frequent occurrence of Arg residues has previously been reported in interfaces of protein complexes [56].
Binding sites have been reported to often occur in large clefts at the protein surface [18]. This is particularly frequent for sites that bind small ligands as it enables the protein to surround such ligands almost completely in order to minimize contact with the solvent [57]. To investigate the extent to which this was also the case for the binding sites identified here, we checked whether the sites were located in one of the 4 largest clefts identified by SURFNET [18], following a set of recent rules used in conjunction with this program [58]. We consider a binding site to be part of one of these clefts if at least 75% of its residues are cleft residues as well. Residues lining SURFNET clefts are defined with the MASK program (provided with SURFNET) using default values [59]. Inspection of Table  1 reveals that out of the 74 binding sites in our dataset, 45 are located in such clefts, including the majority of the sites that bind small ligands (17/24) and polysaccharides (9/11). The frequency is lower for other ligand types, particularly for peptide and protein binding sites (5/17). As a consequence of their preferred location in clefts, residues belonging to functional sites generally have lower solvent accessibilities (25.2 ± 25.5%) than non-functional surface residues (40.3 ± 25.8%), with about 25% of residues in these sites being completely buried. It should also be noted that the largest cleft identified by SURFNET is usually much larger than any of the functional sites defined here (which comprise 17 residues on average), with the largest clefts containing at least 20 residues and often as many as several hundreds (see [60]). The probability that the overlap of such large clefts with the much smaller functional sites might occur by chance can therefore be quite high (see discussion below on the predictive power of destabilizing regions).

Destabilizing regions Salient features
Destabilizing regions are defined as groups of spatially neighbouring residues whose contribution to the free energy of the native state is unfavorable (destabilizing). Such groups are identified by first computing the contributions of individual residues to the protein folding free energy, selecting the residues with the most unfavorable contributions and delimiting regions in the protein that contain a high density of such residues, as described in Methods. Table 2 summarizes the salient features of the destabilizing regions identified in the 63 proteins of our dataset. They include the number of residues in each region, the fraction of the total number of protein residues found in the region, and whether the region maps into a large cleft. Further details for each site, including the identity of individual residues are given at [61].
In homo-multimeric assemblies the described destabilizing regions represent the smallest of the equivalent destabilizing regions identified in different subunits, and can be considered as the common core of these regions. A similar procedure was applied to define the common core of intersecting regions between binding sites and destabilizing regions, in multimeric proteins (see below).
A total of 121 destabilizing regions are detected in the dataset, but none are found in 11 proteins (pdb ids 1utx, 1gv2, 1eao, 1e7l, 1vyi, 1w9s, 1upq, 1w53, 1tgr, 1r29, 1sif). The number of destabilizing regions per protein ranges between zero in these 11 cases and 6 in one protein. Their size varies from 4 to 33 residues, averaging around 10 residues, and most destabilizing regions (101/ 121) contain less than 15 residues. Residues in a single destabilizing region represent between <1% and ~15% of the total number of residues in the protein, with an average of ~4%. Less than half (59/121) of the destabilizing regions map into one of the 4 largest clefts in the protein.
The average amino acid composition of destabilizing regions is illustrated in Fig. 1, alongside of the composition in binding sites and in the full proteins of our dataset. Relative to the amino acid composition of the full protein, these regions are highly enriched in S and D, and to a lesser extent in E, R, T and H, and are largely depleted in the aliphatic residues (A, V, I, L), as well as in K, W, G, M and C. The amino acid composition of the destabilizing regions thus displays some similarities to the composition of the binding sites (low representation of lysines and aliphatic residues) but also differs from it, most notably by the lower content of aromatics, and much higher content of S, T, D and E. Overall, polar and charged residues account for more than 75% of the residues in the destabilizing regions.

Origins of the unfavorable energetic contributions
To gain insight into the origins of the unfavorable energy contributions of the so-called destabilizing residues, the contribution of individual residues to the folding free energy of the protein ΔG f is decomposed into individual terms as follows: Where ΔG vdw , ΔG elec and ΔG solvation are respectively the differences in Van der Waals, electrostatic and solvation free energies between the folded state and the reference state for a given residue (see Methods). Figure 2 shows the average values and standard deviations for the different terms in Eq. 1 and the total free energy difference, computed for the 20 amino acid types, of all the residues of our dataset ( Fig. 2a) and of the residues identified as destabilizing by our analysis (Fig. 2b) (see Methods).
The polar residues in our dataset generally exhibit an unfavorable contribution to the solvation free energy difference ( Fig. 2a) because their polar groups are often partially buried in the folded protein. However, with those groups often engaged in hydrogen bonds with neighbouring residues in the protein, the unfavorable effect of desolvation tends to be at least partially compensated by more favorable electrostatic and Van der Waals interactions made in the folded protein than in water. For K, N, Q and Y residues these compensatory effects roughly balance each other, yielding a net contribution to the folding free energy that is near zero (Fig. 2a). In R, D, E, H, S and Amino acid composition in proteins of our dataset, functional sites and destabilizing regions Figure 1 Amino acid composition in proteins of our dataset, functional sites and destabilizing regions. Mean proportions of residue types for all residues in the dataset, binding sites residues and destabilizing regions residues. Residues are sorted with increasing hydrophobicity according to Kyte-Doolittle scale [76].  Average Van der Waals, electrostatics and solvation contributions, and total free energy difference, for each residue type Figure 2 Average Van der Waals, electrostatics and solvation contributions, and total free energy difference, for each residue type. Average values of the Van der Waals, electrostatics and solvation terms, and of the total free energy difference, for each residue type, computed over (a) all residues in the dataset, and (b) destabilizing residues. Standard deviations are indicated as error bars. Residues are sorted with increasing hydrophobicity according to Kyte-Doolittle scale [76].
T the unfavorable contribution due to desolvation is less effectively compensated by other terms, leading to a net destabilizing contribution. It is thus not surprising that these residues are significantly over-represented in destabilizing regions. For the hydrophobic residues desolvation generally results in a near zero or favorable contribution, whereas the contribution from Van der Waals interactions tends to be stabilizing, especially for F, Y and W, reflecting their tight packing in the folded state [62]. The values calculated for the electrostatic contribution are close to zero for most residue types, except for arginines where they seem to be greatly stabilizing.
For destabilizing residues (Fig. 2b) the contribution from desolvation is in general more destabilizing for all residues types, and the favorable contributions from the Van der Waals and electrostatic terms is in general weaker, leading to a net destabilizing effect overall. These trends are most salient for polar residues, which represent the major fraction of the destabilizing residues identified in our dataset. When hydrophobic residues are identified as destabilizing this is often due to the solvation and Van der Waals terms being unfavorable. In comparison to the hydrophobic residues in the full dataset whose contribution to the Van der Waals term is in general quite stabilizing, the absence of favorable Van der Waals contributions in destabilizing hydrophobic residues is particularly striking (Fig. 2a,b).

Relation between destabilizing regions and known binding sites
In total, there are more destabilizing regions (121) than known binding sites (74), but destabilizing regions tend to be smaller, with ~10 residues on average vs. 17 in functional sites. This results in similar average numbers of residues in destabilizing regions and known binding sites, per protein (19 and 20 residues, respectively). The average fraction of protein residues located in binding sites (10%) is however much larger than that located in destabilizing regions (~4%) because several analyzed proteins with known binding sites do not contain destabilizing regions.
We also see that known binding sites are more frequently located in large clefts than destabilizing regions (45/74 vs. 59/121).

Intersection of functional sites and destabilizing regions
Detected destabilizing regions and known binding sites show overlap in 45 out of the total of 63 proteins in our dataset. Residues shared by a known binding site and a destabilizing region constitute what we call here the Intersection Region (IR). Table 3 lists the details of the overlaps, including the number of residues in the known binding site, in the destabilizing region, and in the IR. In the case of mul-timeric proteins, the listed numbers were computed considering all the subunits. In total, 60 IR's are identified and more than one IR's are found in 16  Can destabilizing regions be used to predict functional sites?
Using the results on the overlap between destabilizing regions and known functional sites listed in Table 3, we now evaluate the extent to which destabilizing regions, as identified here, can be used to predict functional sites in a protein structure in absence of prior knowledge.
To that end the sensitivity S and positive predictive value (PPV) of the destabilizing regions were computed. For a given protein, S obs is the fraction of residues in known binding sites that are also part of destabilizing regions; and PPV obs is the fraction of residues in destabilizing regions that is intersecting with known binding sites (see Methods). Since our procedure was not trained on our dataset but on a different group of 7 proteins, this dataset can be used to compute these quantities. The average values of S obs and PPV obs in the dataset are respectively 25.7% and 27.0%. This means that, on average, about one fourth of the residues in a given binding site are also part of destabilizing regions, and that a little more than one fourth of the residues in destabilizing regions are also part of binding sites. The observed S and PPV values for each protein in which destabilizing regions were identified are listed in Table 3.
Although this overlap is modest and would not allow to identify functional sites in a quantitative fashion, it is statistically significant. The number of overlapping residues expected by chance for each binding site-destabilizing region pair is computed using the hypergeometric distribution (see Methods). These expected numbers of intersecting residues were used to calculate the expected sensitivity (S exp ) and PPV (PPV exp ) values for each protein.
The average S exp and PPV exp are 7.9% and 9.3%, respectively. A one-tailed Wilcoxon signed-rank test [63] was . i 1gus appears twice here because it has 2 distinct binding sites for small ligands. The same observation applies to 1e6l and 1uns that have 2 distinct binding sites for different proteins. j Average sensitivity (and standard deviation) for the given ligand type. k Average PPV (and standard deviation) for the given ligand type. then applied to the full set of observed and expected sensitivity and PPV values, and indicates that S obs are significantly larger than S exp (p = 3.305e -08) and that PPV obs are significantly larger than PPV exp (p = 4.091e -07).
Taking a P-value threshold of 0.05 in considering an overlap as statistically significant, yielded statistically significant overlaps in 40 of the 52 proteins in which at least one destabilizing region has been identified. In other words, if destabilizing regions are identified in a protein, there is 77% (40/52) probability that at least one of these regions will display a significant overlap with at least one known binding site. In the following we describe how these significant overlaps are distributed amongst the functional sites associated with different types of bound ligands. Figure 3 illustrates the comparison of binding sites with destabilizing regions as detected using our procedure. Results are shown for proteins that bind different types of ligands (polysaccharides, small ligands, nucleic acids and proteins).

Overlap with destabilizing regions as a function of ligand types
The extent of overlap between destabilizing regions and functional sites varies significantly with the type of ligand binding to these sites (Fisher exact test [63]: contingency  table is Table 4, p = 0.005). This is not too surprising given that the shape and composition of the functional site is in principle optimized to fit the type of ligand that binds to it, and that occurrence of destabilizing regions depends on shape and amino acid composition. The average sensitivity is highest (43.7%) for sites that bind polysaccharides and lowest (8%) for nucleic acid binding sites, whereas the average PPV ranges from 36.2% for protein binding sites to 10% for nucleic-acid binding sites (see Table 3).
Destabilizing regions identified here are reasonable predictors of binding sites for polysaccharides and small ligands, but very poor predictors of sites involved in nucleic acid binding, with the prediction performance for other types of sites being of intermediate reliability (see Tables  3 and 4). The better overlap with sites that bind small ligands and polysaccharides can be explained by the fact that many small ligands and polysaccharides bind to clefts enriched in polar and/or charged residues [57]. The polar residues in these sites therefore tend to be more buried than average, thereby providing a destabilizing contribution to the folding free energy in absence of the ligand [29]. The same applies to metal-binding sites located in deep clefts, with some exceptions however, as in phospholipase C ([PDB:1qmd]) where the zinc-binding site undergoes very large conformational change upon binding [64]. In internalin A ([PDB:lo6v]), CheY ([PDB:le6l) and endonuclease VII ([PDB:le7l), the metal ion binds in shallow clefts or flat surfaces where the residues have freedom to move to adapt to the absence of the ion.
In contrast, nucleic acids generally bind to larger regions with flat or convex surface shape. As a result, the binding site residues can be well solvated in absence of the bound ligand. Furthermore, nucleic acid binding regions usually include a sizable fraction of aliphatic and aromatic residues [65], which are poorly represented in the destabilizing regions identified here.
The overlap with lipid-binding sites, which are often located in deep clefts rich in hydrophobic residues, is in general rather limited. Significant overlap is however observed with sites in phospholipase C ([PDB:1qmd]) and crustacyanin C1 ([PDB:1obq]). These proteins bind polar heads carrying phospholipids, and the corresponding sites bury these heads inside polar clefts on the protein surface. It is these polar clefts that tend to overlap with the destabilizing regions identified in these proteins.
Protein and peptide binding sites are very diverse, and so is their overlap with destabilizing regions. Some functional sites or part of such sites are located in disordered regions and cannot be identified by our method (see Methods).

Discussion and conclusions
The basic assumption in this work has been that functional sites in proteins are very likely to contain residues that contribute unfavorably to the stability of the native conformation, due to evolutionary selection pressure for optimizing functional efficiency or specificity. This idea has been formulated previously by several authors [25,32,34,38,41], and illustrated in several proteins systems [25][26][27][28][29][30]. More recently, links have been established between functional sites and unfavorable solvation effects [66] or electrostatic interactions [39]. So far however, the relation between protein residues providing destabilizing contributions and functional sites has not been systematically investigated with rigorous statistical backing.
The present study attempted to fill this gap. It described a procedure for identifying regions in protein structures, containing residues that contribute unfavorably to the thermodynamic stability of the folded state. This stability was assessed from the experimentally determined atomic coordinates on the basis of a classical empirical energy function and standard parameters available in the CHARMM package, augmented with a surface area dependent solvation term. Contributions of individual residues were computed using a thermodynamic cycle that incorporates a simplified model for the unfolded state. Clusters of the most destabilizing residues were identified and extended to include their immediate spatial  [77] and rendered with Raster3D [78].

Examples of known binding sites and destabilizing regions identified in 3 proteins
neighbours, yielding the so-called destabilizing regions defined in this study.
Applying our procedure to a set of 63 high resolution protein crystal structures with well annotated ligand binding sites, but representing the apo-form of the protein, we were able to measure the overlap between these annotated sites and the identified destabilizing regions, assess its statistical significance and evaluate the effectiveness of using destabilizing regions for the prediction of ligand binding sites in proteins.
Although our results show that only about 25% of the residues in destabilizing regions, as defined here, belong to ligand binding sites and vice versa, this overlap is well above what would be expected by chance (~8%). We find furthermore, that when destabilizing regions are detected, they display statistically significant overlap with at least one known binding site in 77% of proteins examined here.
Another important finding of our analysis is that the extent of overlap between destabilizing regions and binding sites is highly dependent on the type of ligand bound to these sites. More extensive overlap is observed with binding sites for small ligands and polysaccharides whereas the overlap with nucleic acid binding sites is extremely poor. These differences are rationalized by the observation that the binding sites for small ligands and polysaccharides occur mostly in clefts lined with polar residues. Those become partially desolvated, as a result, leading to unfavorable contributions. In contrast, the nucleic acid-binding sites often consist of convex surfaces that are particularly rich in positively charged and polar residues. The latter are hence optimally solvated in absence of the bound nucleotides and therefore provide a favorable energetic contribution. But the relationship between destabilizing regions and functional sites cannot be reduced to geometric features or amino acid composition. For instance, polysaccharide-binding sites are rich in residue types that are rare in destabilizing regions but overlap well with the latter. Likewise, some clefts are not detected as destabilizing (e.g. in sphericase) whereas flat and convex regions are occasionally detected as such.
There is little doubt that the occurrence of ligand binding sites in clefts often makes physical and chemical sense.
Using the definition of clefts in a protein structure to predict functional sites is however far more challenging. SUR-FNET is a program that identifies clefts in protein structures. It was claimed by the authors that the largest cleft identified by SURFNET contains the protein binding site in a large majority of cases [18]. Applying SURFNET to our 63 structures and checking the overlap of the largest cleft identified by this procedure with the functional sites defined in our dataset, shows that SURFNET is 'better' at predicting functional site (average sensitivity of 46% compared to 26% with our approach). However SURFNET clefts are usually much larger than the functional sites (see [60]), and therefore tend to include these entirely in addition to including a large number of 'false positive' residues. The PPV of the SURFNET method is consequently much lower (15%) compared to our method (27%). For the same reason, the overlap noted here with destabilizing regions may often not be statistically significant (large size residue patches have a higher probability to overlap with another patch by chance). It was recently shown [58] that the poor specificity of SURFNET could be improved by using it in combination with the conservation-based method CONSURF [6]. Similarly, complementing SURF-NET with functional site prediction approaches not based on conservation, like the one presented here, may prove useful for cases where not enough homologues are available.
The energetic criteria used here to define destabilizing regions would also need improvements. These criteria currently rely on standard force fields and approaches, that suffer from many well documented limitations. The representation of electrostatic and solvation effects is far from optimal, although some of us have recently demonstrated that the addition of the simple surface area terms to the CHARMM potential, as done here, is superior to many of the more sophisticated continuum electrostatic models [67]. More importantly, our analysis completely neglects polypeptide chain entropy contributions to the free energy in both the folded and unfolded states, and our model for the unfolded state is extremely crude. Due in part to these limitations we chose not to consider residues with very high temperature factors in our calculations, as the atomic coordinates of those residues are likely to be inaccurate. However, it is well known that flexible regions often tend to be involved in recognition. It is therefore not surprising that our choice not to consider residues with high temperature factors resulted in the elimination of several destabilizing regions that show significant overlap with functional sites (e.g. the TcR-binding site of staphylococcal enterotoxin C2 ([PDB:1uns]).
A further factor that most certainly influenced the results of our analysis is the incomplete knowledge that we currently have of the functional sites of proteins on the one hand, and the lack of consistent annotations for the known sites on the other. Even one of the best characterized proteins, such as hen egg white lysozyme, features a myriad of binding and functional sites that have not been annotated and archived in databases. The so-called 'moonlighting' proteins where new binding sites and activities are discovered long after their first function was characterised, illustrate well this point [68]. To compensate at least in part for this shortcoming, the biologically relevant binding sites in the 63 proteins used in our analysis were manually annotated using all available literature evidence. The annotated binding sites are freely available at [53]. Clearly though, much more work is needed in order to produce both more consistent definitions of functional sites and to proceed with their annotations.
Despite the current limitations in using energetic criteria to identify functional sites in proteins, we believe that methods such as those presented here and future improved versions, will play an increasingly important role. Indeed the fact that they do not rely on information on related proteins, as most other methods presently do [6,23], should make them particularly useful for assigning function to proteins with no known relatives, of which a growing number is being currently discovered (metagenomics projects [69]).
Since functional site prediction methods that use information on sequences and structures of related proteins may also suffer from limitations due to the lack of sequence data, an approach in which such methods are combined with energetic criteria should help improve performance, as previously suggested [23]. Of particular interest are methods that detect spatial clusters of conserved residues, which were shown to greatly improve the performance of functional site prediction [6][7][8][9]. Further improvements may be obtained if those methods are combined with the search of spatial clusters of destabilizing residues, as done in this study, instead of considering only individual destabilizing residues [24].

Contributions of individual residues to protein stability
The contribution of residue i to the folding free energy of the protein is computed as the difference (ΔΔG i ) between the folding free energy in presence and absence of the considered amino acid side chain in position i, using the thermodynamic cycle shown in Figure 4, as follows: with representing the folding free energy of the protein in the presence of all the amino acids including that at position i, and (BB) representing the folding free energy of the entire protein in absence of the sidechain at position i. ΔΔG i hence takes into account the total free energy cost of desolvating in part or in whole the amino acid itself, as well as the cost of the partial desolvation of neighbouring residues and the vacuum interaction terms of the considered residue with all surrounding atoms. is computed as previously described [51]: Where G folded is the protein free energy in the folded state and G reference the free energy in a reference state, which is used as a model for the protein unfolded state. The free energy of the folded state is then expressed as an effective energy, which is the sum of the following terms [51]: E conformation is the classical conformational energy computed using the CHARMM 22 force field [70] which is expressed as a sum of pairwise contributions, and uses a full atom representation. G solvation represents the solvation free energy, computed using an empirical atomic solvation model [71] (see references [36,51] for further details). In these calculations the electrostatic term is computed using a dielectric constant of 8 and a switching function operating between 6-7 Å The free energy of the reference state G reference is calculated as the sum of the free energy contributions of isolated amino acids: Where i are the isolated amino acids, modelled by a standard dipeptide unit with N-acetyl-N'-methylamide backbone, and the sum is performed over the sequence of the protein. As for the folded state, G reference is expressed as a sum of two terms: where and are the contributions from conformational and solvation energies, respectively. Calculation of the two energy terms in Eq. 6 involves computing the Boltzmann averages of the conformational and solvation energies over all possible side chain conforma-tions of amino acid i. The same force-field is used as for the folded state calculations.
Prior to computing the energies in Eqs 4 and 6, hydrogen positions are added using the HBUILD command in charmm [72]. Histidine protonation is assigned on the basis of distance to neighbouring residues. The resulting structures are relaxed by applying 50 steps of steepest descent energy minimisation. The biologically meaningful quaternary structures are obtained with PQS [73], and manually verified with information from the PDB file and from the literature whenever available.

Defining clusters of destabilizing residues
The contribution of individual residues to the protein folding free energy is computed as detailed above and the values are ranked in ascending order starting with the most unfavorable contribution. Destabilizing residues are defined as the 28% residues with largest unfavorable contributions to stability in each protein, over the total ranked list of residues. A subset of "highly destabilizing" residues is defined as the 5% of the residues with the most destabilizing contributions. The precise values for these Destabilizing regions detection procedure Figure 5 Destabilizing regions detection procedure. (a) clustering of highly destabilizing residues (red) that are less than 9.0 Å apart. (b) addition of destabilizing residues (orange) that are within 6.0 Å of a destabilizing residue already present in a destabilizing region. (c) Final result. Only the destabilizing regions larger than 4 residues are considered. In (a) and (b) are represented residues (cyan), destabilizing or not, which are enclosed in a sphere centered on a pair of destabilizing residues and therefore added to the destabilizing region (see text for more details).
Thermodynamic cycle for calculating the contribution of a side-chain to the protein folding free energy Figure 4 Thermodynamic cycle for calculating the contribution of a side-chain to the protein folding free energy. ΔG folding is the folding free energy of the protein in the presence of all amino acids including the one at position i.
(BB) is the folding free energy of the protein in the absence of the side chain at position i. ΔG w-solv (SC) is the free energy cost of introducing the side chain of residue i into the water solvent. ΔG fp (SC) is the free energy cost of introducing the same side chain into the folded protein structure. ΔG fp (SC) includes the energy of interaction of the side chain with the surrounding residues in the protein structure, as well as the cost of burying the atoms of both the side chain and the surrounding protein structure.
ΔG i folding two thresholds were derived, as described in section on parameter fitting (below).
Destabilizing regions are detected using a 2-steps distance-based spatial clustering procedure illustrated at Figure 5 with pdb entry 1c7k. Each highly destabilizing residue is the seed of a destabilizing region. The highly destabilizing residues are clustered if they are less than 9.0 Å apart. Then, any destabilizing residue is added to a cluster if it is within 6.0 Å of a destabilizing residue already present in the cluster. In addition, when destabilizing residues are paired, a sphere, whose center is the center-ofmass of the pair and whose diameter is 75% of the distance between the 2 residues, is considered; any residue enclosed in that sphere, whether destabilizing or not, is added to the corresponding destabilizing region. Finally, destabilizing regions with less than 4 residues are discarded. This represents the size of the smallest binding site in the dataset.
The centre-of-mass of the residue side chain, excluding C β , is used in the calculation of all distances, with the exception of Gly and Ala, where the C α and C β coordinates are used, respectively. This approach is inspired by a method used to identify clusters of conserved residues [9].
The software DESITE for identifying destabilizing regions from the atomic coordinates is available upon request from the authors. A license to the CHARMM package is required.

Parameter adjustments
Our procedure has a total of 4 adjustable parameters.
These are the fractions of residues with unfavorable contributions to stability used to define the destabilizing and highly destabilizing residues, and the distance thresholds used to group highly destabilizing residues and destabilizing residues into the same destabilizing region.
The values of these parameters were obtained as follows. We first defined "reasonable" ranges of values according to the physical meaning of the parameters (e.g. distance parameters cannot be too large or the destabilizing regions would consist of the entire protein). We used the following "acceptable" ranges: the proportion of destabilizing residues is varied between 6 and 30%, whereas that of highly destabilizing residues is varied between 1 and 10%, and the distance for grouping 2 highly destabilizing residues is varied between 8 and 12 Å whereas that for adding a destabilizing residue to a cluster is varied between 5 and 10 Å. Values were changed in intervals of 1 (Å or %) within these ranges. Destabilizing regions obtained with all possible values combinations (with logical restrictions, i. e. the proportion of highly destabilizing residues must be smaller than that of destabilizing resi-dues) were compared with known binding sites in 7 proteins selected as described in the section on Protein datasets. The pdb identifiers of the apo-structures of these proteins used for parametrisation are 1bn6, 1c5h, 1e5m, 1glo, 1hl4, 1ogh and 1ojx. We selected the combination of values that yielded the best overall prediction accuracy with regard to the known functional sites in these proteins.

Filtering criteria
Not all identified clusters of destabilizing residues were considered for further analysis. Positions of atoms having high temperature factors (B factors) are considered as inaccurate, and could yield destabilizing contributions due their inaccuracies. Regions where more than half of the destabilizing residues either have a high average B factor or are located within 5 Å of such residues are therefore not analyzed. The average B factor of a residue is considered as "high" if it is larger than the average B factors taken over all residues in the protein plus two standard deviations. Regions where the majority of the residues have alternate conformations are also discarded because they may yield destabilizing contributions to be due to incompatible combinations of the alternate conformations. In homo multimeric proteins a destabilizing region is discarded if it is not detected in all subunits.

Protein datasets
From the February 2005 release of the PDB [44], we selected the subset of x-ray structures released after November 3rd 1999, with a resolution better than 2.4 Å, a R-value better than 0.20, no residues with missing coordinates except at the termini, and a SITE record in the pdb file. To eliminate structures with bound ligands (potential holo-forms), we filtered out entries with nucleic-acid chains or small ligands (HET-groups), and those that were neither protein monomers or homomultimers. Applying these drastic filters, and removing redundancy, using PIS-CES [74] with a 25% sequence identity cutoff, resulted in a set of only 7 structures. Those were used as our learning set to derive the values of the 4 adjustable parameters as described above.
To build our test set of apo-structures, we relaxed the above-mentioned filter on HET-groups to accept structures with HET-groups of 5 atoms or less, but verifying in all cases that these were not the biologically relevant ligands. To guarantee we could compare the predictions calculated on these apo-structures with the true functional sites for all proteins in the dataset, we excluded the proteins for which we did not find information on functionally important residues from any of the sources described in the section "definition of known binding sites" (see below).
Due to the SITE record filter, this dataset contained mostly enzymes. In order to include non-enzymes as well, we expanded the set by releasing the SITE record filter, while allowing only non-enzymatic proteins to be added. These structures were also filtered to remove entries with ligands, but included those for which a holo-structure, another structure of the same protein with its biologically relevant ligand, was also available (this was done using the RELATED record of the pdb file). The holo-structures were used to aid the functional site definition.
Redundancy was removed with a 25% sequence identity cutoff [74]. The final test set contains 63 proteins, comprising 35 enzymatic and 28 non-enzymatic proteins. In contrast, the learning set contains 7 proteins, all of which are enzymes.

Definition of known binding sites
Known binding sites are defined on the basis of structural and biochemical information. When one or several holostructures of the protein are available, we define a residue as ligand binding when at least two of its atoms are within a 6 Å distance from a biologically relevant ligand in the (ensemble of) holo-structures, ignoring hydrogens. To this the contents of the SITE record is added, manually checking that it contains biologically relevant information. The binding site definition is complemented by site-directed mutagenesis and chemical modification data, whenever relevant for the function of the protein, obtained by manually searching the available literature. If the structure of a protein in complex with a biologically relevant ligand is not available, the functional site is defined from literature only. We consider information from close homologs when there is evidence that the function is conserved. Out of the 63 proteins in the validation dataset, 49 have a binding site based on structural information only, i.e., where literature search did not add any residue, 10 have a binding site derived from a combination of literature searches and presence of close homolog complexes in the pdb, and for 7 proteins the binding site definition is based on literature search only. For these 7 proteins the known binding site does not form a continuous surface patch.
Protein-bound metal ions can have no other function than to stabilise the protein structure, or they can be directly implicated in the molecular function, as is the case for 8 proteins in the dataset. Metal-binding sites are considered only if the metal is known to be important for function, and if in addition it is absent from the apo-structure used in the energy calculations.
The descriptions of the known binding sites, with the literature citations are available at [53].

Evaluating the overlap between destabilizing regions and functional sites
To evaluate the overlap between destabilizing regions and known binding sites, the residues in each site (and region) are compared and the number of residues in common is computed. The number of overlapping residues expected by chance for each binding site -destabilizing region pair is computed using the hypergeometric distribution and from it a statistical significance value (P-value) is computed for the observed overlap, as implemented in the software Compare-Classes of the RSA-tools package [75]. When multiple destabilizing regions are detected in a given protein, they are grouped together as a single one for computing the P-value. For a destabilizing region and a functional site containing a and b residues, respectively, the probability of finding exactly c common residues between them is where is the binomial coefficient. The probability of observing at least c residues in common by chance is given by The quality of the overlap is also measured with the Sensitivity and Positive Predictive Value (PPV) : Where TP (true positives) is the number of residues correctly predicted as part of the binding site; FP (false positives) is the number of residues incorrectly predicted as part of the binding site, and FN (false negatives) is the number of residues incorrectly predicted as not part of the binding site.
Sensitivity is the proportion of residues in a known binding site that are found in the predicted sites (i.e. the destabilizing regions), and the PPV is the proportion of predicted sites residues that are part of a known binding site.

Authors' contributions
All authors participated in the design of the study. BHD performed the analysis and drafted the manuscript. MFL Sensitivity TP TP FN = + ( ) (9) PPV TP TP FP = + ( ) (10) and SJW provided feedback throughout the project, and SJW directed the work. All authors read and approved the final manuscript.