ResBoost: characterizing and predicting catalytic residues in enzymes

Background Identifying the catalytic residues in enzymes can aid in understanding the molecular basis of an enzyme's function and has significant implications for designing new drugs, identifying genetic disorders, and engineering proteins with novel functions. Since experimentally determining catalytic sites is expensive, better computational methods for identifying catalytic residues are needed. Results We propose ResBoost, a new computational method to learn characteristics of catalytic residues. The method effectively selects and combines rules of thumb into a simple, easily interpretable logical expression that can be used for prediction. We formally define the rules of thumb that are often used to narrow the list of candidate residues, including residue evolutionary conservation, 3D clustering, solvent accessibility, and hydrophilicity. ResBoost builds on two methods from machine learning, the AdaBoost algorithm and Alternating Decision Trees, and provides precise control over the inherent trade-off between sensitivity and specificity. We evaluated ResBoost using cross-validation on a dataset of 100 enzymes from the hand-curated Catalytic Site Atlas (CSA). Conclusion ResBoost achieved 85% sensitivity for a 9.8% false positive rate and 73% sensitivity for a 5.7% false positive rate. ResBoost reduces the number of false positives by up to 56% compared to the use of evolutionary conservation scoring alone. We also illustrate the ability of ResBoost to identify recently validated catalytic residues not listed in the CSA.

ConSurf, like ET, uses ideas from evolution to identify residues of functional importance. Based on the Rate4Site tool, ConSurf estimates the rate of evolution of each residue of the protein from the sequence and phylogenetic information, and then maps these rates onto the molecular surface of the protein to help identify patches that may be functionally important [1,2]. We obtained ConSurf scores from version 3.0 [2] by specifying the PDB ID and chain and using all the default settings (including ConSurf's pre-computed multiple sequence alignments based on MUSCLE [3] and trees based on the neighbor joining algorithm [4]). If the protein chains had less than the required number of 5 unique PSI-BLAST hits, we changed the default ConSurf settings to use UniProt instead of the standard Swiss-Prot (this was required for only 3 of the 100 enzymes in our dataset). For consistency across enzymes, we normalized ConSurf scores so the highest scoring residue is 1 and the lowest scoring entry is 0 for each enzyme.

Solvent accessibility.
Catalytic residues must be at least somewhat solvent accessible in order to perform their biochemical function. We obtain solvent accessibility scores using DSSP [5], which is available from the PDB [6]. DSSP provides the surface area a i that is in contact with the solvent for each residue x i . Given a threshold A, the solvent accessibility threshold classifier classifies a residue i as TRUE if a i ≥ A and FALSE otherwise.
The lack of solvent accessibility as measured by DSSP does not imply that a residue cannot be catalytic. Due to the complexity of enzyme interactions and the limitations of DSSP, some residues that are labeled as not solvent accessible in a static solved protein structure may in fact contact atoms in the solvent. This was shown to be the case for some catalytic residues in the CSA [7].

Secondary structure.
Catalytic residues have been observed in all secondary structures of enzymes. However, the proportion of residues that are catalytic in these secondary structures is not the same across all secondary structures. In particular, residues on alpha helices are somewhat less likely to be catalytic than residues on turns, loops, and coils [7]. Using DSSP [5], we classified each residue as being in an alpha helix, a beta sheet, or coil/other. We then defined one base classifier for each secondary structure type that classifies a residue as TRUE if the residue is in that secondary structure and FALSE otherwise.

Catalytic propensity.
Bartlett et al. measured the frequency of each amino acid type for all protein residues in the CSA and compared this with the frequency of each amino acid type among the catalytic residues in the database [7]. The frequencies provided quantitative support for an intuition that many biologists already had: nonpolar amino acids such as alanine and valine are rarely catalytic while polar amino acids such as histidine and glutamine are often catalytic.
We considered two types of catalytic propensity, side-chain and main-chain. For each type, we built a table of catalytic propensities [7] and assigned each residue x i a catalytic propensity value c i based on its amino acid. Given a threshold C, each catalytic propensity threshold classifier classifies a residue x i as TRUE if c i ≥ C and FALSE otherwise.

Residue charge.
As in Bartlett et al. [7], we classified residues of type H, R, K, E, and D as charged. We defined a base classifier for charge that classifies a residue as TRUE if the residue is charged and FALSE otherwise.

Residue polarity.
As in Bartlett et al. [7], we classified residues of type Q, T, S, N, C, Y, and W as polar. We defined a base classifier for polarity that classifies a residue as TRUE if the residue is polar and FALSE otherwise.