How accurate and statistically robust are catalytic site predictions based on closeness centrality?
© Chea and Livesay. 2007
Received: 11 December 2006
Accepted: 11 May 2007
Published: 11 May 2007
Skip to main content
© Chea and Livesay. 2007
Received: 11 December 2006
Accepted: 11 May 2007
Published: 11 May 2007
We examine the accuracy of enzyme catalytic residue predictions from a network representation of protein structure. In this model, amino acid α-carbons specify vertices within a graph and edges connect vertices that are proximal in structure. Closeness centrality, which has shown promise in previous investigations, is used to identify important positions within the network. Closeness centrality, a global measure of network centrality, is calculated as the reciprocal of the average distance between vertex i and all other vertices.
We benchmark the approach against 283 structurally unique proteins within the Catalytic Site Atlas. Our results, which are inline with previous investigations of smaller datasets, indicate closeness centrality predictions are statistically significant. However, unlike previous approaches, we specifically focus on residues with the very best scores. Over the top five closeness centrality scores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstrated previously, adding a solvent accessibility filter significantly improves predictive power; the average ratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictions by residue identity improves the results even more than accessibility filtering. Here, we simply eliminate residues with physiochemical properties unlikely to be compatible with catalytic requirements from consideration. Residue identity filtering improves the average true to false positive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results. Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134. Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined.
Our results resolutely confirm that closeness centrality is a viable prediction scheme whose predictions are statistically significant. Simple filtering schemes substantially improve the method's predicted power. Moreover, no clear effect on performance is observed when comparing ligated and unligated structures. Similarly, the CC prediction results are robust to slight structural perturbations from molecular dynamics simulation.
The accurate and robust prediction of protein functional sites from sequence and/or structure remains an open problem in bioinformatics . Despite the limitations of current methodologies, several sequence and structure-based approaches have recently become popular . Most of these approaches rely on an underlying multiple sequence alignment and attempt to uncover some type of feature conservation therein  (i.e. residues that are conserved across the alignment [4–6]). Arguably, evolutionary tracing has become the most widely used method for computational prediction of protein functional sites . The Evolutionary trace (ET) approach begins with an alignment and corresponding phylogeny. The method searches for all alignment positions that recapitulate the overall phylogeny. While ET is fundamentally a sequence-based scheme, the standard application of the approach uses structural clusters of trace residues to identify functional regions [8–10]. Several other related methods that rely on an underlying alignment plus representative structure have proven useful as well [11–14]. Conversely, we have introduced a phylogenetic motif-based method that is similar in spirit to ET, although it is specifically designed to rely solely on sequence information [15–17].
The literature also contains a host of functional site prediction strategies that are explicitly designed to not rely on a phylogeny . These approaches are useful when too few sequences are available to generate a representative description of familial diversity. While their theoretical foundations vary considerably, most rely solely on structure or a structure + alignment combination. For example, Gutteridge et al. recently developed a neural network approach to predict catalytic sites . Catalytic sites are defined by residues directly involved in the enzyme-mediated reaction mechanism, which generally constitute a subset of all functional residues. The neural network input of Gutteridge et al. includes both structural and alignment descriptors, and is able to correctly predict the active site in >69% of the cases examined. The ability to rigorously benchmark the approach is based on comprehensive databasing and exhaustive manual curation of catalytic residues from the literature  by the same group. This tour de force has led to the Catalytic Site Atlas (CSA) , which contains approximately 600 different proteins with experimentally validated catalytic residues.
Other common catalytic site prediction methods are based on Poisson-Boltzmann continuum electrostatic theory . Elcock has observed that functional residues tend to have increased electrostatic strain energy , meaning that stabilization occurs on mutation. While the approach utilizes sophisticated Poisson-Boltzmann continuum theory, the underlying rationale is based on straightforward evolutionary arguments. The naïve description of protein evolution is that nature solely optimizes structural stability at each residue. However, catalytic and other important residues have functional constraints imposed upon them, meaning that while mutation might be stabilizing, it can occur at the expense of functional proficiency. The detangling of stability and functional evolutionary pressures is examined more thoroughly by Cheng et al. using all-atom protein design . Analogous to the electrostatic strain energy approach, the THEMATICS approach uses Poisson-Boltzmann-based pKa calculations to look for residue titration curves that do not follow Henderson-Hasselbalch . The method looks for titration curves of partially charged residues that are flat over a wide pH range. Similarly, we have demonstrated that a large pKa shift from the null model (aqueous) value can be indicative of catalytic residues [26, 27]. However, the prediction accuracy of this approach is lessened because many structurally important residues (i.e. residues involved in a salt bridge) also have significant pKa shifts.
Network models have also been used with success in predicting protein functional and/or catalytic residues. Instead of representing protein structures as a Cartesian collection of atoms, network models recast protein structures as topological graphs [28–31]. The most common of these methods are based on protein structure contact maps, where each vertex of the graph represents an α-carbon and edges connect vertices within some distance cutoff (generally 6–9 Å). Once the graph is complete, a variety of topological metrics can be used to predict functional residues from it, including: centrality [32, 33], valency  and sub-graph conservation . Despite growing consensus concerning the utility of these methods, a robust assessment of their prediction accuracy remains to be completed. Amitai et al. , Thibert et al.  and del Sol et al.  examine the ability of residue centrality to predict catalytic and/or functional sites within datasets of 178, 128 and 46 proteins, respectively. The results from these studies are encouraging. Moreover, they show that combining centrality within other metrics improves predictive power. For example, Amitai et al. demonstrates that combining centrality with solvent accessibility substantially improves accuracy, whereas both Amitai et al. and Thibert et al. demonstrate that including residue conservation improves results.
In this report, we investigate the accuracy and statistical significance of closeness centrality (CC) functional residue predictions, which has previously been shown to be the best of several different network centrality scores (i.e. valency, betweenness, etc.) [32, 33]. Primarily, our investigation is based on SCOP  superfamily-filtered protein chains (which represents 283 unique SCOP superfamilies) from the CSA. Based on observed accuracies, CC is demonstrated to be a viable prediction scheme. Our results are inline with previous investigations, but are more significant due to dataset size and composition since we control for structural redundancy. A second distinction of this work is that instead of focusing on the entire range of true to false positive rates, as done by previous investigations, we concentrate on the very best CC scores. By focusing only on the top five scoring residues, we are able to evaluate the ability of the model to provide insight that provides a reasonable number of experimentally testable predictions. In all cases, our predictions correspond to false positive rates below 1.6%. The performance of the method is improved substantially by considering only residues that are not completely inaccessible to solvent. We further demonstrate that filtering the predictions based solely on amino acid identity substantially improves predictive power even more than filtering by solvent accessibility.
where Np is the total number of vertices in the graph and Lij is the shortest path (geodesic distance) between vertices i and j. The shortest path is simply the minimum of all possible paths between residues i and j. As normally done in protein structure networks, edges are not weighted, making the shortest path simply an integer count of the number of edges separating i and j. It should be noted that Np (a constant within each protein) has no effect on our observed results since we are only using CC to rank the residues, meaning the inverse of shortest path sum solely establishes which residues are ultimately predicted. Nevertheless, we employ CC here to be consistent with previous investigations.
Comparison of accessibility to catalytic residue distributions.
As stated above, several investigations have examined the prediction accuracy of global centrality metrics; however, none of the previous investigations are on the scale of this report. Nor, have any rigorously controlled for structural redundancy as we do here. Of the previous reports, the largest dataset investigated is 178 proteins , which (unlike ours) contained redundant structural folds. Previous investigations use Receiver Operating Characteristic (ROC) plots to examine the balance between true and false positive rates over the entire relationship continuum. A false positive rate greater than 9% is commonly considered; Thibert et al. routinely consider false positive rates ~20%. In this report, we only consider the top N predictions, where N equals 1–5. From a pure prediction point of view, one wants to simultaneously balance sensitivity and specificity. However, when considering experimental realities, we believe our approach has the more relevance because it is less likely to result in huge numbers of false positives that are intractable to test within the lab. The corresponding false positive rate of our predictions is always below 1.6%. Over the entire curve, our ROC plots are virtually superimposable to those within Thibert et al. (results not shown). Unfortunately, a direct comparison to the results within Amitai et al. is impossible since they only provide ROC plots for an integrated sequence conservation/centrality approach. An example ROC plot is provided in Additional file 1.
Evaluation of catalytic site predictions.1
Per PDB accuracy3
TP & FP rate5
1 correct per PDB6
1 correct expect7
(a.) Raw CC values (no filter)
(b.) Solvent accessibility filter
(c.) Residue identify filter
(d.) Combination filter (solvent accessibility + resodue identify)
In Eq. 2, n is equal to the number of predictions put forth, k (in the first iteration of the sum) is equal to the number of correct predictions, and p represents the random (null) probability. Each step of the sum is calculated from the binomial distribution (binomdist) function within Microsoft Excel. Notice that the p-values decrease monotonically (see Additional file 2), despite that the fact that the relative accuracies are not monotonic. In fact, as it is demonstrated below, relative accuracies generally decrease as a function of Tnp. Nevertheless, p-values indicate that the results become more statistically significant at larger Tnp values. This apparent contradiction highlights the true meaning of a p-value. A p-value is the statistical likelihood of the null hypothesis being true. It is not an accuracy of the method. The smaller the p-value is, the more significant the observed results are. However, statistical significance is intimately related to the number of observations. The more observations of a given difference between an observed and null probability, the more significant it is. The number of predictions put forth put forth at each level increases substantially, whereas the accuracy is only slightly diminished, which is why p-values monotonically decrease as a function Tnp.
The improvement is plotted in Fig. 3b in order to normalize the observed percentages by the random expectation. Improvement is defined as the ratio of observed accuracy to random expectation. The random expectation is simply calculated as the percentage of catalytic to all sites within the dataset, meaning each site has an equal chance of being catalytic. While not overwhelming, the observed accuracies (~6%) are substantially greater than the null model (0.9%). The average improvement over Tnp = 1–5 is 6.4% (standard deviation = 0.4%). The false positive range in Fig. 3a–b is 0.4–1.6%. The false positive rate is calculated as the number of incorrect predictions divided by the total number of noncatalytic residues. True positive rates (number correct divided by the total number of catalytic residues) range from 2.1–11.0%.
While the circles in Fig 3a correspond to overall accuracies, the squares describe the number of proteins with at least one correct prediction per protein. The near linear increase is trivially expected since the number of proteins with at least one correct should increase with the total number of predictions. However, after normalizing for the random expectation in Fig. 3b, the improvement indicates that the rate of new proteins with at least one correct prediction generally decreases as a function of Tnp. Here, the random expectation describes the percentage of proteins with at least one correct again assuming that all sites are equally probable to be catalytic.
Throughout this report, we use citrate synthase as an example to discuss the context of the CC results. Citrate synthase is chosen because it nicely demonstrates how the two filters discussed below improve prediction accuracy. Moreover, citrate synthase is an important enzyme in aerobic metabolism; it regulates the pace of the Krebs cycle. The enzyme catalyzes the condensation between the two acetyl carbons from acetyl-CoA and oxaloacetate to form citrate . The reaction is energetically driven by hydrolysis of the thioester bond, which is strongly exothermic, within acetyl-CoA. None of the predictions at Tnp = 5 correspond to catalytic residues. While we are only narrowly using catalytic residues to benchmark the approach, this lack of sensitivity should not be interpreted as a complete failure to provide useful information. Similar to the examples shown in Fig. 1, the five most central residues (Tyr185, Ala186, Phe333, Met335 and Gly336, using 1AJ8 numbering) are all buried deep with the core of the protein; in fact, four are completely inaccessible to solvent. Despite their location within the core, Tyr185 and Phe333 are both clearly important as they structurally contact the catalytic Asp312. Moreover, Phe333 is also contacting the citrate substrate. While all non-protein (HETERO) groups have been stripped from our inputs to make this large-scale analysis tractable, it is evocative that the model is picking residues directly interacting with the substrate, even if they are not catalytic per se. Below it is demonstrated that filtering CC predictions by residue accessibility and/or residue identity substantially improves citrate synthase catalytic residue prediction accuracy.
Straightforward physical intuition suggests that the most buried residues within the protein are likely to have the highest CC values. Fig. 1 clearly demonstrates this expectation to be correct. However, conventional wisdom also states that most catalytic residues are (at least partially) exposed to solvent . For example, it is very common to find catalytic residues at the bottom of an active residue cleft where they are partially obscured from solvent. This is because some exposure to solvent is important for recognition by the incoming substrate. Moreover, water molecules are frequently utilized along the reaction coordinate. As such, it makes sense to filter residue completely inaccessible from solvent from further consideration.
As a first step toward improving CC catalytic residue predictions using solvent accessibility, we begin by asking the question, "Are the solvent accessibility distributions of catalytic and noncatalytic residues significantly different?" Additional file 3 clearly shows that the two distributions are very similar. This result justifies the approach because it demonstrates that CC does not simply recapitulate solvent accessibility. Put in other words, CC provides information orthogonal to accessibility. This point is further demonstrated in Additional file 4 that plots accessibility vs. CC for catalytic and noncatalytic residues. Similar to the value reported within Amitai et al. , the overall correlation between solvent accessibility and CC is low (R = -0.28). Finally, we use mutual information (MI) to quantify the amount of (in)dependence between the two metrics. The MI between solvent accessibility and CC is 0.011; a value of zero indicates complete independence. Consequently, it makes physical sense to combine the two metrics. This would not be the case if closeness centrality simply reflected solvent accessibility.
We introduce the solvent accessibility threshold, Tsa, to filter out residues with low solvent accessibilities. All residues with residue solvent accessibility < Tsa are a priori excluded as catalytic residue predictions. Additional file 5 shows two example plots of how accessibility filtering improves prediction accuracy. In all cases, any amount of accessibility filtering significantly increases the prediction accuracies. In the Tnp = 2 example, the maximal relative accuracy occurs at Tsa = 8 Å2, which corresponds to a prediction accuracy of 13.1%. The associated false and true positive rates are 0.6% and 9.8%, respectively. When Tnp = 5, the maximal accuracy (10.4%) occurs at Tsa = 9 Å2. The corresponding false positive rate is 1.4%, and the true positive rate is 18.8%. One might argue that the performance improvement shown here is simply a matter of opening a free parameter with no transferability. In order to test parameter transferability, the parsed dataset was randomly divided into two halves, and the same analysis was performed on each. The resulting ideal thresholds are very close (± 1.0 Å2) to each other and to the values for whole dataset. This result confirms the transferability of the identified Tsa values. Using a fixed Tsa = 9.0 Å2, which is the most common best value observed, Table 2b tabulates the accuracy of the approach at each Tnp. In all cases, the values are greater than the corresponding unfiltered results. Once more, the values from the collapsed and per protein datasets are similar, especially when considering the standard deviation within the per protein values as an error estimate.
Fig. 4b plots the percentage of PDBs with 0–5 correct predictions (Tsa = 8 Å2), which further demonstrates that the solvent accessibility filter improves accuracy. Compared to the unfiltered predictions, there are fewer proteins with zero correct predictions, and more with one or two correct. In the parsed dataset, the improvement for one, two, three and four correct is 12.4%, 3.5%, 0.4% and 0.7%, respectively. In all cases, the p-value for the accessibility-filtered predictions is lower than the corresponding unfiltered results (Additional file 2). In fact, in spite of a global reduction in the number of predictions, the p-value of the accessibility-filtered results is lowered by 33 to 73 orders of magnitude.
As before, we briefly discuss the context of the predictions within the citrate synthase example. Here, the improvement within the catalytic residue predictions is stark. Citrate synthase has three catalytic residues annotated within the CSA. These residues (His223, His262 and Asp312) are structurally proximal to each other and reside within the active residue cleft. Each directly interacts with a carboxyl group of the enzyme's citrate substrate. Recall that the predictions based solely on CC are inaccessible to solvent. On the other hand, all three of the enzyme's catalytic residues are partially exposed to solvent in both the functional dimer and the constituent monomers that are our predictions are based. The monomer exposure of His223, His262 and Asp312 is 34, 52 and 10 Å2, respectively. The accessibilities of His223 and His262 within the dimer are slightly reduced, whereas the Asp312 value is unaffected. Based solely on CC (i.e. no filtering), the network model fails to predict either of the catalytic residues; in fact, they only rank order 27th, 43rd and 172nd (of 371 residues). Nevertheless, after filtering all residues solvent accessibilities less than 9 Å2, His223 and Asp312 are correctly predicted to be catalytic.
As suggested above, sites other than the catalytic residues can also be critical to function [38–40]. Additionally, it is possible that sites not annotated within the CSA might also be catalytic, or at the very least, directly related to functional efficiency. In fact, Russell et al. define ten additional active site residues as being critical to function . In spite of this more liberal definition, none of the remaining three accessibility-filtered predictions (Glu189, Lys219 and Glu228) correspond to sites within the expanded benchmark. Nevertheless, these residues are clearly important, as they are structurally proximal to both catalytic sites. This result is trivially expected due to their sequence proximity to His218; however, the fact that CC, which treats considers each vertex without regard to primary structure, is promising.
While we explicitly avoid alignment and phylogeny data here, it might be possible to improve prediction accuracy by simply filtering out residues that are unlikely to be catalytic based on their innate physiochemical properties. For example, in the neural networkbased prediction approach of Gutteridge et al. , it is demonstrated that the single most import element of the input is whether or not the residue being considered is histidine. The second most important element is residue conservation, which is followed closely by whether or not the residue in question is lysine, cysteine, aspartate, glutamate and arginine (in that order). These sequencebased input elements are all more important than a variety of commonsense structural characteristics (i.e. depth, solvent accessibility, cleft information and secondary structure). Consequently, we implement a simple filter based on residue identity here. Any residue that is not histidine, lysine, cysteine, aspartate, glutamate or arginine is excluded from further consideration. We have tried other combinations of residue exclusion, but this provides the best overall results. A comparison of per residue CC values for catalytic and noncatalytic residues is provided in Additional file 6.
The accuracy of the residue identity filtered predictions ranges from 16.5 to 22.4%, which is a substantial improvement over the random expectation of 0.9% (Table 2c). Predicting catalytic residues by residue identity alone provides a second baseline to compare to. In this approach, a prediction is put forth each time one of the six residue types listed above occurs. Using only residue identity results in an accuracy of 2.1%, which is only slightly better than random expectation of 0.9%. Moreover, it is substantially less than the residue identity filtered results, meaning CC substantially improves predictive power over residue identity alone. Like before, the per protein accuracy range is significantly less (11.3 to 14.3%) than the collapsed results. Nevertheless, the main result that the method significantly improves upon the solvent accessibility filtered predictions is clearly conserved.
The residue identity filter decides whether to consider or not consider a particular residue type based on an a priori scheme. This is equivalent as saying that the six residues that "pass through" the residue identity filter are equally probable. However, Additional file 6 clearly indicates that this is not reality. As such, it is natural to assume that some sort of fuzzy logic scheme that allows residues to be in the considered or excluded set based on the observed catalytic residue propensities should improve model accuracy. An exhaustive number of schemes were tried using various weighting schemes. For example, three possibilities (from several different considerations) include: (i.) weighting all twenty residues exactly proportional to their catalytic propensity; (ii.) weighting the six from above as equally probable, but scaling of the others; and (iii.) weighting the six from above with exclusion of the remainder. However, no statistically significant improvement over what is reported in Table 2c is found. In the first two examples, the fuzzy model actually does worse since catalytic residues make up such a tiny fraction of the total number of residues. Meaning, any relaxed filtering criteria allows many more noncatalytic (vs. catalytic) residues to be considered; consequently, specificity is lost. Conversely, the best of the trials within the third scheme is statistically indistinguishable from Table 2c.
Combining both filters together results in slight improvement over the residue identity filter at Tnp = 1–2 (Fig. 5b). At Tnp = 4 and 5, the combination slightly underperforms the residue identity filter, yet the values are still significantly better than the accessibility-filtered results. (At Tnp = 3 the results are virtually identical to the residue identity filtered predictions.) The likely explanation for this result is due to the fact that the filters eliminate similar information. For example, it is trivially expected, due to their propensity to be within the core, that residues eliminated by the accessibility filter will be nonpolar amino acids. Likewise, the residue identity filter always eliminates nonpolar residues from further consideration. As done above with the residue identity filtered results, a baseline without CC is considered. In this instance, any time one of the six considered residue types occurs with a solvent accessibility below 9 Å2, a prediction is put forth. This scheme results in an accuracy of 2.5%, slightly better than the 2.1% of residue identity alone, yet nowhere near the accuracies of the combination-filtered CC scores.
Dataset used in comparison of ligated and unligated pairs
Ligated vs. unligated2
4-oxalocrotonate tautomerase (1BJP, 4OTB)
1, 2, 0
α + β
Ribonuclease A (1RBN, 1RSM)
0, 2, 2
α + β
Xylanase II (1BVV, 1XNB)
1, 1, 0
Trpysin (1A0J, 1UTK)
1, 2, 0
Aminopeptidase (1IGB, 1AMP)
1, 0, 0
Phospholipase C (1AOD, 2PLC)
3, 0, 2
Deacetoxycephalosporin C synthase (1W2N, 1W28)
0, 1, 0
Chorismate mutase (3CSM, 2CSM)
2, 2, 0
Alginate lyase A1-III (1HV6, 1QAZ)
2, 0, 1
tRNA-guanine transglycosylase (1R5Y, 1PUD)
1, 0, 0
Nitric oxide synthase oxygenase (1M9R, 3NOS)
2, 2, 0
α + β
Luciferase (1BA3, 1LCI)
3, 1, 0
Class I alpha-1;2-mannosidase (1G6I, 1DL2)
1, 3, 0
Correlation (Correl vs. RMSD)
This report investigates the ability of CC to predict enzyme catalytic residues from topological descriptions of protein structure. While the most central residues generally correspond to positions within the core, the predictions are substantially better than the random expectation. This result is maintained whether one averages over the collapsed or per protein datasets. Filtering the predictions by solvent accessibility and/or residue identity improves the results considerably. Overall, these results are comparable to those from previous reports [32, 33], but have better statistics due to database size and composition. Additionally, we carefully examine the effect of input structure on our predictions. Pairwise comparisons between ligated and unligated structures reveals no clear trend regarding which input is a better choice. Similarly, slight structural perturbations of four protein examples via MD simulation have no observed effect on the overall conclusions.
Three different datasets extracted from the manually annotated CSA entries are examined here. The first, which contains 568 PDB files, represents a dataset randomly culled such that no two sequences have greater than 80% sequence identity. The second and third datasets use structural information to randomly distil to nonredundant SCOP  families (423 proteins) and superfamilies (283 proteins). In each dataset, a single chain per protein structure is included; however, our analysis of all chains demonstrates that the overall accuracies are generally robust to chain differences (results not shown). All figures shown herein are based on the dataset parsed by SCOP superfamily. However, results for the other two datasets are always similar. This point is typified by Fig. 3 and Fig. 4a, which include data for all three.
We test the ability of solvent accessibility to improve prediction accuracy by filtering out the most buried residues. Solvent accessibility is calculated using DSSP , which is an extremely fast approach. DSSP calculated solvent accessibilities range between 0 to >250 Å2. No percent or relative accessibility corrections, which are commonly employed to normalize values by sidechain surface area and to remove backbone considerations, are implemented within DSSP. Nevertheless, the lack of these corrections is not critical here as we are simply trying to identify the residues most excluded from solvent. Theses corrections are more important when quantifying solvent exposure because the maximal accessibility of a large residue (i.e. lysine) is so much greater than that of a small residue (i.e. alanine). Conversely, in our problem, if both residues are maximally buried, the accessibility (with or without the correction) is simply zero in each case.
Molecular dynamics simulations are employed to generate an ensemble of slightly perturbed structures. The protocol used here is the same as we reported previously in our analysis of sensitivity within calculated pKa values . Canonical ensemble (fixed NVT) in vacuo molecular dynamics simulations, as implemented in the Molecular Operating Environment (Chemical Computing Group, Montreal, Quebec, Canada), are used to generate the ensemble of conformers. In each example, the timescale of the simulations is 1 ns, and the timestep is 0.001 ps. Structure sampling occurs every 500 ps. It is obvious that this in vacuo simulation protocol is unacceptable to determine realistic aqueous phase dynamics. However, it is adequate for the aims of this work since the simulation is simply used to generate a conformational distribution.
Catalytic site atlas
Probability density function
Receiver operating characteristic
Andrei Istomin is thanked for proof reading the manuscript. Anthony Fodor is thanked for reading an early draft of this paper and providing several valuable suggestions. The reviewers are also thanked for a number of helpful suggestions. Swati Pande is thanked for constructing the ROC plots. James Torrance, from Janet Thornton's research group, is thanked for assistance with the Catalytic Site Atlas. This work is supported by a Joint Ventures Grant to DRL from the California State Program for Education and Research in Biotechnology. DRL is supported, in part, by NIH R01 GM073082. EC is supported by Howard Hughes Medical Institute undergraduate fellowship.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.