Functional site prediction selects correct protein models
© Chelliah and Taylor; licensee BioMed Central Ltd. 2008
Published: 13 February 2008
The prediction of protein structure can be facilitated by the use of constraints based on a knowledge of functional sites. Without this information it is still possible to predict which residues are likely to be part of a functional site and this information can be used to select model structures from a variety of alternatives that would correspond to a functional protein.
Using a large collection of protein-like decoy models, a score was devised that selected those with predicted functional site residues that formed a cluster. When tested on a variety of small α/β/α type proteins, including enzymes and non-enzymes, those that corresponded to the native fold were ranked highly. This performance held also for a selection of larger α/β/α proteins that played no part in the development of the method.
The use of predicted site positions provides a useful filter to discriminate native-like protein models from non-native models. The method can be applied to any collection of models and should provide a useful aid to all modelling methods from ab initio to homology based approaches.
The prediction of protein structure from purely sequence data has posed a challenge over many years. With the increasing numbers of known structures, many recent methods have turned to the use of structure-based sequence alignment (threading) [1, 2] or fragment assembly , including various hybrid combinations . Although some of these methods are referred to as ab initio, they all rely on having a database of known structures and are better classed as de novo to distinguish them from a pure physico-chemical approach.
Following some of the earliest attempts at protein structure prediction [5, 6], it became clear that the use of external biochemical constraints on residue proximity could provide a very powerful filter on the permitted structures, whether these were simple pairwise positions  or whole motifs . Biochemically important residues are typically found in close proximity and are also highly conserved. With a view to using such information to constrain predictions, attempts were made to predict active site residues from a multiple sequence alignment . This approach relies on finding residues that are conserved for no apparent structural reason and some recent methods also combine this with the requirement to form a cluster in space when the protein structure (or a model) is known . (See ref. , for a review).
In this work we use the method of Chelliah et al., (2004)  to predict residues that are likely to be located in an active (or binding) site and to evaluate their proximity in the context of a model for the protein. We do not address the generation of the models but rather take a series of 'decoy' models constructed by a de novo protein structure prediction method (See Methods section for details). Unlike some collections of decoy models, the ones we use were generated from abstract secondary structure lattice frameworks (Taylor, 2002) which has the advantage that we know, by definition, whether the model has the native fold. Rather than use an ambiguous Root Mean Square Deviation (RMSD) based measure, we can then evaluate our model scores by a true/false criterion.
We begin our study using a small sample of five α/β/α proteins for which a large number of decoys had been previously generated. These proteins, with length under 150 residues, are a mix of both topological and functional types, including enzymes and non-enzymes. We then develop a score termed "Fold Score" based on the proximity of predicted active (or other) site residues to rank the different folds on their functional potential. Without change, the method is then applied to a variety of other proteins of the same structural class but ranging in size up to almost 200 residues in length.
Results and Discussion
Training-set of five proteins
Example decoy fold distribution for 3chy. Number of fold-types, strand and helix order in the fold (HI() denotes the helix order in layer I, SII() the strand order in layer II and HIII() the helix order in layer III), Number of models, Number of clusters and scores of the best model in each fold-type is detailed in this table. In the second column '-' denotes the change in the direction of the secondary structure element when compared to the native. F1 and F7 are correct folds and are in bold type.
Strand and helix order
No. of models in each fold type in 200 Models
No. of cluster with ≤2 Å RMSD; ≥60% PID cut-off
Score of the best Model
HI (1,5);SII(2,1,3,4,5);HIII (2,3,4)
The "summary score" for each plot can then be plotted against a measure of deviation from the native structure.
Correct and incorrect folds in top and bottom 25 ranked models. The correct folds in top 25 ranked models and the wrong folds in the bottom 25 ranked models for the five proteins are tabulated in order to show the strength of the method.
Correct in top 25 ranked models(best)
Incorrect in bottom 25 ranked models (worst)
22 (top 4)
13 (low 4)
14 (top 4)
22 (low 11)
24 (top 22)
25 (low 25)
23 (low 16)
18 (top 7)
15 (low 7)
Independent test-set of larger proteins
RMSD values for larger proteins. For each of the proteins in the larger set, the RMSD of the best model in the top four ranked fold-types is tabulated (along with the fold-type in parentheses). Where this corresponds to the native fold, the value is in bold type.
For the smallest protein in this set, 1v9w (130 residues) the fourth ranked fold corresponds to the native with a RMSD value of 6.3 Å and the third rank model has two β-strands swapped between adjacent positions at the edge of the sheet. For the slightly larger 1rlj (135 residues) the second ranked fold corresponds to the native with a good RMSD value of 4.9 Å. At almost 160 residues, 1kjn was almost as good across all four top ranked models (around 5 Å RMSD) and the first ranked fold corresponds to the native. For the above three proteins, the correct fold was selected. The remaining four proteins has no native-like fold in the 200 decoy models taken from the de novo protein structure prediction method and so, the top ranked models were checked for them being a closer variant to the native structure. 1vq1 was considerably larger (178 residues) but for this size, the RMSD values around 7 or 8 Å on the third and fourth ranked models were acceptable. The main error in the best model was found to be caused by two adjacent β strands lying in swapped positions on the edge of the sheet. Indeed, neglecting strand swaps, all four models otherwise correspond to the native fold.
The remaining three proteins were close in size at 186/7 residues. At this size the two single figure RMSD values of 8.9 for 1uxo and 9.8 for 1t57 constitute good approximations to the native fold. 1t57 had a pair of swapped β-strand positions in the middle of the sheet and 1uxoA had a minor helix displaced to the opposite face of the sheet.
The largest protein (1vk2) had a best RMSD value over 10 Å which does not correspond to a native-like fold. The models with RMSD values 14.7 and 14.5 are not too far from the native structures. The model with RMSD 14.7, had 2 pairs of adjacent swapped β-strands in the middle of the sheet and a pair of helix swaps between the opposite faces of the sheet. The model with RMSD 14.5, had 2 pairs of adjacent swapped β-strands (one in the middle of the sheet, the other in the C-terminal end of the sheet) and a pair of helix swaps between the opposite faces of the sheet. The bigger RMSD in these models are due to longer loops and helix displacments between layers.
We have shown that consideration of the requirement of proteins to form a functional site, either enzymic or binding, can be used to select the correct protein fold from a large number of well constructed decoy models. Our method uses only sequence data to do this in combination with the model structures and involves no information derived from the known structures. A test set of five small β/α type proteins were used to determine the best formulation of the CRESCENDO method, with a clear choice being indicated that the use of a reduced number of residue environments was best. This difference from the application of the method to native protein structures arises because we are applying the method to rough Cα models from which the constructed side-chain orientations are unreliable.
This protocol was then applied to a set of larger proteins with the correct fold, or a close variant being selected in six of the seven proteins. The type of error most commonly seen in this test set was the swapping of adjacent strand positions in a β-sheet. With hindsight, this is not an unexpected error, since these strands still have the same orientation and any functional residues that they, or their flanking loops, carry will remain in close proximity and be scored equally well by our method. Not only will our evaluation method be blind to such variation but if elements of structure carry no functional sites then they will be equally unconstrained. The method is more sensitive to discrimination between models that have secondary structure elements oriented in the opposite direction (provided they carry the identified functional residues). For example, Chemotaxis Y protein (3chy), has the C-terminal helix (fifth) involved in binding. The fold-type F8 (Table 1), has this helix in the opposite direction, which is the lowest ranked fold for this protein. Similarly, Thioredoxin has the fourth strand involved in binding, but is in the opposite direction in one of the fold-types and ranked last. For the smaller proteins almost all loops on a face will be involved in a binding site but for the larger proteins, there is a greater chance that some loops will be unconstrained. Despite these fundamental limitations, if the number of allowed topologies can be reduced, even to single figures, then more detailed modelling methods can be applied to reconstruct the geometry of the binding site. If the nature of the substrate, or just what is bound in the site, is known then some stereo-specific constraints may provide further selection criteria.
Decoy model generation
Decoy models were generated from the so-called "Periodic Table" classification of protein structures . These are secondary structure lattice models derived from the combinatorial enumeration of possible folds over layers of secondary structure. The lattice (or 'stick') models are converted to Cα models using a threading method and finally refined using fragments drawn from native structures . When the models have the correct native fold, the RMSD against the native structure ranges from 4–6 Å in the length range of 100–150 residues (respectively). For folds that differ from the native, the RMSD ranges up to random (typically 15–20 Å in this range). Although the decoy models with the non-native fold have high RMSD values, this does not mean they are disordered. Rather they have properties (secondary structure content and packing) that are close to native proteins. In addition, some variants have only slight changes from the native fold, such as two adjacent strands in swapped position, that is often undetectable using a conventional RMSD based measure.
Chemotaxis Y protein (3chy)
This protein contains the common flavodoxin fold with 5 strands and 5 helices (2 on the I layer and 3 on the III layer) (Figure 1). The strand order in layer II is 21345, helix order in layer I is 15 and helix order in layer III is 234, which would hereafter be denoted as the following. HI(1,5);SII(2,1,3,4,5);HIII(2,3,4), where HI() denotes the helix order in layer I, SII() the strand order in layer II and HIII() the helix order in layer III. In the decoy models, eight distinct folds were found (designated F1...F8), of which F1 and F7 correspond to the native (Table 1).
Glycerol-3-phosphate cytidylyltransferase (1cozA)
This protein contains 5 strands and 5 helices (3 on the I layer and 2 on the III layer) (Figure 1). The strand and helix order is HI(2,1,5);SII(3,2,1,4,5);HIII(3,4). In the 200 decoy models, there were 11 different folds (F1...F11). F3 and F6 are similar to native. In the 1cozA structure the C-terminal helix packs off the sheet and this helix is a part of the active site. In fold-type F3, this helix packs on the sheet. In F6 this helix is packed off the sheet like the native crystal structure. In F5 the C-terminal helix is predicted as strand and is packed in the β-sheet.
This protein contains 5 strands and 4 helices (2 on the I layer and 2 on the III layer) (Figure 1). The strand and helix order is HI(2,4);SII(1,3,2,-4,5);HIII(1,3). (The negative number indicates a reverse strand direction.) In the 200 decoy models, there were 10 different folds (F1...F10). F1 is similar to native.
This protein contains 5 strands and 5 helices (2 on the I layer and 3 on the III layer) (Figure 1). The strand and helix order is HI(1,5);SII(2,1,3,4,5);HIII(2,3,4). There were 10 different folds, among which F3 and F6 are considered as correct, although an extra strand is predicted and packed on the sheet instead of the 2nd helix which is in layer 3. The top scoring model F4 has this extra strand and strand swap between 1st, 4th and 5th strands.
Lumazine synthase (1dioA)
This protein contains 4 strands and 4 helices (2 on the I layer and 2 on the III layer) (Figure 1). The strand and helix order is HI(1,4);SII(2,1,3,4);HIII(2,3). In the 200 decoy models, there were 16 different folds (F1...F16). F1 is similar to native.
The method was tested with seven more 3 layer α/β/α proteins (1v9w, 1rlj, 1kjnA, 1vq1A, 1uxoA, 1t57A and 1vk2A). The topology diagram for the seven proteins are shown in Figure 1. The number of different topologies (fold-types) in the 200 decoy models are 29 for 1v9w, 15 for 1rlj, 14 for 1kjnA, 12 for 1vq1A, 14 for 1uxoA, 12 for 1t57A and 16 for 1vk2A.
Clustering of models
models were taken from the models generated during the final step of the protein de novo structure prediction method  (see subsection 1 of Methods). Since the models are Cα models, main-chain atoms were built and side-chains were built using SCWRL  and refined using MODELLER [16, 17]. These models consists of a variety of different folds and were classified based on their fold-types (Figure 2). Pairwise superposition between models of same fold-type was made using the program SAP  and the models with ≤2 Å RMSD and ≥60% PID on structural superposition were clustered together. Clustering between models of same type is needed, since the functional site prediction (when looking at the residue proximity of the predicted functional site residues) differs between models of same type due to 1) difference in loop conformation, 2) β strand or helix shift even by a single residue. So, even correct folds might have poor models (based on the functional site prediction). Models within each cluster were superimposed using MODELLER to get the superimposed coordinates. Average Cβ coordinates of the residues of the models of each cluster was used to find the pairwise distance between residues.
Functional site prediction
The functional site prediction method CRESCENDO  was used to predict the critical residues important for binding, of the models. The environment-specific substitution tables  reflect the pattern of amino-acid substitutions in a particular local environment, (usually defined by local secondary structure = 4 (α helix, β strand, +phi angle and coil), solvent accessibility = 2 (accessible, inaccessible) and side-chain hydrogen bonding = 8 (side-chain:side-chain = 2; side-chain:main-chain CO group = 2; side-chain:main-chain NH group = 2), which in combination gives 64 environments i.e. 4 × 2 × 8 = 64). Restraints arising from the binding of substrates, cofactors, subunits and other molecules are not taken into account while deriving the environment-specific substitution tables. Thus, the substitution patterns of the functionally important residues are not well-predicted by the environment-specific substitution tables. So, comparison of the substitution patterns derived from the environment-specific substitution tables with the amino acid substitutions that occur during evolution in family of proteins should identify the functionally important residues (which is implemented in CRESCENDO), since they will be more conserved than that predicted from the substitution tables. The divergent score, as defined by Yona and Levitt , quantifies the overall difference or divergence between the observed and predicted substitution probabilities at each alignment position. The homologous sequences for the model structure is collected as described by Chelliah et al . The method distinguishes residues that are conserved for a functional reason from those that are conserved for structural reason. Though the side-chains for each model were built and refined using the above mentioned programs, the hydrogen bond and accessibility information might not be accurate. Because of this, we derived the secondary structure using STICKS  and solvent accessibility using SACAO (K. Lin, unpublished) with the Cα models and used only 6 (3 × 2) environments. (secondary structure = 3 (α helix, β strand and coil); solvent accessibility = 2 (accessible, inaccessible).
Assuming that the functional residues in the correct models form clusters, and that they might be scattered in the incorrect models, a score termed the "Fold score" was calculated for each model based on the proximity of the functionally important residues. To find how the functional residues are packed in the model, pairwise distances and the product of CRESCENDO scores between each pair of residues (that are at least 8 residues apart in the linear sequence) are calculated. The resulting plot is a measure of how well the predicted site residues have co-localised in the model and will be refered to throughout as "proximity plots".
As many "proximity plots" are generated for each decoy set, we devised a summary measure of how well clustered the site residues were. The percentage of pairs of residues that are within 12 Å distance was calculated for the top 40 pairs (based on the product of CRESCENDO scores) in steps of 5, and the percentage scores were added in each step to get the final "Fold score" for that fold. The sum of the percentage of clustered residues up to a sample size of 40 was taken since, beyond this the plots generally decayed. This single statistic can then be compared to a measure of structural quality. For this we used the percentage sequence identity (PID) of the structural superposition  of the model with the native structure divided by the RMSD plus 5. The 5 is just a (non-linear) scaling factor that does not alter the rank of the points. PID/RMSD on a log scale would be much the same. These plots will be referred to as the "summary plots".
Specificity and Sensitivity were calculated according to the "Fold Score". Specificity is defined as the fraction of significant hits (i.e. hits with Fold scores above a threshold) being correct. Sensitivity is defined as the fraction of possible correct hits being significant.
Both authors are supported by the Medical Research Council (UK).
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 1, 2008: Asia Pacific Bioinformatics Network (APBioNet) Sixth International Conference on Bioinformatics (InCoB2007). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S1.
- Jones DT, Taylor WR, Thornton JM: A new approach to protein fold recognition. Nature 1992, 358: 86–89. 10.1038/358086a0View ArticlePubMedGoogle Scholar
- Taylor WR, Lin K, Klose D, Fraternali F, Jonassen I: Dynamic domain threading. Proteins 2006, 64: 601–614. 10.1002/prot.20915View ArticlePubMedGoogle Scholar
- Bradley P, Misura KMS, Baker D: Toward High-Resolution de Novo Structure Prediction for Small Proteins. Science 2005, 309: 1868–1871. 10.1126/science.1113801View ArticlePubMedGoogle Scholar
- Wu S, Skolnick J, Zhang Y: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology 2007, 5: 17. 10.1186/1741-7007-5-17PubMed CentralView ArticlePubMedGoogle Scholar
- Cohen FE, Richmond TJ, Richards FM: Protein Folding: Evaluation of some simple rules for the assembly of helices into tertiary structures with myoglobin as as example. J Mol Biol 1979., 132:Google Scholar
- Cohen FE, Sternberg MJE, Taylor WR: Analysis and prediction of protein β -sheet structures by a combinatorial approach. Nature 1980, 285: 378–382. 10.1038/285378a0View ArticlePubMedGoogle Scholar
- Cohen F, Sternberg M: On the use of chemically derived distance constraints in the prediction of protein structure with myoglobin as an example. J Mol Biol 1980, 137: 9–22. 10.1016/0022-2836(80)90154-0View ArticlePubMedGoogle Scholar
- Taylor WR: Protein fold refinement: building models from idealised folds using motif constraints and multiple sequence data. Prot Engng 1993, 6: 593–604. 10.1093/protein/6.6.593View ArticleGoogle Scholar
- Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJE: Prediction of Protein Secondary Structure and Active Sites using the Alignment of Homologous Sequences. J Mol Biol 1987, 195: 957–961. 10.1016/0022-2836(87)90501-8View ArticlePubMedGoogle Scholar
- Chelliah V, Chen L, Blundell TL, Lovell SC: Distinguishing Structural and Functional Restraints in Evolution in Order to Identify Interaction Sites. J Mol Biol 2004, 342: 1487–1504. 10.1016/j.jmb.2004.08.022View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Searching for functional sites in protein structures. Curr Opin Chem Biol 2004, 8: 3–7. 10.1016/j.cbpa.2003.11.001View ArticlePubMedGoogle Scholar
- Taylor WR: A Periodic Table for Protein Structure. Nature 2002, 416: 657–660. 10.1038/416657aView ArticlePubMedGoogle Scholar
- Jonassen I, Klose D, Taylor WR: Protein Model Refinement Using Structural Fragment Tessellation. Comput Biol Chem 2006, 30: 360–366. 10.1016/j.compbiolchem.2006.08.002View ArticlePubMedGoogle Scholar
- Taylor WR, Bartlett GJ, Chelliah V, Klose D, Lin K, Sheldon T, Jonassen I: Prediction of Protein Structure from Ideal Forms. Proteins 2008, in press.Google Scholar
- Bower MJ, Cohen FE, Dunbrack RL Jr: Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: A new homology modeling tool. J Mol Biol 1997, 267: 1268–1282. 10.1006/jmbi.1997.0926View ArticlePubMedGoogle Scholar
- Šali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993, 234: 779–815. 10.1006/jmbi.1993.1626View ArticlePubMedGoogle Scholar
- Eswar N, Mari-Renom MA, Webb B, Madhusudhan MS, Eramian D, Shen M, Pieper U, Šali A: Comparative Protein Structure Modeling With MODELLER. In Current Protocols in Bioinformatics. John Wiley & Sons, Inc; 2000:5.6.1–5.6.30.Google Scholar
- Taylor WR: Protein structure alignment using iterated double dynamic programming. Prot Sci 1999, 8: 654–665.View ArticleGoogle Scholar
- Overington J, Donnelly D, Johnson MS, Šali A, Blundell TL: Environment-specific amino-acid substitution tables – tertiary templates and prediction of protein folds. Protein Science 1992, 1: 216–226.PubMed CentralView ArticlePubMedGoogle Scholar
- Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol 2002, 315: 1257–1275. 10.1006/jmbi.2001.5293View ArticlePubMedGoogle Scholar
- Taylor WR: Defining linear segments in protein structure. J Mol Biol 2001, 310: 1135–1150. 10.1006/jmbi.2001.4817View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.