A resource for benchmarking the usefulness of protein structure models
© Carbajo and Tramontano; licensee BioMed Central Ltd. 2012
Received: 19 February 2012
Accepted: 16 July 2012
Published: 2 August 2012
Skip to main content
© Carbajo and Tramontano; licensee BioMed Central Ltd. 2012
Received: 19 February 2012
Accepted: 16 July 2012
Published: 2 August 2012
Increasingly, biologists and biochemists use computational tools to design experiments to probe the function of proteins and/or to engineer them for a variety of different purposes. The most effective strategies rely on the knowledge of the three-dimensional structure of the protein of interest. However it is often the case that an experimental structure is not available and that models of different quality are used instead. On the other hand, the relationship between the quality of a model and its appropriate use is not easy to derive in general, and so far it has been analyzed in detail only for specific application.
This paper describes a database and related software tools that allow testing of a given structure based method on models of a protein representing different levels of accuracy. The comparison of the results of a computational experiment on the experimental structure and on a set of its decoy models will allow developers and users to assess which is the specific threshold of accuracy required to perform the task effectively.
The ModelDB server automatically builds decoy models of different accuracy for a given protein of known structure and provides a set of useful tools for their analysis. Pre-computed data for a non-redundant set of deposited protein structures are available for analysis and download in the ModelDB database.
Project name: A resource for benchmarking the usefulness of protein structure models. Project home page: http://bl210.caspur.it/MODEL-DB/MODEL-DB_web/MODindex.php.
The function of a protein is brought about by its three-dimensional structure the knowledge of which can be instrumental to several applications, ranging from function assignment to the prediction of its mode of interaction with its molecular partners to the interpretation and design of re-engineering experiments.
The obvious limiting step in exploiting the power of protein structures in many of these applications clearly lies on the availability of a relatively small number of experimentally determined protein structures, stored in the PDB , compared to the number of known protein sequences . This limitation can be partially overcome by the use of methods for inferring the structure of proteins from their sequences.
At present several methods are available to this end the most reliable of which remains comparative modeling, based on the observation that evolutionarily related proteins have similar structure and therefore that the knowledge of the structure of one member of a protein family (template) can be used as starting model for the others, provided that the evolutionary relationship can be detected at the sequence level. Because functional relevant regions are better preserved in evolution, the method has the advantage that it will produce better results for the biologically relevant parts of the target protein.
Comparative modeling has an additional advantage over other protein structure prediction methods: there is a known and well-studied relationship between the divergence between the sequences of two homologous proteins, indicative of their evolutionary distance, and the structural changes between the backbone atoms of their core . This implies that, when a single template is used, it is possible to estimate beforehand the error affecting the model by measuring the percentage of identity between its sequence and that of the target protein.
The relationship has been validated several times, for example using the results of blind tests in the CASP (Critical Assessment of Methods of Protein Structure Prediction) series of initiatives . The results of the experiments also repeatedly showed that the quality of models can be substantially improved when the whole set of sequences of the protein family are taken into account and this affects several steps of the modeling procedure, from the detection of the template to the quality of the alignment to the use of different regions from different templates in the final model. However, in this case and when multiple templates are used, the relationship between sequence and structure divergence is more difficult to estimate.
CASP has also tested the ability of independent methods to estimate the quality of models demonstrating that they can achieve a significant accuracy in selecting the best model among a set of diverse predictions for the same proteins, while methods for assigning an estimated quality to single models still lag behind .
It is obvious that the quality of a protein model dictates how effective it is for subsequent studies, however the identification of a precise and general relationship between the quality of a model and its usefulness for a specific application is still eluding the efforts of the community. The aim of the server described here is to allow developers of structure based methods to quickly test how well their method performs when models of different quality are used instead of experimental structures.
It has been shown that high-resolution models 1-2 Å RMSD away from the native counterpart can provide relevant functional information, such as the inference of enzyme reaction mechanisms  and the interpretation of disease-causing mutations . In some cases they have been shown to be useful in ligand-docking studies , experimental structure determination aid [9, 10] and drug design [11, 12]. As accuracy drops, the range of applications narrows.
We reasoned that the easiest and most straightforward way to help solving the issue is to provide method developers with a curated and annotated set of models at different level of accuracy for each known protein structure to rapidly test the level of accuracy required for a model to be used in place of an experimental protein structure.
We describe here how we obtained models at different levels of quality for each of the proteins of known structure and introduce a tool, ModelDB, which allows easy access to them and to several relevant information about the modeled proteins.
The user can select entries on the basis of several annotations, structural and functional domains, gene ontology and Enzyme Commission Numbers extracted directly from publicly available databases.
The server also includes a tool to build a homology model of a protein of unknown structure and to compare the model with the template(s) used to build it.
The initial dataset of proteins included proteins solved by X-ray crystallography alone or in complex with other molecules as available on January 3rd 2011, filtered not to contain any pair with more than 50% sequence identity (using PISCES ), excluding those structures with only Cα atoms, with a resolution worse than 2 Å, with a sequence length outside the range 20–10000 residues, and with an R-factor higher than 0.3. We were left with a total of 8,609 PDB chains. We could detect suitable templates, and therefore build comparative models, for 7,166 of them. Of these 2,999 have an EC number (72,648 in the whole PDB), 2,452 with a complete 4-digit EC number (63,474 in the whole PDB); 5,106 bind to ligands (97,388 in the whole PDB), 3,742 of them to more than one (70,780 in the whole PDB), and there are 1,199 different ligands found binding to this subset of modeled chains (9,891 in the whole PDB); 2,261 have at least one annotated catalytic site (51,437 in the whole PDB), 982 of them have more than one catalytic site (23,686 in the whole PDB); 4,546 are annotated in Swiss-Prot (125,966 in the whole PDB).
The sequence of each PDB chain is used as query in HHsearch to search for templates in the 70% non-redundant PDB database. All target-single template alignments with 80% minimum sequence coverage and 10-1 maximum E-value were used as input for Modeller to produce an all non-hydrogen atom single-template model for each of the selected sequences.
Each model of each target protein was compared to the corresponding experimental structure using LGA (Local–global Alignment) . We record the GDT-TS and RMSD values and allow sorting the models according to these parameters (see later) as well as to HHsearch probability, E-value, score and coverage.
Each PDB structure in the input list (as well as its corresponding models) is annotated whenever possible at the residue level, using the CREDO database  that collects protein-ligand interactions, the Catalytic Site Atlas (CSA) [21, 22] that includes information about enzyme catalytic sites, and Swiss-Prot.
Experimental structures and their models can be visualized and colored at the residue level according to solvent accessibility and secondary structure as computed by DSSP , cavity occurrences and average depth as defined by Speedfill , and protrusion and burial indexes obtained via PSAIA .
This visualization is obtained using an in-house Perl program named mappON. A stand-alone version of mappON is also available to visualize the parameters described above and also the disorder probability (computed using DisEMBL ), evolutionary residue conservation and variability (retrieved from the ConSurf-DB ) on user provided structures. The tool is accessible via the ModelDB site.
Subsets of proteins can be selected on the basis of their functional and structural domains, GO annotation and Enzyme Commission Numbers.
The user can build a homology model of a protein of unknown structure using Modeller  on the basis of templates identified using HHPred  and compare its structure with those of the templates used to build it taking advantage of all the described visualization tools.
Models for the 7,166 proteins obtained as described in Methods and annotated at the residue level with information about secondary structure, solvent accessibility, cavity occurrence, average depth and protrusion and burial indexes are stored in the ModelDB relational database.
The ModelDB database can be accessed via the publicly available ModelDB web server. Both can also be downloaded and installed locally.
The ModelDB pipeline was used to build the pre-calculated model sets stored in the database and can be used to build models for user-provided structures. In this case, input sequences are subjected to the same procedure described in Methods for building the database.
The ModelDB web interface (http://bl210.caspur.it/MODEL-DB/MODEL-DB_web/MODindex.php) is conceived to be as user-friendly as possible and has several features. A user can either specify a PDB code or upload a protein structure of interest, in both cases the chain needs to be specified (by default the first chain present in the structure is analyzed).
If the input protein is not present in the database or the user changed the default parameters for modeling, the modeling program is launched. This is followed by a BLAST search with stringent parameters (90% coverage and an e-value of 10-4) against PDB and Swiss-Prot, to retrieve information and functional annotations for the protein entry or for a very close homologue.
Upon completion of these steps, the output page described next, which is directly displayed if the entry is already stored in the database, is shown.
There is the possibility to color the structures and surfaces according to different coloring schemes (Figure 3D). Collapsible boxes provide functional annotation (Figure 3E). Functional residues as well as the distance in Å between corresponding Cαs of the experimental and modeled structure can be visualized in the structure(s). Finally, the models of a given protein can be downloaded as a zip file.
We show here examples of how the ModelDB server can be used to identify the level of accuracy required for simple structure-based computations.
The last question we asked is whether the relative position of residues forming an active site, and therefore well conserved throughout evolution, can be reliably measured using models. This is relevant because in many cases the identification of specific residues at a given distance from each other are very good signs of the presence of an active site.
Since the gap between known protein sequences and structures continues to increase, researchers need to make use of protein structural models more routinely. Models usually contain structural inaccuracies that vary in number and severity, but they can still provide important insights into a protein role. There is no general rule that relates model accuracy with its usefulness for different applications, therefore there is the need to test the model quality tolerance for each specific structure-based method. ModelDB, the tool introduced here, serves this purpose by rapidly generating decoy sets for the proteins of interest. These decoys are intended to be used to test structure-based methods and decide to which extent each method can be applied to computed protein structure models. The tool allows the establishment of the quality threshold at which interpretable results, analogous to the ones that would be obtained with native structures, can be produced.
The project has involved the implementation of a pipeline divided in programs that work together, but also exist independently, either on-line or for local use when larger calculations are demanded. The ModelDB modeling pipeline takes a protein structure as input to generate single-template decoy models; it makes use of an in-house program named mappON to visualize the structures and the models colored according to different descriptors (solvent accessibilities, cavity occurrences, etc.). The on-line versions of both ModelDB and mappON query a relational database that not only contains pre-calculated decoy models, but also functional annotations extracted from different sources.
ModelDB contains decoy models created for a significant subset of the PDB, thereby covering a significant portion of the protein structural space compared to the other resources; this portion will increase as new decoy sets will be built and stored in the database. Individual decoy sets themselves are expected to cover wider quality ranges in new releases as more structures are deposited in the PDB. Last but not least, ModelDB also provides a visualization window where any decoy in a set, colored according to different descriptors, can be loaded, inspected and compared with its native counterpart.
This work was supported by Award number KUK-I1-012-43 made by King Abdullah University of Science and Technology (KAUST).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.