A quality metric for homology modeling: the H-factor
© di Luccio and Koehl; licensee BioMed Central Ltd. 2011
Received: 3 August 2010
Accepted: 4 February 2011
Published: 4 February 2011
The analysis of protein structures provides fundamental insight into most biochemical functions and consequently into the cause and possible treatment of diseases. As the structures of most known proteins cannot be solved experimentally for technical or sometimes simply for time constraints, in silico protein structure prediction is expected to step in and generate a more complete picture of the protein structure universe. Molecular modeling of protein structures is a fast growing field and tremendous works have been done since the publication of the very first model. The growth of modeling techniques and more specifically of those that rely on the existing experimental knowledge of protein structures is intimately linked to the developments of high resolution, experimental techniques such as NMR, X-ray crystallography and electron microscopy. This strong connection between experimental and in silico methods is however not devoid of criticisms and concerns among modelers as well as among experimentalists.
In this paper, we focus on homology-modeling and more specifically, we review how it is perceived by the structural biology community and what can be done to impress on the experimentalists that it can be a valuable resource to them. We review the common practices and provide a set of guidelines for building better models. For that purpose, we introduce the H-factor, a new indicator for assessing the quality of homology models, mimicking the R-factor in X-ray crystallography. The methods for computing the H-factor is fully described and validated on a series of test cases.
We have developed a web service for computing the H-factor for models of a protein structure. This service is freely accessible at http://koehllab.genomecenter.ucdavis.edu/toolkit/h-factor.
Since 1958, when Kendrew et al reported the first atomic-level resolution of a protein structure (myoglobin), the structural biology field dramatically expanded with the development of new tools and methods to gain access into atomic details of a protein or a nucleic acid . This opened a completely new world of knowledge and understanding to the scientific community, as the analysis of protein structures provides fundamental insight into most biochemical functions and consequently into the cause and treatment of diseases. Structural biology is now recognized as a fundamental step in our quest to understanding life at the molecular level.
Finding the structures of all proteins is currently a bottleneck for genomics studies. In this matter, the Protein Structure Initiative (PSI) aims at the determination of the three-dimensional (3D) structure of approximately 100,000 structures in 10 years. However, the protein sequence databank (UniProt/TrEMBL) is growing at a much faster rate, with more than 10 millions sequences available to date (March 2010). At the same time point, the Protein Data Bank (PDB) includes 64,100 structures, out of which only approximately 4300 are "unique" at chain level (i.e. once we remove "redundant" proteins whose sequences have more than 95% sequence identity with another protein in the PDB). It should be noted that these structures only represent a biased sample of the protein universe. For example, the PDB includes only 220 unique membrane proteins which is very little since membrane proteins constitute around 20-30% of most proteomes . Noteworthy, the human genome has ~21,000 protein-encoding genes for a proteome of ~1,000,000 proteins when combining the complexity induced by alternative slicing events . In addition, due to experimental limitations, the vast majority of the solved structures are below the 50 KDa threshold excluding numerous larger proteins. Large proteins however represent a significant fraction of the proteins present in an organism; for instance, proteins found in yeast Saccharomyces cerevisiae have five hundred amino acid residues on average and their lengths can reach two thousand eight hundred residues . The structure of these large proteins, as well as of even larger assembly can be solved by electron microscopy at a somewhat low-resolution. While this field is expanding very fast and a growing number of structures solved at atomic-level resolution have been reported [5, 6], its impact with respect to the size of the protein sequence databank remains limited. Many more protein structures have been solved by either X-ray crystallography or nuclear magnetic resonance (NMR). It remains however that most proteins are out-of-reach because of technical difficulties. There is clearly a huge gap between the world of known structures and the universe of known protein sequences. Structural genomic projects are unable to keep up with newly discovered genes.
One way to work around this problem is to use computational methods to predict proteins structures. In silico protein structure prediction techniques can be divided into two categories: the ab-initio folding methods and homology modeling. In this paper we focus on the latter. We note that both approaches have been shown to yield astounding results, as shown in the successive CASP contests . However, they do require caution: while predicting the structure of a protein is an intellectual challenge that requires solving many practical issues, it is often considered as an art in essence.
The growth rate of structures deposited in the PDB is slowing down since 2004, along with the number of new superfamilies or folds discovered [8, 9]. One possible explanation is that many proteins still evade the structural biology pipelines at this time because of the technical difficulties described above. In 1992 Chothia hypothesized that the number of protein folds in nature is probably finite and around 1,000 [9, 10]. The latest analysis of the PDB and of the structural classification of proteins (SCOP) showed that we have not yet reached a plateau (currently estimated to be around 1,500) . The current rate at which proteins are added in the PDB is far too slow to match with the number of new protein sequences discovered every year. The situation is however not so negative. There is a definite hope that the current content of the PDB will allow us to predict reliably the correct scaffold of more than 70% of the whole proteome using in silico methods . This is the rationale for using homology modeling to complement experimental techniques.
Biologists unanimously consider X-ray crystallography as the prime source of structural information on proteins and the "gold standard" in term of accuracy: they base their confidence on its long list of published successes. The vast majority of structures deposited in the PDB were determined by X-ray crystallography and 14 Nobel prizes in Chemistry and Medicine have been awarded to crystallographers  (For recent reviews, see Kleywegt and Jones, 2002 ; Wlodawer et al, 2007 ; Brown and Ramaswamy, 2007 ; Ilari and Savino, 2008 ). Homology modeling on its own however is not devoid of successes. The very first published homology model in 1969 was the small protein α-lactalbumin, which was modeled on the basis of the structure of hen egg white lysozyme as a template  with the two proteins sharing 39% sequence identity. When the structure of α-lactalbumin was later solved by X-ray crystallography , the model turned out to be essentially correct . Since then, homology modeling has continuously extended its field of applications, including designing mutants to test hypotheses about protein functions, identification of active sites, drug design, protein-protein docking, facilitating molecular replacement in X-ray structure determination, refining models based on NMR constraints (for recent reviews of homology modeling applications, see [23–25]). Despite these successes, homology modeling is not yet a well-established alternative or complement to experimental structural biology. It remains the focus of many criticisms often coming from the structural biologists themselves as they often consider a protein model to be unreliable, not being based on experimental data. The question arises as to what needs to be done to give homology modeling its credentials.
In this paper, we focus on homology modeling and more specifically, we review how the structural biology community perceives it and what can be done to impress on the experimentalists that it can be a valuable resource to them. It is organized as follows. The next section reviews the differences and similarities between homology modeling and high-resolution experimental structural biology. In particular, we illustrate steps in the homology modeling procedure that are putative source of errors. Our goal is to provide a useful step-by step handbook for non-specialists in order to help building better model.
The following section introduces the H-factor, a new indicator for assessing the quality of homology models. The H-factor is designed to check how well a family of homology models reflects the data that were used to generate those models, in the spirit of the R-factors in X-ray crystallography. The results section that follows validates the H-factor on a series of test cases. We conclude the paper with a discussion on what remains to be done to make homology modeling a prime technique for the biologists.
Homology Modeling versus Experimental Structural Biology
This section reviews briefly the quality of protein structure models obtained either using high-resolution experimental techniques or homology modeling. Our hope is to identify common good practices as well as safeguards from which we can derive a validation tool for the latter. We start with the concept of a structural model and its meaning in the two communities of experimental and computational structural biologists. We then highlight the pros and cons of X-ray crystallography and NMR spectroscopy. An overview of the different steps involved in homology modeling follows, with emphasis on sources of errors and how they can be checked. Ultimately our goal is not to rank these methods but rather we hope to show that they all provide valuable and often complementary information, as long as the proper safeguards are applied.
What is a model?
The meaning of the word "model" is ambiguous in the structural biology community. A model for a protein structure can be obtained either by X-ray crystallography, by NMR spectroscopy, by electron microscopy, by computational methods or by combinations of all or some of these techniques. With experimental techniques, the atomic coordinates are refined against experimental structural restraints and constraints. Eventually the final model is called "structure" when the refinement statistics converge toward acceptable canonical values. Note that often this final model is subjected to refinement using simulation techniques such as constrained molecular mechanics or molecular dynamics simulations. Even though these simulations are constrained with the experimental data, the subsequent "structure" cannot be considered to be fully independent of modeling. On the other hand, an in silico model is generated without or with very limited experimentally constraints: it depends obviously on the hypotheses included in the modeling process, on the force-field used in the simulations as well as on the quality of the scientific computing tools that were used during the modeling steps. While the quality of an "experimental" model can be assessed directly against the experimental data, the quality of an "in silico" model is more subjective and ultimately defined through the usefulness of the model: this is most probably the source of the mistrust towards modeling in general.
X-ray crystallography: source of errors and quality metrics
A number of factors contribute to the quality of an X-ray structure. The first factor relates to the intrinsic crystal properties and its diffraction capabilities, which is mostly evaluated in term of resolution. The quality metrics used in X-ray crystallography fall into three categories: 1) to measure the quality of the raw data, 2) to measure the agreement of the refined structure against the data and 3) to validate only the model for ideal stereochemistry, rotamers and bad clashes implemented in What Check or Molprobity for instance. The first category lies upstream of the structure building process as the measures it includes evaluate the quality of the experimental diffraction data. The Rsym indicator for example measures the average spread of individual measurements in respect to their symmetry equivalent measurements. A good dataset will have an Rsym smaller than five percent. In addition, the quality of a dataset is also assessed by its signal-to-noise ratio <I/σ(I)> and its data-collection completion for a given space group. Unfortunately, only a few crystals diffract to atomic resolution (under 1 Å) with ideal quality metrics. Most of the crystal-based structures have therefore been solved with good to average raw data quality. Although building a structure can be semi-automatic with automated tools available for chain tracing, side chain-building, ligand building and water detection, it is still refined by experimentalists using their subjective interpretations of the data. It is common for example to find areas of poorly defined electron density map due to disordered regions. The experimentalist interprets these data to the best of her knowledge but this is unfortunately a common source of errors. The second set of quality metrics assesses the relative agreement of the structure in regard to the experimental data. This set includes the R-factor and the "free" R-factor (R-free). The R-free is analogous to the R-factor but uses a small subset of the data that have been flagged-out and not taken into account during any refinement process . Its purpose is to monitor the progress of refinement and to check that the R factor is not being artificially reduced by the introduction of too many parameters. As such, it provides an unbiased indicator of the errors in the structure and prevents any over-refinement and over-interpretation of the data. Both factors along with Rsym and Rmerge can be seen as indicators of the errors inherent in the refined model and in the experimental data.
The quality of protein crystal structures has been reviewed several times over the last fifteen years and it is striking to notice that despite a constant increase of the technology and validation tools, it has not improved overall. The quality spectrum of X-ray structures remains broad. The increase of automation through structural genomic pipelines did not help raising the bar in that matter as human intuition and reasoning are taken out of the process [17, 18]. Interestingly, X-ray structures published in high-impact general science journals are usually the worse offenders in term of quality and errors. This is explained by the experimental difficulties associated with solving novel high profile structures and the rush to publish in a competitive environment .
X-ray crystallography is not immune to errors and mistakes, both honest and dishonest. Unfortunately, over the years we have seen gross mistakes in various structures leading to the retraction of several high-impact papers in leading journals because of a lack of quality control during the structure building pipeline [27, 28]. A recent review by G. J. Kleywegt highlights the need and the proper use of validation methods in structural biology in general and in X-ray crystallography in particular. The author also emphasizes the use of validation methods early on in the project pipeline in order to minimize the number of erroneous high-profile structures that can hinder the progress of science for years to come . In light of a growing number of structures falsification and to prevent both dishonest and honest mistakes in structures determination, the curators of the PDB have implemented over the years new sets of validation procedures for the deposition process .
NMR spectroscopy: source of errors and quality metrics
Although structure determination by NMR spectroscopy methods is very different from X-ray crystallography, it shares similarities with the latter in terms of sources of errors. Instead of using an X-ray beam diffracting around electrons in a crystal, NMR spectroscopy is performed in solution and uses the magnetic properties of the nuclei with odd spins (mostly 1H, 13C and 15N). As the molecule of interest is placed in a strong magnetic field, each of these nuclei is characterized by a unique resonance frequency, i.e. the frequency at which it will absorb energy. This frequency depends on the local magnetic field that combines the external field and the local environment: it is referred to as the chemical shifts. NMR experiments are designed to monitor the behavior of these nuclei as the system is perturbed from equilibrium and each experiment usually isolates one property, such as through-bond connectivity's (COSY and TOCSY experiments) or spatial proximity that allows for energy transfer (NOESY experiments). In the specific case of proteins, the number of nuclei involved can be large, leading to crowding of the spectra: this is usually overcome by using multidimensional experiments (mostly 2 D, but also 3 D and 4D). The typical protocol for protein structure determination by NMR proceeds as follows. Firstly, the chemical shifts observed on multidimensional spectra are assigned to their specific atoms (nuclei) in the protein (the assignment process). Second, through-the-bond and through-space coupling effects (i.e. J-coupling and Nuclear Overhauser effects, respectively) observed on these spectra are quantified and concerted into angles and distance restraints. Most of these restraints correspond to ranges of possible values instead of a precise constraint. Thirdly, a molecular modeling technique is used to generate a set of models for the protein structure that satisfy these experimental restraints as well as standard stereochemistry. For a more detailed presentation of the application of NMR to protein structure determination, we refer the reader to [31–34]. Analogously to X-ray methods, the quality of NMR measurements affects the quality of the structures. The precision of a set of models for a protein structure determined by NMR is evaluated as the root-mean-square (RMS) difference between each model and a "mean" structure, defined geometrically as the mean of all the models (note that the stereochemistry of this mean model is usually not correct). The quality of each model is evaluated by the number of violations observed in the model compared to the experimental restraints. A high-quality NMR structure refers to a set of high quality models with no violations that are tightly bundled around their mean, i.e. with a small RMS. Note that in addition to these NMR specific quality measures, Garrett and Clore introduced a R-factor and a free R-factor for the refinement of NMR structures based on residual dipolar coupling, a long range NMR measure obtained on proteins that have been partially oriented in dilute liquid crystals. In similarity with X-ray crystallography, the quality spectrum of NMR structures is broad, with errors and mistakes reported that are inherent to a human based determination process .
Homology modeling: source of errors and quality metrics
The general strategy developed for homology-modeling proceeds through a canonical seven-steps procedure (Figure 1): (1) Identify the template proteins that share structural similarity to the target; (2) Align the target sequence with the templates sequences; (3) Build a single framework of spatially aligned template structures and assimilate the target protein backbone with this framework; (4) Build the missing backbone elements (loops) not represented in the template framework; (5) Build the target side chains; (6) Refine the model in order to minimize unrealistic contacts and strains; and (7) Evaluate the final refined model for physical tenability. To date numerous homology-modeling programs such as MODELLER , SegMod/ENCAD , Swiss-model , 3D-Jigsaw , BUILDER  and Nest  have been developed and many of them have been embedded into homology-modeling servers to ease the burden of generating models. Online portals such as the Protein Structure Initiative (PSI) model portal or the Swiss-Model Repository bring to the community a large database of models . The PSI model portal currently provides 8.2 millions comparative protein models for 3.1 million distinct UniProt entries. Every model comes with relevant validation data. However those models are automatically generated without any human interaction that might render them inaccurate without any extra validation steps. In the following, we overview each step of the homology modeling process and highlight potential sources of errors.
*Steps 1 and 2: template selection and sequence alignment
The selection of template(s) is undoubtedly a critical step in modeling. It was long assumed that two proteins whose sequences share at least 40% identity have similar structures. If such a template exists, it is easily detected by any sequence alignment techniques. Homology modeling under such conditions is then expected to generate models whose accuracy is close to that of an experimental structure. We know however that this is not always true. Roessler et al. recently reported the discovery of two native Cro proteins sharing 40% sequence identity but with different folds . Moreover, Alexander et al. were able to design two proteins with 88% sequence identity but having different structure and function . Reversely, it is not uncommon for proteins, especially enzymes carrying the same function across the tree of life to share a somewhat low sequence identity and at the same time being structurally similar, with a Cα r.ms.d. ~1.5Å . All these observations indicate that the selection of template is far from being a trivial task and extreme caution should be applied.
The situation is even more difficult if there is no significant sequence similarity between the sequence of the target protein and any of the known protein structures. This is one of the current challenges of the post-genomic era that is tackled by fold-recognition methods, namely to identify a suitable template for homology modeling [46–48]. Unlike sequence-only comparison, fold-recognition techniques take advantage of the information made available by 3 D structures. Despite a steady development over the years, as illustrated through the successive CASP experiments, fold recognition techniques still have a number of limitations. They are however the key to extend the domain of application of homology modeling methods.
* Step 4: Loop building
The loop-building step is another key component in homology modeling. Loops participate in many biological events and functional aspects such as enzyme active sites, ligand-receptor interactions, and antigen-antibody recognition among others. However, due to the flexible nature of loops, it is often difficult to predict their conformation. There are two main approaches to tackle the problem of loop modeling: methods that use databases of loop conformations or ab initio methods. In the database approach, a library of protein fragments whose size corresponds to the size of the loop to be modelled is scanned for fragments whose end-to-end distance matches the corresponding distance in the framework. The library is derived from the known protein structures in the PDB. This method has proved to be accurate when the loop is relatively short. Fidelis et al. have shown that loops of a maximum of seven residues can be modelled with confidence based on known structures . When the database method is combined with a restrained energy minimization, it extends the confidence of loop building up to nine residues . Beyond the nine residues threshold, ab initio methods have to step in mostly because for these longer loops, the fragment library provides a poor sampling of the conformational space accessible. The ab initio loop prediction approach relies on a conformational search guided by a scoring function. The accuracy of ab initio loops modeling remains currently low, especially when dealing with very long loops .
* Step 5: The side-chains positioning problem
* Step 6: Refinement of the final model
In a review written in 1999 on the CASP3 experiment, Koehl and Levitt noted that most models submitted in the homology modeling category were not refined, as previous CASP meetings had shown that refinement did not improve the models . Sadly for the computational biologists the situation has not improved and it remains difficult to generate a model closer to the native structure than the template used to build it . Energy refinement, originally introduced by Levitt and Lifson forms the basis of current methodologies for protein structure refinement against experimental data . Without experimental restraints, refinement by energy minimization generally moves the protein structure away from its X-ray structure. Some recent studies have shown that this negative trend can be reversed through the inclusion of evolutionary derived distance constraints  through the combination of sophisticated sampling techniques based on replica exchange molecular dynamics and statistical potentials , through the addition of a carefully designed, differentiable smooth statistical potential , or by careful consideration of the solvent effects . While these studies are definitely source for hope, much work remains to be done as far as refinement is concerned.
A general framework for model assessment: R-factors and equivalent
The wide availability of homology modeling software packages, as well as the development of web interfaces that automate the use of these packages has resulted in better access to and a broader usage of homology modeling. While this is definitely commendable and homology modeling should be even more advertised, there are risks that this will lead to errors because of the difficulties in evaluating the correctness of the models these techniques generate. This is primarily due to the lack of cross validation indicators such as the R-factor and R-free in X-ray crystallography . In addition to the stereochemistry assessment of the structure and the good correlation between the R-factor and the R-free values, the quality of an X-ray structure can be evaluated based on the thermal motion value of atoms described by the Debye-Waller factor or B-factor. The B-factor allows for the identification of zones of large mobility or error like disordered loops. When multiple monomers populate an asymmetric unit of a crystal, the crystallographer will choose to focus on the analysis of the monomer with the lowest average B-factor since the likelihood of errors is lower. Unfortunately, such criteria do not apply to models thus rendering the identification of zones of uncertainties a non-trivial task.
With respect to homology modeling, the main step is to thoroughly validate the model and always provide all the relevant details about the protocol used. This will give the user all the necessary data to judge the quality of the model. The quality assessment of models has been the focus of numerous studies and various algorithms have been reported over the years. In this matter, tremendous efforts are being made to produce the best triage procedures or scoring functions among models, as seen in the latest CASP meetings . These scoring functions are based on statistical potentials , local side-chain and backbone interactions , residue environments , packing estimates , solvation energy , hydrogen bonding, and geometric properties . In addition, it is essential that the quality of stereochemistry be kept high. The stereochemistry can be assessed by commonly used programs such as Procheck or WhatIf [78, 79].
Ultimately, the validation of models comes from experiments such as site-directed mutagenesis, circular dichroism, cross-linking, mass spectrometry, fluorescence-based thermal shift, light scattering, molecular FRET or electron microscopy. Such experimental data can be translated into constraints/restraints and introduced in the modeling protocols thus improving the accuracy of models. One can also identify fast and cheap experimental procedures that can help testing homology models. The easiest way is to crosscheck models with experimental structures. For instance with enzymes, it is possible to verify the location of important catalytic residues in the active site by comparison with homologous family members. Most importantly however, a model needs to be checked manually in the same way a NMR or an X-ray structure is processed.
Despite all these methods, the homology modeling community still lacks a simple indicator which gives an unambiguous feedback on how the final model, or family of models, reflects the data that were used in the modeling process, similar to the couple R-factor/R-free for X-ray crystallography. The next section introduces such an indicator, namely the H-factor.
Computing the H-factor
where the sum is computed over all positions in the sequence alignment between the target and template, N is the length of the sequence alignment, p is the secondary structure prediction of the target at position i (values for p are 'H' for helix, 'S' for strands, and 'C' otherwise), c(i) is the confidence factor reported by psipred (integer value, from 1 to 10) for the secondary structure prediction at position i and s is the secondary structure type observed at position i in the template structure reported by stride. The offset coefficients a and b are set to 1.3 and 0.9, respectively, to ensure that score (1) has values between 0 and 10.
where N is the length of the sequence alignment.
n is the number of models. The offset coefficients a and b are chosen such that average RMS values of 0.1 and 7 Å correspond to scores of 1 and 10, respectively; the corresponding values are a = 1.3 and b = 0.87.
m is the number of functional domains identified in the target sequence, MA d is the structural fragment extracted from the average structure MA corresponding to the domain d, n is the number of domains homologous to domain d found in PDB structures, and Dd,i is the i-th possible structure of the domain homologous to d. The offset coefficients a and b have been chosen such that average RMS values of 0.1 and 7 Å correspond to scores of 1 and 10, respectively; the corresponding values are a = 1.3 and b = 0.87. This usually enforces that score (4) is between 0 and 10. Note that if this procedure does not find an equivalent domain for a fragment, the fragment is ignored; if no domains are found for all fragments, score ( 4 ) is ignored.
The H-factor computation is accessible online at http://koehllab.genomecenter.ucdavis.edu/toolkit/h-factor with a simplified operating manual (Cf. additional files). The source code is available upon request.
Testing of the H-factor on CASP targets
Test set based on CASP targets.
Test case (CASP ID)
Template(s) (resolution (Å))
%sequence identity between template and target
Type of sequence alignment
Loop segments of 3 or more residues
80-86; 163-166; 179-
182; 225-228; 271-275
2DCN (2.25); 1RKD (1.84); 1V1A (2.10);
1VM7 (2.15); 2AFB (2.05); 2FV7 (2.10)
33-37; 75-78; 100-103
33-37; 75-78; 100-103;
Results and discussion
The H-factor: detailed analysis on three CASP targets
Comparing the H-factor with cRMS, DOPE and QMEAN scores to assess models generated for CASP7 targets.
CASP 7 target
Scoring function (a)
cRMS (Å) (b)
% ID (c)
When we deliberately introduce a shift at position 31 in the alignment between the sequence of T0295 and the sequence of its template 1ZQ9 (see Figure 2), the corresponding models generated by MODELLER show structural diversity in the loop region near the shift (i.e. near Thr31). Score (3) captures this structural diversity within a set of models. It leads to the H-factor being raised from 19% to 21% (see table 2). However, score (2) could not detect a single position shift in the alignment. The H-factor is therefore capable of detecting backbone deviation due to modeling errors, the same way the R-factor does.
The CASP7 target T0375 is a more difficult modeling case. It is a human ketohexokinase and the rigid-body domain closure of sugar kinases is known to be large, adding complexity into the modeling process . Although several sugar kinase structures have been solved, the search for templates for T0375 identified only six distinct remote templates. Moreover, all six templates are needed to obtain complete sequence coverage of T0375 within one single framework with MODELLER. In addition, the template sequences have low similarity with the target sequence. This is detected by the scoring function (2), which returns a value of 8.6 (out of 10) (table 2). Note that the score (2) is not a direct measurement of the quality of the sequence alignment. It is designed to quantify the difference between the two sets of sequences: if this difference is small, the model is expected to be good, while if the difference is large, the sequence alignment most probably belongs to the twilight zone and the models should then be considered with caution. The overall H-factor for the models generated for T0375 is 41%. This mid-range value indicates that caution should be used when interpreting or using these models. Indeed, the average cRMS between these models and the actual structure of T0387 (available in the PDB in the file 2HLZ) is 3 Å, i.e. reflecting a medium-resolution agreement.
The CASP7 target T0287 is the most difficult test case we have considered. In fact, it would not be considered a homology-modeling target by many, despite the fact that a (remote) structural homologue is available in the PDB. We did decide to include it in our study to test whether the H-factor was still providing useful information when applied on a difficult test case. T0287 corresponds to CaGS, a protein from Helicobacter pilori whose function is unknown. A database search over all sequences of proteins whose structure is known identifies a unique template, 1V55, with a low sequence identity (16%, see table 1). 1V55 is a Cytochrome C Oxidase and it is not clear that T0287 and 1V55 are homologues. We did build 20 models for T0287, using 1V55 as a template, and the out-of-the-box alignment between the sequences of T0287 and 1V55 generated by ClustalW. As mentioned above, we did not try to optimize the alignment or the modeling itself as our interest is to see if the H-factor is able to assess the quality of the models we generated. In this specific case, all four scores reported high values (7.0, 8.1, 7.5 and 7.5 for scores (1), (2), (3) and (4), respectively). The overall H-factor is 75%, a valued that should raise concerns about the quality of these models. As for target T0375, this is confirmed by the experiment: the average cRMS between these models and the actual structure for T0287 (PDB code 2G3V) is 5.8 Å, indicating that the models are poor approximations of the native structure.
H-factor: Detecting bad models
Comparison of the H-factor with cRMS, DOPE and QMEAN scores to assess models generated for CASP7 targets from the free-modeling category
CASP 7 target
cRMS (Å) (a)
H-factor: characterizing good models
H-factor applied to NMR structures with 20 models or more.
Scoring function (a)
cRMS (Å) (b)
Relationship between the H-factor and cRMS
The cRMS by itself is a reasonable quality indicator as long as its value remains low (say below 2 Å). It should be noted that it is an average value computed over the whole structure. As such, it is very sensitive to large structural fluctuations in disordered loops for example that can lead to large cRMS values even if the conserved domains are structurally very similar. It is well known for example that caution should be applied when using cRMS to assess the quality of a structural alignment. cRMS is implemented in score (3) of the H-factor to evaluate the heterogeneity amongst a set of models as it is directly related to both the choice of the template and the quality of the sequence alignment. However, the score (3) loses accuracy for cRMS values larger than 2 Å, which is not uncommon when a remote template is used. cRMS is also implemented in score (4) to quantity the modelling quality of specific individual domains by comparing them with corresponding domains in the PDB. This is a domain-based cRMS that does not take into account potential long loops between domains, making it more reliable. Taken together, the scores (3) and (4) alleviate most of the limitations of cRMS while retaining its major properties. The H-factor is therefore expected to be more reliable than a sole cRMS to judge the accuracy of a wide range of models, as seen in Table 3.
Comparing the H-factor with ProSA, DOPE and QMEAN
The statistical potential Discrete Optimized Protein Energy (DOPE) is another measure of model quality that has been introduced in MODELLER-8 . DOPE is a statistical potential with an improved reference state that accounts for the compact shape of native protein structures. The DOPE score is designed such that large, negative scores are usually indicators of good models. In their original study, Shi and Sali  found that the accuracy of DOPE to asses a homology model improves as the accuracy of the models improve. We observe a similar behaviour for targets T0295 and T0295* (Table 2). These two targets correspond to the same protein and it is therefore possible to compare the DOPE scores of their models. The model generates for T0295*, based on an incorrect alignment, has a much lower DOPE score (-24317) that the model generated with the correct alignment (T0295;-33940). Note that we cannot compare DOPE scores for proteins of different size, as these scores are not normalized. DOPE scores are therefore relative, and designed to pick a "good" model among poorer model. DOPE scores do not assess directly the quality of the model that is picked, i.e. if it is likely to be similar to the actual structure. The H-factor is a better indicator in that respect.
QMEAN, which stands for Qualitative Model Energy ANalysis, is a composite scoring function for homology models that describes the major geometrical aspects of protein structures (including a torsion angle potential over three consecutive amino acids, a secondary structure-specific residue-based statistical potential, a solvation potential for the burial of residues) as well as the agreement between the predicted and calculated secondary structure and solvent accessibility, respectively . As such, it includes a term similar to the score (1) of the H-factor, as well as terms that assess different properties such as residue accessibility. The score QMEANnorm is a normalised version of the QMEAN score in which all terms are divided by the number of interactions/residue in order to avoid a size-bias of the score . QMEANnorm scores vary between 0 and 1, with larger scores expected to correspond to better models. Unlike the DOPE score, both the H-factor and QMEANnorm scores allow for the comparison of proteins of different sizes. The QMEANscore is as effective as PROSA or DOPE for detecting errors in a model that result from errors in the sequence alignment between the template and target protein: T0295* has a QMEANnorm score of 0.196 while the score forT0295 is 0.735 (Table 2). Interestingly, T0295* (0.196) has a less favorable QMEANnorm score than the erroneous model generated for the CASP target T0287 (0.285) (see table 3). We have observed however that the QMEANnorm score is prone to fail: some of the erroneous models generated for the CASP target T0307 have QMEANscores of 0.4 to 0.6, i.e. they are evaluated to be almost as correct as the positive control T0295 (0.735). Unlike ProSA and QMEAN, the H-factor did detect that these models were to be considered with caution. Because it analyzes a set of models, we believe that the H-factor score is more robust as an absolute measure of the quality of a model. It lacks however the ability to discriminate among a set of models generated for the same target; PROSA and DOPE are better potentials for this specific task.
These results emphasize the essential differences in the nature of the ProSA, DOPE, QMEANnorm and H-factor scores. ProSA, DOPE and QMEAN check the quality of a model, independently of the context in which it was generated. The H-factor on the other hand checks the quality of a set of models with respect to a context that includes for example the sequence alignment assessed by the score (2) . The modeler however should use these differences to extend his/her assessment of the model his/she generates. We believe that ProSA, DOPE, QMEAN and H-Factor analyses are needed to provide a better overview of the quality of models derived by homology modeling.
Current limitations and originalities of the H-factor
The H-factor is not the panacea, and does not provide a universal solution to the problem of asserting the quality of a model generated by homology modeling. Firstly, the H-factor has some technical limitations. Our current implementation does not take into account multiple templates, but rather only one single framework. The structural components included in the H-factor (i.e. scores (3) and (4)) are based on the backbone of the models, and do not take into account sidechains and possible errors in their modeling. Second, the scoring function (3) of the H-factor measures the heterogeneity of a set of models generated with the same input. It means, that the H-factor cannot be computed on a singular model. In homology modeling the heterogeneity of models can be seen as a quality indicator and building only one single model is not recommended. Similarly to NMR structures where only one of the models can be chosen for analysis, the best model in homology modeling regime is chosen based on the MODELLER energy function for instance. Third, the H-factor does not include any external information. For example, if some biological data are available, such as the knowledge of the residues involved in the active site, or standard biophysical data such as melting temperature, or secondary structure content derived for circular dichroism, these data are currently not included in the H-factor analysis.
The R-factor is a measure of the agreement between the crystallographic model and the experimental X-ray diffraction data. Despite the lack of 'experimental' data to compare with, the modelling community has been searching for a similar indicator for homology modeling. Both QMEAN and the H-factor are designed to be 'absolute' indicators that assess the quality of homology models in a way that mimics the R-factor in X-ray crystallography. Both QMEAN and H-factor provide an easy-to-use estimate of the quality of models based on scoring functions assessing various aspects of the modelling process as well as the model itself .
In vivo macromolecular structures oscillate between numerous conformers, some more than others. While X-ray structures correspond to snapshots of a limited numbers of conformers, NMR structures tend to describe more accurately flexibility. Indeed, NMR "structures" are usually provided as a family of conformers that are meant to sample the conformational space accessible to the molecule of interest. In homology modeling on the other hand, the heterogeneity of models is a quality indicator. A good set of models will have a cRMS very close to their framework. Moreover, if errors are being made in the template choice or in the sequence alignment, then the models will be heterogeneous. The scoring function (3) is designed to quantify this assertion. It also means that the H-factor cannot be computed on a single model.
One of the originality of the H-factor is the scoring function (4). It has been designed to evaluate the biological relevance of the models by comparing the model conformations of all the functional domains in the protein considered with the existing sibling deposited in the Protein Data Bank.
We acknowledge that there is room for improvement. However, It remains that the H-factor we have introduced here is a first step in the direction of validating homology models for the biologists in addition to existing methods, as proved in the examples shown above.
Homology modeling is slowly building up a record of success and can help structural biologists in many aspects. Models can serve as a bootstrap structure for both NMR and X-ray crystallography and thus help saving a huge amount of time. In X-ray crystallography for instance, many derivative dataset are often needed to solve the phase problem. Alternatively, an accurate bootstrap would be extremely handy for molecular replacement. The same applies for NMR. Protein modeling is also crucial for fitting low resolution electron microscopy maps or building accurate models using structural restraints gathered with small-angle X-ray scattering (SAXS) experiments. Models can be used at different level of details according to their accuracy. In the absence of experimental structures, they serve as starting points for modeling experiments, such as molecular dynamics studies, docking experiments and structure-based drug design. For instance, models of membrane proteins such as G-protein-coupled receptor (GPCR) are extensively used, as few structures are available for this protein family .
In this study, we proposed a modeling etiquette that hopefully will help make good use of models. We introduced the H-factor, a new indicator that assesses the quality of models generated by homology modeling, mimicking the R-factor in X-ray crystallography. The H-factor is able to detect backbone anomalies as well as give a feedback on the biological relevance of models. The H-factor evaluates the quality of a protein model within the context in which it is modelled and we believe it is an essential tool that needs to be used in addition to the other validation tools available.
To search for protein structures using any of the accession numbers mentioned in this article, please follow this link (http://www.rcsb.org/pdb/home/home.do).
The National Institutes of Health (NIH), the National Research Foundation of Korea and the Kyungpook National University supported the research presented in this paper. We thank Mrs. Marie Vallet for technical assistance and for setting up the H-factor web tool and Dr. Xinwei Shi for technical assistance and helpful discussions. We thank the anonymous reviewers for their insightful comments that have helped improved our manuscript.
- Kendrew JC, Bodo G, Dintzis HM, Parrish RG, Wyckoff H, Phillips DC: A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature 1958, 181: 662–666. 10.1038/181662a0View ArticlePubMedGoogle Scholar
- Keating AE: A rational route to probing membrane proteins. Genome Biol 2007, 8: 214. 10.1186/gb-2007-8-5-214PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen ON: Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry. Curr Opin Chem Biol 2004, 8: 33–41. 10.1016/j.cbpa.2003.12.009View ArticlePubMedGoogle Scholar
- Warringer J, Blomberg A: Evolutionary constraints on yeast protein size. BMC Evol Biol 2006, 6: 61. 10.1186/1471-2148-6-61PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang X, Settembre E, Xu C, Dormitzer PR, Bellamy R, Harrison SC, Grigorieff N: Near-atomic resolution using electron cryomicroscopy and single-particle reconstruction. Proc Natl Acad Sci USA 2008, 105: 1867–1872. 10.1073/pnas.0711623105PubMed CentralView ArticlePubMedGoogle Scholar
- Yu X, Jin L, Zhou ZH: 3.88 A structure of cytoplasmic polyhedrosis virus by cryo-electron microscopy. Nature 2008, 453: 415–419. 10.1038/nature06893PubMed CentralView ArticlePubMedGoogle Scholar
- Moult J, Pedersen JT, Judson R, Fidelis K: A large-scale experiment to assess protein structure prediction methods. Proteins 1995, 23: ii-v. 10.1002/prot.340230303View ArticlePubMedGoogle Scholar
- Levitt M: Growth of novel protein structural data. Proc Natl Acad Sci USA 2007, 104: 3183–3188. 10.1073/pnas.0611678104PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.PubMedGoogle Scholar
- Chothia C: Proteins. One thousand families for the molecular biologist. Nature 1992, 357: 543–544. 10.1038/357543a0View ArticlePubMedGoogle Scholar
- Levitt M: Nature of the protein universe. Proc Natl Acad Sci USA 2009, 106: 11079–11084. 10.1073/pnas.0905029106PubMed CentralView ArticlePubMedGoogle Scholar
- Cozzetto D, Tramontano A: Relationship between multiple sequence alignments and quality of protein comparative models. Proteins 2005, 58: 151–157. 10.1002/prot.20284View ArticlePubMedGoogle Scholar
- Tramontano A, Morea V: Assessment of homology-based predictions in CASP5. Proteins 2003, 53(Suppl 6):352–368. 10.1002/prot.10543View ArticlePubMedGoogle Scholar
- Tress M, Tai CH, Wang G, Ezkurdia I, Lopez G, Valencia A, Lee B, Dunbrack RL Jr: Domain definition and target classification for CASP6. Proteins 2005, 61(Suppl 7):8–18. 10.1002/prot.20717View ArticlePubMedGoogle Scholar
- Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A: Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins 2005, 61(Suppl 7):27–45. 10.1002/prot.20720View ArticlePubMedGoogle Scholar
- Wlodawer A, Minor W, Dauter Z, Jaskolski M: Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J 2008, 275: 1–21. 10.1111/j.1742-4658.2008.06444.xPubMed CentralView ArticlePubMedGoogle Scholar
- Kleywegt GJ, Jones TA: Homo crystallographicus--quo vadis? Structure 2002, 10: 465–472. 10.1016/S0969-2126(02)00743-8View ArticlePubMedGoogle Scholar
- Brown EN, Ramaswamy S: Quality of protein crystal structures. Acta Crystallogr D Biol Crystallogr 2007, 63: 941–950. 10.1107/S0907444907033847View ArticlePubMedGoogle Scholar
- Ilari A, Savino C: Protein structure determination by x-ray crystallography. Methods Mol Biol 2008, 452: 63–87. full_textView ArticlePubMedGoogle Scholar
- Browne WJ, North AC, Phillips DC, Brew K, Vanaman TC, Hill RL: A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J Mol Biol 1969, 42: 65–86. 10.1016/0022-2836(69)90487-2View ArticlePubMedGoogle Scholar
- Acharya KR, Stuart DI, Walker NP, Lewis M, Phillips DC: Refined structure of baboon alpha-lactalbumin at 1.7 A resolution. Comparison with C-type lysozyme. J Mol Biol 1989, 208: 99–127. 10.1016/0022-2836(89)90091-0View ArticlePubMedGoogle Scholar
- Acharya KR, Stuart DI, Phillips DC, Scheraga HA: A critical evaluation of the predicted and X-ray structures of alpha-lactalbumin. J Protein Chem 1990, 9: 549–563. 10.1007/BF01025008View ArticlePubMedGoogle Scholar
- Baker D, Sali A: Protein structure prediction and structural genomics. Science 2001, 294: 93–96. 10.1126/science.1065659View ArticlePubMedGoogle Scholar
- Eswar N, John B, Mirkovic N, Fiser A, Ilyin VA, Pieper U, Stuart AC, Marti-Renom MA, Madhusudhan MS, Yerkovich B, Sali A: Tools for comparative protein structure modeling and analysis. Nucleic Acids Res 2003, 31: 3375–3380. 10.1093/nar/gkg543PubMed CentralView ArticlePubMedGoogle Scholar
- Koh IY, Eyrich VA, Marti-Renom MA, Przybylski D, Madhusudhan MS, Eswar N, Grana O, Pazos F, Valencia A, Sali A, Rost B: EVA: Evaluation of protein structure prediction servers. Nucleic Acids Res 2003, 31: 3311–3315. 10.1093/nar/gkg619PubMed CentralView ArticlePubMedGoogle Scholar
- Brunger AT: Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 1992, 355: 472–475. 10.1038/355472a0View ArticlePubMedGoogle Scholar
- Matthews BW: Five retracted structure reports: Inverted or incorrect? Protein Sci 2007, 16: 1013–1016. 10.1110/ps.072888607PubMed CentralView ArticlePubMedGoogle Scholar
- Hanson MA, Stevens RC: Retraction: Cocrystal structure of synaptobrevin-II bound to botulinum neurotoxin type B at 2.0 A resolution. Nat Struct Mol Biol 2009, 16: 795. 10.1038/nsmb0709-795View ArticlePubMedGoogle Scholar
- Kleywegt GJ: On vital aid: the why, what and how of validation. Acta Crystallogr D Biol Crystallogr 2009, 65: 134–139. 10.1107/S090744490900081XPubMed CentralView ArticlePubMedGoogle Scholar
- Yang H, Guranovic V, Dutta S, Feng Z, Berman HM, Westbrook JD: Automated and accurate deposition of structures solved by X-ray diffraction to the Protein Data Bank. Acta Crystallogr D Biol Crystallogr 2004, 60: 1833–1839. 10.1107/S0907444904019419View ArticlePubMedGoogle Scholar
- Wuthrich K: Protein structure determination in solution by NMR spectroscopy. J Biol Chem 1990, 265: 22059–22062.PubMedGoogle Scholar
- Grzesiek S, Sass HJ: From biomolecular structure to functional understanding: new NMR developments narrow the gap. Curr Opin Struct Biol 2009, 19: 585–595. 10.1016/j.sbi.2009.07.015View ArticlePubMedGoogle Scholar
- Wuthrich K: NMR studies of structure and function of biological macromolecules (Nobel Lecture). J Biomol NMR 2003, 27: 13–39. 10.1023/A:1024733922459View ArticlePubMedGoogle Scholar
- Wuthrich K: NMR in biological research: peptides and proteins. North-Holland Publishing Co., Amsterdam; 1976.Google Scholar
- Saccenti E, Rosato A: The war of tools: how can NMR spectroscopists detect errors in their structures? J Biomol NMR 2008, 40: 251–261. 10.1007/s10858-008-9228-4View ArticlePubMedGoogle Scholar
- Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 1993, 234: 779–815. 10.1006/jmbi.1993.1626View ArticlePubMedGoogle Scholar
- Levitt M: Accurate modeling of protein conformation by automatic segment matching. J Mol Biol 1992, 226: 507–533. 10.1016/0022-2836(92)90964-LView ArticlePubMedGoogle Scholar
- Schwede T, Kopp J, Guex N, Peitsch MC: SWISS-MODEL: An automated protein homology-modeling server. Nucleic Acids Res 2003, 31: 3381–3385. 10.1093/nar/gkg520PubMed CentralView ArticlePubMedGoogle Scholar
- Bates PA, Kelley LA, MacCallum RM, Sternberg MJ: Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins 2001, (Suppl 5):39–46. 10.1002/prot.1168
- Koehl P, Delarue M: A self consistent mean field approach to simultaneous gap closure and side-chain positioning in homology modelling. Nat Struct Biol 1995, 2: 163–170. 10.1038/nsb0295-163View ArticlePubMedGoogle Scholar
- Petrey D, Xiang Z, Tang CL, Xie L, Gimpelev M, Mitros T, Soto CS, Goldsmith-Fischman S, Kernytsky A, Schlessinger A, et al.: Using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. Proteins 2003, 53(Suppl 6):430–435. 10.1002/prot.10550View ArticlePubMedGoogle Scholar
- Kopp J, Schwede T: The SWISS-MODEL Repository of annotated three-dimensional protein structure homology models. Nucleic Acids Res 2004, 32: D230–234. 10.1093/nar/gkh008PubMed CentralView ArticlePubMedGoogle Scholar
- Roessler CG, Hall BM, Anderson WJ, Ingram WM, Roberts SA, Montfort WR, Cordes MH: Transitive homology-guided structural studies lead to discovery of Cro proteins with 40% sequence identity but different folds. Proc Natl Acad Sci USA 2008, 105: 2343–2348. 10.1073/pnas.0711589105PubMed CentralView ArticlePubMedGoogle Scholar
- Alexander PA, He Y, Chen Y, Orban J, Bryan PN: The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci USA 2007, 104: 11963–11968. 10.1073/pnas.0700922104PubMed CentralView ArticlePubMedGoogle Scholar
- di Luccio E, Wilson DK: Comprehensive X-ray Structural Studies of the Quinolinate Phosphoribosyl Transferase (BNA6) from Saccharomyces cerevisiae. Biochemistry 2008, 47: 4039–4050. 10.1021/bi7020475View ArticlePubMedGoogle Scholar
- Torda AE: Perspectives in protein-fold recognition. Curr Opin Struct Biol 1997, 7: 200–205. 10.1016/S0959-440X(97)80026-7View ArticlePubMedGoogle Scholar
- Friedberg I, Jaroszewski L, Ye Y, Godzik A: The interplay of fold recognition and experimental structure determination in structural genomics. Curr Opin Struct Biol 2004, 14: 307–312. 10.1016/j.sbi.2004.04.005View ArticlePubMedGoogle Scholar
- Buchete NV, Straub JE, Thirumalai D: Development of novel statistical potentials for protein fold recognition. Curr Opin Struct Biol 2004, 14: 225–232. 10.1016/j.sbi.2004.03.002View ArticlePubMedGoogle Scholar
- Venclovas C: Comparative modeling in CASP5: progress is evident, but alignment errors remain a significant hindrance. Proteins 2003, 53(Suppl 6):380–388. 10.1002/prot.10591View ArticlePubMedGoogle Scholar
- Dunbrack RL Jr: Sequence comparison and protein structure prediction. Curr Opin Struct Biol 2006, 16: 374–384. 10.1016/j.sbi.2006.05.006View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5: 113. 10.1186/1471-2105-5-113PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian AR, Kaufmann M, Morgenstern B: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol 2008, 3: 6. 10.1186/1748-7188-3-6PubMed CentralView ArticlePubMedGoogle Scholar
- Fidelis K, Stern PS, Bacon D, Moult J: Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng 1994, 7: 953–960. 10.1093/protein/7.8.953View ArticlePubMedGoogle Scholar
- van Vlijmen HW, Karplus M: PDB-based protein loop prediction: parameters for selection and methods for optimization. J Mol Biol 1997, 267: 975–1001. 10.1006/jmbi.1996.0857View ArticlePubMedGoogle Scholar
- Olson MA, Feig M, Brooks CL: Prediction of protein loop conformations using multiscale modeling methods with physical energy scoring functions. J Comput Chem 2008, 29: 820–831. 10.1002/jcc.20827View ArticlePubMedGoogle Scholar
- Ponder JW, Richards FM: Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. J Mol Biol 1987, 193: 775–791. 10.1016/0022-2836(87)90358-5View ArticlePubMedGoogle Scholar
- Lovell SC, Word JM, Richardson JS, Richardson DC: The penultimate rotamer library. Proteins 2000, 40: 389–408. 10.1002/1097-0134(20000815)40:3<389::AID-PROT50>3.0.CO;2-2View ArticlePubMedGoogle Scholar
- Dunbrack RL Jr, Karplus M: Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains. Nat Struct Biol 1994, 1: 334–340. 10.1038/nsb0594-334View ArticlePubMedGoogle Scholar
- Dunbrack RL Jr, Karplus M: Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol 1993, 230: 543–574. 10.1006/jmbi.1993.1170View ArticlePubMedGoogle Scholar
- Vasquez M: Modeling side-chain conformation. Curr Opin Struct Biol 1996, 6: 217–221. 10.1016/S0959-440X(96)80077-7View ArticlePubMedGoogle Scholar
- Koehl P, Delarue M: Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J Mol Biol 1994, 239: 249–275. 10.1006/jmbi.1994.1366View ArticlePubMedGoogle Scholar
- Ohlendorf DH: Acuracy of refined protein structures. II. Comparison of four independently refined models of human interleukin 1beta. Acta Crystallogr D Biol Crystallogr 1994, 50: 808–812. 10.1107/S0907444994002659View ArticlePubMedGoogle Scholar
- Moult J: A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 2005, 15: 285–289. 10.1016/j.sbi.2005.05.011View ArticlePubMedGoogle Scholar
- Koehl P, Levitt M: Structure-based conformational preferences of amino acids. Proc Natl Acad Sci USA 1999, 96: 12524–12529. 10.1073/pnas.96.22.12524PubMed CentralView ArticlePubMedGoogle Scholar
- Levitt M, Lifson S: Refinement of protein conformations using a macromolecular energy minimization procedure. J Mol Biol 1969, 46: 269–279. 10.1016/0022-2836(69)90421-5View ArticlePubMedGoogle Scholar
- Misura KM, Chivian D, Rohl CA, Kim DE, Baker D: Physically realistic homology models built with ROSETTA can be more accurate than their templates. Proc Natl Acad Sci USA 2006, 103: 5361–5366. 10.1073/pnas.0509355103PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu J, Fan H, Periole X, Honig B, Mark AE: Refining homology models by combining replica-exchange molecular dynamics and statistical potentials. Proteins 2008, 72: 1171–1188. 10.1002/prot.22005PubMed CentralView ArticlePubMedGoogle Scholar
- Summa CM, Levitt M: Near-native structure refinement using in vacuo energy minimization. Proc Natl Acad Sci USA 2007, 104: 3177–3182. 10.1073/pnas.0611593104PubMed CentralView ArticlePubMedGoogle Scholar
- Chopra G, Summa CM, Levitt M: Solvent dramatically affects protein structure refinement. Proc Natl Acad Sci USA 2008, 105: 20239–20244. 10.1073/pnas.0810818105PubMed CentralView ArticlePubMedGoogle Scholar
- Brunger AT: Free R value: cross-validation in crystallography. Methods Enzymol 1997, 277: 366–396. full_textView ArticlePubMedGoogle Scholar
- Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction-Round VII. Proteins 2007, 69(Suppl 8):3–9. 10.1002/prot.21767PubMed CentralView ArticlePubMedGoogle Scholar
- Sippl MJ: Knowledge-based potentials for proteins. Curr Opin Struct Biol 1995, 5: 229–235. 10.1016/0959-440X(95)80081-6View ArticlePubMedGoogle Scholar
- Fang Q, Shortle D: A consistent set of statistical potentials for quantifying local side-chain and backbone interactions. Proteins 2005, 60: 90–96. 10.1002/prot.20482View ArticlePubMedGoogle Scholar
- Summa CM, Levitt M, Degrado WF: An atomic environment potential for use in protein structure prediction. J Mol Biol 2005, 352: 986–1001. 10.1016/j.jmb.2005.07.054View ArticlePubMedGoogle Scholar
- Berglund A, Head RD, Welsh EA, Marshall GR: ProVal: a protein-scoring function for the selection of native and near-native folds. Proteins 2004, 54: 289–302. 10.1002/prot.10523View ArticlePubMedGoogle Scholar
- Wallner B, Elofsson A: Can correct protein models be identified? Protein Sci 2003, 12: 1073–1086. 10.1110/ps.0236803PubMed CentralView ArticlePubMedGoogle Scholar
- Lovell SC, Davis IW, Arendall WB, de Bakker PI, Word JM, Prisant MG, Richardson JS, Richardson DC: Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins 2003, 50: 437–450. 10.1002/prot.10286View ArticlePubMedGoogle Scholar
- Laskowski RA, MacArthur MW, Moss DS, Thornton JM: PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr 1993, 26: 283–291. 10.1107/S0021889892009944View ArticleGoogle Scholar
- Vriend G: WHAT IF: a molecular modeling and drug design program. J Mol Graph 1990, 8: 52–56, 29. 10.1016/0263-7855(90)80070-VView ArticlePubMedGoogle Scholar
- McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16: 404–405. 10.1093/bioinformatics/16.4.404View ArticlePubMedGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- Frishman D, Argos P: Knowledge-based protein secondary structure assignment. Proteins 1995, 23: 566–579. 10.1002/prot.340230412View ArticlePubMedGoogle Scholar
- Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 2000, 16: 776–785. 10.1093/bioinformatics/16.9.776View ArticlePubMedGoogle Scholar
- Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R: Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 1998, 26: 320–322. 10.1093/nar/26.1.320PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al.: The Pfam protein families database. Nucleic Acids Res 2004, (32 Database):D138–141. 10.1093/nar/gkh121
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
- Venclovas C, Zemla A, Fidelis K, Moult J: Assessment of progress over the CASP experiments. Proteins 2003, 53(Suppl 6):585–595. 10.1002/prot.10530View ArticlePubMedGoogle Scholar
- Di Luccio E, Petschacher B, Voegtli J, Chou HT, Stahlberg H, Nidetzky B, Wilson DK: Structural and kinetic studies of induced fit in xylulose kinase from Escherichia coli. J Mol Biol 2007, 365: 783–798. 10.1016/j.jmb.2006.10.068PubMed CentralView ArticlePubMedGoogle Scholar
- Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis 1997, 18: 2714–2723. 10.1002/elps.1150181505View ArticlePubMedGoogle Scholar
- Kolinski A: Protein modeling and structure prediction with a reduced representation. Acta Biochim Pol 2004, 51: 349–371.PubMedGoogle Scholar
- Koehl P, Delarue M: Mean-field minimization methods for biological macromolecules. Curr Opin Struct Biol 1996, 6: 222–226. 10.1016/S0959-440X(96)80078-9View ArticlePubMedGoogle Scholar
- Wiederstein M, Sippl MJ: ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res 2007, 35: W407–410. 10.1093/nar/gkm290PubMed CentralView ArticlePubMedGoogle Scholar
- Shen MY, Sali A: Statistical potential for assessment and prediction of protein structures. Protein Sci 2006, 15: 2507–2524. 10.1110/ps.062416606PubMed CentralView ArticlePubMedGoogle Scholar
- Benkert P, Tosatto SC, Schomburg D: QMEAN: A comprehensive scoring function for model quality assessment. Proteins 2008, 71: 261–277. 10.1002/prot.21715View ArticlePubMedGoogle Scholar
- Paiva AC, Oliveira L, Horn F, Bywater RP, Vriend G: Modeling GPCRs. Ernst Schering Found Symp Proc 2006, 23–47.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.