Improved model quality assessment using ProQ2
© Ray et al; licensee BioMed Central Ltd. 2012
Received: 11 May 2012
Accepted: 7 September 2012
Published: 10 September 2012
Skip to main content
© Ray et al; licensee BioMed Central Ltd. 2012
Received: 11 May 2012
Accepted: 7 September 2012
Published: 10 September 2012
Employing methods to assess the quality of modeled protein structures is now standard practice in bioinformatics. In a broad sense, the techniques can be divided into methods relying on consensus prediction on the one hand, and single-model methods on the other. Consensus methods frequently perform very well when there is a clear consensus, but this is not always the case. In particular, they frequently fail in selecting the best possible model in the hard cases (lacking consensus) or in the easy cases where models are very similar. In contrast, single-model methods do not suffer from these drawbacks and could potentially be applied on any protein of interest to assess quality or as a scoring function for sampling-based refinement.
Here, we present a new single-model method, ProQ2, based on ideas from its predecessor, ProQ. ProQ2 is a model quality assessment algorithm that uses support vector machines to predict local as well as global quality of protein models. Improved performance is obtained by combining previously used features with updated structural and predicted features. The most important contribution can be attributed to the use of profile weighting of the residue specific features and the use features averaged over the whole model even though the prediction is still local.
ProQ2 is significantly better than its predecessors at detecting high quality models, improving the sum of Z-scores for the selected first-ranked models by 20% and 32% compared to the second-best single-model method in CASP8 and CASP9, respectively. The absolute quality assessment of the models at both local and global level is also improved. The Pearson’s correlation between the correct and local predicted score is improved from 0.59 to 0.70 on CASP8 and from 0.62 to 0.68 on CASP9; for global score to the correct GDT_TS from 0.75 to 0.80 and from 0.77 to 0.80 again compared to the second-best single methods in CASP8 and CASP9, respectively. ProQ2 is available at http://proq2.wallnerlab.org.
Modeling of protein structure is a central challenge in structural bioinformatics, and holds the promise not only to identify classes of structure, but also to provide detailed information about the specific structure and biological function of molecules. This is critically important to guide and understand experimental studies: It enables prediction of binding, simulation, and design for a huge set of proteins whose structures have not yet been determined experimentally (or cannot be obtained), and it is a central part of contemporary drug development.
The accuracy of protein structure prediction has increased tremendously over the last decade, and today it is frequently possible to build models with 2-3Å resolution even when there are only distantly related templates available. However, as protein structure prediction has matured and become common in applications, the biggest challenge is typically not the overall average accuracy of a prediction method, but rather how accurate a specific model of a specific protein is. –Is it worth spending months of additional human work, modeling and simulation time on this model? Ranking or scoring of models has long been used to select the best predictions in methods, but this challenge means there is also a direct need for absolute quality prediction, e.g. the probability of a certain region of the protein being within 3Å of a correct structure.
One of the most common prediction approaches in use today is to produce many alternative models, either from different alignments and templates [1–4] or by sampling different regions of the conformational space . Given this set of models, some kind of scoring function is then used to rank the different models based on their structural properties. Ideally, this scoring function should correlate perfectly with the distance from the native structure. In practice, while they have improved, ranking methods are still not able to consistently place the best models at the top. In fact, it is often the case that models of higher or even much higher quality than the one selected are already available in the set of predictions, but simply missed [6, 7]. In other words, many prediction methods are able to generate quite good models, but we are not yet able to identify them as such! In principle, there are three classes of functions to score protein models. The first of them is single-model methods that only use information from the actual model, such as evolutionary information [8–10], residue environment compatibility , statistical potentials from physics  or knowledge-based ones [13, 14], or combinations of different structural features [15–19]. The second class is consensus methods that primarily use consensus of multiple models  or template alignments  for a given sequence to pick the most probable model. Finally, there are also hybrid methods that combine the single-model and consensus approaches to achieve improved performance [21–24]. Of the above methods, it is only the single-model methods that can be used for conformational sampling and as a guide for refinement since they are strict functions of the atomic positions in the model. On the other hand, in terms of accuracy the consensus and hybrid methods outperform the single methods, in particular in benchmarks such as CASP  with access to many alternative models for all different targets. The success of the consensus methods in CASP has resulted in an almost complete lack of development of new true single-model methods. As a consequence only 5 out of 22 methods submitting predictions to both the global and local categories in the model quality assessment part of the latest CASP were actual true single-model methods . By true, we mean methods that can be used for conformational sampling and that do not use any template information in the scoring of models.
Scoring of models can be performed at different levels, either locally (i.e., per residue) or globally to reflect the overall properties of a model. Traditionally, the focus of most scoring functions has been to discriminate between globally incorrect and approximately correct models, which works reasonably well e.g. for picking the model that provides the best average structure for a complete protein (which is highly relevant e.g. for CASP). In contrast, only a handful of methods focus on predicting the individual correctness of different parts of a protein model [9, 11, 23, 26], but this is gradually changing with the introduction of a separate category for local quality assessment in CASP. In fact, we believe that local quality prediction might even be more useful than global prediction. First, it is relatively easy to produce a global score from the local, making global scoring a special case of the local one. Second, a local score can be used as a guide for further local improvement and refinement of a model. Third, even without refinement local quality estimates are useful for biologist as it provides confidence measures for different parts of protein models.
In this study, we present the development of the next generation of the ProQ quality prediction algorithm, and how we have been able to improve local quality prediction quite significantly through better use of evolutionary information and combination of locally and globally predicted structural features. ProQ was one of the first methods that utilized protein models, in contrast to native structures, to derive and combine different types of features that better recognize correct models . This was later extended to local prediction in ProQres  and to membrane proteins . We have reworked the method from scratch by using a support vector machine (SVM) for prediction, and it has been trained on a large set of structural models from CASP7. In addition to evolutionary data, there are several new model features used to improve performance significantly, e.g. predicted surface area and a much-improved description of predicted secondary structure. We also show that including features averaged over the entire model, e.g. overall agreement with secondary structure and predicted surfaces, improves local prediction performance too.
The aim of this study was to develop an improved version of ProQ that predict local as well as global structural model correctness. The main idea is to calculate scalar features from each protein model based on properties that can be derived from its sequence (e.g. conservation, predicted secondary structure, and exposure) or 3D coordinates (e.g. atom-atom contacts, residue–residue contacts, and secondary structure) and use these features to predict model correctness (see Methods for a description of the features new to this study). To achieve a localized prediction, the environment around each residue is described by calculating the features for a sliding window centered around the residue of interest. For features involving spatial contacts, residues or atoms outside the window that are in spatial proximity of those in the window are included as well. After the local prediction is performed, global prediction is achieved by summing the local predictions and normalize by the target sequence length to enable comparisons between proteins. Thus, the global score is a number in the range [0,1]. The local prediction is the local S-score, as defined in the Methods section.
From the earlier studies, we expect optimal performance by combining different types of input features [15, 17, 18]. To get an idea of which features contribute most to the performance, support vector machines (SVMs) were trained using five-fold cross–validation on individual input features as well as in combination of different feature types. After trying different SVM kernels (including linear, radial basis function and polynomial ones), we chose the linear kernel function for its performance, speed and simplicity.
Pearson’s correlation coefficient for different input features
Retrained ProQ (Base)
Residue + Profile Weighting
Surface + Profile Weighting
Base + Global Surface Area Prediction
Base + Global Secondary Structure Pred.
Base + Profile Weighting
Base + Local Surface Area Prediction
Base + Local Secondary Structure Pred.
Base + Information per position (Conservation)
All Combined (ProQ2)
The largest performance increase in local prediction accuracy is actually obtained by including global features describing the agreement with predicted and actual secondary structure, and predicted and actual residue surface area calculated as an average over the whole model. Even though these features are not providing any localized information, they increase the correlation between local predicted and true quality significantly over the baseline (+0.10 to 0.65). The performance increase is about the same for predicted secondary structure and predicted surface area. The use of global features, i.e. features calculated over the whole model to predict local quality, is not as strange as it first might seem. The global features reveal whether the model is overall accurate, an obvious prerequisite for the local quality to be accurate. For instance, from the local perspective a model might appear correct, i.e. favorable interactions and good local agreement with secondary structure prediction, but a worse global agreement could affect the the accuracy in the first region too. Both predicted secondary structure and predicted surface area are also among the local features that result in a slight performance increase (+0.03 to 0.58).
The second-largest performance increase is obtained by profile weighting (+0.07 to 0.62). This is actually not a new feature, but rather a re-weighting of the residue-residue contact and surface area features used in the original version of ProQ, which is here based on multiple sequence alignment of homologous sequences. This re-weighting improves the performance of residue-residue contacts and surface area based predictors to equal degree (Table 1).
Finally, a small increase is also observed by adding the information per position from the PSSM, a measure of local sequence conservation. This is despite the fact that this type of information in principle should have been captured by the feature describing correspondence between calculated surface exposure and the one predicted from sequence conservation.
It has been shown many times, both in CASP [25, 28, 29] and elsewhere [1, 2, 21], that consensus methods are superior MQAPs compared to stand-alone or single methods not using consensus or template information, at least in terms of correlation. However, a major drawback with consensus methods is that they perform optimally in the fold recognition regime, but tend to do worse in free modeling where consensus is lacking or for easy comparative modeling targets, where consensus is all there is. Even though the correlation can be quite high, they often fail in selecting the best possible model.
Overall for CASP7 targets, the combination selects models that are of 1.4% and 1.8% higher quality compared to ProQ2 and Pcons respectively, while maintaining a good correlation. The bootstrap support values calculated according to , with repeated random selection with return, are higher than 0.95, which demonstrates that GDT1 for the combination is higher in more than 95% of the cases.
Description of the methods included in the benchmark
Support Vector Machine trained to predict S-score
Potential of mean force, top-ranked single MQAP in CASP8 and CASP9 
Neural network trained on the output from primary MQAPs 
Neural network trained on CA-CA interactions 
Correlates conservation and solvent accessibility, only global 
Top-ranked single MQAP in CASP8, only global .
QMEAN-weighted GDT_TS averaging, top-ranked consensus method MQAP in CASP8 and CASP9 .
Linear combination of ProQ2 and Pcons scores, 0.2ProQ2+0.8Pcons
Local model quality benchmark on the CASP8/CASP9 data sets
〈R target 〉
〈R model 〉
Statistical significance test for local quality prediction
In all of the above tests the performance of the reference consensus method, QMEANclust, is of course better than any of the single-model methods, but the performance gap is now significantly smaller.
Benchmark of global model quality
Statistical significance test for global quality prediction
The model selection as measured by the sum of GDT_TS for the first ranked model is also significantly improved, from 74.0/44.7 for the best single-model methods in CASP8 (MULTICOM-CMFR) and CASP9 (QMEAN) to 75.2/47.0 for ProQ2 on the two data sets. To put the numbers in perspective, the performance is now closer to the selection of the reference consensus method QMEANclust (75.8/48.6) than to the selection of the previous best performing single-model methods. In addition, when ProQ2 is combined with our consensus method Pcons, the model selection is improved further to 76.9/48.7. This performance is similar to the best performing consensus methods and clearly better than Pcons alone (75.9/48.3), which demonstrates the added value of ProQ2.
ProQ2 has lower number of predictions below average (Z<0) and larger number of predictions with Z-score greater than 2 compared to the other single-model methods. Only for a few targets (3/203 on CASP8 and CASP9) does the model selected by ProQ2 have a Z<-0.5, while both QMEAN and ProQ have more than 15 models with Z-scores in that range. This is in sharp contrast to the consensus methods that never select models with Z<-0.5 and on CASP9 none of the consensus methods select models with Z<0, demonstrating that consensus methods seldom select models worse than average. This is one of the clear advantages over non-consensus methods. However, at the other end of the spectrum, the ability of the consensus method to select models with high Z-score is quite far from optimal, as demonstrated by the Z-score distribution for Pcons and QMEANclust on the CASP8 data set in Figure 4A. In fact, all single-models methods select more models with Z>2 than either Pcons or QMEANclust, indicating that combining the results from single methods with consensus could potentially improve the results at the high end. This is also exactly what is observed for ProQ2+Pcons, were the number of models with Z>2 increases significantly from 8 for Pcons to 15 for ProQ2+Pcons on the combined CASP8/CASP9 set (bootstrap value:>0.92).
The aim of this study was to improve both local and global single-model quality assessment. This was done by training support vector machines to predict the local quality measure, S-score, with a combination of evolutionary and multiple sequence alignment information combined with other structural features on data from CASP7. The final version, ProQ2, was compared to the top-performing single-model quality assessment groups in CASP9 as well as to its predecessor, ProQ, on the complete CASP9 data set.
We show that ProQ2 is superior to ProQ in all aspects. The correlation between predicted and true quality is improved from 0.52/0.49 (CASP8/CASP9) to 0.70/0.68 on the local level and from 0.67/0.68 to 0.80/0.80 on the global level. In addition, the selection of high-quality models is also dramatically improved. ProQ2 is significantly better than the top-performing single-model quality assessment groups in both CASP8 and CASP9. The improvement in correlation is larger on the local level, but still significant on the global level. The largest improvement is however in the selection of high-quality models with a sum of Z-score improvement from 83.7/52.1 on CASP8/CASP9 for the second-ranked single model to 100.4/68.6 for ProQ2. Finally, we also show that ProQ2 combined with the consensus predictor Pcons can improve the selection even further.
The performance from any machine learning approach is ultimately limited by the quality of the underlying data used for training and testing. For the development of ProQ2 we used data from CASP7  for training and data from CASP8  and CASP9  for testing and benchmarking. Each residue in each protein model generates one input item, so using the complete CASP7 data set would have resulted in far too much training data. Instead, we selected ten representative models from each target randomly, resulting a final set of 163,934 residues from 874 models of 102 target structures.
The testing and benchmarking data from CASP8 and CASP9 included all models and the top-ranked model quality assessment programs from the MQAP category in CASP. ProQ2 participated officially in CASP9 so those predictions were truly on unseen data. To make the present comparison fair, we had to filter out all models that did not have a prediction for all targets included in the benchmark. This resulted in a final benchmarking set consisting of 3,620,156 residues from 18,693 models from 122 targets for CASP8 and 4,676,538 residues from 23,635 models from 81 targets and 21,589 models from 81 targets for the local and global quality benchmark for CASP9, respectively. The reason for the different number of models is that we wanted to include as many global single-model methods as possible (there were fewer methods making local predictions), thereby making the union of a common set smaller than for the local.
SVM training was performed using five-fold cross-validation on the CASP7 data set. The SVM light  V6.01 implementation of support vector machine regression was used with a linear kernel function (other kernels were tried but showed no increased performance). The trade-off between training error and margin was optimized (the -c parameter) and the epsilon for the width-of-loss function in regression tube was optimized for all cross-validation sets at the same time.
Support vector machines were trained using structural features describing the local environment around each residue in the protein models combined with other features predicted from sequence such as secondary structure, surface exposure and conservation.
Combinations of the following features were used: atom–atom and residue–residue contacts, surface accessibility, predicted secondary structure, predicted surface exposure, and evolutionary information calculated over a sequence window. Many of the features, in particular the structural ones, are similar to those used in our earlier studies [9, 15]. For consistency we include a short description of them here.
This features describes the distribution of atom–atom contacts in the protein model. Atoms were grouped into 13 different atom types based on chemical properties, see Wallner et al. (2003). Two atoms were defined to be in contact if the distance between them was within 4Å. The 4Å cutoff was chosen by trying different cutoffs in the range 3Å–7Å. Contacts between atoms from positions adjacent in sequence were ignored. Finally, the number of contacts from each group was normalized by dividing with the total number of contacts within the window.
This feature describes the distribution of residue–residue contacts. Residues were grouped in six different groups: (1) Arg, Lys; (2) Asp, Glu; (3) His, Phe, Trp, Tyr; (4) Asn, Gln, Ser, Thr; (5) Ala, Ile, Leu, Met, Val, Cys; (6) Gly, Pro [15, 33]. Two residues were defined to be in contact if the distance between the Cα–atoms or any of the atoms belonging to the side chain of the two residues were within 6Å and if the residues were more than five residues apart in sequence. Many different cutoffs in the range 3Å–12Å were tested and 6Å showed the best performance. Finally, the number of contacts for each residue group was normalized with the total number of contacts within the window.
This features describes the exposure distribution for the same residue grouping as used for the residue–residue contacts. The surface accessibility was calculated using NACCESS . The relative exposure of the side chains for each residue group was used. The exposure data was grouped into one of the four groups <25%, 25%–50%, 50%–75% and >75% exposed and finally normalized by the number of residues within the window.
The predicted probability from PSIPRED for the secondary structure of the central residue in the sequence window.
Correspondence between predicted and actual secondary structure over a 21-residue window.
Secondary structure assigned by STRIDE, binary encoded into three classes over a 5-residue window.
We also tried secondary structure content for the three classes over a sequence window but that did not show any improved prediction performance.
Correspondence between predicted and actual burial/exposure class over a 21-residue window.
Actual surface area over a 13-residue window. This feature complements the surface features that describe the exposure pattern for different residues used earlier.
This describes the evolutionary history of a given sequence window. Sequence profiles were derived using three iterations of PSI-BLAST  against UniRef90, release 2010_3  with a 10−3 E–value cutoff for inclusion (-h) and all other parameters at default settings. The sequence profile output from PSI-BLAST contains both the position-specific scoring matrix (PSSM) and the ’information per position’ (IPP), a position-specific measure of conservation calculated from the PSSM. The PSSM was used for weighting of the structural features (see “Profile weighting” section below). The IPP was used directly as an input feature with an optimized window size of 3.
This is actually not an individual feature but rather a re-weighting of all residue-based features, i.e the residue-residue contacts and residue-specific exposure patterns, according to the occurrence in the sequence profile. For instance, if a position in the sequence profile contains 40% alanine and 60% serine, contacts to this position are weighted by 40% as contacts alanine and by 60% as contacts to serine. This effectively increases the amount of training examples and should also make the final predictor less sensitive to small sequence changes, since data is extracted from multiple sequence alignments among homologous sequences.
Global correspondence between predicted and actual secondary structure.
Global correspondence between predicted and actual residue burial/exposure.
where R is Pearson’s correlation and z a normally distributed variable with variance s 2=1/(n−3), with n being the number of observations. Two correlation coefficients R 1 and R 2 can be converted into the corresponding z 1and z 2. The P-value associated with |z 1−z 2| calculated from the normal distribution is an estimate of the likelihood that the difference between R 1 and R 2 is significant. Fisher’s transformation is used both for P-value estimates, as well as for constructing confidence intervals.
For benchmarking global quality, an additional measure called GDT1 was also used. This measure is simply the sum of the GDT_TS score for the highest-ranked model by the different methods for each target, and it is commonly used in CASP. In contrast to the correlation coefficient this measure only considers the model selection and not whether the actual predictions are in good agreement. The GDT1 was also converted into a Z-score for each target by subtracting the average GDT1 and dividing by the standard deviation.
This work was supported by grants 2010-5107 and 2010-491 from the Swedish Research Council and the Swedish Foundation for Strategic Research to EL, and from the Carl Trygger Foundation as well as a Marie Curie Fellowship to BW. Supercomputing Resources were provided by grant 020/11-40 from the Swedish National Infrastructure for Computing to EL.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.