MetaMQAP: A meta-server for the quality assessment of protein models
© Pawlowski et al. 2008
Received: 18 March 2008
Accepted: 29 September 2008
Published: 29 September 2008
Skip to main content
© Pawlowski et al. 2008
Received: 18 March 2008
Accepted: 29 September 2008
Published: 29 September 2008
Computational models of protein structure are usually inaccurate and exhibit significant deviations from the true structure. The utility of models depends on the degree of these deviations. A number of predictive methods have been developed to discriminate between the globally incorrect and approximately correct models. However, only a few methods predict correctness of different parts of computational models. Several Model Quality Assessment Programs (MQAPs) have been developed to detect local inaccuracies in unrefined crystallographic models, but it is not known if they are useful for computational models, which usually exhibit different and much more severe errors.
The ability to identify local errors in models was tested for eight MQAPs: VERIFY3D, PROSA, BALA, ANOLEA, PROVE, TUNE, REFINER, PROQRES on 8251 models from the CASP-5 and CASP-6 experiments, by calculating the Spearman's rank correlation coefficients between per-residue scores of these methods and local deviations between C-alpha atoms in the models vs. experimental structures. As a reference, we calculated the value of correlation between the local deviations and trivial features that can be calculated for each residue directly from the models, i.e. solvent accessibility, depth in the structure, and the number of local and non-local neighbours. We found that absolute correlations of scores returned by the MQAPs and local deviations were poor for all methods. In addition, scores of PROQRES and several other MQAPs strongly correlate with 'trivial' features. Therefore, we developed MetaMQAP, a meta-predictor based on a multivariate regression model, which uses scores of the above-mentioned methods, but in which trivial parameters are controlled. MetaMQAP predicts the absolute deviation (in Ångströms) of individual C-alpha atoms between the model and the unknown true structure as well as global deviations (expressed as root mean square deviation and GDT_TS scores). Local model accuracy predicted by MetaMQAP shows an impressive correlation coefficient of 0.7 with true deviations from native structures, a significant improvement over all constituent primary MQAP scores. The global MetaMQAP score is correlated with model GDT_TS on the level of 0.89.
Finally, we compared our method with the MQAPs that scored best in the 7th edition of CASP, using CASP7 server models (not included in the MetaMQAP training set) as the test data. In our benchmark, MetaMQAP is outperformed only by PCONS6 and method QA_556 – methods that require comparison of multiple alternative models and score each of them depending on its similarity to other models. MetaMQAP is however the best among methods capable of evaluating just single models.
We implemented the MetaMQAP as a web server available for free use by all academic users at the URL https://genesilico.pl/toolkit/
Evaluation of model accuracy is an essential step in protein structure prediction. The existing methods for quality assessment of protein models (MQAPs) are usually based either on a physical effective energy which can be obtained from fundamental analysis of particle forces or on an empirical pseudo energy derived from known protein structures (review: . So far, most of the development of MQAPs was focused on the global evaluation of protein structure and most of the existing methods were optimized to discriminate between globally correct and incorrect 'decoy' structures rather than to detect correct and incorrect fragments [2, 3]. Even for MQAPs that are capable of generating independent evaluations for each amino acid in the protein structure, it is usually recommended that a score is averaged over a long stretch of residues (e.g. 21 amino acids in the case of VERIFY3D ). Systematic assessment experiments, e.g. Critical Assessment of techniques for protein Structure Prediction (CASP) and LiveBench demonstrated that models with a correct fold can be confidently recognized, especially by the fold-recognition meta-servers [5, 6]. However, comparative models, especially those based on remotely related templates, often exhibit local inaccuracies that are difficult to identify by a global evaluation, in particular misthreadings of short regions (5–10 residues) corresponding to shifted alignments within individual secondary structure elements [7, 8].
In CASP5, we proposed that inaccuracies due to local alignment shifts can be identified and corrected by identification of variable conformations in alternative homology models, comparison of their VERIFY3D scores averaged over only 5 neighbouring residues, and construction of hybrid models comprising the best-scoring fragments . Our method (termed the "FRankenstein's monster approach") turned out to consistently produce very accurate models, especially if regions with initially poor scores were systematically varied to generate additional models for evaluation . However, detailed inspection of cases where we failed to identify the most native-like local conformation based on the VERIFY3D score revealed a considerable variation of scores even among models with similar structural features. Therefore, we decided to carry out a systematic evaluation of the capability of VERIFY3D and several other popular MQAPs, including the recently published method PROQRES , to identify the best method for prediction of local accuracy of protein models. However, as the work progressed, we realized that none of the MQAPs we analyzed was sufficiently accurate and robust and that they exhibited very different strengths, and weaknesses. This in turn prompted us to develop a new "meta-predictor" specifically optimized to detect local errors.
We downloaded all models generated within the framework of the Critical Assessment of techniques for protein Structure Prediction (CASP) rounds 5 and 6, for cases classified as 'template-based modeling', i.e. 'comparative modeling' and 'fold recognition' [12, 13]. In these cases a large fraction of models have a correct fold and exhibit widely varying degree of global and local similarity to the native structure, with some completely wrong models (of incorrect folds). To create the model database we used only models that covered at least 90% residues of the target sequence and did not exhibit any internal deletions (i.e. missing residues were allowed only at the termini). If the CASP target was a multidomain protein, it was split into individual domains, which were then regarded as separate models. Ultimately, we collected 8251 models for 84 CASP5&6 targets. Then these models where superimposed onto their experimentally solved counterparts using LGA , routinely used in CASP assessment. For our datasets, the average root mean square deviation (RMSD) on C-α atoms between the models and the templates is 2.00 Å and the average GDT_TS score is 59.
Many of CASP models are 'non-physical' in the sense that they often exhibit steric clashes, non-standard bond lengths and angles, improper stereochemistry or they lack parts of residues (e.g. residues may be reduced to just C-α atoms). Thus, we 'idealized' our CASP5&6 model dataset to minimize the most severe local errors by simply running MODELLER  with default options, using the original model as a template to derive spatial restraints to build a refined full-atom model. We want to emphasize that such procedure can lead to false positives in the case of bad regions of a model and false negatives in the case of excellent refined models.
The aim of our analysis was to develop a method that would be able to accurately estimate the deviation of C-α atoms with respect to the corresponding residues in the native structures without any knowledge of the native structure. Despite we introduced 'idealization', we intended to make predictions for the original models. Thus, we trained MetaMQAPII with the deviations between original models (not the idealized models) to the native structures, even though the other component of training was the MQAP score for the idealized models.
In the last part of this article we compare MetaMQAP with CASP7 winners in the MQAP category. Thus, from the CASP7 website we downloaded both the server models and Quality Assessment predictions done by winners of the MQAP category. The accuracy of server models calculated by the LGA method was taken from the CASP7 website . CASP7 server models have been processed in the same way as the training CASP5&6 data, e.g. they have been 'idealized' with MODELLER and scored with MetaMQAP. As with the CASP5&6+ dataset, deviations between the models and the experimental structures were calculated for models before 'idealization'.
We applied wide range of statistical tools: such as Pearson and Spearman's rang correlation, ROC curve analysis, t-test, multivariable regression, and cluster analysis. All statistical analyses were done using STATISTICA 7 software (StatSoft, Inc. Tulsa, OK, USA).
For the evaluation of protein models from the training dataset and for the development of the MetaMQAP we used 8 primary MQAP methods: VERIFY3D , PROSA2003 , PROVE , ANOLEA , BALA-SNAPP , TUNE , REFINER , and PROQRES . VERIFY 3D evaluates the environment of each residue in a model with respect to the expected environment as found in the high resolution X-ray structures. It operates on a '3D-1D profile' of a protein structure, which includes the statistical preferences for the following criteria: the area of the residue that is buried, the fraction of side-chain area that is covered by polar atoms (oxygen and nitrogen), and the local secondary structure [4, 23]. In our own experience, VERIFY3D is rather permissive (i.e. detects only relatively major errors, usually related to unusual contacts resulting from misalignments, e.g. burial of charged groups in a hydrophobic core. On the other hand, VERIFY3D often fails to detect errors such as non-physical bond lengths or angles or some steric clashes (e.g. threading of a distorted aliphatic side chain through a distorted aromatic ring could be regarded as 'protein-like' by this method). PROSA 2003 relies on empirical energy potentials derived from pairwise interactions observed in high-resolution protein structures . In our own experience, PROSA 2003 is very strict compared to VERIFY3D, i.e. it often detects even very minor errors, such as distorted geometry of hydrogen-bonded residues, and therefore may be more useful for the evaluation of nearly-native homology models than for the fold-recognition models that are plagued by local errors. ANOLEA is also based on a distance-dependent empirical potential. It evaluates the non-local environment (NLE) of each heavy atom in the model. The NLE is defined as the set of all heavy atoms within the distance of 7 Å that belong to amino acids farther than 11 residues in the analyzed polypeptide. Owing to the focus on non-local contacts, ANOLEA is able to identify some errors that remain undetected both by VERIFY3D and PROSA . PROVE analyzes the packing in protein models by evaluating the regularity of the atom volume, defined by the atom's radius and the planes separating it from other atoms . BALA-SNAPP evaluates the structure by means of a four-body statistical potential, applied to tetrahedral quadruplets or spatially neighbouring residues . TUNE uses a neural network to predict local quality of residue from both a local and non local contact of residues in the model . REFINER is based on a statistical potential, which includes terms such as: contacts potential, long distance potential, hydrogen bonds and burial pseudo energy . Finally, PROQRES is the only method in this set, which has been developed specifically to predict local errors in crude protein models. This method applies a neural network to estimate local structure from: atom-atom contacts, residue-residue contacts, secondary structure context, and solvent accessibility .
In the final comparison, we analyzed the results of "blind" assessment done for the CASP7 dataset by 6 methods: QA556 – LEE (unpublished), QA704 – QA-ModFOLD [25, 26], a method based on the nFOLD protocol , QA633 – PROQ, QA692 – ProQlocal , QA634 – PCONS6 a new variant of PCONS , QA713 – CircleQA (for more information see CASP7 abstracts website )
The evaluation of the capability of MQAPs to predict the local accuracy of protein models requires a carefully prepared dataset. We aimed to identify the most native-like segments in a set of high quality models, in particular those generated by comparative modeling and fold-recognition methods. Therefore, rather than analyzing popular sets of decoys with a clear majority of globally incorrect versions of various protein structures [2, 30], we decided to use models of all CASP5&6 targets in the CM and FR(H) categories (corresponding to the 'template-based modeling' category in CASP7). We used only models that covered at least 90% residues of the experimentally solved structure and exhibited no missing internal residues (i.e. deletions were allowed only for the termini) (see Methods). Models from the CM and FR(H) categories are usually based on templates with the same fold as the native structure, and the major reasons of their deviation from the native structure are alignment shifts (misthreading) and/or structural divergence between the target and the used template. These models are 'relatively good' only with respect to the correct position of the backbone atoms in the protein core, but they often contain various errors, such as steric clashes between the side chains, missing side chains, unmodeled residues corresponding to insertions or terminal extensions, and discontinuities in the place of deletions. Such models may be considered native-like in terms of C-α atoms, but non-physical in details. Unfortunately, most MQAP methods were optimized for the structures of crystallographic quality, and all 'non-physical' details contribute to their scores in unpredictable ways – either as very serious errors (e.g. steric clashes in ANOLEA) or as artificially positive elements (e.g. some clashes in VERIFY3D). In addition, CASP models are generated by different modeling protocols which exhibit various peculiarities with respect to inclusion or omission of atoms. Variants include C-α atoms, backbone and C-β atoms, all heavy atoms, or all atoms including hydrogens, or different combinations of the above (i.e. in one model some residues may be complete and others may lack different types of atoms). Obviously, it is very difficult to compare the accuracy of residues modeled at such a different level of precision, even if the aim is to assess the accuracy of C-α coordinates only. Moreover, most MQAPs require complete models, without chain breaks or missing atoms, but often also without any hydrogen atoms. Our CASP7 results (see below) clearly demonstrate that utilization of 'crude' CASP models leads to decreased performance of MQAPs, compared to the 'idealized' variants of the same models.
Taking into account the above-mentioned considerations, we constructed an 'idealized' set of models (hereafter referred to as the CASP5&6+ dataset) using MODELLER , which minimizes the violation of stereo-chemical constraints as well as restraints derived from the template and yields the canonical set of atoms for each residue. The restraints were derived from the original CASP models, instead of templates. This procedure reconstructed a heavy-atom representation for all residues except the omitted terminal residues and optimized the bond lengths, angles and packing. On the other hand, the backbone structure of such 'idealized' models maintained the conformation nearly identical to the starting structures (see Methods). We envisage only one situation that can lead to a false significant improvement in an idealized model: If the original model contains big gaps (>>3.6Å) between C-α atoms of adjacent residues (obviously an error, as this should not occur in real proteins), MODELLER will attempt to seal the gap, causing conformational change in the neighboring regions and bringing it closer to what may be a native-like conformation (if the flanking regions have correct conformation). However, such cases are quite rare in practice.
The idealized models were used only to compute MQAP scores, while the deviation between the modeled and experimentally observed positions of C-α atoms (used as a measure of the local quality of the model) was calculated for the original, unmodified models. Additional file 1 shows the distribution of residue deviation in our set of 'original' models.
All models in the CASP5&6+ dataset were evaluated with 8 popular MQAP methods that we found to be available for download and local installation: VERIFY3D, PROSA2003, PROVE, ANOLEA, BALA-SNAPP, TUNE, REFINER, PROQRES (see Methods). VERIFY3D, ANOLEA and REFINER report series of "raw" scores for individual residues. PROSA reports the composite score and its two components. ANOLEA and BALA report an additional score corresponding to the number of contacts/neighbours of each residue. TUNE and PROQRES report only a single score for each residue. We also analyzed the correlation between residue deviation with local residue features such as: solvent accessibility calculated using NACCESS , residue depth calculated using MSMS  as well as with the agreement between secondary structure predicted with PSI-PRED  and calculated from the model using DSSP . In addition, we studied the accuracy of a trivial score (calculated directly from the model), based on residue depth in the structure size of the protein and type of an amino acid. TrivialScore divides the residues into 2000 classes based on the 10 bins of model size (number of residues in the model), 10 bins of residue depth in the structure (ResDepth) and 20 bins defined by each amino acid. The predicted TrivialScore values directly correspond to the average residue deviation of residues grouped in a given class. TrivialScore should be regarded as a baseline MQAP, which predicts that on the average residues in the protein core are modeled well, and residues on the surface are modeled poorly.
According to our benchmark (Figure 2), the scores reported by PROQRES exhibit best correlation with the real residue deviation (R = -0.50). This result was expected, since PROQRES is the only method in our test set that has been developed specifically to predict local quality of theoretical models, such as those from CASP. However, our analysis also revealed that PROQRES score performs only slightly better than the global model accuracy (PROQ score – which is identical for all residues in a given model). Apart from PROQ and PROQRES, the correlations of primary MQAP scores with the local accuracy of the model appear very poor – only a few scores exhibit a correlation coefficient above 0.25. The scores that correlate best with the local model accuracy, are BALA, VERIFY3D score averaged in a window of 5 residues (VERIFY3Dw5), ANOLEAw5, and REFINER. It is noteworthy that the 'smoothened' VERIFY3Dw5 score is a much better predictor of local model accuracy than the corresponding "raw" score (VERIFY3D). The same is observed for the ANOLEA and its 'smoothened' variant ANOLEAw5. Another interesting observation is that the composite score reported by PROSA (PROSA) comprises a relatively well-performing component score describing atom-atom interactions (PROSApair) and a much poorer component score describing atom-solvent interactions (PROSAsurf). We also observed that scores reported by different methods are poorly correlated with each other, which provides a stimulus to develop a method that combines strengths of different methods and eliminates the individual weaknesses.
Summarizing, we observed that the local quality of the model can be predicted to a large extent by consideration of trivial features such as amino acid accessibility and depth in the structure. In other words, as a rule of a thumb, residues buried in the core are on the average predicted more accurately than residues on the surface. Application of sophisticated algorithms implemented in MQAPs provides only a minor added value. To our knowledge this important fact has escaped the attention of developers and users of MQAP. Nonetheless, different MQAP scores are surprisingly poorly correlated with each other, which suggest that these added values might be different in each case. Thus, there is a hope that combination of different MQAPs together with informed analysis of trivial features may provide a much stronger predictor of local model quality than any of these approaches alone.
Combination of different sub-optimal predictive methods into 'meta-predictors' has been very successful in structural bioinformatics, in particular in protein and RNA secondary structure prediction [35, 36], membrane protein prediction , protein fold-recognition , and identification of protein domains . Although sometimes even simple averaging shows improvement over the results obtained by the primary methods, there are machine learning techniques that allow more intelligent combination of the best features that different methods have to offer. Here, we used a multivariate linear regression model to develop an MQAP meta-predictor (MetaMQAP). We used the residue deviation as the dependent variable, while MQAPs and residue features were used as predictors.
Description of scores returned by the primary MQAP methods as well as other local features analyzed in this work.
3D-1D profile score for a single residue
VERIFY3D score averaged over a 5 residue window
pair energy (atom-atom interactions)
surface energy (atom-solvent interactions)
combination of PROSApair and PROSAsurf
distance – dependent empirical potential. It evaluates the non-local environment (NLE) of each heavy atom in the model
ANOLEA score averaged over a 5 residue window
average relative volume for all atoms of a residue
mean of a four-body statistical potential, applied to tetrahedral quadruplets or spatially neighbouring residues
pseudoenergy of local contacts
pseudoenergy of long-distance contacts
pseudoenergy of H-H bond interaction
pseudoenergy of burial
weighted sum of all REFINER pseudoenergies
score based on neural network that predict local quality of residue from both a local and non local contact of residues in the mode
fraction of non-polar residues in area of given residue (ENVIRONMENT)
burial of the residue (ENVIRONMENT)
number of residues within the distance of 10 Å in space and within 8 residues in the sequence
number of residues within the distance of 10 Å in space and more distant than 8 residues in the sequence.
the distance between the C-α atom of a residue and the closest geometrically plausible position of a water molecule on the surface of the protein
It is a neural network based predictor that based on a number of structural features predicts the quality of a protein model. ProQ is optimized to find correct models in contrast to other methods which are optimized to find native structures
score based on neural network which estimate local structure from: atom-atom contacts, residue-residue contacts, secondary structure context, and solvent accessibility
absolute value of solvent accessibility for all atom off a residue (according to NACCESS)
proportion of absolute solvent accessibility of a given residue to the solvent accessibility of the same type of residue (X) in a model tripeptyde Ala-X-Ala (according to NACCESS)
probability of a loop conformation in secondary structure predicted by PSIPRED
probability of a helical conformation in secondary structure predicted by PSIPRED
probability of an extended conformation in secondary structure predicted by PSIPRED
agreement between secondary structure predicted by PSIPRED and secondary structure observed in the model (calculated by DSSP)
It must be emphasized that the linear regression statistical models described above can predict rank (1–100) of deviation instead of the deviation in Ångströms, however the rank can be easily transformed into distance. The predicted rank of deviation were assigned to residue deviation according to average residue deviation of residues of the same rang as predicted in training data
In order to train and test the MetaMQAP the dataset of all CASP5&6+ targets was randomly divided into training data (70 targets – 7032 models total) and test data (14 targets – 1219 models total). The training data was used to train the MetaMQAP method (i.e. to calculate the regression models). The test data (called CASP5&6+ test) was used to evaluate MetaMQAP and compare it with the 'primary' MQAPs.
Local (per residue) deviation for the best and worst residues according to different MQAPs.
10 % highest quality residues
10 % lowest quality residues
Area under the ROC curve
The advantage of MetaMQAP is even greater in a context of standard deviation (Table 2). For MetaMQAP, the 10% top-scoring residues exhibit deviation of 1.13 Å +/- std.dev. 0.82, while for PROQRES, its 10% top-scoring residues exhibit deviation of 1.42 Å +- std.dev. 2.08 Å. In comparison, the 10 % truly best residues exhibit deviation of 0.44 Å +- std.dev.0.25.Å, which indicates that MetaMQAP really pushes the prediction close to the limit. These data clearly show, that MetaMQAP is more confident in detecting residues of highest quality than PROQRES or any other method. A similar situation is observed for the residues predicted to be of the lowest quality. The mean deviation of the residues marked as 10% worst by MetaMQAP is 20.39 Å and is significantly higher than mean deviation of 10 % worst residues selected by each of remaining methods, with a deviation of PROQRES (t-test, p < 0.05) at the level of 16.45 Å. It is noteworthy that 10% truly worst residues have avg. deviation of 31.3 Å, which shows a room for further improvement even in the case of our method (avg. deviation 20.39 Å). In addition we also present the ROC curve analysis (cutoff 3Å) which confirms the performance of MetaMQAP and overall ranking of other MQAPs.
The study was made for all residues in CASP5&6+ test dataset as well as separately for a few group of residues whose variants in our data set exhibit mean deviations 0–2Å, 2–4Å, 4–8 Å and > 8 Å. Intuitively, it should be easier to rank '4–8Å' sets of residues that include both very accurate and very inaccurate variants than sets '0–2Å' or '>8Å' of residues. In the 0–2 Å range, most variants have similar deviations from the native structure in all models and are typically all very close to the native structure. On the other hand, the set of residues with deviation at least 8 Å usually come from models with globally low quality, so their relative local quality becomes very difficult to predict by all MQAPs. Interestingly, PROSA estimates better rank of variant accuracy than PROSApair in spite of the fact that PROSApair is better correlated with absolute residue deviation than PROSA (see Figure 2). Another useful information for modelers is that PROSA has the same accuracy as PROQRES for the most accurately modeled variants (in the 0–2 Å range). Despite the fact that MetaMQAP was trained to predict absolute residue deviation, here again it outperforms all other MQAPs in ranking different variants of a given residue. The advantage of MetaMQAP over any another method is statistically significant in the case of each studied group of residues (t-test, p < 0.05)
A prototype of MetaMQAP was one of the methods participating in the Model Quality Assessment contest held within the "7th Critical Assessment of protein Structure Prediction methods" (CASP7). The models submitted by all servers in tertiary structure prediction category were scored by all participating MQAP's, and the assessments were compared with actual model accuracy. Predicted GDT_TS was taken as the global reference score, while single residue deviations from the native structure were taken as the local reference scores .
Almost all CASP7 benchmarks have shown that our prototype of MetaMQAP exhibited rather poor accuracy. The phenomenon was observed both for global and local model accuracy. We conclude that it was caused mainly because of two reasons:
r – correlation coefficient, n – number of scored models
The equations show that the Z-score is proportional to the square root of the number of predictions submitted by an MQAP method. Our method was penalized for providing predictions of only a half of server models, which were full atom models and didn't contain any steric clashes. MetaMQAP (CASP7 group number 038) submitted evaluations for 12259 models while the winning Pcons6 (group 634) scored 23858 server models. Thus, even if the Pearson's correlation between i.e. global model score and GDT_TS would be nearly the same for the two methods, the Z-score is much higher for Pcons6. We conclude that the MQAP scoring system used in CASP7 had an unexpected bias and in our opinion does not necessarily show the discrimination value of the methods tested.
Second, MetaMQAP was trained on CASP5&6+ database containing models idealized by Modeller. In the CASP7 experiment we made a very serious mistake and used MetaMQAP to evaluate models without any idealization. After CASP7 experiment we realized that all models evaluated by our method would be evaluated more accurately, if they were first idealized by Modeller, exactly as we have done for our training set. Such behaviour is now default for MetaMQAP.
Below we present the benchmark of the current version of MetaMQAP on the same set of CASP7 server models that was used in real CASP7 MQAP benchmark. All of these server models were scored by our method after idealization by MODELLER. For both global and local deviations in the benchmark, we use data deposited on CASP7 website (and calculated before idealization) instead of the ones calculated by ourselves.
There are two important questions for which answers are sought. First, how accurate is a given model (in the sense of its GDT_TS score), without considering any alternative models? Second, if there are number of alternative models, how do they compare to each other according to accuracy?
According to the CASP7 Quality Assessment (QA), the four methods that are best in predicting GDT_TS are: QA_556, QA_704, QA_633, and QA_692, while four winners in predicting the rank of alternative models are: QA_556, QA_634, QA_713, and QA_633 [41, 42]. To our best knowledge, the QA_556 method is so far unpublished and we could not find any web server, thus it was impossible for us to replicate its results. According to CASP7 abstracts , QA_556 is based on global optimalization of a scoring function. A set of models for given target sequence is clustered together and the clusters are evaluated by MODELLER scoring function  and/or DFIRE energy . Methods QA_633 (PROQ), QA_634 (PCONS6), and QA_692 (PROQLOCAL) were developed by Elofsson and coworkers . PROQLOCAL combines score of PROQRES and PROQCONF .
The PCONS method analyzes the set of protein models and looks for recurring three-dimensional structural patterns . Based on this analysis, each model is assigned a score reflecting how often its three-dimensional structural patterns are present in a whole ensemble of submitted models. It is based on the assumption that recurring patterns are more likely to be correct than patterns than occur only once or in few models. PCONS returns one score reflecting the predicted global quality and scores for all individual residues reflecting their predicted local quality. Unfortunately, the latest version of PCONS -PCONS6 does not allow to submit a single model to score it. It can be used only to score set of models created by the Pcons.net modeling server .
The QA_704 method is called QA_ModFold. It combines the scores of MODCHECK, ProQ-LG, ProQ-LX and ModSSEA using a neural network to predict global model accuracy . The method is available as web server .
The QA_713 method, also called Circle-QA, predicts model quality in two steps. First, it predicts target difficulty to assign weights for two scoring functions used in a second step. These functions are: 1) agreement between secondary structure predicted for the target sequence and observed in the model, and 2) combined score of burial area, fraction of polar area, and secondary structure. Unfortunately we were not able to find a web server of the method.
Only 3 methods of QA_556, QA_692, and QA_699 belong to both of presented groups of winners (prediction of global model accuracy, AND ranking alternative variants of models for a given target). In a following benchmark we compare MetaMQAP with these 3 methods. We decided to focus our analysis on models that were scored by all the compared MQAPs. Fortunately, almost all of these methods were able to score almost all servers models, so for such a benchmark we used 20486 of 24339 CASP7 servers' models. Additional file 2 presents a number of predictions for each of the considered MQAP methods.
Additional file 3 shows the accuracy of MQAPs as a function of target difficulty. The targets were divided into two different groups: "TBM" (template-based modeling) and "TBM/FM or FM" (template-based modeling or free modeling). We decided to combine TBM/FM and FM targets, because the distribution of GDT_TS score for models submitted for these targets are similar, given that we consider only single domain targets. As expected, all MQAP methods work better for TBM targets. The three most accurate methods in the case are QA_556, MetaMQAP and QA_704. All of these methods exhibit degraded performance for more difficult targets (TBM/FM & FM). Still, our MetaMQAP is also among three best methods, while QA_633 and QA_692 fail in predicting GDT_TS for hardest targets. Surprisingly, for most difficult targets, the size of a model is the best predictor of a model GDT_TS. Actually, it is better than any MQAP score: the correlation with coefficient of -0.6 indicates that the longer a target sequence the poorer are the models. On the other hand, for the TBM group, the size of a target is only weakly correlated with GDT_TS (with a coefficient of 0.1).
Our findings also indicate, that the number of missing residues can greatly influence the MQAP accuracy. Additional file 4 presents the correlation of global MQAP scores with GDT_TS with respect to completeness, measured as the ratio of a number of residues in the model to the total number of residues in the experimentally solved target structure. All methods except QA_556 exhibit significantly decreased accuracy for models which have less than 80% of residues. Therefore we strongly recommend using MetaMQAP only for models, in which no more than 20% of residues are missing.
One of the two most important characteristics on an MQAP is the ability to discriminate between better and worse alternative models for a given target sequence. Figure 5B shows the mean Spearman's rank correlation between global model accuracy measured by GDT_TS and the MQAPs' global score. We use Spearman's rank correlation instead of Pearson correlation, because it is more appropriate for ranking alternative models. The Spearman's rank correlation was averaged over all CASP7 targets. We observe that the correlation is similar both when we consider all CASP7 models and when only the models evaluated by all MQAP methods. The correlation of MetaMQAP global score with the rank of model accuracy is 0.65, and it is significantly lower that in the case of method QA_556 (0.78) or QA_634 (0.76). Both of these methods used clustering of large sets of alternative models and then assigned higher scores to models that were members of big clusters. We conclude that availability of numerous alternative models and their clustering based on geometric similarity is a method of choice for ranking of models. This criterion has been implemented earlier for selection of best decoys both from de novo folding  and fold-recognition analyses .
Additional file 5 presents the ranking abilities of MQAPs relative to the target difficulty. Exactly as before, we grouped single domain CASP7 targets into two classes: 1) TBM and 2) FM and TBM/FM. For easy targets (TBM) the highest accuracy is achieved by QA_556, but the effectiveness of the method significantly decreases for difficult targets (TBM/FM and FM). For the group of hard targets, Pcons6 performs best, because its accuracy is quite insensitive to target difficulty.
MQAPs ability to detect most accurate model in set of alternative models (analysis performed on a set of all CASP7 server models).
Percentile of GDT_TS
average GDT_TS of top-ranked models
Best server model
Additional file 6 presents the Pearson's correlation between local MQAP scores and true residue deviation as a function of target difficulty. For TBM targets MetaMQAP is the 2nd best method, with the correlation of 0.55. For more difficult targets the accuracy of all methods decreases significantly, but the effect is lowest in the case of PCONS6. Such a result was expected, because for very poor models statistical potentials used by e.g. MetaMQAP or ProQlocal become less accurate and only the information how common a given fragment is in set of alternative models seems to be an effective indicator of model quality.
There is also a clear correlation between MQAP accuracy and residue difficulty (Figure 8B). It can be measured as an average error between instances of a residue in all different models with respect to native structure. The residues were split into four residue difficulty bins: the "easiest" residues are those with a mean deviation in models of less than 2 Å, the "easy but not easiest" have mean deviation between 2–4 Å, and so on. The most difficult residues have mean deviation above 8 Å. For all residue difficulty classes, MetaMQAP shows accuracy comparable to PROQlocal. The accuracy of all methods drops down significantly when the residue mean deviation is > 8 Å.
We have developed a new method for the quality assessment of protein models, which uses the results of eight other MQAP methods (VERIFY3D, PROSA, BALA, ANOLEA, PROVE, PROQRES, REFINER, and TUNE) and a wide range of local residue features to predict the local deviation of residues in the model from their counterpart in the (unknown) native structure. To our knowledge, this is the first publicly accessible method that attempts to predict the absolute deviation (in Å) for the individual residues in the model in a manner completely independent of used modeling protocol and without any additional alternative models. The development of such methods was recently strongly encouraged in the course of discussions of members of the protein structure prediction community on the FORCASP website . Among similar methods, the ModFold server v1.1  can predict local residue deviations (in the ModFoldClust mode), but only for multiple models.
When our method is compared with the winners of MQAP category in last 7th edition of the CASP experiment , our method is outperformed by a few methods that use to their evaluation a large set of alternative models. However, thus far none of these methods has been made available as a public web server. Moreover, these methods require that a potential user provides a large set of models, whose size and diversity would be comparable to all models submitted by all CASP modelers for a given target. It is very unlikely that such an effort would be possible outside the CASP experiment itself. Therefore, MetaMQAP has a potential to become a tool of choice for researchers interested in evaluating just a single model of their target protein, or a few individual models, without necessity to provide a multiplicity of state-of-art models with many different modeling methods.
We hope that our server will be useful for many researchers, perhaps not only computational modelers, but also experimental structural biologists willing to identify confident parts of low-resolution models, e.g. structures generated by automatic tracing of crystallographic electron density maps or preliminary models from an NMR experiment. One possible area of immediate practical application for our method could be the identification of regions of comparative models that deviate less than ~3 Å from the native structure and could be used to solve the crystal structures directly from the diffraction data, using the molecular replacement approach [52–54]. Another possible application is to ensure that the overall quality of the model is sufficient for the intended application and to predicting regions of lower quality for further refinement.
It should be remembered that MetaMQAP is a 'meta-server', and critically depends on the 'primary' methods for the assessment of the local quality of models. The training protocol described in this work can be extended in the future to include additional models (of different quality, similar to QA_634) and additional primary MQAP methods (and structural feature analyses) – both general ones as those that specialize in detection of very minor or very large errors, or deviations of some particular type.
Published analyses describing the use of MetaMQAP prior to publication of this article.
Bifunctional tRNA methyltransferase and oxidoreductase
Nuclease/ATPase subunit of Type I restriction-modification system
DNA modification enzyme
MiaA, MiaB, MiaE
Enzymes involved in the ms2io6A biosynthesis pathway: a P-loop NTPase, a Radical SAM enzyme, and a diiron carboxylate oxidase
We are grateful to the developers of primary MQAPs, in particular those used in this work: VERIFY3D, PROSA, BALA, ANOLEA, PROVE, TUNE, REFINER, and PROQ, for making their methods available. Our method can stand only 'on their shoulders', and without them the development of MetaMQAP would not be possible. We also thank our colleagues, in particular Marcin Feder, Michal Pietal, and Maciej Fijalkowski for critical comments and useful suggestions and for their help with the installation of MetaMQAP as a publicly available server. This work was supported by the Polish Ministry of Science (grant N301 10632/3600). M.P. and J.M.B. were additionally supported by the NIH (grant 1R01 GM081680-01 to Andrzej Kloczkowski)
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.