Evaluation of 3D-Jury on CASP7 models

Background 3D-Jury, the structure prediction consensus method publicly available in the Meta Server , was evaluated using models gathered in the 7th round of the Critical Assessment of Techniques for Protein Structure Prediction (CASP7). 3D-Jury is an automated expert process that generates protein structure meta-predictions from sets of models obtained from partner servers. Results The performance of 3D-Jury was analysed for three aspects. First, we examined the correlation between the 3D-Jury score and a model quality measure: the number of correctly predicted residues. The 3D-Jury score was shown to correlate significantly with the number of correctly predicted residues, the correlation is good enough to be used for prediction. 3D-Jury was also found to improve upon the competing servers' choice of the best structure model in most cases. The value of the 3D-Jury score as a generic reliability measure was also examined. We found that the 3D-Jury score separates bad models from good models better than the reliability score of the original server in 27 cases and falls short of it in only 5 cases out of a total of 38. We report the release of a new Meta Server feature: instant 3D-Jury scoring of uploaded user models. Conclusion The 3D-Jury score continues to be a good indicator of structural model quality. It also provides a generic reliability score, especially important for models that were not assigned such by the original server. Individual structure modellers can also benefit from the 3D-Jury scoring system by testing their models in the new instant scoring feature available in the Meta Server.


Background
The number of protein structure prediction servers has increased over the past years [1]. The use of many different methods to predict the structure of a protein is now stateof-the-art in protein structure prediction [2]. However, the number of available servers, taken together with the number of models returned exceeds the limit a human researcher is likely to scan. Fortunately, structure prediction meta-servers address this problem: they gather mod-els from various other servers and employ automated processes successfully applied by human experts in order to deliver a correct prediction [1]. Since existing structure prediction servers are constantly upgraded while new servers appear, it is necessary to re-evaluate the fitness of the aforementioned expert processes.
The latest, 7 th round of the Critical Assessment of Techniques for Protein Structure Prediction [3] has provided us with a fair amount of structure prediction server models. With the help of the Structure Prediction Meta Server [4], we have evaluated the servers returning these models using the same protocols as in previous Livebench experiments [5], results are available at [6].
Standard evaluation methods take into account the first (top ranked) model of the prediction servers. The Meta Server assigns a new reliability score to each model using 3D-Jury [7]. This score can be used to re-rank the models and thus affect the evaluation results. The aim of the present work was to verify the continued applicability of this model ranking method, focusing on the version available on-line. We were interested in answering the following three questions: Can we use 3D-Jury to estimate model quality? Does 3D-Jury select a model more accurate than the choice of the generating server? Could the 3D-Jury score be used as a generic model reliability score?

3D-Jury score correlates with the number of correctly predicted residues
The correlation of the 3D-Jury score (Jscore) with model quality is of fundamental importance to the operation of the Meta Server. Therefore we first examined the correlation of the 3D-Jury score returned by the default on-line version of 3D-Jury: 3J 1,A (see Methods: 3D-Jury operating modes), with the number of correctly predicted residues ( ) .
3D-Jury scores correlate with the number of correctly predicted residues ( ): the correlation coefficient is 0.95. A linear model (LM 1 ) is presented on Figure 1. The residual error, 20.15, is low enough to enable meaningful estimation of the number of correctly positioned residues.
A better model (LM 2 ) can be obtained by fitting to the [30, 100) 3D-Jury score range only. This range represents difficult targets. Figure 2 shows the linear model obtained. The residual error is 13.37, offering narrower, better prediction intervals for the number of correctly positioned residues.
As an example to the use of LM 2 , let's assume that our model has 3D-Jury score 44.5. We can expect to have 13 to 82 well positioned residues in this model on the 99% confidence level, 21 to 74 on the 95% confidence level. For a score of 59 the 99% prediction interval for the number of correct residues is 26-94, the 95% prediction interval is narrower: 34-86.
A key to which residues are likely to be well-positioned is provided on the model-centred 3D-Jury page, accessible by selecting a model in the Model column of the main 3D-Jury page. Here, residues that are likely to be correctly positioned would have grey background at the corresponding positions of most of the other aligned models, forming a column of grey background.

3D-Jury improves overall server prediction results
We examined whether 3D-Jury could improve overall server performance by selecting a better model when multiple models are returned by a prediction server. We tested four operating modes of 3D  [9] and 3dpro [10], we can see that 3D-Jury 3J 1,A predicts more targets better, but its overall performance is slightly worse than the original servers'. The reason for this is that 3J 1,A 's more numerous choices of better models were not good enough to counteract its loss of MaxSub scores on the bad choices. In the case of inub [11] and BasD [12] the situation is inverse: 3J 1,A improved fewer targets, but the net improvement is positive. For many servers the improvement -or worseningof the targets is marginal (e.g. phyre-2 = 0.6%). Nevertheless we can see that even in these cases there is room for a 4 -5% improvement ( Table 1, column Q % , values in parentheses). Moreover, it appears that for at least 14 targets every server fails to pick the best model.

3D-Jury scores as generic model reliability scores
In order to assess the advantage of using 3D-Jury scores as generic reliability scores we conducted a receiver operating characteristic (ROC) analysis adapted for CASP and Livebench [5] evaluation. The analysis shows how well a reliability score separates good models from bad ones, in terms of the average number of good models seen before encountering 1 to 11 bad models ( ). We compared the 3D-Jury scores returned by the on-line version 3J 1,A to the reliability scores of the original servers, when available. Results are shown in Table 2. The 3D-Jury score exceeds the original server score ( ) in 27 cases and falls short of it in only 5 cases out of the 38 analysed. The exceptions are pmodeller6 [9], pcons6 [2], ffas03 [13], inub [11] and shub [11].
The J 0 scores listed in Table 2 indicate the lowest 3D-Jury score seen before a bad model was encountered from the indicated server. In other words, no bad model above J 0 score was seen in the test model set of the server. J 0 scores are of practical value: they can be used as server-specific score thresholds, since a score above J 0 is likely to indicate a good model.

3D-Jury scoring of user models
In order to encourage model selection and refinement using 3D-Jury, we introduced a new feature: instant 3D-Jury scoring of user models. This feature, available for any completed job by selecting the job in the Queue and uploading a model, enables the user to score a set of models and obtain a ranking based on the 3D-Jury score. Popup hints and an on-line tutorial [14], available from the job page, offer help with this new feature.

Conclusion
In this report we present the evaluation of 3D-Jury [7] on models gathered in CASP7. We found good correlation between the 3D-Jury score and a model quality measure: the number of correctly predicted residues. This correlation can be used to predict important model features such as the number of correctly positioned residues. Using Figure 2, 3D-Jury scores can be translated to the estimated number of correctly predicted residues. We plan to upgrade the on-line 3D-Jury to provide the 90%, 95% and 99% prediction intervals for the number of correctly predicted residues automatically.
3D-Jury, in general, also appears to boost server predictions by identifying better models. Our results show that 3D-Jury performs best when all models of all servers are used to calculate the J score. This option, however, is not feasible in the Meta Server since many of the servers participating in CASP7 are not currently available on-line. Nevertheless, 3J 1,A , the provided on-line default presents a reasonable choice. We found that 3D-Jury scores can be used as generic reliability scores, an especially important feature for models that are not provided with such values. We have also extracted serverwise 3D-Jury score thresholds to help identifying reliable models. We report the release of a new Meta Server feature: instant 3D-Jury scoring of uploaded user models.
3D-Jury remains to be a valuable tool in the hands of protein structure modellers. Its ability to pinpoint the best server models is founded by the results of our analysis.

tp tp R
Correlation of 3D-Jury score with the number of correctly predicted C α atoms Figure 1 Correlation of 3D-Jury score with the number of correctly predicted C α atoms.
-the number of C α atoms predicted within 3.5 Å from their respective locations in the crystal structure; Servers that predicted less than two targets and/or returned only one model for each target were excluded from the server model ranking tests (reported in Table 1). The resulting set contains 25,215 models for 85 targets from 59 servers -a 5 models per server average.
Models with Jscore = 0 were excluded from all correlation and regression analyses.
Server reliability scores (Rscore) that anti-correlate with model quality were multiplied by -1.

Model quality measures
MaxSub [8] score and (defined below) were used to measure the quality of models. Maxsub returns a score between 0.0 (incorrect prediction) and 1.0 (perfect prediction). In this study the score was multiplied by 10.0 as is customary on the 3D-Jury web pages [20]. We say that models with MaxS > 0 are good, while models with MaxS = 0 are bad.
is the number of C α atoms that are predicted within 3.5 Å from their respective locations in the solved structure, as reported by the MaxSub tool [8] operating on the C α atoms of the structures compared. We say that gives the number of correctly predicted residues.

3D-Jury model scoring
The 3D-Jury score of a model M is calculated by first comparing M to a set of other models available to the system for the same target. The way these other models are selected is a tunable parameter of 3D-Jury. M is compared to each selected model, and a pairwise similarity score (S M,i , for pair i) is assigned that equals to the number of respective C α atoms that are within 3.5 Å of each other after optimal superposition of the structures represented by their the C α atoms. MaxSub [8] is used to carry out this step. In case a pairwise similarity score falls below a certain cutoff value, it is set to zero. The 3D-Jury score (Jscore) of model M is the sum of its pairwise similarity scores divided by the number of these scores (n) + 1 [7]: 3D-Jury parameters 3D-Jury offers three tunable parameters: the list of servers to draw models from for pairwise score calculation; the method of server model selection (applicable in case of multiple available models, the name of the method is shown in italics): first model, most similar (in terms of S M,i ) one, or all models; and the pairwise similarity score cutoff [7]. In this analysis we used the publicly available BasD [12], ffas03 [13], inub [11], mgenthreader [16], ORFeus-2 [17], pdbblast [18] and 3D-PSSM [19] as default servers and a constant similarity cutoff of 40 in order to simulate regular on-line use of the service. -the number of C α atoms predicted within 3.5 Å from their respective locations in the crystal structure; Jscore -3J 1,A score; solid green line -prediction of linear model LM 2 ; blue longdash lines: confidence interval at 95% confidence level; blue dashed lines: prediction interval at 90% confidence level; blue dotdash lines: prediction interval at 95% confidence level; blue dotted lines: prediction interval at 99% confidence level; x -slope; the colour bar is key to the approximate density of models A linear model (LM 2 ) was fitted to the 3D-Jury score vs. of 6,710 models. The residual standard error is 13.37. The 95% confidence interval as well as prediction intervals for 90%, 95% and 99% confidence levels are indicated on the figure. The vertical and horizontal histograms show the distributions of and 3D-Jury scores respectively. The 30 to 100 3D-Jury score range was chosen to represent difficult targets.

3D-Jury operating modes
The four operating modes of 3D-Jury used in this report are: 3J 1,A -uses one model of the default servers (a mode typical for on-line predictions); 3J a,A -all models of default servers; 3J 1,C -one model of all servers; 3J a,C -all models of all servers.

Receiver operating characteristic (ROC) analysis
We performed a ROC analysis adapted for CASP and Livebench [18] model evaluation for each server. Server models were ordered by the original reliability score (Rscore, when available), or the 3D-Jury score (Jscore). The highest scoring models for each target were collected into separate sets M R and M J , corresponding to the Rscore or Jscore used for ordering. Models in both sets were ordered by their respective scores. Good models (MaxS > 0) were labelled positive, bad models (MaxS = 0) were labelled negative. Using Rscore or Jscore as the discrimination threshold, we plotted the number of true positives (tp) versus the number of false positives (fp) on the [0 -10] fp range. This was to take into account the absolute number of targets predicted by the servers, focusing on the hardest targets. We used the number of true positives averaged over the [0 -10] false positive range as a quality measure for the reliability scores, the higher values indicating better reliability scores.

Statistics and figures
Reported correlation coefficients are significant at the 95% significance level.
Statistics and figures were prepared using R [21].