In this study we describe a novel method, FunFOLD, for the prediction of ligand binding site residues, which shows a significant improvement over all of the true server methods that were tested at CASP8, as well as the predictions from one of the top manual groups - FN202. In addition, the method was tested on the CASP9 set and was found to be competitive with the top server groups and statistically inseparable in performance from most of the top manual groups. Whilst a prototype version of a server was tested in the CASP9 function prediction category (IntFOLD-FN - FN425), we have since improved the reliability of the FunFOLD server and the current automated implementation more closely resembles the performance of our manual function predictions (McGuffin - FN094).
The performance of all methods was measured using the standard Matthews Correlation Coefficient (MCC) , on both the CASP8 and CASP9 function (FN) targets. The mean MCC Z-scores for FunFOLD, were shown to be higher than the mean MCC Z-scores for all other methods tested at CASP8 (Table 1). The FunFOLD predictions were an improvement upon those made by the Lee manual group (+1% MCC), the Lee server group (+4%) and the Sternberg group (+13%), which were the top groups tested during CASP8. The improvement over the Sternberg group on the CASP8 data set is statistically significant at the 99% level according to the Wilcoxon Signed rank sum test. The FunFOLD method is also shown to be competitive with all methods tested at CASP9 and shows no statistically significant difference with the top ranking server methods: I-TASSER-FUNCTION and firestar  (Tables 4, 5, 6 and 7). However, a statistically significant improvement was seen over the 3DLigandSite methods that were tested at CASP9, at the 99% level according to the Wilcoxon Signed rank sum test.
Intuitively, the quality of the starting 3D model that is used for the prediction of the ligand binding site will have an effect on the accuracy of results. Thus, in this paper for the CASP8 analysis we used ModFOLDclust2 model quality assessment method  to select high quality input models. However, we also tested FunFOLD using alternative 3D models that were submitted by the top function prediction groups and found that the improvement in MCC scores was maintained - the Lee manual method was improved upon by +2%, Lee server by +3% and the Sternberg group by +10%, which was again shown to be statistically significant. In addition, when native structures were used, as might be expected, the improvement in performance was maintained.
On the CASP9 data set we tested starting models from both our IntFOLD-TS server and from the better performing Zhang-Server, however again no significant difference in performance was seen. Furthermore, whilst using the native structures improved performance marginally in some cases, in the majority of cases the increases were not significant. According to these benchmarks, the results indicate that selecting an alternative starting model does not have much of an influence. Therefore, where there has been a significant improvement over other groups, this must have arisen from the FunFOLD algorithm itself, rather than from improved initial model selection.
One of the top ligand binding site prediction servers tested at CASP9 was the I-TASSER-FUNCTION server method (Zhang group), which also relies on 3D model-to-template superposition, for binding site residue prediction. However, in addition to global model to template superposition, the I-TASSER-FUNCTION method also carries out local alignment of the proposed binding site region, in order to improve local superposition. In light of the results shown in Figures 8 and 9, local superposition scores, such as those used in I-TASSER-FUNCTION, could be adopted which may help to improve future versions of FunFOLD. However, in our benchmarking on the CASP9 set we could measure no significant increase in performance of the I-TASSER-FUNCTION method over FunFOLD.
At the time of writing the I-TASSER-FUNCTION server is currently not publicly available. Therefore, arguably the top ranking publicly available server tested at CASP9 was firestar , which predicts residue conservation in target sequences based on PSI-BLAST alignments to the large catalogue of sites in PDB structures contained in the FireDB . However, the firestar method is not currently available as a standalone program and again we could measure no significant performance gain over the FunFOLD method.
The FunFOLD method is clearly also competitive with the top manual groups that were tested in CASP8, however no manual intervention is required for our approach. Furthermore, the method significantly outperforms each of the true server methods tested at CASP8. Whilst the Lee server group (FN293) was mostly automated, the group was counted in the extended deadline category and the authors reported a small amount of human intervention , hence in this study we have considered group FN293 in CASP8 as a non-server group.
The FunFOLD method also significantly outperformed one of the top manual groups (FN202) and since CASP8, the authors have developed a fully automated publicly available server, called 3DLigandSite , variations of which participated in CASP9. The authors reported that the predictions from the 3DLigandSite server were comparable in performance to their manual predictions at CASP8 , therefore in this study we can consider predictions from group FN202 to be the gold standard for fully automated ligand binding residue prediction for testing on the CASP8 data set.
The standalone FunFOLD software uses similar input data to the version of the FINDSITE  software that is currently available to download. For both programs, a 3D model and a list of templates is required, however, the methods differ in the output produced; the FunFOLD method outputs binding residue predictions in CASP FN format, where as the FINDSITE method outputs a list of putative locations for the centre of each binding pocket i.e. locations in 3D space rather than binding site residues. Thus, the FunFOLD software cannot be directly compared to the FINDSITE software as they both produce different output. Furthermore, the FINDSITE dataset  cannot be directly used in our analysis, as the location of the binding site residues for each template is not defined and would have to be predicted, adding potential for errors in methods comparison. However, the latest FINDSITE-DBDT method did compete in CASP9, but to our knowledge the server is not publicly available. The current implementation of FunFOLD, the prototype version of the server (FN425) and our manual prediction group (FN094) performed statistically significantly better than the FINDSITE-DBDT method on the CASP9 data set .
The FunFOLD method uses a similar procedure to that carried out by the most successful prediction groups participating CASP; the 3D input model is superposed onto structurally similar ligand containing PDB files and the putative binding residues are then determined. However, the FunFOLD method uses a novel, fully automated approach for both identifying clusters of ligands and determining putative binding site residues. The novel ligand residue voting method used in FunFOLD reduces the rate of over predictions, which appears to be one of the main problems with many structure based approaches.
There are several caveats to consider when benchmarking methods for the prediction of protein ligand binding residues. Firstly, uncertainties can arise if there are several ligand binding sites within a protein either predicted or observed. For this analysis we only considered the binding residues that were defined by the CASP assessors and for each target only one binding site was defined. Secondly, the inherent flexibility of proteins may make it difficult to determine which residues are actually in contact with the ligand. This is further exacerbated if the binding site is located in a disordered region of a protein. Thirdly, the definition of the distance cut-off for a residue that is in contact with a ligand is the Van der Waals radii + 0.5 Å, but this definition is subjective. Finally, there may be ambiguity about whether the ligand bound to the solved structure is the protein's ideal ligand. Thus a ligand used in a prediction may not necessarily be incorrect. Indeed one binding site may bind more than one ligand.
Each of these issues creates difficulties for the fair assessment of methods as defining a list of observed binding site residues may be subjective. Some of these issues were addressed by the exclusion of "neutral residues" in the CASP 8 analysis , which have also been excluded in this analysis for the CASP8 data (the CASP8 assessors defined neutral residues as those which would potentially bind to an alternative ligand, but which were not observed binding to the alternative ligand within the solved 3D structure). In CASP9 the assessors used two classifications for binding sites - partial and extended. From the official analysis there was not a significant difference in assessment, if methods were analyzed using either partial or extended binding site definitions. However, the use of the MCC statistic for assessment does compound some of these issues and the prediction of a binding residue that is defined as incorrect, but which is nevertheless close to the observed binding pocket will therefore obtain the same score as a random incorrect prediction.
Therefore we recently proposed a novel scoring method - the Binding-site Distance Test (BDT) score, which addresses some of the shortcomings of using MCC scores, whilst maintaining the advantages . Predicted residues that are close to the observed residues will obtain a higher BDT score than more distant predictions. The BDT score was used by the CASP9 assessors, in addition to the MCC score, to investigate if it caused a significant difference in the rankings of the methods but no significant changes in the grouping of top methods were reported . In this analysis, the list of top groups identified using BDT scoring is again roughly in agreement with the top groups obtained using the MCC scoring; however the ranking of some of the less accurate methods does appear to change. Using the mean BDT score, we also see higher scores for FunFOLD compared with all methods tested on CASP8 on equivalent subsets of data and the difference is again significant for all but the top two manual groups (Figure 2, Figure 6 and Table 1). When the CASP9 predictions are analysed using the mean BDT scores the FunFOLD method is ranked below the top two server methods I-TASSER-FUNCTION and firestar , however again the difference in performance is not significant.
An obvious way of improving future versions of the FunFOLD software would be to optimize the voting threshold for the inclusion of predicted residues. Furthermore, predicting the ligand binding site residues on multiple models and then pooling the results may help to increase accuracy. The specific physiochemical properties of the residues could be studied, such as charge and polarity, and residues exhibiting more favorable physiochemical properties for binding to a ligand could be weighted more heavily in the residue voting process. A prediction of the enzymes functional family could be performed, with the prediction then used to weight residues that bind to ligands that occur more often within these families, more heavily in the voting process. In addition to undertaking global model-to-template superpositions, local superposition of the binding site regions could also be carried out to increase accuracy. As previously mentioned this was carried out by one of the top server groups at CASP9 (I-TASSER-FUNCTION - FN339).
A general function prediction quality assessment tool could also be developed in order to weight predictions or provide probabilities scores for individual residues. Features of the quality assessment might include: the type of ligands within the cluster, with clusters containing a large number of similar ligands receiving a higher score; the distance of the superposed ligands from the centroid ligand within the cluster, with clusters containing ligands that are superposed perfectly receiving higher scores; the global and local model quality scores of the starting 3D model could also be factored into the analysis using our ModFOLD methods [28, 37], with residues in poorly modeled regions down weighted; the probability of bound residues occurring in disordered regions could also be considered by integrating our DISOclust results , with residues in regions of high disorder receiving an appropriate weighting. In future, each of these features could be integrated into an automated quality assessment tool in order to produce more appropriate confidence scores which could be used for ranking binding residue predictions.