FunFOLD: an improved automated method for the prediction of ligand binding residues using 3D models of proteins

Background The accurate prediction of ligand binding residues from amino acid sequences is important for the automated functional annotation of novel proteins. In the previous two CASP experiments, the most successful methods in the function prediction category were those which used structural superpositions of 3D models and related templates with bound ligands in order to identify putative contacting residues. However, whilst most of this prediction process can be automated, visual inspection and manual adjustments of parameters, such as the distance thresholds used for each target, have often been required to prevent over prediction. Here we describe a novel method FunFOLD, which uses an automatic approach for cluster identification and residue selection. The software provided can easily be integrated into existing fold recognition servers, requiring only a 3D model and list of templates as inputs. A simple web interface is also provided allowing access to non-expert users. The method has been benchmarked against the top servers and manual prediction groups tested at both CASP8 and CASP9. Results The FunFOLD method shows a significant improvement over the best available servers and is shown to be competitive with the top manual prediction groups that were tested at CASP8. The FunFOLD method is also competitive with both the top server and manual methods tested at CASP9. When tested using common subsets of targets, the predictions from FunFOLD are shown to achieve a significantly higher mean Matthews Correlation Coefficient (MCC) scores and Binding-site Distance Test (BDT) scores than all server methods that were tested at CASP8. Testing on the CASP9 set showed no statistically significant separation in performance between FunFOLD and the other top server groups tested. Conclusions The FunFOLD software is freely available as both a standalone package and a prediction server, providing competitive ligand binding site residue predictions for expert and non-expert users alike. The software provides a new fully automated approach for structure based function prediction using 3D models of proteins.


Background
Because of a protein's essential cellular role, it is important to fully understand its structure and interactions. Predicting the location of the binding site and the ligand binding residues, is a necessary step towards elucidating how a protein functions [1]. The determination of the ligand binding site residues in a protein is also important, because substrate specificity of an enzyme is determined by the fine details of the binding site residues, such as side chain orientation and physiochemical properties [2].
In the CASP6 experiment, a function prediction category was included for the first time, where groups were required to predict Enzyme Commission numbers (EC) and Gene Ontology (GO) terms [20]. As a result of the difficulty in assessing these terms, the CASP7 assessors made a decision to modify the function prediction category [21]. Consequently in CASP8, the function prediction was included in a different format, with ligand binding site residues assessed, as many CASP targets were found to crystallize with biologically relevant ligands [1].
In CASP8, the top methods in the function prediction category were the methods by the Lee group [3] and the Sternberg group [22]. The Lee group had two main methods: a manual method (FN407) and an extended deadline server method (FN293). The Lee extended deadline server method relied on manual configuration of distance cutoffs for each target and, at the time of writing, a server based on this method is not publicly available. The Sternberg group was also registered as a manual prediction group (FN202), however since CASP8, the group has produced a publicly available server, 3D LigandSite, which is reported to be similar in performance to their manual prediction method from CASP8 [15].
The Lee group's manual and server methods from CASP8 only differed in the top 3D models used for prediction. The methods were both based on 3D superposition of structurally similar proteins that contain ligands. Residues were considered to be in contact with the ligands if the distance between them was less than 0.5Å + the Van der Waals radii, and the cut-offs for including residues in predictions were manually altered depending on the target [3].
The Sternberg group's manual predictions in CASP8, and their 3D LigandSite [15] server, used a very similar method [22] to the Lee group for ligand binding site prediction. The Sternberg group used the 3D-Jury method [23] in order to select from amongst the CASP8 server models followed by structural superposition of models and related templates. In addition, residue conservation was considered using ConFunc [7].
However, for the 3D LigandSite [15] server method, Wass et al. used the top 3D models from the Phyre server [24] and MAMMOTH [25] for structural superposition of similar templates with bound ligands. The residue conservation scoring was found to cause significant over prediction of residues [22] and so data from ConFunc is not directly included in the current 3D LigandSite server predictions [15]. To the best of our knowledge the 3D LigandSite server is the best publicly available fully automated method for prediction of ligand binding residues that has been independently benchmarked on the CASP8 data set and it produces comparable results to that of group FN202 [15]. Hence in this study we firstly compare our novel approach, FunFOLD, against all the groups at CASP8, paying particular attention to the comparison with the predictions from CASP8 group FN202.
In CASP9, the top few ranked methods in the function prediction category were by the Zhang group (FN096, FN339) and the firestar group (FN035, FN315). The I-TASSER-FUNCTION and Zhang human methods are again based on model-to-template superposition, to predict the ligand binding site residues. However, at the time of writing no publicly available server has been produced that implements the I-TASSER-FUNCTION method. The firestar (FN315) server, which is publicly available, also utilized sequence conservation from PSI-BLAST [26] profile alignments in order to predict ligand binding site residues [4].
According to the official CASP9 assessment [27], it was difficult to find statistically significant separations between the top 11 or so groups in the function prediction category, which included - Zhang  The FunFOLD method is similar in concept to other groups' methods, such as the Lee group and the Sternberg group methods, which use protein structure superposition of distantly related templates to a modelled protein to identify ligand binding sites. However, the FunFOLD algorithm uses a novel automated method for ligand clustering and identification of binding residues. We also investigate the use of ModFOLDclust2 [28] to select different starting models from amongst the pool of alternative server models and gauge the affect on accuracy.

FunFOLD
The FunFOLD method for predicting ligand binding site residues is based on the concept that, ligand containing templates from the PDB with the same folds (according to TMalign [29]) as the 3D model of the target protein under analysis, may contain similar binding sites. The FunFOLD standalone method takes as its input a 3D model of the protein under analysis and a list of template PDB IDs, which can be obtained from the templates used to build the model. The prototype version of the Fun-FOLD server method, which was developed during the CASP9 prediction season (listed as the group "IntFOLD-FN" -FN425 (server)), queried the FireDB [30] in order to identify if ligands in each template PDB file were biologically relevant. This method worked well, but it relied on the FireDB database, which has been updated since CASP8. Although it was unlikely that the identified biological ligands within templates would have changed since CASP8 (Mike Tress pers. comm.), there was a small chance the list of relevant ligands in FireDB could have been influenced by new related PDB entries since CASP8. In addition, during the CASP9 prediction season our prototype version of the server often had connectivity problems with the database. In order to become independent of querying the FireDB database, a list of biologically relevant ligands was obtained from the Sternberg group, which was originally used by the 3D LigandSite server [15]. Using this list allows the FunFOLD standalone software and the current version of the FunFOLD server to be more reliable, independent on external servers and suitable for benchmarking on both of the CASP datasets.
The FunFOLD algorithm used the TM-align method [29] to superpose each of the template structures containing relevant ligands onto the 3D protein model. Each model-to-template superposition was saved if the resulting TM-score ≥0.4 (TM-scores~0.4-0.6 have been shown to mark the transition phase of significantly related folds [31]). A simple PyMOL script allowed each of the superposition files to be used in order to orientate the original PDB structures with bound ligands correctly relative to the model. The resulting PyMOL superposition of all templates and models was then saved and parsed to leave only the coordinates of the model and relevant ligands.
Ligands were then assigned to clusters using an agglomerative hierarchical clustering algorithm that identified each continuous mass of contacting ligands, thereby indicating putative binding pockets. Ligands were considered to be part of a cluster if any of their atoms were in contact with the continuous mass. Thus, the linkage criteria for clustering were determined by contacts between ligands, which were defined as ≤ the Van der Waals radius of an atom plus 0.5 Ångströms. Once each continuous mass of contacting ligands was identified, the cluster with the largest number of ligands was selected as the location of the most likely binding pocket. The distances between all atoms within the mass of ligands and all atoms within the 3D protein model were then calculated. Again, residues were determined to be in contact with the ligand if the distance between a ligand atom and a residue atom was ≤ the Van der Waals radius plus 0.5 Ångströms.
In order to determine which residues were most likely to bind to the predicted ligand, a "residue voting" procedure was carried out. For a residue to be included in a prediction it must have had at least one contact with 2 or more ligands and at least 25% of the ligands in the cluster. This cut-off was determined during the CASP9 prediction season whilst the method was in development and was based on comparisons of the Fun-FOLD server output with that obtained from state-ofthe-art servers, such as 3D LigandSite, that were publicly available at the time. However, no optimization of this cut-off was carried out prior to testing of the FunFOLD software on the CASP8 and CASP9 data sets. The residue voting system used in FunFOLD can be illustrated in Figure 1. In Figure 1 we can see a cluster that contains a continuous mass of metals (1 ZN, 6 FEs, 1 NI and 1 MG), 1 GDP and 3 PO 4 s. The residues that are only contacting the GDP ligand are not considered as potential ligand binding residues, as the residues only receive 1/13 votes each (7.69%). Whilst the residues that are in contact with GDP and one PO 4 molecule receive 2 votes, they only receive a 15% share of the vote (2/13) and so they are also excluded from the prediction. The final predicted ligand binding residues are therefore: HIS29 with 7/13 votes (53.85%), HIS58 with 11/13 votes (84.62%), ASP59 with 12/13 votes (92.31%) and ASP122 with 5/13 votes (38.46%); all are above the 25% threshold required to be included in the prediction.

Accuracy Benchmarking
The FunFOLD method was benchmarked using only the information concerning templates and models for each target that could be obtained from the CASP8 and Figure 1 The FunFOLD residue voting system. Ribbon diagram of CASP8 target T0470 (PDBID 3djb) which illustrates how the voting system works in FunFOLD. The green sticks represented predicted ligand binding residues (which are the same as the observed ligand binding residues) and the white spheres and sticks represent predicted ligands. CASP9 server predictions. Thus all of the information used was available to predictors during both CASP prediction seasons. The FN prediction files for the 27 targets analysed for function prediction in CASP8 [1], the 30 targets analysed for function prediction in CASP9, and all associated 3D server models were downloaded from the CASP website [32].
For each CASP8 target, the ModFOLDclust2 method [28] was used to select the top 3D model from the server models submitted at CASP8. However, the top model from the IntFOLD-TS server (TS275) [33] was used for testing performance on each of the CASP9 targets. The IntFOLD-TS models were used so that the method is the same as the current FunFOLD server implementation. The top model for each target was then used as the starting model for predicting ligand binding residues. The parent records from each server model were examined in order to construct a list of template PDB IDs for each target that was available at the time of each CASP prediction season. The resulting template list was then filtered using FASTA [34] to ensure it was 70% non-redundant according to pairwise sequence identity. This type of filtering is in line with that carried out during the construction of the non-redundant fold libraries used by many fold recognition servers, such as IntFOLD-TS. Finally, a maximum of 40 templates were used in our analysis for efficiency.
The FunFOLD prediction results were compared against all of the function prediction groups participating in CASP8 and CASP9 using the MCC scores [35] as an indicator of performance. An analysis of the statistical significance between the differences in mean scores was also carried out, similar to that of the official CASP assessments [1,27]. In addition, a new metric, the Binding-site Distance Test (BDT) score [36] was used with the d 0 threshold set to 1Ǻ, in order to stringently assess the accuracy of predictions. The BDT score ranges between 0 (random) and 1 (perfect) and relates to the actual 3D distance between the predicted residues and the observed residues. The BDT score appropriately penalizes both under and over predictions, whilst also considering the 3D distance of predicted residues from the observed binding site. Thus, predictions close to the binding site score higher than more distant predictions [36].
The top function prediction methods in CASP8 were methods by the Lee group (FN407 & FN293) [3] and the Sternberg group (FN202) [22]. In order to more stringently compare the performance of the FunFOLD method against that of the Lee and Sternberg methods, for each prediction we also used the top 3D models that the Lee and the Sternberg groups submitted for their CASP8 server and manual predictions (LEE-S_TS1, LEE_TS1 and Phyre-de-novo_TS1). This analysis was also carried out on the CASP9 data set, using models from both the IntFOLD-TS server (FN275) and the Zhang-server (FN428) methods. This allowed us to gauge how much of the difference in performance was due to the initial model selection. Finally, the analysis was repeated using native structures for each CASP target in order to evaluate performance using "perfect models".

Accuracy Benchmarking
The results of an assessment of binding site predictions, similar to the official CASP8 function prediction assessment carried out by Lopez et al. (2009) [1], are shown in Figure 2, 3, 4, 5, 6 and 7 and in all tables. The Matthews Correlation Coefficient (MCC) and the Binding-site Distance Test (BDT) method are used to measure prediction success and the resulting scores achieved by the different groups are compared with those from the FunFOLD method. The FunFOLD method is shown to outperform all other methods tested at CASP8 according to both the mean per-target MCC scores and BDT scores ( Figure 2). The FunFOLD method is also shown to be competitive with the methods tested at CASP9. The FunFOLD method outperformed the 3DLigandSite methods (FN017, FN415, FN057, FN072), but did not outperform the top server methods, I-TASSER-FUNCTION (FN339) and firestar (FN315), according to mean per-target MCC and BDT scores for both partial and extended binding site analysis ( Figure 3 and Figure 4). (The CASP9 assessors defined partial binding site residues as binding site residues from CASP9 targets that contain ligands, with extended binding sites referring to CASP9 targets that bound ligands that were not the target's natural ligand, thus, the natural ligand was docked into the binding site and the binding site residues predicted.).
The mean MCC Z-scores and BDT Z-scores were also calculated for each group in order to normalize scores across targets and the results are shown in Figure 5   group has made predictions for different numbers of targets. Therefore, a direct comparison of methods from CASP8 and CASP9 on common subsets of targets must be carried out along with an analysis of statistical significance of the differences in performance. The results in Table 1 show that the FunFOLD method does achieve higher mean scores and mean Z-scores than every other CASP8 method, when common subsets of predictions are directly compared.
The difference in mean MCC performance is >28% for each of the true server methods tested at CASP8. In addition, the FunFOLD method shows a 13% improvement over group FN202's CASP8 predictions, a 5% improvement over group FN293's CASP8 predictions  and a 1% improvement over FN407's CASP8 predictions. The improvement is statistically significant for all CASP8 groups tested, except the methods by the Lee group -FN407 and FN293. A similar increase in performance is shown if the BDT score is used for the assessment (Table 1). Tables 2 and 3 show the performance of methods on the subsets of targets; those containing only metal ligands ( Table 2) and those containing non-metal ligands (Table 3). On the CASP8 data set the Fun-FOLD method performs slightly better on the subset of targets containing metal ligands than those  containing non-metals. The FunFOLD method does improve upon the majority of other methods when tested on the non-metal subset; however there is no significant difference compared with the top four methods (Table 3).
In order to produce the FunFOLD results shown in Table 1, 2 and 3, the ModFOLDclust2 quality assessment program [28] was used to select a 3D model for each target and this model was then used as an input for the FunFOLD executable. However, the results in Table 4 are Table 1 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by each CASP8 function prediction group The analysis is based on common subsets of all CASP8 function prediction targets, with a minimum of 10 predictions in common. N, size of the common subset used in the comparison; MCC, Matthews Correlation Coefficient; BDT, the Binding Site Distance Test Score [36].P-value, the p-value for the paired Wilcoxon signed rank sum test using the raw scores; P-value (Z-score), the p-value for the paired Wilcoxon signed rank sum test using the Z-scores; 1 -p-value, 1 minus the pvalue for the paired Wilcoxon signed rank sum test using the raw scores; 1-p-value (Z-score), 1 minus the p-value for the paired Wilcoxon signed rank sum test using the Z-scores. The tables are sorted by the Mean MCC Z-score for groups with the best CASP8 groups at the top. The highest mean scores, mean Z-scores, significant p-values and significant 1-p-values are indicated in bold. Server groups are underlined (FN293 and FN105 were extended deadline server groups; no publicly available server could be found for FN293 at the time of writing; group FN202 now have a publicly available server which is comparable to their CASP8 performance [15]). Table 2 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by each CASP8 function prediction group -as in Table 1   also shown in order to compare the performance of Fun-FOLD if the TS1 3D models submitted by each CASP8 comparison group are used instead. An improvement in MCC score of around 10% is shown compared to group FN202's predictions, which again is shown to be statistically significant. The improvement over group FN293's predictions is~3%, whilst the improvement over group FN407 is increased to~2%. Again, using the BDT score shows that a significant increase over group FN202 is maintained along with an increase in mean scores compared with those obtained by groups FN407 and FN293. In Table 5, the coordinates from the native structures are used in order to determine the difference in performance compared with the top 3 groups. Using native structures instead of models, the FunFOLD method shows a significant improvement over both groups FN293 and FN202. In order to produce the FunFOLD results shown in Tables 6, 7, 8, 9, 10 and 11, the top IntFOLD-TS model for each target was used as an input for the FunFOLD executable. This was done so that the method matched the current implementation of the FunFOLD server. The results indicate that there is no statistically significant difference between the top server methods tested at CASP9 and the FunFOLD method, according to p-values of the raw and z-scores, on the partial (Table 6, 7 and 8) and extended binding sites (Tables 9, 10 and 11). Again, the CASP9 dataset is divided into subsets of targets containing metal and non-metal ligands and the results are shown in Tables 7, 8, 10 and 11. However, there are no notable differences compared with using the full data (Tables 6 and 9).The results actually show fewer significant differences between methods, which is expected due to the reduced sizes of the subsets (Tables 7, 8, 10 and 11).
In line with the analysis performed using the CASP8 dataset, we also tested the FunFOLD method using the CASP9 TS1 3D models obtained from the Zhang-Server, but again we saw no significant difference in performance by using a different starting model (results not shown). The results in Tables 12 and 13 compare the FunFOLD performance using CASP9 native structures with partial and extended binding site definitions respectively. In Table 12, no improvement is seen from using native structures, whilst Table 13 shows that a slight performance increase in mean scores can be achieved, however this is not significant. Table 3 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by each CASP8 function prediction group -as in Table 1 except for CASP8 targets containing non-metal ligands or metal and non-metal ligands in the same binding site

Examples of predictions
If the FunFOLD method is provided with a good quality model and a list of templates that are structurally similar containing biologically relevant ligands, then highly accurate predictions can be achieved ( Figure 8A and 8B; Figure 9 A and 9B). Figure 8A and 8B represents accurate predictions for the CASP8 target T0407 (PDBID 3e38) with an MCC score of 0.941 and BDT score of 0.888. For comparison, the prediction by group FN202 was also very accurate with an MCC = 0.902 and BDT = 0.827, however the MCC and BDT scores for group FN407 were only 0.579 and 0.563, respectively. Group FN293 did not make a prediction for target T0407.
Analysing the prediction for T0407 in more detail, the FunFOLD method correctly predicted the binding site as being a metal binding site and the observed zinc ligands to be in the binding pocket. However, the method also under predicted one residue -HIS157. This under prediction occurred because the aromatic ring of the histidine residue was orientated away from the binding site in the model, and therefore HIS157 received a low vote of 1/13 (7.69%). Figure 8C and 8D illustrate where FunFOLD did not succeed in predicting all of the correct residues within the binding site of CASP8 target T0483 producing many false negatives, as well as incorrectly predicting other residues which are not officially considered to be involved in binding to the ligand (false positives). Despite these errors the FunFOLD prediction for T0483 achieved an MCC = 0.616 and a BDT = 0.503, which were higher scores than those produced by   Table 6 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by the top CASP9 function prediction groups for the partial binding sites analysis The analysis is based on common subsets of all CASP9 targets, with a minimum of 10 predictions in common. N, size of the common subset used in the comparison; MCC, Matthews Correlation Coefficient; BDT, the Binding Site Distance Test Score [36].P-value, the p-value for the paired Wilcoxon signed rank sum test using the raw scores; P-value (Z-score), the p-value for the paired Wilcoxon signed rank sum test using the Z-scores; 1 -p-value, 1 minus the p-value for the paired Wilcoxon signed rank sum test using the raw scores; 1-p-value (Z-score), 1 minus the p-value for the paired Wilcoxon signed rank sum test using the Zscores. The table is sorted by the Mean MCC Z-score for groups with the best CASP9 groups at the top. The highest mean scores, mean Z-scores, significant pvalues and significant 1-p-values are indicated in bold. Server groups are underlined. According to the ModFOLDclust2 [28] predictions, the global model quality score for the model used in the prediction for target T0483 is~0.6 (scores >0.4 indicate the fold has a good probability of being correct). However, the local model quality around the binding site is comparatively bad. Thus, FunFOLD did not predict the observed residues 33, 110, 111, 114, 116, 159 and 162. Residues 33, 110 and 159 simply did not receive enough ligand votes to be considered; with residue 33 falling Table 7 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by the top CASP9 function prediction groups -as in Table 6 Table 8 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by the top CASP9 function prediction groups -as in Table 6 except for CASP9 targets containing non-metal ligands or metal and non-metal ligands in the same binding site  The analysis is based on common subsets of all CASP9 targets, with a minimum of 10 predictions in common. N, size of the common subset used in the comparison; MCC, Matthews Correlation Coefficient; BDT, the Binding Site Distance Test Score [36].P-value, the p-value for the paired Wilcoxon signed rank sum test using the raw scores; P-value (Z-score), the p-value for the paired Wilcoxon signed rank sum test using the Zscores; 1 -p-value, 1 minus the p-value for the paired Wilcoxon signed rank sum test using the raw scores; 1-p-value (Z-score), 1 minus the p-value for the paired Wilcoxon signed rank sum test using the Z-scores. The table is sorted by the Mean MCC Z-score for groups with the best CASP9 groups at the top. The highest mean scores, mean Z-scores, significant p-values and significant 1-p-values are indicated in bold. Server groups are underlined. Table 10 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by the top CASP9 function prediction groups -as in Table 9 except for CASP9 targets containing only metal ligands, with a minimum of 6 predictions in common  When a more detailed analysis of the predictions for T0635 was carried out, the FunFOLD method correctly predicted the binding site as being a metal binding site, the correct location of the binding site and the three binding site residues. However, the method over-predicted one residue THR69. This over prediction occurred because the residue was in contact with three ligands (SO 4 -2 and CL-1), which were not well superposed onto the ligand cluster receiving a vote of 3/9 (33.33%). Figure 9C and 9D, showing CASP9 target T0582 (PDB ID 3o14), again represents a case where FunFOLD did not succeed in predicting all of the binding site residues and incorrectly predicted other residues that were not part of the official binding site. Despite these errors FunFOLD achieved an MCC = 0.395 and BDT = 0.348. The Fun-FOLD method correctly predicted residues 60 and 64, but over-predicted residues 59, 115, 116, and 117 and failed to predict binding site residues 58 and 94. Residue 59 received 2/8 votes, residues 115 and 116 received 8/8 votes and residue 117 received 7/8 votes. The under- Table 11 The mean MCC and BDT raw scores and Z-scores obtained by FunFOLD are compared with those obtained by the top CASP9 function prediction groups -as in Table 9 except for CASP9 targets containing non-metal ligands or metal and non-metal ligands in the same binding site  Table 12 What is the effect on FunFOLD performance if the native structures are used for each target? A repeat comparison of FunFOLD performance versus the top 3 groups for at CASP9 using partial binding site definitions. predicted residues 58 received 1/9 votes, while residue 94 was not predicted to be in contact with the ligand cluster. According to ModFOLDclust2 [28] predictions, the global model quality score for the model used in the prediction for CASP9 target T0582 = 0.466, which is a reasonable global model quality score. However yet again, the local model quality around the binding site for this model is comparatively bad. This resulted in both over and under-predictions, with residues 115, 116 and 117 being poorly modelled.

Discussion
In this study we describe a novel method, FunFOLD, for the prediction of ligand binding site residues, which shows a significant improvement over all of the true server methods that were tested at CASP8, as well as the predictions from one of the top manual groups -FN202. In addition, the method was tested on the CASP9 set and was found to be competitive with the top server groups and statistically inseparable in performance from most of the top manual groups. Whilst a prototype version of a server was tested in the CASP9 function prediction category (IntFOLD-FN -FN425), we have since improved the reliability of the FunFOLD server and the current automated implementation more closely resembles the performance of our manual function predictions (McGuffin -FN094).
The performance of all methods was measured using the standard Matthews Correlation Coefficient (MCC) [35], on both the CASP8 and CASP9 function (FN) targets. The mean MCC Z-scores for FunFOLD, were shown to be higher than the mean MCC Z-scores for all other methods tested at CASP8 ( Table 1). The Fun-FOLD predictions were an improvement upon those made by the Lee manual group (+1% MCC), the Lee server group (+4%) and the Sternberg group (+13%), which were the top groups tested during CASP8. The improvement over the Sternberg group on the CASP8 data set is statistically significant at the 99% level according to the Wilcoxon Signed rank sum test. The FunFOLD method is also shown to be competitive with all methods tested at CASP9 and shows no statistically significant difference with the top ranking server methods: I-TASSER-FUNCTION and firestar [4] (Tables 4, 5, 6 and 7). However, a statistically significant improvement was seen over the 3DLigandSite methods that were tested at CASP9, at the 99% level according to the Wilcoxon Signed rank sum test.
Intuitively, the quality of the starting 3D model that is used for the prediction of the ligand binding site will have an effect on the accuracy of results. Thus, in this paper for the CASP8 analysis we used ModFOLDclust2 model quality assessment method [28] to select high quality input models. However, we also tested FunFOLD using alternative 3D models that were submitted by the top function prediction groups and found that the improvement in MCC scores was maintained -the Lee manual method was improved upon by +2%, Lee server by +3% and the Sternberg group by +10%, which was again shown to be statistically significant. In addition, when native structures were used, as might be expected, the improvement in performance was maintained.
On the CASP9 data set we tested starting models from both our IntFOLD-TS server and from the better performing Zhang-Server, however again no significant difference in performance was seen. Furthermore, whilst using the native structures improved performance marginally in some cases, in the majority of cases the increases were not significant. According to these benchmarks, the results indicate that selecting an alternative starting model does not have much of an influence. Therefore, where there has been a significant improvement over other groups, this must have arisen from the FunFOLD algorithm itself, rather than from improved initial model selection.
One of the top ligand binding site prediction servers tested at CASP9 was the I-TASSER-FUNCTION server method (Zhang group), which also relies on 3D modelto-template superposition, for binding site residue prediction. However, in addition to global model to template superposition, the I-TASSER-FUNCTION method also carries out local alignment of the proposed binding site region, in order to improve local superposition. In light of the results shown in Figures 8 and 9, local superposition scores, such as those used in I-TASSER-FUNCTION, could be adopted which may help to improve future versions of FunFOLD. However, in our benchmarking on the CASP9 set we could measure no significant increase in performance of the I-TASSER-FUNCTION method over FunFOLD.
At the time of writing the I-TASSER-FUNCTION server is currently not publicly available. Therefore, arguably the top ranking publicly available server tested at CASP9 was firestar [4], which predicts residue conservation in target sequences based on PSI-BLAST alignments to the large catalogue of sites in PDB structures contained in the FireDB [30]. However, the firestar method is not currently available as a standalone program and again we could measure no significant performance gain over the FunFOLD method.
The FunFOLD method is clearly also competitive with the top manual groups that were tested in CASP8, however no manual intervention is required for our approach. Furthermore, the method significantly outperforms each of the true server methods tested at CASP8. Whilst the Lee server group (FN293) was mostly automated, the group was counted in the extended deadline category and the authors reported a small amount of human intervention [3], hence in this study we have considered group FN293 in CASP8 as a non-server group.
The FunFOLD method also significantly outperformed one of the top manual groups (FN202) and since CASP8, the authors have developed a fully automated publicly available server, called 3DLigandSite [15], variations of which participated in CASP9. The authors reported that the predictions from the 3DLigandSite server were comparable in performance to their manual predictions at CASP8 [15], therefore in this study we can consider predictions from group FN202 to be the gold standard for fully automated ligand binding residue prediction for testing on the CASP8 data set.
The standalone FunFOLD software uses similar input data to the version of the FINDSITE [11] software that is currently available to download. For both programs, a 3D model and a list of templates is required, however, the methods differ in the output produced; the Fun-FOLD method outputs binding residue predictions in CASP FN format, where as the FINDSITE method outputs a list of putative locations for the centre of each binding pocket i.e. locations in 3D space rather than binding site residues. Thus, the FunFOLD software cannot be directly compared to the FINDSITE software as they both produce different output. Furthermore, the FINDSITE dataset [11] cannot be directly used in our analysis, as the location of the binding site residues for each template is not defined and would have to be predicted, adding potential for errors in methods comparison. However, the latest FINDSITE-DBDT method did compete in CASP9, but to our knowledge the server is not publicly available. The current implementation of FunFOLD, the prototype version of the server (FN425) and our manual prediction group (FN094) performed statistically significantly better than the FINDSITE-DBDT method on the CASP9 data set [27].
The FunFOLD method uses a similar procedure to that carried out by the most successful prediction groups participating CASP; the 3D input model is superposed onto structurally similar ligand containing PDB files and the putative binding residues are then determined. However, the FunFOLD method uses a novel, fully automated approach for both identifying clusters of ligands and determining putative binding site residues. The novel ligand residue voting method used in FunFOLD reduces the rate of over predictions, which appears to be one of the main problems with many structure based approaches.
There are several caveats to consider when benchmarking methods for the prediction of protein ligand binding residues. Firstly, uncertainties can arise if there are several ligand binding sites within a protein either predicted or observed. For this analysis we only considered the binding residues that were defined by the CASP assessors and for each target only one binding site was defined. Secondly, the inherent flexibility of proteins may make it difficult to determine which residues are actually in contact with the ligand. This is further exacerbated if the binding site is located in a disordered region of a protein. Thirdly, the definition of the distance cut-off for a residue that is in contact with a ligand is the Van der Waals radii + 0.5 Å, but this definition is subjective. Finally, there may be ambiguity about whether the ligand bound to the solved structure is the protein's ideal ligand. Thus a ligand used in a prediction may not necessarily be incorrect. Indeed one binding site may bind more than one ligand.
Each of these issues creates difficulties for the fair assessment of methods as defining a list of observed binding site residues may be subjective. Some of these issues were addressed by the exclusion of "neutral residues" in the CASP 8 analysis [21], which have also been excluded in this analysis for the CASP8 data (the CASP8 assessors defined neutral residues as those which would potentially bind to an alternative ligand, but which were not observed binding to the alternative ligand within the solved 3D structure). In CASP9 the assessors used two classifications for binding sites -partial and extended. From the official analysis there was not a significant difference in assessment, if methods were analyzed using either partial or extended binding site definitions. However, the use of the MCC statistic for assessment does compound some of these issues and the prediction of a binding residue that is defined as incorrect, but which is nevertheless close to the observed binding pocket will therefore obtain the same score as a random incorrect prediction.
Therefore we recently proposed a novel scoring method -the Binding-site Distance Test (BDT) score, which addresses some of the shortcomings of using MCC scores, whilst maintaining the advantages [36]. Predicted residues that are close to the observed residues will obtain a higher BDT score than more distant predictions. The BDT score was used by the CASP9 assessors, in addition to the MCC score, to investigate if it caused a significant difference in the rankings of the methods but no significant changes in the grouping of top methods were reported [27]. In this analysis, the list of top groups identified using BDT scoring is again roughly in agreement with the top groups obtained using the MCC scoring; however the ranking of some of the less accurate methods does appear to change. Using the mean BDT score, we also see higher scores for FunFOLD compared with all methods tested on CASP8 on equivalent subsets of data and the difference is again significant for all but the top two manual groups (Figure 2, Figure 6 and Table 1). When the CASP9 predictions are analysed using the mean BDT scores the FunFOLD method is ranked below the top two server methods I-TASSER-FUNCTION and firestar [4], however again the difference in performance is not significant.
An obvious way of improving future versions of the FunFOLD software would be to optimize the voting threshold for the inclusion of predicted residues. Furthermore, predicting the ligand binding site residues on multiple models and then pooling the results may help to increase accuracy. The specific physiochemical properties of the residues could be studied, such as charge and polarity, and residues exhibiting more favorable physiochemical properties for binding to a ligand could be weighted more heavily in the residue voting process. A prediction of the enzymes functional family could be performed, with the prediction then used to weight residues that bind to ligands that occur more often within these families, more heavily in the voting process. In addition to undertaking global model-totemplate superpositions, local superposition of the binding site regions could also be carried out to increase accuracy. As previously mentioned this was carried out by one of the top server groups at CASP9 (I-TASSER-FUNCTION -FN339).
A general function prediction quality assessment tool could also be developed in order to weight predictions or provide probabilities scores for individual residues. Features of the quality assessment might include: the type of ligands within the cluster, with clusters containing a large number of similar ligands receiving a higher score; the distance of the superposed ligands from the centroid ligand within the cluster, with clusters containing ligands that are superposed perfectly receiving higher scores; the global and local model quality scores of the starting 3D model could also be factored into the analysis using our Mod-FOLD methods [28,37], with residues in poorly modeled regions down weighted; the probability of bound residues occurring in disordered regions could also be considered by integrating our DISOclust results [38], with residues in regions of high disorder receiving an appropriate weighting. In future, each of these features could be integrated into an automated quality assessment tool in order to produce more appropriate confidence scores which could be used for ranking binding residue predictions.

Conclusion
The FunFOLD software implements a competitive method for the prediction of protein binding site residues, which can also be used to determine the putative ligands interacting with a protein. The method is available as a standalone program, which can be used to predict binding residues and ligands based on user supplied 3D models and template lists. We also provide access to FunFOLD via a simple web server, which only requires users to supply an amino acid sequence.