Improving the prediction of protein binding sites by combining heterogeneous data and Voronoi diagrams
© Segura et al; licensee BioMed Central Ltd. 2011
Received: 10 May 2011
Accepted: 23 August 2011
Published: 23 August 2011
Protein binding site prediction by computational means can yield valuable information that complements and guides experimental approaches to determine the structure of protein complexes. Predictions become even more relevant and timely given the current resolution of protein interaction maps, where there is a very large and still expanding gap between the available information on: (i) which proteins interact and (ii) how proteins interact. Proteins interact through exposed residues that present differential physicochemical properties, and these can be exploited to identify protein interfaces.
Here we present VORFFIP, a novel method for protein binding site prediction. The method makes use of broad set of heterogeneous data and defined of residue environment, by means of Voronoi Diagrams that are integrated by a two-steps Random Forest ensemble classifier. Four sets of residue features (structural, energy terms, sequence conservation, and crystallographic B-factors) used in different combinations together with three definitions of residue environment (Voronoi Diagrams, sequence sliding window, and Euclidian distance) have been analyzed in order to maximize the performance of the method.
The integration of different forms information such as structural features, energy term, evolutionary conservation and crystallographic B-factors, improves the performance of binding site prediction. Including the information of neighbouring residues also improves the prediction of protein interfaces. Among the different approaches that can be used to define the environment of exposed residues, Voronoi Diagrams provide the most accurate description. Finally, VORFFIP compares favourably to other methods reported in the recent literature.
The experimental characterization of the structure of protein complexes by X-ray crystallography, Nuclear Magnetic Resonance (NMR) or Electron Microscopy (EM) cannot keep pace with the ever-expanding volume of interactome data. Moreover, weak or transient interactions are very difficult to crystallize, NMR has clear limitations with regard to the size of the protein complexes that are tractable, and EM often does not provide adequate resolution. Computational tools, such as protein binding site predictions and protein docking, offer alternatives to describe protein interactions by providing theoretical structural models of protein complexes (e.g. ). Indeed, computational and experimental methodologies are complementary rather than mutually exclusive; for example protein binding site predictions can guide mutational analyses aimed at charting protein interfaces.
Residues located at protein interfaces present distinct physicochemical properties. Hydrophobic residues predominate in permanent complexes, although charged residues often form part of interfaces [2–5]. Interface residues also have both higher solvent accessibilities [2, 6] and lower crystallographic B-factors  than those seen in exposed residues not involved in protein interfaces. Other studies have shown that interface residues are evolutionarily conserved [8–10] although this has been questioned in several reports [11, 12]. Finally, interface residues are less prone to sample alternative side-chain rotamers to minimize entropic cost upon complex formation [13–15].
The features described above can be used individually or in combination to predict protein interfaces (see  for a recent review). Methods used to predict protein binding sites include those based on patch analysis  and those based on Neural Networks, including methods developed by Fariselli et al. , Ofran and Rost , and Porollo and Meller . The latter includes the neighbourhood or environment of residues as input data, defining the environment as the residues enclosed in a Euclidean distance threshold, which results in more accurate predictions. Neuvirth et al  proposed a method that utilizes secondary structure, hydrophobicity and experimental B-factors among other structural features. A support vector machine integrating six structural and chemical features was proposed by Bradford et al.  and later refined using Bayesian Networks . A parametric score function based on sequence conservation and structural information has been also proposed . More recently, Sikić et al  proposed a Random Forest ensemble classifier to predict interface residues using a 9-residue sliding window that includes sequence and structural information.
Despite these existing methods, the accurate generic prediction of protein interfaces is not resolved. The lack of a clear understanding of protein-protein interaction hinders the development of more accurate methods, and thus new approaches and ideas are needed. Also, as new experimental data emerge, new prediction algorithms can be devised that outperform their predecessors, thus providing better tools for the scientific community. Here, we describe a novel, structure-based, computational method: Voronoi Random Forest Feedback Interface Predictor (VORFFIP). VORFFIP is a two-steps Random Forest (RF) ensemble classifier that integrates a set of input variables accounting for structural features, energetic terms, evolutionary conservation, and crystallographic B-factors. In addition, VORFFIP uses Voronoi Diagrams (VDs) to define the local environment of exposed residues; this provides a more accurate description of the effect of the neighbourhood than can be provided by either the sliding window or Euclidean distance approaches. VDs have been used in a number of applications to study atomic packing of proteins , define protein interfaces , identify residue-residue contacts , define pockets in proteins  or define molecular surfaces . However, VDs have not been used in the context of protein binding site predictions to define the environment of exposed residues.
The performance of VORFFIP has been comprehensively assessed using different combinations of input data and environment definitions, identifying the combinations that lead to the best performance. Finally, VORFFIP outperformed other prediction methods under similar benchmarking conditions.
Results and Discussion
The aim of this work was to identify which features, environment descriptors and their combinations yielded the best results when predicting protein binding sites. This would highlight the best approach to distinguish exposed residues that are likely to be part of a protein interface from those that are not. To that end, a comprehensive study using VORFFIP was performed evaluating the results against widely used performance indicators. Unless specifically noted, all the results were obtained in a 5-fold cross validation test using the B100 dataset. The B100 dataset was derived from Benchmark 3.0  after discarding antigen-antibody complexes (see additional file 1, Material and Methods section for more information). Protein complexes in the B100 dataset have two representatives: the bound and unbound conformations. The bound conformation was only used to define interfaces (i.e. defining the residues located at protein interfaces); however, training and predictions were performed on the unbound conformations. In this manner, it was ensured that no information from the bound conformation was used during the training and prediction.
One step vs. two-steps RF
Results show that the performance of VORFFIP is improved when the second-step RF is included. The ROC curve obtained on the second-step RF showed higher sensitivity for any false-positive rates (Additional file 1, Figure S1) and the difference in AUC values was statistically significant (p-value < 0.01). Both ROC curves were derived using structure, energy, conservation and B-factors together with VD to account for the neighbourhood. However, the same behaviour was observed when using individual sets or combination of features such as structure and conservation and other environment descriptors such as sliding window (data not shown). In terms of precision (P), recall (R), F1-scores and Matthews correlation coefficient (MCC), the second-step RF also produced better results: first-step RF vs. second-step RF; R: 0.50 vs. 0.56; P: 0.36 vs. 0.45; MCC: 0.34 vs. 0.42; F1-scores 0.41 vs. 0.49. Thus, second-step RF and score-derived metrics such as es i (9) corrected false positives and identified missing hits, thus improving the performance of VORFFIP. Unless otherwise noted, the two-step RF was selected as default predictor.
Improving predictive power by combining heterogeneous data and using Voronoi Diagrams
AUC values for different combinations of features and environment definitions
A similar trend that a combination of all features and VDs gave the best scores was observed with other performance indicators such as MCC, R, P and F1-scores (Additional file 1, Table S3). MCC scores are of special interest due to the ratio between positive and negative cases: the number of exposed residues that do not belong to an interface is much higher than those that do. Both MCC and F1-scores improved when all the sources of information were combined and VD was used to account for the environment, thus resulting in better and more balanced predictions. For the sake of completeness, different Euclidian distance cut-offs between 5 to 20 Å (5 Å binning) were tested. The optimal performance is achieved between 10 and 15 Å cut-off agreeing with a previous observation . However, VORFFIP's predictions were still more accurate when VDs were used to account for the local environment (Additional file 1, Table S4).
Effect of the environment descriptors
Comparing VORFFIP with previous studies
The algorithm was compared against three recently published methods: SPPIDER , WHISCY  and the method developed by Sikić et al. . In each case, VORFFIP was trained and tested following the same procedure described in the previous studies and using the same datasets. Also, the definition of interface residues was the same as described in the original publications (Additional file 1, Datasets section for more information).
Comparing SPPIDER and VORFFIP
Comparing WHISCY, WHISCYMATE and VORFFIP
Finally, Sikić's method  relies on a 9-residue sliding window that includes sequence, secondary structure and several structural features as input variables to a RF classifier. Sikić's method was benchmarked using the O333 dataset on a 3-fold cross validation test. Additional file 1, Figure S3 shows a precision versus recall plot, similar to the one reported in the original publication . As shown, VORFFIP achieved a higher precision at any recall rate (except for first-step RF at recall rates lower than 0.3).
In this work we present VORFFIP, a novel computational tool for the prediction of protein binding sites. Several studies of protein complexes with known crystal structures have shown that residues at interfaces present unique properties (see Introduction). These properties, which provide information that is specific to structural features, energy terms, evolutionary conservation and crystallographic B-factors of individual residues, have predictive power. However, combining this range of individual features by means of a RF ensemble classifier clearly improved prediction; the combination of information is more powerful than the individual pieces of information. Moreover, the second-step RF further enhanced the performance of the method. The results show that all statistical measures used to gauge the performance of the method showed improvement from the first-step to the second-step RF, and thus incorporating the score values obtained by the first-step RF led to better predictions, probably because of the nature of protein binding sites formed from the contiguous surface patches.
Accounting for the environment of residues also enhanced the accuracy of the prediction. Although this observation is not new, the use of VDs in the framework of protein binding site prediction is novel. VDs not only provide a better approach to define protein interfaces (as shown by Cazals et al ) but also sharper and more accurate definition of the local environment of exposed residues as shown by the results presented here. VORFFIP and VDs delivered the best predictions in comparison to other approaches to define the local environment of residues, such as Euclidean distances (spheres) or sliding window. Moreover, there are clear advantages in using VDs, including no requirement for cut-offs (distances or window) and given its nature, it is easy to implement a weighting system based on the number of contacts (see Methods section). Thus, VDs offer a more natural and rational approach for defining the structural environment of residues.
Significant differences were observed between the precision and recall values in the SPPIDER and WHISCY tests. While SPPIDER was trained and tested using a set of protein complexes, i.e. proteins in bound conformation, WHISCY used protein complexes from Benchmark set version 1.0  and version 2.0 . Benchmark sets have two representations for each protein complexes, unbound and bound; predictions were performed only on the unbound version to ensure no bound information was used during prediction. It was found that crystallographic B-factors were very good predictors on the SPPIDER dataset whereas their performance seriously decreased when using the WHISCY dataset. This observation highlights the need for reliable datasets, such as the Benchmark series , to properly and fairly benchmark computational methods.
In summary, this paper describes a new computational tool for the prediction of binding sites. VORFFIP is a two-step RF ensemble classifier that relies on a set of input variables that accounts for several aspects of residue and environment-based information. VORFIPP compared favourably against other reported methods. VORFFIP is accessible at http://www.bioinsilico.org/VORFFIP.
Dataset and definition of protein interfaces
Five datasets of protein complexes, termed O333, S435, S149, W025 and B100, were used for benchmarking (B100) and to compare with previous methods (O333, S435, S149, W025). In the case of O333, S435, S149, W025 datasets, different definitions of protein interfaces were used depending on the description in the original publication. Full details are given in additional file 1, Material and Methods section. Briefly, the O333 set corresponds to that compiled by Ofran and Rost  and used by Sikic et al. . S435 and S149 correspond to the two sets derived by Porollo and Meller that were used to train and test SPPIDER . The dataset W025 corresponds to both Benchmark 1.0  and 2.0  sets and was used to benchmark WHISCY . Dataset B100 corresponds to Benchmark 3.0  after discarding antigen antibody complexes and was used as an independent set to benchmark VORFFIP under different conditions such as input data and environment definitions. Datasets can be downloaded from http://www.bioinsilico.org/VORFFIP/datasets.html.
Defining the environment of exposed residues: Voronoi Diagrams (VDs)
Thus, the clear advantages of using VDs to define the environment are twofold: (i), thresholds are not required, as contacts are based on visibility between residues rather than Euclidian distances or sliding window; and (ii), a weighting factor (2) can be defined based on the number of contacts between residues.
VORFFIP prediction algorithm
VORFFIP algorithm consists of two consecutive RF ensemble classifiers, named first-step and second-step RFs. In the first-step RF, residues and environment-based features are calculated and used as input variables. The scores yielded by the first-step RF are then decomposed into a number of new input variables that together with the previously calculated features are inputted to the second-step RF to calculate the final scores (see Figure 1 for an overview and schematic representation of the method). The randomForest package  implemented in R (http://www.r-project.org/) was used to train and compute decision trees.
First-step Random Forest
The input variables for the first-step RF include residue and environment-based information.
where K is an index set of all the features listed in additional file 1, Material and Methods section.
The neighbouring residues were identified using VDs as described previously. Three metrics were devised to account for the environment: the EF vector (5), the Contact Description Vector (CDV) (6); and the Environment Description Matrix (EDM) (7).
where a j is a neighbouring residue of the type type l (i.e. Ala, Cys, etc).
and where a r is a residue of type type l and a s of type type k and N rs is the number of contacts between residues a r and a s .
Second-step Random Forest
As a result of the first-step RF, score are assigned to each residue. The second-step RF makes use of this information in the form of score values (s i ), environmental scores (es i ) (9), the contact score vector (CSV) (10), and maximum-minimum score (Mms) values (11) that are added to the variables listed above to output a final score (Figure 2).
where s j is the score assigned by first-step RF to residue a j normalized by c ij (2).
where s j is the score assigned by the first-step RF to the neighbouring residue a j .
Assessing the performance of the method
Five widely used statistical measures were used to evaluate the performance of the method: Recall (R), Precision (P), the Matthew Correlation Coefficient (MCC), Q2 quartile, and the F1 score. The statistical analysis of ROC curves was performed using the StAR program . Further information is given in additional file 1, Material and Methods section.
NFF thanks Dr. Gendra for critical reading and insightful comments to the manuscript, and Ms Martina and Ms Daniela G Fernandez for continuing inspiration and motivation. This work was supported by the Research Councils United Kingdom (RCUK) Academic Fellow scheme (to NFF) and an internal scholarship awarded by the Leeds Institute of Molecular Medicine (to JM).
- Prasad NK, Vindal V, Kumar V, Kabra A, Phogat N, Kumar M: Structural and docking studies of Leucaena leucocephala Cinnamoyl CoA reductase. J Mol Model 2010, 17: 533–541.View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Principles of protein-protein interactions. ProcNatlAcadSciUSA 1996, 93: 13.View ArticleGoogle Scholar
- Lo CL, Chothia C, Janin J: The atomic structure of protein-protein recognition sites. JMolBiol 1999, 285: 2177.Google Scholar
- Larsen TA, Olson AJ, Goodsell DS: Morphology of protein-protein interfaces. Structure 1998, 6: 421–427. 10.1016/S0969-2126(98)00044-6View ArticlePubMedGoogle Scholar
- Glaser F, Steinberg DM, Vakser IA, Ben Tal N: Residue frequencies and pairing preferences at protein-protein interfaces. Proteins 2001, 43: 89. 10.1002/1097-0134(20010501)43:2<89::AID-PROT1021>3.0.CO;2-HView ArticlePubMedGoogle Scholar
- Chen H, Zhou H-X: Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data. Proteins 2005, 61: 21–35. 10.1002/prot.20514View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Protein-protein interactions: a review of protein dimer structures. Prog Biophys Mol Biol 1995, 63: 31–65. 10.1016/0079-6107(94)00008-WView ArticlePubMedGoogle Scholar
- Wang B, Chen P, Huang D-S, Li J-j, Lok T-M, Lyu MR: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett 2006, 580: 380–384. 10.1016/j.febslet.2005.11.081View ArticlePubMedGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. JMolBiol 1996, 257: 342.Google Scholar
- Yan C, Dobbs D, Honavar V: A two-stage classifier for identification of protein-protein interface residues. Bioinformatics 2004, 20(Suppl 1):i371–378. 10.1093/bioinformatics/bth920View ArticlePubMedGoogle Scholar
- Grishin NV, Phillips MA: The subunit interfaces of oligomeric enzymes are conserved to a similar extent to the overall protein sequences. Protein Sci 1994, 3: 2455–2458. 10.1002/pro.5560031231PubMed CentralView ArticlePubMedGoogle Scholar
- Caffrey DR, Somaroo S, Hughes JD, Mintseris J, Huang ES: Are protein-protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci 2004, 13: 190–202. 10.1110/ps.03323604PubMed CentralView ArticlePubMedGoogle Scholar
- Liang S, Zhang C, Liu S, Zhou Y: Protein binding site prediction using an empirical scoring function. Nucleic Acids Res 2006, 34: 3698–3707. 10.1093/nar/gkl454PubMed CentralView ArticlePubMedGoogle Scholar
- Cole C, Warwicker J: Side-chain conformational entropy at protein-protein interfaces. Protein Sci 2002, 11: 2860–2870.PubMed CentralView ArticlePubMedGoogle Scholar
- Fleishman SJ, Khare SD, Koga N, Baker D: Restricted sidechain plasticity in the structures of native proteins and complexes. Protein Sci 2011, 20: 753–757. 10.1002/pro.604PubMed CentralView ArticlePubMedGoogle Scholar
- Ezkurdia I, Bartoli L, Fariselli P, Casadio R, Valencia A, Tress ML: Progress and challenges in predicting protein-protein interaction sites. Brief Bioinform 2009, 10: 233–246.View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis. JMolBiol 1997, 272: 133.Google Scholar
- Fariselli P, Pazos F, Valencia A, Casadio R: Prediction of protein--protein interaction sites in heterocomplexes with neural networks. Eur J Biochem 2002, 269: 1356–1361. 10.1046/j.1432-1033.2002.02767.xView ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: Predicted protein-protein interaction sites from local sequence information. FEBS Lett 2003, 544: 236–239. 10.1016/S0014-5793(03)00456-3View ArticlePubMedGoogle Scholar
- Porollo A, Meller JÇ: Prediction-based fingerprints of protein-protein interactions. Proteins 2007, 66: 630–645.View ArticlePubMedGoogle Scholar
- Neuvirth H, Raz R, Schreiber G: ProMate: a structure based prediction program to identify the location of protein-protein binding sites. J Mol Biol 2004, 338: 181–199. 10.1016/j.jmb.2004.02.040View ArticlePubMedGoogle Scholar
- Bradford JR, Westhead DR: Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics 2005, 21: 1487–1494. 10.1093/bioinformatics/bti242View ArticlePubMedGoogle Scholar
- Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR: Insights into protein-protein interfaces using a Bayesian network prediction method. J Mol Biol 2006, 362: 365–386. 10.1016/j.jmb.2006.07.028View ArticlePubMedGoogle Scholar
- de Vries SJ, van Dijk ADJ, Bonvin AMJJ: WHISCY: what information does surface conservation yield? Application to data-driven docking. Proteins 2006, 63: 479–489. 10.1002/prot.20842View ArticlePubMedGoogle Scholar
- Sikic M, Tomic S, Vlahovicek K: Prediction of protein-protein interaction sites in sequences and 3D structures by random forests. PLoS Comput Biol 2009, 5: e1000278. 10.1371/journal.pcbi.1000278PubMed CentralView ArticlePubMedGoogle Scholar
- Tsai J, Gerstein M: Calculations of protein volumes: sensitivity analysis and parameter database. Bioinformatics 2002, 18: 985–995. 10.1093/bioinformatics/18.7.985View ArticlePubMedGoogle Scholar
- Cazals F, Proust F, Bahadur RP, Janin J: Revisiting the Voronoi description of protein-protein interfaces. Protein Sci 2006, 15: 2082–2092. 10.1110/ps.062245906PubMed CentralView ArticlePubMedGoogle Scholar
- Dupuis F, Sadoc JF, Jullien R, Angelov B, Mornon JP: Voro3D: 3D Voronoi tessellations applied to protein structures. Bioinformatics 2005, 21: 1715–1716. 10.1093/bioinformatics/bth365View ArticlePubMedGoogle Scholar
- Edelsbrunner H, Facello M, Liang J: On the definition and the construction of pockets in macromolecules. Discrete Appl Math 1998, 88: 83–102. 10.1016/S0166-218X(98)00067-5View ArticleGoogle Scholar
- Liang J, Edelsbrunner H, Fu P, Sudhakar PV, Subramaniam S: Analytical shape computation of macromolecules: I. Molecular area and volume through alpha shape. Proteins 1998, 33: 1–17.View ArticlePubMedGoogle Scholar
- Hwang H, Pierce B, Mintseris J, Janin Jl, Weng Z: Protein-protein docking benchmark version 3.0. Proteins 2008, 73: 705–709. 10.1002/prot.22106PubMed CentralView ArticlePubMedGoogle Scholar
- Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C: SCOP: a structural classification of proteins database. Nucleic Acids Res 2000, 28: 257–259. 10.1093/nar/28.1.257PubMed CentralView ArticlePubMedGoogle Scholar
- Chen R, Mintseris J, Janin Jl, Weng Z: A protein-protein docking benchmark. Proteins 2003, 52: 88–91. 10.1002/prot.10390View ArticlePubMedGoogle Scholar
- Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin Jl, Weng Z: Protein-Protein Docking Benchmark 2.0: an update. Proteins 2005, 60: 214–216. 10.1002/prot.20560View ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: Analysing six types of protein-protein interfaces. J Mol Biol 2003, 325: 377–387. 10.1016/S0022-2836(02)01223-8View ArticlePubMedGoogle Scholar
- Barber CB, Dobkin DP, Huhdanpaa H: The Quickhull algorithm for convex hulls. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE 1996, 22: 469–483. 10.1145/235815.235821View ArticleGoogle Scholar
- Liaw A, Wiener M: Classification and Regression by randomForest. R News 2002, 2: 18–22.Google Scholar
- Vergara IA, Norambuena T, Ferrada E, Slater AW, Melo F: StAR: a simple tool for the statistical comparison of ROC curves. BMC Bioinformatics 2008, 9: 265. 10.1186/1471-2105-9-265PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.