Optimised amino acid specific weighting factors for unbound protein docking
© Heuser and Schomburg; licensee BioMed Central Ltd. 2006
Received: 09 January 2006
Accepted: 14 July 2006
Published: 14 July 2006
One of the most challenging aspects of protein-protein docking is the inclusion of flexibility into the docking procedure. We developed a postfilter where the grid-representation of proteins for docking is extended by an optimised weighting factor for each amino acid.
For up to 86% of the evaluated complexes a near-native structure was within the top 5% of the ranked prediction output. The weighting factors obtained by the optimisation procedure correlate to a certain extent with the flexibility of the amino acids, their hydrophobicity and with their propensity to be in the interface.
Use of the optimised amino acid specific parameters yields a strong increase of near-native structures on the first ranks of the prediction.
Protein-protein interactions and complex formation play a central role in a broad range of biological processes, including hormone-receptor binding, protease inhibition, antibody-antigen interaction and signal transduction . As structural genomics projects proceed, we are confronted with an increasing number of structurally known proteins that are functionally uncharacterised. To identify how two proteins are interacting will be particularly important for elucidating functions and designing inhibitors. Although predicting around 50 percent false positive interactions , high throughput interaction discovery methods, such as the yeast two hybrid system, suggest thousands of protein-protein interactions and therefore also imply that a large fraction of all proteins interact with other proteins .
Since many biological interactions occur in transient complexes whose structures often cannot be determined experimentally, it is important to develop computational docking methods which can predict the structure of complexes with a proper accuracy .
Docking algorithms are developed to predict in which orientation two proteins are likely to bind under natural conditions. They can be split in a sampling step followed by a scoring step. A collection of putative structural complexes is generated by scanning the full conformational space in the first step. Afterwards the putative complexes are ranked according to scoring functions based on geometrical and chemical complementarity.
During the docking procedure the two grids representing the proteins are moved with respect to each other in a specified number of rotations and the geometric correlation for all translations is calculated in Fourier space within one step. The geometric complementarity of the proteins is evaluated by summing up the products of the values of the overlapping cells. In most approaches the surface cells of the proteins are assigned a value of one. Therefore the more surface cells are in contact with surface cells from the other protein the higher is the geometric score. The interior of the larger protein is assigned a negative value (in our docking program: -6). This results in a 'punishing' negative value as soon as overlaps with interior cells of the first protein are observed. The interior cells of the second protein are assigned a value of one leading to an asymmetrical treatment of both proteins, which 'softens' the surfaces slightly .
In the beginning the FFT-based docking methods were developed for bound docking. In bound docking the complex structure is split in its subunits and the docking algorithm predicts the complex structure from these subunits. For most cases bound docking gives good results, i.e. for most cases a near-native complex structure is on the first rank of the prediction output. However, if the complex structure is to be predicted from the 3D-structures determined in an unbound state, most docking procedures do find a near-native solution but not within the first ranks. This can be explained by the conformational differences between the complex and the unbound structures. Therefore the development of unbound docking methods, which are able to predict the near-native complex even if conformational changes take place, is the most challenging current task in the field of protein-protein docking.
Each protein-protein interaction depends on the amino acids involved in the interaction. Several attempts to evaluate the importance of the 20 amino acids for protein-protein docking were published [14–18]. Different properties of the amino acids like hydrophobicity, interface propensity, electrostatic properties, flexibility and others were tested for their relevance in docking. Two different approaches were done. On the one hand it was attempted to use these properties to detect the interface region of proteins before the docking procedure [16, 19–26] and on the other hand the differences of the amino acids were used to identify the near-native structure of a complex from all those structures showing a high geometrical correlation .
However, some of the properties lead to controversial ratings of the amino acids. For example for methionine there is a high propensity to be in the interface of a complex , which would lead to an assignment of an important role for that amino acid, but at the same time methionine has a large side chain which might cause clashes in rigid body docking even for the near-native complex which should not be 'punished'.
The flexibility of the amino acid side chains is the main reason for the unsatisfactory results of unbound docking. In the past it was tried to truncate or collapse very flexible side chains like arginine, lysine, asparagine, glutamine and methionine by assigning low numerical values to the cells representing their side chains [10, 28, 29]. Other approaches to treat the flexibility of side-chains include docking with different copies of the unbound subunits , or the usage of rotamer libraries in the refinement step.
Since it is nearly impossible to decide which property and which scale is the best one in each single case, we optimised amino acid specific weighting factors for rigid body unbound-unbound protein-protein docking. Therefore the grid representation of the proteins was extended. The new representation takes the amino acids into account which are represented by the cells. The values assigned to each cell, are composed of a value for surface or interior of the protein and a weighting factor in dependence of the amino acid (Figure 1D–E). These weighting factors were specifically optimised for three different classes of complexes following the classification of the dataset: enzyme-inhibitor/substrate, antibody-antigen and others.
There are two different possible methods to make use of this kind of representation of the proteins. On the one hand these values can be assigned to the cells before the calculation of the geometric correlation. Thereby an improvement of prediction accuracy might be achieved without an extension of the required computation time, but the chosen parameters must be capable of differentiating between clashing complex structures and such which have primarily surface-surface contacts. On the other hand this protein representation can be used to rerank the structures suggested by the calculation of the geometric correlation. This publication is focussed on the second approach.
The aim of our current work is to find new and independent criteria for the reranking of proposed complex structures, which describe other properties of near-native complex structures as our previously published  postfilter. An integration of the different approaches described in the literature requires a larger programming exercise but is expecting to give further improvement.
Amino acid specific weighting factors. Average optimised parameter for all amino acids obtained from the 5 cross validation runs for three different classes of protein complexes.
There are two groups of amino acids where the obtained parameters are comparable between all three complex classes. On the one hand there is a group consisting of aromatic and hydrophobic amino acids (TRP, TYR, PHE, ILE, VAL) which got high weighting factors for all three complex classes. On the other hand very low parameter were obtained for the amino acids with long and flexible side chains (ARG, GLN, GLU, LYS) and to such amino acids having a very low propensity to be in the interface (ASP, SER, PRO) .
For the other amino acids the optimised values differ for the three complex classes. While ALA, ASN and HIS play a major role in finding near-native structures for antibody-antigen complexes, especially MET got a high value for enzyme-inhibitor complexes. For LEU and THR, the optimisation yielded medium values for enzyme-inhibitor complexes and for other complexes but a weighting factor of about null was obtained for the antibody-antigen complexes.
The optimised parameters are specific for the complex class they have been optimised for. The percentage of near-native structures within the top 10% of the reranked output is about 10–15% higher when the parameters obtained from the same complex class are applied as compared to the application of the other parameters.
The values optimised for the interior of the receptor (I1) seriously differ for the three classes. For enzyme inhibitor complexes the value is 8.34, for antibody-antigen complexes -3.45 and for the other complexes 0.15.
With non-optimised parameters only 32% of the evaluated enzyme-inhibitor complexes had a solution with an RMS below 5Å within the first 100 ranks. After the optimisation 68% of the complexes do have a good solution within the top 100. For the antibody-antigen complexes the percentage of complexes for which one near- native structure can be found within the top 5% of the ranked predictions increased from 40% to 68% and for the other complexes from 31% to 50%. For the enzyme-inhibitor complexes the percentage of all near-native complexes within the top 5% increased from 13% to 34%, for the antibody-antigen complexes from 8% to 59% and for the other complexes from 6% to 26%.
Ranks of near-native structures using the optimised weighting factors
First structure with RMS<5Å in the ranking
Best structure with RMS<5Å in the ranking
Although the method is not able to reach the ultimate goal of protein-protein docking, i.e. unambiguously give the near-native geometries the highest scores the increase of near-native structures using the optimised amino acid specific grid values is quite spectacular, especially when the strict measure of 2.5Å RMS from the native geometry is used to be classified as correct. More than 90% of all near-native structures for the enzyme-inhibitor complexes are found within the 10% of the ranked output after rescoring with the optimised grid values. For all three complex-classes the number of near-native complex structures (RMSiCα<5Å) within the top 100 ranks increased by a factor of 3–5.
In comparison to our previously published results , this filter performs better than the conservation based filter for those cases, where the conservation based filter was not able to detect the near-native structures reliably (antibody-antigen, several of the enzyme-inhibitor complexes). However, the method described here can not reach the results, which we derived from the domain based post-filter. On the other hand the method described here does not depend on the availability of additional information like homologous complexes.
The optimised parameters comply partially with known properties of amino acids in protein complex interfaces. Amino acids where the optimisation produced very low weighting factors are likely to produce clashes in unbound docking especially for the near-native structures. They would be misleading for the docking of unbound proteins. The lowest values (~0) are assigned to the flexible polar amino acids as ARG, ASP, GLN, GLU or LYS, which also have a very low interface propensity . The high values of the aromatic residues are explainable by their ability to form π-stacks with their ring-systems and by their high propensity to be in interface regions, together with the rigidity of the aromatic ring system.
The weighting factors for the enzyme-inhibitor complexes correspond to the known interface propensities , to the number of freely rotatable bonds in the side chain and to the hydrophobicity of the amino acids. The very high value for methionine can be explained by the high propensity to be in the interface and the general rareness of MET on protein surfaces, so that MET gives a strong hint towards the near-native interface region.
The antibody-antigen factors still comply with the above mentioned properties, but to a smaller extend as the factors for the enzyme-inhibitor complexes. Higher values obtained for ASN and TYR correspond to the described higher importance of hydrogen bonds in antibody-antigen complexes.
The set of "other" complexes is rather heterogeneous, so that it is hard to interpretate the optimised values. For these complexes the highest values were also obtained for the apolar, hydrophobic and rigid residues, while for those with long flexible side chains the optimisation ended up with values near null.
Since the number of freely rotatable bonds is correlated with the obtained weighting factors for enzyme-inhibitor complexes, the method presented here is a computational rather cheap way to take the flexibility of amino acid side-chains on the surface of proteins at least partially into account, which is one of the major problems of unbound protein docking methods. The method does not attempt to reproduce the correct conformation of the side chains in the bound state, but it partially ignores the misleading aspect of 'wrong' side chain conformations, which are likely to produce steric clashes even for the near-native conformations in rigid body docking. The correct conformation must be obtained by other methods like energy minimisation procedures.
The method works equally well for complexes classified as "easy" and "medium difficult" [30, 31] and also for some bound-cases we evaluated the method is able to enrich the number of near-native structures on the top ranks. This underlines the ability of the filter to identify near-native solutions independently from the flexibility by ignoring clashes of near-native structures.
On the first view the positive value for interior cells of the receptor from the enzyme-inhibitor complexes is surprising since in the common Fourier docking procedure these cells get a negative value assigned. However, it has to be kept in mind that the values were optimised for usage as a postfilter, not for the docking process as such. The structures where the geometrical correlation was recalculated are only those structures obtained by a standard Fourier docking process and therefore have only minor clashes with the interior. All potential complex structures where both proteins are overlapping were excluded in the first calculation of the geometric correlation where -6 was used as a value for the interior cells.
The number of antibody-antigen and of other complexes (4 each) used for validation is quite low due to the lack of available literature values. But the enrichments achieved for the complexes which were also used for the optimisation, gives reason to optimism that the obtained weighting factors will work for the prediction of new antibody-antigen and other complexes as they do for the enzyme-inhibitor complexes.
One of the major advantages of the described method compared to methods that treat side-chain flexibility explicitly consists of the fact that it is computationally much faster.
The method presented here should work with any rigid body docking system which is based on a grid representation of the proteins. There are, of course, other criteria – like electrostatic interactions, size or hydrophobicity of the interface – that can be used for ranking docking candidates but the work presented here is focused on a rather simple optimisation of the geometric ranking which integrates the different roles of the different amino acids in the interface. The weighting obtained by the optimisation process is partially correlated with different structural properties of the amino acids and takes therefore several aspects into account.
As a next step in this work we will try to find atom-specific weighting factors and combine the method described here with our previously published postfilters and with a comprehensive scoring function (publication in preparation).
The docking tool
In a first step the geometric correlation for 83 unbound protein-protein complexes from the Z-Dock 2 benchmark set was calculated using our docking tool ckordo [8, 12]. Ckordo is a FFT based docking program including further docking arguments such as hydrophobicity and electrostatics. For this work we used only the geometric correlation calculated in Fourier space. The geometric correlation was calculated with a rotation increment of 15° and a maximum cell size of 1.5Å. For each rotation the five structures with the highest geometrical correlation were considered. Since the number of near-native structures is very low compared to the number of incorrect ones additional solutions with low RMS-values were produced by running ckordo for 1000 randomly chosen angles in the range from -10° to +10° around the correct rotation. From this run all proposed complex structures with RMSiCα<5Å were selected. This resulted in up to 4700 additional near-native solutions for each complex, which were added to the 22,000 structures calculated with rotational steps of 15°. If the ratio between the number of near-native solutions towards wrong ones is too low, the optimisation procedure would not be able to find the optimal parameters.
For the optimisation process ckordo was modified so that for each proposed structure the number of contacts of each amino acid with respect to being surface or interior was calculated. This results in one 20 × 20 matrix for each structure for each possible contact type (surface_protein1 × protein2, interior_protein1 × protein2). Furthermore for each proposed structure the root mean square deviation of the Cα atoms in the interface (RMSiCα) was calculated, in comparison to the unbound proteins fitted on the complex. For the RMSD calculation the Cαs were defined to be part of the interface if at least one atom of the other protein was within a distance of 10Å. The fitting of the unbound proteins on the complex was done with CE .
The contact-matrices and the RMS-values for the complexes from the Z-DOCK 2 Benchmark were used for the optimisation procedure. For each complex class the optimisation was performed independently. It was evaluated if the optimised parameters yield better results if only those structures proposed by the docking tool or those structures and additionally generated near-native complex structures lead to a better result (where the latter was the case). The optimisation was done using the nonlinear minimisation method (nlm()) from the R-package for statistical computing . The formula which is used for the calculation of the geometric correlation until now equals F1. For the optimisation procedure the formula was extended by the weighting factors (F2).
The optimisation itself is a minimisation of the quadratic error between an objective function and the scores obtained. In a preliminary step several models such as logarithmic and linear scales were evaluated as objective function using a small subset of the examples. Finally the best results were obtained for the function shown in Figure 7. All near-native structures (RMSiCα: 0–5Å) were assigned a 100 times
geoscore = ∑(P 1xP 2)
Formula 1: Formula for the calculation of the geometric score. The values in the overlapping grid cells of protein 1 (P1) and protein 2 (P2) multiplied and summed up.
weighted_geo_score = ∑((W AA xP 1)x(W AA xP 2))
Formula 2: Formula for the calculation of the geometric score after inclusion of the amino acid specific weighting factors. (P1 = cells of protein1, P2 = cells of protein2, WAA = weighting factor for amino acid)
It was evaluated if there is an improvement in prediction quality when the values for surface and for interior were included in the optimisation procedure. It turned out that the best results were achived, when only the value for the interior of the larger protein – usually the receptor – is included in the optimisation and the value for the other protein was kept fixed at the value of 1.
The nonlinear minimisation function of the R-package  uses a Newton-type algorithm [34, 35]. This method allows finding a minimum of a function by numerical computation of the derivatives. As a convergence criteria for the optimisation the default parameter were used.
Due to hardware limits it was impossible to use all available structures for the optimisation, so that subsets had to be chosen. To prove that several different subsets lead to similar results a 5-fold cross validation procedure was performed. Therefore the different complexes from each class were grouped randomly in 5 groups. The optimisation was run 5 times each time leaving out one of the groups and optimising with the remaining four. The final results were calculated using the average value of the five optimisations.
Furthermore the effect of the obtained parameters was evaluated on 17 enzyme-inhibitor, 4 antibody-antigen and 4 other complexes from literature, which were not part of the training. The docking procedure for these testcases was run with a rotation increment of 12° leading to 43080 potential structures for each complex. The evaluation was done with respect to the number of complexes which do have a near-native solution within the top ranks and to the number of near-native structures on the first ranks.
The authors are grateful for the financial support from the German Federal Ministry for Education and Research (BMBF).
- Valdar WS, Thornton JM: Protein-protein interfaces: analysis of amino acid conservation in homodimers. Proteins 2001, 42(1):108–124. 10.1002/1097-0134(20010101)42:1<108::AID-PROT110>3.0.CO;2-OView ArticlePubMedGoogle Scholar
- Caffrey P: Conserved amino acid residues correlating with ketoreductase stereospecificity in modular polyketide synthases. Chembiochem 2003, 4(7):654–657. 10.1002/cbic.200300581View ArticlePubMedGoogle Scholar
- Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1(5):349–356. 10.1074/mcp.M100037-MCP200View ArticlePubMedGoogle Scholar
- Aloy P, Russell RB: The third dimension for protein interactions and complexes. Trends Biochem Sci 2002, 27(12):633–638. 10.1016/S0968-0004(02)02204-1View ArticlePubMedGoogle Scholar
- Vajda S, Camacho CJ: Protein-protein docking: is the glass half-full or half-empty? Trends Biotechnol 2004, 22(3):110–116. 10.1016/j.tibtech.2004.01.006View ArticlePubMedGoogle Scholar
- Halperin I, Ma B, Wolfson H, Nussinov R: Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins 2002, 47(4):409–443. 10.1002/prot.10115View ArticlePubMedGoogle Scholar
- Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA: Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci U S A 1992, 89(6):2195–2199. 10.1073/pnas.89.6.2195PubMed CentralView ArticlePubMedGoogle Scholar
- Meyer M, Wilson P, Schomburg D: Hydrogen bonding and molecular surface shape complementarity as a basis for protein docking. J Mol Biol 1996, 264(1):199–210. 10.1006/jmbi.1996.0634View ArticlePubMedGoogle Scholar
- Ackermann F, Herrmann G, Posch S, Sagerer G: Estimation and filtering of potential protein-protein docking positions. Bioinformatics 1998, 14(2):196–205. 10.1093/bioinformatics/14.2.196View ArticlePubMedGoogle Scholar
- Chen R, Weng Z: Docking unbound proteins using shape complementarity, desolvation, and electrostatics. Proteins 2002, 47(3):281–294. 10.1002/prot.10092View ArticlePubMedGoogle Scholar
- Gabb HA, Jackson RM, Sternberg MJ: Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 1997, 272(1):106–120. 10.1006/jmbi.1997.1203View ArticlePubMedGoogle Scholar
- Zimmermann O: Untersuchungen zur Vorhersage der nativen Orientierung von Protein-Komplexen mit Fourier-Korrelationsmethoden. In Institute for Biochemistry. Cologne, University of Cologne; 2002.Google Scholar
- Eisenstein M: Introducing a 4th dimension to protein-protein docking. Structure (Camb) 2004, 12(12):2095–2096. 10.1016/j.str.2004.11.001View ArticleGoogle Scholar
- Jones S, Thornton JM: Principles of protein-protein interactions. Proc Natl Acad Sci U S A 1996, 93(1):13–20. 10.1073/pnas.93.1.13PubMed CentralView ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Mol Biol 1997, 272(1):121–132. 10.1006/jmbi.1997.1234View ArticlePubMedGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257(2):342–358. 10.1006/jmbi.1996.0167View ArticlePubMedGoogle Scholar
- Lo Conte L, Chothia C, Janin J: The atomic structure of protein-protein recognition sites. J Mol Biol 1999, 285(5):2177–2198. 10.1006/jmbi.1998.2439View ArticlePubMedGoogle Scholar
- Chakrabarti P, Janin J: Dissecting protein-protein recognition sites. Proteins 2002, 47(3):334–343. 10.1002/prot.10085View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis. J Mol Biol 1997, 272(1):133–143. 10.1006/jmbi.1997.1233View ArticlePubMedGoogle Scholar
- Fernandez-Recio J, Totrov M, Skorodumov C, Abagyan R: Optimal docking area: a new method for predicting protein-protein interaction sites. Proteins 2005, 58(1):134–143. 10.1002/prot.20285View ArticlePubMedGoogle Scholar
- Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N: Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, 18 Suppl 1: S71–7.View ArticlePubMedGoogle Scholar
- Landgraf R, Xenarios I, Eisenberg D: Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 2001, 307(5):1487–1502. 10.1006/jmbi.2001.4540View ArticlePubMedGoogle Scholar
- Fariselli P, Pazos F, Valencia A, Casadio R: Prediction of protein--protein interaction sites in heterocomplexes with neural networks. Eur J Biochem 2002, 269(5):1356–1361. 10.1046/j.1432-1033.2002.02767.xView ArticlePubMedGoogle Scholar
- Zhou HX, Shan Y: Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins 2001, 44(3):336–343. 10.1002/prot.1099View ArticlePubMedGoogle Scholar
- Koike A, Takagi T: Prediction of protein-protein interaction sites using support vector machines. Protein Eng Des Sel 2004, 17(2):165–173. 10.1093/protein/gzh020View ArticlePubMedGoogle Scholar
- Bordner AJ, Abagyan R: Statistical analysis and prediction of protein-protein interfaces. Proteins 2005.Google Scholar
- Heuser P, Bau D, Benkert P, Schomburg D: Refinement of unbound protein docking studies using biological knowledge. Proteins 2005, 61(4):1059–1067. 10.1002/prot.20634View ArticlePubMedGoogle Scholar
- Palma PN, Krippahl L, Wampler JE, Moura JJ: BiGGER: a new (soft) docking algorithm for predicting protein interactions. Proteins 2000, 39(4):372–384. 10.1002/(SICI)1097-0134(20000601)39:4<372::AID-PROT100>3.0.CO;2-QView ArticlePubMedGoogle Scholar
- Heifetz A, Eisenstein M: Effect of local shape modifications of molecular surfaces on rigid-body protein-protein docking. Protein Eng 2003, 16(3):179–185. 10.1093/proeng/gzg021View ArticlePubMedGoogle Scholar
- Zacharias M: Protein-protein docking with a reduced protein model accounting for side-chain flexibility. Protein Sci 2003, 12(6):1271–1282. 10.1110/ps.0239303PubMed CentralView ArticlePubMedGoogle Scholar
- Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin J, Weng Z: Protein-Protein Docking Benchmark 2.0: an update. Proteins 2005, 60(2):214–216. 10.1002/prot.20560View ArticlePubMedGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11(9):739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2005.Google Scholar
- Dennis JE, Schnabel RB: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, NJ 1983.Google Scholar
- Schnabel RB, Koontz JE, Weiss BE: A modular system of algorithms for unconstrained minimization. ACM Trans Math Software 1985, 11: 419–440. 10.1145/6187.6192View ArticleGoogle Scholar
- Heuser P, Martin O: UUPPDD - Unbound Unbound Protein Protein Docking Dataset.[http://biotool.uni-koeln.de:8080/uuppdd/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.