Optimised amino acid specific weighting factors for unbound protein docking

Background One of the most challenging aspects of protein-protein docking is the inclusion of flexibility into the docking procedure. We developed a postfilter where the grid-representation of proteins for docking is extended by an optimised weighting factor for each amino acid. Results For up to 86% of the evaluated complexes a near-native structure was within the top 5% of the ranked prediction output. The weighting factors obtained by the optimisation procedure correlate to a certain extent with the flexibility of the amino acids, their hydrophobicity and with their propensity to be in the interface. Conclusion Use of the optimised amino acid specific parameters yields a strong increase of near-native structures on the first ranks of the prediction.


Background
Protein-protein interactions and complex formation play a central role in a broad range of biological processes, including hormone-receptor binding, protease inhibition, antibody-antigen interaction and signal transduction [1]. As structural genomics projects proceed, we are confronted with an increasing number of structurally known proteins that are functionally uncharacterised. To identify how two proteins are interacting will be particularly important for elucidating functions and designing inhibitors [2]. Although predicting around 50 percent false positive interactions [3], high throughput interaction discovery methods, such as the yeast two hybrid system, suggest thousands of protein-protein interactions and therefore also imply that a large fraction of all proteins interact with other proteins [4].
Since many biological interactions occur in transient complexes whose structures often cannot be determined experimentally, it is important to develop computational docking methods which can predict the structure of complexes with a proper accuracy [5].
Docking algorithms are developed to predict in which orientation two proteins are likely to bind under natural conditions. They can be split in a sampling step followed by a scoring step. A collection of putative structural complexes is generated by scanning the full conformational space in the first step. Afterwards the putative complexes are ranked according to scoring functions based on geometrical and chemical complementarity.
For the scanning of the conformational space for geometrical complementarity different methods are used (for a general introduction and an overview over the different docking methods see Halperin 2002 [6]). One of the most widely spread docking methods is based on Fast Fourier Transformations (FFT). The usage of FFT was introduced into docking by Katchalsky-Katzir in 1992 [7]. One important aspect of the docking procedure is the representation of the proteins. Most FFT based methods use a grid representation for the proteins [7][8][9][10][11]. Therefore each protein is mapped on a 3D grid, and the cells of the grid get different values assigned, representing the surface or the interior of the proteins ( Figure 1A-C). Further grids or complex numbers can be used to represent specific properties which are thought to play a crucial role in protein interactions like hydrophobicity or electrostatics [8,[10][11][12].
During the docking procedure the two grids representing the proteins are moved with respect to each other in a specified number of rotations and the geometric correlation for all translations is calculated in Fourier space within one step. The geometric complementarity of the proteins is evaluated by summing up the products of the values of the overlapping cells. In most approaches the surface cells of the proteins are assigned a value of one. Therefore the more surface cells are in contact with surface cells from the other protein the higher is the geometric score. The interior of the larger protein is assigned a negative value (in our docking program: -6). This results in a 'punishing' negative value as soon as overlaps with interior cells of the first protein are observed. The interior cells of the second protein are assigned a value of one leading to an asymmetrical treatment of both proteins, which 'softens' the surfaces slightly [13].
In the beginning the FFT-based docking methods were developed for bound docking. In bound docking the complex structure is split in its subunits and the docking algorithm predicts the complex structure from these subunits. For most cases bound docking gives good results, i.e. for most cases a near-native complex structure is on the first rank of the prediction output. However, if the complex structure is to be predicted from the 3D-structures determined in an unbound state, most docking procedures do find a nearnative solution but not within the first ranks. This can be explained by the conformational differences between the complex and the unbound structures. Therefore the development of unbound docking methods, which are able to predict the near-native complex even if conformational changes take place, is the most challenging current task in the field of protein-protein docking.
Each protein-protein interaction depends on the amino acids involved in the interaction. Several attempts to evaluate the importance of the 20 amino acids for proteinprotein docking were published [14][15][16][17][18]. Different properties of the amino acids like hydrophobicity, interface propensity, electrostatic properties, flexibility and others were tested for their relevance in docking. Two different approaches were done. On the one hand it was attempted to use these properties to detect the interface region of proteins before the docking procedure [16,[19][20][21][22][23][24][25][26] and on the other hand the differences of the amino acids were used to identify the near-native structure of a complex from all those structures showing a high geometrical correlation [27].
However, some of the properties lead to controversial ratings of the amino acids. For example for methionine there is a high propensity to be in the interface of a complex [18], which would lead to an assignment of an important role for that amino acid, but at the same time methionine has a large side chain which might cause clashes in rigid body docking even for the near-native complex which should not be 'punished'.
The flexibility of the amino acid side chains is the main reason for the unsatisfactory results of unbound docking. In the past it was tried to truncate or collapse very flexible side chains like arginine, lysine, asparagine, glutamine and methionine by assigning low numerical values to the cells representing their side chains [10,28,29]. Other approaches to treat the flexibility of side-chains include docking with different copies of the unbound subunits [30], or the usage of rotamer libraries in the refinement step.
Since it is nearly impossible to decide which property and which scale is the best one in each single case, we optimised amino acid specific weighting factors for rigid body The proteins are mapped on a grid (A); In the common rep-resentation there is a differentiation between cells represent-ing the surface (S) and the interior (I) (B), which results in the distribution of values as shown in C Figure 1 The proteins are mapped on a grid (A); In the common representation there is a differentiation between cells representing the surface (S) and the interior (I) (B), which results in the distribution of values as shown in C. Since each cell also represents the underlying amino acid (D), in the new representation (E) as it is used here each value is composed of an numerical value for surface and interior (S/I) and a weighting factor for each amino acid (W AA ).
unbound-unbound protein-protein docking. Therefore the grid representation of the proteins was extended. The new representation takes the amino acids into account which are represented by the cells. The values assigned to each cell, are composed of a value for surface or interior of the protein and a weighting factor in dependence of the amino acid ( Figure 1D-E). These weighting factors were specifically optimised for three different classes of complexes following the classification of the dataset [31]: enzyme-inhibitor/substrate, antibody-antigen and others.
There are two different possible methods to make use of this kind of representation of the proteins. On the one hand these values can be assigned to the cells before the calculation of the geometric correlation. Thereby an improvement of prediction accuracy might be achieved without an extension of the required computation time, but the chosen parameters must be capable of differentiating between clashing complex structures and such which have primarily surface-surface contacts. On the other hand this protein representation can be used to rerank the structures suggested by the calculation of the geometric correlation. This publication is focussed on the second approach.
The aim of our current work is to find new and independent criteria for the reranking of proposed complex structures, which describe other properties of near-native complex structures as our previously published [27] postfilter. An integration of the different approaches described in the literature requires a larger programming exercise but is expecting to give further improvement.

Obtained parameter
The weighting factor for each amino acid and the estimated value for all cells of the interior of the receptor obtained by the optimisation are shown for the three different complex classes in Table 1.
There are two groups of amino acids where the obtained parameters are comparable between all three complex classes. On the one hand there is a group consisting of aromatic and hydrophobic amino acids (TRP, TYR, PHE, ILE, VAL) which got high weighting factors for all three complex classes. On the other hand very low parameter were obtained for the amino acids with long and flexible side chains (ARG, GLN, GLU, LYS) and to such amino acids having a very low propensity to be in the interface (ASP, SER, PRO) [18].
For the other amino acids the optimised values differ for the three complex classes. While ALA, ASN and HIS play a major role in finding near-native structures for antibodyantigen complexes, especially MET got a high value for enzyme-inhibitor complexes. For LEU and THR, the optimisation yielded medium values for enzyme-inhibitor complexes and for other complexes but a weighting factor of about null was obtained for the antibody-antigen complexes.
The optimised parameters are specific for the complex class they have been optimised for. The percentage of near-native structures within the top 10% of the reranked output is about 10-15% higher when the parameters obtained from the same complex class are applied as compared to the application of the other parameters.
The values optimised for the interior of the receptor (I1) seriously differ for the three classes. For enzyme inhibitor complexes the value is 8.34, for antibody-antigen complexes -3.45 and for the other complexes 0.15.

Validation
For each complex the geometric correlation score was recalculated using the optimised parameter. For most of the complexes a massive enrichment of low RMS-solutions on the lower ranks was achieved. In figure 2 this enrichment is shown for [PDB:1CGI], [PDB:1BVN] and for [PDB:1TMQ]. It is shown how many percent of the dataset must be evaluated until which percent value of all near-native (RMSiCα<5Å) solutions are found. The red line shows the result obtained by the recalculation using the optimised parameter and the black line shows the results before the optimisation.
In figure 3 the distribution of the calculated scores is shown. In 3A the calculated geometric correlation compared to the RMS as it is calculated without the optimised amino acid dependent weighting factors is shown, whilst part B shows the same distribution after the optimisation.
The overall improvement of prediction quality is shown in figure 4. It is shown for how many percent of all tested complexes from the ZDOCK2.0 benchmark a good solution (RMSiCα<5Å) was found within the top ranks.
With non-optimised parameters only 32% of the evaluated enzyme-inhibitor complexes had a solution with an RMS below 5Å within the first 100 ranks. After the optimisation 68% of the complexes do have a good solution within the top 100. For the antibody-antigen complexes the percentage of complexes for which one near-native structure can be found within the top 5% of the ranked predictions increased from 40% to 68% and for the other complexes from 31% to 50%. For the enzyme-inhibitor complexes the percentage of all near-native complexes within the top 5% increased from 13% to 34%, for the antibody-antigen complexes from 8% to 59% and for the other complexes from 6% to 26%.
The application of the optimised weighting factors is especially well suited to recognize those solutions with a very low RMS. Figure 5 shows the enrichment of near-native structures on the lower ranks for structures with RMSiCα<5Å and RMSiCα<2.5Å. For the enzyme-inhibitor complexes more than 90% of all solutions with RMSiCα<2.5Å can be found within the top 10%.
For validation of the optimisation the obtained weighting factors were applied to 17 enzyme-inhibitor, to 4 antibody-antigen and to 4 other complexes from the literature, which are not part of the ZDOCK2 benchmark. Figure 6 shows the improvement achieved for these testcases and table 2 gives a detailed overview over the effect of the reranking on the position of the first and the best near-native structure for each complex.

Discussion
Although the method is not able to reach the ultimate goal of protein-protein docking, i.e. unambiguously give the near-native geometries the highest scores the increase of near-native structures using the optimised amino acid specific grid values is quite spectacular, especially when the strict measure of 2.5Å RMS from the native geometry is used to be classified as correct. More than 90% of all near-native structures for the enzyme-inhibitor complexes are found within the 10% of the ranked output after rescoring with the optimised grid values. For all three complex-classes the number of near-native complex structures (RMSiCα<5Å) within the top 100 ranks increased by a factor of 3-5.
In comparison to our previously published results [27], this filter performs better than the conservation based filter for those cases, where the conservation based filter was not able to detect the near-native structures reliably (antibody-antigen, several of the enzyme-inhibitor complexes). However, the method described here can not reach the results, which we derived from the domain based post-filter. On the other hand the method described here does not depend on the availability of additional information like homologous complexes.
The optimised parameters comply partially with known properties of amino acids in protein complex interfaces. Amino acids where the optimisation produced very low weighting factors are likely to produce clashes in unbound docking especially for the near-native structures. They would be misleading for the docking of unbound proteins. The lowest values (~0) are assigned to the flexible polar amino acids as ARG, ASP, GLN, GLU or LYS, which also have a very low interface propensity [18]. The high values of the aromatic residues are explainable by their ability to form π-stacks with their ring-systems and by their high propensity to be in interface regions, together with the rigidity of the aromatic ring system.
The weighting factors for the enzyme-inhibitor complexes correspond to the known interface propensities [18], to the number of freely rotatable bonds in the side chain and to the hydrophobicity of the amino acids. The very high value for methionine can be explained by the high propensity to be in the interface and the general rareness of MET on protein surfaces, so that MET gives a strong hint towards the near-native interface region.
The antibody-antigen factors still comply with the above mentioned properties, but to a smaller extend as the factors for the enzyme-inhibitor complexes. Higher values obtained for ASN and TYR correspond to the described higher importance of hydrogen bonds in antibody-antigen complexes.
The set of "other" complexes is rather heterogeneous, so that it is hard to interpretate the optimised values. For Since the number of freely rotatable bonds is correlated with the obtained weighting factors for enzyme-inhibitor complexes, the method presented here is a computational rather cheap way to take the flexibility of amino acid sidechains on the surface of proteins at least partially into account, which is one of the major problems of unbound protein docking methods. The method does not attempt to reproduce the correct conformation of the side chains in the bound state, but it partially ignores the misleading aspect of 'wrong' side chain conformations, which are likely to produce steric clashes even for the near-native  The method works equally well for complexes classified as "easy" and "medium difficult" [30,31] and also for some bound-cases we evaluated the method is able to enrich the number of near-native structures on the top ranks. This underlines the ability of the filter to identify near-native solutions independently from the flexibility by ignoring clashes of near-native structures.

Comparison of the distribution of calculated geometric correlations for a protease/inhibitor example (2SIC) (A) without and (B) with optimised parameter
On the first view the positive value for interior cells of the receptor from the enzyme-inhibitor complexes is surprising since in the common Fourier docking procedure these cells get a negative value assigned. However, it has to be kept in mind that the values were optimised for usage as a postfilter, not for the docking process as such. The structures where the geometrical correlation was recalculated are only those structures obtained by a standard Fourier docking process and therefore have only minor clashes with the interior. All potential complex structures where both proteins are overlapping were excluded in the first calculation of the geometric correlation where -6 was used as a value for the interior cells.
The number of antibody-antigen and of other complexes (4 each) used for validation is quite low due to the lack of available literature values. But the enrichments achieved for the complexes which were also used for the optimisation, gives reason to optimism that the obtained weighting factors will work for the prediction of new antibodyantigen and other complexes as they do for the enzymeinhibitor complexes.
One of the major advantages of the described method compared to methods that treat side-chain flexibility explicitly consists of the fact that it is computationally much faster.

Conclusion
The method presented here should work with any rigid body docking system which is based on a grid representation of the proteins. There are, of course, other criterialike electrostatic interactions, size or hydrophobicity of the interface -that can be used for ranking docking candidates but the work presented here is focused on a rather simple optimisation of the geometric ranking which integrates the different roles of the different amino acids in the interface. The weighting obtained by the optimisation process is partially correlated with different structural properties of the amino acids and takes therefore several aspects into account.
As a next step in this work we will try to find atom-specific weighting factors and combine the method described here with our previously published postfilters and with a comprehensive scoring function (publication in preparation).

The docking tool
In a first step the geometric correlation for 83 unbound protein-protein complexes from the Z-Dock 2 benchmark set [31] was calculated using our docking tool ckordo [8,12]. Ckordo is a FFT based docking program including further docking arguments such as hydrophobicity and electrostatics. For this work we used only the geometric correlation calculated in Fourier space. The geometric correlation was calculated with a rotation increment of 15°a nd a maximum cell size of 1.5Å. For each rotation the five structures with the highest geometrical correlation were considered. Since the number of near-native structures is very low compared to the number of incorrect ones additional solutions with low RMS-values were produced by running ckordo for 1000 randomly chosen angles in the range from -10° to +10° around the correct rotation. From this run all proposed complex structures with RMSiCα<5Å were selected. This resulted in up to Enrichment of near-native structures on the lower ranks for enzyme-inhibitor complexes  RMSiCa<2.5A Improvement of prediction quality (testcases from literature) For the optimisation process ckordo was modified so that for each proposed structure the number of contacts of each amino acid with respect to being surface or interior was calculated. This results in one 20 × 20 matrix for each structure for each possible contact type (surface_protein1 × protein2, interior_protein1 × protein2). Furthermore for each proposed structure the root mean square deviation of the Cα atoms in the interface (RMSiCα) was calculated, in comparison to the unbound proteins fitted on the complex. For the RMSD calculation the Cαs were defined to be part of the interface if at least one atom of the other protein was within a distance of 10Å. The fitting of the unbound proteins on the complex was done with CE [32].

Optimisation
The contact-matrices and the RMS-values for the complexes from the Z-DOCK 2 Benchmark [31] were used for the optimisation procedure. For each complex class the optimisation was performed independently. It was evaluated if the optimised parameters yield better results if only those structures proposed by the docking tool or those structures and additionally generated near-native complex structures lead to a better result (where the latter was the case). The optimisation was done using the nonlinear minimisation method (nlm()) from the R-package for statistical computing [33]. The formula which is used for the calculation of the geometric correlation until now equals The optimisation itself is a minimisation of the quadratic error between an objective function and the scores obtained. In a preliminary step several models such as logarithmic and linear scales were evaluated as objective function using a small subset of the examples. Finally the best results were obtained for the function shown in Figure 7. All near-native structures (RMSiCα: 0-5Å) were assigned a 100 times  It was evaluated if there is an improvement in prediction quality when the values for surface and for interior were included in the optimisation procedure. It turned out that the best results were achived, when only the value for the interior of the larger protein -usually the receptor -is included in the optimisation and the value for the other protein was kept fixed at the value of 1.
The nonlinear minimisation function of the R-package [33] uses a Newton-type algorithm [34,35]. This method allows finding a minimum of a function by numerical computation of the derivatives. As a convergence criteria for the optimisation the default parameter were used.

Validation
Due to hardware limits it was impossible to use all available structures for the optimisation, so that subsets had to be chosen. To prove that several different subsets lead to similar results a 5-fold cross validation procedure was performed. Therefore the different complexes from each class were grouped randomly in 5 groups. The optimisation was run 5 times each time leaving out one of the groups and optimising with the remaining four. The final results were calculated using the average value of the five optimisations.
Furthermore the effect of the obtained parameters was evaluated on 17 enzyme-inhibitor, 4 antibody-antigen and 4 other complexes from literature [36], which were not part of the training. The docking procedure for these testcases was run with a rotation increment of 12° leading to 43080 potential structures for each complex. The evaluation was done with respect to the number of complexes which do have a near-native solution within the top ranks and to the number of near-native structures on the first ranks.

Authors' contributions
The described results are obtained by a combined effort of the two authors where PH did the actual computer work including programming and integration of the method whereas the method was developed and results were discussed in frequent discussions between the two authors. The original idea was from DS.