DECK: Distance and environment-dependent, coarse-grained, knowledge-based potentials for protein-protein docking
© Liu and Vakser; licensee BioMed Central Ltd. 2011
Received: 18 March 2011
Accepted: 11 July 2011
Published: 11 July 2011
Computational approaches to protein-protein docking typically include scoring aimed at improving the rank of the near-native structure relative to the false-positive matches. Knowledge-based potentials improve modeling of protein complexes by taking advantage of the rapidly increasing amount of experimentally derived information on protein-protein association. An essential element of knowledge-based potentials is defining the reference state for an optimal description of the residue-residue (or atom-atom) pairs in the non-interaction state.
The study presents a new Distance- and Environment-dependent, Coarse-grained, Knowledge-based (DECK) potential for scoring of protein-protein docking predictions. Training sets of protein-protein matches were generated based on bound and unbound forms of proteins taken from the DOCKGROUND resource. Each residue was represented by a pseudo-atom in the geometric center of the side chain. To capture the long-range and the multi-body interactions, residues in different secondary structure elements at protein-protein interfaces were considered as different residue types. Five reference states for the potentials were defined and tested. The optimal reference state was selected and the cutoff effect on the distance-dependent potentials investigated. The potentials were validated on the docking decoys sets, showing better performance than the existing potentials used in scoring of protein-protein docking results.
A novel residue-based statistical potential for protein-protein docking was developed and validated on docking decoy sets. The results show that the scoring function DECK can successfully identify near-native protein-protein matches and thus is useful in protein docking. In addition to the practical application of the potentials, the study provides insights into the relative utility of the reference states, the scope of the distance dependence, and the coarse-graining of the potentials.
Protein-protein interactions are a key element of life processes. Thus better understanding of these interactions, coupled with our ability to model them, is essential for the fundamental knowledge of their biology and the multitude of biomedical applications.
Computational approaches to structural determination of protein-protein complexes (protein-protein docking) typically involve two steps: the global, often low-resolution, search within a computationally feasible timeframe to detect a set of matches that includes at least one near-native structure (scan stage), and the local refinement of the matches from the scan stage that may involve more computationally expensive protocols. Such refinement often includes scoring aimed at improving the rank of the near-native structure relative to the false-positive matches.
Knowledge-based potentials [1, 2], physics-based potentials , and the hybrid potentials [4–6] have been shown to perform successfully in protein-protein docking benchmark tests. However, the limited ranking ability of the current scoring functions in CAPRI  suggests that much work still has to be done.
In structure prediction of individual proteins, the knowledge-based scoring functions gained significant popularity [8–10]. It has been shown that knowledge-based pairwise atomic potentials perform better than the physics-based potentials in the near-native structure refinement .
An essential element of knowledge-based potentials is defining the reference state for the optimal description of residue-residue (or atom-atom) pairs in the non-interaction state. For protein-protein interactions, generally, there are three methods of defining the non-interaction state. The first one is based on the large-distance cutoffs (e.g., DFIRE , DCOMPLEX with DFIRE-based potential , DOPE , and volume correction [15, 16]), the second one is based on random mixing of residue or atom types (e.g., KBP , and DBD-Hunter ), and the third one is based on false-positive matches/decoys (e.g., RAPDF , PIPER , and DARS ). Our approach utilizes reference states based on protein-protein decoys. It was shown that the long-range cooperative interactions  play an important role in protein-protein association. However, they are difficult to model based on contact or physics-based potentials. On the other hand, the coarse-grained distance-dependent potentials are a simple way to capture the long-range residue-residue interaction. In this paper we present a new Distance- and Environment-dependent, Coarse-grained, Knowledge-based (DECK) potential for scoring of protein-protein docking predictions.
A test was also performed on RosettaDock  unbound docking decoy set from Gray lab. The set includes 54 complexes. Each complex has top 200 structures from the global search based on unbound structures with rebuilt side chains. This decoy set represents another important facet of protein docking. The ZDOCK3.0+ZRANK set has the rigid body docking output, which typically contains a large number of matches for further structural refinement. The RosettaDock set contains the structures with optimized side-chain conformations, representing an expected output of a flexible structure refinement. Such a refinement is computationally expensive and thus has a significantly smaller number of matches, which are meant to be structurally more accurate than the rigid-body docking output.
DECK 1 and 2, and Contact Potential were tested and compared with RosettaDock, DCOMPLEX and ZRANK score values. The RosettaDock score values were obtained from the file in the decoy set. The scores of DCOMPLEX and ZRANK were computed locally. With a hit defined as a match with ligand RMSD <5 Å, 28 of 54 complexes had at least one hit. The results are shown in Figure 3B. If the hit was redefined as a match with ligand RMSD <10 Å, 37 of 54 complexes in the decoy set had at least one hit. Figure 3C shows the results according to this definition. As the results indicate, in both cases, DECK 2 outperformed other potentials across all top N predictions.
The scoring procedure implementing DECK is available from the authors upon request (email@example.com).
The knowledge-based potentials improve modeling of protein complexes by taking advantage of the rapidly increasing amount of experimentally derived information on protein-protein association. The distance dependence of these potentials is supposed to provide a more accurate description of protein-protein interactions by taking into account the structural and physicochemical aspects of the interacting proteins within a broader scope than the immediate contact across the interface. The coarse-graining of the potentials makes them less sensitive to the structural inaccuracies of the proteins, which are unavoidable for unbound X-ray and potentially modeled proteins, especially in high-throughput applications to large interaction networks.
Five reference states for the coarse-grained, distance-dependent, knowledge-based potentials were used in this study. Similar reference states in earlier studies focused on protein structure prediction and protein folding [19, 26, 27]. We applied a similar form of the potential to protein-protein docking, redefining the reference states based on the non-native matches (docking decoys). The larger number of non-native matches models random protein-protein binding with reasonable accuracy. The long range interactions were accounted for by incorporating the structural environment of the interacting residues. Docking decoys were used as a reference state earlier in DARS potentials . However, our method differs in three key points. The first one is the detailed form of the potential. DARS is based on the mole fraction potential, uniform reference state, and atomic contact potentials  (the random crystal reference state: the atom pairs are randomly exchanged). In our method, the reference states 1 and 2 also include the mole fraction terms. However, they also incorporate the probability of finding residue types at a certain distance . The second point is the way to calculate the observed and the expected probabilities of residue pairs. The observed probability of DARS is based on the native structure. In our study, the observed probability based on the native structure made the results worse when tested on GRAMM-X decoys (data not shown). The main reason was the limited number of nonredundant protein-protein interfaces. So, in our approach the near-native matches were used instead of the native complexes. The DARS approach used 20,000 best scoring matches (shape complementarily only) for calculating the reference probabilities. We used ~160,000 best scoring matches without the near-native hits for calculating the expected probability in each case. The third point is the resolution. Our method is coarse-grained. Because in this work we do not integrate our potential in the FFT search, a direct comparison of the results is difficult. However, both studies show that the reference states based on decoys perform better than the ones based on mole fraction terms. Overall, the results show that the scoring function DECK can successfully identify near-native protein-protein matches and thus is useful in protein docking.
Scoring of predicted protein-protein matches is important for identification of near-native structures in a pool of models. Knowledge-based scoring schemes improve modeling of protein complexes by taking advantage of the rapidly increasing amount of experimentally derived information on protein-protein association. A choice of the reference state for the description of non-interacting residue or atom pairs is an essential element of the knowledge-based potentials. The study presents a new potential for scoring of protein-protein docking predictions. Training sets of protein-protein matches were generated based on the bound and unbound proteins from the DOCKGROUND resource. Each residue was represented by a pseudo-atom in the geometric center of the side chain. To capture the long-range and the multi-body interactions, residues in different secondary structure elements at protein-protein interfaces were considered as different residue types. Five reference states for the potentials were defined and tested. The optimal reference state was selected and the cutoff effect on the distance-dependent potentials investigated. The potentials were validated on the docking decoys sets, showing better performance than the existing potentials used in scoring of protein-protein docking results. The study also provides insights into the relative utility of the reference states, the scope of the distance dependence and the coarse-graining of the potentials.
The bound and the unbound complexes for the training sets were taken from the DOCKGROUND resource [22, 29, 30] (http://dockground.bioinformatics.ku.edu). The bound complexes were from the representative bound set and the bound part of the docking benchmark. The unbound complexes were from the docking benchmark. For all the complexes, the docking decoys were generated by GRAMM-X  scan (with no scoring and refinement). A match with RMSD of the ligand backbone atoms <5 Å was defined as the near-native one, comparable with CAPRI evaluation criteria . With 160,000 matches per complex, 358 bound complexes from the representative set, and 71 bound complexes and 50 unbound complexes from the docking benchmark set had at least one near-native prediction. Two training sets were compiled: Training Set 1 (408 complexes) including 358 bound complexes from the representative set and 50 unbound complexes from the docking benchmark, and Training Set 2 (429 complexes) including 358 bound complexes from the representative set and 71 bound complexes from the docking benchmark. It is well known that existing protein-protein docking procedures perform differently on bound and unbound structures. Thus, it is interesting to see the difference between the knowledge-based potentials derived from the bound and from the unbound docking, especially with the potentials tested on the unbound docking decoys.
Knowledge-based energy functions
where π(i,j,d) obs and π(i,j,d) exp are the observed and the expected probability of the residue pair (i, j) at distance d respectively, and RT is set to 1.
The interaction distance was divided into 21 bins. Comparison with the contact potential (Figure 3) suggests that the larger number of bins enhances the performance of the potential. At the same time, increasing the number of bins beyond 21 would contradict the coarse-grained, residue-based nature of the potential.
Five reference states from the existing methodologies were defined. Each residue was represented by a pseudo-atom in the geometric center of the side chain (for GLY, the geometric center of the main chain). The distance between residues i and j was defined as the distance between their pseudo atoms. Atomic environment potential  was used to model multi-body interaction from pairwise contact potentials. To capture the long-range and the multi-body interactions, residues in different secondary structure environments [39, 40] (helix, strand, and coil) at protein-protein interfaces was considered as different residue types. The total number of such types was 60 (20 amino acids in three secondary structure states). The secondary structure state was calculated by DSSP . The eight DSSP secondary structure states are usually placed in three groups: helix (G, H and I), strand (E and B) and loop (all others). In our study, besides H and E, other states were designated as O. So the three secondary structure states were: H, E and O.
All residue-residue pairs were from protein-protein interfaces of the near-native matches or non-near native decoys. A residue was assigned to the interface if its centroid was within 30 Å of any residue centroid of the other docking partner. Different methods of calculating the probabilities in observed and expected states lead to different potentials. In the following part, we will discuss five different methods used to define the reference state.
Reference state 1
where n p is the total number of complexes; n m is the total number of near-native matches in each complex; the number of residue types (NRT) is 60; g p , m (i,j,d) is the total number of i, j pairs at distance d in near-native structure m of complex p.
The expected probability of residue pair (i, j) is estimated from the near-native matches. The expected probability of residue pair (i, j) was calculated in the same way as the observed probability using all decoys instead of the near-native matches in Eqs. (2-5).
Reference state 2
where d is the distance between residues i and j.
where N(i,j,d) obs , N(d) obs and mole fraction χ i are calculated according to Eqs. (3-5).
Reference state 3
Reference state 4
where d is the distance between residues i and j.
Reference state 5
The observed probability π(i,j,d) obs and the expected probability π(i,j,d) exp of residue pair (i, j) were calculated from the near-native matches and the non-native decoys according to Eq. (11), respectively. The only difference between the observed probability π(i,j,d) obs and the expected probability π(i,j,d) exp of residue pair (i, j) are the objects of the statistics - the near-native matches for the former and the non-native decoys for the latter.
SL is an associate professor at the Department of Physics at Huazhong University of Science and Technology, and IAV is the director of the Center for Bioinformatics and professor of Bioinformatics and Molecular Biosciences at The University of Kansas.
The authors thank Anatoly Ruvinsky for helpful comments and suggestions. The authors are grateful to Yaoqi Zhou for providing DCOMPLEX program. The study was supported by R01 GM074255 grant from NIH.
- Mintseris J, Pierce B, Wiehe K, Anderson R, Chen R, Weng Z: Integrating statistical pair potentials into protein complex prediction. Proteins 2007, 69: 511–520. 10.1002/prot.21502View ArticlePubMedGoogle Scholar
- Chuang GY, Kozakov D, Brenke R, Comeau SR, Vajda S: DARS (Decoys As the Reference State) potentials for protein-protein docking. Biophys J 2008, 95: 4217–4227. 10.1529/biophysj.108.135814PubMed CentralView ArticlePubMedGoogle Scholar
- May A, Zacharias M: Energy minimization in low-frequency normal modes to efficiently allow for global flexibility during systematic protein-protein docking. Proteins 2008, 70: 794–809.View ArticlePubMedGoogle Scholar
- Gray JJ, Moughon S, Wang C, Schueler-Furman O, Kuhlman B, Rohl CA, Baker D: Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J Mol Biol 2003, 331: 281–299. 10.1016/S0022-2836(03)00670-3View ArticlePubMedGoogle Scholar
- Pierce B, Weng Z: ZRANK: Reranking protein docking predictions with an optimized energy function. Proteins 2007, 67: 1078–1086. 10.1002/prot.21373View ArticlePubMedGoogle Scholar
- Andrusier N, Nussinov R, Wolfson HJ: FireDock: Fast interaction refinement in molecular docking. Proteins 2007, 69: 139–159. 10.1002/prot.21495View ArticlePubMedGoogle Scholar
- Lensink MF, Wodak SJ: Docking and scoring protein interactions: CAPRI 2009. Proteins 2010, 78: 3073–3084.View ArticlePubMedGoogle Scholar
- Simons KT, Kooperberg C, Huang ES, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol 1997, 268: 209–225. 10.1006/jmbi.1997.0959View ArticlePubMedGoogle Scholar
- Zhang Y, Kolinski A, Skolnick J: TOUCHSTONE II: A new approach to ab initio protein structure prediction. Biophysical journal 2003, 85: 1145–1164. 10.1016/S0006-3495(03)74551-2PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y: Progress and challenges in protein structure prediction. Curr Opin Struct Biol 2008, 18: 342–348. 10.1016/j.sbi.2008.02.004PubMed CentralView ArticlePubMedGoogle Scholar
- Summa CM, Levitt M: Near-native structure refinement using in vacuo energy minimization. Proc Natl Acad Sci USA 2007, 104: 3177–3182. 10.1073/pnas.0611593104PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou H, Zhou Y: Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci 2002, 11: 2714–2726.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu S, Zhang C, Zhou H, Zhou Y: A physical reference state unifies the structure-derived potential of mean force for protein folding and binding. Proteins 2004, 56: 93–101. 10.1002/prot.20019View ArticlePubMedGoogle Scholar
- Shen MY, Sali A: Statistical potential for assessment and prediction of protein structures. Protein Sci 2006, 15: 2507–2524. 10.1110/ps.062416606PubMed CentralView ArticlePubMedGoogle Scholar
- Xu B, Yang Y, Liang H, Zhou Y: An all-atom knowledge-based energy function for protein-DNA threading, docking decoy discrimination, and prediction of transcription-factor binding profiles. Proteins 2009, 76: 718–730. 10.1002/prot.22384PubMed CentralView ArticlePubMedGoogle Scholar
- Su Y, Zhou A, Xia X, Li W, Sun Z: Quantitative prediction of protein-protein binding affinity with a potential of mean force considering volume correction. Protein Sci 2009, 18: 2550–2558. 10.1002/pro.257PubMed CentralView ArticlePubMedGoogle Scholar
- Lu H, Skolnick J: A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 2001, 44: 223–232. 10.1002/prot.1087View ArticlePubMedGoogle Scholar
- Gao M, Skolnick J: DBD-Hunter: A knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res 2008, 36: 3978–3992. 10.1093/nar/gkn332PubMed CentralView ArticlePubMedGoogle Scholar
- Samudrala R, Moult J: An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 1998, 275: 895–916. 10.1006/jmbi.1997.1479View ArticlePubMedGoogle Scholar
- Kozakov D, Brenke R, Comeau SR, Vajda S: PIPER: An FFT-based protein docking program with pairwise potentials. Proteins 2006, 65: 392–406. 10.1002/prot.21117View ArticlePubMedGoogle Scholar
- Moza B, Buonpane RA, Zhu P, Herfst CA, Rahman AK, McCormick JK, Kranz DM, Sundberg EJ: Long-range cooperative binding effects in a T cell receptor variable domain. Proc Natl Acad Sci USA 2006, 103: 9867–9872. 10.1073/pnas.0600220103PubMed CentralView ArticlePubMedGoogle Scholar
- Liu S, Gao Y, Vakser IA: DOCKGROUND protein-protein docking decoy set. Bioinformatics 2008, 24: 2634–2635. 10.1093/bioinformatics/btn497PubMed CentralView ArticlePubMedGoogle Scholar
- Ruvinsky AM, Vakser IA: Interaction cutoff effect on ruggedness of protein-protein energy landscape. Proteins 2008, 70: 1498–1505.View ArticlePubMedGoogle Scholar
- Huang SY, Zou X: An iterative knowledge-based scoring function for protein-protein recognition. Proteins 2008, 72: 557–579. 10.1002/prot.21949View ArticlePubMedGoogle Scholar
- Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin J, Weng Z: Protein-protein docking benchmark 2.0: An update. Proteins 2005, 60: 214–216. 10.1002/prot.20560View ArticlePubMedGoogle Scholar
- Miyazawa S, Jernigan RL: Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules 1985, 18: 534–552. 10.1021/ma00145a039View ArticleGoogle Scholar
- Sippl MJ: Calculation of the conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J Mol Biol 1990, 213: 859–883. 10.1016/S0022-2836(05)80269-4View ArticlePubMedGoogle Scholar
- Zhang C, Vasmatzis G, Cornette JL, DeLisi C: Determination of atomic desolvation energies from the structures of crystallized proteins. J Mol Biol 1997, 267: 707–726. 10.1006/jmbi.1996.0859View ArticlePubMedGoogle Scholar
- Douguet D, Chen HC, Tovchigrechko A, Vakser IA: DOCKGROUND resource for studying protein-protein interfaces. Bioinformatics 2006, 22: 2612–2618. 10.1093/bioinformatics/btl447View ArticlePubMedGoogle Scholar
- Gao Y, Douguet D, Tovchigrechko A, Vakser IA: DOCKGROUND system of databases for protein recognition studies: Unbound structures for docking. Proteins 2007, 69: 845–851. 10.1002/prot.21714View ArticlePubMedGoogle Scholar
- Tovchigrechko A, Vakser IA: GRAMM-X public web server for protein-protein docking. Nucleic Acids Res 2006, 34: W310-W314. 10.1093/nar/gkl206PubMed CentralView ArticlePubMedGoogle Scholar
- Tanaka S, Scheraga HA: Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules 1976, 9: 945–950. 10.1021/ma60054a013View ArticlePubMedGoogle Scholar
- Melo F, Feytmans E: Novel knowledge-based mean force potential at atomic level. J Mol Biol 1997, 267: 207–222. 10.1006/jmbi.1996.0868View ArticlePubMedGoogle Scholar
- Melo F, Sanchez R, Sali A: Statistical potentials for fold assessment. Protein Sci 2002, 11: 430–448.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang L, Gao Y, Mao F, Liu Z, Lai L: Potential of mean force for protein-protein interaction studies. Proteins 2002, 46: 190–196. 10.1002/prot.10031View ArticlePubMedGoogle Scholar
- Fang Q, Shortle D: Protein refolding in silico with atom-based statistical potentials and conformational search using a simple genetic algorithm. J Mol Biol 2006, 359: 1456–1467. 10.1016/j.jmb.2006.04.033View ArticlePubMedGoogle Scholar
- Wu Y, Lu M, Chen M, Li J, Ma J: OPUS-Ca: A knowledge-based potential function requiring only C-alpha positions. Protein Sci 2007, 16: 1449–1463. 10.1110/ps.072796107PubMed CentralView ArticlePubMedGoogle Scholar
- Summa CM, Levitt M, Degrado WF: An atomic environment potential for use in protein structure prediction. J Mol Biol 2005, 352: 986–1001. 10.1016/j.jmb.2005.07.054View ArticlePubMedGoogle Scholar
- Zhang C, Kim SH: Environment-dependent residue contact energies for proteins. Proc Natl Acad Sci USA 2000, 97: 2550–2555. 10.1073/pnas.040573597PubMed CentralView ArticlePubMedGoogle Scholar
- Fang Q, Shortle D: A consistent set of statistical potentials for quantifying local side-chain and backbone interactions. Proteins 2005, 60: 90–96. 10.1002/prot.20482View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22: 2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.