Simulated unbound structures for benchmarking of protein docking in the Dockground resource
© Kirys et al. 2015
Received: 23 March 2015
Accepted: 10 July 2015
Published: 31 July 2015
Proteins play an important role in biological processes in living organisms. Many protein functions are based on interaction with other proteins. The structural information is important for adequate description of these interactions. Sets of protein structures determined in both bound and unbound states are essential for benchmarking of the docking procedures. However, the number of such proteins in PDB is relatively small. A radical expansion of such sets is possible if the unbound structures are computationally simulated.
The Dockground public resource provides data to improve our understanding of protein–protein interactions and to assist in the development of better tools for structural modeling of protein complexes, such as docking algorithms and scoring functions. A large set of simulated unbound protein structures was generated from the bound structures. The modeling protocol was based on 1 ns Langevin dynamics simulation. The simulated structures were validated on the ensemble of experimentally determined unbound and bound structures. The set is intended for large scale benchmarking of docking algorithms and scoring functions.
A radical expansion of the unbound protein docking benchmark set was achieved by simulating the unbound structures. The simulated unbound structures were selected according to criteria from systematic comparison of experimentally determined bound and unbound structures. The set is publicly available at http://dockground.compbio.ku.edu.
KeywordsProtein interactions Protein docking Molecular recognition Conformational analysis
Proteins play an important role in biological processes in living organisms. Many protein functions are based on interaction with other proteins. The structural information is essential for adequate description of these interactions. Protein interaction is characterized by structural and physicochemical recognition factors [1–3], and conformational changes upon binding . Computational approaches to the structural modeling of protein interactions are important, given the limitations of experimental techniques . A significant progress in the computational prediction of protein-protein complexes (protein-protein docking) has been reflected in the community-wide assessment . The original steric complementarity-based algorithms paved the way to knowledge-based approaches [1, 3, 6, 7] including those based on similarity to existing co-crystallized complexes, low-resolution (coarse-grained) techniques, and proteome-wide applications .
The docking algorithms are generally based on the concept of structure complementarity, observed in experimentally determined complexes. Thus, most docking procedures perform better when the bound (co-crystallized) protein structures are used, assuring the perfect match between the structures. Such bound docking allows one to neglect the internal degrees of freedom (structural flexibility), providing for an effective search of the six-dimensional rigid-body space of the external coordinates. However, in the real-case scenario, the bound structures of the participating proteins are unknown, and one has to rely on the unbound (e.g. crystallized separately) proteins. Because of the huge number of potentially relevant internal degrees of freedom, the problem of unbound docking is far from being solved.
The rigid-body docking of unbound proteins results in structural mismatches at the putative interfaces. Thus, one approach to the unbound docking lowers the resolution of the structures, alleviating the difference between unbound and bound structures, and decreasing the structural overlap . The downside of such an approach is a lesser (low-resolution) precision of the predicted structure of the complex. An alternative paradigm is to use sophisticated scoring schemes to evaluate a large number (e.g. hundreds of thousands) of high-resolution rigid-body predictions, in anticipation that it would capture the native interface containing structural mismatches (thus having high energy) . Docking approaches that explicitly search the internal coordinates are being developed [10, 11]. However, their success in the unbound docking is still limited . Template-based docking approaches (structure or sequence-based) generally are based on the backbone alignment (followed by the repacking of the side chains for the final prediction). Thus, in principle, they should not depend on the bound/unbound difference in the side chains conformations. However, the difference in the backbone may affect the performance of the procedure.
For the development of docking techniques applicable to the unbound proteins, it is essential to learn the experimentally determined difference between bound and unbound states. A number of proteins have structures experimentally determined in both unbound and bound states [4, 12]. In most proteins (71 % of complexes) conformational change upon binding is < 2 Å all atoms RMSD . A significant number of complexes with larger RMSD have a domain shift, where conformational changes in the domains themselves are small. Still, the other cases of large RMSD involve interface loops, which change conformation significantly upon binding. Thus, our ability to adequately address conformational changes in docking is important.
The utility of the unbound docking approaches is tested in the CAPRI blind experiment . To provide consistent sets for validation of docking and scoring procedures, benchmark sets of protein-protein complexes were compiled [13, 14]. However, the number of known representative protein pairs with experimentally determined structures in both bound and unbound states is relatively small (e.g., 176 in the Weng’s Benchmark 4.0 ). At the same time, the number of co-crystallized complexes is much larger. A key feature in the Dockground resource  is flexibility, which allows users to build the datasets according to their own requirements. Such datasets can involve thousands of complexes and thus can be used for truly large-scale benchmarking of docking methodologies. Simulating the unbound structure from the bound one provides such an opportunity. Our earlier set of simulated unbound structures , based on an older version of Dockground, was generated by changing the side chain conformations according to the rotamer library . In the current paper we describe a much larger set obtained by Langevin dynamics simulation and based on a systematic analysis of the experimentally determined bound/unbound structural differences. The set is a valuable resource for benchmarking docking procedures and development of docking methodologies.
Protein complexes were selected from the Bound part of Dockground with the following criteria: mean area buried ≥ 500 Å2, include alternative binding modes, homo/hetero n-mers, and oligomers, and the redundancy cutoff 97 %. The resulting set contained 1918 protein-protein complexes. Program Profix from the Jackal package (http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software) was used to build the disordered residues and missing atoms.
It was expected that dynamic simulation of separate bound protein structures, without the interacting partner, would relax interface side-chain conformations constrained by the interacting partner and thus approximate the unbound form of the protein. To speed up the calculations, we chose the Langevin Dynamics (LD) simulation in CHARMM (CHARMM22 force field), with electrostatics by Generalized Born approximation, performed on each bound structure without its interacting partner. Prior to LD simulation, the initial structures from PDB files were minimized (by 50 steps of steepest descent minimization followed by 500 steps of adopted basis Newton–Raphson minimization).
In the simulation, the backbone atoms of helices and strands were constrained with a force constant of 5 kcal/mol, the temperature was set to 309.6 K, the bond lengths were fixed using shake with tolerance 1.0E-8, the friction force fbeta was 5.0, and the time of simulation was 1 ns, with 100 snapshots saved.
Protein structures that crashed during the simulation were removed. The simulation yielded 3205 protein structures. The number of resulting complexes with both proteins simulated was 1530. In 145 complexes only one partner was simulated. The structure with the largest all atoms RMSD from the bound structure was designated as the simulated unbound structure. Among the simulated proteins there were 1245 non-obligate and 1960 obligate complexes, according to NOXClass procedure .
Comparison with experimentally determined structures
Generation of the set
To validate the simulated unbound structures, an ensemble of unbound and bound experimentally determined structures from PDB was selected for six proteins: ovomucoid, pancreatic trypsin inhibitor, chemotaxis protein CheY, ubiquitin, RNase A, and lysozyme C (for details see ). The extent of bound to unbound change and similarity between bound and unbound ensembles was calculated in terms of all atoms full structure and all atoms interface RMSD. Interface residues were defined as those losing >1 Å2 of their surface upon binding.
The number of unbound structures in the ensemble ranged from 27 to 394 per protein. The difference between bound and unbound structures varied in 0.7 - 7.3 Å full structure RMSD, and in 0.3 – 11.7 Å interface RMSD. The mean RMSD between bound and unbound structures was 1.9 Å (both all atoms and interface).
For consistency, as an option for users who would like to utilize structures of same origin, we simulated the unbound structures in cases where the X-ray unbound structure is known. The Dockground selection of monomers, with sequence identity between bound and unbound structures ≥ 97 %, and no ligands at the unbound interfaces, yielded 172 unbound/bound proteins. Among them, a single unbound structure was available for 79 proteins, with the others having multiple unbound structures. The average RMSD between bound and unbound structures was 1.2 Å (0.3 – 3.9 Å range) for full structure, and 1.5 Å (0.3 – 5.0 Å range) for the interface. The relatively small RMSD between bound and unbound structures could partially be explained by the fact that some proteins designated as monomers (and thus treated as unbound) are crystallized as homodimers. If proteins with bound/unbound RMSD ≤ 1 Å (likely not true unbound cases) are deleted, the average RMSD is 1.4 Å for the full structure, and 1.8 Å for the interface, similar to the difference between the bound and the simulated unbound structures.
Eglin C also belongs to the potato chymotrypsin inhibitor family and has a flexible binding loop . Comparison of the loop conformations of bound, unbound, and modeled structures (Fig. 4) shows that some loop conformations in the unbound NMR ensemble are close to the bound conformation, whereas other conformations are similar to the simulated unbound structure. While supporting the conformational selection mechanism upon binding  for eglin C suggested by the molecular simulations of serine protease inhibitor in , it also confirms the validity of the simulated protocol.
To further expand the pool of structures with similar bound/unbound differences (see above), obligate complexes were included as an option. Although they would not have an unbound structure in vivo /vitro, the algorithms that distinguish between obligate and non-obligate complexes have limited reliability . The option to exclude such complexes is implemented in the user interface in the Dockground resource. An example of such complex is the nerve growth factor protein (Fig. 5), which has a conformational change upon simulation confirmed by the experimental evidence. This protein has structural flexibility in the loop regions, reflected in our simulation, and this structural malleability might be important in binding .
Availability of the set
The resulting set of 3184 PDB-formatted files is available on the Dockground site (http://dockground.compbio.ku.edu) on the “Unbound - > Build Database” page, and as a “Quick Download” link. Users can download either the entire set or any combination of the available subsets. In addition to the obligate and/or non-obligate complexes, the interface offers to download structures, for which simulated unbound structures were generated for both monomers in the complex or only for one. Users can also include simulated unbound structures, for which corresponding X-ray unbound structure exists in the Dockground unbound docking benchmark 3.0. The names of the files start with the PDB code of the initial bound structure, followed by _u1 or _u2 for the first and second chain in the initial complex, respectively. Chain IDs and residue numbering were kept as in the original PDB files.
The Dockground public resource provides data to improve our understanding of protein–protein interactions and to assist in development of docking algorithms and scoring functions. Sets of protein structures determined in both bound and unbound states are essential for benchmarking docking procedures. However, the number of such proteins in PDB is relatively small. A radical expansion of such sets is possible if the unbound structures are computationally simulated. Such simulated unbound protein set was generated for the Dockground resource. The modeling protocol was based on 1 ns Langevin dynamics simulation. Simulated unbound structure was selected according to criteria from systematic comparison of experimentally determined bound and unbound structures. The set is publicly available at http://dockground.compbio.ku.edu.
The authors are grateful to Huan Rui and Sunhwan Jo for help with CHARMM simulations. This study was supported by NIH grant R01GM074255 and NSF grant DBI1262621.
- Vakser IA. Protein-protein docking: From interaction to interactome. Biophys J. 2014;107:1785–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Vakser IA. Low-resolution structural modeling of protein interactome. Curr Opin Struct Biol. 2013;23:198–205.View ArticlePubMedPubMed CentralGoogle Scholar
- Sudha G, Nussinov R, Srinivasan N. An overview of recent advances in structural bioinformatics of protein-protein interactions and a guide to their principles. Prog Bioph Mol Biol. 2014;116:141–50.View ArticleGoogle Scholar
- Ruvinsky AM, Kirys T, Tuzikov AV, Vakser IA. Side-chain conformational changes upon protein-protein association. J Mol Biol. 2011;408:356–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Lensink MF, Wodak SJ. Docking, scoring, and affinity prediction in CAPRI. Proteins. 2013;81:2082–95.View ArticlePubMedGoogle Scholar
- Szilagyi A, Zhang Y. Template-based structure modeling of protein–protein interactions. Curr Opin Struct Biol. 2014;24:10–23.View ArticlePubMedGoogle Scholar
- Rodrigues JPGLM, Bonvin AMJJ. Integrative computational modeling of protein interactions. FEBS J. 2014;281:1988–2003.View ArticlePubMedGoogle Scholar
- Vakser IA. Protein docking for low-resolution structures. Protein Eng. 1995;8:371–7.View ArticlePubMedGoogle Scholar
- Vajda S, Hall DR, Kozakov D. Sampling and scoring: A marriage made in heaven. Proteins. 2013;81:1874–84.View ArticlePubMedGoogle Scholar
- Mirzaei H, Beglov D, Paschalidis IC, Vajda S, Vakili P, Kozakov D. Rigid body energy minimization on manifolds for molecular docking. J Chem Theory Comput. 2012;8:4374–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang C, Bradley P, Baker D. Protein–protein docking with backbone flexibility. J Mol Biol. 2007;373:503–19.View ArticlePubMedGoogle Scholar
- Ruvinsky AM, Kirys T, Tuzikov AV, Vakser IA. Ensemble-based characterization of unbound and bound states on protein energy landscape. Protein Sci. 2013;22:734–44.View ArticlePubMedPubMed CentralGoogle Scholar
- Gao Y, Douguet D, Tovchigrechko A, Vakser IA. DOCKGROUND system of databases for protein recognition studies: Unbound structures for docking. Proteins. 2007;69:845–51.View ArticlePubMedGoogle Scholar
- Hwang H, Vreven T, Janin J, Weng Z. Protein–protein docking benchmark version 4.0. Proteins. 2010;78:3111–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Douguet D, Chen HC, Tovchigrechko A, Vakser IA. DOCKGROUND resource for studying protein-protein interfaces. Bioinformatics. 2006;22:2612–8.View ArticlePubMedGoogle Scholar
- Canutescu AA, Shelenkov AA, Dunbrack RL. A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci. 2003;12:2001–14.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu H, Domingues FS, Sommer I, Lengauer T. NOXclass: Prediction of protein-protein interaction types. BMC Bioinformatics. 2006;7:27.View ArticlePubMedPubMed CentralGoogle Scholar
- Karplus M, McCammon JA. Molecular dynamics simulations of biomolecules. Nature Struct Biol. 2002;9:646–52.View ArticlePubMedGoogle Scholar
- Meinhold L, Smith JC, Kitao A, Zewail AH. Picosecond fluctuating protein energy landscape mapped by pressure–temperature molecular dynamics simulation. Proc Natl Acad Sci U S A. 2007;104:17261–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Kohn JE, Afonine PV, Ruscio JZ, Adams PD, Head-Gordon T. Evidence of functional protein dynamics from X-Ray crystallographic ensembles. PLoS Comp Biol. 2010;6:e1000911.View ArticleGoogle Scholar
- Ludvigsen S, Shen HY, Kjaer M, Madsen JC, Poulsen FM. Refinement of the three-dimensional solution structure of barley serine proteinase inhibitor 2 and comparison with the structures in crystals. J Mol Biol. 1991;222:621–35.View ArticlePubMedGoogle Scholar
- Dauter Z, Betzel C, Genov N, Pipon N, Wilson KS. Complex between the subtilisin from a mesophilic bacterium and the leech inhibitor eglin-C. Acta Cryst B. 1991;47(Pt 5):707–30.View ArticleGoogle Scholar
- Gaspari Z, Varnai P, Szappanos B, Perczel A. Reconciling the lock-and-key and dynamic views of canonical serine protease inhibitor action. FEBS Lett. 2010;584:203–6.View ArticlePubMedGoogle Scholar
- Nooren IMA, Thornton JM. Structural characterisation and functional significance of transient protein–protein interactions. J Mol Biol. 2003;325:991–1018.View ArticlePubMedGoogle Scholar
- Holland DR, Cousens LS, Meng W, Matthews BW. Nerve growth factor in different crystal forms displays structural flexibility and reveals zinc binding sites. J Mol Biol. 1994;239:385–400.View ArticlePubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.