An effective approach for generating a three-Cys2His2 zinc-finger-DNA complex model by docking
© Chou et al. 2010
Received: 21 December 2009
Accepted: 18 June 2010
Published: 18 June 2010
Skip to main content
© Chou et al. 2010
Received: 21 December 2009
Accepted: 18 June 2010
Published: 18 June 2010
Determination of protein-DNA complex structures with both NMR and X-ray crystallography remains challenging in many cases. High Ambiguity-Driven DOCKing (HADDOCK) is an information-driven docking program that has been used to successfully model many protein-DNA complexes. However, a protein-DNA complex model whereby the protein wraps around DNA has not been reported. Defining the ambiguous interaction restraints for the classical three-Cys2His2 zinc-finger proteins that wrap around DNA is critical because of the complicated binding geometry. In this study, we generated a Zif268-DNA complex model using three different sets of ambiguous interaction restraints (AIRs) to study the effect of the geometric distribution on the docking and used this approach to generate a newly reported Sp1-DNA complex model.
The complex models we generated on the basis of two AIRs with a good geometric distribution in each domain are reasonable in terms of the number of models with wrap-around conformation, interface root mean square deviation, AIR energy and fraction native contacts. We derived the modeling approach for generating a three-Cys2His2 zinc-finger-DNA complex model according to the results of docking studies using the Zif268-DNA and other three crystal complex structures. Furthermore, the Sp1-DNA complex model was calculated with this approach, and the interactions between Sp1 and DNA are in good agreement with those previously reported.
Our docking data demonstrate that two AIRs with a reasonable geometric distribution in each of the three-Cys2His2 zinc-finger domains are sufficient to generate an accurate complex model with protein wrapping around DNA. This approach is efficient for generating a zinc-finger protein-DNA complex model for unknown complex structures in which the protein wraps around DNA. We provide a flowchart showing the detailed procedures of this approach.
Determining the structure of protein-DNA complexes and elucidating the details that govern their interaction is essential to better understand many biological processes. In many instances, limitations in crystallization and difficulties in obtaining the intermolecular nuclear Overhauser effects by NMR experiments are obstacles to determining the structure of protein-DNA complexes . Homology modeling is an alternative approach to obtain a protein-DNA complex model. Programs such as TFmodeller can model the complex according to homologous complex structure . The major limitation of this approach is that high conservation of interface residues between the target and template is required for generating a good homology complex model. The high conservation of interface residues may not be possible in many cases; for example, in the zinc finger protein family, the DNA recognition residues and the interacting DNA are not well conserved. Thus, the prediction of the detailed interaction for the entire zinc-finger protein-DNA complex based on the homologous complex structure may not be effective. Hence, other approaches are required to obtain good complex models.
Few structurally based approaches to understand and predict the specificity and binding affinity of the zinc-finger protein-DNA interactions have been reported [3–5]. The applicability of these structurally based approaches will significantly increase with the availability of zinc-finger protein-DNA complex models. One study  used homology models to predict the binding affinities and specificities of protein-DNA complexes, including zinc-finger-DNA complexes. However, the homology modeling complexes are limited by sensitivity to protein and DNA backbone orientation , which may affect the prediction of the detailed interaction between the protein and DNA.
Biomolecular docking is an alternative approach to modeling zinc finger protein-DNA complexes. However, the inherent flexibility of DNA and the scarcity of information about the precise surfaces of DNA involved in interactions with associated proteins represent two major hurdles in computational docking . High Ambiguity-Driven biomolecular DOCKing (HADDOCK)  is an information-driven program that successfully addresses the global and local DNA flexibility in modeling protein-DNA complexes. The information on interfaces is derived from biochemical and/or biophysical experiments and introduced as ambiguous interaction restraints (AIRs)  to drive the protein-DNA docking. Although several studies have successfully used HADDOCK in generating protein-DNA complex models [11–15], none have analyzed the proteins that wrap around the DNA, such as the three-Cys2His2 zinc-finger-DNA complex. In this study, we focused on modeling the entire three-Cys2His2 zinc-finger-DNA complex by use of the HADDOCK program.
For protein-DNA complexes, two structural factors determine binding geometries: the tight fitting between DNA and protein surfaces and the matching of the residue and base positions . Several challenges must be factored into generating a model of the three-Cys2His2 zinc-finger-DNA complex with the HADDOCK program, including the number and position of AIRs and the combination of active residues and bases of AIRs in rigid body docking. However, the combination of active residues and bases of AIRs in the multiple DNA binding domains results in more complexity. In this study, we focused on the number and position of AIRs and simplified the combination of active residues by defining the AIRs in a pairwise manner between amino acids and bases. This approach mainly limits the combinational search, and, hence, the overall geometric distribution of AIRs between domains depends on the number and position of AIRs in the interface.
Here, we used the Zif268-DNA complex structure  as a reference system for docking. From the interaction information for this complex structure, three different AIR sets were derived and used for docking calculation. The docking result for each AIR set was evaluated for the total number of wrap-around conformations, interface RMSD (iRMSD), buried surface area (BSA), and fraction native contacts (Fnat) of the modeled complex. We found that the third AIR set was sufficient to generate good complex models for Zif268-DNA, and the same method was then used to model other zinc-finger protein-DNA models, such as YY1 , WT1  and Aart , by using only two AIRs in each domain, that is, the third AIR set. Thus, the three-Cys2His2 zinc-finger-DNA complex models could be successfully generated by using only two AIRs in each domain and the HADDOCK program.
We then extended this method to model the unknown Sp1-DNA complex structure. The human transcription factor Sp1 is considered a ubiquitous factor that regulates the expression of different genes responsible for various cellular processes [21–23]. The C-terminal DNA binding domain of Sp1, referred to as Sp1 hereafter, consists of three consecutive Cys2His2 zinc fingers that bind to GC-rich recognition elements present in a number of cellular and viral promoters. To date, the structure and computational model of Sp1-DNA have not been reported. However, Oka et al  reported the binding mode and proposed detailed interactions between Sp1 and DNA on the basis of similarity of their Sp1 NMR structure with the Zif268 protein structure. The reported binding mode is in good agreement with results of other experiments such as ethylation interference analysis , methylation interference analysis and mutation study . In this study, we built the homology structure of Sp1 and then used the reported interactions to derive two AIRs in each finger domain to generate the Sp1-DNA complex model. The interactions observed on the best Sp1-DNA complex model are in good agreement with those previously reported , which further reveals that the approach we developed is indeed an efficient way for generating a zinc-finger protein-DNA complex model in which the protein wraps around DNA.
First, we give a brief overview of the data-driven docking for generating a three-Cys2His2 zinc-finger-DNA complex. Using the X-ray crystal structure of the classical three-Cys2His2 zinc-finger Zif268-DNA complex as a reference, we obtained detailed information on hydrogen bonds and van der Waals contacts between Zif268 and DNA . From this information, we evaluated three different AIR sets for generating complex models using the HADDOCK program.
The first set was derived from the complete interface information on hydrogen bonds and van der Waals contacts, and the second set was derived from information on sequence-specific hydrogen bonds. In many cases, only limited experimental data for the interface interaction are available, so it was necessary to study the effect of fewer AIRs for docking. Therefore, for the third AIR set, we aimed to find the minimum AIRs needed for successful docking. We first used one AIR derived from an N-terminal residue of α-helix and its interacting base in each domain for docking calculation because the N-terminal α-helix is known to fit into the major groove of the DNA in the Zif268-DNA complex . However, use of one AIR in each domain can generate only a few wrap-around models. Apparently, one AIR in each domain is not enough to cover the interface of the complex. To represent the entire surface of each α-helix in the interface, we thus used two AIRs in each domain, one in the N-terminus and the other in or near the C-terminus of the α-helix. The detailed selection of the two AIRs in each domain to generate an efficient zinc-finger protein complex model is described in the section Docking Procedure.
After the three different AIR sets were derived, the docking calculations were performed, and the generated complex models were analyzed in terms of wrap-around conformation, localization of AIRs in true and false complex models, and energy of AIR (EAIR) distribution. Finally, the top 10 structures were selected on the basis of HADDOCK score and analyzed on the basis of iRMSD, Einter, BSA and Fnat. The same docking procedures were used for other test cases, such as YY1, WT1 and Aart, to confirm whether this approach can be used to model other zinc-finger-DNA complexes. Furthermore, the same approach was used to model the previously unreported Sp1-DNA complex.
The 10 best Zif268-DNA complex models for each AIR set were selected on the basis of HADDOCK score. Standard deviations are shown as subscripts.
Einter e (kcal mol-1)
Our main focus in this work was to assess the effect of various AIR sets in obtaining good complex models. Although the geometric distribution analysis provided valuable information for the different AIR sets, it could not give a complete understanding of whether the derived AIRs are matched or not in the complex models. Instead, EAIR analysis of the complex models is more precise and shows the suitability of the AIRs for docking. In brief, if the distance between the AIRs is large, the EAIR value is high and indicates that the AIRs do not satisfy the distance criteria that lead to mismatched AIRs, as well as a non-wrap-around complex model. So the EAIR in each complex model is a good indicator of the suitability of AIR sets for generating a complex model. To understand the EAIR distribution in the final 200 complex models assessed, we produced a plot of the HADDOCK score as a function of EAIR.
Data for the 10 best complex models for other test cases, such as YY1, WT1 and Aart, generated with two AIRs in each domain. Standard deviations are shown as subscripts.
Einter e (kcal mol-1)
The above-mentioned complex models were all generated on the basis of structures of the bound zinc finger proteins derived from known crystal complex structures. One may wonder if the approach is also applied when the free form structure or the homology structure is used as the starting structure. It is therefore worthwhile to check them. However, the linker regions of the free Cys2His2 zinc finger proteins are highly flexible so that 3 D structure of the free form structure of Zif268 as well as other three-Cys2His2 zinc finger proteins is not available. We therefore used the homology modeled structure as an initial structure to perform docking calculation. Since the structural alignment of the bound Zif268 protein with other bound zinc-finger proteins has RMSDs of 1.413 Å, 0.745 Å, and 0.992 Å for YY1, Aart, and WT1, respectively, and the sequence identities among these proteins are varied, in the range of 63% (Zif268-WT1) to ~ 41% (Zif268-YY1). To obtain a detailed analysis, three homology model structures for each protein were generated. For example, three homology modeled structures of Zif268 were generated using the bound-WT1, AART and YY1 structure as an individual template, respectively. In total, 12 homology modeled structures were made. For each case, the AIRs were obtained by using the procedure described in the following paragraph and then docking was performed. The 10 best complex models in each case were analyzed and the results are shown in Additional file 1-Table S1. The iRMSD and Fnat for the 10 best complex modes in each case are within the range of 1.86-2.86 Å and 0.54-0.77. These results are acceptable and comparable to those based on the bound form docking, demonstrating that the homology modeled structure can also be applied as a starting structure to generate a three-Cys2His2 zinc finger-DNA complex model using our approach.
The first step, which is the most important in generating a complex model, is the selection of two AIRs in each domain. Two AIRs, one in the N-terminus and another in or near the C-terminus of the α-helix in each domain, should be selected on the basis of the available experimental data or bioinformatics prediction. Of note, only a few residues in the N and C-termini of the α-helix in each domain interact with DNA. If the user has this complete information, then the selection of AIRs has few combinations. Each AIR set can give a different result, so identifying the suitable AIR set that can generate a complex model is necessary. The following steps are used to identify the suitable AIRs to generate a complex model.
The second step is the analysis of the geometric distribution of the AIRs. From modeling the Zif268-DNA complex and other test cases, we found that two AIRs in each domain with a reasonable geometric distribution can generate a complex model. So the geometric distribution analysis is a prescreening procedure to filter the few combinations of AIRs with improper distribution. The improper distribution is mainly caused by some AIRs located in only one side of the DNA. The projection view of the AIRs is used to analyze this distribution. For analysis of the unknown case that does not have a complex structure, a homology-modeled protein structure is necessary. The homology-modeled structure can be superimposed on its published homologous structure. This superimposition can reveal the DNA axis, which can be used as a reference to analyze the AIR distribution. However, a few AIR sets can show similar spatial orientation in the projection view. Thus, the only way to identify the best AIR set is by calculating docking with all these sets individually. Each AIR set can give different results, because the AIR is an atom-to-atom restraint; analyzing this information by only the projection view is difficult, so the following step is necessary to identify the best AIR set.
The third step is the analysis of the wrap-around conformation and EAIR. This analysis will help determine the suitability of the AIRs for generating a complex model. Each AIR set can give different numbers of wrap-around conformation models. Among the AIR sets, the one that can generate more wrap-around conformations and the occurrence of a single major population of complex models with low AIRs energy in the EAIR analysis reveals the AIR set that is the best for generating the complex model. In case of few numbers of wrap-around models and only a few models in the population with low EAIR values, the user should go back to the first step to choose another AIR pair for docking.
The final step is the analysis of the 10 best complex models. After successful docking, the 10 best complex models are selected on the basis of the HADDOCK score, and these models are analyzed for iRMSD and Fnat with respect to the reference structure only if the reference structure is available. For the unknown cases that do not have a complex structure, analysis of Einter, RMSD (from lowest energy minimum models) and qualitative comparison with other experimental data can help to validate the model.
Our study revealed that two AIRs in each domain is the minimum information required to efficiently generate a good complex model; however, to identify the best AIRs that can provide a complex model, a few rounds of docking are needed. We used these procedures to model the previously unreported Sp1-DNA complex.
For finger 2, residues Arg580, Gly583 and Arg586 form hydrogen bonds with bases GUA7, CYT6 and GUA5, respectively, in the primary strand of the DNA, whereas Asp582 contacts CYT17, Gln585 contacts CYT16 and Ser581 contacts the sugar phosphate backbone of CYT16 in the complementary strand. These observations are consistent with the reported interactions  (Additional file 1-Figure S1). However, ethylation interference analysis  revealed that Arg565 interacts with the phosphate between GUA3 and GUA4, but we did not observe this interaction in our model. For finger 3, residues Arg608 and Lys614 form hydrogen bonds with GUA5 and GUA2, respectively. His611 and Asp610 form only nonbond contacts with bases GUA3 and CYT20, whereas in the reported interaction  (Additional file 1-Figure S1), these two form hydrogen bonds with GUA3 and CYT20. Ethylation interference analysis  revealed that Lys595 interacts with the phosphate between GUA9 and GUA10. However, we did not observe this interaction in our model. As compared with the reported interactions , one new interaction was observed between Phe597 and CYT18 in our complex model. For finger 1, residues Lys550 and His553 form hydrogen bonds with two bases each, GUA9 and GUA10, and GUA8 and GUA9, respectively, in the reported interactions  (Additional file 1-Figure S1). However, in our model, we observed only the His553 interaction, and Lys550 formed only a nonbond contact with GUA9. Our model is consistent with that from methylation interference analysis  suggesting that Lys550 interacts with GUA9/10. Apart from this finding, all other backbone contacts are consistent with reported interactions. Overall, our complex model is almost consistent with the reported model interactions, so the model generated by our approach is acceptable. As well, much less information was used to generate this complex model.
Data for the 10 best structures of the Sp1-DNA complex models generated with pairwise and non-pairwise AIR sets. Standard deviations are shown as subscript.
Einter d (kcal mol-1)
Formulating optimal AIRs in each domain to successfully model a three-Cys2His2 zinc-finger-DNA complex by use of HADDOCK requires only a limited amount of interaction information. Although all restraints in the three different AIR sets were derived on the basis of the real interactions observed in the crystal structure, the quality of docking results varies. The results for different AIR sets showed that the unequal distribution in one domain largely affects the other two domains in three-Cys2His2 zinc-finger domains during docking. Therefore, balancing the AIRs in each domain is necessary, as is the overall interface. Analysis of the geometric distribution of AIRs, wrap-around conformation, EAIR versus HADDOCK score, iRMSD, and Fnat revealed that two AIRs for each domain, with a reasonable geometric distribution, is sufficient to successfully generate a complex model. By comparison to the reference structure, we are confident that the complex model of Zif268-DNA, as well as those for other test cases, generated with HADDOCK is acceptable and reliable. We also generated the Sp1-DNA complex model for the first time using this approach. Most of the interactions in this model are consistent with the reported interactions. The approach we describe to model the three-Cys2His2 zinc-finger Sp1-DNA can be easily applied to model other similar three-Cys2His2 zinc-finger proteins with complex structures unknown to date.
Zinc-finger proteins are the largest family of nucleic acid binding proteins in eukaryotes , but only a small number of the three-Cys2His2 zinc-finger protein-DNA complex structures have been studied. Because obtaining all the interface contacts from experiments is tedious and difficult, using fewer AIRs with a reasonable geometric distribution to generate zinc-finger protein-DNA complex models in which the protein wraps around DNA is greatly beneficial and can facilitate computational studies to better understand the zinc-finger protein-DNA interactions. As well, this approach further demonstrates the versatility of using HADDOCK for computational modeling.
The coordinate file of the Zif268-DNA complex was obtained from the RCSB Protein Data Bank  (PDB code: 1ZAA), and the coordinate of the bound Zif268 was separated and used as the starting structure. The DNA in this complex has overhanging bases (Additional file 1-Figure S2), and during canonical B-DNA construction, it was converted to paired bases by including the complementary bases by use of the nucleic acid modeling module in Discovery studio 2.0 (Accelrys). Similarly, the consensus DNA sequence of Sp1 binding (5'-AGGGGCGGGGCC-3') was built. The two constructed DNAs were assigned as a single chain identifier and renumbered. Atom and residue names were matched to the topallhdg5.3.pro  and dna-rna_allatom.top topology file naming for direct use in HADDOCK. The homology model of Sp1 was constructed by use of the Modeller module in Discovery studio 2.0 (Accelrys). The structures from PDB (1alf, 1mey and 1jk1)  were chosen as templates for modeling.
where N atoms indicates all atoms of a given residue and N res is the sum of active and passive residues for a given molecule. The AIRs are incorporated as an additional energy term in the HADDOCK score. If the residues and bases for each AIR are far away, then the effective distance for each restraint increases and the EAIR is also increased. For DNA binding proteins possessing multiple domains, the overall EAIR will be greatly affected, even if a single AIR is unable to satisfy the distance criteria.
In general, the AIR setup is created with all possible combinations of active and passive residues. This setup allows the HADDOCK program to search all the possible configurations around the defined residues. However this default AIR setup may not be suitable for proteins with multiple domains. For example, for the three-Cys2His2 zinc finger, the use of AIRs allows for the residues of zf1 to combine with DNA bases that interact with zinc fingers 2 and 3. The same kinds of combinations are generated for zinc finger 2 and 3 domains. Obviously, these kinds of combinations may not allow the protein to find suitable configurations in the interface region, which results in a protein that may not wrap around DNA. So in our approach, we defined the AIRs for local regions for each zinc-finger domain and its corresponding interacting region in DNA. Then we summed all the AIRs in the three domains as a single input for docking.
Three AIR sets used for docking Zif268 with DNA
Zinc finger 1
Zinc finger 2
Zinc finger 3
R18, S19, D20, E21, R24, I28, H25
R46, D48, H49, H53, T56
H74, S75, D76, E77, K79, R80
G10, T13, T13, G8, G8, G6, G7
G7, C17, G6, G4, C3
G4, C18, A20, G2, C19, G2
R46, D48, H49
G7, C17, G6
To simplify the analysis of the geometric distribution of the three AIR sets in Zif268-DNA, the following considerations were applied. As the protein wraps around the DNA along the major groove, the DNA helix axis was considered the reference axis for the geometric distribution of AIRs. Because the residues and bases in AIRs are extremely close in proximity, for clarity, we considered only the geometric distribution of the residues in the AIRs with reference to the DNA helix axis. The geometric distribution of the AIRs in 3-D space is difficult to represent, so we simplified this into a 2-D representation with reference to the DNA helix axis without losing distribution information of the AIRs. For the unknown complex of Sp1, we superimposed the homology protein structure on the Zif268-DNA crystal structure and then obtained the DNA helix axis and used that axis as a reference for geometric distribution analysis of AIRs in Sp1.
The docking procedure consisted of three stages: rigid-body docking, semi-flexible refinement and final refinement in explicit solvent. During the rigid-body docking, 1000 complex models were generated for each set of AIRs. The best 20% complex models were selected on the basis of HADDOCK score defined as a weighted sum of intermolecular electrostatic, van der Waals contacts, desolvation, EAIR and BSA term . These models were used for further refinement in the semi-flexible refinement stage consisting of three parts: rigid-body torsion angle dynamics, semi-flexible simulated annealing stage and final semi-flexible simulated annealing stage. The final stage of the docking protocol is gentle water refinement. The effects of global and local flexibility of the DNA during docking have been reported ; thus, the default option was used to define the flexible regions of DNA. Also, default HADDOCK parameters were used, except for the random deletion of a fraction of the restraint option, which was set as false for all docking calculations. Additional restraints to maintain base planarity and Watson-Crick bonds were introduced for the DNA.
For each docking, the wrap-around orientation of the complex models was analyzed by use of the Pymol program . The final 200 structures were analyzed according to EAIR versus HADDOCK score. The 10 best complex models were then selected on the basis of HADDOCK score. The iRMSD values of the complex interface were calculated by the McLachlan algorithm  as implemented in the Profit program (Martin, A.C.R., http://www.bioinf.org.uk/software/profit/). All heavy atoms were used to calculate the iRMSD of the complex interface. Intermolecular contacts were evaluated with a 5 Å cut-off value . The Fnat was defined as the number of native intermolecular contacts on a nucleotide-residue basis (hydrogen bonded and non-bonded) identified in a docking solution, divided by the total number of contacts in the reference structure. Both BSA and Einter were analyzed for the 10 best complex models for each AIR set.
We would like to thank the Academia Sinica and the National Science Council, Taiwan for supporting this work. We also thank Laura Smales for copyedit the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.