Evolutionary conservation of DNA-contact residues in DNA-binding domains
- Yao-Lin Chang†1,
- Huai-Kuang Tsai†2,
- Cheng-Yan Kao1, 3Email author,
- Yung-Chian Chen4,
- Yuh-Jyh Hu5 and
- Jinn-Moon Yang4, 6, 7Email author
© Chang et al; licensee BioMed Central Ltd. 2008
Published: 28 May 2008
DNA-binding proteins are of utmost importance to gene regulation. The identification of DNA-binding domains is useful for understanding the regulation mechanisms of DNA-binding proteins. In this study, we proposed a method to determine whether a domain or a protein can has DNA binding capability by considering evolutionary conservation of DNA-binding residues.
Our method achieves high precision and recall for 66 families of DNA-binding domains, with a false positive rate less than 5% for 250 non-DNA-binding proteins. In addition, experimental results show that our method is able to identify the different DNA-binding behaviors of proteins in the same SCOP family based on the use of evolutionary conservation of DNA-contact residues.
This study shows the conservation of DNA-contact residues in DNA-binding domains. We conclude that the members in the same subfamily bind DNA specifically and the members in different subfamilies often recognize different DNA targets. Additionally, we observe the co-evolution of DNA-contact residues and interacting DNA base-pairs.
DNA-binding proteins play a key role in living organisms of many genetic activities such as transcription, recombination, DNA replication and repair. One or more domains of these proteins interact with DNA, and they offer the specificity for direct and indirect readout of DNA . To identify the DNA-binding domains is very important for understanding the regulation mechanisms.
Recently, rapidly increasing amount of protein-DNA complexes from X-ray crystallography and nuclear magnetic resonance (NMR) have enabled the use of structural-based approaches for identifying DNA-binding proteins. Most of the structural DNA-binding domains can be categorized into several classes according to their structures or binding types [2–4]. However, some DNA-binding domains can not be well categorized, and for some DNA-binding domains structural information is unavailable [3, 5]. Several studies used various computational approaches to predict potential DNA-binding proteins by using protein-DNA complexes structure features, such as the overall charges, electric moments, and shape of binding sites [6–12]. Since the charge and conformational complementarities of binding sites are essential for protein-DNA binding, these features provide a reasonable basis to identify DNA-binding proteins. Another trend is to consider the degree of conservation of residues [13–15]. Luscombe and Thornton  have studied 21 families of DNA-binding proteins and showed that those amino acids interacting with the DNA are better conserved than those not interacting with DNA. Stawiski et al.  found that electrostatic patches of DNA-binding proteins have a higher percentage of aromatic and positive residues. According to the general properties of 20 amino acids, they also showed that residues of the patch are conserved at property levels.
In this paper, we propose a structure-based threading method by considering evolutionary conservation of DNA-contact residues in DNA-binding domains to identify DNA-binding domains. We use BLOSUM62 , an evolutionary-based scoring matrix for amino acid substitutions, to measure the degree of conservation of binding residues. Our method can achieve high precision and recall for 66 families of DNA-binding domains, with a false positive rate less than 5% for 250 non-DNA-binding proteins.
Given a query domain, our method identified similar DNA-binding structures or homologous protein sequences from the template library. To evaluate the performance of our method, for each DNA-contact domain (D) in the template library we generated its corresponding positive and negative sets. The members in the positive set contain the domains similar to domain D based on SCOP, while domains in the negative set do not. By applying our method on these two sets, we found that the scores of the domains in the positive set are significantly higher than those of domains in the negative set. We further determined a threshold to achieve high precision and recall. Combining with the threshold, we applied our method on 66 known SCOP families of DNA-binding domains and 250 non-DNA-binding proteins to examine the performance.
Positive and negative set for each contact domain
We collected DNA-binding contact domains from SCOP database, the detail is described in Method. To remove redundant contact domains, domains with highly similar sequences (identity > 90%) are grouped using the NCBI software BLASTCLUST. In each group, the one with the maximal number of contact residues is chosen as the representative domain of a group. For a representative domain R, these protein domains in the same SCOP family are considered as the member of R according to SCOP95 (members whose similarity greater than 95% are excluded). Each member of R was aligned to R using the CE. We define a residue of R as misaligned if it is aligned to a gap. A family member is discarded if more than 20% contact residues of R are misaligned between R and this member. Family members that satisfy the above criteria are considered to be in the positive set. If there are less than five members in the positive set of R, the entire family of R is discarded. We finally yielded 66 representative domains with corresponding positive sets. For each R, we artificially generated 1000 domains to be the negative set. To do this, for each artificial domain, we replicate its residues from R. Then we randomly mutated the residue type of each contact residue of R.
Determining the threshold of similar DNA-binding function of a contact domain
For each representative domain R, each member in the positive and negative sets was scored by the method we developed. Ideally, the scores of domains in the positive set should be on average significantly higher than those of the negative set. We used the Kolmogorov-Smirnov (KS) test to examine the above criterion. The KS test is a nonparametric test to determine if two distributions differ significantly. According to our results, the scores are significantly different for the positive set and the negative set in most domains (97% of 66 sets have a p value less than 0.05).
To investigate variation of contact residues of DNA-binding domain in the same SCOP family, we compared the bound DNA sequences of two DNA-binding domains by aligning the double-strand sequences to each other. 1B8I-A binds two DNA sequences (i.e. PDB entry 1B8I-C and 1B8I-D) and 1O4X-A1 binds another two DNA sequences (PDB entry 1O4X-C and 1O4X-D). First we generated four pairing alignments: 1B8I-C and 1O4X-C; 1B8I-C and 1O4X-D; 1B8I-D and 1O4X-C; and 1B8I-D and 1O4X-D. We do not allow any gap insertion when aligning a-pairing DNA sequences. The alignments are obtained by sliding two sequences against each other until the best match is found. The alignment with the maximum number of identical aligned pairs is chosen, and as a result the alignment between 1B8I-C and 1O4X-C is the one chosen (Figure 4C). Then we adjust the alignment of the other DNA strand pairs (i.e. 1B8I-D and 1O4X-D) according to this best alignment (1B8I-C and 1O4X-C).
Figures 4B and 4C show that the number of identical nucleotides between 1B8I-C and 1PUF-E (10) as well as 1B8I-D and 1PUF-D (10) is much higher than those of 1B8I-C and 1O4X-C (6) as well as 1B8I-D and 1O4X-D (5) for whole DNA sequences. At the same time, 11 identical contact nucleotides are obtained from the alignments of 1B8I-C and 1PUF-E as well as 1B8I-D and 1PUF-D; but two identical contact nucleotides are yielded from the alignments of 1B8I-C and 1O4X-C as well as 1B8I-D and 1O4X-D (the contact nucleotides are the nucleotides that interact with contact residues of protein). With respect to 1B8I-A, 1PUF-A and 1O4X-A1 are different not only in the DNA sequences they bind to but also in their DNA-binding sites. These results show that the members in the same SCOP family may have different DNA-binding models and that our method is able to detect the different Protein-DNA interactions based on the evolutionary conservation of DNA-contact residues.
The contact residues of DNA-binding domains are useful in discriminating DNA-binding domains from non-DNA-binding domains in a novel protein sequence. Our method, which considers evolutionary conservation of DNA-binding residues, can achieve high precision and recall for 66 families of DNA-binding domains, with a false positive rate less than 5% for 250 non-DNA-binding proteins. In addition, our method is able to identify the different DNA-binding behaviors of proteins in the same SCOP family based on the evolutionary conservation of DNA-contact residues. We also discussed the mutation of contact residues of DNA-binding domains can possibly change the bound DNA sequences. It implies that the co-change of DNA-contact residues and their DNA-binding bases.
We first collected protein-DNA complexes from PDB and each complex should contain at least one protein chain and a double-strand DNA. As in Luscombe et al. , a complex was excluded if its DNA is single-stranded or the length of the DNA is less than 4 bases. For each protein-DNA complex, we then identify contact residues and contact domains of this protein. Contact residues, whose heavy atoms are within a distance (distance ≤ 4.5 Å) of any heavy atoms of the bound DNA, are considered as the core parts of the contact domain in a complex . For each protein-DNA complex, we identified its DNA-contact domains according to contact residues and the definition of the SCOP database. Each domain must have more than 5 contact residues and the number of residues of this protein is more than 50 to make sure that the contact between the protein and DNA was reasonably extensive. Finally, 230 contact DNA-binding domains were identified and collected in the template library.
Homologous proteins searching
For a given protein sequence/structure M, we found a homologous DNA-binding protein from the template library using alignment tools. If M is a 3D-structure, we used a structure alignment (i.e. CE ) to align M to all contact domains. The CE will return a Z score for each alignment representing the structure similarity of the two aligned structures. DNA-binding proteins are considered as homologous proteins of query M if CE Z scores of exceed 3.7 based on CE's statistical model. On the other hand, if M is a protein sequence, we used sequence alignment (i.e. FASTA [29–31]) to search the template library. Here, a DNA-binding protein is considered a homologous protein of M if the sequence identity exceeds 25% according to observations of previous studies [32–37].
where CR is the set of the contact residues between D and M; d i and m i denote the corresponding i th contact residue of D and M, respectively. Here, the score of a misaligned residue is -4 which is the smallest in the BLOSUM62 matrix.
Authors' contributionsYLC and HKT carried out the design of scoring functions and data set preparation, participated in experimental designs and drafted the manuscript. CYK provided the design of this study. YCC and YJH provided the domain knowledge and useful comments. JMY provided the original idea, participated in the design and coordination of this study and helped to draft the manuscript. All authors read and approved the final manuscript.
J.-M. Yang was supported by National Science Council and partial support of the ATU plan by MOE. Authors are grateful to both the hardware and software supports of the Structural Bioinformatics Core Facility at National Chiao Tung University.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S6.
- Michael Gromiha M, Siebers JG, Selvaraj S, Kono H, Sarai A: Intermolecular and intramolecular readout mechanisms in protein-DNA recognition. J Mol Biol 2004,337(2):285–294. 10.1016/j.jmb.2004.01.033View ArticlePubMedGoogle Scholar
- Vinson CR, Sigler PB, McKnight SL: Scissors-grip model for DNA recognition by a family of leucine zipper proteins. Science 1989,246(4932):911–916. 10.1126/science.2683088View ArticlePubMedGoogle Scholar
- Harrison SC: A structural taxonomy of DNA-binding domains. Nature 1991,353(6346):715–719. 10.1038/353715a0View ArticlePubMedGoogle Scholar
- Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of protein-DNA complexes. Genome Biol 2000,1(1):REVIEWS001. 10.1186/gb-2000-1-1-reviews001PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson PF, McKnight SL: Eukaryotic transcriptional regulatory proteins. Annu Rev Biochem 1989, 58: 799–839. 10.1146/annurev.bi.58.070189.004055View ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: Moment-based prediction of DNA-binding proteins. J Mol Biol 2004,341(1):65–71. 10.1016/j.jmb.2004.05.058View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 2004,20(4):477–486. 10.1093/bioinformatics/btg432View ArticlePubMedGoogle Scholar
- Tsuchiya Y, Kinoshita K, Nakamura H: Structure-based prediction of DNA-binding sites on proteins using the empirical preference of electrostatic potential and the shape of molecular surfaces. Proteins 2004,55(4):885–894. 10.1002/prot.20111View ArticlePubMedGoogle Scholar
- Bhardwaj N, Langlois RE, Zhao G, Lu H: Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res 2005,33(20):6486–6493. 10.1093/nar/gki949PubMed CentralView ArticlePubMedGoogle Scholar
- Bhardwaj N, Lu H: Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett 2007,581(5):1058–1066. 10.1016/j.febslet.2007.01.086PubMed CentralView ArticlePubMedGoogle Scholar
- Yu X, Cao J, Cai Y, Shi T, Li Y: Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol 2006,240(2):175–184. 10.1016/j.jtbi.2005.09.018View ArticlePubMedGoogle Scholar
- Szilagyi A, Skolnick J: Efficient prediction of nucleic acid binding function from low-resolution protein structures. J Mol Biol 2006,358(3):922–933. 10.1016/j.jmb.2006.02.053View ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 2005, 6: 33. 10.1186/1471-2105-6-33PubMed CentralView ArticlePubMedGoogle Scholar
- Kuznetsov IB, Gou Z, Li R, Hwang S: Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins 2006,64(1):19–27. 10.1002/prot.20977View ArticlePubMedGoogle Scholar
- Tjong H, Zhou HX: DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces. Nucleic Acids Res 2007,35(5):1465–1477. 10.1093/nar/gkm008PubMed CentralView ArticlePubMedGoogle Scholar
- Luscombe NM, Thornton JM: Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity. J Mol Biol 2002,320(5):991–1009. 10.1016/S0022-2836(02)00571-5View ArticlePubMedGoogle Scholar
- Stawiski EW, Gregoret LM, Mandel-Gutfreund Y: Annotating nucleic acid-binding function based on protein structure. J Mol Biol 2003,326(4):1065–1079. 10.1016/S0022-2836(03)00031-7View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992,89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci 1994,3(3):522–524.PubMed CentralView ArticlePubMedGoogle Scholar
- Passner JM, Ryoo HD, Shen L, Mann RS, Aggarwal AK: Structure of a DNA-bound Ultrabithorax-Extradenticle homeodomain complex. Nature 1999,397(6721):714–719. 10.1038/17833View ArticlePubMedGoogle Scholar
- LaRonde-LeBlanc NA, Wolberger C: Structure of HoxA9 and Pbx1 bound to DNA: Hox hexapeptide and DNA recognition anterior to posterior. Genes Dev 2003,17(16):2060–2072. 10.1101/gad.1103303PubMed CentralView ArticlePubMedGoogle Scholar
- Dutnall RN, Tafrov ST, Sternglanz R, Ramakrishnan V: Structure of the histone acetyltransferase Hat1: a paradigm for the GCN5-related N-acetyltransferase superfamily. Cell 1998,94(4):427–438. 10.1016/S0092-8674(00)81584-6View ArticlePubMedGoogle Scholar
- Williams DC Jr, Cai M, Clore GM: Molecular basis for synergistic transcriptional activation by Oct1 and Sox2 revealed from the solution structure of the 42-kDa Oct1.Sox2.Hoxb1-DNA ternary transcription factor complex. J Biol Chem 2004,279(2):1449–1457. 10.1074/jbc.M309790200View ArticlePubMedGoogle Scholar
- Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM: MUSTANG: a multiple structural alignment algorithm. Proteins 2006,64(3):559–574. 10.1002/prot.20921View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995,247(4):536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Luscombe NM, Laskowski RA, Thornton JM: Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Res 2001,29(13):2860–2874. 10.1093/nar/29.13.2860PubMed CentralView ArticlePubMedGoogle Scholar
- Morozov AV, Havranek JJ, Baker D, Siggia ED: Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res 2005,33(18):5781–5798. 10.1093/nar/gki875PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol 1993,233(1):123–138. 10.1006/jmbi.1993.1489View ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988,85(8):2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR: Effective protein sequence comparison. Methods Enzymol 1996, 266: 227–258.View ArticlePubMedGoogle Scholar
- Pearson WR: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 2000, 132: 185–219.PubMedGoogle Scholar
- Smith TF: The art of matchmaking: sequence alignment methods and their structural implications. Structure 1999,7(1):R7-R12. 10.1016/S0969-2126(99)80003-3View ArticlePubMedGoogle Scholar
- Skolnick J, Fetrow JS: From genes to protein structure and function: novel applications of computational approaches in the genomic era. Trends Biotechnol 2000,18(1):34–39. 10.1016/S0167-7799(99)01398-0View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970,48(3):443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981,147(1):195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998,14(10):846–856. 10.1093/bioinformatics/14.10.846View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990,215(3):403–410.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.