Frequent occurrence of recognition Site-like sequences in the restriction endonucleases
© Biro and Biro; licensee BioMed Central Ltd. 2004
Received: 10 September 2003
Accepted: 16 March 2004
Published: 16 March 2004
There are two different theories about the development of the genetic code. Woese suggested that it was developed in connection with the amino acid repertoire, while Crick argued that any connection between codons and amino acids is only the result of an "accident". This question is fundamental to understand the nature of specific protein-nucleic acid interactions.
The nature of specific protein-nucleic acid interaction between restriction endonucleases (RE) and their recognition sequences (RS) was studied by bioinformatics methods. It was found that the frequency of 5–6 residue long RS-like oligonucleotides is unexpectedly high in the nucleic acid sequence of the corresponding RE (p < 0.05 and p < 0.001 respectively, n = 7). There is an extensive conservation of these RS-like sequences in RE isoschizomers. A review of the seven available crystallographic studies showed that the amino acids coded by codons that are subsets of recognition sequences were often closely located to the RS itself and they were in many cases directly adjacent to the codon-like triplets in the RS.
Fifty-five examples of this codon-amino acid co-localization are found and analyzed, which represents 41.5% of total 132 amino acids which are localized within 8 Å distance to the C1' atoms in the DNA. The average distance between the closest atoms in the codons and amino acids is 5.5 +/- 0.2 Å (mean +/- S.E.M, n = 55), while the distance between the nitrogen and oxygen atoms of the co-localized molecules is significantly shorter, (3.4 +/- 0.2 Å, p < 0.001, n = 15), when positively charged amino acids are involved. This is indicating that an interaction between the nucleic- and amino acids might occur.
We interpret these results in favor of Woese and suggest that the genetic code is "rational" and there is a stereospecific relationship between the codes and the amino acids.
The nature of specific protein-nucleic acid interactions is not well understood. The interaction between transcription factors and promoters has been studied most extensively . However, that system is very complex and there has been, so far, no simple, general conclusion drawn from these studies. The interaction of restriction enzymes (REs) with their recognition sequences (RSs) is also highly specific. Furthermore, the protein-binding site is often short (5–7 nucleotides) and simple (tandem repeat, where the sense and anti-sense strands are identical). The RE-RS system is also extensively studied because of its great biotechnological importance. These circumstances makes the RE family an interesting case study of specific nucleic acid-protein interactions.
Our previous study  convinced us that the codon translation table is not random and that there is a common periodicity in the codon structure and the physico-chemical properties of the amino acids. We interpreted those results in favor of Woese who argued  that the genetic code developed in a close connection to the amino acid repertoire and that this close biochemical connection is fundamental to specific protein-nucleic acid interactions. This consideration led us to ask whether the genetic code could be somehow the bridge between a nucleic acid sequence (here the RS) and the amino acid sequence (here the RE) that specifically recognizes it.
Restriction enzyme data was collected from REBASE , GenBank , SwissProt  and the Protein DataBank (PDB) . Nucleic acid and protein sequences were aligned and compared to each other using ClustalW  and the similarities were visualized using Jalview . In some cases the nucleic acid sequences were overlappingly translated into virtual protein-like sequences . We found that overlappingly translated sequences (OTS) are especially useful for detecting and visualizing short sequence similarities (in contrast to regularly translated proteins, unpublished) because they retain all the information present in the nucleic acid sequences, while the regular, non-overlapping translation looses as much as 2/3rd of information because of codon redundancy.
This study was limited to those restrictions endonucleases whose recognition sequence was unambiguous and where sequence and structure data of the DNA-enzyme complex was publicly available. One thousand residue long repeating sequences were constructed from the RS-s. These artificial repeats were compared to the RE nucleic acid sequences using ClustalW to find RS-like oligonucleotides. This method found most, but not all, RS-like sequences. In some cases it was necessary to complete this approach with searching using text search tools and counting of 4–8 nucleic acid long, RS-like oligonucleotides.
Therefore we preferred to use protein sequences instead of nucleic acid sequences. The RS nucleic acids were overlappingly translated and the protein-like OTS sequences were aligned with the usual (non-OTS) protein sequences of the REs using ClustalW. This approach effectively filtered the single nucleotide similarities. The OTS sequence of the RS-repeats provided a frame independent, protein-like representation of the recognition sites and the similarity to the RE sequence indicated the presence of RS-like sequences in the REs. Neither the nucleic acid nor the OTS alignment found all RS-like residues and the two approaches gave slightly different results (different nucleotides in the last wobble positions are often interpreted to code for the same amino acid).
Restriction enzymes with known crystallographic structure
Recognition Sequence (RS)
RS-like Oligonucleotides (-# of copy)
GGCCTA-1, GCCTA-1, TAGG-1, GGAT-2, AGGC-1, GATC-1
AGATC-1, TAGAT-1, AGAT-5, GATC-2, CTAG-2
CGAATT-1, AGCTTA-1, TTCGA-2, AAGCT-3, AATTC-2, AATT-8, TTAA-6, TAAG-3, CTTA-3, TCGA-3, GAAT-4, AAGC-3, TTCG-1
ATATCG-1, ATATC-1, GATAT-3, TATAG-1, ATAT-11, TATC-2, GATA-2, TATA-4, CTAT-2
GGCGCCGG-1, GCCGCGG-1, CGGCGC-2, GCCGCG-1, CGCGC-2, GCCGG-1, CGCGG-2, GCGGC-2, GGCCG-2, CGGCC-1, CGGC-3, CCGG-5, GGCG3, CCGC-4, GCCG-3, GGCC-1, CGCG-4
CCGGCG-1, GCCGC-4, GGCG-1, GCCG-3, CGCC-2, CCGC-1, GCGG-2, CGGC-2
The first 2 bars of the figure (marked by *) require additional explanation; this also gives an example of our calculations. We have found one 7-residue and one 8-residue long RS-like sequence in Nael, which is 954 residues long. The expected values are 954/47 = 0.058 and 954/48 = 0.014 respectively, whereas F = l in each case. Thus the F/E ratios are 17.2 and 71.4 respectively, indicating that these findings are significant although it was not possible to use the student's t-test on these single values.
Codon-amino acid co-location in 7 REs: The proportion of the involved codon residues
41.5 +/- 3.6
20.5 +/- 3.1
2.2 +/- 0.4
5 +/- 0.4
18.2 +/- 2.8
18.7 +/- 3.1
p < 0.001
Codon-amino acid co-location: The shortest atomic distances (Å)
Mean +/- S.E.M. (n)
P < 0.05
p < 0.01
p < 0.001
Specific DNA-protein interactions are very important in the regulatory network of the genome. The exact rules of these interactions are not well understood. The known forms of DNA (the Double Helix) are closed, inverted structures where the molecular information is not directly exposed on the surface . However the major groove is rich in chemical information. The edges of each base pair are exposed in the major and minor grooves, creating a pattern of hydrogen bond donors (D) and acceptors (A) and of van der Waals surfaces (methyl group, M; nonpolar hydrogen, H) that identifies the base pair . There is a unique and logical link between 1., the nucleotide sequence of the DNA that specifically interacts with a protein; 2., the pattern of D, A, M, H properties in the grooves of that DNA sequence; 3., the physicochemical properties of the protein that interacts with it and 4., the DNA coding the amino acids of that protein. The question is whether this unique and logical link is the genetic code itself.
This question was already formulated in the 1960s and there are basically two distinct opinions. Francis Crick could not see any logical connection between the structure of the genetic code and the physicochemical properties of the amino acids and he regarded it just a "frozen accident" . On the other side, Woese  propagated the theory of the coevolution of proteins and nucleic acids and argued for a specific stereochemical connection between the amino acids and their codons. We succeeded in constructing a "Common Periodic Table of Codons and Amino Acids"  and so became fellows of Woese.
The REs are known to interact very specifically with their RSs. We tried to find RS-like oligonucleotides in the coding sequences of the REs. The RSs are usually simple, short sequences, and it is not possible to find 3–6 residue long sequences by using conventional sequence similarity searching methods such as BLAST or FASTA. However, an unconventional method, the multiple sequence alignment of overlappingly translated sequences, seems to be useful for finding and visualizing short sequence similarities. The method is rapid and informative. A disadvantage is that there are no methods developed for exact statistical evaluation of the results. We were able to confirm that PstI isoschizomers are rather similar to each other and contain conserved sequences (as expected). However, we also made the new observation, that many of these sequence conservations are short, conserved, RS-like sequences. Even if some of the shortest (3–4 nucleic acid long) RS-like sequences could easily have been found by chance, the conservation indicates biological significance.
This indication was further strengthened by our second study of seven REs with known 3D structures. A statistically significant overrepresentation of 5–8 residue long RS-like sequences was found in the coding sequences of these enzymes. Conservation and overrepresentation do not automatically confirm a biological role, however it is a strong argument for one . Codons for alanine, glycine, valine and aspartate have relatively high frequency in the acceptor stems of their respective tRNAs . The tRNAs with complementary anticodons also had some kind of complementarity with their acceptor stems . Such relationships could support the hypothesis that one or more anticodon nucleotides were historically related to an acceptor stem nucleotide needed for aminoacylation, i.e. they are signs of codon-amino acid co-evolution.
Even more convincing evidence is, of course, the visualization of a stereochemical relationship between a particular codon and its amino acid. The increasing amount of freely available crystallographic data, including structures of DNA-protein complexes, might give us this type of evidence and we show here an example of it. We were able to find 55 examples where a nucleic acid was co-located with its own codon in such a way that it might indicate a stereospecific interaction.
In the case of the positively charged amino acids the atoms with opposite partial charge (N, O) were involved and they were close enough to each other to interact. The hydrophobic amino acids were probably too far from their own codon-like triplet in the RS to be in direct interaction with the dsDNA. However the DNA-RE structure is changing during the enzyme reaction and different parts of the protein might be involved during the process.
The number of examples in the presented sample is impressing: 41.5% perfect matches between amino acids and codon-like sequences within 8 Å distance. However 55 examples are still few to categorize the type of interactions and draw a general conclusion.
We are aware of some early model studies [19, 20] indicating stereochemical relationship between coding triplets and amino acids, as well as the error in that model building . We don't want to repeat that mistake while searching for a fallen apple close to its tree.
Our previous research, the construction of a Common Periodic Table for Codons and Amino acids  already indicated that the amino- and nucleic acids developed in close connection to each other. The present results provide additional evidence to strengthen Woese's hypotheses  that there is a stereochemical connection between amino acids and their codons. It was shown that the recent bioinformatics methods and existing databases provide realistic conditions to study this question.
The author is grateful to Dr Clare Sansom (Birkbeck Collage, London) for her helpful comments and suggestions regarding the preparation of he manuscript
- Gill G: Regulation of the initiation of eukaryotic transcription. Essays Biochem 2001, 37: 33–43.View ArticlePubMedGoogle Scholar
- Biro JC, Benyo B, Sansom C, Slavecz A, Fordos G, Micsik T, Benyo Z: A common periodic table of codons and amino acids. Biochem Biophys Res Com 2003, 306: 408–415. 10.1016/S0006-291X(03)00974-4View ArticlePubMedGoogle Scholar
- Woese CR: The Molecular Basis for Gene Expression in: The Genetic Code. Harper & Row, New York 1967, Chapters 6–7: 156–160.Google Scholar
- Roberts RJ, Vincze T, Posfai J, Macelis D: REBASE – restriction enzymes and methylases. Nucleic Acids Research 2003, 31: 418–420. 10.1093/nar/gkg069PubMed CentralView ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res 2003, 31: 23–7. 10.1093/nar/gkg057PubMed CentralView ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan L, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–370. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook JZ, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Research 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–80.PubMed CentralView ArticlePubMedGoogle Scholar
- Clamp M: Jalview.1999. [http://www.ebi.ac.uk/~michele/jalview/]Google Scholar
- Biro JC: Overlapping translation of nucleic acids for bioinformatics applications. Med Hypotheses 2003, 60: 654–659. 10.1016/S0306-9877(03)00008-2View ArticleGoogle Scholar
- Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modeling. Electrophoresis 1997, 18: 2714–2723.View ArticlePubMedGoogle Scholar
- Student's t-test[http://www.physics.csbsju.edu/stats/t-test_NROW_form.html]
- Biro JC: Speculation about alternative DNA structures. Med Hypotheses 2003, 61: 86–97. 10.1016/S0306-9877(03)00123-3View ArticlePubMedGoogle Scholar
- Watson JD, Baker TA, Bell SP, Gann A, Levine M, Losick R: The structure of DNA and RNA. in Molecular biology of the gene 5 Edition Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York 2004, chapter 6: 1–33. (preprint in 2003)Google Scholar
- Crick FHC: The origin of the genetic code. J Mol Biol 1968, 38: 367–379.View ArticlePubMedGoogle Scholar
- Schimmel P: Origin of genetic code: A needle in the haystack of tRNA sequences. Proc Natl Acad Sci USA 1996, 93: 4521–4522. 10.1073/pnas.93.10.4521PubMed CentralView ArticlePubMedGoogle Scholar
- Moller W, Janssen GM: Statistical evidence for remnants of the primordial code in the acceptor stem of prokaryotic transfer RNA. J Mol Evol 1992, 34: 471–477.View ArticlePubMedGoogle Scholar
- Rodin S, Rodin A, Ohno S: The presence of codon-anticodon pairs in the acceptor stem of tRNAs. Proc Natl Acad Sci USA 1996, 93: 4537–4542. 10.1073/pnas.93.10.4537PubMed CentralView ArticlePubMedGoogle Scholar
- Pelc SR, Welton MGE: Stereochemical relationship between coding triplets and amino-acids. Nature 1966, 209: 868–870.View ArticlePubMedGoogle Scholar
- Welton MGE, Pelc SR: Specificity of the Stereochemical relationship between ribonucleic acid-triplets and amino-acids. Nature 1966, 209: 870–872.View ArticlePubMedGoogle Scholar
- Crick FHC: An Error in Model Building. Nature 1967, 213: 798.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.