APTE: identification of indirect read-out A-DNA promoter elements in genomes
© Whitley et al.; licensee BioMed Central Ltd. 2014
Received: 5 March 2014
Accepted: 20 August 2014
Published: 26 August 2014
Transcriptional regulation is normally based on the recognition by a transcription factor of a defined base sequence in a process of direct read-out. However, the nucleic acid secondary and tertiary structure can also act as a recognition site for the transcription factor in a process known as indirect read-out, although this is much less understood. We have previously identified such a transcriptional control mechanism in early Xenopus development where the interaction of the transcription factor ilf3 and the gata2 promoter requires the presence of both an unusual A-form DNA structure and a CCAAT sequence. Rapid identification of such promoters elsewhere in the Xenopus and other genomes would provide insight into a less studied area of gene regulation, although currently there are few tools to analyse genomes in such ways.
In this paper we report the implementation of a novel bioinformatics approach that has identified 86 such putative promoters in the Xenopus genome. We have shown that five of these sites are A-form in solution, bind to transcription factors and fully validated one of these newly identified promoters as interacting with the ilf3 containing complex CBTF. This interaction regulates the transcription of a previously uncharacterised downstream gene that is active in early development.
A Perl program (APTE) has located a number of potential A-form DNA promotor elements in the Xenopus genome, five of these putative targets have been experimentally validated as A-form and as targets for specific DNA binding proteins; one has also been shown to interact with the A-form binding transcription factor ilf3. APTE is available from http://www.port.ac.uk/research/cmd/software/ under the terms of the GNU General Public License.
KeywordsA-form DNA Transcription Xenopus Promoter analysis
Transcription is the major level at which cellular protein concentration is regulated in response to environmental and developmental cues. Transcriptional control is mediated by the interaction of transcription factors and DNA elements. These elements are normally one instance of a set of similar sequences (or motifs) that the transcription factor ‘reads’ in a process known as direct read-out. There are some cases, however, where the transcription factor recognises not the sequence per se but the structure that the DNA adopts as a consequence of both sequence and conditions. The disruption of the DNA from the standard B-form conformation acts as a recognition site for the transcription factor in a process known as indirect read-out. This is well established in prokaryotes [1–3] but less recognised in eukaryotic cells, although an indirect read-out mechanism has been suggested for a selection of eukaryotic gene promoters [4–6]. Given the size of vertebrate genomes it is highly likely that some regions consist of sequences forming non-canonical structures and that some of these are regulatory. Indeed local DNA topography has been shown to correlate better than sequence with functional non-coding regions of the human genome .
The canonical double-stranded DNA structure is B-form, a right-handed helix with 3.4 Å between base pairs and a base tilt of 6 degrees to the helix axis. However, DNA can exist in a number of other conformations, the major types being A-form, Z-form and tetraplex, all of which have been implicated in gene regulation [8–10]. A-form is the canonical dsRNA structure with right-handed helices but with only 2.6 Å between bases and a 20-degree base tilt, while the sugar in A-form is in the c-3′ endo position in contrast to the c-2′ endo position observed for B-form. These differences lead to A-form helices being ‘shorter and fatter’, possessing major and minor grooves of similar width and the major groove deepened with respect to the B-form structure. Although DNA is usually in the canonical B-form it can be induced into A-form by dehydration and certain DNA sequences can naturally adopt an A-form helix under physiological conditions . These A-form elements can then be specifically recognised by DNA binding proteins.
The interaction of the Xenopus CCAAT box transcription factor (CBTF) complex and the promoter of the developmentally important gata2 gene is an example of a transcriptional regulatory mechanism involving A-form DNA. We have previously shown that this mechanism is based on an interaction requiring both DNA base specific (direct read-out) and DNA structure specific (indirect read-out) interactions [8, 6]. The CBTF complex is composed of approximately eight sub-units of which the ilf3 protein is currently the only published component; however, this subunit is critical for CBTF activity. Ilf3 is found in the nucleus when the gata2 gene, a developmentally regulated gene involved in blood formation, is transcribed. A number of biochemical experiments have also confirmed ilf3 as a regulator of gata2 transcription, including chromatin associated precipitation (ChIP) identifying ilf3 at the gata2 promoter during active transcription of this gene . Therefore the CBTF complex and its interactions is of interest both from developmental and transcriptionally mechanistic viewpoints.
Ilf3 contains two double stranded RNA binding domains (dsRBDs) and these domains are required for transcriptional activation in vivo and DNA binding in vitro. The RNA binding activity of ilf3, and other dsRBD containing proteins, has been well characterised, indeed ilf3 was first identified through its interaction with RNA . Crystal and NMR structures of a dsRBD alone exist , as does a crystal structure of the protein-RNA complex . Alongside saturation mutagenesis studies, these structural studies have revealed that the domains recognise the A-form helical structure of double stranded RNA, although far less is known about their interaction with DNA. We have previously shown that Xenopus ilf3 contributes to the activity of CBTF as a transcriptional activator by its interaction with structure-specific DNA sequences. Specifically the dsRBDs of ilf3 are capable of interacting not only with A-form RNA but also non-canonical A-form DNA, such as that at the gata2 promoter .
Here we report the development and validation of a bioinformatics tool for the analysis of genomic data to identify other potential promoters that utilise an A-form DNA structural component; in particular, those that are responsive to the transcription factor ilf3.
Results and discussion
Predicted promoter elements
Frequency of A-DNA promoter sequences in Xenopus tropicalis 4.2 genome (apelen ≥ 10, motifgap ≤ 20, motifs for combined promoter sequences: CCAAT, GGGCGG, AGATA and TGATA)
A-form promoter sequences (APS)
Combined promoter sequences (CPS)
Total number of genes in genome
Genes with APS within 500 bp upstream of TSSa
586 (3.18% of genes)
Genes with CPS within 500 bp upstream of TSS
86 (0.47% of genes)
Frequency of motifs in combined promoter sequences (CPS) in Xenopus tropicalis 4.2 genome ( apelen ≥ 10, motifgap ≤ 20)
Genes with motif within 500 bp upstream of TSSa
Total number of motifs in genome
Motifs within 500 bp upstream of TSS (including multiples)
Motifs in CPS
Motifs in CPS within 500 bp upstream of TSS
Gene IDs and names of the immediately downstream genes of the 86 putative A-form promoter elements identified in the JGI 4.2 genome assembly, the associated promoter motif sequence for each hit is shown alongside
Selection and validation of a predicted promoter
We have previously identified and characterised a promoter element that requires an unusual A-form DNA structure in conjunction with a known promoter sequence motif. This combination of direct and indirect read-out mechanism drives early embryonic expression of the gata2 gene in Xenopus and is responsive to the ilf3 containing transcription factor complex CBTF. However, the question of the prevalence of this type of regulatory mechanism in genomes remained. To address this we implemented a Perl program to investigate the occurence and used this to search the 4.2 version of the Xenopus genome. From the 86 hits obtained we selected five to test for both actual A-form structure and as specific targets for DNA binding proteins. All five of the selected targets were experimentally validated as A-form and as protein binding sites. One of these five, containing a CCAAT motif as does the previously identified gata2 promoter, was selected for further validation. This element is the putative promoter for the gdi3 gene and was shown by supershift to be a target for the known gata2 transcription factor ilf3. The temporal expression pattern of gdi3 occurs shortly after that of gata2 and gdi3 transcription is also responsive to ilf3 fusion proteins in vivo. Taken together this is strong evidence for the element identified by the program to be a critical component of the promoter of gdi3.
Identification of the promoter elements required the A-forming potential of a base triplet of a given sequence to be calculated in a moving window along the genome using the method of Basham et. al. In the overwhelming majority of hits the APS consists of a consecutive sequence of Cs or Gs, with the first or second position in a block of Cs occasionally replaced by a T. Only five cases were observed where this pattern does not hold, all involving repeated blocks of ATGC. However, it should be noted that APE values do not exist for 14 of the 64 possible triplets, which are effectively ignored by the present algorithm. The reliability of the method would no doubt be increased if these non-determined values were assigned. Despite this, apte provides a powerful tool for potential identification of A-form regulatory elements in whole genomes. A major problem in eukaryotic transcriptional studies is that transcription factor binding sites occur with high frequency and this leads to many ‘false positive’ identification of promoter elements by search programs. Potentially by considering DNA structure the reliability of such search programs could be significantly enhanced. For instance there are 25,253 CCAAT sequences (counting multiples per gene) within 500 bp of a TSS in the 4.2 genome and 54,703 APS sequences anywhere in the genome. However there are only 36 in conjunction, a far more manageable number to screen.
Previous work on indirect read out mechanisms invoved with DNA recognition has largely been limited to in vitro experiments. Our validation of gdi3 as being regulated by such a mechanism is at least partially in vivo. Within eukaryotic genomes DNA is chromatinised with the interactions of the histones and the DNA, providing not only packaging but regulatory functions. It is unclear how non B-form DNA structures affects chromatinisation, possibly they chromatinise less well and are therefore bare regions at promoters, but the fact that we have identified a gene that is regulated in vivo by an A-form binding protein suggests that these structures persist within the chromatin environmment.
Although our results reflect mainly the identification of genes responsive to the ilf3 transcription factor potentially other A-form DNA binding proteins may also be recognising these elements. Importantly, the ability to look at whole genome assemblies means that it is now possible to study the role of these A-form elements within gene regulatory networks.
Algorithm and implementation
The algorithm is implemented as a Perl program named apte (A-form promoter transcription elements), which provides both a command-line interface and a Perl/Tk graphical interface. The program reads genomic sequence data from General Feature Format (GFF) Version 3 files (http://www.sequenceontology.org/gff3.shtml) and from Ensembl MySQL databases (http://www.ensembl.org/info/data/ftp/index.html). GFF input files should contain a list of genes to be searched and the DNA sequence in FASTA format. Access to Ensembl databases is provided through the Ensembl Perl API (http://www.ensembl.org/info/docs/api/index.html) which is a prerequisite for the program.
The main input parameters for apte are: motif, the promoter motif sequence; apelen, the minimum number of negative APE values in the APS; motifgap, the maximum number of bases between the APS and the motif; and genegap, the size of the region preceding the TSS to be searched. The default values adopted for the parameters are motif = CCAAT, apelen = 10, motifgap = 20 and genegap = 500. Searches can cover an entire genome or be limited to a specific gene or sequence region. Searches can also be made solely for A-DNA promoter sequences or promoter motifs. Results are output as a tab-separated table with a row for each combined sequence found, listing the APS and motif positions and summary details of the corresponding gene. Options are provided to write the results in GFF format; or in BED or WIG format files which may be uploaded to the Ensembl genome browser for display as custom tracks. The BED files indicate the location of the APS, the motif and the sign of the APE values over the search region. The WIG files plot the APE scores over the search region.
Microinjection and RT-PCR
Xenopus embryos were collected at time points during early developmental stages according to Nieuwkoop  and RNA extracted for RT-PCR analysis using the method of Steinbach and Rupp . The samples were amplified to the linear phase of the amplification with the ODC gene used as an internal control, all primer sequences are available in supplemental information. Synthetic mRNA was prepared as previously described  and injected into both cells of two-cell stage embryos.
An Applied Photophysics Pi* 180 instrument was flushed with nitrogen gas (Oxygen-Free) for all CD experiments. Cell pathlengths of 1 mm and 4 mm were used to obtain far and near ultra-violet data respectively. Each duplex was dissolved in 100 mM KF 5 mM NaPO4 buffer pH 7.6 at room temperature and stored on ice. Concentrations were determined by UV measurements at 260 nm coupled with snake-venom phosphodiesterase time course digestions to correct for hypochromic difference. The samples were run at 20+/-0.1C using a Melcor Peltier Thermoelectric Temperature Control Unit. Data was collected every 1 nm over the wavelength range 183 nm to 360 nm using adaptive sampling in conjunction with signal averaging in all cases. The instrument wavelength accuracy was 0.1+/-nm determined from the Xeon lines and the ellipticity was calibrated from camphor suphonic acid at 290.5 nm.
Electrophoretic mobility shift assay (EMSA)
DNA oligonucleotides (Invitrogen) were annealed to form duplexes and end-labeled by T4 polynucleotide kinase (NEB) using γ33P ATP. The proteins were incubated with the nucleic acid probe for 15 minutes on ice in EMSA buffer  in the presence of 500 ng poly dI-dC. Either wild-type or mutant non-labeled competitor was added at a 50 times excess to two of the reactions while a third reaction was incubated with anti-ilf3 antibody to allow identification of the specific DNA-protein complex. After incubation the DNA and DNA-protein complexes were separated on a 4% native polyacrylamide gel in 0.25 X TBE. The gels were dried and visualized using a phosphorimager (Fuji).
We would like to thank Dr Colin Sharpe for discussion concerning experimental procedure, and Mr Benjamin Marconnet (IUT Belfort) for contributions to the apte program. This work was supported by the Institute of Biomedical and Biomolecular Science, University of Portsmouth.
- Mauro SA, Pawlowski D, Koudelka GB: The role of the minor groove substituents in indirect readout of DNA sequence by 434 repressor. J Biol Chem. 2003, 278: 12955-12960. 10.1074/jbc.M212667200.View ArticlePubMedGoogle Scholar
- Chen S, Gunasekera A, Zhang X, Kunkel TA, Ebright RH, Berman HM: Indirect readout of DNA sequence at the primary-kink site in the CAP-DNA complex: alteration of DNA binding specificity through alteration of DNA kinking. J Mol Biol. 2001, 314: 75-82. 10.1006/jmbi.2001.5090.View ArticlePubMedGoogle Scholar
- McGeehan JE, Streeter SD, Thresh SJ, Ball N, Ravelli RB, Kneale GG: Structural analysis of the genetic switch that regulates the expression of restriction-modification genes. Nucleic Acids Res. 2008, 36: 4778-4787. 10.1093/nar/gkn448.View ArticlePubMed CentralPubMedGoogle Scholar
- Fairall L, Martin S, Rhodes D: The DNA binding site of the Xenopus transcription factor IIIA has a non-B-form structure. EMBO J. 1989, 8: 1809-1817.PubMed CentralPubMedGoogle Scholar
- Borden KL: The activating transcription factor region within the E2A promoter exists in a novel conformation. Biochemistry. 1993, 32: 6506-6514. 10.1021/bi00077a003.View ArticlePubMedGoogle Scholar
- Llewellyn KJ, Cary PD, McClellan JA, Guille MJ, Scarlett GP: A-form DNA structure is a determinant of transcript levels from the Xenopus gata2 promoter in embryos. Biochim Biophys Acta. 2009, 1789: 675-680. 10.1016/j.bbagrm.2009.07.007.View ArticlePubMedGoogle Scholar
- Parker SC, Hansen L, Abaan HO, Tullius TD, Margulies EH: Local DNA topography correlates with functional noncoding regions of the human genome. Science. 2009, 324: 389-392. 10.1126/science.1169050.View ArticlePubMed CentralPubMedGoogle Scholar
- Scarlett GP, Elgar SJ, Cary PD, Noble AM, Orford RL, Kneale GG, Guille MJ: Intact RNA-binding domains are necessary for structure-specific DNA binding and transcription control by CBTF122 during Xenopus development. J Biol Chem. 2004, 279: 52447-52455. 10.1074/jbc.M406107200.View ArticlePubMedGoogle Scholar
- Champ PC, Maurice S, Vargason JM, Camp T, Ho PS: Distributions of Z-DNA and nuclear factor I in human chromosome 22: a model for coupled transcriptional regulation. Nucleic Acids Res. 2004, 32: 6501-6510. 10.1093/nar/gkh988.View ArticlePubMed CentralPubMedGoogle Scholar
- Brooks TA, Kendrick S, Hurley L: Making sense of G-quadruplex and i-motif functions in oncogene promoters. FEBS J. 2010, 277: 3459-3469. 10.1111/j.1742-4658.2010.07759.x.View ArticlePubMed CentralPubMedGoogle Scholar
- Basham B, Schroth GP, Ho PS: An A-DNA triplet code: thermodynamic rules for predicting A- and B-DNA. Proc Natl Acad Sci U S A. 1995, 92: 6464-6468. 10.1073/pnas.92.14.6464.View ArticlePubMed CentralPubMedGoogle Scholar
- Cazanove O, Batut J, Scarlett G, Mumford K, Elgar S, Thresh S, Neant I, Moreau M, Guille M: Methylation of Xilf3 by Xprmt1b alters its DNA, but not RNA, binding activity. Biochemistry. 2008, 47: 8350-8357. 10.1021/bi7008486.View ArticlePubMedGoogle Scholar
- Bass BL, Hurst SR, Singer JD: Binding-properties of newly identified Xenopus proteins containing dsRNA-binding motifs. Curr Biol. 1994, 4: 301-314. 10.1016/S0960-9822(00)00069-5.View ArticlePubMedGoogle Scholar
- Bycroft M, Grunert S, Murzin AG, Procter M, St Johnston D: NMR solution structure of a double stranded RNA-binding domain from Drosophila staufen protein revels homology to the N-terminal domain of ribosomal protein S5. EMBO J. 1995, 14: 4385-4391.Google Scholar
- Ramos A, Grünert S, Adams J, Micklem DR, Proctor MR, Freund S, Bycroft M, St Johnston D, Varani G: RNA recognition by a Staufen double-stranded RNA-binding domain. EMBO J. 2000, 19: 997-1009. 10.1093/emboj/19.5.997.View ArticlePubMed CentralPubMedGoogle Scholar
- Ohkuma Y, Horikoshi M, Roeder RG, Desplan C: Binding site dependent direct activation and repression of in vitro transcription by Drosophila homeodomain proteins. Cell. 1990, 61: 475-484. 10.1016/0092-8674(90)90529-N.View ArticlePubMedGoogle Scholar
- Nieuwkoop PD, Faber J: Normal Table of Xenopus laevis (Daudin). 1967, Amsterdam: North Holland Publishing CoGoogle Scholar
- Rupp R, Steinbach O: Quantitative Analysis of mRNA Levels in Xenopus Embryos by Reverse Transcriptase - Polymerase Chain Reaction (RT-PCR). Molecular Methods in Developmental Biology: Xenopus and Zebrafish, Vol. 127. Edited by: Guille M. 1998, New Jersey: Humana Press, 41-56.Google Scholar
- Orford R, Guille M: Bandshift Analysis using Crude Oocyte and Embryo Extracts from Xenopus Laevis. Molecular Methods in Developmental Biology: Xenopus and Zebrafish, Vol. 127. Edited by: Guille M. 1999, New Jersey: Humana Press, 175-187.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.