TOBFAC: the database of tobacco transcription factors
© Rushton et al; licensee BioMed Central Ltd. 2008
Received: 27 September 2007
Accepted: 25 January 2008
Published: 25 January 2008
Regulation of gene expression at the level of transcription is a major control point in many biological processes. Transcription factors (TFs) can activate and/or repress the transcriptional rate of target genes and vascular plant genomes devote approximately 7% of their coding capacity to TFs. Global analysis of TFs has only been performed for three complete higher plant genomes – Arabidopsis (Arabidopsis thaliana), poplar (Populus trichocarpa) and rice (Oryza sativa). Presently, no large-scale analysis of TFs has been made from a member of the Solanaceae, one of the most important families of vascular plants. To fill this void, we have analysed tobacco (Nicotiana tabacum) TFs using a dataset of 1,159,022 gene-space sequence reads (GSRs) obtained by methylation filtering of the tobacco genome. An analytical pipeline was developed to isolate TF sequences from the GSR data set. This involved multiple (typically 10–15) independent searches with different versions of the TF family-defining domain(s) (normally the DNA-binding domain) followed by assembly into contigs and verification. Our analysis revealed that tobacco contains a minimum of 2,513 TFs representing all of the 64 well-characterised plant TF families. The number of TFs in tobacco is higher than previously reported for Arabidopsis and rice.
TOBFAC is an expandable knowledgebase of tobacco TFs with data currently available for over 2,513 TFs from 64 gene families. TOBFAC integrates available sequence information, phylogenetic analysis, and EST data with published reports on tobacco TF function. The database provides a major resource for the study of gene expression in tobacco and the Solanaceae and helps to fill a current gap in studies of TF families across the plant kingdom. TOBFAC is publicly accessible at http://compsysbio.achs.virginia.edu/tobfac/.
Tobacco [Nicotiana tabacum L.] is a member of the agriculturally important Solanaceae and is one of the most studied higher plant species. This is because of both its economic importance and because it is a convenient plant system for research. Tobacco can be easily transformed and has a relatively short generation time. A system of reduced complexity, the tobacco Bright Yellow-2 (BY-2) cell line, is also available and this cell line is fast growing, responds to a variety of plant hormones and can be stably transformed . BY-2 cells are an excellent experimental system for studies of gene expression and secondary metabolism. The one missing piece in the puzzle is the availability of the genome sequence of tobacco.
The large genome size of tobacco (approximately 4.5 Gb) makes the goal of sequencing the tobacco genome difficult. Fortunately, there are now a number of methods that can deliver sequence information on the vast majority of genes in a species without the need to sequence and assemble the entire genome. One of these techniques is methylation filtration (MF), which preferentially clones the hypomethylated fraction of the genome, effectively reducing the size of the genome to be sequenced. MF has already been successfully applied in maize, sorghum and cowpea [2–5]. The development of MF followed studies of genome architecture that revealed that repetitive elements tend to form clusters within plant genomes that become heavily methylated (hypermethylated), leaving stretches of less-methylated (hypomethylated), low-copy gene-rich space scattered in islands throughout the genome [6, 7].
The Tobacco Genome Initiative (TGI) has obtained sequence from an estimated minimum of 90% of tobacco gene space (cultivar Hicks Broadleaf) using MF technology . We have used a dataset of 1,159,022 gene-space sequence reads (GSRs) generated by the TGI as the basis for identifying the majority of all members of 64 well-characterized transcription factor (TF) families. Our dataset is estimated to represent a minimum of 2,513 genes and TOBFAC has been designed not only to be a repository for these sequences but also to be a major resource for all data concerning tobacco TFs.
Since the publication of the first version of the Arabidopsis genome sequence, finding the complete set of known TFs in plant genomes has become an attainable goal . However, only three higher plant genome sequences are currently available, Arabidopsis (Arabidopsis thaliana), poplar (Populus trichocarpa) and rice (Oryza sativa L.). A number of databases have been constructed that integrate all predicted TFs from these three genomes. These include the Plant Transcription Factors databases [10, 11], PlnTFDB  and PlanTAPDB . In addition to these, there are a number of similar databases devoted to other species, but these are based on EST data. The species contained in these EST-based datasets include maize, wheat, barley, sorghum, sugarcane, cotton soybean, potato, tomato, apple, orange, grape, sunflower, lotus, loblolly pine and white spruce . However, none of these are devoted to tobacco and none approach the level of coverage and total of over 2,500 TFs that we have documented from tobacco. The Nicotiana tabacum genome is the result of an interspecific hybridization event between Nicotiana sylvestris and Nicotiana tomentosiformis resulting in an allopolyploid genome with a basic chromosome number of x = 24. Based on a high density microsatellite-based genetic map, there is clear evidence of several large translocation events between the ancestral chromosomes but no evidence of large duplication events. The extent of loss of chromosomal segments is difficult to assess in the absence of diploid maps for the ancestral species. It is therefore difficult to predict the total number of tobacco TFs, but it is likely to be over 3,000. After the databases that cover Arabidopsis, rice and poplar, TOBFAC is currently the most extensive species-specific higher plant TF database. It will be a major resource for the study of gene expression in tobacco and the Solanaceae and will also facilitate studies of TF families across the plant kingdom.
Construction and Content
Source dataset and knowledgebase construction
A total of 1,159,022 GSRs obtained by methylation filtering were downloaded from the TGI (March 6th 2006) . The GSRs were imported into a PostgrelSQL database having a schema very similar to that used for the Cowpea Genomics Initiative . Scripting was performed in Perl using Perl DBI/DBD for PostgrelSQL database connectivity . The Perl modules HTML::Template and HTML::FillInForm were used to facilitate the a Model-View-Controller software architecture which separates the display layer (HTML) from bioinformatics logic layer (Perl code). In addition to sequence data, the database stores Blast results as well as some ancillary meta data. The TOBFAC web pages are conceptual views of the PostgrelSQL relational database. Some web pages combine records from multiple tables, and link to static files outside the database.
Identification and classification of tobacco TFs
Determination of the minimum number of genes in each TF family
It is estimated that at least 90% of tobacco TF genes have been tagged by the MF technology , but not all of these genes are present as complete gene sequences. A major task was therefore to calculate the minimum number of genes present in each tobacco TF family based on the GSRs. This was calculated based on the number of independent sequences that contained a certain portion of the conserved domain (say amino acids 20–30). The largest number of copies of any part of the domain represents the minimum number of those genes in tobacco. For larger gene families, the genes were first assigned to known subfamilies and the calculation was performed for each subfamily. The final number for the gene family was then the sum of all the subfamilies.
Phylogenetic analysis of TF families
Comparison of TOBFAC sequences and EST sequences
The main page for TOBFAC
TOBFAC is one of the most extensive species-specific higher plant TF databases, with only Arabidopsis, rice and poplar having more complete TF analysis to date . The data will provide a major resource for the study of gene expression in tobacco and the Solanaceae and will also facilitate studies of TF families across the plant kingdom. The current version will be updated regularly to improve the dataset of tobacco TFs. The main page for TOBFAC is the main web interface for all information contained in the database (see Figure 2). It contains introductory information about both tobacco and the GSR project conducted by the TGI. Links at the top of the main page enable downloading of both the complete set of TOBFAC sequences and the complete set of sequences that were used in the searches to find them. It is also possible to download the complete list of predicted tobacco genes and the Genbank entries for their individual constituent GSRs. There are also links to the homology search page, the published tobacco TF page and the list of publications on tobacco TFs. The published tobacco TF page contains a list of published tobacco TF genes sorted by gene family and with hotlinks to both Genbank and PubMed entries. The list of publications on tobacco TFs is reached via a hotlink on the main page and contains publications again sorted by gene family and with hotlinks to PubMed entries and/or PDF files for all publications. A list of all 64 available TF gene families is present on the main page together with the minimum number of tobacco genes in each family. The TF family names link to the individual TF family pages.
The individual TF family pages
Each TF family has its own dedicated page (see Figure 3) that contains all sequences from this group. A button at the top of the page links to the complete list of sequences in the gene family. Other buttons allow downloading of the phylogenetic tree file, either with or without bootstrap values, and a JPEG of the phylogenetic tree. A short introduction describes the family and there is also a link to the family's Pfam accession. All predicted sequences are presented, together with what parts of the gene they each contain. An explanation of how the minimum number of genes in this gene family was calculated from this total number of sequences is also present. Below the introduction, each individual gene is listed together with links to the Genbank entries for every individual GSR that makes up this gene. The gene name links to the individual page for the gene itself. There is also a list of papers that relate to this familt in tobacco, complete with links to their PubMed entries. At the bottom of each TF family page is a list of all published tobacco genes that belong to this family, together with links to the Genbank entries for these genes. This wealth of tobacco-related features should provide the user with most of the available information on this gene family in tobacco together with several different ways to utilise the TOBFAC sequences.
The individual gene pages
Each TF family page lists all of the genes in that family in the TOBFAC database. Each gene name links to a page dedicated to that gene that contains a wealth of information designed to aid research into that gene. At the top of the page is the DNA sequence of the gene in fasta format followed by a six frame translation of the sequence. Below that is a link to the pfam accession page. If a gene has already been published, this is stated together with the name that it was published under and links to PubMed and Genbank entries for the gene. This enables integration of all available data on TOBFAC genes that have already been published. Additional data on the predicted gene follows with results from searches of the EST data and Genbank. The top hits are presented, together with their definition lines, and links to the database entries for the sequences. Below the searches are links to all Genbank entries of all of the individual GSRs that make up the predicted genes. Finally, the individual gene pages also contain the complete lists for the gene family of both relevant papers and published genes. Each individual gene page therefore integrates a large amount of data from multiple sources, contains multiple functionalities and provides a one-stop-shop environment for research into each of the 2,882 TOBFAC TF sequences.
The homology search page
The homology search page is reached via a link on the main page and is a local installation of the NCBI BLAST program that allows searching against the TOBFAC sequences. It can also be used to search the 46,546 tobacco EST sequences from the ESTobacco project  or the tobacco unigenes from Genbank. The integration of these datasets facilitates a "one stop shop" approach to the study of tobacco TFs. We have also performed BLAST searches with the TOBFAC sequences against the ESTobacco EST sequences. A total of 557 of the TOBFAC tobacco TFs are supported by EST sequences (see Figure 5). A list of TOBFAC genes that are supported by ESTs is available via a link on the main page.
The published tobacco TF page
TOBFAC aims to contain information on all known tobacco TF genes. This includes not only information from the TGI GSR project and EST sequences but also a list of published tobacco TF genes. The published tobacco TF page contains a list of published tobacco TF genes sorted by gene family and with hotlinks to both Genbank and PubMed entries. This list of genes facilitates the integration of data from multiple sources. Some care should be exercised, however, when comparing published sequences with GSR sequences because different ecotypes have been used in many cases and also published sequences have been obtained from the BY-2 cell line. For some of the larger gene families, we have tried to integrate published and GSR sequences, and although this has often been successful, this was not always the case. For this reason, future versions of the TOBFAC dataset will primarily be centered on GSR data and corresponding EST sequences.
The list of publications on tobacco TFs
TOBFAC is designed not only to be a database of all known tobacco TF sequences, but also a knowledge base of all tobacco TF-related material. A list of publications on tobacco TFs is reached via a hotlink on the main page and contains publications sorted by gene family and with hotlinks to PubMed entries and/or PDF files. Over half of the tobacco TF families have no published literature concerning them and this illustrates how the information contained in TOBFAC will facilitate novel areas of research in tobacco.
Discussion and Conclusion
Genome-related public databases are invaluable to the scientific community and transcriptional regulation of gene expression is a major control point in many biological processes. Databases that contain information on complete or near-complete complements of TFs are major resource for the plant research community but this does not currently include a member of the Solanaceae, one of the most important families of vascular plants. TOBFAC fills this void and contains the majority of tobacco TFs (a minimum of 2,513) from 64 gene families together with extensive data on each gene family. TOBFAC integrates these TF sequences with published tobacco TFs, scientific literature on tobacco TFs, phylogenetic trees and EST data. The aim is to make TOBFAC the database of choice for transcription-related projects in tobacco and an important resource for projects in other members of the Solanaceae.
The availability of all, or nearly all, members of TF families from several species will facilitate the study of their biological functions, phylogenetic relationships, and the evolution of their DNA-binding domains . The dataset contained within TOBFAC has already been used for phylogenetic analysis of the ERF, WRKY, NAC, homeodomain, bZIP, bHLH, R2R3MYB, CONSTANS, ZIM, Dof and MADS box genes. These phylogeneies have been successfully used to both predict gene function and to find novel subfamilies of TF genes (Rushton, P.J., manuscript in preparation).
Gene space sequences have been published for maize, sorghum and cowpea [2, 3, 5] and the near future will see this number increase. The computational pipeline described in this article (see Figure 1) can be widely applied to any new GSR dataset and modified to include any new families of TFs as they are discovered. We believe it to be a robust method that increases the chances of isolating TF gene sequences from GSR datasets, especially for short or incomplete sequences and divergent members of gene families. This is in contrast to other approaches that use stringent filtering processes to avoid false-positives but can thereby fail to identify up to 20% of TF genes .
Future plans and releases
The current version of TOBFAC is based on 1,159,022 GSRs obtained from the TGI website. As additional GSRs, ESTs and BAC sequences become available, we will integrate these new sequences into TOBFAC and update both TF family data and phylogenetic analyses. Microarray-based transcription profiling is currently being performed for tobacco and expression data for TFs will also be included in future versions of TOBFAC.
Availability and requirements
Project name: TOBFAC: The database of tobacco transcription factors.
Project home page: http://compsysbio.achs.virginia.edu/tobfac/.
Operating system: Platform independent.
Programming language: Perl.
Other requirements: None.
Licence: None required.
Any restrictions to use by non-academics: None.
We wish to thank Shengcheng Han, Hongbo Zhang and Brett Castrodale for their input during the development of the database. This work was partially supported by funding from Philip Morris USA (Richmond, VA).
- Geelen DNV, Inze DG: A bright future for the bright yellow-2 cell culture. Plant Physiology 2001, 127(4):1375–1379. 10.1104/pp.127.4.1375PubMed CentralView ArticlePubMedGoogle Scholar
- Palmer LE, Rabinowicz PD, O'Shaughnessy AL, Balija VS, Nascimento LU, Dike S, de la Bastide M, Martienssen RA, McCombie WR: Maize genome sequencing by methylation filtrations. Science 2003, 302(5653):2115–2117. 10.1126/science.1091265View ArticlePubMedGoogle Scholar
- Rabinowicz PD, Schutz K, Dedhia N, Yordan C, Parnell LD, Stein L, McCombie WR, Martienssen RA: Differential methylation of genes and retrotransposons facilitates shotgun sequencing of the maize genome. Nature Genetics 1999, 23(3):305–308. 10.1038/15479View ArticlePubMedGoogle Scholar
- Whitelaw CA, Barbazuk WB, Pertea G, Chan AP, Cheung F, Lee Y, Zheng L, van Heeringen S, Karamycheva S, Bennetzen JL, SanMiguel P, Lakey N, Bedell J, Yuan Y, Budiman MA, Resnick A, Van Aken S, Utterback T, Riedmuller S, Williams M, Feldblyum T, Schubert K, Beachy R, Fraser CM, Quackenbush J: Enrichment of gene-coding sequences in maize by genome filtration. Science 2003, 302(5653):2118–2120. 10.1126/science.1090047View ArticlePubMedGoogle Scholar
- Chen X, Laudeman T, Rushton P, Spraggins T, Timko M: CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences. BMC Bioinformatics 2007, 8(1):129. 10.1186/1471-2105-8-129PubMed CentralView ArticlePubMedGoogle Scholar
- Bedell JA, Budiman MA, Nunberg A, Citek RW, Robbins D, Jones J, Flick E, Rohlfing T, Fries J, Bradford K, McMenamy J, Smith M, Holeman H, Roe BA, Wiley G, Korf IF, Rabinowicz PD, Lakey N, McCombie WR, Jeddeloh JA, Martienssen RA: Sorghum genome sequencing by methylation filtration. Plos Biology 2005, 3(1):103–115. 10.1371/journal.pbio.0030013View ArticleGoogle Scholar
- Bennetzen JL, Schrick K, Springer PS, Brown WE, Sanmiguel P: Active Maize Genes Are Unmodified and Flanked by Diverse Classes of Modified, Highly Repetitive DNA. Genome 1994, 37(4):565–576.View ArticlePubMedGoogle Scholar
- Tobacco Genome Initiative[http://www.tobaccogenome.org/]
- Riechmann JL, Heard J, Martin G, Reuber L, Jiang CZ, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR, Creelman R, Pilgrim M, Broun P, Zhang JZ, Ghandehari D, Sherman BK, Yu CL: Arabidopsis transcription factors: Genome-wide comparative analysis among eukaryotes. Science 2000, 290(5499):2105–2110. 10.1126/science.290.5499.2105View ArticlePubMedGoogle Scholar
- Gao G, Zhong Y, Guo A, Zhu Q, Tang W, Zheng W, Gu X, Wei L, Luo J: DRTF: a database of rice transcription factors. Bioinformatics 2006, 22(10):1286–1287. 10.1093/bioinformatics/btl107View ArticlePubMedGoogle Scholar
- Guo AY, He K, Liu D, Bai SN, Gu XC, Wei LP, Luo JC: DATF: a database of Arabidopsis transcription factors. Bioinformatics 2005, 21(10):2568–2569. 10.1093/bioinformatics/bti334View ArticlePubMedGoogle Scholar
- Riano-Pachon DM, Ruzicic S, Dreyer I, Mueller-Roeber B: PlnTFDB: an integrative plant transcription factor database. Bmc Bioinformatics 2007., 8:Google Scholar
- Richardt S, Lang D, Reski R, Frank W, Rensing SA: PlanTAPDB, a Phylogeny-Based Resource of Plant Transcription-Associated Proteins. Plant Physiol 2007, 143(4):1452–1466. 10.1104/pp.107.095760PubMed CentralView ArticlePubMedGoogle Scholar
- Plant Transcription Factor Databases[http://planttfdb.cbi.pku.edu.cn/]
- National Center for Biotechnology Information[http://www.ncbi.nlm.nih.gov/]
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0. Mol Biol Evol 2007, 24(8):1596–1599. 10.1093/molbev/msm092View ArticlePubMedGoogle Scholar
- European Sequencing of Tobacco Project[http://www.estobacco.info/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.