PlnTFDB: an integrative plant transcription factor database
© Riaño-Pachón et al. 2007
Received: 22 December 2006
Accepted: 07 February 2007
Published: 07 February 2007
Skip to main content
© Riaño-Pachón et al. 2007
Received: 22 December 2006
Accepted: 07 February 2007
Published: 07 February 2007
Transcription factors (TFs) are key regulatory proteins that enhance or repress the transcriptional rate of their target genes by binding to specific promoter regions (i.e. cis-acting elements) upon activation or de-activation of upstream signaling cascades. TFs thus constitute master control elements of dynamic transcriptional networks. TFs have fundamental roles in almost all biological processes (development, growth and response to environmental factors) and it is assumed that they play immensely important functions in the evolution of species. In plants, TFs have been employed to manipulate various types of metabolic, developmental and stress response pathways. Cross-species comparison and identification of regulatory modules and hence TFs is thought to become increasingly important for the rational design of new plant biomass. Up to now, however, no computational repository is available that provides access to the largely complete sets of transcription factors of sequenced plant genomes.
PlnTFDB is an integrative plant transcription factor database that provides a web interface to access large (close to complete) sets of transcription factors of several plant species, currently encompassing Arabidopsis thaliana (thale cress), Populus trichocarpa (poplar), Oryza sativa (rice), Chlamydomonas reinhardtii and Ostreococcus tauri. It also provides an access point to its daughter databases of a species-centered representation of transcription factors (OstreoTFDB, ChlamyTFDB, ArabTFDB, PoplarTFDB and RiceTFDB). Information including protein sequences, coding regions, genomic sequences, expressed sequence tags (ESTs), domain architecture and scientific literature is provided for each family.
We have created lists of putatively complete sets of transcription factors and other transcriptional regulators for five plant genomes. They are publicly available through http://plntfdb.bio.uni-potsdam.de. Further data will be included in the future when the sequences of other plant genomes become available.
Transcription factors (TFs) are proteins (trans-acting factors) that regulate gene expression levels by binding to specific DNA sequences (cis-acting elements) in the promoters of target genes, thereby enhancing or repressing their transcriptional rates. The identification and functional characterization of TFs is essential for the reconstruction of transcriptional regulatory networks, which govern major cellular pathways in the response to biotic (e.g. response against pathogens or symbiotic relationships) and abiotic (e.g. light, cold, salt content) stimuli, and intrinsic developmental processes (e.g. growth of organs). Two global types of TFs can be distinguished: basal or general, and regulatory or specific TFs. Basal TFs belong to the minimal set of proteins required for the initiation of transcription (e.g. TATA-box binding protein). Together with RNA polymerase they form the basal transcription apparatus, representing the core of each transcriptional process. In contrast, regulatory TFs bind proximal or distal (up or downstream) of the basal transcription apparatus and act either as constitutive or inducible factors. These proteins influence the initiation of transcription by contacting members of the basal apparatus. Regulatory TFs exert gene-specific and/or tissue-specific functions and influence the transcriptional levels of their target genes in response to different stimuli. In the following when using the term TF, we refer to regulatory TFs.
The large diversity of TFs and cis- acting elements they bind to are the source for an enormous combinatorial complexity which allows fine-tuning gene expression control, and gives rise to a huge spectrum of developmental and physiological phenotypes. Therefore, it is not surprising that the manipulation of the expression of TFs often results in drastic phenotypic changes in the organism. This makes them extremely interesting candidates for biotechnological approaches (e.g. ). It is widely acknowledged that the evolution of regulatory networks is an important actor in the development of evolutionary novelties, consequently in shaping biological diversity. A deep understanding of transcription factors and their regulatory networks would also improve our understanding of organism diversity [2, 3].
The cataloguing of eukaryotic transcription factors started more than a decade ago and has e.g. resulted in the generation of TRANSFAC®, a database of cis-acting elements and trans-acting factors . However, TRANSFAC® includes A. thaliana as the only plant species that is extensively represented. Other plant species are covered to a lesser extent (e. g. Zea mays, Nicotiana tabacum, Lycopersicum esculentum). Additionally, other TF databases focusing on single plant species are available (for A. thaliana [5–7], or O. sativa ). Kummerfeld and Teichmann , have created a server for the prediction of TFs in organisms with sequenced genomes. Up to date, however, none of the currently available databases provides a uniform platform to review plant TF families across several species, encompassing descriptions of each TF family and links to the appropriate literature, and cross-references between the databases by means of orthologous relationships.
Today, nuclear genome sequences are available for several hundreds of organisms, and the sequencing of many more is currently underway. This provides a huge opportunity for making comparisons along different evolutionary branches of the tree of life for various kinds of genes. In this study we have focused on plants and transcription factors. We have predicted the putatively complete sets of transcription factors in five plant species, i.e. the vascular plants Arabidopsis thaliana , Populus trichocarpa , Oryza sativa  and the algae Chlamydomonas reinhardtii  and Ostreococcus tauri , and made the data available through a uniform web resource. Currently, various other plant genomes are being sequenced, including genomes from crops and experimental model species (see ). Plant Transcription Factor Databases at Uni-Potsdam.de provides an easily usable platform for the incorporation of new TF sequences from these and additional plant species.
Sequence data for A. thaliana were downloaded from TAIR [16, 17], annotation release version 6.0, for P. trichocarpa they were downloaded from JGI/DOE , annotation release version 1.1, for O. sativa from TIGR , annotation release version 4.0, for C. reinhardtii from JGI/DOE , annotation release version 3.1, and for O. tauri from the University of Ghent , annotation release version August 2006.
Transcription factors can be identified and grouped into different families according to their domain architecture, mainly taking into account their DNA-binding domains, as described by Riechmann et al.  for A. thaliana. We have extended this approach by including new TF families and applied it in a systematic manner to other plant species.
Therefore, in a first step, we identified – using current literature – the list of all domains, which are known to occur in TFs and that are generally employed to classify proteins as transcriptional regulators. The list was established from available PFAM profile Hidden Markov Models (HMMs) (v20.0, ), additionally we generated new models for further TF families, as indicated below.
One set of nodes (blue squares) represents protein families (i.e. transcription factors, solid color, or other transcriptional regulators, shaded) and the other set of nodes (yellow circles) represents protein domains. The edges indicate the connections between protein domains and families. A continuous edge represents a required relationship, i.e. the indicated domain must be present in a protein to be assigned to the respective TF family. A discontinuous edge represents a forbidden relationship, i.e. the definition of such a family excludes the presence of the given domain. Rules were implemented in a PERL script as "IF . . . THEN" statements ('Classifier' in Fig. 1).
The general pipeline we have developed for the identification and classification of TFs is shown in Fig. 1. Typically, the process starts with retrieving the complete set of predicted proteins for a given species, followed by a profile-HMM search with all available PFAM HMMs (v20.0, ) and the models that we have generated for further TF families. The search is carried out using the software package HMMER (v2.3.2, ). All significant HMM hits are kept. For the PFAM models, only those hits with a bit-score larger than the gathering score reported for the HMM were considered significant. For our own HMMs, hits with an e-value smaller than 10-3 and a bit-score threshold that differed for each HMM were considered significant. From this set of significant HMM hits, we discarded all proteins that contained domains having DNA-related activity but not generally regarded as being parts of transcriptional regulators (such as e.g. transposase-related domains). Thereby, we eliminated potential false positives right at the beginning. Finally, we applied the PERL script implementing the set of established rules for the identification and classification of TFs on the remaining set of proteins ('Classifier' in Fig. 1). The script produces as output a list of proteins that belong to the different classes of transcriptional regulators and their classification into the identified families.
For 31 out of 68 families the presence of a single domain was sufficient to assign membership (two out of the 31 families belong to the category of other transcriptional regulators). The remaining families were characterized by combinations of different domains. In this way we were able to classify transcription factors into 58 families plus 10 families for other types of transcriptional regulators, such as chromatin remodeling factors.
Number of TFs per species
Total number of proteins
Percentage of TFs
For the families Alfin-like, CCAAT-Dr1, CCAAT-HAP3, CCAAT-HAP5, DBP, G2-like, GRF, HRT, LUG, NOZZLE, SAP, Trihelix, ULT and VOZ no appropriated models were found in the PFAM (v20.0) database. Consequently we created our own profile-HMMs based on either published multiple sequence alignments, or on alignments we created based on outputs of PSI-BLAST searches run against the NCBI protein database. The alignments used to build the HMMs are available through our web interfaces.
The basic information in each species-specific database is structured in two sets of tables. One set (right side of the TF table) contains in several tables the information about the TF family: literature references, family description and domains relevant for their classification. The field relating the information in these tables is the family_id. The second set (left side of TF table) contains five tables with the information related to the TFs themselves: sequences, domains present, domain alignments, expressed sequence tags (ESTs), orthologs. The main field here is the cds_id that unequivocally identifies every TF. One additional table, the TF table relates the two sets of tables. This table has both keys, i.e., cds_id and family_id, and contains the information about the classification of the transcription factors into families. The PlantTFDB consists of a single table with the following fields: coding sequence identifier, locus identifier, transcription factor family, md5sum of the protein sequence, description of the protein sequence, species name and TF family. The field md5sum_pep contains the md5sum of the protein sequence, which is a sequence of 32 hexadecimal digits that identifies unequivocally each protein sequence in the database.
The information provided in the species-specific web databases is linked through the gene identifiers or domain names to different external resources, when available and appropriate: TAIR , TIGR's rice genome annotation , JGI/DOE's poplar genome , and C. reinhardtii genome annotation , University of Ghent's O. tauri genome annotation , AthaMap , PlantGDB , Gramene , INPARANOID , SIMAP , and PFAM . Additional external links to other databases and computational tools will continually be included.
To evaluate the confidence in our lists of putatively complete sets of transcription factors, we decided to compare our predictions to published data sets on detailed phylogenetic single-family analyses in A. thaliana. In this way the published analyses were taken as the gold standard. We measured the sensitivity and the positive predicive value (PPV) of our approach- in a similar fashion as done by Iida et al.  (The terminus 'specificity' used by Iida et al.  is in fact the PPV, see [32, 33]).
The sensitivity is defined as:
where, TP is the number of true positives, i.e. the number of TFs listed in our database that are also found in the gold standard, and TP + FN, is the number of true positives plus the number of false negatives, i.e. TP + FN is equivalent to the total number of TFs in the gold standard.
The PPV is defined as:
with the same notation as before, and FP being the number of false positives. Thus, TP + FP is equivalent to the total number of TFs listed in our database.
146/146 = 1.00
146/147 = 0.99
21/22 = 0.95
21/23 = 0.91
28/28 = 1.00
28/29 = 0.97
122/132 = 0.92
122/154 = 0.80
68/70 = 0.97
68/74 = 0.92
35/36 = 0.97
35/36 = 0.97
29/29 = 1.00
29/29 = 1.00
32/33 = 0.97
32/33 = 0.97
99/104 = 0.95
99/108 = 0.92
MYB + MYB-related
184/209 = 0.88
184/198 = 0.93
100/101 = 0.99
100/100 = 1.00
71/72 = 0.99
71/72 = 0.99
The computational identification and classification of TFs is a very dynamic process that relies on the available computational models and tools, which in turn rely on the accumulated biological knowledge. This fact is reflected by the calculated Sensitivity and PPV values. As more experimental data become available over time, further improvements in HMMs are expected helping to minimize further the existing gaps between the gold standards and the reported data in the database.
Users can start their data-mining either browsing by species, selecting one species and looking at all TF families found in that genome, or browsing by families, selecting one family and looking at the species where this TF family is present. In either case the number of proteins found is shown (see Fig. 4A). When a TF family of interest is located (e.g. Alfin-like family in rice), a click on the name of the family will lead the user to the appropriate species-centered database showing detailed information for that family (see Fig. 4B), where detailed information for each of the protein members can be accessed (e. g. LOC_Os01g66420.1; Fig. 4C). From there the user can navigate to any of the other species for which orthologs have been found. Alternatively, the user can use a preferred protein sequence to search the whole set of TFs in PlnTFDB@Uni-Potsdam, or the species-centered databases, using BLAST.
The availability of all members of a family in several species will facilitate the study of their biological functions, phylogenetic relationships, and the evolution of the DNA-binding domains. For example, Yang et al.  employed the sequences available in RiceTFDB, which is part of PlnTFDB@Uni-Potsdam, to perform an evolutionary study of DOF TFs from three different species, i.e. Arabidopsis, poplar and rice. Information extracted from our database is currently being used to establish an oligonucleotide-based microarray representing all predicted rice transcription factors (Christophe Perin, CIRAD, Montpellier, personal communication). In our own experiments we recently used the TF sequences listed in RiceTFDB to establish a large-scale quantitative real-time polymerase chain reaction (PCR) platform allowing us to test the expression of more than 2.500 rice TF genes in high throughput (manuscript in preparation). Using this platform we discovered rice TF genes responding to salt and/or drought stress, including, besides others, the genes LOC_Os04g45810 (HB TF), LOC_Os01g68370.3 (ABI3VP1 TF). Notably, the orthologous Arabidopsis genes, i.e. At2g46680.1 and At3g24650, respectively, are known to be affected by salt/drought stress [35, 36].
The number of sequenced and annotated plant genomes is rapidly increasing. The computational pipeline described in this article will be applied to new plant genomes as soon as they become available and the new information will be added to future releases of PlnTFDB@Uni-Potsdam. Upcoming versions of the database will also include additional structural data about the domains employed for the identification and classification of TFs, and detailed information about the hierarchical family classification of DNA-binding domains [4, 37, 38].
We are currently extending the TF discovery pipeline towards large EST collections. The next release of PlnTFDB@Uni-Potsdam will include such information and will classify TFs from plant species whose genomes have not yet been sequenced but for which large EST collections are available.
We constructedPlnTFDB@Uni-Potsdam, the first database of its kind that provides a centralized putatively complete list of transcription factors and other transcriptional regulators from several plant species. Its daughter databases (OstreoTFDB, ChlamyTFDB, ArabTFB, PoplarTFDB, and RiceTFDB) provide detailed information for individual members of each TF family, including orthologs present in the other species. The latest version of PlantTFDB (vl.O) contains 7597 different protein sequences, grouped into a total of 58 different TF families and 10 additional transcriptional regulator families. The web interface provides access from different starting points, from a gene ID, a protein sequence or a TF family.
All databases can be freely accessed through the WWW using any modern web browser.
Basic Local Alignment Search Tool. bp, Base pair.
Joint Genome Institute/Department of Energy.
National Center for Biotechnology Information.
The Arabidopsis Information Resource.
The Institute for Genomic Research.
This work was financially supported by the Interdisciplinary Center 'Advanced Protein Technologies' of the University of Potsdam, coordinated by Dr. Babette Regierer, and the German Federal Ministry of Education and Research. The authors are grateful to Camila Caldana and Masood Soltaninajafabadi (Max-Planck Institute of Molecular Plant Physiology, Potsdam) for providing data about salt and drought stress regulated rice genes identified through quantitative RT-PCR, to Dr. Judith Lucia Gomez Porras and Luiz Gustavo Guedes Correa (University of Potsdam) for helpful comments on an outline version of this manuscript, to the student workers Cindy Ast and Zvonimir Marelja for their assistance during the set-up phase of this project, and to the anonymous reviewers for their valuable comments that helped to improve the article. Bernd Mueller-Roeber thanks the Fond der Chemischen Industrie for funding (No. 0164389).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.