The recent explosion of "knowledge production" due to the completion of human genome sequences and to the availability of high-throughput technology (such as microarray, ChIP-PET, Chip-on-Chip) has generated a critical need for bioinformatic instruments able to manage the huge amount of biological data produced and to facilitate their retrieval and analysis. Moreover, the integration of data from different resources, such as in silico analysis, experimental data and biological databases, is crucial to exploit the large amount of information available and to focus on the discovery of the functional role of genes, and their expression regulation.
The p53FamTaG database is a comprehensive and unique resource of genome-wide search of human p53, p63, p73 direct target genes combining the in silico prediction of their p53 REs, with the transcriptional profiles of these target genes in isogenic cell lines over-expressing different members of the p53 family. The dissection of the transcriptional targets of p53 gene family members, which recognise the same RE, is a challenge in cancer research. p53 is mutated in over 50% of all cancers and in the remaining cases its pathway may also be affected. The involvement of p73 and p63 in tumour development is much less well-established. Multiple isoforms of p63 and p73 have been characterised and emerging evidence suggests that some of the roles played by the TAp63 and TAp73 isoforms overlap those of p53, whereas their ΔN variants have an opposite effect or even an oncogenic role in cancer progression.
To identify putative direct target genes of the p53 gene family, we performed an in silico genome-wide search of the p53 REs in specific regions of the human genes (promoter region, introns, 5'UTRs), using criteria defined on the structure of 109 REs of 83 human experimentally demonstrated target genes. Through this in silico search we selected 18110 human genes containing the REs as potential p53 family direct target genes.
We complemented and validated the in silico results with the study of the expression profile of these potential direct target genes using the microarray approach. To this purpose we generated stable transfected cell lines integrating the expression constructs of p53, TAp63α, ΔNp63α, TAp73α, TAp73β and the p53R175H mutated isoform in the same genomic locus. This allowed us to examine, in the same cellular context, the effects of the overexpression of p53 family members on the expression profile of the in silico detected direct target genes.
In recent years, distinct studies have aimed to identify p53, p63 and p73 regulated genes using DNA microarray approaches. However, all the three proteins have not be analysed so far in the same study and considering the heterogeneity of experimental and genetic backgrounds used (cellular stressors, cell lines, etc), it is very difficult, if not impossible, to compare the expression profiles of those independent studies to identify common and non-common target genes and whether those genes are direct or indirect targets [15–18]. A further drawback of these results is that they exists only as simple flat files, poorly annotated and they are not collected in relational databases publicly available through the web.
p53FamTaG database was designed to store the information of the in silico and microarray approaches with links between the two data sets and to the most accredited databases world-wide. Through a user-friendly graphical interface, it is possible to query this complex information in a few seconds. For each gene containing the RE, the database provides the gene name (HUGO), the alias name, the ENSEMBL stable gene ID and RefSeq ID, the chromosome, the RE structure (decamers, spacers, length, sequence), the RE chromosomal position and gene region localization (promoter, 5'UTR, intron) and the microarray results. Moreover, the database provides the hyperlink to PubMed for the experimentally demonstrated target genes.
One particularly noteworthy feature of the database is the possibility to export the sequences of the REs including full information in FASTA format, which is not possible from any other public resource. The availability of the RE sequences of potential target genes which appear to be up or down regulated in our microarray experiments, allows to guide experimental approaches (such as PCR amplification of REs and cloning for luciferase, EMSA, Chromatin immuno-precipitation assays) to demonstrate the binding of the p53 family member to the RE (manuscript in preparation). Furthermore, these results may lead to refining the specific RE for each p53 family member and finally to identifying common transcription binding site frameworks by applying algorithms.
An additional significant feature of p53FamTaG database is the annotation of 83 experimentally demonstrated direct target genes integrated with the microarray data produced in our Lab. These target genes are often only validated for one of the members or isoforms of the p53 family members. Our data set now allows the user to observe such a target gene under the overexpression of the three p53 family members under identical experimental conditions and to understand the involvement of each member in the modulation of this target gene.
A map of 542 human p53 high-confidence binding loci, obtained by ChIP (chromatin immunoprecipitation)-PET (paired-end ditag) approach, has recently been published . The PET sequences are derived from about 66,000 individual p53 ChIP fragment sequences using human HCT116 colorectal cancer cells treated with 5-fluorouracil for 6 h, conditions known to activate p53 expression. The gene name, the sequence and the chromosomal localization of the 542 binding loci PET clusters are available in the UCSC database. However, the REs are not indicated and therefore not suitable for further studies. Out of the 542 loci, only 381 corresponded to known genes with the others referring to sequences lacking gene names, to cDNA clones or to hypothetical proteins. We queried our p53FamTaG database by using the list of these 381 gene names and we found that 341 genes are present in our database, showing that our pure in silico approach found most of these experimentally selected RE. Moreover, for 205 of these target genes our database reports the microarray results, making available for these genes, studied only for p53, also the transcriptional effect of p53R175H, TAp63α, ΔNp63α, TAp73α and TAp73β under our experimental conditions.
The availability of a database like p53FamTaG able to integrate, retrieve and display this precious information also led us to find a way to include the ChIP-PET data through a link to the UCSC database.
It should be considered that high throughput technology (microarray, ChIP-PET, Chip-on-Chip) has the advantage of producing and analysing large quantities of biological data, yet it is based on robust experimental methodology (which also has a substantial cost) and statistical analysis. However, each experiment in any case remains referred to specific conditions (type of inductors, cellular stressors, cell line), and represents an initial framework for further experimental validation more focused on particular cellular conditions. In particular, in gene expression studies, among the regulatory elements (p53 family members REs in our case) spread throughout the genome, only some are involved in specific conditions following binding dynamics or tissue specificity of the transcriptional network. For example, out of the 83 direct target genes that have been experimentally demonstrated and are annotated in our database, only 15 genes were identified in the global map study with the ChIP-PET approach reporting 542 p53 binding sites.
With the global in silico search the p53FamTaG database is able to collect all the information and put it into a holistic picture creating new knowledge that would not be possible on a single data set. Therefore, p53FamTaG represents a powerful resource and may be considered a basic repository colligating existing information for researchers in this field. The database content is reported in Table 1. The validity of the in silico data we obtained by applying DNAfan and in particular the consistence of the criteria we established for the genome-wide RE search (stringency and complexity of the search pattern) were supported by the positive selection of 83 experimentally demonstrated p53 target genes and of 341 ChiP-PET high-confidence binding loci reported in literature. Moreover, the majority of the published target genes present in p53FamTaG have a statistically significant change of expression in at least 1 sample of our experimental conditions. These results strongly validate the complementary approach between the in silico search and the microarray experiments.
The important features of the structure of the database are: 1) its design in modules so that, with the perspective of data integration, it can be easily extended to host additional data such as new experimental data (results from microarray or real-time PCR approaches), and data from the literature, making the new data available in context with the target sequence analysis; 2) the design can be mirrored for example for the identification and collection of the REs in other organisms or for the in silico identification of other transcription factor binding sites.