p53 tetramer
The tumor protein 53 is one of the experimentally and theoretically most extensively studied transcription factors, which makes it a very suitable system for validation of new methods for binding mode prediction. Here, the PWMs of its dimeric and tetrameric forms were calculated. The p53 tetramer response elements have two half-site palindromes with consensus sequence RRRCWWGYYY summing up to a 20 bp DNA sequence. Sequence logos of some of the p53 tetramer PWMs published in the last years together with the one reported here are shown in Figure 2. The third logo from the top was obtained using the DDNA2 server [16] that estimates the protein-DNA binding energy adopting a newly published version of the DFIRE potential [17]. In order to compare these five PWMs we estimated their relative entropies using the following formula:
where pi, jis the probability of observing element i at position j in the PWM. Not surprisingly, both experimental PWMs have very similar relative entropies (3.22 from Ref. [18] and 3.99 for the PWM from Ref. [19]). From the theoretically predicted ones, the PWM reported here has relative entropy of 8.69 that lies closer to the experiment than the one calculated with the modified DFIRE potential (10.87). The p53 TRANSFAC matrix has been compiled from 17 SELEX binding sites and has a corresponding lower entropy (1.72) than the other two experimental ones.
There are few experimental studies that focused extensively on measuring the p53-DNA binding affinity and from these we selected the two largest independent data sets, published in Ref. [18] and Ref. [20]. We calculated the TF-DNA binding energy scores of the altogether 51 oligonucleotide sequences published in Ref. [18] taking as a reference sequence number 600 in Table one from the corresponding work. Figure 3 shows the TF-DNA energy scores calculated by us plotted against the experimental lnK
d
shifts of the particular oligonucleotides with respect to the consensus sequence. There is a linear correlation between the experimental affinities and the calculated free binding energies (p value = 1.0E-9, R2 = 0.54), the small deviations from the regression line observed in the upper right part of the figure are most probably related to larger experimental errors observed at weaker TF-DNA binding. Another experimental work [20] investigated the binding of p53 to mouse DDB2 genes. Having taken the consensus sequences and corresponding dissociation constants provided in their paper we found a good agreement between experiment and theory, as suggested by the p value of the fit (2.0E-3) and the regression coefficient R2 equal to 0.65. The oligonucleotide sequences, experimental lnK
d
shifts and the calculated by us binding energy scores for these two experimental sets can be found in Additional File 2.
p53, p63 and p73 dimers
The p63 and p73 tumor proteins belong to the p53 transcription factor family, reaching 63% sequence identity to p53 in the DNA binding domain. Correspondingly, p53, p63 and p73 have overlapping and distinct functions - p53 regulates the stress response to suppress tumors; p63 is essential for ectoderm development; and p73 might regulate both stress response and development.
To the best of our knowledge, crystallographic structures of the DBDs of p63 and p73 complexed with DNA have not been resolved yet, however, solution structures of the C-terminal domains containing the corresponding DBDs are available (PDB entries 1RG6 and 1COK)). As these three proteins belong to the same family and show high DBD sequence similarity, we used homology modeling in order to reconstruct the complete p63 and p73 DBDs using the p53 DBD as a template. We compared the performance of the web servers ESyPred3D [21], Geno3D [22], 3D-JIGSAW [23], Phyre [24], and SwissModel [25] in reconstructing the p63 and p73 DBDs using the p53 DBD. Further, considering the high sequence similarity of p63 and p73 to p53 and possibly the same binding mode in most of the cases, we aligned the modeled structures of p63 and p73 to the crystallographic structure of the p53 DNA binding domain in its homodimeric form (PDB code 2GEQ). Both p63 and p73 DBDs were modeled using as a template the p53 structure with PDB accession number 1GZH whose chain A had 55% sequence similarity to both p63 and p73. Although chain D from a p53-DNA complex (PDB entry 2AC0) showed higher similarity to p73 (62%), it included a smaller number of residues, therefore the former template was preferred.
The p63 and p73 DBDs were iteratively aligned to the p53 domain using the Pymol software package [26]. We chose the p63 and p73 DBD structures modeled with the SwissModel [25] server as they showed the smallest RMSD from the p53 domain - the alignment of the common Cαatoms of p63 to those in the corresponding p53 domain showed an average RMSD of 0.45Å, for p73 it was about 0.44Å. The other web servers (ESyPred3D, Geno3D, 3D-JIGSAW, and Phyre) allowed also for modeling of chain flexibility in the protein structures using rotamer libraries, which led to very large RMSD values (up to 3Å) superposing the new DBDs to the p53 DBD. However, these servers provided structures that could be effectively used in protocols sensitive to chain flexibility, such as molecular dynamics studies and docking.
Finally, the aligned complexes were locally relaxed with AmberTools [27] in order to remove possible sterical clashes. The homodimeric p53, p63, and p73 PWMs were calculated using the scoring scheme presented above. The corresponding sequence logos of the three PWMs are shown in Figure 4 together with the logo obtained from the TRANSFAC matrix with entry V$P53_02. For comparison we calculated also the PWMs of the possible tetrameric complexes (not shown). In those calculations the p53 tetramer structure with PDB entry 2AC0 was used as a template, the results were well comparable to those obtained with the dimer complexes.
Transcriptional targets of the p53 family
In order to quantify how similar were the newly computed homodimeric p53, p63, and p73 PWMs we decided to evaluate their performance on upstream sequences of genes known to be regulated by these factors. In TRANSFAC there are only few annotated p53 response elements mapped to promoters, which was not enough for comprehensive statistics. Hence, we selected a set of 85 genes which are known to be upregulated by p53 and/or p73. All promoters of a given gene were included in the scan with Match (altogether 126 promoters). We used three promoter windows of different length in accordance to the genomic coordinates provided in TRANSPRO. The smallest promoter window spanned [-900, 100] bp from the transcription start site (··TSS··), the other two had coordinates of [-1900, 100] bp and [-4900, 100] bp relative to the TSS. There were no overlapping promoters in the promoter window [-900,100] bp, 2 partially overlapping promoters were found in the [-1900,100] bp window, and 22 in the largest [-4900, 100] bp promoter window. However, we do not how the potential binding sites are distributed in the promoters and how many they could be, therefore we assumed that each promoter contained at least one site and scanned the whole length of all promoters.
Matrix cutoffs were calibrated to give an empirical prediction rate of (less than or equal to) 1 site in 10 K residues (p value of 1.0E-4). In order to take properties of real promoter sequences, such as repeats, TFBSs, or CpG-rich regions, into account, we randomly sampled a set of 126 human promoter sequences and selected a score threshold that satisfied the desired site frequency based on Match predictions. This procedure was repeated several times which lead to insignificant changes in the matrix cutoffs. Then the promoter sequences were scanned with Match using the matrix cutoffs calibrated on the background sets. Figure 5 shows the fraction of promoters in which only potential p53-responsive elements have been predicted with Match, those with p53 and p63 hits, with p53 and p73 hits, and with hits of all three PWMs, respectively. In contrast, p63- or p73-only promoters seem to be relatively rare as Match returned only five promoters with p63 and p73 but no p53 hits in the promoter window [-1900, 100] bp (not shown in the Figure). In those five promoters (belonging to the DHRS3, FEN1, JAG2, SERPINA1, and TRIM32 genes) the two p63 and p73 binding sites reported in DHS3 overlapped, two out of four hits overlapped in FEN1, from the altogether five hits in JAG2 only two overlapped, and the two p63 and p73 hits in SERPINA did not overlap. TRIM32 was reported as a p63-only promoter (having two p63 hits) at matrix cutoff corresponding to p value of 1.0E-4.
Restricting the promoter window to [-900, 100] bp in respect to the gene TSS produced many p53-only hits (results not shown) but also many non p53-hits in particular promoter sequences. p63-only sites were reported in altogether eight promoters belonging to eight genes (CTNN31, GDF15, IL4R, KLHL21, MDM2, RAD17, SERPINA1, TRIM32). The p63 sites lay 715 bp, 10 bp, 474 bp, 725 bp, 415 bp, 289 bp, 42 bp, and 702 bp before the particular TSS, respectively. For these particular 8 promoters, it means that the p63 sites lie closer to the TSS (could be found in the [-900,100] bp window), but potential p53-binding sites were not found in this small promoter window. Here, from the 8 p63-only promoters found in the [-900, 100] bp window, only two (SERPINA1 and TRIM32) did not contain potential p53 hits when extending the promoter window to [-1900, 100] bp. At the small promoter window [-900, 100] bp Match returned p63 and p73 hits in altogether 7 promoter sequences belonging to the CDC42RP2, DHRS3, FEN1, GRK5, JAG2, p53AIP1, and TP53 genes. The reported p63 and p73 sites overlapped completely in the CDC42RP2, DHRS3, GRK5, and TP53 promoters and partially in the other three promoters. In one of the JAG2 promoters the p63 site lay 782 bp away from the p73 site, in FEN1 and p53AIP1 the separation amounted to 170 bp and 361 bp, respectively. On the other hand, in the third JAG2 promoter the suggested p63 and p73 sites lay next to each other. In some of the cases the reported p63 and p73 sites overlapped, however, several p63-only promoters were detected. None of the p53 hits returned by Match overlapped with p63 or p73 ones.
In summary, the results obtained with Match suggest that the three new structure-based matrices return unique hits. Particular p63 sites reported in promoters of the CAMLG, COL5A2, PPL, and SYNE2 genes lay about 50 bp or less from the proposed p53 binding sites. Similar to these results, the p73 sites reported in promoters of the GDF15, FXR1, and TSPAN1 genes lay between 17 and 42 bp away from p53 sites.
Concluding this section, the results obtained with the three newly computed PWMs suggest that the p53 transcriptional activity could be regulated by the other two family members. The promoter sequences used here are available in Additional File 3 together with the corresponding list of TSS genomic coordinates (NCBI 36/human genome build 18).
NF-κ B
The good correlation between experimental affinity data and predicted TF-DNA free binding energies shown in Figure 3 and Additional File 2 encouraged us to calculate PWMs of transcription factors from other structural classes. We further chose the NF-κ B family for which experimentally measured dissociation constants are available to which we compared the binding energy scores computed here. We calculated PWMs for three members of the NF-κ B family: the p50 homodimer, the p50RelB and the p50p65 heterodimers. Sequence logos of these three PWMs together with a general logo from the corresponding TRANSFAC entry V$NFKAPPAB_01 are shown in Figure 6. The relative entropies of these members of the NF-κ B family calculated with Eq.4 are as follows: 2.65 for p50p50, 2.38 for p50p65, 2.84 for p50RelB, and 2.29 for the corresponding TRANSFAC PWM. The three newly computed matrices presented here differ slightly from each other, but in general they are in a good qualitative and quantitative agreement with the general sequence logo from TRANSFAC. Comparing the matrices on the basis of relative entropy we would suggest that the p50p65 PWM computed from structure is the one closest to the experimental one. The left parts of the p50RelB and p50p65 PWMs are somehow similar to each other, as guanine is preferred at the fourth position in the corresponding sequence logos, which is not the case for the p50 homodimer and the TRANSFAC PWM. On the other hand, the first two logos show similar CCC probabilities in their right parts. We also recalculated the p50p65 PWMs using the DDNA2 server, its PWM (shown in Additional File 1) had higher relative entropy (3.28) comparing to the experimental PWM (2.29) and the other NF-κ B matrices we obtained.
We also addressed the quality of the structure-based matrices by scanning human promoters containing known, experimentally verified NF-κ B binding sites. We selected 69 human genes for which it is known that they are regulated by NF-κ B. The full sequences (each of length 11 kb) of the altogether 124 promoters belonging to these genes were extracted from TRANSPRO and are available in Additional File 4. Experimentally confirmed NF-κ B binding sites are found in only 31 out of the 124 promoter sequences belonging to 25 genes. All relevant experimentally verified sites (also from other TFs) in these 124 promoters are listed in Additional File 5, 58 of them are NF-κ B response elements. As TRANSFAC provided no matrix compiled particularly for the RelB protein we excluded the newly computed p50RelB PWM from the scan. Consequently, we compared the performance of the new p50 homodimer and the p50p65 heterodimer PWMs to two of the newest matrices of the TRANSFAC library, V$P50P50_Q3 and V$P50RELAP65_Q5_01, which were built from homodimeric and heterodimeric response elements, respectively. Score cutoffs were calibrated on randomly sampled background promoter sequences in the same way as for the p53 family members. We worked again with matrix precision corresponding to a p value of 1.0E-4. We assumed that each promoter contained at least one site and scanned the whole promoter length.
A matrix scan with Match was performed on 31 promoter sequences containing 58 experimentally confirmed binding sites. Here, the alternative promoters of four genes (CCL5, IFNG, IL2, and MMP9) partially overlapped. The newly computed p50p50 and p50p65 PWMs detected 30 and 25 sites, respectively. The TRANSFAC p50p50 PWM recovered 25 sites, whereas 26 response elements were reported by the TRANSFAC p50p65 PWM. Binding site locations of the p50p50 motifs differed in 8 sites, of which three sites were only recovered by the TRANSFAC p50p50 PWM. In the comparison of p50p65 motifs, the structure-based matrix reported 5 sites missed by the TRANSFAC PWM, however 6 matches of the TRANSFAC matrix were not detected by our PWM. Comparing the two groups, the newly computed p50p50 PWM discovered 7 sites that the p50p65 PWM did not find, but missed 2 which were covered by the p50p65 PWM. The p50p50 TRANSFAC PWM discovered 4 sites different from those reported by the TRANSFAC p50p65 PWM, but missed 5 sites detected by the latter PWM.
In summary, the PWMs computed from crystal structures performed better than the TRANSFAC ones recovering altogether 55 sites, while both TRANSFAC PWMs yielded 51 hits. However, there are two factors that influence these results - for many of the experimental binding sites it is not known with which NF-κ B-family member (p50, p52, p65, RelB) they interact. Here, we focused on two very specific cases, namely p50 homodimer and p50p65 heterodimer, and omitted p52 and RelB. Second, we did not compare the PWMs presented here to all 6 other NF-κ B-related matrices annotated in TRANSFAC that could probably cover all 58 response elements found in the 31 promoters. The comparison shown here aimed at evaluating the quality of the matrices presented in this work.
GABP and ERα
As representatives from other structural TF classes we chose the GABP heterodimer and the ERα nuclear hormone receptor whose family members are widely expressed in various eukaryotic genomes. The helix-turn-helix GA-binding protein (GABP) is unique among the ETS-family transcription factors as it functions as a heterodimer composed of an α and a β subunit. The α subunit, encoded by the GABPA gene, contains the ETS DNA-binding domain, while the β subunit, encoded by an unrelated gene, GABPB2, contains the transcriptional activation domain as well as four ankyrin repeats necessary for dimerization with the DNA-binding subunit. For the GABP matrix computation we used its heterodimeric complex, while for ERα its homodimer complexed with DNA was used. Figure 7 shows the newly obtained GABP matrix logo together with the corresponding logo from TRANSFAC (from entry V$GABP_B). The intensities of the first three nucleotides in the newly computed PWM compare well to those found in the corresponding TRANSFAC logo although the intensive guanine signal seen in the latter PWM vanishes in the atomistically modeled matrix. Nevertheless, the GG-repeat as well as adenine at the seventh position in the new matrix are also quantitatively well captured comparing to the corresponding PWM from TRANSFAC.
The ERα estrogen receptor belongs to the family of Cys4 zinc finger proteins. Figure 8 shows the sequence logo of the homodimer and a half-site (monomer) logo from TRANSFAC entry V$ER_Q6_02. The GG-signal seems to be well reproduced by the statistical potential used here, and the AGGTCA consensus suggested by TRANSFAC is better reproduced in the right part of the PWM computed in this work. Thymine seems to compete with guanine in the fifth position of the structure-based matrix, while the TRANSFAC logo exhibits thymine dominance at that position. The all-atom matrix computation used here allows for discrepancies between two homodimeric binding sites as shown in Figure 8. Experimentally, it is more difficult to capture such structural details as most of the experiments use half-sites in the binding affinity measurements [18] and the PWMs derived from such size also have to be symmetric. Probably the strongest advantage of the method presented here over the experimental ones is the opportunity to compute PWMs from complicated TF structures like homo- and heterodimers, or tetramers, as in the experiment this is often difficult to account for.