The p53HMM algorithm: using profile hidden markov models to detect p53-responsive genes
© Riley et al; licensee BioMed Central Ltd. 2009
Received: 25 January 2008
Accepted: 20 April 2009
Published: 20 April 2009
A computational method (called p53HMM) is presented that utilizes Profile Hidden Markov Models (PHMMs) to estimate the relative binding affinities of putative p53 response elements (REs), both p53 single-sites and cluster-sites. These models incorporate a novel "Corresponded Baum-Welch" training algorithm that provides increased predictive power by exploiting the redundancy of information found in the repeated, palindromic p53-binding motif. The predictive accuracy of these new models are compared against other predictive models, including position specific score matrices (PSSMs, or weight matrices). We also present a new dynamic acceptance threshold, dependent upon a putative binding site's distance from the Transcription Start Site (TSS) and its estimated binding affinity. This new criteria for classifying putative p53-binding sites increases predictive accuracy by reducing the false positive rate.
Training a Profile Hidden Markov Model with corresponding positions matching a combined-palindromic p53-binding motif creates the best p53-RE predictive model. The p53HMM algorithm is available on-line: http://tools.csb.ias.edu
Using Profile Hidden Markov Models with training methods that exploit the redundant information of the homotetramer p53 binding site provides better predictive models than weight matrices (PSSMs). These methods may also boost performance when applied to other transcription factor binding sites.
The degeneracy of the p53-RE
In the influential paper "Definition of a Consensus binding Site for p53", by El-Deiry et al., 7 of the 20 DNA target sites (35%) used to form the head-to-head (HH) p53 consensus sequence had at least one nucleotide insertion or deletion relative to the discovered 20 bp consensus after proper alignment (see Figure 1) . Alignments of the roughly 160 experimentally-validated p53 binding sites to date also show that approximately 30% of presently known sites have at least one nucleotide insertion or deletion relative to the consensus matrix . Discovery of p53 binding sites with such degeneracy cannot be reliably made with a PSSM approach, since prevalent insertions and deletions in the consensus sequence misalign the PSSM reading frame, and lead to improper scoring. Therefore, PSSM binding site discovery algorithms inherently mis-score at least 30% of the known p53 binding sites.
PHMMs can model nucleotide insertions and deletions
Using PHMMs to estimate binding affinities
The dynamic programming forward and backward algorithms are used to calculate the probabilities P hmm (x) and P hmm (j, b). These two probabilities are calculated by summing up the probability of observing the sequence x, and the base b at position j, for all the paths through the linear PHMM, respectively. The dynamic programming Viterbi algorithm is used to find the best alignment of the candidate site x to the binding-site motif modeled by the PHMM. The best (optimal) alignment of the sequence x is obtained by finding the path through the PHMM that gives the highest log-odds score for the sequence . In the case of transcription factor binding sites, the log-odds score of this optimal path (also called the Viterbi score) is commonly used to provide adequate approximations to the probabilities P hmm (x) and P hmm (j, b) [see Additional file 1 for details]. When using the Viterbi score for the probability P hmm (x) we are assuming that there is generally only one major set of binding interactions between specific nucleotides and amino acids for a given protein-DNA complex, and that all other possible binding locations in the response element can be ignored.
Training a PHMM with validated binding sites
Before a PHMM can be used to estimate the relative binding affinity for any putative binding site, the PHMM must be trained to properly model a functional binding site of interest. When training a PHMM for a particular motif, the goal is to choose the parameters of the model in order to maximize the likelihood of the sequences in the training set, without over-fitting. Again, under ideal conditions the log-odds score (log-likelihood ratio) G s (x) to be maximized for the collection of binding sites in the training set is proportional to the estimated binding free energy -ΔG(x) of these binding sites. When the state paths for the training sequences are not known, no known closed form solution exists for the parameter estimations . The Baum-Welch algorithm is the most commonly used iterative Expectation Maximization (EM) method to train the parameters of the model. The Baum-Welch algorithm always climbs the gradient (to increase the combined scores of the training set) and uses the optimized dynamic programming forward and backward algorithms .
Results and discussion
A novel training method that boosts predictive power
The Corresponded Baum-Welch algorithm
In order to include the prior knowledge of the structural motif (or in an attempt to discover it), a novel "Corresponded Baum-Welch" algorithm is proposed to enforce or learn the optimal correspondence between expectations of parameters for corresponding positions after each iteration of the Baum-Welch algorithm (see Methods). For example, assume that we have prior knowledge that a transcription factor protein binds to the DNA in homodimer form, where each monomer interacts with 5 DNA base pairs. Then a corresponding palindromic motif for the nucleotide positions would be: 1 2 3 4 5 5 4 3 2 1, while a reverse-complement palindromic motif would be: 1 2 3 4 5 (where ã has the complement nucleotide emission distribution of a). All the emission distributions for each of the five sets of synonymous positions would be made corresponding, as well as all the transition probabilities between synonymous positions. In this example, if all the parameters between synonymous positions were fully corresponding (tied), then the parameter search space would be roughly cut in half. The level of correspondence between the parameters for synonymous positions can be given a priori, or trained for if the training set is sufficiently large. One optimal level of correspondence, c, can be calculated for the whole motif (for all the corresponding positions), or a separate one can be found for each set of corresponding positions. (See Methods for details.)
Comparing the different p53 corresponding (structural) motifs
Since the 20 bp-tetrameric p53 binding site has a repeated and nested palindromic structure, different correspondence motifs can be constructed to train the PHMM models, and cross validation can be used to compare their predictive properties. The motifs that are compared are: the repeat or T-coupled motif (1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10), the (reverse-complement) palindromic or H-coupled motif (1 2 3 4 5 6 7 8 9 10 ), the independently (reverse-complement) palindromic or Q-coupled motif (1 2 3 4 5 6 7 8 9 10 , the repeated, fully-palindromic or combined-palindromic motif (1 2 3 4 5 1 2 3 4 5 ), and the completely un-tied motif with no correspondence between any positions (see Figure 3) . We perform 1000 iterations of ten-fold random-split cross validation on each model to gain statistics on their predictive accuracy. The positive set contains 160 experimentally validated p53 binding sites from , and the negative set contains 40 bp random samples from the mononucleotide content of the training set. Then we utilize Receiver Operating Characteristic (ROC) curves in order to compare the predictive power of the classifiers in an unbiased, threshold-independent (non-parametric) manner. This is achieved by calculating the true positive and false positive rates for all possible threshold values for each model. The summary statistic for comparing the ROC curves is the AUC (Area Under the Curve). AUC values lie somewhere between 1.0 and 0.5 (where an AUC of 1.0 would correspond to a perfect classifier, and an AUC of 0.5 would correspond to a classifier that is no better than random coin flipping.)
Training Insert-State Emissions
A major consideration when training Profile Hidden Markov Models (PHMMs) is which parameters to train for at each position, and which parameters to fix at each position to the over-all average. The more non-fixed parameters that must be trained for at each position in the motif, the more data that is needed to properly train the model. Ideally, a sufficiently large training set is available to be able to train for all the parameters in the PHMM at each position. Unfortunately, in the case of transcription factor binding sites, this is rarely the case. Typically, when using PHMMs to model DNA binding sites, both the insert probabilities and insert state nucleotide emissions probabilities are set to the binding site averages, since there are rarely enough examples of these rare occurrence events at a particular position to train those parameters for that position alone . By corresponding (fully or partially tying) positions and in effect increasing the training data for each position, it may be possible to train the insertion-state emissions distributions for these corresponding positions. This could possibly boost predictive power of the models, if the p53 protein is selective as to which nucleotides can be inserted into the motif at certain positions without compromising the binding affinity of the site. A common example of such selective sequence insertions can be found in functional protein families, whereby hydrophobic or hydrophilic amino acid insertions may be tolerated at certain positions, provided that the insertions are present either in the core or at the surface of the protein, respectively, after folding. Notice that fixing the insertion-state emission distributions at every position to the amino-acid average for the whole sequence would be very inappropriate in this example.
The final results
Validation of the p53HMM algorithm
The new p53HMM algorithm was used to screen for putative p53 binding sites in the endosomal compartment genes, which led to the discovery of a functional p53 site and a new p53-regulated gene, CHMP4C . The putative p53RE sequence AAACAAGCCC agtagcagcagctgctcc GAGCTTGCCC was predicted in the promoter region (-497 to -460 bp) of the CHMP4C gene. The data from the chromatin immunoprecipitation and the luciferase reporter assays showed that p53 protein can bind to this sequence and induce CHMP4C gene expression. Additionally, analysis by p53HMM found an alternative putative p53 binding site in the LIF gene that corresponds to a 6 bp upstream shift of the downstream half-site relative to the recently published putative site in intron 1 . The p53HMM algorithm predicted the site GGACATGTCGGGACA-GCTC, which matches the consensus RRRCWWGYYYRRRCWWGYYY perfectly except for the low-conserved position 10 and the gap ("-", deletion) at position 16. A PSSM approach predicted the shifted site GGACATGTCGggacagCTCCCAGCTC, which is the best "gap-less" p53 site in the region conferring p53 regulation, but it still matches the consensus very poorly with five mismatches (the putative spacer sequence is in lowercase) . A few genes in the dataset of 160 functional p53 binding sites have a deletion relative to the consensus exactly between the well-conserved C and G as seen above, including the genes: EGFR, TYRP1, EEF1A1, HSP90AB1, and BAI1. This discovery of an alternative p53 binding site that better matches known functional sites, by modeling for observed insertions and deletions, highlights some of the advantages of the new p53HMM algorithm.
Special considerations for the p53HMM algorithm
Although the spacer within a p53 RE has been shown to greatly affect the binding affinity for p53 protein, the ability to properly quantify this effect for all possible spacers of lengths 0–21 base pairs has been elusive. Therefore like previous algorithms, we have chosen to initially ignore the spacers of the training set and putative REs . We are able to ignore arbitrary-length spacers by inserting a no-cost Free Insertion Module (FIM) between the two half-sites of the single-site PHMM [16, 17]. Similarly, we can ignore spacers with lengths between 1 and N base pairs by inserting a no-cost Finite Emission Module (FEM-N) between the two half-sites (see Figure 2). A prior p53 RE search algorithm (p53MH) was based upon a PSSM approach and a novel filtering matrix . Unfortunately, the tables were not symmetric and the filtering table over-fit the available data at the time. The combined result was that the p53MH method completely rejects 58 of the 160 experimentally validated sites to date (receiving a score of 0 out of 100, where 100 represented the maximum relative binding affinity). Additionally, some sites received very high scores approaching 100, while the reverse-complement received a score of 0, and vice-versa. Due to these observations, we have purposely designed the p53HMM algorithm to be symmetric, so as to give identical scores for putative sites and their reverse complements. Secondly, we chose to abandon the filtering matrix to avoid over-fitting the available data. A feature that we preserved from p53MH is the normalizing of scores by the highest possible affinity for the motif (×100), so that the highest possible normalized score is 100.
Modeling dependencies between positions
PSSMs assume that all nucleotide positions within the motif contribute independently to the binding affinity of the binding site, which has been shown experimentally to not always be the case . Recent research has focused on modeling dependencies between positions in protein-DNA binding sites [18, 19]. Typically Tree Bayesian Networks and Mixtures of trees have been used to attempt to model these dependencies between positions, which have been shown through cross validation to increase the predictive power of these models . Our PHMM models do not attempt to model dependencies between the positions, however they can be extended to do so by using higher-order Profile Hidden Markov Models. Unfortunately, the ability to train for positional dependencies, and boost predictive power, is dependent upon the sampling size of the training set and requires larger training sets to train the extra parameters.
A novel p53 cluster-site algorithm
Normalized Experimental Affinity of Cluster-sites
Number of Half-sites
Relative Binding Affinity
Theoretical Affinity Approximations
# of Full-sites with spacers ≤ 14 bp
# of Full-sites with spacers ≤ 24 bp
# of Full-sites with any size spacer
These p53 cluster-site scores are attained through a two step process. The first step uses the cluster-site model which contains a generalized p53 half-site PHMM and a back-transition that limits any spacer between two half-sites to no more than 14 bp (see part e of Figure 2). The dynamic programming Viterbi algorithm is used to find the highest scoring p53 half-sites in the sequence (that are separated by no more then 14 bp). The second step then parses the state-path generated from step 1 and generates viable p53 full-sites with any spacers removed, while conserving the property that the half-sites in the cluster-site were not separated by more than 14 bp. Now we use the more flexible p53 single-site model to score these viable full-sites using the Viterbi algorithm (see part d of Figure 2). We maintain a running sum of the log-odds scores of the candidate full-sites that are above a certain threshold. The log-odds score threshold and spacer-length limit (14 bp) are chosen so as to best fit the experimental data (see Figure 7).
Additionally, this p53 cluster-site model follows statistical mechanics, in that the overall binding affinity for the complete RE is proportional to the probability of any p53 protein binding to any of the allowed motifs found in the cluster-site. (See Methods for more details.)
Dynamic acceptance thresholds as a function of the distance from the TSS
Profile Hidden Markov Models (PHMMs) can boost predictive power over weight matrices (PSSMs) when the binding motif is highly degenerative and tolerates insertions and/or deletions at various positions. The increase in predictive power for the p53-binding motif can be seen in Figures 4 and 5. When the RE has a known repeated and/or palindromic motif, this prior knowledge can be used to correspond parameters in the model to exploit the redundancy in the information in the motif. We propose a novel "Corresponded Baum-Welch" training algorithm that significantly boosts the predictive power of the p53-RE model, as seen in Figures 4 and 5. When the motif is not known, all possible motifs for the given size can be sampled and cross-validation techniques leveraged to infer the correct motif that maximizes predictive power. For example, Figure 5 reveals that the maximally predictive p53-binding motif corresponds the four quarter-sites in a combined-palindromic structure.
Our algorithms demonstrate the best predictive capability to date in classifying putative p53 binding sites. One algorithm uses a novel "Corresponded Baum-Welch" training method that exploits the repeated, palindromic structure of the p53 motif to train for allowed insertions and deletions relative to the consensus. The second algorithm properly models the relative increase in binding affinity for p53 cluster-sites (REs with ≥ 3 adjacent half-sites) by using a two step process that scores all viable full-sites in the cluster-site while restricting the spacer-length to 14 bp. This new cluster-site algorithm best matches the experimental data (see Figure 7).
Functional low-affinity p53-sites only exist near the TSS. Therefore the binding affinity threshold for accepting a putative site should be dependent on the putative site's distance from the TSS. By this method, putative sites with relatively low calculated binding affinities that are near the TSS may be accepted, while those sites with equal scores but more distant from the TSS will be rejected. A dynamic threshold, as a function of the distance from the TSS, can greatly reduce the false positive rate when searching for putative p53-sites in genes.
The Corresponded Baum-Welch algorithm
The Corresponded Baum-Welch algorithm will converge at (local) optimum emission and transition probabilities and correspondence factors that maximize the likelihood of observing the training set with possible pseudo-counts. Please see the Additional file 1 for further details.
The p53 cluster-site algorithm
The spacer-length upper bound and the affinity-score lower bound were fit to best match the experimental results. In the case for p53-binding sites, the best fit is a spacer-length of no more than 14 bp and a log-odds score of at least 27.5 (see Figure 7).
The p53HMM implementation
We thank Michael Krasnitz, Amar Drawid, Anirvan Sengupta, Sean Eddy, Jiri Vanicek, and Raúl Rabadán for helpful discussions.
- Levine AJ: p53, the cellular gatekeeper for growth and division. Cell 1997, 88(3):323–331.View ArticlePubMedGoogle Scholar
- Funk WD, Pak DT, Karas RH, Wright WE, Shay JW: A transcriptionally active DNA-binding site for human p53 protein complexes. Mol Cell Biol 1992, 12(6):2866–2871.PubMed CentralView ArticlePubMedGoogle Scholar
- el Deiry WS, Kern SE, Pietenpol JA, Kinzler KW, Vogelstein B: Definition of a consensus binding site for p53. Nat Genet 1992, 1: 45–49.View ArticlePubMedGoogle Scholar
- Riley T, Sontag E, Chen P, Levine A: Transcriptional control of human p53-regulated genes. Nat Rev Mol Cell Biol 2008, 9(5):402–412.View ArticlePubMedGoogle Scholar
- Krogh A, Brown M, Mian IS, Sjölander K, Haussler D: Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 1994, 235(5):1501–1531.View ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755–763.View ArticlePubMedGoogle Scholar
- Stormo G, Fields D: Specificity, free energy and information content in protein-DNA interactions. Trends in Biochemical Sciences 1998, 23(5):109–113.View ArticlePubMedGoogle Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. 1st edition. Cambridge University Press; 1998.View ArticleGoogle Scholar
- Djordjevic M, Sengupta AM, Shraiman BI: A Biophysical Approach to Transcription Factor Binding Site Discovery. Genome Res 2003, 13(11):2381–2390.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee J, Kim J, Kim J: Data-driven design of hmm topology for on-line handwriting recognition.2000. [http://citeseer.ist.psu.edu/lee00datadriven.html]Google Scholar
- Ma B, Pan Y, Zheng J, Levine AJ, Nussinov R: Sequence analysis of p53 response-elements suggests multiple binding modes of the p53 tetramer to DNA targets. Nucleic Acids Res 2007, 35(9):2986–3001.PubMed CentralView ArticlePubMedGoogle Scholar
- Marinescu VD, Kohane IS, Riva A: MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes. BMC Bioinformatics 2005, 6: 79.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu X, Riley T, Levine AJ: The regulation of the endosomal compartment by p53 the tumor suppressor gene. FEBS Journal 2009, 276(8):2201–2212.View ArticlePubMedGoogle Scholar
- Hu W, Feng Z, Teresky AK, Levine AJ: p53 regulates maternal reproduction through LIF. Nature 2007, 450(7170):721–724.View ArticlePubMedGoogle Scholar
- Hoh J, Jin S, Parrado T, Edington J, Levine AJ, Ott J: The p53MH algorithm and its application in detecting p53-responsive genes. Proc Natl Acad Sci USA 2002, 99(13):8467–8472.PubMed CentralView ArticlePubMedGoogle Scholar
- Hughey R, Krogh A: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 1996, 12(2):95–107.PubMedGoogle Scholar
- Barrett C, Hughey R, Karplus K: Scoring hidden Markov models. Comput Appl Biosci 1997, 13(2):191–199.PubMedGoogle Scholar
- Barash Y, Elidan G, Friedman N, Kaplan T: Modeling dependencies in protein-DNA binding sites. In RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology. New York, NY, USA: ACM Press; 2003:28–37.View ArticleGoogle Scholar
- Zhou Q, Liu JS: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics 2004, 20(6):909–916.View ArticlePubMedGoogle Scholar
- Tan T, Chu G: p53 Binds and activates the xeroderma pigmentosum DDB2 gene in humans but not mice. Mol Cell Biol 2002, 22(10):3247–3254.PubMed CentralView ArticlePubMedGoogle Scholar
- Contente A, Dittmer A, Koch MC, Roth J, Dobbelstein M: A polymorphic microsatellite that mediates induction of PIG3 by p53. Nat Genet 2002, 30(3):315–320.View ArticlePubMedGoogle Scholar
- Bourdon JC, Deguin-Chambon V, Lelong JC, Dessen P, May P, Debuire B, May E: Further characterisation of the p53 responsive element-identification of new candidate genes for trans-activation by p53. Oncogene 1997, 14: 85–94.View ArticlePubMedGoogle Scholar
- Kern SE, Pietenpol JA, Thiagalingam S, Seymour A, Kinzler KW, Vogelstein B: Oncogenic forms of p53 inhibit p53-regulated gene expression. Science 1992, 256(5058):827–830.View ArticlePubMedGoogle Scholar
- Holland RCG, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M, Schreiber MJ: BioJava: an open-source framework for bioinformatics. Bioinformatics 2008, 24(18):2096–2097.PubMed CentralView ArticlePubMedGoogle Scholar
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 1990, 18(20):6097–6100.PubMed CentralView ArticlePubMedGoogle Scholar
- Schuster-Böckler B, Schultz J, Rahmann S: HMM Logos for visualization of protein families. BMC Bioinformatics 2004, 5: 7.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.