Computing DNA duplex instability profiles efficiently with a two-state model: trends of promoters and binding sites
© Kantorovitz et al. 2010
Received: 11 May 2010
Accepted: 21 December 2010
Published: 21 December 2010
Skip to main content
© Kantorovitz et al. 2010
Received: 11 May 2010
Accepted: 21 December 2010
Published: 21 December 2010
DNA instability profiles have been used recently for predicting the transcriptional start site and the location of core promoters, and to gain insight into promoter action. It was also shown that the use of these profiles can significantly improve the performance of motif finding programs.
In this work we introduce a new method for computing DNA instability profiles. The model that we use is a modified Ising-type model and it is implemented via statistical mechanics. Our linear time algorithm computes the profile of a 10,000 base-pair long sequence in less than one second. The method we use also allows the computation of the probability that several consecutive bases are unpaired simultaneously. This is a feature that is not available in other linear-time algorithms. We use the model to compare the thermodynamic trends of promoter sequences of several genomes. In addition, we report results that associate the location of local extrema in the instability profiles with the presence of core promoter elements at these locations and with the location of the transcription start sites (TSS). We also analyzed the instability scores of binding sites of several human core promoter elements. We show that the instability scores of functional binding sites of a given core promoter element are significantly different than the scores of sites with the same motif occurring outside the functional range (relative to the TSS).
The time efficiency of the algorithm and its genome-wide applications makes this work of broad interest to scientists interested in transcriptional regulation, motif discovery, and comparative genomics.
DNA duplex instability is manifested as the ease at denaturating the DNA double strand, i.e., as the partial melting and unfolding of double stranded DNA. The study of DNA duplex instability has been a fascinating subject for many reasons: its importance for techniques such as PCR, sequencing by hybridization, antigene targeting, and for understanding replication, mutation, repair, and transcription, see  and references therein.
With respect to understanding transcription initiation, in the very recent past, there has been increased evidence that duplex instability, as well as other physiochemical properties, reveal specific signatures of TSS and core promoter elements. In this context, there have been several comprehensive analyses of genomes such as that of the Plasmodium falciparum , yeast , human, and other animals [4–7]. To a smaller scale, structural properties of DNA have also been used to predict DNA function in viral sequences [8, 9].
It has also been shown that the DNA duplex instability profiles can be used to aid motif discovery in yeast . The instability profiles, computed with the on-line tool WebSIDD , were used to derive informative positional priors and incorporated into a motif finding algorithm. As a result, the performance of the motif finding program improved significantly.
The need for an efficient method to compute the profiles was stressed in , since the on-line tool WebSIDD could not be used to efficiently compute profiles of sequences that were several thousands base pairs long. The algorithms used for computing DNA instability profiles for the above applications [2–9] either have non-linear time complexity (such as the algorithms based on the Peyrard-Bishop-Dauxois (PBD) model [11, 12] and WebSIDD, based on the Benham model [13, 14]) or are linear time approximations to a non-linear-time model (e.g., ).
Some progress in this direction has been made. Recently, in  the Zimm-Brag model was used for a genome wide comparison between coding domains and thermodynamically stable regions. In some organisms the corellation between coding domains and thermodynamic stability allowed the identification of putative exons or genes. The authors state that the algorithm is linear in the length of the sequence. Also, using the Poland-Scheraga model in  another algorithm for DNA melting calculations was reported with time complexity less than quadratic.
In this work we modified an Ising-type model [17–19] that identifies as major contributions to DNA stability the hydrogen bonds between the complementary bases and the nearest-neighbor stacking interaction. One of the advantages of this model is that it can be implemented efficiently, since the time complexity of the algorithm is linear in the length of the sequence.
Another feature of the method we use is that it directly computes the probability of bubble formation of any size k. Our operational definition of a bubble is that of a strand separation, or DNA 'opening' spanning several base pairs. Here, a bubble of size k means that at least k base pairs are open.
Studies have suggested that the ability of the DNA to form a 'transcriptional bubble' at the transcriptional start site is essential to initiate transcription. Using the PBD model, in , it was argued that thermodynamic instability profiles are able to identify the location of TSS. In  it was demonstrated that bubble size is important, in the sense that when the simulated bubble size equals the transcriptional bubble size, the highest peak in the instability profile appears at the TSS. Most previous algorithms (in particular, [3, 4, 6]), only compute the opening probabilities of one base-pair at a time and use averaging techniques to measure the propensity of the DNA to form a bubble of size k >1 at a specific location. This averaging process is not equivalent to computing the opening probability of the whole window of size k.
Using the Ising-type model, we computed the DNA instability profiles for the human promoter regions in Database of Transcriptional Start Sites (DBTSS). We show that these profiles provide an insight into core promoter elements such as the downstream promoter element (DPE), transcription factor II recognition element (BRE), initiator (Inr) and GC box. We show that there is an association between the location of local extrema in the instability profiles and the presence of core promoter elements at these locations. We present evidence that BRE and DPE prefer stability, whereas the TATA box and the Inr prefer instability.
Finally, we examine the applications of the DNA duplex instability profiles to motif discovery. Our findings raise a concern that the "one size fits all" approach to transcription factors used in , may not be appropriate.
Most of the approaches for the computation of DNA instability profiles use models that are coarse-grained, in the sense that they take into account only the major contributions to DNA stability.
The Peyrard-Bishop-Dauxois model [11, 12] assumes that the hydrogen bonds and stacking interaction are the main contributions to DNA stability. Like the modified Ising-type model, it does not take into account explicitly the three-dimensional structure of the double helix and neglects torsional effects. The main difference with the approach we use in this work is that the variable that describes the stretching of the bonds is continuous rather than discrete. The computational complexity of this model is non-linear in the length of the sequence. By direct integration an algorithm was devised in  that reduces the complexity of function evaluations to being linear in the length of the sequence and quadratic in the number of grid points used in the integration.
The Benham model [10, 13, 14] uses the free energy needed to separate the two strands and destroy the helical structure as a measure of instability. This model predicts the location and extend of destabilization given the DNA sequence and imposed super-helical stress and is discrete: the base pair is assumed to be either separated or not. The Benham model has non-linear time complexity.
In  the human genomic melting map was obtained based on the Poland model , which uses recurrence relations to calculate the probabilities of transition of the double helix from the helical to the coil state, rather than considering the state of the hydrogen bonds and stacking interactions. The approach we use here is more general since it allows for the study of localized openings that are precursors to melting, instead of considering only the complete melting of a DNA sequence. The algorithm used in  is a linear time approximation to the non-linear-time Poland model.
In  an approach to predict promoters in whole-genome sequences was used, with the aid of large-scale structural properties of DNA, such as GC content, stabilizing energy of Z-DNA, DNA denaturation values, protein induced deformability, and duplex free energy. First, structural profiles are calculated by converting the nucleotide sequence into a numerical profile, by replacing each di-or trinucleotide with its corresponding structural value. Next, the values are averaged over a window of size 400. The approach we use is different in the sense that no averaging is taking place, rather we calculate conditional probabilities of having k base pairs in the open state and loss of stacking interactions. The cooperative and long range effects in the Ising type model used here, are due to that fact that in the calculation of the probabilities, the entire sequence is taken into account in the evaluation of the partition function (normalizing constant so the probabilities at a given base pair add up to one). In this sense, our approach is an effective smoothing and no averaging within the same sequence takes place.
Other approaches to the study of DNA denaturation include the examination of the breathing dynamics from a probabilistic point of view  and . In  the authors develop a master equation, which together with a Gillepsie algorithm, generates sequence-specific stochastic time series of partially melted regions in DNA. In  the dynamics and thermodynamics of twist-induced denaturation was studied in a long, random sequence of DNA, using large deviation theory, scaling arguments, and Monte Carlo simulations.
Our results are based on the calculation of the thermal equilibrium statistical properties of dsDNA using a modified version of the model introduced in . The model was proposed as a tool to study the thermal fluctuations that lead to the infrequent events of the Watson-Crick base-pair opening, also referred to as DNA breathing. This fluctuational base-pair opening implies the disruption of hydrogen bonds between the complementary bases and the loss of stacking interactions between adjacent base-pairs by the flipping of the base pair out of the helical stack.
Like other models that are designed to predict the propensity of DNA to breathe (such as ), this model takes into account two major contributions to DNA stability: the lateral pairing between the complementary bases and the stacking interactions of the pairs with both immediate neighbors along the helical axis. The model in  also introduced a novel term accounting for the unfavorable positioning of the exposed base, which proceeds through the formation of a highly constrained small loop, and was described as the ring factor. In this work, we neglect the ring factor, since quantitatively it was found to be an adjustable parameter and in our simulations it had the effect of mainly translating vertically the opening propensity profiles - the plot of the propensity to open of base-pair n vs. n.
This Ising-type model distinguishes two states of base-pairs, the open state in which the hydrogen bonds are broken and the bases are flipped out of the stack, and the closed state in which the opposite is true. The instability profiles are obtained by calculating the probability P k (n) for k consecutive base-pairs to be open at the same time, starting at base pair n. The parameter k is called the bubble-size. In the original version of the model  only the case k = 1 was considered. In this work, we generalized the model to be able to calculate the propensity P k (n) for k ≥ 1. Our choice for k in this work varies from k = 1 to k = 9. The need to consider more than one value of k, stems from the fact that new features of the opening profiles emerge with different values. For example, k = 1 gives an implementation of the original model introduced in , but it tends to be noisy (see examples in [Additional file 1]).
It is important to note that the way in which the opening probabilities are calculated by our method for a bubble of size k >1 is fundamentally different from considering the probabilities that each individual base-pair is open and then averaging over a window of size k. Our method computes the probability that all k base-pairs in the window of size k are open simultaneously.
The inhomogeneity of the sequence is taken care of by 2 sets of parameters for the hydrogen bonds and 10 parameters for the stacking interaction of the adjacent bases. There are no free parameters in this approach. The thermodynamic parameter values used in our simulations are the ones reported in [18, 19] and were determined experimentally .
The approach we use provides an efficient method for a genome-wide scan: the time complexity of our algorithm is linear in the length of the sequence. It took less than one second to compute the profile of a 10,000 bp long sequence, while it took more than two hours for WebSIDD .
"Shape recognition" of DNA is a major determinant in protein-DNA interactions . Examination of DNA-protein complex structures has revealed that transcription factor (TF) binding sites can exhibit characteristic structural signatures, e.g. in terms of deformability , bending, groove width, or the presence of kinked bases . These properties may in some instances be correlated to thermodynamic stability and the presented here characteristic profiles for the various promoter elements may reflect conformational properties of the corresponding DNA-protein complexes. In the case of the TATA box, the relationship is easy to see. The TATA box binding protein (TBP) unwinds and bends the DNA double helix almost at 90 degree angle to achieve specific binding , suggesting that sequences that are resistant to such deformation would not bind TBP well. For the Inr element, it has been proposed that a propensity for strand separation assists in the formation of the "transcriptional bubble" , the exposed single strand DNA required by RNA polymerases to initiate transcription. Moreover, YY1 transcription factor, which recognizes Inr motifs such as CCATTT, makes specific contacts with one strand only [30, 31], raising the possibility that its binding also assists in formation of the transcriptional bubble. Generally however, DNA conformational properties are determined by a complex interplay of hydrogen bonding, base stacking energies, hydration, counterions, and steric effects well past the predictive ability of a simple thermodynamic stability model.
Core promoter elements, their consensus sequences and functional window
-33 to -23
-5 to +6
+23 to +33
-42 to -32
-170 to -5
We observed that for the TATA box and Inr, the scores of the functional sites were higher, on average, than the scores of random sites. On the other hand, for BRE, DPE and GC box, the scores of the functional sites were lower, on average, than the scores of random sites. Figure 5 shows that for TATA box and Inr, the graph of the ecdf of the functional sites lies below the ecdf of the random sites, while for BRE, DPE and GC box, the ecdf of the functional sites lies above the ecdf of the random sites. This suggests that functional sites of TATA box and Inr "prefer" less stability but BRE, DPE and GC box "prefer" more stability, when compared to random sites.
In , Gordân et al. incorporated the DNA instability profiles into a motif finding algorithm based on the following observation regarding high-confidence transcription factors binding sites ("functional TFBS") in yeast. They noticed that the the distributions of the instability scores were significantly different for the high-confidence TFBS compared to random sites. This information was then used to derive informative positional priors.
Gordân et al. also observed that, when their set of high-confidence yeast TFBS was compared with random sites, it had, in general, lower instability scores. They hypothesized that TFBS occur preferentially in regions with high DNA duplex stability.
Our findings for individual core promoter elements in human suggest that, compared to random sites, TFs with AT-rich motifs prefer instability while GC-rich motifs prefer stability. This is consistent with our results on the human core promoters signature. We hypothesize that the set of Yeast motifs used in  was GC-rich, therefore skewing the results when compared to random sites and improving the overall performance of the motif discovery tool on the GC-rich motif data set. This relationship between the GC content and stability preference is supported by the following results on shuffled motifs.
In this context, a shuffled motif is a biologically meaningless motif created from the original core promoter motif by shuffling the regular expression for the motif. For example, for the Inr motif YYANWYY, a shuffled motif can be ANYYYYW. In this case we considered the instability scores at all sites in the sequences where ANYYYYW occurred.
These results show that, on average, the AT-rich shuffled motifs scored higher (more instability) than random sites, while the GC-rich shuffled motifs scored lower than random sites.
It is not surprising that (per motif) the two distributions compared were significantly different. The instability score of a word at a site depends in large on the content of the word. Therefore it is expected that scores at sites of one selected set of words and scores of sites of a different set of words (or random words) will have different distributions. But it is important to note how the GC content of the motif effect the results. These results suggest that binding sites for different TFs have different instability profiles when compared to random sites, even when the GC content of the random sites is similar to the GC content of the TFBS. Therefore, in order to capture biologically significant features, one should be careful when combining instability scores of binding sites from different TFs.
We have introduced a linear time algorithm for computing DNA duplex instability profiles. The algorithm has the feature that it can compute the probability of formation of localized openings of any size k. Our analysis has shown that when studying the signatures of functional sites, bubble size matters. Specifically, considering the case of one base pair open, which corresponds to case k = 1, in some instances fails to identify the signatures. With our method, one can easily perform the calculation with several bubble sizes and be able to differentiate the signatures.
Our study has shown that core promoters with GC-rich motifs prefer stability, while those with AT-rich motif prefer instability. We have also shown that the DNA instability scores can differentiate functional binding sites from non-functional binding sites. We have demonstrated that a fast algorithm for the calculation of instability profiles can be a powerful tool in the investigation of entire genomes, with potential applications to motif discovery.
Stacking and base-pairing parameters
For all the computations in this paper, the temperature parameter, T , was set to 37C and the ring factor parameter, ξ, to 1.
Promoter sequences were obtained from the DBTSS website, version 6 . Only sequences with NM ids were considered, and redundancies were dealt with by choosing one representative at random. For human, the total number of sequences considered was 15,194, for mouse 15,337 and for zebrafish 5,343.
where n S is position n (relative to TSS) in sequence S and P k(n S ) is the opening propensity of k base pairs being open, starting at position n S .
Five core promoters were considered: the transcription factor II recognition element (BRE), the downstream promoter element (DPE), initiator (Inr), the TATA box, and the GC box. The sequence motifs and functional windows are shown in Table 1.
A sequence was classified as containing a given core promoter if the motif for that core promoter had a match inside the appropriate functional window. A match here is an exact match of the regular expressions given in Table 1 on the positive strand.
Number of functional sites and non-functional sites per motif
For each motif site we assigned an average score as follows. For k smaller than the motif length, we took the average opening probabilities of the k-windows that are contained in the site. For k greater than the motif length, we averaged the scores of the k-windows that contained the site. The distributions of these scores of the functional sites was compared to the distribution of the scores of the non-functional sites using two sample Kolmogorov-Smirnov test.
For each functional site we picked at random 10 sites of equal length from the same promoter region. For each site we assigned an average score as was done for the functional sites.
MRK was partially supported by the NSF grant DBI-0835718. ZR acknowledges support by the NSF through grant DMS-0708421. This research was funded in part by the NIH (grant # GM073911 to AU).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.