DNA duplex instability is manifested as the ease at denaturating the DNA double strand, i.e., as the partial melting and unfolding of double stranded DNA. The study of DNA duplex instability has been a fascinating subject for many reasons: its importance for techniques such as PCR, sequencing by hybridization, antigene targeting, and for understanding replication, mutation, repair, and transcription, see [1] and references therein.

With respect to understanding transcription initiation, in the very recent past, there has been increased evidence that duplex instability, as well as other physiochemical properties, reveal specific signatures of TSS and core promoter elements. In this context, there have been several comprehensive analyses of genomes such as that of the Plasmodium falciparum [2], yeast [3], human, and other animals [4–7]. To a smaller scale, structural properties of DNA have also been used to predict DNA function in viral sequences [8, 9].

It has also been shown that the DNA duplex instability profiles can be used to aid motif discovery in yeast [3]. The instability profiles, computed with the on-line tool WebSIDD [10], were used to derive informative positional priors and incorporated into a motif finding algorithm. As a result, the performance of the motif finding program improved significantly.

The need for an efficient method to compute the profiles was stressed in [3], since the on-line tool WebSIDD could not be used to efficiently compute profiles of sequences that were several thousands base pairs long. The algorithms used for computing DNA instability profiles for the above applications [2–9] either have non-linear time complexity (such as the algorithms based on the Peyrard-Bishop-Dauxois (PBD) model [11, 12] and WebSIDD, based on the Benham model [13, 14]) or are linear time approximations to a non-linear-time model (e.g., [6]).

Some progress in this direction has been made. Recently, in [15] the Zimm-Brag model was used for a genome wide comparison between coding domains and thermodynamically stable regions. In some organisms the corellation between coding domains and thermodynamic stability allowed the identification of putative exons or genes. The authors state that the algorithm is linear in the length of the sequence. Also, using the Poland-Scheraga model in [16] another algorithm for DNA melting calculations was reported with time complexity less than quadratic.

In this work we modified an Ising-type model [17–19] that identifies as major contributions to DNA stability the hydrogen bonds between the complementary bases and the nearest-neighbor stacking interaction. One of the advantages of this model is that it can be implemented efficiently, since the time complexity of the algorithm is linear in the length of the sequence.

Another feature of the method we use is that it directly computes the probability of bubble formation of any size *k*. Our operational definition of a bubble is that of a strand separation, or DNA 'opening' spanning several base pairs. Here, a bubble of size *k* means that at least *k* base pairs are open.

Studies have suggested that the ability of the DNA to form a 'transcriptional bubble' at the transcriptional start site is essential to initiate transcription. Using the PBD model, in [20], it was argued that thermodynamic instability profiles are able to identify the location of TSS. In [21] it was demonstrated that bubble size is important, in the sense that when the simulated bubble size equals the transcriptional bubble size, the highest peak in the instability profile appears at the TSS. Most previous algorithms (in particular, [3, 4, 6]), only compute the opening probabilities of one base-pair at a time and use averaging techniques to measure the propensity of the DNA to form a bubble of size *k >*1 at a specific location. This averaging process is not equivalent to computing the opening probability of the whole window of size *k*.

Using the Ising-type model, we computed the DNA instability profiles for the human promoter regions in Database of Transcriptional Start Sites (DBTSS). We show that these profiles provide an insight into core promoter elements such as the downstream promoter element (DPE), transcription factor II recognition element (BRE), initiator (Inr) and GC box. We show that there is an association between the location of local extrema in the instability profiles and the presence of core promoter elements at these locations. We present evidence that BRE and DPE prefer stability, whereas the TATA box and the Inr prefer instability.

Finally, we examine the applications of the DNA duplex instability profiles to motif discovery. Our findings raise a concern that the "one size fits all" approach to transcription factors used in [3], may not be appropriate.

### Related approaches

Most of the approaches for the computation of DNA instability profiles use models that are coarse-grained, in the sense that they take into account only the major contributions to DNA stability.

The Peyrard-Bishop-Dauxois model [11, 12] assumes that the hydrogen bonds and stacking interaction are the main contributions to DNA stability. Like the modified Ising-type model, it does not take into account explicitly the three-dimensional structure of the double helix and neglects torsional effects. The main difference with the approach we use in this work is that the variable that describes the stretching of the bonds is continuous rather than discrete. The computational complexity of this model is non-linear in the length of the sequence. By direct integration an algorithm was devised in [22] that reduces the complexity of function evaluations to being linear in the length of the sequence and quadratic in the number of grid points used in the integration.

The Benham model [10, 13, 14] uses the free energy needed to separate the two strands and destroy the helical structure as a measure of instability. This model predicts the location and extend of destabilization given the DNA sequence and imposed super-helical stress and is discrete: the base pair is assumed to be either separated or not. The Benham model has non-linear time complexity.

In [6] the human genomic melting map was obtained based on the Poland model [23], which uses recurrence relations to calculate the probabilities of transition of the double helix from the helical to the coil state, rather than considering the state of the hydrogen bonds and stacking interactions. The approach we use here is more general since it allows for the study of localized openings that are precursors to melting, instead of considering only the complete melting of a DNA sequence. The algorithm used in [6] is a linear time approximation to the non-linear-time Poland model.

In [4] an approach to predict promoters in whole-genome sequences was used, with the aid of large-scale structural properties of DNA, such as GC content, stabilizing energy of Z-DNA, DNA denaturation values, protein induced deformability, and duplex free energy. First, structural profiles are calculated by converting the nucleotide sequence into a numerical profile, by replacing each di-or trinucleotide with its corresponding structural value. Next, the values are averaged over a window of size 400. The approach we use is different in the sense that no averaging is taking place, rather we calculate conditional probabilities of having *k* base pairs in the open state and loss of stacking interactions. The cooperative and long range effects in the Ising type model used here, are due to that fact that in the calculation of the probabilities, the entire sequence is taken into account in the evaluation of the partition function (normalizing constant so the probabilities at a given base pair add up to one). In this sense, our approach is an effective smoothing and no averaging within the same sequence takes place.

Other approaches to the study of DNA denaturation include the examination of the breathing dynamics from a probabilistic point of view [24] and [25]. In [24] the authors develop a master equation, which together with a Gillepsie algorithm, generates sequence-specific stochastic time series of partially melted regions in DNA. In [25] the dynamics and thermodynamics of twist-induced denaturation was studied in a long, random sequence of DNA, using large deviation theory, scaling arguments, and Monte Carlo simulations.