Reliable scaling of position weight matrices for binding strength comparisons between transcription factors
- Xiaoyan Ma^{1, 2},
- Daphne Ezer^{1, 2},
- Carmen Navarro^{2, 3} and
- Boris Adryan^{1, 2}Email author
Received: 17 March 2015
Accepted: 8 July 2015
Published: 20 August 2015
Abstract
Background
Scoring DNA sequences against Position Weight Matrices (PWMs) is a widely adopted method to identify putative transcription factor binding sites. While common bioinformatics tools produce scores that can reflect the binding strength between a specific transcription factor and the DNA, these scores are not directly comparable between different transcription factors. Other methods, including p-value associated approaches (Touzet H, Varré J-S. Efficient and accurate p-value computation for position weight matrices. Algorithms Mol Biol. 2007;2(1510.1186):1748–7188), provide more rigorous ways to identify potential binding sites, but their results are difficult to interpret in terms of binding energy, which is essential for the modeling of transcription factor binding dynamics and enhancer activities.
Results
Here, we provide two different ways to find the scaling parameter λ that allows us to infer binding energy from a PWM score. The first approach uses a PWM and background genomic sequence as input to estimate λ for a specific transcription factor, which we applied to show that λ distributions for different transcription factor families correspond with their DNA binding properties. Our second method can reliably convert λ between different PWMs of the same transcription factor, which allows us to directly compare PWMs that were generated by different approaches.
Conclusion
These two approaches provide computationally efficient ways to scale PWM scores and estimate the strength of transcription factor binding sites in quantitative studies of binding dynamics. Their results are consistent with each other and previous reports in most of cases.
Keywords
Background
Sequence-specific transcription factors (TFs) are key elements in the regulation of gene expression. Their binding preferences to DNA have been studied extensively in vitro, in vivo and using computational methods. In vitro methods such as protein binding microarray(PBM) [1], high-throughput SELEX measurements [2] and DNase I-seq [3] have provided fundamental insight into the specificity of TF binding. The systematic compilation of DNA sequences from such experiments (and along with them catalogues such as TRANSFAC [4] or JASPAR [5]) have long suggested that TFs do not just bind to one DNA motif, but can bind to a repertoire of similar sequences. Stacks of such sequences give rise to alignment matrices, in which each column represents the absolute count of A, C, G and T nucleotide occurrences per position along the length of the motif. We use “motif” in this manuscript in reference to the PWM motif for a specific TF. Work by Berg et al. [6] introduced a derivative of the alignment or position frequency matrix (PFM), the position weight matrix (PWM, sometimes also noted as PSSM for position-specific scoring matrices), which takes the log likelihood of observing nucleotides taking their overall frequency into account. Berg et al. [7] later showed that the score obtained by comparing the PWM against a DNA sequence is proportional to the binding energy between this TF and the DNA. In most cases the actual binding energy between the protein and DNA is not known, and the proportionality is scaled with a factor commonly termed λ. Berg et al. originally introduced λ to relate the population of base-pair choices to binding free energy [6]. It is analogy to the inverse temperature factor in statistical physics to describe the energy distribution and also to serve as a factor in tuning the number of potential binding sites in order to satisfy the constraints on overall energy distribution.
There is no well-characterized and easily computable way to determine the TF binding energy for specific DNA sequences and to compare binding site strength between different types of TFs at large scale. This is problematic when scanning the genome with a library of PWMs, as scoring functions treat each PWM independently, and the absolute score associated with a “good match” to the PWM of one transcription factor might be associated with a mismatch for another factor. A more sophisticated application of binding site strength estimation is, for example, modeling the relationship between enhancer occupancy and gene expression [8, 9]. The experimental PBM approach [1] allows the estimation of the relative binding strength of a protein to “naked” DNA in vitro, but the data availability is restricted to a limited number of TFs due to high cost of the technology. In addition, PBMs are also not suitable for TFs with longer motifs, as their accuracy will decrease with the length of the DNA probe [1]. Therefore, PWM-based approaches are used to computationally estimate TF binding affinity to a specific sequence [8, 10].
In the majority of bioinformatic studies, the scaling factor λ is unknown and PWM scores are used at face value as measure of affinity. For example, in our own work [11] we used the PWM score without scaling to compare binding site strength across different TFs in E. coli, which might lead to a bias due to the absolute differences between the highest and lowest PWM scores across all TFs of interest. One approach is to scale the PWM score by a p-value for each specific score threshold [12]. This method provides a good way to define putative binding sites by choosing a proper statistical threshold, but it is difficult to correlate these p-values with binding energy estimation, as is required for quantitative studies of enhancer activity [8, 9]. Other work has tried to assess the range of λ on the basis of fitting calculated affinity landscapes to ChIP-seq profiles [13, 14]. However, ChIP data is intrinsically noisy and the height of a ChIP peak may not accurately represent the real binding affinity, undermining the stability and accuracy of λ obtained from these methods. In Roider et al. [13], the estimated λ for the same TF in different conditions diverged greatly in nearly one third of TFs. Furthermore, there is a wide band of possible λ values that optimize the correlation. Aforementioned fitting methods are further reliant on chromatin accessibility data acquired under the same growing conditions or development stages, which is sometimes not available for specific TFs.
We propose a simple approximation to estimate the scaling parameter λ based on existing PWMs, average maximum mismatch energy tolerance estimated by high-throughput binding energy measurements [15] and the distribution of PWM scores across the genome of a specific organism. This method is independent of genome-wide binding and accessibility data. Furthermore, in the cases where there are potentially inconsistent PWMs for a particular TF (e.g. derived on the basis of individual binding sites vs. derived from high-throughput efforts), we provide a method to convert the known λ for one PWM of the same TF into another suitable value for a new PWM. This method is based on a computational model of the facilitated diffusion of TFs on the DNA that our group established earlier [16]. We calculate sequence-specific residence times of TFs at the DNA, which is correlated with affinity. We can therefore derive λ for different PWMs of the same TF on the basis of the consistency of simulated residence time. These two strategies (a) calculating λ to scale PWM scores based on the mismatch energy theory using a simple equation and (b) converting the scaling parameter λ between different PWMs of the same TF on the basis of simulated residence time of facilitated diffusion provide simple but useful estimations of binding energy across different TFs using properly scaled PWM scores.
Methods
PWMs of TFs for yeast, fly and vertebrates
Position frequency matrices (PFM) used to construct PWMs were downloaded from the JASPAR database (JASPAR-CORE-2014 non-redundant PFM) [5]. Additional sources of PFMs such as those contained in the BioConductor package PWMEnrich.Dmelanogaster.background [17] were used as a source of different matrices for the same TFs. PFMs constructed with less than 30 reference sequences of validated binding sites were removed, as we deemed those insufficient descriptions of binding preference. Given that typical TF binding sites span at least six base pairs, we removed any motifs less than 6 base pairs in length.
where μ is chosen to be 1 [19] and we also show that the choice of the pseudo-count μ does not have significant influence on our results (Additional file 1: Figure S6); n _{ x,k } is the frequency of certain nucleotide x in a specific position k of the motif.
Simple equation to calculate λ
where S _{ j } is the PWM score at position j in the genome, τ _{0} is the average residence time calculated as in [16]. This equation is widely used in simulations of TF binding kinetics [20].
where <E _{ maxMismatch }> stands for the average maximum mismatch energy tolerance, which is chosen to be 6 bits as is discussed below from the study of Maerkl et al. [15]; I f _{ i } represents the information content of a specific PWM and <I f> stands for the average information content corresponding to the average maximum mismatch energy [15], which is 13.2 bits. We reason that if the information content is a good indication of how specific a TF is, the energy drop measured in bits between strong and weak binding sites (S _{ m a x,i }−S _{ t o p0.1 %,i })/λ _{ i } should have some relationship with the binding specificity of a particular TF. The more specific a TF is, the more significant the energy drop can be. Given limited data in binding energy measurement, we assume that the relationship is simply linear.
We chose an average mismatch energy tolerance of 6 bits based on the study by Maerkl et al. 2007 [15]. They showed by mechanical trapping of molecular interactions a significant decline in binding energy by at most 2 to 3 nucleotide mismatches, and each mismatch nucleotide contributes 2 bits in mismatch energy. Even if more mutations are introduced, the binding energy does not drop further since it has already reached the background non-specific binding energy level.
This experiment was applied only to TFs belonging to the bHLH family. In the absence of more comprehensive data, we must assume that all TFs share this value; although if more general TF in-vitro binding energy measurement results become available, we suggest adjusting the specific top score threshold and corresponding average mismatch energy bits accordingly. Another report featuring TFs from different families including: p53, Max, Glucocorticoid Receptor [22] also provides additional support for 6 bits as average mismatch energy tolerance level since TFs from different families in their study have similar binding kinetics.
In order to control for PWM motif length, in the analysis of λ value comparison across different species and TF families, each λ value was transformed into a Z-score. Specifically, PWM motifs were grouped by motif length, with each group having more than 50 PWM motifs (The groups were: 7-8 bp, 9-10 bp, 11-12 bp, 13-15 bp, >=16 bp), and the λ values were normalized by the mean and standard deviation within each of these groups (Additional file 3: Table S3 lists the mean and standard deviation value for each group, Additional file 4: Figure S3 depicts the distribution of λ at different motif lengths with color coded points that represent different species).
Estimating λ of a new PWM matrix for the same TF based on the residence time landscape of the facilitated diffusion model
Sometimes there may be more than one PWM available for a specific TF. For instance, different TF motif databases (such as JASPAR [5], SwissRegulon [23], FlyFactorSurvey [24], and HOCOMOCO [25]) may have different versions of PWM motifs for the same TF. In order to directly compare the TF binding energy when using two alternative versions of a PWM, it is important to have a way of scaling the results by λ. λ can be adjusted using the formalism introduced in the previous sections. As a compute-efficient alternative, we developed a more optimal strategy for estimating λ, which does not require the assumption that the PWM information content influences the energy mismatch tolerance. Instead, we base our strategy on the estimation of the sequence specific residence time of a particular TF, which is a biologically meaningful quantity and can be correlated with in vitro sequence-dependent sliding measurement of TFs [10]. For the same TF, the distribution of the sequence-specific residence time calculated by Eq. 4 should be as consistent as possible, even when using slightly different PWMs if an appropriate λ is chosen for scaling. Based on this, given a known λ for one PWM, we are able to find another suitable λ for the new PWM.
Note that the stronger the PWM score, the more likely it is that the sequence is bound by a TF and that the residence time of a TF is a biologically meaningful quantity, but there is a much greater number of weak and medium strength binding sites than there are strong sites in the genome. Therefore, if we scored each potential binding site equally, the background of weak and medium strength binding sites would have a greater affect on the estimated λ than the strong binding sites. Therefore, we compare residence times across different quantiles on a logarithmic binding strength scale so that the strongest binding sites have the most influence on our λ estimates.
Specifically, in the following analysis, we take the −l o g _{10} of the cumulative distribution of PWM scores and select all binding sites with values greater than 3.0 (recall that this corresponds to the 0.1 % percent of binding sites, which were chosen as the lower boundary of weak binding sites). We divide these top-scoring binding sites into bins every 0.1 log-quantile and calculate the average residence time for each of these bins. Our strategy identifies the λ that would produce the most similar residence times for each of these log-quantiles. Assuming that for the first PWM we already have an estimate of λ, by either binding profile fitting or other methods, we can use Eq. 4 to calculate the residence time for each binding strength log-quantile, as described above. In the following analysis of this paper, we borrow the values obtained from Eq. 6 as pre-calculated λ for proof-of-principle, since there are very few well-characterized λ values from profile fitting. Note that τ _{0} is calculated via the strategy described in Zabet et al. [16] from all intergenetic regions in the genome, which has a different value for each unique PWM.
where q represents each quantile in the quantile series, τ _{ q,λ } is the residence time in a specific quantile of a particular λ, τ _{ q,r e f } is the residence time in the same quantile of the known λ of the reference PWM matrix. The λ derived by minimizing the mean square error or minimizing the value of the above formula show good consistency with adjusted R ^{2} of 0.9644 (p= 6.3·10^{−9}). Thus, there should not be significant bias using either of these two methods.
The R scripts for both converting λ between two PWM matrices and estimating λ using Eq. 6 are provided in the following link: https://github.com/XyMa/PWM_scale.
Results
Estimating scaling parameter λ for binding site affinity across different species and TF families based on Eq. 6
The λ parameter is the critical link between PWM score, the estimated binding energy and TF residence time. Estimating TF binding site affinity by comparing PWM scores at face value can lead to a large bias, especially when this includes comparisons between many types of TFs, because several properties of the PWM itself can influence the PWM score. For example, the information content of the PWMs is positively correlated to the maximum possible PWM score, as is shown in Additional file 5: Figure S1 with an R ^{2} value of 0.597. Thus, the absolute value of PWM scores cannot be compared directly across different TFs as an indicator of binding site strength. Proper scaling of PWM score is needed in order to compare binding site affinity across different types of TFs. Based on the methods proposed by Berg et al. [7], the TF binding energy for a specific binding site can be computed by Eq. 3 using the estimated λ.
Maximum, minimum and the mean values of λ in 3 groups of organisms
S. cerevisiae | D. melanogaster | Vertebrates | |
---|---|---|---|
maximum | 2.83 | 2.72 | 2.82 |
minimum | 0.26 | 0.35 | 0.25 |
mean | 1.25 | 1.40 | 1.73 |
Since λ is the denominator to the PWM score differences between one binding site and the consensus sequence in Eq. 3, a larger λ indicates lower mismatch energy when Δ S _{ j } is the same. Thus, with the same possible mismatch energy range, if λ is larger, the PWM score can have a greater range from the consensus sequence to the potentially weakest binding site, which indicates that, as suggested by Pabo et al. [26], the binding motif for the TF family has higher flexibility. This is consistent with the fact that the TFs in the zinc-finger super-family, including the nuclear receptor and β- β- α zinc-finger families, are less constrained to a particular motif than HTH super family. Additionally, cross species comparison of λ indicates that from yeast to vertebrate, more flexible TF motifs are used, which is consistent with the result from Itzkovitz et al. [27] that organisms which appeared more recently in evolution tend to use more TFs with motifs of higher flexibility.
Comparison of λ values estimated with Eq. 6 to λ values derived from fitting ChIP-seq data
Converting λ between different PWM matrices of the same TF
In many cases there are two PWMs available for the same TF, and one of these PWMs might already have a reliable estimate of λ from any number of experimental or computational approaches [14]. In such circumstances, we provide a strategy to estimate the unknown λ associated with the alternative PWM. It would be possible to calculate the unknown λ from Eq. 6, but this does not incorporate the additional data available (i.e. the known λ). Our alternative strategy not only incorporates this data, but also loosens the assumption in Eq. 6 that the maximum mismatch energy for DNA binding is proportional to information content.
Discussion
TF binding site strength estimation using PWM-based methods is essential for modelling TF-DNA interaction in functional genomics; but a proper scaling parameter is needed when using the PWM score to estimate TF binding energy. Therefore, we provide two independent methods for estimating the scaling parameter λ in different conditions. The simple Eq. 6 is widely applicable, since it only requires a PWM as input, which is easy to implement compared to methods using fitting to ChIP-seq [13, 14]. Our second method converts a λ specific to one PWM into λ for a different PWM of the same TF. It is based on the definition of sequence-specific residence time from the facilitated diffusion model of TFs on DNA [16]. This method is particularly useful for converting a previously estimated λ into the one associated with a more up-to-date or otherwise alternative PWMs.
These two methods are consistent with one another (Additional file 11: Figure S4) and with previously established methods. For instance, Eq. 6 can also provide very similar results compared with the estimated λ from ChIP-seq data fitting [13, 14]. Although our estimates of λ are mostly consistent with those estimated by Zabet [14] and Roider [13], it is not possible to robustly compare our λ estimates to experimentally derived values at large scale, as this data is simply unavailable. Having more such data would also enable us to adjust currently fixed parameters in our equation for different TF families, such as the top-scoring threshold, instead of assuming a uniform value across all TFs. The consistent value range of λ in different organisms calculated by this method provides additional support for the applicability of this simple equation. Moreover, the estimated distribution of λ values for different TF families make sense in the light of motif choice for each of the TF families [28]. For example, TFs in the zinc-finger TF super-family, including nuclear receptor zinc-finger and β- β- α zinc-finger families, have more flexible binding motifs, which can suit a wider range of possible binding sites than the helix-turn-helix super-family, which has a more restricted motif consensus sequence [26]. In contrast, some TF families belonging to the same super-family and sharing similar binding domain properties can have a strong similarity in λ distribution, e.g. homeobox family and forkhead family which both belong to the helix-turn-helix super-family. The two TF families that show the highest average z-score of λ values (namely, basic leucine-zipper and helix-loop-helix families) tend to form homodimers and heterodimers, though some TFs in other TF families also tend to dimerise e.g. some members in homeobox family. If PWM motifs for either monomers or dimers are available, the corresponding λ scores can be roughly estimated following the same procedure using Eq. 6, or we can further use the second method mentioned before to convert λ values between different PWMs by the keeping residence time consistent. However, our method only considers TF-DNA interaction, ignoring the effects of TF-TF interactions that could stabilize TF binding.
There are some points that should be noted when using the simple equation method: first, it cannot be applied to very short TF motifs that are less than 6 base pairs in length. Since this method depends on calculating the difference between the PWM score of the 0.1 % highest scoring matches and the maximum score, if the motif is only 5 base pairs in length, the number of possible choices for sequence combination of 5 base pairs is only 1024, then the top 0.1 % highest scoring matches is very likely to be nearly equal to the maximum score. However, most eukaryotic motifs are more than 6 base pairs long. Eukaryotic TFs on average cover 15 bp of DNA with a core motif length of 8-15 bp [8]. Thus, this limitation should not be a problem in the majority of cases. However, if a higher threshold e.g. top 1 ·10^{−5} is applied with certain adjustment for average mismatch bits in the denominator, it requires the PWM motif to be at least 10 bp long, which will limit the applicability of this simple method. The default cut-off threshold for binding sites is the top 0.1 % of the highest scoring matches, but varying the threshold up to the top 0.001 % does not significantly influence the rank of λ (Additional file 2: Figure S5). Note in Eq. 6, the average mismatch energy bit score in the denominator is the one corresponding to the certain top PWM matches threshold, which means if a new threshold is adopted, the average mismatch energy bit score should be updated accordingly, but given very limited binding energy measurement data, it is difficult to select specific values for each corresponding binding site strength level. Thus, we simply compared the rank correlation of λ, which is not affected by the linear scaling factor of average mismatch energy bits. Although we estimate λ by top scoring genomic sequences, it will not substantially affect the analysis if this is done on random sequences with the same GC content, since given the size of the genome, local binding site patterns will not have much influence on the general distribution of binding site strength. Additional file 1: Figure S6 shows that the number of unique k-mers passing the 0.1 % top scoring matches threshold in genomic sequences correlates well with that in random sequences of the same GC content.
Another assumption in this method is that the mismatch energy tolerance range measured in bits is proportional to the information content of the PWM. This assumption can deal with the bias from the differences in information content of most PWMs; however, it might not hold for PWMs with extremely high information content. For example, the yeast transcription factor IXR1 has an information content of 47 bits according to the PFM from JASPAR [5], which is substantially larger than the average information content of 13.2 bits. In that case, the binding energy will probably be overestimated, which leads to a lower λ, but these cases are very rare and only 7 PWMs in our analysis (less than 1.5 %) have information content greater than 20. Further, we note that the experiment by Maerkl et al. [15] was applied only to TFs belonging to the bHLH family. In the absence of any alternative data, we simply assume that this value is scaled by the information content of the PWM; although if more in-vitro binding energy measurements should become available in the future, we suggest adjusting the specific top score threshold and corresponding average mismatch energy bits accordingly.
There are two limitations of this method, which can potentially lead to some biases between different organisms and different TF families. One limitation is related to the calculation of mismatch energy tolerance in different groups of TF families. We apply a single cut-off threshold of the top 0.1 % highest scoring matches for weak binding sites suggested by Wunderlich et al. [21], but it could be possible that for different TF families, different thresholds should be used due to variations in their DNA binding domains. However, it is difficult to choose specific thresholds for every TF family based on the currently available data. Further, from the definition of information content of the PWM, it sums up information content gain from each nucleotide [20]. It implies that longer motifs including more flanking base pairs will have higher information content compared to the shorter ones with only core motifs, which is an artefact of computation. However, there is no satisfactory way to deal with this problem. One possible solution is using the information content per nucleotide instead of the total information content, but this may be problematic as the information content contributed by flanking sequences constitutes only a very small fraction compared to core motifs. Thus, if dividing total information content by the length of the motif, the dilution of information content could lead to even larger biases. Therefore, instead, in our analysis of comparing λ value distribution across different organisms and TF families, we control for motif length by normalizing it to the mean in each motif length bin. Another potential solution is trying to define a core motif from one PWM, but this requires detailed knowledge about the TF of interest. Additionally, λ will not be a reliable measure of the biochemical stickiness of the TF to the DNA if the PWM itself is not an accurate representation of TF binding. A PWM assumes that each nucleotide position independently contributes to TF binding affinity, which may not be the case [29, 30]. For instance, a study by Storm et al. [31] used both a single nucleotide model and a di-nucleotide model to fit the binding energy measurements [15]. Although they found that the di-nucleotide model provides a better fit to the experimental data, the single nucleotide model could also perform well when non-specific binding energy was taken into account. In addition, the composition of the position frequency matrix of the PWM may contain biases due to the difficulties of attaining an unbiased validated binding site set. Nevertheless, λ can give us insights about DNA binding properties of TFs.
Also, it should be pointed out that residence time in this paper refers to an estimate based on biophysical models [10, 16]. However, other papers report inconsistent scales of residence time according to different experimental approaches. For example, the residence time estimations obtained by Competition-ChIP methods [32] do not share the same order of magnitude compared to the residence times measured by FRAP or single molecular tracking [22, 33, 34], which can probably be an artifact of experimental methods or alternatively, the range of residence time truly varies greatly across different TFs [35]. Because the experimentally determined values are not comparable to each other, we simply adopt bioinformatics-based approaches to compute residence time. Since our method converts λ between different PWMs of the same TF under the concept of residence time, it avoids fitting inconsistent experimental observations and potential variations in DNA-binding kinetics for different TFs.
Although in many cases PWMs are not optimal representations of binding motifs, they have become almost universally adopted to identify potential TF binding sites. It is important to remember that the value of a PWM score is not directly correlated to the binding energy, but rather depends on the scaling parameter λ. Previously, researchers either assumed that λ has similar values across different PWMs or estimated it through computationally intensive binding profile fitting methods [13, 14]. There are several alternative ways to identify potential binding sites based on the p-value of the PWM score [12]. Other studies provide tools to combine more local information e.g. DNA sequence conservation and epigenetic marks with PWMs to identify potential binding sites with higher confidence [36]. These methods are useful in defining potential binding sites, but their results are difficult to interpret in terms of TF binding energy which is widely used in modeling TF binding dynamics and enhancer activity [8]. Here we provide two simple strategies for estimating λ, which will let us more clearly link PWM scores with the energetics of TF binding.
Conclusion
Using PWMs as representations of binding motifs have become widely adopted to identify potential TF binding sites. It is important to remember that the value of a PWM score is not directly correlated to the binding energy, but rather depends on the scaling parameter λ. Previously, researchers either assumed that λ has similar values across different PWMs or estimated it through computationally intensive binding profile fitting methods [13, 14]. There are several alternative ways to identify potential binding sites based on the p-value of the PWM score [12]. Other studies provide tools to combine more local information e.g. DNA sequence conservation and epigenetic marks with PWMs to identify potential binding sites with higher confidence [36]. These methods are useful in defining potential binding sites, but their results are difficult to interpret in terms of TF binding energy, which is widely used in modeling TF binding dynamics and enhancer activity [8]. Here we provide two simple strategies for estimating λ, which will let us more clearly link PWM scores with the energetics of TF binding. One approach is to simply apply Eq. 6 to estimate λ only based on the given PWM and genome background sequences of a specific organism. It provides results consistent with those estimated by Zabet [14] and Roider [13] in most cases, though there is a small number of exceptions. Further, λ value distribution for different TF families from this method are consistent with DNA binding properties of TF families [26, 28], which further supports the applicability of this simple method. Another approach converts λ between two PWMs of the same TF based on the consistency of residence times. It is useful when we get alternative versions of PWMs from different databases and want to estimate binding site strength in a consistent manner. Both of the approaches we developed are much compute-efficient than previous methods of TF binding profile fitting [13, 14].
Declarations
Acknowledgements
We thank Nicolae Radu Zabet for helpful conversations, Robert Foy and Jamie McGinn for editing the manuscript.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, et al. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nature Genetics. 2004; 36(12):1331–39.View ArticlePubMedPubMed CentralGoogle Scholar
- Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P. High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites. Nature Biotechnology. 2002; 20(8):831–5.View ArticlePubMedGoogle Scholar
- Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012; 489(7414):83–90.View ArticlePubMedPubMed CentralGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, et al. TRANSFAC and its module transcompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006; 34(Database issue):D108–10.View ArticlePubMedGoogle Scholar
- Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2013; 42(Database issue):D142–7.PubMedPubMed CentralGoogle Scholar
- Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 1987; 193:723–50.View ArticlePubMedGoogle Scholar
- Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins. Trends in Biochemical Sciences. 1988; 13(6):207–11.View ArticlePubMedGoogle Scholar
- Kim AR, Martinez C, Ionides J, Ramos AF, Ludwig MZ, Ogawa N, et al. Rearrangements of 2.5 kilobases of noncoding DNA from the drosophila even-skipped locus define predictive rules of genomic cis-regulatory logic. PLoS Genetics. 2013; 9(2):1003243.View ArticleGoogle Scholar
- Giorgetti L, Siggers T, Tiana G, Caprara G, Notarbartolo S, Corona T, et al. Noncooperative interactions between transcription factors and clustered dna binding sites enable graded transcriptional responses to environmental inputs. Molecular Cell. 2010; 37(3):418–28.View ArticlePubMedGoogle Scholar
- Leith JS, Tafvizi A, Huang F, Uspal WE, Doyle PS, Fersht AR, et al. Sequence-dependent sliding kinetics of p53. Proceedings of the National Academy of Sciences. 2012; 109(41):16552–57.View ArticleGoogle Scholar
- Ezer D, Zabet NR, Adryan B. Physical constraints determine the logic of bacterial promoter architectures. Nucleic Acids Research. 2014:078.Google Scholar
- Touzet H, Varré J-S. Efficient and accurate p-value computation for position weight matrices. Algorithms Mol Biol. 2007; 2(1510.1186):1748–7188.Google Scholar
- Roider HG, Kanhere A, Manke T, Vingron M. Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics. 2007; 23(2):134–41.View ArticlePubMedGoogle Scholar
- Zabet NR, Adryan B. Estimating binding properties of transcription factors from genome-wide binding profiles. Nucleic Acids Res. 2015; 43(1):84–94.View ArticlePubMedGoogle Scholar
- Maerkl SJ, Quake SR. A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007; 315(5809):233–7.View ArticlePubMedGoogle Scholar
- Zabet NR, Adryan B. A comprehensive computational model of facilitated diffusion in prokaryotes. Bioinformatics. 2012; 28(11):1517–24.View ArticlePubMedPubMed CentralGoogle Scholar
- Stojnic R, Diez D. PWMEnrich: PWM Enrichment Analysis. 2014. R Package Version 4.2.0.Google Scholar
- Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000; 16(1):16–23.View ArticlePubMedGoogle Scholar
- Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics. 2004; 5(4):276–87.View ArticlePubMedGoogle Scholar
- Stormo GD, Zhao Y. Determining the specificity of protein–DNA interactions. Nature Reviews Genetics. 2010; 11(11):751–60.View ArticlePubMedGoogle Scholar
- Wunderlich Z, Mirny LA. Different gene regulation strategies revealed by analysis of binding motifs. Trends in Genetics. 2009; 25(10):434–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Mueller F, Wach P, McNally JG. Evidence for a common mode of transcription factor interaction with chromatin as revealed by improved quantitative fluorescence recovery after photobleaching. Biophysical journal. 2008; 94(8):3323–39.View ArticlePubMedPubMed CentralGoogle Scholar
- Pachkov M, Balwierz PJ, Arnold P, Ozonov E, van Nimwegen E. SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic acids research. 2013; 41(D1):214–20.View ArticleGoogle Scholar
- Zhu LJ, Christensen RG, Kazemian M, Hull CJ, Enuameh MS, Basciotta MD, et al. FlyFactorSurvey: a database of drosophila transcription factor binding specificities determined using the bacterial one-hybrid system. Nucleic Acids Research. 2011; 39(suppl 1):111–7.View ArticleGoogle Scholar
- Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, et al. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic acids research. 2013; 41(D1):195–202.View ArticleGoogle Scholar
- Pabo CO, Sauer RT. Transcription factors: structural families and principles of DNA recognition. Annual Review of Biochemistry. 1992; 61(1):1053–95.View ArticlePubMedGoogle Scholar
- Itzkovitz S, Tlusty T, Alon U. Coding limits on the number of transcription factors. BMC Genomics. 2006; 7(1):239.View ArticlePubMedPubMed CentralGoogle Scholar
- Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein-DNA complexes. Genome Biol. 2000; 1(1):1–37.View ArticleGoogle Scholar
- Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. DNA-binding specificities of human transcription factors. Cell. 2013; 152(1):327–39.View ArticlePubMedGoogle Scholar
- Bulyk ML, Johnson PL, Church GM. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Research. 2002; 30(5):1255–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Stormo GD, Zhao Y. Putting numbers on the network connections. BioEssays. 2007; 29(8):717–21.View ArticlePubMedGoogle Scholar
- Lickwar CR, Mueller F, Hanlon SE, McNally JG, Lieb JD. Genome-wide protein-DNA binding dynamics suggest a molecular clutch for transcription factor function. Nature. 2012; 484(7393):251–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Mueller F, Stasevich TJ, Mazza D, McNally JG. Quantifying transcription factor kinetics: At work or at play?Critical reviews in biochemistry and molecular biology. 2013; 48(5):492–514.View ArticlePubMedGoogle Scholar
- Chen J, Zhang Z, Li L, Chen BC, Revyakin A, Hajj B, et al. Single-molecule dynamics of enhanceosome assembly in embryonic stem cells. Cell. 2014; 156(6):1274–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Sung MH, Guertin MJ, Baek S, Hager GL. DNase footprint signatures are dictated by factor dynamics and DNA sequence. Molecular Cell. 2014; 56(2):275–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome research. 2011; 21(3):447–55.View ArticlePubMedPubMed CentralGoogle Scholar