Reliable scaling of position weight matrices for binding strength comparisons between transcription factors

Background Scoring DNA sequences against Position Weight Matrices (PWMs) is a widely adopted method to identify putative transcription factor binding sites. While common bioinformatics tools produce scores that can reflect the binding strength between a specific transcription factor and the DNA, these scores are not directly comparable between different transcription factors. Other methods, including p-value associated approaches (Touzet H, Varré J-S. Efficient and accurate p-value computation for position weight matrices. Algorithms Mol Biol. 2007;2(1510.1186):1748–7188), provide more rigorous ways to identify potential binding sites, but their results are difficult to interpret in terms of binding energy, which is essential for the modeling of transcription factor binding dynamics and enhancer activities. Results Here, we provide two different ways to find the scaling parameter λ that allows us to infer binding energy from a PWM score. The first approach uses a PWM and background genomic sequence as input to estimate λ for a specific transcription factor, which we applied to show that λ distributions for different transcription factor families correspond with their DNA binding properties. Our second method can reliably convert λ between different PWMs of the same transcription factor, which allows us to directly compare PWMs that were generated by different approaches. Conclusion These two approaches provide computationally efficient ways to scale PWM scores and estimate the strength of transcription factor binding sites in quantitative studies of binding dynamics. Their results are consistent with each other and previous reports in most of cases. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0666-1) contains supplementary material, which is available to authorized users.


Introduction
Sequence-specific transcription factors (TFs) are key elements in the regulation of gene expression. Their binding preferences to DNA have been studied extensively in vitro, in vivo and using computational methods. In vitro methods such as DNase I footprinting [1], protein binding microarray(PBM) [2] and high-throughput SE-LEX measurements [3] have provided fundamental insight into the specificity of TF binding. The systematic compilation of DNA sequences from such experiments (and along with them catalogues such as TRANSFAC [4] or JASPAR [5]) have long suggested that TFs do not just bind to one DNA motif, but can bind to a repertoire of similar sequences. Stacks of such sequences give rise to alignment matrices, in which each column represents the absolute count of A, C, G and T nucleotide occurrences per position along the length of the motif. Work by Berg et al. [6] introduced a derivative of the alignment or position frequency matrix (PFM), the position weight matrix (PWM), which takes the log likelihood of observing nucleotides taking their overall frequency into account. Berg et al. [7] later showed that the score obtained by comparing the PWM against a DNA sequence is proportional to the binding energy between this TF and the DNA. In most cases the actual binding energy between the protein and DNA is not known, and the proportionality is scaled with a factor commonly termed λ.
There is no well-characterized and easily computable way to determine the TF binding energy to specific DNA sequences and to compare binding site strength between different types of TFs at large scale. This is problematic when scanning the genome with a library of PWMs, as scoring functions treat each PWM independently, and the absolute score associated with a "good match" to the PWM of one transcription factor might be associated with a mismatch for another factor. A more sophisticated application of binding site strength estimation is, for example, modeling the relationship between enhancer occupancy and gene expression [8,9]. The experimental PBM approach [2] allows the estimation of the relative binding strength of a protein to "naked" DNA in vitro, but the data availability is restricted to a limited number of TFs due to high cost of the technology. In addition, PBMs are also not suitable for TFs with longer motifs, as their accuracy will decrease with the length of the DNA probe [2]. Therefore, PWM-based approaches are used to computationally estimate TF binding affinity to a specific sequence [8,10].
In the majority of bioinformatic studies, the scaling factor λ is unknown and PWM scores are used at face value as measure of affinity. For example, in our own work [11] we used the PWM score without scaling to compare binding site strength across different TFs in E. coli, which might lead to a bias due to the absolute differences between the highest and lowest PWM scores across all TFs of interest. Other work has tried to assess the range of λ on the basis of fitting calculated affinity landscapes to ChIP-seq profiles [12,13]. However, ChIP data is intrinsically noisy and the height of a ChIP peak may not accurately represent the real binding affinity, undermining the stability and accuracy of λ obtained from these methods. In Roider et.al [12], the estimated λ for the same TF in different conditions diverged greatly in nearly one third of TFs. Furthermore, there is a wide band of possible λ values that optimize the correlation. Aforementioned fitting methods are further reliant on chromatin accessibility data acquired under the same experimental conditions, which is often not available for specific conditions and TFs. We propose a simple approximation to estimate the scaling parameter λ based on existing PWM matrices, average maximum mismatch energy tolerance estimated by high-throughput binding energy measurements [14] and the distribution of PWM scores across the genome of a specific organism. This method is independent of genome-wide binding and accessibility data. Furthermore, in the cases where there are potentially inconsistent PWMs for a particular TF (e.g. derived on the basis of individual binding sites vs. derived from high-throughput efforts), we provide a method to convert the known λ for one PWM matrix of the same TF into another suitable value for a new PWM matrix. This method is based on a computational model of the facilitated diffusion of TFs on the DNA that our group established earlier [15]. We calculate sequence-specific residence times of TFs at the DNA, which is correlated with affinity. We can therefore derive λ for different PWMs of the same TF on the basis of the consistency of simulated residence time. These two strategies (a) calculating λ to scale PWM scores based on the mismatch energy theory using a simple equation and (b) converting the scaling parameter λ between different PWM matrices of the same TF on the basis of simulated residence time of facilitated diffusion provide simple but useful estimations of binding energy across different TFs using properly scaled PWM scores.

PWM matrices for TFs for yeast, fly and vertebrates
Position frequency matrices (PFM) used to construct PWM matrices were downloaded from the JASPAR database (JASPAR-CORE-2014 non-redundant PFM) [5]. Additional sources of PFM such as those contained in the BioConductor package PWMEnrich.Dmelanogaster.background [16] were used as a source of different matrices for the same TFs. PFMs constructed with less than 30 reference sequences of validated binding sites were removed, as we deemed those insufficient descriptions of binding preference. Given that typical TF binding sites span at least six base pairs, we removed any motifs less than 6 base pairs in length.
A bioinformatics approach was used to derive PWM scores [17] as follows: where j is the DNA position for the PWM score calculation, L is the length of the motif and k represents k th nucleotide in the PWM motif. In addition, if there is a specific nucleotide in position (j+k) on the DNA, f j+k is the frequency of this nucleotide in the whole genome of a specific organism. We used a pseudo-count µ to adjust the frequency of nucleotides and obtain v j,k to avoid zero frequency as follows [18] v j,k = n j,k + f j+k · µ x n x,k + µ where µ is chosen to be 1 [18] and we also show that the choice of the pseudo-count µ does not have significant influence on our results (Supplementary Figure 6); n x,k is the frequency of certain nucleotide x in a specific position k of the motif. Simple equation to calculate λ λ is the scaling factor that allows for direct comparison of different PWMs in terms of binding energy to DNA. The binding energy of a TF to the DNA at a specific position can be expressed as: where E is the binding energy, S j is the PWM score and E 0 is a scaling parameter. This is useful in a variety of contexts, such as comparing the binding strength of different TFs. In addition, the expected amount of time that the TF is bound to a particular DNA sequence can be estimated as: where S j is the PWM score at position j in the genome, τ 0 is the average residence time calculated as in [15]. This equation is widely used in simulations of TF binding kinetics [19].
Given the utility of the λ for estimating binding strength and occupancy time, it is very important to have a simple strategy for estimating it. We derive our equation based on the following core assumptions: 1. The top 0.1% of the highest scoring matches of the PWM to intergenic regions are considered to be possible TF binding sites, as suggested by [20]. Their genome-wide study of different eukaryotic TFs revealed an average of 1 binding site in every 1-5 thousand base pairs of intergenic sequence. This top 0.1% score threshold has also been similarly adopted in other studies [10]. 2. The maximum mismatch energy between the consensus binding motif and specific DNA sequences is proportional to the information content of the PWM matrix of the TF.
Based on the mismatch energy theory for estimating TF binding strength [7], the mismatch energy at a particular binding site j of TF species i in the genome can be expressed as: where S i,j stands for the PWM score at position j, S max,i is for the maximum PWM score of TF species i and λ, i is the scaling parameter we want to estimate. The lower boundary of potential binding sites is approximated by the top 0.1% of PWM scores following the same reason as mentioned before and corresponds to the maximum mismatch energy tolerance level as follows where E maxM ismatch,i stands for maximum mismatch energy tolerance for TF species i, thus, λ i can be calculated using: Because different transcription factors have different DNA binding domains, the maximum mismatch energy range can vary from one TF to another. Since there is only data available for 4 individual TFs using microfluidic platform-based binding energy measurements [14], we estimated the maximum mismatch energy for other TFs by using the available data as the average rate and assuming that the mismatch energy tolerance is proportional to the information content of the PWM matrix as follows: where If i represents the information content of a specific PWM matrix, < If > stands for the average information content corresponding to the average maximum mismatch energy measured by [14], which is 13.2 bits. We chose an average mismatch energy tolerance of 6 bits based on the study by [14]. They showed by mechanical trapping of molecular interactions a significant decline in binding energy by at most 2 to 3 nucleotide mismatches, and each mismatch nucleotide contributes 2 bits in mismatch energy. Even if more mutations are introduced, the binding energy does not drop further since it has already reached the background non-specific binding energy level. Another report featuring TFs from different families including: p53, Max, Glucocorticoid Receptor [21] also provides additional support for 6 bits as average mismatch energy tolerance level.
Estimating λ of a new PWM matrix for the same TF based on the residence time landscape of the facilitated diffusion model Sometimes there may be more than one PWM available for a specific TF. In order to directly compare the TF binding energy when using two alternative versions of a PWM, it is important to have a way of scaling the results by λ. λ can be adjusted using the formalism introduced in the previous sections. As a compute-efficient alternative, we developed a more optimal strategy for estimating λ, which does not require the assumption that the PWM information content influences the energy mismatch tolerance. Instead, we base our strategy on the estimation of the sequence specific residence time of a particular TF, which is a biological meaningful quantity and can be correlated with in vitro sequence-dependent sliding measurement of TFs [10]. For the same TF, the sequence-specific residence time distribution calculated by Equation 4 should be as consistent as possible, even when using slightly different PWMs, if an appropriate λ is chosen for scaling. Based on this, given a known λ for one PWM, we are able to find another suitable λ for the new PWM.
Note that the stronger the PWM score, the more likely it is that the sequence is bound by a TF and that the TF's residence time is a biologically meaningful quantity, but there is a much greater number of weak and medium strength binding sites than strong sites in the genome. Therefore, if we scored each potential binding site equally, the background of weak and medium-strength binding sites would have a greater affect on the estimated λ than the strong binding sites. Therefore we compare residence times across different quantiles on a logarithmic binding strength scale, so that the strongest binding sites have the most influence on our λ estimates. Specifically, in the following analysis, we take the log 10 of the cumulative distribution of PWM scores and select all binding sites with values greater than 3.0 (recall that this corresponds to the 0.1% percent of binding sites, which were chosen as the lower boundary of weak binding sites). We divide these top-scoring binding sites into bins every 0.1 log-quantile and calculate the average residence time for each of these bins.
Our strategy identifies the λ that would produce the most similar residence times for each of these log-quantiles. Assuming that for the first PWM, we already have an estimate of λ by either binding profile fitting or other methods, we can use Equation 4 to calculate the residence time for each binding strength log-quantile, as described above. In the following analysis of this paper, since there are very few well-characterized λ values from profile fitting, for proof-of-principle, we borrow the values obtained from Equation 6 as pre-calculated λ. Note that τ 0 is calculated via the strategy described in [15] from all intergenetic regions in the genome, which has a different value for each unique PWM. Now for the second PWM, we can vary λ between the potential values of 0.1 and 3, which was shown to be a possible λ range [12], and calculate the corresponding residence times at each log-quantile level. We can now compare the reference residence times from the first PWM with the residence times for the second PWM across each binding site strength level, and for each value of λ. The λ that minimizes the mean square error between two sets of calculated residence times is chosen as the suitable λ value for the second PWM matrix. Since outliers can have a big influence on the mean square error, we calculated the sum of the absolute differences for the natural logarithm of residence times between the two PWM matrices for these quartile bins (Equation 8) to make a comparison with the method that uses mean square error.
where q represents each quantile in the quantile series, τ q,λ is the residence time in a specific quantile of a particular λ, τ q,ref is the residence time in the same quantile of the known λ of the reference PWM matrix. The λ derived by these two methods show good consistency with adjusted coefficient of determination of 0.9644. Thus, there should not be significant bias using either of these two methods.

Results
Estimating scaling parameter λ for binding site affinity across different species and TF families based on Equation 6 The λ parameter is the critical link between PWM score, the estimated binding energy and TF residence time. Estimating TF binding site affinity by comparing PWM scores at face value can lead to a large bias, especially when this includes comparisons between many types of TFs, because several properties of the PWM matrix itself can influence the PWM score. For example, the information content of the PWM matrices is positively correlated to the maximum possible PWM score, as is shown in Figure 1 with an R 2 value of 0.597. Thus, the absolute value of PWM scores cannot be compared directly across different TFs as an indicator of binding site strength. Therefore, proper scaling of PWM score is needed in order to compare binding site affinity across different types of TFs. Based on the methods proposed by Berg et al. [7], the TF binding energy for a specific binding site can be computed by Equation 3 using the estimated λ.
λ calculated by this method are all within the range suggested by [12], which are listed in Table 1 for different organisms. The values for vertebrate species refer to all available vertebrate TFs obtained from the non-redundant PFM JASPAR database. The upper and lower bound of λ across all organisms are quite similar, in the range of 0.25 to 2.83. This indicates that all eukaryotic TFs, no matter which organisms they belong to, all share energetically similar DNA binding mechanisms, since λ can be interpreted as a metric for the chemical property of stickiness between the TF molecule and DNA. To demonstrate the biological application of this parameter, Figure 2 shows an example of the D. melanogaster Even-skipped stripe 1 enhancer with the comparison between PWM score and the affinity estimation using λ scaling. The usefulness of λ estimates becomes apparent when comparing the first two binding sites indicated by blue arrows in this locus; the second binding site has a higher PWM score, but its binding strength is lower than the first binding site once the λ scaling factor is taken into account. Similar situations also appear in the overlapping binding site of Bicoid and Kruppel indicated by the third arrow. Thus, only comparing the raw value of PWM score [11] may lead to false interpretations of binding site importance. When comparing binding site strength for different TFs, we need to use λ to adjust the PWM score.
Next, we calculated λ for each TF in S. cerevisiae, D. melanogaster and available vertebrate TFs in JASPAR [5], which are listed in Supplemental Table 1. Figure  3 shows the overall λ distribution in each group of organisms, showing that the mean values of the λ distribution shift from low to high from S. cerevisiae to D. melanogaster, and then to the vertebrates. The difference in average information content alone does not fully explain this discrepancy (Supplement Figure 1). Interestingly, there is also a shift of mean information content from low to high values from S. cerevisiae to vertebrates, but since the information content is the denominator in Equation 6, it cannot account as a direct cause of the observed trend in the change of λ distribution. Instead, the range of PWM scores that fall into the top 0.1% represented in the numerator of Equation 6 mainly contribute to the differences of the λ distribution.
Furthermore, the comparison of λ across different TF families according to the classification in JASPAR [5] is illustrated in Figure 4. Zinc-finger nuclear receptor family, the basic leucine-zipper family and helix-loop-helix family are the three families with highest average values of λ compared with other groups with t-test p-values equal to 8.3 · 10 −5 ,1.4 · 10 −4 and 6.1 · 10 −4 respectively. Homeobox and forkhead TF family, both of which belong to the helix-turn-helix(HTH) TF super family, appear to have lower average λ compared with the former three families and no difference is detected between these two using t-test (p-value=0.98). The β-βα zinc-finger family, the largest TF family in higher eukaryotes, shows relatively high λ compared to the HTH super-family including the homeobox and forkhead sub-families with t-test p-value of 0.13, and it also has a wide range of λ owing to the great diversity of binding motifs [22]. The high mobility group family shows no significant differences from HTH super-family with t-test p-value of 0.47.
Since λ is the denominator to the PWM score differences between one binding site and the consensus sequence in Equation 3, a larger λ indicates lower mismatch energy when ∆S j is the same. Thus, with the same possible mismatch energy range, if λ is larger, the PWM score can have a greater range from the consensus sequence to the potentially weakest binding site, which indicates the binding motif for the TF family has higher flexibility as suggested by [23]. This is consistent with the fact that the TFs in the zinc-finger super-family including the nuclear receptor and β-β-α zinc-finger families are less constrained to a particular motif than HTH super family. Additionally, cross species comparison of λ indicates that from yeast to vertebrate, more flexible TF motifs are used, which is consistent with the result from [22] that organisms which appeared more recently in evolution tend to use more TFs with motifs of higher flexibility.
Converting λ between different PWM matrices of the same TF In many cases there are two PWMs available for the same TF, and one of these PWMs might already have a reliable estimate of λ, from any number of experimental or computational approaches [13]. In such circumstances, we provide a strategy to estimate the unknown λ associated with the alternative PWM matrix. It would be possible to calculate the unknown λ from Equation 6, but this does not incorporate the additional data available (i.e. the known λ). Our alternative strategy not only incorporates this data, but also loosens the assumption in Equation 6 that the maximum mismatch energy for DNA binding is proportional to information content.
The procedure to compute a suitable λ is based on the concept of sequence-specific residence time (Equation 4), as illustrated in Figure 5. Initially, a well-characterized λ is computed or measured for the first PWM of a particular TF, and then we use this value to derive a λ that is appropriate for the second PWM of the same TF. As part of the calculation of the λ for the second PWM, Figure 5C shows a heatmap of the estimated residence times for a TF named lame duck (lmd) in a particular binding strength quantile, at different values of λ (ranging from 0.1 to 3.0 as suggested by both [12] and the range of estimated λ using Equation 6 across different organisms). Both PWMs for the TF come from FlyFactorSurvey database [24],but they are derived from different reports with motif logos shown in Figure 5B. Blank regions in the heatmap indicate the choice of λ would generate a residence time outside the range of pre-calculated possible residence times using the first PWM and the existing λ value, implying that these λ values for the second PWM are unsuitable. As shown in the heatmap, blank regions often appear in very low values of λ, while if λ is too large, the possible residence time range from weak to strong binding sites is often very restricted, which means high affinity sites cannot be distinguished from low affinity sites efficiently. λ values with residence times all within the reference range can be further selected, as specified in Methods. Figure  5D-F compares the residence time values between two different PWMs, at different values of λ for the second PWM. We see that the λ in Figure 5D and 5F would not allow for consistent residence times between the two PWMs, but Figure 5E does provide consistent results. Therefore, the λ adopted in Figure 5E is picked up as the suitable value for the second PWM matrix. More examples of residence time heatmaps for converting λ between different PWMs are shown in Supplementary  Figure 3.
In order to evaluate the consistency of λ estimation between the above method and using Equation 6, we use the examples of 20 D.melanogaster TFs with more than 1 version of PWMs available from different experiments. These PWMs are obtained from the BioConductor R package PWMEnrich.Dmelanogaster.background [16] and their labels are listed in Supplementary Table 2. Since there are only few λ available from binding profile fitting, just for the purpose of illustration, the reference values of λ were pre-calculated from Equation 6 instead. New λ values for PWMs obtained from other experiments are computed using both methods and they show good consistency with each other with adjusted R 2 equals to 0.88 (Supplementary Figure  4). Converting λ between these two PWMs in the opposite direction also show similar results (data not shown). It indicates that both methods provide consistent estimates of λ, even though they have different core assumptions.

Discussion
TF binding site strength estimation using PWM-based methods is essential for modelling TF-DNA interaction in functional genomics; but a proper scaling parameter is needed when using the PWM score to estimate TF binding energy. Therefore, we provide two independent methods for estimating the scaling parameter λ in different conditions. The simple Equation 6 is widely applicable, since it only requires a PWM matrix as input, which is easy to implement compared to methods using fitting to ChIP-seq [12,13]. Our second method converts a λ specific to one PWM into λ for a different PWM of the same TF. It is based on the definition of sequence-specific residence time from the facilitated diffusion model of TFs on DNA [15]. This method is particularly useful for converting a previously estimated λ into the one associated with a more up-to-date or otherwise alternative PWM matrix.
These two methods are consistent with one another (Supplementary Figure 4) and with previously established methods. For instance, Equation 6 can also provide very similar results compared with the estimated λ from ChIP-seq data fitting for the 5 D. melanogaster TFs in [13], which is shown in Supplementary Figure 5. The consistent value range of λ in different organisms calculated by this method provides additional support for the applicability of this simple equation. Moreover, the estimated distribution of λ values for different TF families make sense in the light of motif choice for each TF families [25]. For example, TFs in the zinc-finger TF super-family including nuclear receptor zinc-finger and β-β-α zinc-finger families have more flexible binding motifs, which can suit a wider range of possible binding sites than the helix-turn-helix super-family, which has a more restricted motif consensus sequence [26]. Inside the same super-family, because of differentiated DNA binding domains and functions, nuclear receptor zinc-finger and β-β-α zinc-finger families also show significant differences of λ distribution, with t-test p-value of 0.006. However, some TF families belong to the same super-family and also share similar binding domain properties can have strong similarity in λ distribution, e.g. homeobox family and forkhead family which both belong to the helix-turn-helix super-family.
There are some points that should be noted when using the simple equation method: first, it cannot be applied to very short TF motifs that are less than 6 base pairs in length. Since this method depends on calculating the difference between the top 0.1% of PWM scores and the maximum score, if the motif is only 5 base pairs in length, the number of possible choices for sequence combination of 5 base pairs is only 1024, then the top 0.1% of PWM scores is the top score. However, most eukaryotic motifs are more than 6 base pairs long. Eukaryotic TFs on average cover 15 bp of DNA with a core motif length of 8-15 bp [8]. Thus, this limitation should not be a problem in the majority of cases. Another assumption in this method is that the mismatch energy tolerance range measured in bits is proportional to the information content of the PWM matrix. This assumption can deal with the bias from the differences in information content of most PWM matrices; however, it might not hold for PWM matrices with extremely high information content. For example, the yeast transcription factor IXR1 has an information content of 47 bits according to the PFM matrix from JASPAR [5], which is substantially larger than the average information content of 13.2 bits. In that case, the binding energy will probably be overestimated, which leads to a lower λ, but these cases are very rare and only 7 PWM matrices in our analysis (less than 1.5%) have information content greater than 20. The mean of information content and the mean of estimated λ for each TF families are listed in Supplementary Table 3, and they show weak positive correlation with p-value of 0.056 (full distribution of information content across different organisms and in different TF families are shown in Supplementary  Figure 3 and 4). Interestingly, the information content is actually the denominator in Equation 7 to calculate λ which means the information content does not affect the results of λ directly. It is the mismatch energy tolerance differences between different TF families that contributes to the variation in λ distribution.
There are two limitations of this method which can potentially lead to some biases between different organisms and different TF families. One limitation is related to the calculation of mismatch energy tolerance in different groups of TF families. We apply a single cut-off threshold of top 0.1% PWM score for weak binding sites suggested by [20], but it could be possible for different TF families, different threshold should be used due to variations in their DNA binding domains. However, it is difficult to choose specific thresholds for every TF family based on the currently available data. Another limitation is the assumption that mismatch energy tolerance of a TF is proportional to the information content of the PWM matrix. It is possible that such relationship is not linear but more complicated, which is difficult to verify. Further, from the definition of information content of the PWM matrix, it sums up information content gain from each nucleotide [19], which implies longer motifs including more flanking base-pairs will have slightly higher information content compared to the shorter ones just with core motifs, which is an artefact of computation. However, there is no satisfactory way to deal with this problem. One possible solution is using the information content per nucleotide instead of the total information content, but this may be even more detrimental as the information content contributed by flanking sequences constitutes only a very small fraction compared to core motifs. Thus, if dividing total information content by the length of the motif, the dilution of information content can lead to even bigger biases. Another potential solution is trying to define a core motif from one PWM matrix, but this requires detailed knowledge about the TF of interest. Additionally, λ will not be a reliable measure of biochemical stickiness of the TF to the DNA if the PWM itself is not an accurate representation of TF binding. A PWM assumes that each nucleotide position independently contributes to TF binding affinity, which may not be the case [27,28]. In addition, the composition of the position frequency matrix of the PWM may contain biases due to the difficulties of attaining an unbiased validated binding site set. Nevertheless, λ can give us insights about DNA binding properties of TFs.
Also, it should be pointed out that residence time in this paper refers to an estimate based on the biophysical model proposed in [10,15]. However, other papers report inconsistent scales of residence time according to different experimental approaches. For example, the residence time estimations obtained by Competition-ChIP methods [29] do not share the same order of magnitude compared to the residence times measured by FRAP or single molecular tracking [30, 21, 31], which can probably be an artefact of experimental methods or alternatively, the range of residence time truly varies greatly across different TFs [32]. Because the experimentally determined values are not comparable to each other, we simply adopt bioinformatics-based approaches to compute residence time. Since our method converts λ between different PWM matrices of the same TF under the concept of residence time, it avoids to fit inconsistent experimental observations and potential variations in DNA-binding kinetics for different TFs.
Although in many cases PWMs are not optimal representations of binding motifs, they have become almost universally adopted to identify potential TF binding sites. However, it is important to remember that the value of a PWM score is not directly correlated to the binding energy, but rather depends on the scaling parameter λ. Previously, researchers either assumed that λ has similar values across different PWMs or estimated it through computational intensive binding profile fitting methods [12,13]. Here we provide two simple strategies for estimating λ, which will let us more clearly link PWM scores with the energetics of TF binding.  The relationship between maximum PWM score and information content of PWM matrices. Individual dots represents each PWM matrix generated from the non-redundant PFM JASPAR-CORE database [5] after the filtering procedures specified in the Methods section. There is a strong positive correlation between the information content of the PWM and the maximum possible PWM score that could be generated by that PWM, with an R 2 value of 0.597. Figure 2: A comparison between PWM score and binding site strength in the D. melanogaster even-skipped stripe 1 enhancer. The even-skipped stripe 1 enhancer on chromosome 2R is dense with binding sites. We compare the raw PWM scores (circles) and the λ-scaled binding strength (height of the bars) for each of these binding sites, colour-coded by the type of TF. Based on raw PWM scores, one might assume that the Caudal site indicated by the first blue arrow would have a lower binding strength than the Kruppel site indicated by the second blue arrow; however, once the binding strength is scaled by λ using Equation 5, it becomes evident that the opposite is the more likely scenario. The third arrow points to a location where a Kruppel and a Bicoid binding site overlap. Here, the λ adjusted binding strength estimates would suggest that Bicoid binding site is stronger, while a raw PWM score would suggest the opposite. These results illustrate how using raw PWM scores may result in biased interpretation of the relative binding strength of TFs.   Figure 4: λ distribution comparison across major TF families. BBA-ZF represents the λ distribution for β-β-α zinc-finger family; NR is zinc-finger nuclear receptor family; L-zipper stands for the basic leucine-zipper family; HLH is helix-loop-helix family; Homeo is homeobox family; FK is fork-head family and HMG is high mobility group family. For each group, λ was calculated by Equation 6. BBA-ZF represents the λ distribution for β-β-α zinc-finger family; NR is zinc-finger nuclear receptor family; L-zipper stands for the basic Leucine-zipper family; HLH is helix-loop-helix family; Homeo is homeobox family; FK is forkhead family and HMG is high mobility group family.  [16]. Each column of the heatmaps represents a specific λ value and each row represents a specific binding site strength level.  Figure 6. Comparison of λ values calculated by using different pseudo-count values in PWM matrices. Subfigure A shows the comparison between the λ values obtained by using PWM matrices with pseudocounts of 1 and 3 (the adjusted R 2 is 0.973), while subfigure B compares pseudocounts of 1 and 0.3 (the adjusted R 2 is 0.978). Each dot represents a TF from 100 randomly chosen vertebrate TFs in JASPAR database [5].