A reexamination of information theory-based methods for DNA-binding site identification
© Erill and O'Neill. 2009
Received: 14 October 2008
Accepted: 11 February 2009
Published: 11 February 2009
Skip to main content
© Erill and O'Neill. 2009
Received: 14 October 2008
Accepted: 11 February 2009
Published: 11 February 2009
Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods.
Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results.
We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.
Even though much progress has been made since the first genomic sequences became available, the identification of transcription factor (TF) binding sites in genomic sequences remains a considerable challenge in bioinformatics. In recent years, this problem has been aggravated by the ever-increasing pace of genome sequencing, the realization that junk DNA was a considerable misnomer and by the need to reconcile inferences from high-throughput assays with the underlying genome sequence. New high-throughput technologies, like ChIP-chip and ChIP-Seq [1, 2], can contribute significantly to reduce the search space involved in the identification of some TF-binding sites, but theoretical models of binding sites are still required to gain insight into their function and mechanism, and to tackle the general problem of binding site identification in the absence of high-throughput experimental data.
Over the years, the quest for identifying TF-binding sites has taken two natural and complementary approaches, relying either implicitly or explicitly on experimental data. On the one hand, de novo motif discovery methods like MEME, consensus-building, Dyad-Analysis or Gibbs sampling [3–6] use implicit experimental data to uncover overrepresented candidate TF-binding sites in the promoter regions of a set of genes that are known to be co-expressed or co-regulated. On the other hand, different binding site search methods have also been developed to exploit explicit data on the sequence and location of known TF-binding sites [7–10]. In binding site search, data is provided by collections of aligned known sites often referred to as motifs or prototype groups. This work deals with binding site search methods and, in particular, with those relying on the application of information theory to DNA sequences.
where corresponds to the frequency of the consensus base at position l of the prototype group, is the frequency of the base observed at position l of the site and N is the number of sequences in the prototype group (1/N acting as a small sample correction to avoid zero frequencies). If one assumes positional independence, a global HI for the whole site can be computed by summing HI(l) over all site positions .
where N is the number of possible values (x i ) the variable X can take.
The expression for Shannon entropy is very similar to the Boltzmann-Gibbs entropy in thermodynamics , but they are quite different in substance [18, 19]. As expected, Shannon entropy (entropy henceforth) is maximal when all possible states of X are equiprobable and independent, since in this situation our uncertainty about which state we will observe is the greatest and thus the amount of information the variable conveys is also maximized. The reduction in uncertainty (or information gain) that takes place during communication over a noisy channel is known as mutual information and is expressed in terms of the difference in uncertainty over the original message (X) before and after we receive a version of it (Y):I(X;Y) = H(X) - H(X|Y)
where S corresponds to each of the four possible DNA bases and f(S) represents its relative frequency in the genome sequence.
where p(S l ) is the frequency of each base S l at position l in the prototype group.
In the case of a TF-binding site, this equates with the specificity of the site recognition process. By definition, mutual information has a maximum in H(X), corresponding to the case of a noise-free channel (i.e perfect site recognition; H(X|Y) = 0), and a minimum in 0 when X and Y become independent (H(X|Y) = H(X)). For a given protein, the specificity of the site recognition process is a constant defined by H(X|Y). Therefore, mutual information is maximal whenever H(X) is maximized, which in the case of genomic sequences corresponds to an equiprobable base distribution.
As in R sequence , positional independence may be assumed in order to generate a global RE value for the whole site by summing up RE(l) for all positions. The Kullback-Leibler divergence is also measured in bits, allowing direct comparison with R frequency . Following their initial introduction by Schneider et al., both R sequence and RE have been used by different authors as a measure of the information content in binding motifs [22, 23].
cross-entropy can be conceptualized as a weighted version of a priori entropy (H before ). For each of the four possible bases in a motif position (S l ), a priory entropy is now weighted up or down depending on the ratio between the motif and background frequencies for that particular base p(S l )/f(S l ). In this manner, if a base is for instance underrepresented in the genome but conserved in the motif, its contribution to the a priori entropy will become higher and, consequently, RE(l) will increase. Conversely, a conserved base that is overrepresented in the genome will contribute less. As a consequence, in a skewed background RE is larger for motifs relying on underrepresented bases, agreeing with R frequency predictions, in which "rarer" sites require additional information in order to be found.
Apart from the aforementioned Heterology Index of Berg & von Hippel, which serves as a search function directly, several other methods have been proposed over the years to search for TF-binding sites based on the availability of a prototype group of experimentally validated binding sites. Even though some of them were proposed before the introduction of the information theory framework, they all can be derived from the expressions for R sequence and RE seen above.
where p(S i, l ) is the frequency of occurrence in the prototype group of the base S observed at position l of the query sequence i. As in the case of R sequence , positional independence is assumed and the score for the full sequence i is the sum of R i (l) over all its positions.
that explicitly takes into account the background genomic frequencies f(S i, l ) and that again assumes positional independence to obtain an additive score for the full site. In this work we label this term I seq to avoid confusion with the relative entropy (RE) term from which it derives.
A fundamental problem of both I seq and R i is that they discard information on the relative importance of each position within the motif. This is clearly illustrated by a simple example. Suppose that for a given position a of a motif we have prototype frequencies pa(A) = 0.6, pa(C) = 0.4, pa(T) = 0.0 and pa(G) = 0.0. If we observe a C in our query sequence, then R i (a) = H before -log2(0.4). It is easy to see, however, that if position b of the motif has prototype frequencies pb(B) = 0.2, pb(C) = 0.4, pb(T) = 0.2 and pb(G) = 0.2 and we again observe a C in the query sequence, R i (b) = H before -log2(0.4). That is, R i is assigning the same score to a C observed in a relatively well conserved position (a) and to a C observed in a nearly random one (b). This result is counterintuitive in the sense that we would expect that a match in a conserved position be more significant than a match in a poorly conserved one. O'Neill pointed out this problem and suggested two alternative methods to take into account the importance, or weight, of each position in the prototype group [27, 28].
will yield a positive value because R + sequence (l) will be improved by the addition, whereas a query sequence discordant with the prototype will result in a negative value.
Historically, there has been substantial dissention among the appropriate definition of information content (R sequence or RE) [20, 22, 23], the suitability of the positional independence assumption [29, 30] and the relative efficiency of the abovementioned methods and later variants [9, 22, 28] for locating TF-binding sites in both equiprobable and skewed genomic backgrounds. Unfortunately, at the time most of these methods were developed there was not enough experimental data to test their shortcomings and advantages in a real biological setting and, even in relatively recent studies, most search efficiency results have been presented on randomly generated backgrounds . In this work we make use of newly available data on experimentally validated binding sites across different species to assess the limits of the different search methods, to gauge the suitability of alternative definitions of information content and to expose the drawbacks of benchmarking on random sequence. The results reported here point at substantial misconceptions in the derivation of information theory methods, leading us to propose a complete reformulation of the concept of information content in binding sites. Consequently, they have deep implications for the understanding of binding site evolution and for the assessment of binding site sequence function in the search and recognition processes.
By using the ratio between consensus and observed frequencies as a multiplicative factor on a stripped-down version of R i , FitomHI ensures that the intuitive relationship derived by Berg & von Hippel is explicitly taken into account when scoring candidate sites. As it can be seen in Figure 1, however, the FitomHI index does not outperform other methods (such as R i ) suggesting that the hypothesis behind the Berg & von Hippel scheme might have been misguided.
The relatively poor performance of the FitomHI index points up another obvious but nonetheless important observation regarding the results of Figure 1. As it can be readily seen, methods that do not take into account the importance of each position in the prototype group (i.e. non-weighted methods: R i , HI, I seq ) consistently outperform those that do integrate this factor (weighted methods: R sequence · BvH, R ' sequence ), with the proposed FitomHI index falling somewhat in between. As in the case of FitomHI, this is at first glance an unexpected and counterintuitive result, since weighted methods have been shown previously to perform well in searching  and to excel at ranking TF-binding sites according to their experimental binding affinity . Moreover, both the notion of positional weighting and of a ratio between consensus and observed bases are intuitively appealing .
The reason why weighted methods perform poorly in search mode when compared to non-weighted ones is, nonetheless, relatively straightforward. By down-weighting poorly conserved positions, weighted methods concentrate their scoring on a smaller number of conserved positions, thereby increasing the chances that "correct" bases might appear by chance at those positions during a genome-wide search and thus leading to a larger number of false positives. Conversely, non-weighted methods bestow the same importance to all motif positions, lowering the odds that false positives may arise by chance. In this context, FitomHI can be seen as a crude weighted method, since it is taking into consideration part of the information profile through its explicit use of the consensus-to-observed frequency ratio.
The superiority of non-weighted methods over weighted ones in binding site searches raises important questions regarding site recognition by proteins. To a certain extent, the problem of ranking binding sites can be equated with binding affinity, while the search problem ostensibly equates with the protein's ability to locate its binding sites. Traditionally, it has been assumed that binding site affinity and binding site location are intrinsically linked at the protein level and, thus, models developed for one problem have applied to the other without much consideration. However, the disparity in performance between weighted and non-weighted methods on the search problem suggests that this may not be a good practice. The intuitive concepts behind weighted methods and the Berg & von Hippel approach were initially introduced to deal with the ranking problem and thus they may not apply as well to the related search problem. Furthermore, the main difference between both kinds of methods (i.e. positional weighting) points to a mechanistic difference between these two different modes of action of DNA-binding proteins.
The fact that non-weighted methods outperform weighted ones in genome-wide searches suggests that information lying in poorly conserved motif positions is being used actively by the protein to discern true binding sites against the genomic background. As mentioned above, the equal appraising of all site positions by non-weighted methods has the net effect of reducing the number of possible false positives. However, given the nature of protein-DNA interactions, it is unlikely that such discrimination is achieved by specific recognition on all motif positions. Instead, the uniform use of all site positions in non-weighted methods seems to be taking into account secondary information (e.g. AT-richness) residing in poorly conserved positions that can be of relevance to the protein in order to make non-specific contacts or as a requirement for optimal curvature or bendability. In contrast, the better performance of weighted methods in ranking sites according to their binding affinity indicates that conserved motif positions are the main players in determining the strength of a site . In agreement with this, the mean difference in search efficiency between weighted and non-weighted methods decreases (from 15.1% for Fis to 0.3% for LexA) as motif conservation increases, suggesting that there is an increasing dependence on secondary information sources for proteins targeting less conserved sites, as would be expected in that these sites remain functional.
The resulting disparity between weighted and non-weighted methods is not the only clue pointing towards the use of additional information in the process of site location. At R sequence = 10.09 bits, CRP is substantially underspecified to cover its 210 experimentally validated sites, since R frequency predicts that at least 14 bits should be necessary to specifically locate 210 sites on the E. coli genome. This implies that, on average, 28% of the information required to specify true CRP sites is not present as positional information in R sequence . In fact, the estimated number of sites for CRP based on the equivalence between R sequence and R frequency is about 4,300, but even on a 1/30 true- to false-positive ratio (i.e. accepting ~6,100 false positives) the best search method is only able to retrieve 80.7% of the true sites (data not shown). This means that nearly 20% of true CRP sites are left unaccounted for when using information theory-based methods for locating them. Moreover, the set of non-located true sites displays very low R sequence (6.17 bits), suggesting again that other sources of information should be exploited to improve these predictions ; the protein could not function were it actually faced with the challenge of 100,000 pseudo-sites as this low information level suggests. Experimental results have already hinted at the existence of several complementary sources of information for site location, such as curvature [33–36], pre-recruitment or cooperative binding [37–39]. As formulated originally, information theory-based methods cannot take into account this additional information, but they provide a robust theoretical foundation to develop more complex methods that incorporate it explicitly. In fact, several higher order models based on information theory that include contextual information have already been proposed [40–42].
RE' (l) = RE- (l) · (RE+ (l) - RE- (l))
The motifs for both CRP and Fur TF-binding sites are manifestly AT-rich and, as expected, binding sites for both proteins become more or less apparent in, respectively, GC- or AT-skewed backgrounds. In accordance with this fact, RE-based methods (i.e. I seq and RE · BvH), which have been devised to take into account explicitly the deviation of sites from the background skew, consistently outperform R sequence -based methods on skewed backgrounds, although there are noticeable differences depending on the background skew and the motif searched. In GC-rich backgrounds, both AT-rich sites are relatively easy to locate. Thus, the downplaying of the few G/C positions carried out by the RE non-weighted method (I seq ) is not a strong advantage over its R sequence counterpart (R i ). This does not hold true for weighted methods, which discard a large proportion of the AT-rich sites by focusing on conserved positions, allowing RE · BvH to clearly outperform R sequence · BvH when looking for CRP. On the other hand, searches on AT-rich backgrounds yield a completely different picture. By playing down the dominant A/T positions in the motifs and emphasizing the scant G/C ones, RE substantially alters the shape of the information profile. As a consequence, RE-based methods are able to separate Fur and CRP sites from the AT-rich background much more efficiently than R sequence -based methods, and this applies both to weighted and non-weighted methods.
As in the case of equiprobable backgrounds, these results stress again the need to validate search methods against real genomic data in order to derive meaningful results. Moreover, they also point out that the rationale for the derivation of RE and its resulting indices (I seq , RE · BvH) may be partly flawed. Schneider et al. proposed RE as a way to extend the equivalence between R sequence and R frequency in equiprobable backgrounds to skewed genomes. This line of reasoning has later been utilized by Stormo and coworkers [25, 26]. A main flaw in their argument stems from the fact that R frequency was derived from the expected frequency of occurrence of sites in a uniform background. As noted above, however, in a real skewed genome the occurrence of anti-skew stretches can be far from uniform, and the net effect of this biased distribution is to make the use of RE meaningless or even counterproductive, as in the case of Fur sites in P. aeruginosa.
R sequence is free from the artifacts created by deviations in oligonucleotide distribution and similar factors involved in the search problem in that it is a measure only of positional information. As such, it is a more reliable indicator of motif positional information content than RE. Therefore, R sequence -derived methods (e.g. R i ) should be expected, on average, to perform better than RE-based ones. It should be pointed out here that an often implied argument for the use of RE over R sequence , the advent of negative information content in skewed genomes, is based on a misconception. Computing R sequence for a collection of E. coli sites against a skewed background may indeed generate negative R sequence values, but this perplexing result is an artifact of the transplantation of the E. coli motif onto a skewed background rather than a fault in the formulation of R sequence . On a skewed genome, the a priori entropy (H before ) is reduced because of the background skew. To obtain negative values for R sequence , nucleotide frequencies in some positions of the binding motif must be close to equiprobability. In this case, the a posteriori entropy (H after ) will be greater than the a priori one, leading to a negative R sequence value. This is indeed the case of most non-conserved positions in many E. coli motifs when evaluated on a skewed background. However, it is easy to see that, for real binding sites evolving in a skewed genome, positions that are not important for binding will remain at background genomic frequencies (instead of being actively selected towards equiprobability), thus leading to positive or, at the most, zero values for R sequence .
The failure of RE-based methods to outperform R sequence -based ones in real genomes casts serious doubts on the validity of this approach and its main underlying assumption, the equivalence between R frequency and R sequence . However, the inadequacy of RE to deal with non-uniform n-mer frequencies is not the only result pointing to a demise of the equality between R frequency and R sequence . A main corollary of the hypothesis for deriving RE is the assumption that when a genome drifts towards skew in its base composition, DNA-binding proteins shall evolve to recognize binding sites with an anti-skew composition, thus maximizing the efficiency of binding site location at a lesser cost in overall base conservation. This line of reasoning was explicitly developed by Schneider et al. They noted that sites with anti-skew composition would eventually lose positional information (R sequence ) in the course of evolution, since selective pressure towards site conservation would be reduced because the site would be over-specified and therefore easier to locate . In other words, integrating anti-skew composition in a measure of positional information, as RE does, would compensate for the loss of standard positional information (R sequence ). The results shown in Figure 4 and Figure 5, however, suggest that this is not generally the case.
On the one hand, the P. aeruginosa Fur protein seems to control a regulon of about the same size (32 known sites) as that of E. coli Fur (51 known sites) and its motif positional information content (R sequence = 13.69 bits) is similar to that of E. coli Fur (R sequence = 14.31 bits). The ratio between these R sequence values is in accordance with previous estimates for Fur information content based on optimized alignments for a smaller number of sites (20 sites and 18.6 bits for P. aeruginosa Fur; 24 sites and 19.6 bits for E. coli Fur ). Sitting in the 66.56% GC P. aeruginosa genome, however, the Fur motif has a markedly anti-skew composition (70.72% AT). The resulting RE value of 20.72 bits should allow Fur to target specifically as few as two sites in the whole genome. Thus, if the main hypothesis behind RE were true, P. aeruginosa Fur could have discarded a substantial part of its positional information (R sequence ) by relying on anti-skew composition. Instead, P. aeruginosa Fur maintains a R sequence value similar to that of E. coli Fur despite the loss in genomic entropy (H before ) due to genomic skew (i.e. P. aeruginosa Fur sites are more conserved that E. coli Fur sites). On the other hand, in the 38.1% GC-rich H. influenzae CRP sites are strongly conserved (R sequence = 17.83 bits) in comparison to those of E. coli CRP (10.09 bits). A plausible explanation for this effect could be an error due to small sample (there are 45 described CRP sites in H. influenzae for 210 in E. coli), but using the small sample correction proposed by Schneider et al. on H. influenzae CRP sites does only decrease H. influenzae CRP R sequence to 16.51 bits . Moreover, iterated sub-sampling of the 210 E. coli sites into 45-site prototype groups does not yield enough deviation to explain the observed 7 bit increase either (Figure 6b). In fact, the maximum R sequence value for any of the 10,000 sampled groups is still 4 bits away (13.89 bits) from the H. influenzae profile. As mentioned above, a corollary of RE is the prediction that DNA-binding motifs should tend to evolve against the skew in order to profit from easier location. However, the H. influenzae CRP protein is not relying substantially on anti-skew composition to detect its sites. Instead, efficient location of these AT-rich sites in the AT-rich background of H. influenzae seems to be based entirely on increased R sequence . If anything, both H. influenzae CRP and P. aeruginosa Fur sites seem to have adapted towards the skew, not against it. The P. aeruginosa Fur motif is 70.72% AT (for 74.71% AT in E. coli), while the H. influenzae CRP motif is 69.89% AT (for 64.68% AT in E. coli). In summary, both motifs have moved towards the skew, and both have become more conserved.
Since its introduction in 1986 , the assumption of equality between R frequency and R sequence that lies at the core of RE has been considered a de facto axiom of information theory applied to binding sites and has shaped the way we think about binding site search, specificity and evolution. However, the results presented above stand in open contradiction with the predictions made by this hypothesis and thus beg us to reconsider its validity and applicability.
In 2000, Schneider showed by means of a genetic algorithm that, given some constraints, R sequence would evolve towards R frequency . Later on, Kim et al. applied a more formal mathematical treatment to the same problem and concluded that deviations between R sequence and R frequency are constrained to a very small range . An important assumption in both analyses is the use of an on-off switch model for the transcription factor (i.e. sites are either recognized or not according to a threshold). Even though some studies suggest that the transition from sites to non-sites is relatively sharp for some transcription factors , the use of an on-off model is still a strong assumption, since it is well known that transcription factors present a varied range of binding affinities for the binding sites they recognize [48–51]. Therefore, the use of a "black/white" approach centers the ensuing analysis exclusively on the problem of how the protein identifies its target sites in the genome, disregarding completely any functional requirements of the protein for differentially regulating its different binding sites. It is worth noting here that a last implicit constraint in both analyses is the assumption of an equiprobable background. This is important because, as it will be shown, it is only in such a context that R sequence equates to a significant degree with search and, therefore, with R frequency .
As outlined in the introduction, R sequence and R frequency measure subtly different things. R sequence is associated with the uncertainty of the recognition process, while R frequency measures the uncertainty in terms of distinguishing a sequence from the genomic background. Therefore, R frequency is intrinsically linked to the search problem, but R sequence is only partly related to it. In an equiprobable background, where the equality between R sequence and R frequency was first postulated , R sequence is substantially related to the search problem. This is because location of binding sites by the protein proceeds by Brownian diffusion and contacts with DNA in a random manner. Contacts between DNA and protein can be non-specific (totally electrostatic) with non-sites or specific for sites according to the protein profile (true sites and pseudo-sites) . Thus, search efficiency improves as the affinity of the protein for its true sites increases (i.e. R sequence increases), since this implies that fewer genomic positions will qualify as pseudo-sites for the protein. Therefore, the protein will spend less time engaged in specific binding with pseudo-sites during its random walk and the average time to locate its true sites will be significantly reduced. In this setting, R sequence can indeed approximate R frequency to a substantial degree. This is a sad coincidence, because it tricks us into assuming that ranking and searching are equivalent problems for the protein. In doing so, we thus disregard any constraints on R sequence imposed by the ranking problem.
By means of a little thought experiment it can be shown that ranking and searching are in fact separate processes operating simultaneously on TF-binding sites. One can easily envision a 22 bp motif for a transcription factor that had 5 totally conserved and 17 equiprobable positions. Such a motif would have an R sequence value of 10 bits, roughly the same amount as E. coli CRP. The transcription factor recognizing such a motif would still be able to locate its binding sites with relative high efficiency on an equiprobable background, but it would have no way of gradating its response among them. It would, effectively, have become an on/off switch. Since it is known that this is not the way many transcription factors operate with regard to their sites, one must acknowledge that there are at least two separate processes (affinity ranking and site search) contributing to R sequence . As a matter of fact, it is the way in which these two factors are integrated into R sequence that will determine the degree of equivalence between R sequence and R frequency .
For any given motif size, the values of R sequence at each position yield two obvious limits with regard to the possible range of an affinity ranking function. On the one hand, in a motif with fully conserved positions (maximum motif R sequence ) all sites are identical and cannot be differentially regulated, even though they can be located in the genome with the highest efficiency. On the other hand, a motif in which all positions are equiprobable (minimum R sequence ) makes it impossible either to distinguish among sites or to discern them against an equiprobable background. Obviously, as in the case of the thought experiment described above, there are many combinations of both situations that also yield minima for ranking range while providing different degrees of search efficiency and R sequence values. Between these extremes, however, there lie a wide scope of combinations providing different R sequence values and affinity ranges. It must be noted, however, that R sequence does not provide direct information on the transcription factor operating range. Instead, this information can be found by examining the distribution of affinity values for each binding site of the prototype group.
If one assumes an equivalent protein concentration for the different E. coli transcription factors shown in Figure 8b, it can be seen that transcription factors targeting motifs with low R sequence values, like Fis, present a strong dispersion in their effective binding affinity, since all but the best sites become rapidly indiscernible from the background. For higher R sequence values, transcription factors can exploit their potential affinity range in different manners, aiming at reaching a balance among site conservation (R sequence ), the desired effective affinity range and a viable concentration of transcription factor. Based on the results shown in Figure 8, it can be argued that transcription factors covering a large number of sites, like CRP, sacrifice part of their linear affinity range in order to effectively cover the vast majority of their sites without incurring in a large cost in conservation (R sequence ) and protein concentration. On the other hand, transcription factors for more specific responses like Fur, which target a lower number of sites, can maintain higher R sequence values and higher linear ranges. This would allow such transcription factors to operate strongly on a number of sites at the same time as they maintain a much looser control on the rest of the regulon. A biological rationale for this mode of operation can be the necessity to strongly activate/repress several genes important for the specific response while relaying relaxed regulation to less specific genes.
Without prior knowledge of the specific functional requirements for a given transcription factor it is difficult to predict the evolutionary pathway it will follow to meet the equilibrium between R sequence , protein concentration and its effective affinity range. The SOS response repressor LexA, however, poses an interesting case example since a part of its functional requirements is well known. LexA targets around 30 palindromic binding sites in E. coli  and its binding motif has an R sequence value of 20.27 bits. This has long defied interpretation by the standard information theory approach because R frequency calculations indicate that the LexA binding motif ought to contain 17.39 bits (R frequency ). Instead, the observed R sequence suggests that LexA is over-specified to the point of targeting efficiently as few as 4 sites in the E. coli genome. Schneider and Stormo suggested that specific binding of other proteins to T7 promoters might account for extreme over-specification in these sites , but no such cross-interaction has ever been described for E. coli LexA-binding sites. Most probably, the reason for the over-specification of LexA lies in its regulation of the sulA gene, encoding a cell division inhibitor that leads to lethal lexA - mutant phenotypes, and several DNA damage-inducible error-prone polymerases and DNA helicases that can substantially hamper viability if unregulated [55–58]. As it has been postulated previously, the negative effects of these genes require that they be under very tight repression in normal circumstances [48, 56, 58, 59]. Figure 8b shows that LexA enforces efficient repression of key genes by using a relatively high protein number (1300 molecules per cell) and an unexpected amount of site conservation (R sequence ). This allows LexA to operate effectively in its original linear range, as the search process contributes little to the effective affinity range. By maintaining a high ratio (~1/6) with the concentration of inducer (RecA), the system is able to guarantee a fast response time, which is also known to be a requirement of the SOS response [60, 61].
Here we assessed the efficiency of the R i search method operating on a collection of weakened CRP sites. The weakened collection (R sequence = 7.27 bits) was derived from the highly asymmetrical CRP motif by substituting the strong half-sequence with a mirror copy of the weak half-sequence for each site (Figure 9 inset). Despite a 31% reduction in information content on an already underspecified motif and the arbitrary introduction of artificial symmetry, the search results of the mirrored CRP collection are quite close (5.5% difference) to those obtained using both the original and asymmetric CRP collections (Figure 9). The fact that similar results are obtained when the same mirroring procedure is applied to other palindromic motifs, like Fnr (data not shown), leads us to suggest that the excess information observed in the strong dyad of asymmetrical palindromic profiles may be used primarily for binding affinity, while search operates mainly on the remaining symmetrical information.
Given the number of factors governing the equilibrium among R sequence , protein concentration, binding site number and effective affinity range, it is difficult to accurately assess the theory outlined above in the case of skewed genomes. Nonetheless, certain broad predictions can still be made for the evolution of transcription factors trapped in a genome drifting towards skew. As discussed above, based on the equality between R frequency and R sequence , RE predicts that anti-skew motifs on a skewed genome are prone to lose some positional information because the search problem is overtly simplified in the skewed genome. As the P. aeruginosa Fur case illustrates, however, this does not seem to be the case. In fact, the evidence suggests that, for the transcription factor to fulfill an equivalent function in the skewed genome, its anti-skew motif must retain or even increase its positional information content. This is due to the fact that affinity ranking must now operate in a background in which the search process does not contribute significantly to the effective affinity range (Figure 8b). Transcription factors may adapt partly to this situation by lowering their number of copies in the cell (and thus increasing the relevance of the search process in determining effective affinity), but it is difficult to see how they might shed away positional information, as this would only limit further their operational range.
Transcription factors targeting pro-skew sites face the opposite problem. In this case, the search problem becomes a fundamental limiting factor and high R sequence values are required in order to distinguish sites from the genomic background. Nonetheless, the minimum R sequence value required for efficient location of sites does not guarantee per se a desirable effective range for affinity. Thus, additional information content may still be required to provide an effective affinity range. In particular, one must take into account that, due to the pro-skew composition of sites, any increase in the linear affinity range will result in a very large increase in effective affinity range, as search requirements will very rapidly disrupt the original affinity scope. This suggests that a considerable amount of additional information will be required to maintain an adequate effective affinity range. Although they do not constitute solid proof, the search results for Fur shown in Figure 3 certainly support this hypothesis. These results suggest that 14 bits of information in a 33% GC-rich genome ought to be enough to provide a search efficiency roughly similar to that of CRP in E. coli. Nevertheless, H. influenzae CRP displays 17.83 bits, indicating that additional information is being used to provide it with an adequate operating range.
From a broader perspective, the fact that R sequence values do not decrease for P. aeruginosa Fur and H. influenzae CRP is in agreement with a peculiarity of R sequence that has been puzzling researchers for decades. As outlined in the introduction, in skewed genomes R sequence decreases without regard to the direction of the motif skew. The reason is that the background genomic entropy (H before ) decreases, making less information available for encoding recognition. Schneider et al. argued that by going against the skew a transcription factor might exploit search efficiency and benefit from the skew , but in terms of information theory this would be akin to a free lunch proposition: motifs could become more informative in a less informative setting. As we have shown above, however, a skew in genome composition introduces a net reduction in information that influences both the search and ranking problems to different extents. Thus, search might be facilitated by the genomic skew, but at the cost of hampering the effective affinity range. In order to compensate for the overall loss in information content and maintain comparable functionality, transcription factors in skewed genomes are forced to increase or at the least maintain the positional information content of their motifs.
In contrast to the conventional viewpoint, anti-skew sites trapped in a genome drifting towards skew would benefit from moving towards the skew instead of remaining against it. Clearly, if P. aeruginosa Fur moved towards the skew, the search problem would gain relevance, yielding a larger effective affinity range for the same positional information. In addition, it can be argued that positional information would be less expensive to maintain for a motif more attuned to the genomic skew. Indeed, P. aeruginosa Fur seems to be drifting towards the skew, but its drift appears to be remarkably slow. A possible explanation for this fact is that migration towards the skew implies a co-evolutionary process between a transcription factor and its binding sites that may not be easy to attain without temporal loss of functionality. Given this constraint, maintenance or increase of positional information content against the skew may be a much simpler pathway and thus act as a powerful attractor in the evolutionary landscape faced by transcription factors trapped in genomes drifting towards skew.
Summary of relative method performances.
The results presented above have several important implications for the understanding of binding site search, information and evolution. On the search problem, we conclude that non-weighted R sequence -based methods should be used preferentially, as they contain fewer assumptions and are thus less prone to misfire on real biological data. Conversely, weighted R sequence -based methods seem to be better indicated to affinity rank sites. Relative entropy and similar heuristic corrections for skew composition should be avoided, since they are based on the misguided hypothesis that search and differential regulation are equivalent problems for the protein. In contrast, we propose that information content as defined by R sequence is a compound measure that incorporates requirements from the search and regulation processes. This revised paradigm suggests that binding sites will tend to drift towards the genomic skew, not against it, and increase their conservation to circumvent the global loss of information content in skewed genomes.
Complete genome sequences for E. coli str. K-12 substr. MG1655 [NC_000913], P. aeruginosa PAO1 [NC_002516], H. influenzae Rd KW20 [NC_000907], Colwellia psychrerythraea 34H [NC_003910], Salinibacter ruber DSM 13855 [NC_007677], Thiobacillus denitrificans ATCC 25259 [NC_007404], Enterococcus faecalis V583 [NC_004668], Anaplasma marginale str. St. Maries [NC_004842] and Nitrosococcus oceani ATCC 19707 [NC_007484] were downloaded from the Entrez database at NCBI http://www.ncbi.nlm.nih.gov/Entrez/.
Collections of binding sites (Table S2, Additional file 1) for E. coli Fis, CRP and Fur sites, and for P. aeruginosa Fur sites were downloaded from the Prodoric database . The collection of E. coli LexA binding sites was obtained from . H. influenzae CRP binding sites were provided by Rosie Redfield .
Searches for binding sites using the different methods described herein were conducted entirely with Fitom, a program to locate binding sites in genomic sequences . Fitom allows different modes of action, in which the user can chose on a variety of search methods, threshold adjustments, report styles and background entropy calculations. For the purposes of this work, all searches were carried using the computed background entropy of the full genome sequence  and no small sample correction  in the estimation of R sequence and RE.
Random backgrounds with different skews were generated with RandSeq, a simple program written in C++ to generate random sequences based on a naïve Bernoulli model of mononucleotide frequencies. To simulate search processes on random backgrounds, binding sites from experimentally validated collections (Table S2, Additional file 1) were inserted at known positions in the randomly generated sequences. Frequencies for 20-mers in real and artificial genomes were computed with NmerFreq. Executable programs, user manuals and source code are available for download at http://research.umbc.edu/~erill.
All reported ROC curves correspond to simulated search processes, either on real genomic sequence or on artificially generated sequence. In a simulated search process, a collection of experimentally validated binding sites is provided to Fitom and the program scans both strands of the target sequence. Experimentally validated binding sites present in the target sequence are considered positives. All other sites in the target sequence are considered negatives. For a given threshold θ, sensitivity is computed as the ratio between true positives (positives reported as positives by Fitom according to θ) and positives. Likewise, specificity is computed as the ratio between true negatives (negatives reported as negatives by Fitom according to θ) and negatives. ROC curves of simulated search processes on artificially generated sequence show the mean and standard deviation of three independent experiments (on three different randomly generated sequences).
The authors wish to thank Andrew Cameron and Rosie Redfield for kindly providing the sequences of CRP sites of H. influenzae. This work was supported partly by UMBC Special Research Assistantship/Initiative Support program. The authors would like to thank the reviewers of this manuscript for their insightful comments and suggestions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.