An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance
 Ana C Casimiro^{1},
 Susana Vinga^{1, 2},
 Ana T Freitas^{1} and
 Arlindo L Oliveira^{1}Email author
https://doi.org/10.1186/14712105989
© Casimiro et al; licensee BioMed Central Ltd. 2008
Received: 17 August 2007
Accepted: 07 February 2008
Published: 07 February 2008
Abstract
Background
Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially.
Results
We propose to integrate position uniformity tests and overrepresentation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (ChiSquare, KolmogorovSmirnov and a ChiSquare bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cisregulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery.
Conclusion
We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that nonuniformity is a good indicator of biological relevance and can be used to complement overrepresentation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.
Background
The computational analysis of DNA sequences represents a major endeavor in the postgenomic era. The increasing number of wholegenome sequencing projects has provided an enormous amount of information which leads to the need of new tools and string processing algorithms to analyze and classify the obtained sequences [1].
In this regard, the study of short functional DNA segments, such as transcriptional factor binding sites, has emerged as an important effort to understand key control mechanisms. For example, it is now known that the presence of certain sequences of motifs in promoter regions determines the effective regulation of gene transcription, a central feature of gene regulatory networks.
DNA motifs can be represented in a number of different ways. Position specific scoring matrices (PSSMs) and consensi (oligonucleotide sequences) are amongst the most commonly used. However, several other more sophisticated methods have been proposed to represent motifs, some of them able to take into account statistical or deterministic dependencies between positions [2]. Our approach is independent of the way motifs are modeled, since it requires only the list of occurrences of motifs, something that can be obtained from any motif representation.
Motif finding is the problem of discovering motifs, that may correspond to transcription factor binding sites, without any prior knowledge of their characteristics. These motifs can be found by analyzing regulatory regions taken from genes of the same organism or from related genes of different organisms. Many approaches have been proposed and one can find an impressive collection of published articles describing algorithms to address the problem.
Currently available methods can roughly be classified in two main classes: probabilistic [3, 4] and combinatorial [5, 6]. This classification covers most, although not all, popular motif finders currently available.
The major drawback with these algorithms is their inability to discriminate the biologically relevant extracted motifs from the potentially numerous false hits. Probabilistic motif finders also have problems when the motifs are highly degenerated. The problem of determining what portion of the output corresponds to a biologically significant result has been addressed mostly through the use of statistical techniques and biological reasoning, and it is a challenge in its own right. In this regard, the correct assessment of which of those observations may have occurred just by chance is a mandatory step in the process of identifying biologically meaningful features.
This is the main rationale for the construction of stochastic models that can provide estimates for the expected number of occurrences of a given sequence. These models are based on some assumed distribution for the sequence of bases, such as the one defined by a Markov chain [7], and are then used to compute the expected number of occurrences, under the null hypothesis, H_{0}, that assumes that the sequence is randomly generated in accordance with the assumed distribution. Sequences that are overrepresented, in a statistically significant way, are considered as potentially significant, as they are highly unlikely to have been generated by chance. This is usually done by determining a pvalue for each extracted motif that assesses its relative expectedness/randomness under the specific predefined model, based on the expected vs. observed number of occurrences. This step makes for an efficient filtering of the output without loosing a significant amount of information and to the correct assessment of the motifs that are under or overexpressed. However, this approach is not very selective, and it is hard to apply to small motifs (that occur very frequently by chance) or to other motifs that are not overrepresented, but that, nonetheless, are biologically significant.
One possible way around this limitation is to look at other characteristics of the motifs, such as the positional distribution in the regions under analysis. The idea of analyzing the positional distribution of motif occurrences was developed recently [8, 9], suggesting that several motifs occurring in natural sequences have strong positional preferences. For example, it is well known that in prokaryote promoter regions, the TATAbox occurs near position 10 (before the beginning of transcription) and the TTGACA motif usually occurs near position 35. In eukaryotes, the bestcharacterized core promoter elements consist of a TATAbox located approximately 30 nucleotides upstream from the start site, an initiator element located at the transcription start site, and a downstream promoter element (DPE) located approximately 30 nucleotides downstream from the transcription start.
The functionality of genome regions is intrinsically related with their ability to fold in tridimensional structures, which clearly indicates that positional bias should be incorporated in the models and analyzed from a statistical point of view. A number of recent studies have focused on this property, both in terms of absolute [10–12] and relative [13, 14] positioning of the different motifs. These studies, however, address the positioning of specific, well identified, motifs, but none of them, to our knowledge, presented complete quantitative results and a comprehensive analysis of this feature, that can be used to actually distinguish between relevant and nonrelevant motifs. In particular, there is no clear proposal of which is the best test to identify and classify the motifs in terms of their positional distribution along the genome.
In this article we propose to integrate position uniformity tests and overrepresentation tests based on Markov models to improve the posterior classification of the motifs and better assess their biological significance.
For the position uniformity tests, the input data corresponds to vectors of motif positions in the input sequences where the motif appears. Specifically, we compute the position of each occurrence of each motif, relative to the translation start site, and build a list of these positions. This list will be analyzed for uniformity using a statistical test. We started this analysis by first comparing different statistical methodologies commonly used to test uniformity, namely the ChiSquare goodnessoffit test and the KolmogorovSmirnov (KS) test [8]. Since the motifs may appear in a small number of positions, a bootstrap [15] version of the ChiSquare test, that can cope with small sample sizes, was also evaluated. These tests were first validated on artificially generated data to better analyze the results, assessing their sensitivity and specificity as well as the Receiver Operating Characteristic (ROC) curves. Based on these results, the bootstrap ChiSquare test was chosen as it proved to be the most powerful test for small sample sizes.
The bootstrap ChiSquare test was then used in the study of the positional distribution of motifs in the promoter sequences of different organisms, namely bakers yeast (Saccharomyces cerevisiae), human (Homo sapiens), fruit fly (Drosophila melanogaster), Escherichia coli and several Dicotyledons plants. Motifs that appeared at least in a given fraction of the sequences (the quorum) were extracted, from these data sets, using a motif finder. For each extracted motif, a pvalue was calculated, under the null hypothesis, that assumes a uniform distribution for the motif positions.
The combination of overrepresentation tests, that take into account the specific sequence of bases, and position uniformity tests, that consider the positional distribution of the motifs, has the potential to represent a more powerful technique for motif classification, than the ones currently used. This methodology is easily adapted to other computational biology applications where position is thought to be biologically relevant and where the nonuniformity of observed motifs may provide strong indication of biological relevance (e.g., [10]).
Results and discussion
In this section we present the results of applying the proposed methodology to both artificially generated (synthetic) and real data sets.
The artificial data sets were first used to compare the three analyzed uniformity tests, the ChiSquare goodnessoffit test, the KolmogorovSmirnov test and the ChiSquare bootstrap test. The uniformity test that obtained the best results was then applied to classify the motifs extracted, by a motif finder, from a number of real data sets, from different organisms. In this section we present only the results obtained for the Saccharomyces cerevisiae data sets, but results for other organisms are also available [see Additional file 1].
Comparison of Position Uniformity Tests
Artificial data sets generated from uniform and beta distributions were used in this comparison (see section Methods). Specifically, we generated a number of positions, normalized to the unit interval, that emulate, what, in the real data, corresponds to the positions of motif occurrences. For each of these distributions, the sensitivity and the specificity of the three uniformity tests was assessed. Specifically, we used these tests to evaluate the probability that H_{0} is true, i.e., that the given samples originated in a uniform distribution.
The results show that all tests present the expected sensitivity, as imposed by the level of significance (in this case, it represents the probability that a uniform sample will be classified as nonuniform). They correctly identified the majority, around 95%, of the uniform instances of different lengths when a level of α = 5% is chosen.
Instances drawn from Beta distributions were in general also correctly identified by the three tests. The more asymmetric the distribution is, the more powerful the KS test becomes. For distributions with lower variance and higher symmetry the bootstrap version of ChiSquare becomes the more powerful test, as shown in Figure 2(a).
The power of the tests increases as the samples become larger, as would be expected. From these results it is possible to see that the KolmogorovSmirnov test is less powerful than the other two tests. Both ChiSquare goodnessoffit and the ChiSquare bootstrap tests present a similar behavior.
For small sample sizes, the determination of the optimal number of bins in the Chisquare test was achieved by testing different rules. One of the most common is to guarantee a minimum number of five expected counts per bin, which leads to nbins = ⌊x/5⌋, where x is the sample size. However, when this value is between 10 and 15 it is advisable to use three classes. In effect, it would be virtually impossible to distinguish a symmetric distribution using two bins defined by the edges (0, 0.5, 1). When using three bins there is a significant increase of specificity values, which strengthens the application of Chisquare with a modified rule for the number of bins. Furthermore, this assessment and further testing on simulated data led to the application of a modified rule for determining the number of bins, described by nbins = ⌊x/5⌋ + 1, for sample sizes up to 40. This was shown to increase the specificity. The graphs of Figure 2 show the specificity of the test for different Beta distribution functions as a function of sample size, using a combination of both rules: from 10 to 40 one extra bin was applied to improve the classification, from 40 to 100 the usual 5perbin formula was employed.
These results also show that there is a real advantage in using bootstrap methodologies. When the size of the sample is small, (N < 30), the results show that the ChiSquare bootstrap test exhibits the best ability to detect nonuniformity. Although the difference between the bootstrap version and the standard test appears to be marginal, one should use the simulation results to fix the desired sensitivity. In fact, only by using the empirical distribution of the test statistic, is it possible to correctly adjust the level of significance for small samples. This is because the distribution in this case markedly deviates from the ChiSquare standard distribution.
These results therefore indicate that the ChiSquare test with bootstrap achieves the best performance and should therefore be used to assess the uniformity of motif positions in real data, specially when the number of occurrences is small.
Analysis of the Saccharomyces cerevisiae data sets
 1.
A motif finder algorithm, RISO [5], was used to identify motifs that occurred in a given fraction (the quorum) of the sequences.
 2.
Motifs were ordered in accordance to the computed statistical significance of the deviation of the number of occurrences observed vs. expected. Motifs were considered overrepresented in a statistically significant way if the pvalue was smaller than 10^{3}.
 3.
A pvalue for the likelihood of a uniform distribution of each motif was obtained, using the ChiSquare bootstrap test, using a significance level, α, equal to 0.05.
With this analysis, it is possible to define four motif classes, according to their classification in the overrepresentation and uniformity tests. Motifs that are both overrepresented and nonuniformly distributed are very likely to have some biological function. Our analysis will be centered on these motifs, although, at times, we will also look at motifs that are not seen as strongly overrepresented but are distributed in a nonuniform way in the promoter regions.
Core promoter elements
For the analysis of the positional distribution of several core promoter elements, a data set containing 5864 promoter sequences from Saccharomyces cerevisiae was considered. This data set includes a significant fraction of the gene promoters for this organism. For eukaryotic species there exists a set of well characterized core promoter elements, that include the TATAbox and the GCbox. The documented consensi for these sites are:

TATAbox[16]:$\begin{array}{cccccccc}\text{T}& \text{A}& \text{T}& \text{A}& (\text{A}/\text{T})& \text{A}& (\text{A}/\text{T})& (\text{A}/\text{G})\end{array}$

GCbox: [17]$\begin{array}{llllllll}\text{g}\hfill & \text{G}\hfill & \text{G}\hfill & \text{G}\hfill & \text{C}\hfill & \text{G}\hfill & \text{G}\hfill & \text{g}\hfill \\ \text{t}\hfill & \text{a}\hfill & \text{t}\hfill & \text{a}\hfill & \text{t}\hfill & \text{a}\hfill \\ \text{a}\hfill & \text{a}\hfill & \text{t}\hfill \end{array}$
Distribution of motifs according to uniformity and statistical significance in the global S. cerevisiae data set.
Motifs  Uniform  Non uniform  Total 

Not overrepresented  254  2002  2256 
Overrepresented  30  638  668 
Total  284  2640  2924 
From the nonuniform group, 638 were classified as overrepresented in a statistically significant way. The first ranked motifs in this subgroup correspond to motifs that match previously described consensi. For example, motifs like TATAAA, TATATA, TATATAA, TATATAT, and TATAAAA match with the TATAbox; GGGTA, and GGGCG fit the GCbox profile.
In Figure 4(b) it is also possible to observe the positional distribution of a set of motifs that corresponds to the GCbox. The pvalues obtained for the uniformity test for these elements are much lower than 0.05, suggesting nonuniformity and stressing that these motifs have a positional preference.
However, only the TATAbox has been classified as overrepresented in a statistically significant way. This is a case where the application of a nonuniformity test would identify biologically significant motifs that would go unnoticed by the overrepresentation tests.
The results obtained for this data set show that most of the extracted motifs are classified as nonuniformly distributed. This result was somehow expected since the motif finder was instructed to extract motifs with a minimum quorum of 20% in a large data set of 5856 promoter sequences. This means that each motif must be present in at least 1172 sequences. Some of the motifs extracted correspond to known transcription factors binding sites and a large number corresponds to motifs that are important in the context of the chromatin regulation. Most of these motifs are known to have a positional preference. In this example, we did not perform a detailed analysis of all the motifs extracted, since they are too numerous. The objective was to identify well characterized motifs that have positional preference and that may not be overrepresented in a statistically significant way.
Aft2p
This data set contains the promoter regions of 46 genes that are documented to be regulated by the transcription factor Aft2p. The documented consensus for this TF are the motifs: CGCACCC, GGCACCC, TGCACCC and YKCACCCR.
Distribution of motifs according to uniformity and statistical significance in the Aft2p data set.
Motifs  Uniform  Non uniform  Total 

Not overrepresented  988  110  1098 
Overrepresented  15  10  25 
Total  1003  120  1123 
From a total of 1123 motifs extracted, only 120 are considered to be nonuniform. A set of 10 out of these 120 are classified as overrepresented. This small group includes the motif CACCC, which corresponds to the core conserved motif described for this TF binding site. The pvalue for the uniformity test for this motif was 1.75e  4. The complete consensus that corresponds to this TF binding site was not reported, since this it is only present in a small number of sequences in this data set.
Dal80p
This data set includes 26 promoter sequences of genes documented to be regulated by the transcription factor Dal80p. The biological consensus described consists of two identical motifs, GATAAG, separated by 15 to 20 base pairs.
Distribution of motifs according to uniformity and 77 statistical significance in the Dal80p data set.
Motifs  Uniform  Non uniform  Total 

Not overrepresented  407  64  471 
Overrepresented  7  4  11 
Total  414  68  482 
From a total of 482 motifs, only 68 are distributed nonuniformly, and only 4 of these are overrepresented in a statistically significant way. In this group of 4 motifs, it is possible to find the GATAAG motif, which is precisely the motif that corresponds to the biological consensus described. The three other motifs that complete this group are the motifs CTTATC, TATATA and AAGAAA. The first one is the complement of the GATAAG motif, the second one is a TATAbox that, as discussed before, has a positional preference and the last one is a motif already identified in yeast promoters as being one of the premRNA 3'endprocessing signals.
Figure 5(b) shows the distribution of the GATAAG motif in the promoter region. This motif obtained a pvalue of 0.04 in the uniformity test.
Gln3p
This data set consists of 19 promoter sequences of genes regulated by the transcription factor Gln3p. The reported consensi for this TF binding site are the motifs GATAAG and GATTAG.
Distribution of motifs according to uniformity and statistical significance in the Gln3p data set.
Motifs  Uniform  Non uniform  Total 

Not overrepresented  559  161  720 
Overrepresented  12  14  26 
Total  571  175  746 
Figure 5(c) shows the positional distribution of the GATAAG motif. The pvalue obtained in the uniformity test was 0.02, which represents a strong indication of nonuniform distribution.
Met4p
Distribution of motifs according to uniformity and statistical significance in the Met4p data set.
Motifs  Uniform  Non uniform  Total 

Not overrepresented  83  15  98 
Overrepresented  17  15  32 
Total  100  30  130 
Figure 5(d) shows the positional distribution of the motif CACGTG in the promoter region of the genes in the data set.
Gat1p
Distribution of motifs according to uniformity and statistical significance in the Gat1p data set.
Motifs  Uniform  Non uniform  Total 

Not overrepresented  682  75  757 
Overrepresented  5  3  8 
Total  687  78  765 
Figure 5(e) shows the positional distribution of the motif CTTATC in the promoter regions of the genes considered.
Datasets for other organisms have also been analyzed, and confirm the obtained results are consisten with the ones presented in this section [see Additional file 1].
Conclusion
Given these results, we propose that the integration of position uniformity tests and overrepresentation tests can be used to improve the accuracy of the classification of motifs found by combinatorial motif finders.
The study we performed on the positional distribution of several well known cisregulatory elements, in the promoter sequences of different organisms, has shown that position conservation is a significant characteristic of many biologically significant motifs. In particular, the results show that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that nonuniformity is a good indicator of biological relevance.
In a number of instances where overrepresentation tests do not provide conclusive results, nonuniformity tests can be used to flag the relevant motifs. This effect is most evident for small motifs, that are expected to appear a large number of times, and for small datasets, where the overrepresentation tests are not powerful enough to single out motifs that are biologically significant, but are only somewhat overrepresented.
In particular, it is clear from the results that motifs that pass both tests are very strong candidates for further analysis. This is an important point, specially for users of combinatorial motif finders, because it is sometimes hard to filter the relevant motifs from the large number of sequences identified by these methods. In fact, currently used overrepresentation tests are likely either to miss important motifs that are not long enough, or to flag as significant motifs that are overrepresented only by chance.
Methods
Motif finding
Motifs were extracted using the motiffinding algorithm RISO [5]. This combinatorial method extracts motifs consisting of plain nucleotide sequences or sequences over a degenerate alphabet, explicitly enumerating all possible patterns. This approach has proved to be effective and efficient when the appropriate parameters for the extraction are known. In our case, since the motifs of interest were known, the parameters were selected to include the desired motifs, taking into account some degeneration of the patterns.
It is noteworthy that the application of the proposed approach is independent of the method used to model and extract the motifs. It could be used, for instance, on the list of occurrences reported by a probabilistic algorithm. Except for efficiency reasons, it could also be applied to test all possible oligonucleotide sequences without performing any searching and prefiltering provided by the RISO combinatorial algorithm. This step is required only to reduce the number of tests to be performed and study only a subset of motifs that have particular properties. This can be used to specify features such as minimum quorum, length and degeneration level, guaranteeing the retrieval of all the motifs that fulfill them. Since this preselection could introduce a bias in the analysis, we subsequently studied motifs for which the biological relevance is widely reported.
Overrepresentation tests
The statistical significance of overrepresentation of the extracted motifs was assessed using a model for the sequences based on a firstorder Markov chain [22]. The probability of occurrence of a motif in each sequence is used to estimate the probability of the motif occurring in at least k sequences, by computing the distribution of a sum of Bernoulli variables, each one of them taking the value 1 in accordance with the computed probability for that sequence. This leads to a binomial distribution, if all sequences have equal length, or to a sum of unequal parameter Bernoulli random variables, if the lengths of the sequences are different.
The statistical significance of the observed number of occurrences for each motif is reported as a pvalue, that is subsequently used to sort the output, starting with the most overrepresented patterns. In this test, only the sequence of bases that constitutes the motif is considered and not the positions where the motif occurred in the sequences.
Uniformity tests
The vector of occurrences p to be tested consists of the positions (p_{1}, p_{2}, ..., p_{ n }) where the target motif appeared (n times) in a set of k sequences {S_{1}, S_{2}, ..., S_{ k }}. The null hypothesis H_{0} is that the underlying distribution F of these positions is the discrete uniform F ~ Unif{1, ..., L}, where L is the (common) length of each sequence S_{ j }.
In order to standardize the results across different sequences, the simulations were performed using a continuous uniform distribution Unif(0, 1). This corresponds to using vectors of the absolute position normalized over the total sequence length. This is equivalent to the discrete version if the continuous values are rounded or truncated to a finite number of classes. The results can be generally interpreted as relative positions to the beginning of transcription.
Uniformity tests abound in the literature. Some of the most widely used, and that will be applied in this study, are the KolmogorovSmirnov (KS) test and the ChiSquared (CS) test.
where n is the sample size and F is the distribution function associated with the theoretical hypothesis. The null hypothesis is rejected if D is greater than a critical value which is dependent on the level of significance considered and the sample size.
where O_{ i }is the observed frequency and E_{ i }is the expected (theoretical) frequency asserted by the null hypothesis. The test statistic X^{2} follows approximately a chisquare distribution with (k  c + 1) degrees of freedom, ${\chi}_{kc+1}^{2}$, where k is the number of classes and c is the number of estimated parameters.
The CS tests were conducted by fixing the number of classes or bins for each sample in order to find, for each particular distribution, the most discriminative parameter set. Cochran'rule was used as the starting point to define the number of classes of equally spaced edges, by forcing a minimum number of five expected observations per bin. This corresponds to employing, for each sample size N, c = ⌊N/5⌋ classes. Other values were also tested to assess if this rule was adequate to this specific setting, specially for small samples where the chisquare approximation might not be applicable.
Bootstrap analysis
Since the approximation to the Chisquare distribution described above is not accurate and might not be appropriate for small sample sizes, a bootstrap version of the Chisquare test was also implemented and analyzed. Parametric bootstrap methodologies are adequate to deal with few observations [15] and goodnessoffit tests can be easily adapted to this methodology.
where I is the indicator or characteristic function that counts the number of replicates whose t_{ i }values exceeded the original t_{ obs }.
H_{0} is accepted or rejected according to the significance level, α, established. For a fixed significance value, α, uniformity can be rejected if pvalue ≤ α. The smaller the pvalue, the more unlikely is that the sample came from a uniform distribution. Intuitively, if most of the B samples taken from a uniform distribution have a smaller value of the statistic than the original sample, then it is not likely that our initial sample comes from a uniform distribution.
Since the described bootstrap procedure should be done every time a uniformity test is run for a given sample, a presimulation was performed in order to calculate the bootstrap critical values, thus avoiding unnecessary simulations. Chisquare tests with varied sample sizes and number of classes were performed on uniformly generated samples and the results saved for future comparison. This procedure was conducted on 200000 replicates.
Sensitivity, specificity and ROC curves
where Positives are the samples from the Uniform distribution and Negatives are samples generated from an alternative distribution. In this study the Beta distribution with several parameters Beta(α, β) was considered, given its flexibility and properties. This distribution is defined on the (0, 1) interval, and if (α, β) = (1, 1), it reduces to the uniform distribution. For parameters (α, α) the function has a bellshape and is symmetric and centered on 0.5. For distinct values of α and β it becomes asymmetric. This distribution exhibits a wide spectrum of behaviors and can therefore be used to model the positional distribution of motifs in sequences.
There is a close relationship between the previous definitions and hypothesis testing when the null hypothesis H_{0} is the uniform distribution of the positions. In this case, the level of significance α, defined as the probability of rejecting H_{0}, when it is true, is related with the sensitivity of the test and equal to 1  sensitivity. The power of the test, the probability of rejecting H_{0} when it is false, is equivalent to the specificity defined as above.
The level of significance used in the tests was 0.05, unless otherwise noted. The sensitivity and specificity of a test depend on the level of significance considered. For this reason Receiver Operating Characteristic curves (ROC curves) are also presented. ROC curves are graphical plots of the sensitivity versus 1  specificity for several thresholds values, in this case the test statistics critical cutoff value. They allow the comparison between tests for different levels of significance.
Datasets
 1.
Uniform continuous distribution on (0, 1) or, equivalently, Beta(1, 1)
 2.
Beta(α, β) distribution with parameters (α, β) equal to (3, 3), (15, 15) and (3, 1.5), reflecting different degrees of asymmetry.
The three tests referred (KS, CS and CSboot) were then applied to each of the samples in order to obtain a mean rejection rate across all the replicates. These mean values estimate the sensitivity and specificity of the test of uniformity, or, equivalently its level of significance and power.
The advantage of using synthetic data sets is the ability to specify exactly the distributions of the motif positions. In this way, we can safely compare the tests under analysis and analyze and tune the parameters involved. However, these artificial samples might still be very different from real motif positions, since the distributions considered are probably not accurate enough to model the actual motif positions in promoter regions.
Data sets for the analysis of the positional distribution of well characterized core promoter elements.
Data sets for for the analysis of the positional distribution of specific transcription factors binding sites in S. cerevisiae.
Transcription Factor  Seq. number  Average bp  Consensus 

Aft2p  46  1000  CGCACCC 
GGCACCC  
TGCACCC  
YKCACCCR  
Dal80p  26  1000  GATAAGN{15,20}GATAAG 
Gln3p  56  1000  GATAAG 
GATTAG  
Met4p  36  1000  TCACGTG 
Gat1p  19  1000  GATAAG 
For all the real datasets, we used the promoter regions defined in each one of the databases from where the sequences were retrieved. For example, for the S. cerevisiae datasets the promoter region corresponds to 1000 nucleotides upstream of the translation start site.
Declarations
Acknowledgements
This work was supported by FCT and the POSI and POCTI programs (projects POSI/SRI/47778/2002, POSI/EIA/57398/2004 and POCTI/BIO/56838/2004).
The authors would like to thank Prof. Lisete Sousa (FCUL, Portugal) and Prof. Ana Pires Parente (ISTUTL, Portugal) for useful suggestions and comments that greatly improved the manuscript.
Authors’ Affiliations
References
 Gusfield D: Algorithms on Strings, Trees, and Sequences. Cambridge University Press; 1999.Google Scholar
 Barash Y, Elidan G, Friedman N, Kaplan T: Modeling dependencies in proteinDNA binding sites. Proceedings of the seventh Annual International Conference on computational molecular biology 2003, 28–37.Google Scholar
 Schug J, Overton GC: Modeling transcription factor binding sites with Gibbs Sampling and Minimum Description Length encoding. Proceedings of the International Conference on Intelligent Systems for Molecular Biology 1997, 5: 268–271.Google Scholar
 Bailey T, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology 1994, 28–36.Google Scholar
 Carvalho AM, Freitas AT, Oliveira AL, Sagot MF: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE Transactions on Computational Biology and Bioinformatics 2006, 3(2):126–140.View ArticlePubMedGoogle Scholar
 Marsan L, Sagot MF: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. Journal of Computational Biology 2000, 7(3/4):345–360.View ArticlePubMedGoogle Scholar
 Robin S, Rodolphe F, Schbath S: DNA words and models. Cambridge University Press, NY; 2005.Google Scholar
 Li N, Tompa M: Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology 2006, 1: 8.PubMed CentralView ArticlePubMedGoogle Scholar
 Hudges JD, Estep PW, Tavazoie S, Church GM: Computational identification of cisregulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae . Journal of Molecular Biology 2000, 296(5):1205–1214.View ArticleGoogle Scholar
 Yagil G: Binary DNA tracts as transcription control elements. Journal of Biomolecular Structure and Dynamics 2007, 24(6):735–736.Google Scholar
 Segal E, FondufeMittendorf Y, Chen LY, Thastrom A, Field Y, Moore IK, Wang JPZ, Widom J: A genomic code for nucleosome positioning. Nature 2006, 442(7104):772–778.PubMed CentralView ArticlePubMedGoogle Scholar
 Narlikar L, Gordan R, Ohler U, Hartemink AJ: Informative priors based on transcription factor structural class improve de novo motif discovery. Bioinformatics 2006, 22(14):e384–92.View ArticlePubMedGoogle Scholar
 Erives A, Levine M: Coordinate enhancers share common organizational features in the Drosophila genome. Proc Natl Acad Sci U S A. 2004, 101(11):3851–3856.PubMed CentralView ArticlePubMedGoogle Scholar
 Makeev VJ, Lifanov AP, Nazina AG, Papatsenko DA: Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Research 2003, 31(20):6016–6026.PubMed CentralView ArticlePubMedGoogle Scholar
 Davison AC, Hinkley DV: Bootstrap methods and their application. Cambridge University Press; 1997.View ArticleGoogle Scholar
 Basehoar AD, Zanton SJ, Pugh BF: Identification and distinct regulation of yeast TATA boxcontaining genes. Cell 2004, 116(5):699–709.View ArticlePubMedGoogle Scholar
 Bucher P: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 1990, 212(4):563–78.View ArticlePubMedGoogle Scholar
 Butler JEF, Kadonaga JT: The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev 2002, 16(20):2583–2592.View ArticlePubMedGoogle Scholar
 Singer VL, Wobbe CR, Struhl K: A wide variety of DNA sequences can functionally replace a yeast TATA element for transcriptional activation. Genes & Development 1990, 4(4):636–645.View ArticleGoogle Scholar
 Blinder D, Magasanik B: Recognition of nitrogenresponsive upstream activation sequences of Saccharomyces cerevisiae by the product of the GLN3 gene . J Bacteriol 1995, 177(14):4190–4193.PubMed CentralPubMedGoogle Scholar
 Coffman JA, Rai R, Loprete DM, Cunninham T, Svetlov V, Cooper TG: Cross regulation of four GATA factors that control nitrogen catabolic gene expression in Saccharomyces cerevisiae . J Bacteriol 1997, 179(11):3416–3429.PubMed CentralPubMedGoogle Scholar
 Robin S, Daudin JJ, Richard H, Sagot MF, Schbath S: Occurrence Probability of Structured Motifs in Random Sequences. Journal of Computational Biology 2002, 9(6):761–774.View ArticlePubMedGoogle Scholar
 Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, Mira NP, Alenquer M, Freitas AT, Oliveira AL, SáCorreia I: The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Research 2006, (34 Database):D446D451.Google Scholar
 Schmid C, Rouaida DP, Praz V, Bucher P: EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Research 2006, (34 Database):D82D85.Google Scholar
 Kutach AK, Kadonaga JT: The Downstream Promoter Element DPE Appears To Be as Widely Used as the TATA Box in Drosophila Core Promoters. Mol Cell Biol 2000, 20(13):4754–4764.PubMed CentralView ArticlePubMedGoogle Scholar
 Huerta AM, Salgado H, Thieffry D, ColladoVides J: RegulonDB: a database on transcriptional regulation in Escherichia coli. Nucleic Acids Research 1998, 26: 55–59.PubMed CentralView ArticlePubMedGoogle Scholar
 Shahmuradov IA, Gammerman AJ, Hancock JM, Bramley PM, Solovyev VV: PlantProm: a database of plant promoter sequences. Nucleic Acids Research 2003, 31: 114–117.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.