Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test
BMC Bioinformatics volume 6, Article number: 109 (2005)
This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information.
We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA.
We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.
The transcription rate of genes is dictated primarily by interactions between DNA-binding transcription factors. Comparatively short sequences (several hundred to several thousand base pairs, depending on thespecies) upstream or downstream of the transcription start site often play a major role in the regulation of gene expression. Specific sites within such regions are recognized by regulatory proteins (transcription factors), which act upon binding as transcriptional repressors or activators, controlling the rate of transcription. The identification of regulatory regions, which are generally composed of dense clusters of target transcription factor binding sites, forms an essential step in understanding the regulatory interactions that govern the spatial and temporal expression of individual genes (see for example [1, 2]) and genetic regulatory networks, (see for example ).
Ultimately, this task is accomplished experimentally using techniques such as empirical deletion analysis, direct binding measurements, and co-precipitation of protein-DNA complexes. However, experimental verification is expensive and time consuming. Therefore, to address the growing volumes of available genomic sequence, a number of algorithms that identify putative cis-regulatory modules and transcription factor binding sites using evolutionary comparisons, whole-genome data, and known descriptions of transcription factor binding sites, have been successfully developed. Regulatory regions of higher eukaryotes can be subdivided into proximal regulatory units – promoters – which are located close to and upstream of the gene, and distal transcription regulatory units called enhancers or cis-regulatory modules. These may be located far upstream or downstream of the target gene, and are much more difficult to recognise. In our work we focus on recognition of enhancers.
Methods for recognising regulatory DNA may be divided into the following approaches:
Recognition of regulatory DNA regions based on description of known transcription factor binding sites (TFBS). This approach exploits the clustering of known, cooperatively-acting transcription factors (TFs). Extracting clustered recognition motifs is one of the most reliable techniques, but is limited to the recognition of similarly regulated cis-regulatory regions. Among the most popular representatives of search by known TFBS are [4–9].
Recognition of regulatory DNA based on phylogenetic foot-printing [10–14]. Methods of this type assume that regulatory regions are highly conserved in cross-genomic comparison, and conserved segments can be extracted from evolutionary related genomes. Performance of phylogenetic foot-printing depends on the evolutionary distance between given species and on the conservation level of individual genes. This is an actively progressing area, as more and more sequenced genomes appear. However, such an approach offers little information as to the specific function of the conserved sequences. Furthermore, it is still an open question as to how many genomes are sufficient for reliable extraction of regulatory regions.
Methods based on the difference of local nucleotide composition between regulatory and non regulatory DNA [15–18]. It is assumed that this difference is due to presence of multiple transcription signals, such as binding motifs for TFs in regulatory regions. The works [15–17] are based on constructing a global interpolated Markov model, applied to promoter recognition only.
In our method, we assume that the abundance of regulatory motifs within regulatory regions leaves a distinct "signature" in nucleotide composition, and that it is possible to capture this "signature" statistically. More specifically, we hypothesize that it takes the form of an over-representation of "similar words" (which are not simple repeats).
The approach of looking for over-occurrence of words has also been widely used in motif discovery, but this is not our aim here. This over-representation of similar words should appear as outliers in the right tail of the distribution of similar word lists of variable length. The "fluffy tail test", proposed in this paper, is designed to identify such outliers and is a useful technique when data from multiple genes and genomes are lacking. It may also be used as a complementary tool when such data are available.
In this section, we first present our new statistical 'fluffy tail' test for measuring the overrepresentation of similar words, and then show its performance on experimentally verified sequence data.
To demonstrate the power of our test, we need a positive, experimentally verified, training set of regulatory sequence data, and also negative training sets of non- regulatory sequence data. We use three test beds. The positive training set is a collection of 60 experimentally verified functional Drosophila melanogaster regulatory regions . This set consists of cis-regulatory modules located far from gene coding sequences and transcription start sites. It contains many binding sites (and site clusters), best known of which are bicoid, hunchback, Kruppel, knirps and caudal, – the sites involved in the regulation of developmental genes. The total size of the positive training set comprises about 68 Kb of sequence data, and contains 58 clusters of the same type of TFBS (homotypic). The two negative training sets are: (i) 60 randomly picked Drosophila exons, and (ii) 60 randomly picked Drosophila non-coding, non-regulatory DNA sequences: we excluded exons and regions of length 1 KB upstream and downstream of genes, using the Ensembl Genome Browser . Each training set contains 68 Kb of sequences in total.
Estimation of distributions of similar words
To construct the distribution of similar words, we first need to specify the length of words under consideration. We try to mimic the TF core, which is the less variable part of a binding motif. Because the core of TFBSs is relatively short (around 3–5 bp) we considered 5-mer words, allowing for 1 mismatch. However, our results also hold for words of length 4 through 12, allowing for 1 through 4 mismatches (see Supplementary Materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]). Thus, for each 5-mer word in each of the 180 sequences (60 sequences in each training set) we computed the number n of similar words of the same length. Thus, each word is the "seed" of a list of similar words. Next, the number of (non-disjoint) lists containing n words is counted, where n = 1,2,3....
(See Methods section for further details). As an example, thehistogram of the distribution of similar 5-mer words is plotted in Figure 1. In this histogram, the Y axis represents the number of lists containing 1,2, ..., n words and the X axis shows the number n of similar words in the list.
From this plot it can be seen that most lists contain 10 to 40 words, but there are outliers: some very large lists form a long, "fluffy" tail. We call a list having the largest size the maximal similar word list (MSWL). If the original sequence is characterized by the presence of an unusually high number of over-represented words, we expect it to contain more long lists in comparison to a random sequence.
To sample such a random distribution we shuffled the given sequence of original data 50 times. For each randomisation we assessed the frequency distribution of similar words. Figure 2 shows a typical example of the distribution of similar words for one of the randomly shuffled sequences of the same (knirps) cis-regulatory module as in Figure 1. Compared with the distribution of the original data (Figure1), the randomised sequence in Figure 2 lacks a heavy, "fluffy" right tail. Figure 3 shows the difference between original and randomised similar word distributions in cumulative form. The difference between the two curves reflects the fluffy right tail of the original data.
In Figure 4, ten randomised sequences are plotted as dotted contours together with the histogram of the original regulatory knirps data (solid). The cumulative histogram for original (solid) and randomised (dotted) sequences is shown in Figure 4 (right). All dotted tails are shorter than the solid one, indicating the statistical significance of the solid tail.
Definition of the fluffiness coefficient F
To measure how strong the distribution of similar words of regulatory regions deviate from randomness, we introduce a "fluffiness" coefficient F:
w here L max,originalis the number of words in the maximal similar word list (MSWL) in the original sequence, and σ r are the mean and standard deviation of the MSWL size in each of r shuffled sequences. Here we call the sequence "random" if it is obtained from original sequence by shuffling it, preserving its single nucleotide composition. We will omit the subscript r for F r later in the paper for simplicity.
One can regard F as measuring the difference between signal and noise, where the signal is taken from the original sequence, and the noise from the randomised sequences with the same composition and length. Thus, the fluffiness coefficient is normalised for the length and base composition of the sequence, because we compare each original sequence only with respect to shuffled sequences of the same length and composition. Thus one can compare the fluffiness F for sequences of different base composition and length.
Results for regulatory regions
Figure 5 shows the distribution of fluffiness coefficient F for regulatory, coding and non-coding non-regulatory (NCNR) DNA. In each sequence we generated r = 50 shuffled versions, in calculating F. One can see that F = 2 distinguishes regulatory DNA from other types of DNA. Thus, we use the value F = 2 as a threshold. A sequence with F>2 we declare to have a "fluffy" tail. Moreover, we found that for each regulatory region having F>2, all the randomised sequences had a shorter tail. This value F = 2 is sufficiently robust: if we vary our threshold a little around F = 2, we still get a fair separation.
Our choice of r = 50 shuffled versions for each sequence allows us to obtain reliable estimates for the fluffiness coefficient F and make the computational time reasonable. Table 1 shows that F is somewhat unstable for smaller r for the knirps regulatory region. However, for each choice of r, F clearly exceeds the threshold value 2, in this example. See Supplementary Materials for more detailed descriptions [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12].
Using the methodology described above, we found that 51 out of 60 regulatory regions (85%) analysed in our positive training set exhibit the significant "fluffy-tail" pattern (see Table 2). The non-detection of the remaining "non-fluffy" regulatory regions could perhaps be partly due to the limited power of experimental deletion analyses to correctly distinguish the boundaries of the cis-regulatory modules.
We calculated the distribution of F for our two negative and one positive training sets. The separation of regulatory DNA from coding and non-coding, non-regulatory DNA on the basis of fluffiness was quantified by estimating the distribution of the F coefficients. A Kruskal-Wallis test showed that these regions differ significantly in the magnitude of the fluffiness coefficient (H = 132.81, N = 180, df = 2, p = 0.00001), with exons and non-coding non-regulatory DNA having much lower F-values than regulatory regions (See Fig. 6).
We now turn to examine the location of similar words in the MSWL for a given sequence.
When the start positions of each of the words in the MSWL are plotted, they tend to be fairly uniformly scattered along the length of the sequence, as illustrated in Figure 7.
We now examine the relationship between the MSWL and predicted TFBS sites. We found significant enrichment of most MSWLs with the occurrences of TFBS in databases: when submitted to the Transfac and Jaspar TFBS databases, the "seed" words for MSWLs exhibited 10–20 fold enrichment with putative TFBS in comparison with all 5-mer words within the given regulatory region: thus, for the most part, these "seed" words turned out to be instances of known TFBS (results not shown here).
Results for exons
We repeated the fluffy tail test for randomly picked Drosophila exons, and found that the distribution of over-represented words of the original sequences did not differ statistically from those of their randomised versions (See Table 2). Note the absence of a "fluffy tail" in Figure 8 (left) and the lack of distinction in the cumulative distribution (Figure 8 right).
Thus we have established a statistical difference between exons and regulatory DNA. Next we compare regulatory DNA with non-coding non-regulatory DNA.
Results for non-coding, presumed non-regulatory DNA
The similar words distribution for non-coding non-regulatory DNA typically shows two patterns: (1) without significant tails, as for exons and (2) with significant tails (Figure 9) but in this case – and in contrast to the regulatory sequences – the spatial locations of over-represented words are typically clustered (Figure 11c).
To deal with this, we developed a measure of spatial clustering of similar words. We say that two words w 1 and w 2 belong to the same cluster, if their genomic start positions s 1 and s 2 satisfy |s 1 - s 2| ≤ m·k , where m is the word length, and k is a constant. We examined the following choices for k: 1; 1.5; 2; 2.5; 3.
The size of a cluster is defined as the number of words in the cluster. For each MSWL we computed the coefficient of variation (CV) in cluster sizes, where CV is the standard deviation divided by the mean cluster size. We used analysis of variance to test for difference in coefficients of variance among four types of functional DNA: exons, non-fluffy NCNR, fluffy NCNR and regulatory regions. The assumptions for ANOVA (homogeneity of variance (CV), no correlation between means and standard deviations of the samples) were satisfied. The results show a strongly significant difference between the four types: see Figure 10. Thus we can use the cluster size CV to distinguish fluffy NCNR from regulatory DNA. CVs for fluffy NCNR are almost always more than 1, for k from 1 to 3; and significantly different from CVs for regulatory DNA.
We found that large clusters of adjacent over-represented words in fluffy NCNR DNA disappear after repeat-masking , thus revealing their identity as non-perfect simple repeats (Figure 11: compare panels a,b,c with d,e,f).
For details about spatial clustering and illustration of coefficient of variation robustness to choice of k and m, see Supplementary Materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12].
Our method allows us to distinguish regulatory DNA from other non-regulatory DNA. In effect, our method aggregates many small signals contained in the region, and makes an internal comparison with background, represented by shuffled sequences.
We would like to extend the application of our method to larger sets of experimentally verified regulatory regions, from Drosophila or any other species. Unfortunately, few experimentally (not computationally!) verified sets are available. We managed to extended our positive training set a little, including a few experimentally verified regulatory regions from human, chicken, sea urchin, fruit fly and yeast (see Supplementary Materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), but it is still not a lot.
We would also like to explore the correlation between the genomic positions of words in MSWL (most abundant words), and positions of known regulatory elements. This may allow us to utilise our method as a kind of motif discovery algorithm. Unfortunately, again, the lack of reliably annotated regulatory regions with regulatory elements makes this step difficult.
Phylogenetic foot-printing is an important and rapidly developing branch of motif discovery methodology. It would be very interesting to compare genomic positions of words in MSWL with conserved sequences from phylogenetic foot-printing analyses. This would reveal whether such words are conserved, and therefore of functional significance.
In a similar vein, we would like to compare the results of fluffiness analysis results across multiple species. We could then answer the question whether cross-species conserved regions have "fluffy" regulatory region properties, and thus infer their putative function.
We are keen to compare results of our fluffy-tail-analysis with the results of recognition methods based on description of known TFBS, such as in the works  and . These authors  also analysed developmental genes of Drosophila melanogaster containing approximately the same clusters of transcription factors.
The work  is closely related to our study. However, it is likely that their method is unable to distinguish non-perfect simple tandem repeat sequences from truly regulatory DNA. We have implemented their method as far as we can understand it, and found out that their separation of positive (cis-regulatory modules) and negative (coding and non-coding non-regulatory DNA) training sets due to local words frequency seems to be less clear than our separation due to "fluffiness" coefficient F (see Figure 6).
There might be possible other regulatory mechanisms apart from TFBS binding. It may be in some specific cases that the 3D local structure of DNA in the nucleus (chromatin) is the principal factor of gene expression and modulating regulatory modules play little or no role . Thus one of the next steps in our work will be the incorporation of nucleosome position information.
We present a novel statistical approach that allows regulatory DNA to be distinguished from coding and non-coding non-regulatory regions according to its "fluffiness" values. This method is based on the presence of unusually high number of short runs of over-represented scattered words in the given DNA sequence.
The performance of the method on experimentally verified sequence data shows that the method allows us to predict whether a sequence may be regulatory.
Description of fluffy tail test
The fluffy tail test essentially consists of the comparison of similar word distributions for the original sequence and for a number of shuffled versions of the original sequences. These shuffled sequences clearly have the same length and single nucleotide composition as the original one.
To construct a similar words distribution one can perform the following two steps:
(1). First, obtain the distribution of similar words for a given DNA stretch (as described in detail below under "Distribution of similar words"). Then randomise the original sequence many times, and obtain a distribution of similar words for each shuffled sequence. These randomised sequences represent the null model (or the background model). The distributions of similar words obtained for the randomised sequences are compared with the corresponding distribution for the original sequence. If there are no statistical differences, we conclude that the sequence probably is an exon (related results are in ) or a homogeneous non-coding non-regulatory region.
However, if the given sequence does contain many similar words, these will show up in its distribution as a longer right tail that may even have a second mode. Such "fluffy" tails are seldom found in the distributions of the shuffled sequences, therefore suggesting the sequence is not exonic or homogeneous non-coding, non-regulatory DNA.
(2). To rule out "fluffy" tails due to non perfect simple tandem repeats, we check whether a) the similar words are spatially clustered and b) if the tails disappear after repeat-masking the sequence (using the on-line tool available at ) then repeating procedure (1).
Distribution of similar words
We considered 5-mer words, allowing for 1 mismatch. However, our results also hold for words of length 4 through 12, allowing for 1 through 4 mismatches (see Supplementary materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]). Thus, for each 5-mer word in each of the 180 sequences (60 sequences in each training set) we computed the number n of similar words of the same length. Each word is the "seed" for a list of similar words.
As an example, consider a stretch of DNA :
accgg gtgtaaaccgacctg atacccg gtcgcccgg ttttaac...
The first "seed" 5-word 'accgg' forms the following list of similar words:
accgg, accga, acctg, acccg, cccgg,
which we have underscored in the above sequence.
The second 5-word 'ccggg' forms another list of similar words:
ccggg, ccggt, ccggt
etc. The first 5-word has the longest list of similar words here. The lists may intersect: e.g. the list for the 'accga'- seed word contains some words from the 'accgg'-seed word list.
Yuh C, Bolouri H, Davidson EH: Genomic cis-regulatory logic: functional analysis and computational model of a sea urchin gene control system. Science 1998, 279: 1896–902. 10.1126/science.279.5358.1896
Yuh C, Bolouri H, Davidson EH: Cis-regulatory logic in the endo 16 gene: switching from a specification to a differentiation mode of control. Development 2001, 128: 617–29.
Davidson EH: Genomic Regulatory Systems. Academic Press; 2001.
Berman B, Nibu Y, Pfeiffer B, Tomancak B, Celniker S, Rubin G, Levine M, Eisen M: Exploiting TFBS clustering to identify CRM involved in pattern formation in Drosophila genome. PNAS 2002, 99(2):757–62. 10.1073/pnas.231608898
Wagner A: A computational genomics approach to the identification of gene networks. Nucleic Acids Research 1997, 25(1):3594–604. 10.1093/nar/25.18.3594
Markstein M, Markstein P, Markstein V, Levine MS: Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A 2002, 99(2):763–68. 10.1073/pnas.012591199
Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional lists of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics 2003, 19(Suppl 1):I169-I176. 10.1093/bioinformatics/btg1021
Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory lists in Drosophila. Genome Res 2003, 13(4):579–88. 10.1101/gr.668403
Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 2002, 3(1):30–8. 10.1186/1471-2105-3-30
Duret L, Bucher P: Searching for regulatory elements in human non coding sequences. Curr Opin Struct Biol 1997, 7: 399–406. 10.1016/S0959-440X(97)80058-9
Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Bio 2002, 2: 11–23.
Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E, Pachter L, Dubchak I: Strategies and tools for whole-genome alignments. Genome Res 2003, 13: 73–80. 10.1101/gr.762503
Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 2002, 299: 1391–4. 10.1126/science.1081331
Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, Connor OMJ, Schwartz S, Miller W, Chiaromonte F: Distinguishing regulatory DNA from neutral sites. Genome Res 2003, 13: 64–72. 10.1101/gr.817703
Ohler U, Harbeck S, Niemann H, Noth E, Reese MG: Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 1999, 15: 362–9. 10.1093/bioinformatics/15.5.362
Ohler U: Promoter prediction on a genomic scale-the Adh experience. Genome Res 2000, 10: 539–42. 10.1101/gr.10.4.539
Ohler U, Niemann H, Liao G, Rubin GM: Joint modelling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 2001, 17: S199–206.
Nazina A, Papatsenko D: Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics 2003, 4: 65–78. 10.1186/1471-2105-4-65
Ensembl Genome Browser[http://www.ensembl.org/]
Audit B, Vaillant C, Arneodo A, d'Aubenton-Carafa Y, Thermes C: Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. J Mol Biol 2002, 316: 903–18. 10.1006/jmbi.2001.5363
Orlov Y, Potapov V: Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Research 2004, 32: W628-W633. on-line.
We would like to acknowledge Yvonne Edwards, Tanya Vavouri, Adam Woolfe, Krys Kelly, Gayle McEwen, Greg Elgar, Carlo Berzuini, Tom Nye, Lorenz Wernisch, and Kenneth Evans for valuable discussions and support.
WRG contributed to development of methodology, KW did numerical comparison with other related methods, RtB statistically processed the data, IA contributed to development of methodology, collected the data and wrote the software. All authors read and approved the final manuscript
Electronic supplementary material
Additional File 2: Contains Supplementary Table1 with results of Fluffy-tail test and Coefficients of Variation for some more experimentally verified regulatory regions for other than Fruit fly species. (DOC 564 KB)
Additional File 6: Shows some examples for consistence of fluffiness for different word length in the histogram form (DOC 334 KB)
Additional File 7: Consistent fluffiness and coefficient of variation for spatial cluster size for some example sequences. (DOC 29 KB)
Additional File 8: Contains the Figures showing fluffiness and spatial clustering of similar words for NCNR 3L4 region. (DOC 793 KB)
Additional File 9: Contains the Figures showing fluffiness and spatial clustering of similar words for NCNR repeat-masked 3L4 region. (DOC 792 KB)
Additional File 10: Contains the Figures showing fluffiness and spatial clustering of similar words for knirps regulatory region (DOC 758 KB)
Additional File 11: Contains the Figures showing fluffiness and spatial clustering of similar words for abdominantA regulatory region. (DOC 930 KB)
Additional File 12: Contains the Figures showing fluffiness and spatial clustering of similar words for internal exon 2r4. (DOC 1012 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Abnizova, I., te Boekhorst, R., Walter, K. et al. Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. BMC Bioinformatics 6, 109 (2005). https://doi.org/10.1186/1471-2105-6-109