Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test
© Abnizova et al; licensee BioMed Central Ltd. 2005
Received: 17 December 2004
Accepted: 27 April 2005
Published: 27 April 2005
This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information.
We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA.
We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.
The transcription rate of genes is dictated primarily by interactions between DNA-binding transcription factors. Comparatively short sequences (several hundred to several thousand base pairs, depending on thespecies) upstream or downstream of the transcription start site often play a major role in the regulation of gene expression. Specific sites within such regions are recognized by regulatory proteins (transcription factors), which act upon binding as transcriptional repressors or activators, controlling the rate of transcription. The identification of regulatory regions, which are generally composed of dense clusters of target transcription factor binding sites, forms an essential step in understanding the regulatory interactions that govern the spatial and temporal expression of individual genes (see for example [1, 2]) and genetic regulatory networks, (see for example ).
Ultimately, this task is accomplished experimentally using techniques such as empirical deletion analysis, direct binding measurements, and co-precipitation of protein-DNA complexes. However, experimental verification is expensive and time consuming. Therefore, to address the growing volumes of available genomic sequence, a number of algorithms that identify putative cis-regulatory modules and transcription factor binding sites using evolutionary comparisons, whole-genome data, and known descriptions of transcription factor binding sites, have been successfully developed. Regulatory regions of higher eukaryotes can be subdivided into proximal regulatory units – promoters – which are located close to and upstream of the gene, and distal transcription regulatory units called enhancers or cis-regulatory modules. These may be located far upstream or downstream of the target gene, and are much more difficult to recognise. In our work we focus on recognition of enhancers.
Recognition of regulatory DNA regions based on description of known transcription factor binding sites (TFBS). This approach exploits the clustering of known, cooperatively-acting transcription factors (TFs). Extracting clustered recognition motifs is one of the most reliable techniques, but is limited to the recognition of similarly regulated cis-regulatory regions. Among the most popular representatives of search by known TFBS are [4–9].
Recognition of regulatory DNA based on phylogenetic foot-printing [10–14]. Methods of this type assume that regulatory regions are highly conserved in cross-genomic comparison, and conserved segments can be extracted from evolutionary related genomes. Performance of phylogenetic foot-printing depends on the evolutionary distance between given species and on the conservation level of individual genes. This is an actively progressing area, as more and more sequenced genomes appear. However, such an approach offers little information as to the specific function of the conserved sequences. Furthermore, it is still an open question as to how many genomes are sufficient for reliable extraction of regulatory regions.
Methods based on the difference of local nucleotide composition between regulatory and non regulatory DNA [15–18]. It is assumed that this difference is due to presence of multiple transcription signals, such as binding motifs for TFs in regulatory regions. The works [15–17] are based on constructing a global interpolated Markov model, applied to promoter recognition only.
In our method, we assume that the abundance of regulatory motifs within regulatory regions leaves a distinct "signature" in nucleotide composition, and that it is possible to capture this "signature" statistically. More specifically, we hypothesize that it takes the form of an over-representation of "similar words" (which are not simple repeats).
The approach of looking for over-occurrence of words has also been widely used in motif discovery, but this is not our aim here. This over-representation of similar words should appear as outliers in the right tail of the distribution of similar word lists of variable length. The "fluffy tail test", proposed in this paper, is designed to identify such outliers and is a useful technique when data from multiple genes and genomes are lacking. It may also be used as a complementary tool when such data are available.
In this section, we first present our new statistical 'fluffy tail' test for measuring the overrepresentation of similar words, and then show its performance on experimentally verified sequence data.
To demonstrate the power of our test, we need a positive, experimentally verified, training set of regulatory sequence data, and also negative training sets of non- regulatory sequence data. We use three test beds. The positive training set is a collection of 60 experimentally verified functional Drosophila melanogaster regulatory regions . This set consists of cis-regulatory modules located far from gene coding sequences and transcription start sites. It contains many binding sites (and site clusters), best known of which are bicoid, hunchback, Kruppel, knirps and caudal, – the sites involved in the regulation of developmental genes. The total size of the positive training set comprises about 68 Kb of sequence data, and contains 58 clusters of the same type of TFBS (homotypic). The two negative training sets are: (i) 60 randomly picked Drosophila exons, and (ii) 60 randomly picked Drosophila non-coding, non-regulatory DNA sequences: we excluded exons and regions of length 1 KB upstream and downstream of genes, using the Ensembl Genome Browser . Each training set contains 68 Kb of sequences in total.
Estimation of distributions of similar words
To construct the distribution of similar words, we first need to specify the length of words under consideration. We try to mimic the TF core, which is the less variable part of a binding motif. Because the core of TFBSs is relatively short (around 3–5 bp) we considered 5-mer words, allowing for 1 mismatch. However, our results also hold for words of length 4 through 12, allowing for 1 through 4 mismatches (see Supplementary Materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]). Thus, for each 5-mer word in each of the 180 sequences (60 sequences in each training set) we computed the number n of similar words of the same length. Thus, each word is the "seed" of a list of similar words. Next, the number of (non-disjoint) lists containing n words is counted, where n = 1,2,3....
From this plot it can be seen that most lists contain 10 to 40 words, but there are outliers: some very large lists form a long, "fluffy" tail. We call a list having the largest size the maximal similar word list (MSWL). If the original sequence is characterized by the presence of an unusually high number of over-represented words, we expect it to contain more long lists in comparison to a random sequence.
Definition of the fluffiness coefficient F
To measure how strong the distribution of similar words of regulatory regions deviate from randomness, we introduce a "fluffiness" coefficient F:
w here L max,originalis the number of words in the maximal similar word list (MSWL) in the original sequence, and σ r are the mean and standard deviation of the MSWL size in each of r shuffled sequences. Here we call the sequence "random" if it is obtained from original sequence by shuffling it, preserving its single nucleotide composition. We will omit the subscript r for F r later in the paper for simplicity.
One can regard F as measuring the difference between signal and noise, where the signal is taken from the original sequence, and the noise from the randomised sequences with the same composition and length. Thus, the fluffiness coefficient is normalised for the length and base composition of the sequence, because we compare each original sequence only with respect to shuffled sequences of the same length and composition. Thus one can compare the fluffiness F for sequences of different base composition and length.
Results for regulatory regions
Sensitivity of F to choice of r, the number of randomisations, for the knirps regulatory region.
"Fluffiness" predictions for three types of functional region, showing the number of fluffy (F>2) sequences, the number of non-fluffy (F<2) sequences and corresponding positive and negative prediction rates, for each type of the region.
Fluffy tails (F>2)
No fluffy tails (F<2)
Non-coding presumed non- regulatory
We now turn to examine the location of similar words in the MSWL for a given sequence.
We now examine the relationship between the MSWL and predicted TFBS sites. We found significant enrichment of most MSWLs with the occurrences of TFBS in databases: when submitted to the Transfac and Jaspar TFBS databases, the "seed" words for MSWLs exhibited 10–20 fold enrichment with putative TFBS in comparison with all 5-mer words within the given regulatory region: thus, for the most part, these "seed" words turned out to be instances of known TFBS (results not shown here).
Results for exons
Thus we have established a statistical difference between exons and regulatory DNA. Next we compare regulatory DNA with non-coding non-regulatory DNA.
Results for non-coding, presumed non-regulatory DNA
To deal with this, we developed a measure of spatial clustering of similar words. We say that two words w 1 and w 2 belong to the same cluster, if their genomic start positions s 1 and s 2 satisfy |s 1 - s 2| ≤ m·k , where m is the word length, and k is a constant. We examined the following choices for k: 1; 1.5; 2; 2.5; 3.
We found that large clusters of adjacent over-represented words in fluffy NCNR DNA disappear after repeat-masking , thus revealing their identity as non-perfect simple repeats (Figure 11: compare panels a,b,c with d,e,f).
For details about spatial clustering and illustration of coefficient of variation robustness to choice of k and m, see Supplementary Materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12].
Our method allows us to distinguish regulatory DNA from other non-regulatory DNA. In effect, our method aggregates many small signals contained in the region, and makes an internal comparison with background, represented by shuffled sequences.
We would like to extend the application of our method to larger sets of experimentally verified regulatory regions, from Drosophila or any other species. Unfortunately, few experimentally (not computationally!) verified sets are available. We managed to extended our positive training set a little, including a few experimentally verified regulatory regions from human, chicken, sea urchin, fruit fly and yeast (see Supplementary Materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), but it is still not a lot.
We would also like to explore the correlation between the genomic positions of words in MSWL (most abundant words), and positions of known regulatory elements. This may allow us to utilise our method as a kind of motif discovery algorithm. Unfortunately, again, the lack of reliably annotated regulatory regions with regulatory elements makes this step difficult.
Phylogenetic foot-printing is an important and rapidly developing branch of motif discovery methodology. It would be very interesting to compare genomic positions of words in MSWL with conserved sequences from phylogenetic foot-printing analyses. This would reveal whether such words are conserved, and therefore of functional significance.
In a similar vein, we would like to compare the results of fluffiness analysis results across multiple species. We could then answer the question whether cross-species conserved regions have "fluffy" regulatory region properties, and thus infer their putative function.
We are keen to compare results of our fluffy-tail-analysis with the results of recognition methods based on description of known TFBS, such as in the works  and . These authors  also analysed developmental genes of Drosophila melanogaster containing approximately the same clusters of transcription factors.
The work  is closely related to our study. However, it is likely that their method is unable to distinguish non-perfect simple tandem repeat sequences from truly regulatory DNA. We have implemented their method as far as we can understand it, and found out that their separation of positive (cis-regulatory modules) and negative (coding and non-coding non-regulatory DNA) training sets due to local words frequency seems to be less clear than our separation due to "fluffiness" coefficient F (see Figure 6).
There might be possible other regulatory mechanisms apart from TFBS binding. It may be in some specific cases that the 3D local structure of DNA in the nucleus (chromatin) is the principal factor of gene expression and modulating regulatory modules play little or no role . Thus one of the next steps in our work will be the incorporation of nucleosome position information.
We present a novel statistical approach that allows regulatory DNA to be distinguished from coding and non-coding non-regulatory regions according to its "fluffiness" values. This method is based on the presence of unusually high number of short runs of over-represented scattered words in the given DNA sequence.
The performance of the method on experimentally verified sequence data shows that the method allows us to predict whether a sequence may be regulatory.
Description of fluffy tail test
The fluffy tail test essentially consists of the comparison of similar word distributions for the original sequence and for a number of shuffled versions of the original sequences. These shuffled sequences clearly have the same length and single nucleotide composition as the original one.
To construct a similar words distribution one can perform the following two steps:
(1). First, obtain the distribution of similar words for a given DNA stretch (as described in detail below under "Distribution of similar words"). Then randomise the original sequence many times, and obtain a distribution of similar words for each shuffled sequence. These randomised sequences represent the null model (or the background model). The distributions of similar words obtained for the randomised sequences are compared with the corresponding distribution for the original sequence. If there are no statistical differences, we conclude that the sequence probably is an exon (related results are in ) or a homogeneous non-coding non-regulatory region.
However, if the given sequence does contain many similar words, these will show up in its distribution as a longer right tail that may even have a second mode. Such "fluffy" tails are seldom found in the distributions of the shuffled sequences, therefore suggesting the sequence is not exonic or homogeneous non-coding, non-regulatory DNA.
(2). To rule out "fluffy" tails due to non perfect simple tandem repeats, we check whether a) the similar words are spatially clustered and b) if the tails disappear after repeat-masking the sequence (using the on-line tool available at ) then repeating procedure (1).
Distribution of similar words
We considered 5-mer words, allowing for 1 mismatch. However, our results also hold for words of length 4 through 12, allowing for 1 through 4 mismatches (see Supplementary materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]). Thus, for each 5-mer word in each of the 180 sequences (60 sequences in each training set) we computed the number n of similar words of the same length. Each word is the "seed" for a list of similar words.
As an example, consider a stretch of DNA :
accgg gtgtaaaccgacctg atacccg gtcgcccgg ttttaac...
The first "seed" 5-word 'accgg' forms the following list of similar words:
accgg, accga, acctg, acccg, cccgg,
which we have underscored in the above sequence.
The second 5-word 'ccggg' forms another list of similar words:
ccggg, ccggt, ccggt
etc. The first 5-word has the longest list of similar words here. The lists may intersect: e.g. the list for the 'accga'- seed word contains some words from the 'accgg'-seed word list.
We would like to acknowledge Yvonne Edwards, Tanya Vavouri, Adam Woolfe, Krys Kelly, Gayle McEwen, Greg Elgar, Carlo Berzuini, Tom Nye, Lorenz Wernisch, and Kenneth Evans for valuable discussions and support.
- Yuh C, Bolouri H, Davidson EH: Genomic cis-regulatory logic: functional analysis and computational model of a sea urchin gene control system. Science 1998, 279: 1896–902. 10.1126/science.279.5358.1896View ArticlePubMed
- Yuh C, Bolouri H, Davidson EH: Cis-regulatory logic in the endo 16 gene: switching from a specification to a differentiation mode of control. Development 2001, 128: 617–29.PubMed
- Davidson EH: Genomic Regulatory Systems. Academic Press; 2001.
- Berman B, Nibu Y, Pfeiffer B, Tomancak B, Celniker S, Rubin G, Levine M, Eisen M: Exploiting TFBS clustering to identify CRM involved in pattern formation in Drosophila genome. PNAS 2002, 99(2):757–62. 10.1073/pnas.231608898PubMed CentralView ArticlePubMed
- Wagner A: A computational genomics approach to the identification of gene networks. Nucleic Acids Research 1997, 25(1):3594–604. 10.1093/nar/25.18.3594PubMed CentralView ArticlePubMed
- Markstein M, Markstein P, Markstein V, Levine MS: Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A 2002, 99(2):763–68. 10.1073/pnas.012591199PubMed CentralView ArticlePubMed
- Johansson O, Alkema W, Wasserman WW, Lagergren J: Identification of functional lists of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics 2003, 19(Suppl 1):I169-I176. 10.1093/bioinformatics/btg1021View ArticlePubMed
- Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory lists in Drosophila. Genome Res 2003, 13(4):579–88. 10.1101/gr.668403PubMed CentralView ArticlePubMed
- Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 2002, 3(1):30–8. 10.1186/1471-2105-3-30PubMed CentralView ArticlePubMed
- Duret L, Bucher P: Searching for regulatory elements in human non coding sequences. Curr Opin Struct Biol 1997, 7: 399–406. 10.1016/S0959-440X(97)80058-9View ArticlePubMed
- Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Bio 2002, 2: 11–23.
- Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E, Pachter L, Dubchak I: Strategies and tools for whole-genome alignments. Genome Res 2003, 13: 73–80. 10.1101/gr.762503PubMed CentralView ArticlePubMed
- Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 2002, 299: 1391–4. 10.1126/science.1081331View Article
- Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, Connor OMJ, Schwartz S, Miller W, Chiaromonte F: Distinguishing regulatory DNA from neutral sites. Genome Res 2003, 13: 64–72. 10.1101/gr.817703PubMed CentralView ArticlePubMed
- Ohler U, Harbeck S, Niemann H, Noth E, Reese MG: Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 1999, 15: 362–9. 10.1093/bioinformatics/15.5.362View ArticlePubMed
- Ohler U: Promoter prediction on a genomic scale-the Adh experience. Genome Res 2000, 10: 539–42. 10.1101/gr.10.4.539PubMed CentralView ArticlePubMed
- Ohler U, Niemann H, Liao G, Rubin GM: Joint modelling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 2001, 17: S199–206.View ArticlePubMed
- Nazina A, Papatsenko D: Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics 2003, 4: 65–78. 10.1186/1471-2105-4-65PubMed CentralView ArticlePubMed
- Ensembl Genome Browser[http://www.ensembl.org/]
- Audit B, Vaillant C, Arneodo A, d'Aubenton-Carafa Y, Thermes C: Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. J Mol Biol 2002, 316: 903–18. 10.1006/jmbi.2001.5363View ArticlePubMed
- Orlov Y, Potapov V: Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Research 2004, 32: W628-W633. on-line.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.