Our method allows us to distinguish regulatory DNA from other non-regulatory DNA. In effect, our method aggregates many small signals contained in the region, and makes an internal comparison with background, represented by shuffled sequences.
We would like to extend the application of our method to larger sets of experimentally verified regulatory regions, from Drosophila or any other species. Unfortunately, few experimentally (not computationally!) verified sets are available. We managed to extended our positive training set a little, including a few experimentally verified regulatory regions from human, chicken, sea urchin, fruit fly and yeast (see Supplementary Materials [see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), but it is still not a lot.
We would also like to explore the correlation between the genomic positions of words in MSWL (most abundant words), and positions of known regulatory elements. This may allow us to utilise our method as a kind of motif discovery algorithm. Unfortunately, again, the lack of reliably annotated regulatory regions with regulatory elements makes this step difficult.
Phylogenetic foot-printing is an important and rapidly developing branch of motif discovery methodology. It would be very interesting to compare genomic positions of words in MSWL with conserved sequences from phylogenetic foot-printing analyses. This would reveal whether such words are conserved, and therefore of functional significance.
In a similar vein, we would like to compare the results of fluffiness analysis results across multiple species. We could then answer the question whether cross-species conserved regions have "fluffy" regulatory region properties, and thus infer their putative function.
We are keen to compare results of our fluffy-tail-analysis with the results of recognition methods based on description of known TFBS, such as in the works  and . These authors  also analysed developmental genes of Drosophila melanogaster containing approximately the same clusters of transcription factors.
The work  is closely related to our study. However, it is likely that their method is unable to distinguish non-perfect simple tandem repeat sequences from truly regulatory DNA. We have implemented their method as far as we can understand it, and found out that their separation of positive (cis-regulatory modules) and negative (coding and non-coding non-regulatory DNA) training sets due to local words frequency seems to be less clear than our separation due to "fluffiness" coefficient F (see Figure 6).
There might be possible other regulatory mechanisms apart from TFBS binding. It may be in some specific cases that the 3D local structure of DNA in the nucleus (chromatin) is the principal factor of gene expression and modulating regulatory modules play little or no role . Thus one of the next steps in our work will be the incorporation of nucleosome position information.