Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency
© Nazina and Papatsenko; licensee BioMed Central Ltd. 2003
Received: 19 July 2003
Accepted: 22 December 2003
Published: 22 December 2003
Transcription regulatory regions in higher eukaryotes are often represented by cis-regulatory modules (CRM) and are responsible for the formation of specific spatial and temporal gene expression patterns. These extended, ~1 KB, regions are found far from coding sequences and cannot be extracted from genome on the basis of their relative position to the coding regions.
To explore the feasibility of CRM extraction from a genome, we generated an original training set, containing annotated sequence data for most of the known developmental CRMs from Drosophila. Based on this set of experimental data, we developed a strategy for statistical extraction of cis-regulatory modules from the genome, using exhaustive analysis of local word frequency (LWF). To assess the performance of our analysis, we measured the correlation between predictions generated by the LWF algorithm and the distribution of conserved non-coding regions in a number of Drosophila developmental genes.
In most of the cases tested, we observed high correlation (up to 0.6–0.8, measured on the entire gene locus) between the two independent techniques. We discuss computational strategies available for extraction of Drosophila CRMs and possible extensions of these methods.
Recognition of transcription regulatory sequences is one of the most important and challenging problems in modern computational biology. In the case of higher eukaryotes, there are proximal transcription regulatory units, located close to 5' ends of coding sequences and called 'proximal promoters', and distant transcription regulatory units, located further upstream or downstream of the gene and called 'enhancers' or 'cis-regulatory modules' (CRMs). It is clear that identification of a 'proximal' transcriptional unit can be based on its relative position to the coding sequence and the presence of specific transcriptional signals such as TATA box, CAAT box, transcription start site consensus (TSS) and, perhaps, other specific signals (such as downstream promoter elements, DPE). Typical CRMs (or enhancers) possess no such specific features; therefore their annotation in genome is much more difficult.
Currently existing methods dedicated to the recognition of transcription regulatory regions can be subdivided into three main categories: (i) search by signal, (ii) search by content and (iii) phylogenetic footprinting [1–4]. Modern 'search by signal' techniques are based on identification of known transcriptional patterns in DNA sequences, such as clustered binding motifs for known transcriptional regulators [5–10]. Extraction of clustered recognition motifs is among the most reliable current techniques, but it is limited to recognition of similarly regulated cis-regulatory modules in a genome.
Another strategy of CRM extraction from genome is phylogenetic footprinting. Methods of this class assume that regulatory regions contain highly conserved segments and they can be extracted by means of sequence comparison from evolutionary related genomes [11–18]. Performance of the phylogenetic footprinting greatly depends on the evolutionary distance between chosen species and on the conservation level of particular genes from these organisms. Phylogenetic footprinting have become especially important in recent days as more than one genome represents the sequence data for most of the main model organisms. However, it is not clear yet whether phylogenetic footprinting alone is sufficient for precise and exhaustive mapping of CRMs and how many related genomes it will require to achieve this goal. Non-coding conserved regions might also include not only promoter and enhancer regions, but also other functional sequence classes, such as origins of replication, matrix-attached regions etc, so an independent method of CRM extraction might be necessary.
'Search by content' (ab initio) methods are often based on the difference in the local base composition and in the local word composition between the regulatory and non-regulatory DNA [10, 19–21]. It is assumed that the difference is caused by presence of transcriptional signals, such as binding motifs for transcriptional regulators in the regulatory DNA. For example, the presence of multiple copies of the same binding site may change local frequency of short words in promoter regions. This idea was explored by analysis of most frequent hexamers (differential hexamer frequency) , other short words and motifs [23, 24] in regulatory sequences. More recent implementations of the 'search by content' strategy take into account base interdependence in transcription regulatory regions and exploit interpolated Markov chains  as well as local word frequency . General-purpose techniques based on the 'search by content' are of great interest as they provide an independent line of evidence for the recognition of transcription regulatory sequences in genome.
Recognition of regulatory sequences using 'search by content' is still a difficult problem due to the presence of distinct signals in different regulatory sequences as well as high divergence of these signals themselves. In this respect, each particular promoter as well as a given large training set may never contain even a small fraction of all specific words. Beside the binding motifs for transcription factors, the regulatory DNA may also possess patterns with specific physical properties [2, 21] or other functional patterns, such as nucleosome positioning signals [26, 27].
The described diversity of promoters and transcription regulatory signals suggests that methods based on local word composition/local word frequency, regardless of the words themselves, might be more suitable for the CRM and promoter recognition. In this work we describe an exhaustive local word frequency analysis and apply this new technique to recognition of cis-regulatory modules of Drosophila developmental genes. We based our recognition strategy on the assumption that biological signals in any regulatory region possess similar redundant properties, independent on a particular gene, promoter or CRM. According to this assumption, words corresponding to binding motifs for transcription factors might possess one level of redundancy, specific words involved, for instance, in promoter bending might possess another etc.
To describe all these various specific patterns in the context of one DNA sequence segment (resolution window), we represent exhaustive statistical analysis of word frequency, where the local frequency of each word in the DNA segment is taken into consideration, regardless of the word itself. For instance, two DNA segments having similar local word frequencies would produce similar scores, even if the words comprising the two DNA segments were different. In this respect, the proposed strategy aims to identify sequence segments containing specific word distributions rather than sequence segments containing specific words or arrays of specific words. In this sense our new method resembles promoter recognition techniques based on local assessment of most frequent words (hexamers) , but it is superior as it takes into consideration the full range of word frequency and scores even unique words.
Typically, the efficiency of a promoter recognition algorithm is evaluated with the help of an annotated promoter database, such as EPD . The database contains proximal promoter elements, and is very helpful for evaluation of methods based on recognition of specific words, such as proximal transcriptional signals represented by the TATA box, transcription start site (TSS) and others. However, a vast majority of eukaryotic transcription regulatory regions located far upstream or far downstream of TSS contain no proximal signals. Therefore, to test the performance of the proposed new technique we developed a unique training set of sequence data, based on cis-regulatory modules (CRMs) of Drosophila developmental genes, transcription regulatory regions from higher eukaryotes, located far from gene coding sequences and TSS.
The developmental genes of Drosophila comprise spatio-temporal cascade (network) of regulatory interactions, responsible for early pattern formation of the developing fly embryo [29–32]. This model system has several advantages that make it unique in computational sequence analysis: (i) The majority of the genes encode transcription factors that are connected into the network of direct transcriptional interactions. (ii) For most of the developmental genes (Bicoid, Hunchback, Krüppel, etc) large amount of experimental data is available at the genetic, biochemical and evolutionary (comparison between species) levels, including positions of their cis-regulatory modules and description of binding motifs for upstream regulators.
We compared the results of our word frequency analysis of Drosophila melanogaster developmental genes with a reference data, generated by phylogenetic footprinting (percent identity profiles, PIP) of the same genes using recently sequenced genome of Drosophila pseudoobscura. This comparison has revealed a striking agreement between predictions generated by the two independent methods.
Construction of interactive CRM annotation
Recognition of regulatory sequences using 'search by content' approach requires representative training set of sequence data, ideally, a set of functional transcription regulatory regions. For this reason we assembled and annotated all published experimental data for the early developmental enhancers of Drosophila. These data include three major types of information: (i) sequence data for experimentally tested CRM regions, (ii) binding motifs for known transcriptional regulators, and (iii) known regulatory interactions between the early developmental genes.
In comparison with existing dedicated databases, such as GeNet [33, 34], our compilation is focused on available deletion and footprinting data and it is convenient for: (i) fast navigation and retrieving of CRMs and (ii) fast retrieving of the binding motifs involved in a given regulatory interaction. All annotated data are publicly accessible from New York University web site http://homepages.nyu.edu/~dap5.
Construction of positive and negative training sets
We considered it irrelevant to construct positive training set from the entire CRM collection directly as the positions of the CRM boundaries identified by empirical deletion analysis are quite arbitrary. In most of the classical deletion studies the boundaries of a deletion fragment were defined according to the presence of convenient restriction sites. This fact, along with the limit on a possible number of deletion combinations (number of transgenic constructs) resulted in a lack of precise resolution of the deletion analysis technique. In many cases the identified by the deletion technique minimal regulatory elements (such as MSE – minimal stripe enhancers) still provide correct spatial distribution of expression patterns, but the rescued patterns contain only a fraction of the endogeneous gene expression levels.
Despite the fact that we used only five transcription factors to evaluate sequences for a positive training set, the resulting training sequences contained many other binding motifs (yet unknown), present in CRMs. Thus, the five chosen motifs served us as markers only, indicating positions in the sequences that belong to cis-regulatory modules. The total size of the described positive training set comprised more than 68 Kb of sequence data and contained 58 homotypic clusters in 33 non-overlapping contigs. Sequences of the positive training set, including identified homotypic clusters in the gene locus regions are available from our database: http://homepages.nyu.edu/~dap5/PCL/pseudoobscura/train_plus_contigs.zip.
We also generated several negative training sets, one containing random samples (50 Kb each sample) from genome of Drosophila melanogaster, one containing random samples from Drosophila CDS collection  and one containing non-coding sequences only. Each negative training set combined >2 Mb of sequence data (>1.5% of the Drosophila genome).
Multiresolution analysis of word frequency
A partition of the spectrum into non-overlapping frequency channels may be set differently and highly depends on other parameters such as the size of the resolution window and the size of the word. In the current work we considered only short words (2–4 bases), assuming that a functional transcriptional signal (for instance, transcription factor binding site), must have a conserved core, which is often only a small fraction of the functional word . Another advantage of short words is that they have much greater local frequencies, which facilitates statistical analysis and allows construction of spectra with larger number of wide non-overlapping frequency channels. For the same reason, wider windows (w, l) are preferred, but the upper size of the window is limited by the desired resolution of the method. In the case of our model system, Drosophila CRMs, we considered equal detection and resolution windows (w = l) that are close to the size of the minimal known CRM sequences and vary within the range of 0.5–1 Kb.
Statistical discrimination of sequences with distinct word frequency
Based on the feature score values S, one can built a statistical model describing DNA sequence segments belonging to a specific functional class, for example, to a CRM or to a coding/non-coding sequence. To generate the statistical model for a functionally related class of sequences (for positive or negative training sets) we built distribution E of the feature score S in each frequency channel, where E s is the fraction of all windows N in a training set, having the same score S. If N s is the number of windows with the score S, then:
E s = N s / N (1)
Given two functional sequence classes ω1 and ω2 (positive and negative training sets) and a feature score value S j , from a test sequence, we now solve the recognition (classification) problem for each frequency channel independently, using simple log-likelihood ratio test :
Thus, in each frequency channel j, each position of the test sequence i obtains a log-likelihood score L ji , which displays chances that the i-th window of the test sequence belongs to a positive (ω1) or a negative (ω2) training set. In this work, we consider only the simplest case, where the contribution of each frequency channel into the recognition is equal (equal weights). Therefore, we calculated the resulting score R i for the i-th window as a sum of scores L ji for all frequency channels. On practice, we used more sophisticated formula (see below) to calculate and correct the score R i , but even the simplest approximation produced relevant results.
Extraction of Drosophila CRMs using local word frequency
To assess the quality of our word frequency classification algorithm on the same system of early Drosophila genes, we adopted an additional reference set of data that independently marks positions of transcription regulatory modules in the analyzed sequences.
To construct this independent reference dataset we took advantage of the existing tools and strategies [13, 16] and generated conservation profiles (percent identity profiles, PIPs) for wide genomic regions of 16 early developmental gene loci, representing the majority of genes from our database. We retrieved from the recently published genome of D. pseudoobscura (Human Genome Sequencing Center at Baylor College of Medicine) all contigs corresponding to the 16 selected genomic regions of D. melanogaster, and produced the conservation profiles using available standard procedures. The assembled sequences for the developmental genes of D. pseudoobscura along with the graphical data for alignments and numerical data for conservation profiles are publicly available from New York University web site http://homepages.nyu.edu/~dap5/PCL/pseudoobscura/pseudoobscura.htm.
Correlation between the LWF and the conservation profiles. The table shows values of Pearson Association Coefficient (cc) , measured on the entire gene loci sequences (gene names are in the first column). The best correlation (fourth column) was observed between the statistical profiles (Λ) obtained from word frequency analysis and the conservation profiles (PIP). Lower correlation between ether the statistical (Λ/deletion) or conservation (PIP/deletion) profiles with the deletion data supports low resolution of deletion analysis (see results, 'construction of positive and negative training sets'). Jackknife test results are shown (in red typeface) for selected loci that contributed the largest fraction of sequences to the positive training set. The corresponding profiles are shown in Figure 7 [see Additional file 1].
Cross validation of cis-regulatory modules using different computational techniques
Exhaustive mapping of the exact CRM distribution in developmental genes of Drosophila is an extremely important biological task. In its turn, evaluation of novel regulatory sequences in this particular system leads to generation of better training sets for further genome-wide CRM recognition and facilitates computational mapping of binding motifs and other transcriptional signals comprising CRMs.
In this work we compared outputs of two independent computational techniques, the word frequency analysis and the phylogenetic footprinting and have shown that both methods generate highly correlated predictions. Most surprising is that the correlation between the two independent approaches is much higher than the correlation of either one with the deletion analysis data (annotated CRMs). In some cases lower correlation between positions of predicted and the known CRMs (deletion analysis) is explained by the conservative design of our formal tests (see Table 1 and Figure 6). We assumed that in each locus (i) all CRMs are known and (ii) their boundaries are mapped precisely. In reality a vast fraction of the sequences comprising the considered loci (16–25 Kb) were never tested for CRM activity and the CRM boundaries represent an interpretation of empirical biological tests that are often difficult to formalize. At the same time word frequency algorithm produced very high positive scores for many of the best known 'classical' developmental CRMs, such as eve stripe 2, eve stripe3+7, eve stripe 4+6, kr CD1, ftz zebra and many others (see Figure 5). Correct recognition of these 'classical' elements, analyzed by many independent experimental groups, strongly supports relevance of our word frequency analysis.
The good agreement between the two independent techniques (word frequency and phylogenetic footprinting) and the successful recognition of the most known CRMs demonstrates the power of our strategy in extraction of CRMs from developmental genes (see Figure 2). It is important that using positive training set based on clusters of early developmental transcription factors we also extracted some of the late (expressed at later stages of fly development) CRM elements that were not among the training sequences. This suggests that our learning method accounts for general features (such as word distributions) inherent to regulatory DNA, rather than for particular motifs and words, specific to particular promoters/CRMs.
It is early to say what strategy will finally dominate in recognition of transcription regulatory regions. Apparently, at the current stage of this field, cross-validation using several independent techniques seems to be the most appropriate solution.
Multiresolution analysis of transcription regulatory sequences
To analyze the functional properties of transcription regions, we assign frequency value F to each word of a sequence segment. This procedure results in a signal, where every position of DNA contributes one sample. At the following step we partition the frequency spectrum into frequency bands (frequency channels) containing words with similar frequency properties and analyze each frequency band (channel) independently. We take into account the full spectrum of word frequencies in our detection window; therefore for a chosen window size our assessment is exhaustive. Major advantage of this multiresolution analysis is its ability to produce highly informative pattern models, which minimize loss of information and facilitate recognition of very uncertain DNA patterns. In our case, for instance, a combination of non-overlapping frequency channels collects maximal information about the local word frequency in a sequence segment. All this information contributes to the final likelihood score. Direct extraction of words with defined frequencies (single frequency channels) may also have its specific field of application, different from promoter recognition. For example, one can filter out undesirable signals (noise) by combining or subtracting different frequency channels. This preprocessing may facilitate further extraction of biological signals such as binding motifs or nucleosome positioning signals using other techniques.
In this work we select word frequency F as our main feature, however there is a good number of various transformation functions that were already implemented for the analysis of biological sequences [38–40]. Most of these functions are based on analysis of local base composition and local base frequency. It is interesting to emphasize here, that if we accept size of the word equal to 1, then the described word frequency analysis becomes 'local base frequency analysis'.
For many types of signals, such as speech and images, most of the information is localized in certain resolution (scaling) levels. In the case of biological sequences the resolution level or size of the resolution window can be selected equal to the size of known functional regions, such as cis-regulatory modules, analyzed in this work. Nevertheless, in many cases it is not known if the selected resolution window is optimal. In our recent work  we evaluated biological signal (clustered binding motifs) in a broad scaling range and developed a procedure that allows establishing the size of the optimal resolution window. That application also might be considered as a multiresolution analysis, but relatively to the scaling range (resolution window l in our case).
Applications based on the multiresolution analysis are relatively new in the field of promoter study and they still require careful feature selection and thorough biological interpretation of obtained patterns. The last type of problems, however, can be solved in the context of well-known biological systems such as Drosophila developmental enhancers (CRMs), where accumulated biological information and annotated sequence data create very comfortable environment for such exploration.
Noise suppression and error correction
To construct our final likelihood score, we adopted two types of error correction. First correction accounts for chances that the observed in a single frequency channel value of S possesses an error. One can see that according to formula (2), the log-likelihood score is equal to zero in the point where E(ω1|S) = E(ω2|S). This point is marked by the vertical red line in Figure 4B. However, this value S0 is much closer to mathematical expectation of E for the negative training set, than for the positive one. One way to account for this error is to weight cumulative E-values of both training sets for S <S0 and S >S0:
This formula takes more simple form if the distributions are known , which is not exactly our case (see results).
Our second correction is intended to fix possible domination of one frequency channel over the others in some cases. Indeed, if the observed in a test sequence value of S in any frequency channel is very far from the math expectation of a positive or a negative training set (E(S) = 0), than the log-likelihood score of this channel takes very high values, and it predominates over the score of other channels. As a result, the final decision is maid according to this channel only (sum of scores in all channels). To cope with this problem, we add a multiplier, suppressing possible high likelihood values of a j-th frequency channel:
Λ j = L j * v(L j , M, σ) (4)
This multiplier ν is a normal distribution of the likelihood score L with expectation M = 0 and standard deviation σ. For very high positive or very high negative values of L, the value of the multiplier approaches zero, suppressing possible dramatic difference between channels. Behavior of the corrected likelihood score Λ j at the different values of parameter σ is given in Figure 4B.
We thank Konstantin Severinov and Nikolaus Rajewsky for their critical remarks and help with the manuscript preparation. This work was supported by grants from National Institutes of Health GM51946 to Stephen Small and GM064864 to Claude Desplan. The CRM annotation and related resources are available from New York University Web site: http://homepages.nyu.edu/~dap5/
- Fickett JW, Hatzigeorgiou AG: Eukaryotic promoter recognition. Genome Res 1997, 7: 861–78.PubMedGoogle Scholar
- Ohler U, Niemann H: Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet 2001, 17: 56–60. 10.1016/S0168-9525(00)02174-0View ArticlePubMedGoogle Scholar
- Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278: 167–81. 10.1006/jmbi.1998.1700View ArticlePubMedGoogle Scholar
- Yuh CH, Brown CT, Livi CB, Rowen L, Clarke PJ, Davidson EH: Patchy Interspecific Sequence Similarities Efficiently Identify Positive cis-Regulatory Elements in the Sea Urchin. Dev Biol 2002, 246: 148–61. 10.1006/dbio.2002.0618View ArticlePubMedGoogle Scholar
- Hehl R, Wingender E: Database-assisted promoter analysis. Trends Plant Sci 2001, 6: 251–5. 10.1016/S1360-1385(01)01954-9View ArticlePubMedGoogle Scholar
- Klingenhoff A, Frech K, Werner T: Regulatory modules shared within gene classes as well as across gene classes can be detected by the same in silico approach. In Silico Biol 2002, 2: S17–26.PubMedGoogle Scholar
- Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A 2002, 99: 757–62. 10.1073/pnas.231608898PubMed CentralView ArticlePubMedGoogle Scholar
- Markstein M, Markstein P, Markstein V, Levine MS: Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci U S A 2002, 99: 763–8. 10.1073/pnas.012591199PubMed CentralView ArticlePubMedGoogle Scholar
- Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory clusters in Drosophila. Genome Res 2003, 13: 579–88. 10.1101/gr.668403PubMed CentralView ArticlePubMedGoogle Scholar
- Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 2002, 3: 30. 10.1186/1471-2105-3-30PubMed CentralView ArticlePubMedGoogle Scholar
- Blanchette M, Tompa M: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res 2002, 12: 739–48. 10.1101/gr.6902PubMed CentralView ArticlePubMedGoogle Scholar
- Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Biol 2002, 9: 211–23. 10.1089/10665270252935421View ArticlePubMedGoogle Scholar
- Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker – a web server for aligning two genomic DNA sequences. Genome Res 2000, 10: 577–86. 10.1101/gr.10.4.577PubMed CentralView ArticlePubMedGoogle Scholar
- Elnitski L, Riemer C, Petrykowska H, Florea L, Schwartz S, Miller W, Hardison R: PipTools: A Computational Toolkit to Annotate and Analyze Pairwise Comparisons of Genomic Sequences. Genomics 2002, 80: 681–90. 10.1006/geno.2002.7018View ArticlePubMedGoogle Scholar
- Rajewsky N, Socci ND, Zapotocky M, Siggia ED: The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons. Genome Res 2002, 12: 298–308. 10.1101/gr.207502. Article published online before print in January 2002PubMed CentralView ArticlePubMedGoogle Scholar
- Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E, Pachter L, Dubchak I: Strategies and tools for whole-genome alignments. Genome Res 2003, 13: 73–80. 10.1101/gr.762503PubMed CentralView ArticlePubMedGoogle Scholar
- Bergman CM, Pfeiffer BD, Rincon-Limas DE, Hoskins RA, Gnirke A, Mungall CJ, Wang AM, Kronmiller B, Pacleb J, Park S, Stapleton M, Wan K, George RA, de Jong PJ, Botas J, Rubin GM, Celniker SE: Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol 2002, 3: RESEARCH0086. 10.1186/gb-2002-3-12-research0086PubMed CentralView ArticlePubMedGoogle Scholar
- Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 2003, 299: 1391–4. 10.1126/science.1081331View ArticlePubMedGoogle Scholar
- Ohler U, Harbeck S, Niemann H, Noth E, Reese MG: Interpolated markov chains for eukaryotic promoter recognition. Bioinformatics 1999, 15: 362–9. 10.1093/bioinformatics/15.5.362View ArticlePubMedGoogle Scholar
- Ohler U: Promoter prediction on a genomic scale – the Adh experience. Genome Res 2000, 10: 539–42. 10.1101/gr.10.4.539PubMed CentralView ArticlePubMedGoogle Scholar
- Ohler U, Niemann H, Liao G, Rubin GM: Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 2001, 17: S199–206.View ArticlePubMedGoogle Scholar
- Hutchinson GB: The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Comput Appl Biosci 1996, 12: 391–8.PubMedGoogle Scholar
- Lewis EB, Knafels JD, Mathog DR, Celniker SE: Sequence analysis of the cis-regulatory regions of the bithorax complex of Drosophila. Proc Natl Acad Sci U S A 1995, 92: 8403–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Scherf M, Klingenhoff A, Werner T: Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J Mol Biol 2000, 297: 599–606. 10.1006/jmbi.2000.3589View ArticlePubMedGoogle Scholar
- Levitsky VG, Katokhin AV: Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis. In Silico Biol 2003, 3: 8.Google Scholar
- Ioshikhes I, Trifonov EN, Zhang MQ: Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. Proc Natl Acad Sci U S A 1999, 96: 2891–5. 10.1073/pnas.96.6.2891PubMed CentralView ArticlePubMedGoogle Scholar
- Levitsky VG, Podkolodnaya OA, Kolchanov NA, Podkolodny NL: Nucleosome formation potential of eukaryotic DNA: calculation and promoters analysis. Bioinformatics 2001, 17: 998–1010. 10.1093/bioinformatics/17.11.998View ArticlePubMedGoogle Scholar
- Perier RC, Junier T, Bonnard C, Bucher P: The Eukaryotic Promoter Database (EPD): recent developments. Nucleic Acids Res 1999, 27: 307–9. 10.1093/nar/27.1.307PubMed CentralView ArticlePubMedGoogle Scholar
- Kassis JA: Spatial and temporal control elements of the Drosophila engrailed gene. Genes Dev 1990, 4: 433–43.View ArticlePubMedGoogle Scholar
- Nasiadka A, Krause HM: Kinetic analysis of segmentation gene interactions in Drosophila embryos. Development 1999, 126: 1515–26.PubMedGoogle Scholar
- Holloway DM, Reinitz J, Spirov A, Vanario-Alonso CE: Sharp borders from fuzzy gradients. Trends Genet 2002, 18: 385–7. 10.1016/S0168-9525(02)02724-5View ArticlePubMedGoogle Scholar
- Andrioli LP, Vasisht V, Theodosopoulou E, Oberstein A, Small S: Anterior repression of a Drosophila stripe enhancer requires three position-specific mechanisms. Development 2002, 129: 4931–40.PubMedGoogle Scholar
- Kolpakov FA, Ananko EA, Kolesov GB, Kolchanov NA: GeneNet: a gene network database and its automated visualization. Bioinformatics 1998, 14: 529–37. 10.1093/bioinformatics/14.6.529View ArticlePubMedGoogle Scholar
- Serov VN, Spirov AV, Samsonova MG: Graphical interface to the genetic network database GeNet. Bioinformatics 1998, 14: 546–7. 10.1093/bioinformatics/14.6.546View ArticlePubMedGoogle Scholar
- Rubin GM, Lewis EB: A brief history of Drosophila's contributions to genome research. Science 2000, 287: 2216–8. 10.1126/science.287.5461.2216View ArticlePubMedGoogle Scholar
- Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res 1995, 23: 4878–84.PubMed CentralView ArticlePubMedGoogle Scholar
- Theodorides S, Koutroumbas K: Pattern recognition London: Academic Press 1998.Google Scholar
- Arneodo A, Bacry E, Graves PV, Muzy JF: Characterizing long-range correlations in DNA sequences from wavelet analysis. Physical Review Letters 1995, 74: 3293–3296. 10.1103/PhysRevLett.74.3293View ArticlePubMedGoogle Scholar
- Dodin G, Vandergheynst P, Levoir P, Cordier C, Marcourt L: Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences. J Theor Biol 2000, 206: 323–6. 10.1006/jtbi.2000.2127View ArticlePubMedGoogle Scholar
- Audit B, Vaillant C, Arneodo A, d'Aubenton-Carafa Y, Thermes C: Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. J Mol Biol 2002, 316: 903–18. 10.1006/jmbi.2001.5363View ArticlePubMedGoogle Scholar
- Waterman MS: Introduction to Computational Biology Chapmen & Hall 1995.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.