Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency

Background Transcription regulatory regions in higher eukaryotes are often represented by cis-regulatory modules (CRM) and are responsible for the formation of specific spatial and temporal gene expression patterns. These extended, ~1 KB, regions are found far from coding sequences and cannot be extracted from genome on the basis of their relative position to the coding regions. Results To explore the feasibility of CRM extraction from a genome, we generated an original training set, containing annotated sequence data for most of the known developmental CRMs from Drosophila. Based on this set of experimental data, we developed a strategy for statistical extraction of cis-regulatory modules from the genome, using exhaustive analysis of local word frequency (LWF). To assess the performance of our analysis, we measured the correlation between predictions generated by the LWF algorithm and the distribution of conserved non-coding regions in a number of Drosophila developmental genes. Conclusions In most of the cases tested, we observed high correlation (up to 0.6–0.8, measured on the entire gene locus) between the two independent techniques. We discuss computational strategies available for extraction of Drosophila CRMs and possible extensions of these methods.


Results:
To explore the feasibility of CRM extraction from a genome, we generated an original training set, containing annotated sequence data for most of the known developmental CRMs from Drosophila. Based on this set of experimental data, we developed a strategy for statistical extraction of cis-regulatory modules from the genome, using exhaustive analysis of local word frequency (LWF). To assess the performance of our analysis, we measured the correlation between predictions generated by the LWF algorithm and the distribution of conserved non-coding regions in a number of Drosophila developmental genes.

Conclusions:
In most of the cases tested, we observed high correlation (up to 0.6-0.8, measured on the entire gene locus) between the two independent techniques. We discuss computational strategies available for extraction of Drosophila CRMs and possible extensions of these methods.

Background
Recognition of transcription regulatory sequences is one of the most important and challenging problems in modern computational biology. In the case of higher eukaryotes, there are proximal transcription regulatory units, located close to 5' ends of coding sequences and called 'proximal promoters', and distant transcription regulatory units, located further upstream or downstream of the gene and called 'enhancers' or 'cis-regulatory modules' (CRMs). It is clear that identification of a 'proximal' transcriptional unit can be based on its relative position to the coding sequence and the presence of specific transcriptional signals such as TATA box, CAAT box, transcription start site consensus (TSS) and, perhaps, other specific signals (such as downstream promoter elements, DPE). Typical CRMs (or enhancers) possess no such specific features; therefore their annotation in genome is much more difficult.
Currently existing methods dedicated to the recognition of transcription regulatory regions can be subdivided into three main categories: (i) search by signal, (ii) search by content and (iii) phylogenetic footprinting [1][2][3][4]. Modern 'search by signal' techniques are based on identification of known transcriptional patterns in DNA sequences, such as clustered binding motifs for known transcriptional regulators [5][6][7][8][9][10]. Extraction of clustered recognition motifs is among the most reliable current techniques, but it is limited to recognition of similarly regulated cis-regulatory modules in a genome.
Another strategy of CRM extraction from genome is phylogenetic footprinting. Methods of this class assume that regulatory regions contain highly conserved segments and they can be extracted by means of sequence comparison from evolutionary related genomes [11][12][13][14][15][16][17][18]. Performance of the phylogenetic footprinting greatly depends on the evolutionary distance between chosen species and on the conservation level of particular genes from these organisms. Phylogenetic footprinting have become especially important in recent days as more than one genome represents the sequence data for most of the main model organisms. However, it is not clear yet whether phylogenetic footprinting alone is sufficient for precise and exhaustive mapping of CRMs and how many related genomes it will require to achieve this goal. Non-coding conserved regions might also include not only promoter and enhancer regions, but also other functional sequence classes, such as origins of replication, matrix-attached regions etc, so an independent method of CRM extraction might be necessary.
'Search by content' (ab initio) methods are often based on the difference in the local base composition and in the local word composition between the regulatory and nonregulatory DNA [10,[19][20][21]. It is assumed that the difference is caused by presence of transcriptional signals, such as binding motifs for transcriptional regulators in the regulatory DNA. For example, the presence of multiple copies of the same binding site may change local frequency of short words in promoter regions. This idea was explored by analysis of most frequent hexamers (differential hexamer frequency) [22], other short words and motifs [23,24] in regulatory sequences. More recent implementations of the 'search by content' strategy take into account base interdependence in transcription regulatory regions and exploit interpolated Markov chains [19] as well as local word frequency [25]. General-purpose techniques based on the 'search by content' are of great interest as they provide an independent line of evidence for the recognition of transcription regulatory sequences in genome.
Recognition of regulatory sequences using 'search by content' is still a difficult problem due to the presence of distinct signals in different regulatory sequences as well as high divergence of these signals themselves. In this respect, each particular promoter as well as a given large training set may never contain even a small fraction of all specific words. Beside the binding motifs for transcription factors, the regulatory DNA may also possess patterns with specific physical properties [2,21] or other functional patterns, such as nucleosome positioning signals [26,27].
The described diversity of promoters and transcription regulatory signals suggests that methods based on local word composition/local word frequency, regardless of the words themselves, might be more suitable for the CRM and promoter recognition. In this work we describe an exhaustive local word frequency analysis and apply this new technique to recognition of cis-regulatory modules of Drosophila developmental genes. We based our recognition strategy on the assumption that biological signals in any regulatory region possess similar redundant properties, independent on a particular gene, promoter or CRM. According to this assumption, words corresponding to binding motifs for transcription factors might possess one level of redundancy, specific words involved, for instance, in promoter bending might possess another etc.
To describe all these various specific patterns in the context of one DNA sequence segment (resolution window), we represent exhaustive statistical analysis of word frequency, where the local frequency of each word in the DNA segment is taken into consideration, regardless of the word itself. For instance, two DNA segments having similar local word frequencies would produce similar scores, even if the words comprising the two DNA segments were different. In this respect, the proposed strategy aims to identify sequence segments containing specific word distributions rather than sequence segments containing specific words or arrays of specific words. In this sense our new method resembles promoter recognition techniques based on local assessment of most frequent words (hexamers) [22], but it is superior as it takes into consideration the full range of word frequency and scores even unique words.
Typically, the efficiency of a promoter recognition algorithm is evaluated with the help of an annotated promoter database, such as EPD [28]. The database contains proximal promoter elements, and is very helpful for evaluation of methods based on recognition of specific words, such as proximal transcriptional signals represented by the TATA box, transcription start site (TSS) and others. However, a vast majority of eukaryotic transcription regulatory regions located far upstream or far downstream of TSS contain no proximal signals. Therefore, to test the performance of the proposed new technique we developed a unique training set of sequence data, based on cis-regulatory modules (CRMs) of Drosophila developmental genes, transcription regulatory regions from higher eukaryotes, located far from gene coding sequences and TSS.
The developmental genes of Drosophila comprise spatiotemporal cascade (network) of regulatory interactions, responsible for early pattern formation of the developing fly embryo [29][30][31][32]. This model system has several advantages that make it unique in computational sequence analysis: (i) The majority of the genes encode transcription factors that are connected into the network of direct transcriptional interactions. (ii) For most of the developmental genes (Bicoid, Hunchback, Krüppel, etc) large amount of experimental data is available at the genetic, biochemical and evolutionary (comparison between species) levels, including positions of their cis-regulatory modules and description of binding motifs for upstream regulators.
We compared the results of our word frequency analysis of Drosophila melanogaster developmental genes with a reference data, generated by phylogenetic footprinting (percent identity profiles, PIP) of the same genes using recently sequenced genome of Drosophila pseudoobscura. This comparison has revealed a striking agreement between predictions generated by the two independent methods.

Construction of interactive CRM annotation
Recognition of regulatory sequences using 'search by content' approach requires representative training set of sequence data, ideally, a set of functional transcription regulatory regions. For this reason we assembled and annotated all published experimental data for the early developmental enhancers of Drosophila. These data include three major types of information: (i) sequence data for experimentally tested CRM regions, (ii) binding motifs for known transcriptional regulators, and (iii) known regulatory interactions between the early developmental genes.
To assemble the CRM sequence data, we retrieved published deletion analysis data for more than 60 CRMs from 20 different developmental genes. To navigate through this sequence collection, we created an interactive database containing two major structural levels (see Figure 1). The top level comprises a list of genes, and references to the corresponding sources in the literature. The bottom level consists of an interactive map, describing the position of all known functional elements (coding sequences and enhancers) in each gene locus (loci ranged in size from 16 to 120 Kb). The exact location of each functional element is presented in a table below the map, along with a description of any known regulatory interactions mediated by the element. The bottom level also contains an annotated text of the locus sequence with highlighted functional regions. In addition, we aligned footprint data for 28 binding motifs, representing the majority of the maternal, gap, pair rule, and some segment polarity genes. For each alignment, we established the optimal motif width, and outlined a well-defined core formed by positions with high information content. This data is also available from our interactive database.
In comparison with existing dedicated databases, such as GeNet [33,34], our compilation is focused on available deletion and footprinting data and it is convenient for: (i) fast navigation and retrieving of CRMs and (ii) fast retrieving of the binding motifs involved in a given regulatory interaction. All annotated data are publicly accessible from New York University web site http:// homepages.nyu.edu/~dap5.

Construction of positive and negative training sets
We considered it irrelevant to construct positive training set from the entire CRM collection directly as the positions of the CRM boundaries identified by empirical deletion analysis are quite arbitrary. In most of the classical deletion studies the boundaries of a deletion fragment were defined according to the presence of convenient restriction sites. This fact, along with the limit on a possible number of deletion combinations (number of transgenic constructs) resulted in a lack of precise resolution of the deletion analysis technique. In many cases the identified by the deletion technique minimal regulatory elements (such as MSE -minimal stripe enhancers) still provide correct spatial distribution of expression patterns, but the rescued patterns contain only a fraction of the endogeneous gene expression levels.
To minimize the effect of possible errors caused by the insufficient resolution of the deletion technique and to develop a formal principle of the positive training set construction, we defined CRM boundaries based on position of clusters of binding sites for transcription factors, involved in the CRM regulation (see Figure 2). In our previous work [9], we have demonstrated that positions of binding site clusters for these regulators correlate well with the positions of the CRMs, identified previously using deletion analysis. In a number of related studies [7,8,10] it has also been demonstrated that the binding motif clusters can be effectively used for mapping CRMs in the genome. Therefore, we constructed our positive training set from sequences containing the most significant clusters of binding sites for five transcription factors (Bicoid, Hunchback, Krüppel, Knirps and Caudal), involved in the regulation of many genes in our CRM database. The vast majority of these clusters overlaps the deletion analysis data, as it has been demonstrated earlier [9].
Despite the fact that we used only five transcription factors to evaluate sequences for a positive training set, the resulting training sequences contained many other binding motifs (yet unknown), present in CRMs. Thus, the five chosen motifs served us as markers only, indicating positions in the sequences that belong to cis-regulatory modules. The total size of the described positive training set comprised more than 68 Kb of sequence data and contained 58 homotypic clusters in 33 non-overlapping contigs. Sequences of the positive training set, including identified homotypic clusters in the gene locus regions are available from our database: http://homepages.nyu.edu/ dap5/PCL/pseudoobscura/train_plus_contigs.zip.
We also generated several negative training sets, one containing random samples (50 Kb each sample) from genome of Drosophila melanogaster, one containing random samples from Drosophila CDS collection [35] and one containing non-coding sequences only. Each negative training set combined >2 Mb of sequence data (>1.5% of the Drosophila genome).

Multiresolution analysis of word frequency
One can describe frequency F of a word in ith position of a DNA sequence through a number of matches found for this word inside a w-window (detection window), centered on the word (See Figure 3, 3A). Presence of binding motifs and other transcriptional signals affects the word frequency, but, in most cases, it is not known which functional signal corresponds to what frequency level. Therefore, consideration of only words with a given frequency level, for instance 'most frequent' words (high F), may result in a loss of functional information. To collect the information exhaustively, we take into consideration local frequency F of each word (or each position) in a Interactive collection of CRMs sequence segment. Then, we group (sort) all words in a sequence segment according to their frequency F into separate frequency channels, where the feature score S j (i) in a frequency channel j represents the total number of words (positions) in a window l (resolution window) having a frequency from a fixed range of the frequency value: j ≤ F < (j + n), (j, n ∈ N) (See Figure 3, 3B). For instance, if j = 1, n = 0, the feature score S j (i) of the resulting frequency channel represents the total number of unique words (having F = 1) in a resolution window l. In contrast, high values of j describe frequency channels that account for highly repetitive words. High values of n define 'wide' frequency channels, combining words with distant frequency values. Combination of all non-overlapping frequency channels in the range of w >F > 0 covers full spectrum of word frequency detected in the window w. Independent consideration of the feature score S in each frequency channel represents multiresolution analysis of our main feature, word frequency.
A partition of the spectrum into non-overlapping frequency channels may be set differently and highly depends on other parameters such as the size of the resolution window and the size of the word. In the current work we considered only short words (2-4 bases), assuming that a functional transcriptional signal (for instance, transcription factor binding site), must have a conserved core, which is often only a small fraction of the functional word [36]. Another advantage of short words is that they have much greater local frequencies, which facilitates statistical analysis and allows construction of spectra with larger number of wide non-overlapping frequency channels. For the same reason, wider windows (w, l) are preferred, but the upper size of the window is limited by the desired resolution of the method. In the case of our model system, Drosophila CRMs, we considered equal detection and resolution windows (w = l) that are close to the size of the minimal known CRM sequences and vary within the range of 0.5-1 Kb.

Statistical discrimination of sequences with distinct word frequency
Based on the feature score values S, one can built a statistical model describing DNA sequence segments belonging to a specific functional class, for example, to a CRM or to a coding/non-coding sequence. To generate the statistical model for a functionally related class of sequences (for positive or negative training sets) we built distribution E of the feature score S in each frequency channel, where E s is the fraction of all windows N in a training set, having the same score S. If N s is the number of windows with the score S, then: Mathematical expectations M(S) and distributions of the feature score E(S) obtained for positive and negative training sets are given in Figure 4A, 4B. In some training sets we obtained distributions that are very close to Poisson in several frequency channels. However, that was not common, especially in the case of frequency channels accounting for unique or very frequent words. For this reason, we have used the E-values, obtained directly from the distributions that were smoothed to reduce statistical noise.
Given two functional sequence classes ω 1 and ω 2 (positive and negative training sets) and a feature score value S j , from a test sequence, we now solve the recognition (classification) problem for each frequency channel independently, using simple log-likelihood ratio test [37]: Thus, in each frequency channel j, each position of the test sequence i obtains a log-likelihood score L ji , which displays chances that the i-th window of the test sequence belongs to a positive (ω 1 ) or a negative (ω 2 ) training set.
In this work, we consider only the simplest case, where the contribution of each frequency channel into the recognition is equal (equal weights). Therefore, we calculated the resulting score R i for the i-th window as a sum of scores L ji for all frequency channels. On practice, we used more sophisticated formula (see below) to calculate and correct the score R i , but even the simplest approximation produced relevant results.

Extraction of Drosophila CRMs using local word frequency
To assess the quality of our word frequency classification algorithm on the same system of early Drosophila genes, we adopted an additional reference set of data that independently marks positions of transcription regulatory modules in the analyzed sequences.
To construct this independent reference dataset we took advantage of the existing tools and strategies [13,16] and generated conservation profiles (percent identity profiles, PIPs) for wide genomic regions of 16 early developmental gene loci, representing the majority of genes from our database. We retrieved from the recently published genome of D. pseudoobscura (Human Genome Sequencing Center at Baylor College of Medicine) all contigs corresponding to the 16 selected genomic regions of D. melanogaster, and produced the conservation profiles using available standard procedures. The assembled sequences for the developmental genes of D. pseudoobscura along with the graphical data for alignments and numerical data for conservation profiles are publicly available from New York University web site http://homepages.nyu.edu/ dap5/PCL/pseudoobscura/pseudoobscura.htm.
We measured correlation (Pearson Association Coefficient) between the likelihood profile obtained from the Local word frequency algorithm Extraction of CRM sequences using local word frequency algorithm  Table 1). Notice that we put percent of sequence identity to zero in exons (yellow bars in the functional map), thus penalizing the correlation value if the word frequency algorithm produced positive score in these regions (see B).
word frequency analysis and the conservation profile, both constructed independently for each of the 16 developmental gene loci from our annotation. We also measured correlation of the likelihood profile and the conservational profile with the positions of annotated CRMs (deletion data). Some of the described results are shown in Figure 5; correlation values for the loci of 16 developmental genes are given in Table 1. One can see that the both independently constructed profiles correlated with each other very well in almost all cases. This important finding suggests that genomic DNA segments possessing word frequency of regulatory clusters (positive training set) correspond to highly conserved regions.
We performed the described word frequency analysis using three different negative training sets: random samples (50 Kb sample size, total of 2 Mb sequence data) from genome of Drosophila, random samples from Drosophila cDNA collection and genomic samples containing non-coding DNA. Genomic samples and non-coding sequences resulted in better correlation of the likelihood profile with the conservational profile and with the deletion data (annotated CRMs) than coding sequences. Figure 7 [see additional file 1] shows comparison of performance of LWF algorithm, trained on different datasets. We also compared prediction accuracy of LWF algorithm with prediction accuracy of phylogenetic footprinting. Results of a benchmarking test shown in Figure  6 demonstrated better performance of LWF in extraction of known CRM regions (deletion data). A combination of the two independent approaches allowed further improvement of the prediction accuracy (see Figure 6, 6C).

Cross validation of cis-regulatory modules using different computational techniques
Exhaustive mapping of the exact CRM distribution in developmental genes of Drosophila is an extremely important biological task. In its turn, evaluation of novel regulatory sequences in this particular system leads to generation of better training sets for further genome-wide CRM recognition and facilitates computational mapping of binding motifs and other transcriptional signals comprising CRMs.
In this work we compared outputs of two independent computational techniques, the word frequency analysis and the phylogenetic footprinting and have shown that both methods generate highly correlated predictions. Most surprising is that the correlation between the two independent approaches is much higher than the correlation of either one with the deletion analysis data (annotated CRMs). In some cases lower correlation between positions of predicted and the known CRMs (deletion analysis) is explained by the conservative design of our formal tests (see Table 1 and Figure 6). We assumed that in each locus (i) all CRMs are known and (ii) their Table 1  Prediction accuracy on a wide genomic region boundaries are mapped precisely. In reality a vast fraction of the sequences comprising the considered loci (16)(17)(18)(19)(20)(21)(22)(23)(24)(25) were never tested for CRM activity and the CRM boundaries represent an interpretation of empirical biological tests that are often difficult to formalize. At the same time word frequency algorithm produced very high positive scores for many of the best known 'classical' developmental CRMs, such as eve stripe 2, eve stripe3+7, eve stripe 4+6, kr CD1, ftz zebra and many others (see Figure 5). Correct recognition of these 'classical' elements, analyzed by many independent experimental groups, strongly supports relevance of our word frequency analysis.

: Correlation between the LWF and the conservation profiles. The table shows values of Pearson Association Coefficient (cc) [41], measured on the entire gene loci sequences (gene names are in the first column). The best correlation (fourth column) was observed between the statistical profiles (Λ) obtained from word frequency analysis and the conservation profiles (PIP). Lower correlation between ether the statistical (Λ/deletion) or conservation (PIP/deletion) profiles with the deletion data supports low resolution of deletion analysis (see results
The good agreement between the two independent techniques (word frequency and phylogenetic footprinting) and the successful recognition of the most known CRMs demonstrates the power of our strategy in extraction of CRMs from developmental genes (see Figure 2). It is important that using positive training set based on clusters of early developmental transcription factors we also extracted some of the late (expressed at later stages of fly development) CRM elements that were not among the training sequences. This suggests that our learning method accounts for general features (such as word distributions) inherent to regulatory DNA, rather than for particular motifs and words, specific to particular promoters/ CRMs.
It is early to say what strategy will finally dominate in recognition of transcription regulatory regions. Apparently, at the current stage of this field, cross-validation using several independent techniques seems to be the most appropriate solution.

Multiresolution analysis of transcription regulatory sequences
To analyze the functional properties of transcription regions, we assign frequency value F to each word of a sequence segment. This procedure results in a signal, where every position of DNA contributes one sample. At the following step we partition the frequency spectrum into frequency bands (frequency channels) containing words with similar frequency properties and analyze each frequency band (channel) independently. We take into account the full spectrum of word frequencies in our detection window; therefore for a chosen window size our assessment is exhaustive. Major advantage of this multiresolution analysis is its ability to produce highly informative pattern models, which minimize loss of information and facilitate recognition of very uncertain DNA patterns. In our case, for instance, a combination of non-overlapping frequency channels collects maximal information about the local word frequency in a sequence segment. All this information contributes to the final likelihood score. Direct extraction of words with defined frequencies (single frequency channels) may also have its specific field of application, different from promoter recognition. For example, one can filter out undesirable signals (noise) by combining or subtracting different frequency channels. This preprocessing may facilitate further extraction of biological signals such as binding motifs or nucleosome positioning signals using other techniques.
In this work we select word frequency F as our main feature, however there is a good number of various transformation functions that were already implemented for the analysis of biological sequences [38][39][40]. Most of these functions are based on analysis of local base composition and local base frequency. It is interesting to emphasize here, that if we accept size of the word equal to 1, then the described word frequency analysis becomes 'local base frequency analysis'.
For many types of signals, such as speech and images, most of the information is localized in certain resolution (scaling) levels. In the case of biological sequences the resolution level or size of the resolution window can be selected equal to the size of known functional regions, such as cis-regulatory modules, analyzed in this work. Nevertheless, in many cases it is not known if the selected resolution window is optimal. In our recent work [9] we evaluated biological signal (clustered binding motifs) in a broad scaling range and developed a procedure that allows establishing the size of the optimal resolution window. That application also might be considered as a multiresolution analysis, but relatively to the scaling range (resolution window l in our case).
Applications based on the multiresolution analysis are relatively new in the field of promoter study and they still require careful feature selection and thorough biological interpretation of obtained patterns. The last type of problems, however, can be solved in the context of well-known biological systems such as Drosophila developmental enhancers (CRMs), where accumulated biological information and annotated sequence data create very comfortable environment for such exploration.

Noise suppression and error correction
To construct our final likelihood score, we adopted two types of error correction. First correction accounts for chances that the observed in a single frequency channel value of S possesses an error. One can see that according to formula (2), the log-likelihood score is equal to zero in the point where E(ω 1 |S) = E(ω 2 |S). This point is marked by the vertical red line in Figure 4B. However, this value S 0 is much closer to mathematical expectation of E for the negative training set, than for the positive one. One way to account for this error is to weight cumulative E-values of both training sets for S <S 0 and S >S 0 : This formula takes more simple form if the distributions are known [37], which is not exactly our case (see results).
Our second correction is intended to fix possible domination of one frequency channel over the others in some cases. Indeed, if the observed in a test sequence value of S in any frequency channel is very far from the math expectation of a positive or a negative training set (E(S) = 0), than the log-likelihood score of this channel takes very high values, and it predominates over the score of other channels. As a result, the final decision is maid according to this channel only (sum of scores in all channels). To cope with this problem, we add a multiplier, suppressing possible high likelihood values of a j-th frequency channel: This multiplier ν is a normal distribution of the likelihood score L with expectation M = 0 and standard deviation σ. For very high positive or very high negative values of L, the value of the multiplier approaches zero, suppressing possible dramatic difference between channels. Behavior of the corrected likelihood score Λ j at the different values of parameter σ is given in Figure 4B.