BLISS: biding site level identification of shared signal-modules in DNA regulatory sequences

Meng, Hailong; Banerjee, Arunava; Zhou, Lei

doi:10.1186/1471-2105-7-287

Methodology article
Open access
Published: 07 June 2006

BLISS: biding site level identification of shared signal-modules in DNA regulatory sequences

Hailong Meng^1,2,
Arunava Banerjee¹ &
Lei Zhou²

BMC Bioinformatics volume 7, Article number: 287 (2006) Cite this article

4001 Accesses
4 Citations
Metrics details

Abstract

Background

Regulatory modules are segments of the DNA that control particular aspects of gene expression. Their identification is therefore of great importance to the field of molecular genetics. Each module is composed of a distinct set of binding sites for specific transcription factors. Since experimental identification of regulatory modules is an arduous process, accurate computational techniques that supplement this process can be very beneficial. Functional modules are under selective pressure to be evolutionarily conserved. Most current approaches therefore attempt to detect conserved regulatory modules through similarity comparisons at the DNA sequence level. However, some regulatory modules, despite the conservation of their responsible binding sites, are embedded in sequences that have little overall similarity.

Results

In this study, we present a novel approach that detects conserved regulatory modules via comparisons at the binding site level. The technique compares the binding site profiles of orthologs and identifies those segments that have similar (not necessarily identical) profiles. The similarity measure is based on the inner product of transformed profiles, which takes into consideration the p values of binding sites as well as the potential shift of binding site positions. We tested this approach on simulated sequence pairs as well as real world examples. In both cases our technique was able to identify regulatory modules which could not to be identified using sequence-similarity based approaches such as rVista 2.0 and Blast.

Conclusion

The results of our experiments demonstrate that, for sequences with little overall similarity at the DNA sequence level, it is still possible to identify conserved regulatory modules based solely on binding site profiles.

Background

Transcription is the fundamental biological process in which a selected region of DNA is transcribed into RNA by a molecular machinery, a core component of which is RNA polymerase. For most protein-coding genes, transcription is the intermediate step via which the information coded in their DNA is "expressed" into functioning proteins. For others, such as RNA genes, the product of the transcription itself may have biological function. Even though each cell has the complete set of genes in its chromosomal DNA, only a portion of the genes are transcribed (expressed) in any particular cell depending on tissue/cell type, developmental stage, etc [1]. The transcriptome, that is all of the genes that are selectively transcribed in a cell, determines the function and morphology of the cell. In addition, the level (i.e., rate) of transcription is often regulated in response to intra-cellular and extra-cellular environmental factors to achieve cellular homeostasis. Normal transcriptional regulation, i.e., the right genes being transcribed at the right times, in the right cell, and at the appropriate rates, is central to almost all physiological processes. Abnormal regulation of transcription often results in disruption of development and/or pathological states. For example, ectopic (i.e., abnormally high) expression of oncogenes leads to hyperplasia and cancer.

A basic element of transcriptional regulation is the interaction of transcription factors (trans factors) and their corresponding t ranscription f actor b inding s ites (TFBSs, also referred to as cis factors) on the DNA. Transcriptional regulation of a gene (e.g. restricted transcription in a particular cell type, or elevated transcription, in response to UV light) is often mediated through the functional/physical interactions among multiple transcription factors, each recruited to the proximity of the DNA in part by their selective affinity to their corresponding binding sites. For example, the even-skipped(eve) gene is transiently expressed in 7 alternative stripes on the longitudinal axis in the developing Drosophila melanogaster embryo at the blastoderm stage. Each of the seven stripes is regulated by a distinct set of transcription factors binding to their corresponding binding sites located in a DNA segment flanking the even-skipped gene. The most well investigated of these segments is the stripe 2 regulatory region, which has identified binding sites for 5 different transcription factors in a 700 bp (base pair)-1 kb (kilo base pair) DNA region in front of the transcription initiation site of the eve gene. Evolutionary comparison of this transcription regulatory module in different Drosophila species has revealed that most of the binding sites are highly conserved and functional, even though the underlying DNA sequence has undergone considerable change [2].

A useful analogy to understanding the composition of DNA regulatory modules is to consider DNA as a sequence of "Letters" and individual binding sites as "Words". Then, a functional module of closely associated binding sites can be conceived of as the "Sentence" command embedded in the DNA sequence that guides transcription. The problems associated with identifying the "Sentence" commands in a DNA sequence are two fold. First, the binding sites are degenerate in nature, that is, the same "Word" may have different letters in certain positions. Second, the "Words" are themselves interspersed by varying lengths of meaningless "Letters".

One approach to identifying DNA regulatory modules is through cross-genome comparison. The assumption underlying this approach is that DNA sequences encoding functional TFBSs are under selective pressure to be conserved during evolution, whereas non-functional DNA mutate/change more rapidly. Thus, if DNA sequences flanking orthologs in two related species were to be compared for sequence-level similarity, DNA regulatory modules would appear as conserved "islands" in a sea of otherwise not-conserved DNA sequences. Approaches in this category include rVista2.0, ConSite, PhyME, TOUCAN, CREME, TraFAC, etc [3–10]. For instances, based on the sequence level conservation between human and mouse, Cora et al. predicted functional TFBSs that are statistically over-represented and share the same specific Gene Ontology (GO) terms [9]. This kind of cross-genome comparison approaches has successfully led to the discovery of regulatory modules that were subsequently verified by functional characterization [11].

The disadvantage of the sequence-based approach is that it is dependent on the overall conservation of the DNA region harbouring the regulatory module. As we mentioned earlier, TFBS sequences are degenerate in nature and are interspersed by non-functional sequences which are not under conservation pressure. Depending on the ratio of functional versus non-functional base-pairs in the DNA region, it is entirely possible that the overall sequence level conservation of the region be indistinguishable from the background level, while the actual TFBSs making up the functional module still be conserved. In other words, it is possible that despite the conservation of the "Sentence" command at the binding site level, the overall conservation of the DNA backbone at the sequence level be minimal or non-detectable. This situation is aggravated if the participating binding sites are highly degenerate (i.e., tolerate many variations at the sequence level) and the spaces between the binding sites are long. In fact, it has been observed by researchers in many instances that while the regulatory region has no detectable overall similarity, the transcription regulatory function is preserved [12]. Sequence-based approaches, or approaches requiring filtration of sequences based on DNA level similarity, are not helpful for identifying the responsible TFBSs in this scenario.

Since the conservation pressure is at the binding site level, i.e., the sequence must be able to maintain binding affinity to the transcription factor(s), it makes biological sense to perform comparisons at the binding site level rather than at the sequence level. This, however, is currently hindered by two factors. The first limitation is the effectiveness of identifying transcriptional binding sites in a given DNA sequence. The set of TFBSs for a given TF can be quantitatively represented using a frequency matrix that describes the binding specificity of the TF at each of its positions. The quality of the matrix used to identify potential TFBSs is determined by the number and quality of known binding site sequences used to construct the matrix. As a result of the development of new technologies such as Chip [13] and ChIP-chip [14], it is anticipated that binding site instances will be identified at a unprecedented rate which will undoubtedly greatly enhance both the quality as well as the coverage of binding site matrices in the near future [15].

The second limitation is that we currently do not understand the grammar governing how binding sites (Words) make up the regulatory modules (Sentences). Based on our understanding of transcriptional regulation, such a grammar should have at least three components: (1) the composition of the binding sites, (2) the sequence of the binding sites, and (3) the spaces between/among the binding sites. Currently, the number of regulatory modules that have been thoroughly characterized is far fewer than what is required to decode this grammar.

A major obstacle for biologists working on transcriptional regulation is to locate and identify potential TFBSs responsible for a particular regulatory module, especially in sequences that do not have significant conserved islands. In this paper, we describe a novel methodology for binding site level identification of conserved regulatory modules in such sequences.

Results

Simulating sequence pairs harbouring a conserved module of binding sites

Since the number of well-studied regulatory modules is currently rather sparse, we chose to simulate sequence sets (pairs) representing the domain of our interest, i.e., conserved binding site patterns in a pair of sequences which nonetheless have little or no similarity at the sequence level. In many cases, experimental investigation in a model organism has narrowed down the location of the regulatory module(s) for a particular gene to a relatively short region (e.g. within 1 kb), whereas for the ortholog in a less-studied organism (reference organism), information about the localization of the module is absent (except that it is generally in the proximity of the gene). In view of this, in the first (current) stage of the development of BLISS, we considered the identification of a conserved module present in both a short sequence of about 100–500 bps, representing the model organism, and a longer sequence (5–6 kb), representing the reference organism. Although this simplification limits the applicability of the current methodology, it does highlight the promise of our approach.

For each sequence pair, the backbones for both the short sequence and the long sequence were generated randomly and thus had no sequence similarity. A hypothetical module involving binding sites for 4–8 distinct transcription factors was first introduced into the short sequence. The binding site sequences were randomly selected from the instances recorded in the TRANSFAC 9.1 database [15]. The rules governing the formation of the hypothetical module were as follows:

1.) A module contains binding sites for 4–8 distinct transcription factors.

2.) For each transcription factor, there may be more than one binding site in the module.

3.) The distance between consecutive binding sites is "d_i", where in 65% of the cases d_ilies within 5–20 bps, in 22% of the cases d_ilies within 21–50 bps and in the remaining 13% of the cases d_ilies within 51–60 bps (Figure 1).

The range of values for d_iwas based on background knowledge as well as a statistical analysis of the distances between pairs of TFBSs in TransCompel database by Qiu et al. which revealed the above distribution of distances between pairs of TFBSs [16].

The hypothetical module was first simulated according to the above rules in the short sequence. Subsequently, a "conserved" module was formulated and inserted into the longer sequence at a random location. The rules governing the formation of the "conserved" module were as follows:

1.) It is comprised of TFBSs that correspond to the same transcription factors as present in the hypothetical module.

2.) The sequence for each TFBS is randomly chosen from the recorded instances in TRANSFAC 9.1, with the caveat that it cannot be the same instance(s) that was (were) used to construct the hypothetical module in the short sequence.

3.) The respective order of TFBSs is the same as in the hypothetical module in the short sequence.

4.) The distance between consecutive binding sites in the conserved module is d_j; d_jis a function of d_iin that d_jlies in the range (d_i- Δd, d_i+ Δd) (Figure 1).

Δd is the "perturbation factor" – the variation of distance between corresponding binding sites in the hypothetical module and the conserved module. In this study, we used Δd = 4 (See Discussion). A total of 10,000 pairs of sequences were generated according to above rules, and were used to test and evaluate various algorithms.

Identifying a conserved module by comparison at the binding site level

As stated above, the objective of our methodology is to identify conserved regulatory modules within highly divergent sequences. The sequence pairs in our simulated data-set had little overall sequence similarity. Of the 10,000 pairs, 73.32% have no similarity detectable by BLAST analysis (E = 0.01, Blast2seq). This indicated that the conservation of binding sites in the hypothetical module and the conserved module was not sufficient to allow detection at the sequence level. Of those that did have a significant match, the output alignments were shorter than 30 bps, which was far shorter than the length of the inserted module.

M_score

To identify the conserved module at the binding site level, we first generated the potential TFBS profiles for each of the simulated sequence pairs. A matrix scoring method similar to that used in Match [17] was implemented (see Methods), which allowed us to score each sequence against the frequency TFBS matrices recorded in TRANSFAC 9.1 (M _score, Figure 2a). When a cut-off value of 0.75 (on M _score) was applied, on average there were about 3000 identified potential TFBSs in every thousand base pairs of simulated sequences. This is similar to what we observed when using genomic DNA sequences randomly extracted from model organism databases (data not shown).

To identify the hypothetical module embedded in the sequences, we tried several different algorithms that compared the binding site profiles of the short and the long sequences. Of those tested, a scoring method (using inner-products) based on a statistical evaluation of binding site matches after a Gaussian smoothing of the binding site profiles gave reliable and promising results.

P_score

The matrix score (M _score), by virtue of its definition (f.1), ranges from 0 to 1 for all TFBS matrices. Thus it does not differentiate short and relatively simple matrices that match DNA sequences with a high frequency from those long and stringent matrices that match DNA sequences only rarely. For example, the binding site for En (I$ED_06) has 7 positions, and on average there are 320 matches (with M _score > 0.75) on any 10,000 bp random sequence. In contrast, the binding site for Bel-1 (V$BEL1_B) has 13 positions, and the average number of matches with M _score > 0.75 in a 10,000 bp random sequence is 3. It is clear that a match involving the binding site V$BEL1_B is far more significant than a match with I$ED_06. To differentiate this, we introduced the p value of the M _score, which was estimated by calculating the fraction of randomly generated sequences that have scores equal to or higher than that M _score. We then calculated the P_score (see Methods) as the product of -log (p value of M _score > cut_off) and the M _score (Figure 2b).

G_score

To account for the variation in the distances between/among binding sites, we performed a Gaussian smoothing of the P_score (see Methods). Through empirical testing (data not shown), we found that a variance of σ² = 9 gave the best performance. We denote the Gaussian smoothed score profile as the G _score profile of the sequence (Figure 2d).

BLISS_score

G _score profiles were generated for both the short and the long sequences. To identify a maximum match at the binding site profile level, the short G _score profile was slid along the long G _score profile. At each position, the match between the short profile and its corresponding region of equal length (length of the window) in the long profile was evaluated using an inner-product as the BLISS _score (see Methods).

Note that since the "length of the window" appears in the denominator, the BLISS _score is independent of the length of the short profile (or the length of the window). Figure 3a shows the distribution of BLISS_scores as the shorter G _score profile was slid along the longer G _score profile. The peak of the BLISS _score indicates the maximum match. In this case, the abrupt surge in the BLISS _score is due to the match between the embedded hypothetical module in the short sequence and the conserved module in the long sequence. When this methodology was tested on all of the 10,000 simulated sequence pairs, about 80% of the highest peaks for each pair contained the correct match between the embedded hypothetical module and the conserved module.

Distribution and statistical evaluation of the BLISS_score

To be able to evaluate the match at the binding site profile level, we analyzed the distribution of BLISS_scores using the simulated sequence pairs. For each pair of sequences, BLISS_scores were calculated at each position as the short profile slid along the longer profile. The peak matches (corresponding to the peaks in the score profile) between each pair of sequences were evaluated to see whether it aligned the embedded modules. If the match did align the modules, it was designated a "true" match. All other BLISS_scores were considered as background.

Figure 3b shows the distribution of the background and the "true" match BLISS_scores for the 10,000 simulated pairs of sequences. This distribution varies slightly depending upon the cut_off threshold set for M _score (Figure 3b&c). This is not surprising, since a lower cut_off threshold will lead to more identified potential binding sites and thus a slightly higher background score.

The distributions allow us to evaluate any particular BLISS _score. It is informative in helping set a threshold for reporting significant matches at the binding site level. Given a BLISS_score x, the distributions allow us to decide whether that BLISS _score corresponds to a true alignment of modules or whether it corresponds to the module aligning with a random DNA segment. Let C1 denote the event where the modules embedded in the short and the long sequences are aligned, and C2 denote the event where either module is aligned with a random DNA segment. Based on Bayes formula, the posterior probabilities can be calculated as follows:

\begin{matrix} p (C 1 | x) = \frac{p (x | C 1) p (C 1)}{p (x | C 1) p (C 1) + p (x | C 2) p (C 2)} \\ p (C 2 | x) = \frac{p (x | C 2) p (C 2)}{p (x | C 1) p (C 1) + p (x | C 2) p (C 2)} \end{matrix}

Where p(C1|x) is the conditional probability of C1 given a BLISS_score x and p(C2|x) is the conditional probability of C2 given a BLISS_score x; p(C1) is the prior probability of C1 and p(C2) is the prior probability of C2; p(x|C1) is the conditional probability of observing x given C1 and p(x|C2) is the conditional probability of observing x given C2. p(x|C1) and p(x|C2) can be read off directly from the distributions generated.

It is difficult to find the means to calculate the prior probabilities p(C1) and p(C2). In this study, we assumed p(C1) = p(C2), although we suspect that p(C1) might be smaller than p(C2). This assumption allowed us to calculate the posterior probabilities to evaluate a BLISS_score x. In practice, we decided that x was a significant matching score if p(C2|x) was less than a threshold of, e.g. 0.01 or 0.001.

Identifying a conserved regulatory module in distantly related species

To test the efficacy of the BLISS methodology in real sequences, we undertook the task of identifying the Even-skipped (eve) stripe 2 enhancer (S2E) in distantly related Drosophila species. Even-skipped, an important development regulatory gene in Drosophila melanogaster (D.mela), is specifically expressed in seven transverse stripes in the embryo during the blastoderm stage. The stripe 2 enhancer is the best studied and includes TFBSs for five TFs, bicoid (Bcd), hunchback (Hb), giant (gt), Kruppel (Kr), and sloppy-paired (slp) [18–20]. Unfortunately, TRANSFAC 9.1 has matrices for only three of the five TFs, i.e., Hb, Kr, and Bcd. Our search was therefore limited in the sense that some of the participating TFBSs could not be predicted and used for the match comparison.

Previous experimental investigations have shown that S2E is evolutionarily conserved among D. yakuba (D.yaku), D. erecta, and D. pseudoobscura (D.pseu) [2, 12, 21]. All of these species are in the same subgenus (Sophophora) as D.mela, with D.pseu having the most distant relationship with D.mela (separated at about 40 million years ago). BLISS did identify the eve S2E modules among these four species. In particular, a significant peak was reported by BLISS when we searched the S2E module extracted from D. pseu against the 14 kb D. mela genomic sequence flanking the eve coding region.

In contrast, no detailed information has been published about potential S2E in more distantly related species, such as D. mojavensis (D.moja) or D. virilis (D.viri), both from a separate subgenus (Drosophila) separated from D.mela at about 60 million years ago [22]. To identify S2E in these two distantly related species, we extracted the 14 kb genomic sequence flanking the eve coding region from D.moja and D.viri genomic sequences. Blast analysis using the D.mela or D.pseu sequence harbouring the eve S2E module did not identify any significant match longer than 41 bp (using bl2seq with default gap penalty values). Using the BLISS methodology however, a significant peak in the BLISS _score was observed (Figure 4a, p(C2|x) < 0.05). Verification of this match indicated that it contained the TFBSs composing the S2E module. Similar results were obtained when corresponding sequence pairs involving 1.) D.mela and D.moja, and 2.) D.mela and D.viri, were analyzed. In contrast, no significant match was identified in the reverse-complemented sequence (Figure 4b) or in other 14 kb sequences unrelated to eve, indicating the specificity of the search.

A detailed inspection of the make up of the S2E modules in distantly related species showed that S2E can be viewed as a complex module made of element modules. To make an analogy, S2E is a complex sentence that has several "clauses" (Figure 5). The evolution of the whole module indicates that the distances between some TFBSs have changed dramatically. However, the distances among the TFBSs within corresponding "clauses" have remained relatively stable. For example, in Clause 1 the distances among participating TFBSs have remained constant over the long evolutionary period. Specifically, the distance between the first bcd (overlapping with the first kr) and the second bcd is invariably 20 in all of the four species. In addition, the distances among TFBSs in Clause 3 have also remained relatively stable, i.e., within the variation we have factored into our simulation.

Since our methodology is really based on the assumption of limited distance variations between TFBSs, it should be much more sensitive at identifying individual "Clauses" or simple modules. When the corresponding TFBS profile covering Clause 1 or Clause 3 were used to search against the genome sequence from D.moja, very significant peaks in BLISS _score were observed (Figure 6a&b, p(C2|x) < 0.001 for both). The peaks corresponded to the match of Clause 1 and Clause 3 on the D.moja sequence, respectively. BLAST analysis using the sequences covering Clause 1 or Clause 3 searched against the D.moja genomic sequence failed to identify significant matches that spanned the whole module. rVista 2.0 did predict Clause 1 because it succeeded in detecting the DNA similarity between the sequence covering Clause 1 and the D.moja sequence. However, rVista 2.0 failed to identify Clause 3 since no similarity was detected between the sequence covering Clause 3 and the D.moja genomic sequence.

Implementation of BLISS as a web-based service

The BLISS methodology has been implemented as a web-based tool for the research community. The web application embodies the Gaussian Smoothing Method for identifying cis-regulatory modules at the binding site level, and outputs all potential TFBSs in the predicted module. The module finding process consists of several steps:

To begin, the user inputs two DNA sequences. For example, a short sequence from a model organism that harbours a regulatory module, and a longer sequence surrounding the ortholog of a different species. An M _score threshold of 0.75 or 0.8 is then chosen by the user for the generation of the TFBS profiles for both sequences. Next, a plot of BLISS_scores comparing successive alignments of the short profile against the long profile is returned to the user. On the very same page, the distributions described earlier (Figure 3b&c) are displayed so that the user may choose a BLISS _score threshold. Once the BLISS _score threshold is chosen, BLISS outputs all of the matches with a BLISS _score higher than that threshold (limited up to 5). For each match, a table of contributing TFBSs are listed based on the product of the p-values of the matching TFBSs on both sequences (Figure 7). Alternatively, it can be listed based on their numeric contribution to the BLISS _score, or by the location of the TFBSs.

Currently, the limits for the short and long input sequences are set at 1200 bps and 15 k bps, respectively.

Discussion

In this study, we have presented a first step towards identifying regulatory modules via comparisons at the binding site level. The advantage of such an approach is that it allows the detection of conserved regulatory modules in highly divergent sequences, as we have demonstrated both with simulated sequences as well as with real world examples. This method is thus complementary to many existing methods that are based on sequence similarity comparisons [23] or use sequence similarity for pre-analysis selection [4, 5, 7, 24]. It should also be complementary to applications such as MEME and CompareProspector, which are widely used for the identification of conserved sequence motifs (binding sites) in the regulatory region of co-expressed genes [25, 26].

There are limitations to our approach. Some of the major limitations, such as the coverage and quality of the TFBS matrices, are expected to improve rapidly in the near future as new high throughput techniques are applied to identify binding sites in genome scale. Our current algorithm is developed based on the assumption that the inter-TFBS distance variation is within a +/-4 base pair range. This allows the identification of modules/clauses with relatively small inter-TFBS distance variation, such as the individual clauses in the S2E module. It will likely miss modules/clauses that have much larger distance variations between TFBSs. In the case of S2E, the identification of the module was based on the fact that the third clause had low inter-TFBS distance variations, which was sufficient to generate a significant BLISS _score (figure 4a). As indicated by Ludwig et al, if S2E as a whole were to be considered, many inter-TFBS distances have changed dramatically during evolution [12]. However, a closer look at the distribution of TFBSs in S2E in the two distantly related species also indicated that the S2E module may be sub-divided into clauses (Figure 5). While the inter-clause distances have varied dramatically, the inter-TFBS distances within each clause have remained largely stable (Figure 5). This is very possibly a reflection of the spacing restriction on important transcription factor interactions.

In addition to the S2E module, we also tested our methodology on other regulatory modules such as the DME (Distal Muscle Enhancer) module in front of the paramyosin gene [27]. Using a 200 bp sequence harbouring the DME in D.mela, we were able to detect the corresponding module in D.viri (data not shown). Currently, the number of well characterized, evolutionarily conserved modules is limited. The goal of BLISS is to facilitate the discovery of multiple TFBSs modules by identifying the conserved pattern of the TFBSs. We also applied BLISS to a regulatory region that is responsible for mediating UV induced expression of hac-1 [28]. There is no existing information on the composition of the UV-responsive module in this region, which has very low sequence level conservation between the corresponding segments in D.mela and D.pseu. Yet genetic experiments have indicated that the responsiveness is highly conserved. The potential module identified by BLISS is currently being tested experimentally.

BLISS, with some adaptation, can potentially be used to identify the conserved regulatory modules in co-expressed genes. Another advantage of BLISS is that the methodology can also be applied to identify patterns that involve not only TFBSs, but also other sequence features such as complex response elements [29], insulator sequences, CpG islands, etc. Functionally, these sequence features (their related modifications and binding partners) interact with transcription factors. However, these features, such as CpG islands, cannot be detected by simple sequence similarity based searches.

Conclusion

In this study, we addressed the feasibility of identifying conserved regulatory modules at the binding site level. Our results indicate that it is feasible to identify conserved regulatory modules in simulated random sequences harbouring a regulatory module made of 4–8 distinct binding sites. Using real sequences, we demonstrated that our approach outperforms regular sequence level comparisons when the orthologous DNA sequences are highly diverged. In addition, the BLISS program outputs directly the candidate binding sites that are shared between the two regulatory sequences, which can greatly facilitate the evaluation of the candidate module as well as the design of the experimental verification strategy by biomedical scientists. Future development of the project will include identifying better algorithms for complex modules and modules with higher inter-TFBS distance variations.

Methods

Generating simulated sequences

10000 simulated sequence pairs were generated for developing the methodology. Each set included a short DNA sequence (100–500 bps) harboring a hypothetical TFBS module and a long DNA sequence (5–6 kb) harboring a conserved TFBS module. First, the hypothetical TFBS module was generated in the following manner: 4–8 binding sites were randomly chosen from TRANSFAC 9.1 [15] database and then random DNA subsequences were inserted between them. Qiu et al. [16] analyzed all the entries of composite elements in the TransCompel database (version 3.0) and they found that about 87% of the composite elements are within 50 bp distance and about 65% are within 20 bp distance of one another. We therefore chose lengths of DNA subsequences inserted between binding sites based on this result. Next, we created the conserved TFBS module, which included binding sites for the same sets of transcription factors in the same order as in the shorter sequence. However, the binding site sequences had to be different and they were randomly chosen from TRANSFAC 9.1. Furthermore, compared to the hypothetical module, distances between binding sites were set to vary slightly and we allowed each binding site to shift up to 4 bps either to the right or to the left. Finally, we inserted the conserved module into a 5000 bp randomly generated DNA sequence to generate the longer sequence.

Binding site identification

Potential TFBSs both in the short DNA sequence (including the hypothetical module) and the long DNA sequence (including the conserved module) were searched based on frequency matrices collected by TRANSFAC 9.1. Because TFBSs may be detectable on either the forward or the backward strand, we searched both strands of sequences. The M _score profile for each sequence is a M*L matrix, where M is twice of the number of matrices applied and L is the length of the sequence. The top half of the M _score matrix is the score profile for the forward strand and the bottom half is that for the complementary strand. The M _score of the i th TFBS at position j of the sequence was calculated by first aligning the frequency matrix for the i th TFBS with the sequence at position j and then computing:

f.1. M_score:

\begin{matrix} M_s c o r e [i, j] = \frac{S c o r e [i, j] - S c o r e_{M i n} [i, j]}{S c o r e_{M a x} [i, j] - S c o r e_{M i n} [i, j]} \\ S c o r e [i, j] = \sum_{k = 0}^{K - 1} I (k) f_{k, n_{j + k}} \\ S c o r e_{M i n} [i] = \sum_{k = 0}^{K - 1} I (k) f_{k}^{\min} \\ S c o r e_{M a x} [i] = \sum_{k = 0}^{K - 1} I (k) f_{k}^{\max} \\ I (k) = \sum_{N \in {A, T, C, G}} f_{k, N} \ln (4 f_{k, N}) \end{matrix}

K is the length of the TFBS. n_j+k∈ {A,T,C,G} is the nucleotide occurring in the sequence at position j+k. $f_{k, n_{j + k}}$ is the frequency of nucleotide n_j+kat position k in the frequency matrix (of the i th TF). $f_{k}^{\min}$ is the lowest frequency and $f_{k}^{\max}$ is highest frequency across all nucleotides at position k in the frequency matrix (of the i th TF). I(k) is the information vector for the frequency matrix, which reflects the degree of conservation at position k of the matrix. Finally, M_score [i,j] is the normalized Score [i,j]. Stormo [30–32] observed that logarithms of the base frequencies ought to be proportional to the binding energy and the information vector reflects this average binding energy between the transcription factor and the binding site.

A score cutoff at 0.75 was applied to the M _score profiles of both the short and the long sequence as follows:

M _score [i, j] = M _score [i, j] if M _score [i, j] ≥ 0.75

M _score [i, j] = 0 if M _score [i, j] < 0.75

p value for M_score

To calculate the p value, a background model is required. Here, we chose the background model to be a random DNA sequence where each position is drawn independently. The ratio among A, T, C, and G is 30% : 30% : 20 % : 20%. For each frequency matrix, 300 million subsequences were sampled from the background model, and the M _score of each subsequence was calculated to build the M _score distribution. The p value of a M _score for each TFBS was estimated by calculating the fraction of samples that had scores equal to or higher than that M _score. Then, the P_score profiles were calculated as follows:

f.2 P_score:

P _score[i, j] = C_i* M _score[i, j]

C_iis the negative natural logarithm of the p value of M _score > = 0.75 for the i th TFBS.

Gaussian smoothing

To account for the change in the distances between/among binding sites, a Gaussian smoothing was applied to the P_score profiles with a variance of 9. Formally, the G _score profiles were calculated as follows:

f.3 G_score:

G_s c o r e [i, j] = \sum_{k = - 7}^{7} P_s c o r e [i, j + k] * \frac{\frac{1}{σ \sqrt{2 π}} e^{- k^{2} / 2 σ^{2}}}{\sum_{i = - 7}^{7} \frac{1}{σ \sqrt{2 π}} e^{- i^{2} / 2 σ^{2}}}

where σ = 3 and k ranges from -7 to 7. In effect, a P_score can spread 7 positions to both the right and the left due to the Gaussian smoothing. Smoothed P_scores beyond 7 positions were ignored due to their small values.

Searching the conserved module in the long sequence

To identify a maximum match at the binding site profile level, the short G _score profile was slid along the long G _score profile. BLISS _score at position n is the matching score between the short profile and its corresponding region of equal length (length of the short sequence) in the long profile at position n:

f.4 BLISS_score:

\begin{array}{l} B L I S S_s c o r e [n] = \\ (\sum_{i = 0}^{M - 1} \sum_{j = 0}^{L - 1} G_s c o r e [i, j] * G_s c o r e 2 [i, j + n]) / L e n g t h O f S h o r t S e q u e n c e \end{array}

where G _score1 is the G _score profile for the short sequence and G _score2 is the G _score profile for the long sequence; L is the length of the short sequence; n is the current location where the short sequence is aligned to the long sequence.

Large scale search of the simulated sequences, statistical analysis

We used 10000 simulated sequence pairs generated by the above method to calculate two BLISS _score distributions. The first is the BLISS _score distribution when the hypothetical module in the short sequence is aligned with the conserved module in the long sequence. The second is the BLISS _score distribution when the module is aligned with a non-module segment of the longer sequence. For each pair of sequences, BLISS_scores were calculated at each position as the short profile slid along the longer profile. The peak matches (corresponding to the peaks in the score profile) between each pair of sequences were evaluated to see whether it aligned the embedded modules. If the match did include the alignment of the modules, it was designated a "true" match, and this BLISS _score was used to calculate the distribution for the modules matching. All of the other BLISS_scores were used to calculate the distribution for the module matching with the background sequence.

Searching for the eve2 module in D. virilis and D. mojavanis sequences

The GenBank [33] accession numbers for the S2E sequences are AF042712(D. pseudoobscura) and AF042709(D. melanogaster). We used BLISS to search these two enhancers in 13 kb D. virilis and 14 kb D. mojavanis sequences, in which S2E is hypothesized to be located, but the specific location is unknown.

Ludwig et al. indicated that distances between TFBSs in two clauses in S2E (region 134–275 and region 484–684 for D. melanogaster, region 196–376 and region 692–866 for D. pseudoobscura) were substantially conserved. We removed those regions and searched for the modules in 13 kb D. virilis and 14 kb D. mojavanis sequences using BLISS.

Website construction

BLISS was implemented using HTML/JSP/JavaBean and is supported by an Apache Tomcat 5.5 server. It is publicly available at: http://gene1.ufscc.ufl.edu:8080/blissWeb/index.html. The M _score profiles of TFBSs were calculated based on the frequency matrix library collected by TRANSFAC 9.1. BLISS used DISLIN [34], a plotting library for displaying data, to draw the match score plot in run time.

References

Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW: Gene expression profiles in normal and cancer cells. Science 1997, 276: 1268–1272. 10.1126/science.276.5316.1268
Article CAS PubMed Google Scholar
Ludwig MZ, Patel NH, Kreitman M: Functional analysis of eve stripe 2 enhancer evolution in Drosophila: rules governing conservation and change. Development 1998, 125: 949–958.
CAS PubMed Google Scholar
Sharan R, Ovcharenko I, Ben-Hur A, Karp RM: CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics 2003, 19 Suppl 1: i283–91. 10.1093/bioinformatics/btg1039
Article PubMed Google Scholar
Sandelin A, Wasserman WW, Lenhard B: ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res 2004, 32: W249–52. 10.1093/nar/gkh012
Article PubMed Central CAS PubMed Google Scholar
Loots GG, Ovcharenko I: rVISTA 2.0: evolutionary analysis of transcription factor binding sites. Nucleic Acids Res 2004, 32: W217–21. 10.1093/nar/gkh095
Article PubMed Central CAS PubMed Google Scholar
Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 2004, 5: 170. 10.1186/1471-2105-5-170
Article PubMed Central PubMed Google Scholar
Aerts S, Van Loo P, Thijs G, Mayer H, de Martin R, Moreau Y, De Moor B: TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis. Nucleic Acids Res 2005, 33: W393–6. 10.1093/nar/gki354
Article PubMed Central CAS PubMed Google Scholar
Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W: MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res 2003, 31: 3518–3524. 10.1093/nar/gkg579
Article PubMed Central CAS PubMed Google Scholar
Cora D, Di Cunto F, Provero P, Silengo L, Caselle M: Computational identification of transcription factor binding sites by functional analysis of sets of genes sharing overrepresented upstream motifs. BMC Bioinformatics 2004, 5: 57. 10.1186/1471-2105-5-57
Article PubMed Central PubMed Google Scholar
Venkatesh B, Yap WH: Comparative genomics using fugu: a tool for the identification of conserved vertebrate cis-regulatory elements. Bioessays 2005, 27: 100–107. 10.1002/bies.20134
Article CAS PubMed Google Scholar
Santini S, Boore JL, Meyer A: Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. Genome Res 2003, 13: 1111–1122. 10.1101/gr.700503
Article PubMed Central CAS PubMed Google Scholar
Ludwig MZ, Palsson A, Alekseeva E, Bergman CM, Nathan J, Kreitman M: Functional evolution of a cis-regulatory module. PLoS Biol 2005, 3: e93. 10.1371/journal.pbio.0030093
Article PubMed Central PubMed Google Scholar
Yan Y, Chen H, Costa M: Chromatin Immunoprecipitation Assays. In Epigenetics Protocols. Edited by: Tollefsbol TO. Totowa, NJ, Humana Press; 2004.
Google Scholar
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 2002, 298: 799–804. 10.1126/science.1075090
Article CAS PubMed Google Scholar
Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31: 374–378. 10.1093/nar/gkg108
Article PubMed Central CAS PubMed Google Scholar
Qiu P, Ding W, Jiang Y, Greene JR, Wang L: Computational analysis of composite regulatory elements. Mamm Genome 2002, 13: 327–332. 10.1007/s00335-001-2141-8
Article PubMed Google Scholar
Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 2003, 31: 3576–3579. 10.1093/nar/gkg585
Article PubMed Central CAS PubMed Google Scholar
Small S, Blair A, Levine M: Regulation of even-skipped stripe 2 in the Drosophila embryo. Embo J 1992, 11: 4047–4057.
PubMed Central CAS PubMed Google Scholar
Arnosti DN, Barolo S, Levine M, Small S: The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development 1996, 122: 205–214.
CAS PubMed Google Scholar
Andrioli LP, Vasisht V, Theodosopoulou E, Oberstein A, Small S: Anterior repression of a Drosophila stripe enhancer requires three position-specific mechanisms. Development 2002, 129: 4931–4940.
CAS PubMed Google Scholar
Ludwig MZ: Functional evolution of noncoding DNA. Curr Opin Genet Dev 2002, 12: 634–639. 10.1016/S0959-437X(02)00355-6
Article CAS PubMed Google Scholar
Powell JR: Progress and prospects in evolutionary biology: The Drosophila model. Oxford, Oxford University Press; 1997.
Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999
Article CAS PubMed Google Scholar
Sharan R, Ben-Hur A, Loots GG, Ovcharenko I: CREME: Cis-Regulatory Module Explorer for the human genome. Nucleic Acids Res 2004, 32: W253–6.
Article PubMed Central CAS PubMed Google Scholar
Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res 2004, 14: 451–458. 10.1101/gr.1327604
Article PubMed Central CAS PubMed Google Scholar
Liu Y, Wei L, Batzoglou S, Brutlag DL, Liu JS, Liu XS: A suite of web-based programs to search for transcriptional regulatory motifs. Nucleic Acids Res 2004, 32: W204–7.
Article PubMed Central CAS PubMed Google Scholar
Marco-Ferreres R, Vivar J, Arredondo JJ, Portillo F, Cervera M: Co-operation between enhancers modulates quantitative expression from the Drosophila Paramyosin/miniparamyosin gene in different muscle types. Mech Dev 2005, 122: 681–694. 10.1016/j.mod.2004.12.005
Article CAS PubMed Google Scholar
Zhou L, Steller H: Distinct pathways mediate UV-induced apoptosis in Drosophila embryos. Dev Cell 2003, 4: 599–605. 10.1016/S1534-5807(03)00085-6
Article CAS PubMed Google Scholar
Ringrose L, Rehmsmeier M, Dura JM, Paro R: Genome-wide prediction of Polycomb/Trithorax response elements in Drosophila melanogaster. Dev Cell 2003, 5: 759–771. 10.1016/S1534-5807(03)00337-X
Article CAS PubMed Google Scholar
Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16: 16–23. 10.1093/bioinformatics/16.1.16
Article CAS PubMed Google Scholar
Stormo GD: Information content and free energy in DNA--protein interactions. J Theor Biol 1998, 195: 135–137. 10.1006/jtbi.1998.0785
Article CAS PubMed Google Scholar
Stormo GD, Fields DS: Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci 1998, 23: 109–113. 10.1016/S0968-0004(98)01187-6
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors are grateful for comments on the manuscript from Dr. Alberto Riva. This work was supported in part by NIH CA095542 to LZ.

Author information

Authors and Affiliations

Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, FL, 32611, USA
Hailong Meng & Arunava Banerjee
Department of Molecular Genetics and Microbiology, UF Shands Cancer Center, College of Medicine, University of Florida, Gainesville, FL, 32610, USA
Hailong Meng & Lei Zhou

Authors

Hailong Meng
View author publications
You can also search for this author in PubMed Google Scholar
Arunava Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Zhou.

Additional information

Authors' contributions

H. Meng is the principal undertaker of this project. A. Banerjee provided guidance to HM in development of the algorithms and worked on the manuscript. L. Zhou initiated the project and provided general guidance to HM. HM and LZ drafted the manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Meng, H., Banerjee, A. & Zhou, L. BLISS: biding site level identification of shared signal-modules in DNA regulatory sequences. BMC Bioinformatics 7, 287 (2006). https://doi.org/10.1186/1471-2105-7-287

Download citation

Received: 08 February 2006
Accepted: 07 June 2006
Published: 07 June 2006
DOI: https://doi.org/10.1186/1471-2105-7-287

BLISS: biding site level identification of shared signal-modules in DNA regulatory sequences

Abstract

Background

Results

Conclusion

Background

Results

Simulating sequence pairs harbouring a conserved module of binding sites

Identifying a conserved module by comparison at the binding site level

M_score

P_score

G_score

BLISS_score

Distribution and statistical evaluation of the BLISS_score

Identifying a conserved regulatory module in distantly related species

Implementation of BLISS as a web-based service

Discussion

Conclusion

Methods

Generating simulated sequences

Binding site identification

p value for M_score

Gaussian smoothing

Searching the conserved module in the long sequence

Large scale search of the simulated sequences, statistical analysis

Searching for the eve2 module in D. virilis and D. mojavanis sequences

Website construction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us