- Methodology article
- Open Access
A specialized learner for inferring structured cis-regulatory modules
BMC Bioinformatics volume 7, Article number: 528 (2006)
The process of transcription is controlled by systems of transcription factors, which bind to specific patterns of binding sites in the transcriptional control regions of genes, called cis-regulatory modules (CRMs). We present an expressive and easily comprehensible CRM representation which is capable of capturing several aspects of a CRM's structure and distinguishing between DNA sequences which do or do not contain it. We also present a learning algorithm tailored for this domain, and a novel method to avoid overfitting by controlling the expressivity of the model.
We are able to find statistically significant CRMs more often then a current state-of-the-art approach on the same data sets. We also show experimentally that each aspect of our expressive CRM model space makes a positive contribution to the learned models on yeast and fly data.
Structural aspects are an important part of CRMs, both in terms of interpreting them biologically and learning them accurately. Source code for our algorithm is available at: http://www.cs.wisc.edu/~noto/crm
Eukaryotic transcription is controlled by multiple factors, which may need to bind to DNA in a specific arrangement in a gene's transcriptional control region. This type of regulation system is called a cis-regulatory module (CRM). The DNA motifs (specific patterns of nucleotides) to which these factors bind are often unknown, and may appear anywhere in a large region in the neighborhood of a gene. This region typically extends several thousand base pairs upstream of the transcription start site, and may also include DNA between the transcription start site and the start codon, and within introns of the transcribed gene. It is often the case that a set of genes are transcribed or expressed together under certain conditions, but the mechanisms underlying this co-expression are unknown. We would like a method that can aid in verifying that these genes are indeed transcribed by a common mechanism, and, more importantly, to explain this fact by finding the CRM which promotes transcription. The task we consider here is to learn such a CRM model from data. Specifically, given a set of sequences which are thought to contain a common cis-regulatory module and a set of sequences which are assumed not to, we wish to produce a description in terms of DNA binding sites which is true of the former set but not of the latter.
Several previous methods have characterized CRMs as a probabilistic over-representation of certain motifs within a window of predetermined size [1–4]. However, the models used are unable to represent very much about the physical arrangement among relevant binding sites. Given that a module consists of binding sites corresponding to multiple interacting transcription factors, we hypothesize that the relative locations of binding sites in real CRMs are an important consideration that the aforementioned models are unable to adequately represent. Consider the example in Figure 1. Suppose that a DNA sequence will be transcribed if, and only if, it contains transcription factor binding site motif 1 followed closely by either motif 2 or motif 3 near the start of transcription on the sense (template) strand. A CRM model that represents this situation must distinguish between DNA sequences that have it (a, b, and c) and those that do not because the motifs are out of order (sequence d), too far apart (sequence e), too far upstream of the start of transcription (sequence f), or bind to the wrong DNA strand (sequence g).
A few of these relationships can be represented by other previous methods. Sinha, et al.  point out that specific motifs may have strand preference, and their models do include a preference for a motif to directly follow another, but they do not describe arbitrary ordering constraints (e.g. a CRM includes all of motifs A, B, and C, and motif A must be most upstream), nor do they include strand preference into the CRM model itself. The models of Keleş et al.  are able to represent logical relationships (as in the example in Figure 1, motif 1 must be followed by motif 2 or motif 3), but they do not represent the ordering or proximity of the motifs. The models of Segal, et al., Aerts, et al., and Zhou, et al. model proximity between two motifs by whether or not they occur within a fixed-size window, but the motifs are otherwise independent and can occur anywhere in a transcription control region. The approach of Beer and Tavazoie  does capture motif orientation, and the relative order and distance between pairs of motifs. However, we argue that our models are more comprehensible than the probabilistic models of Beer and Tavazoie and that the increase in feature space that goes along with such a variety of relationships between possible binding sites requires that a learner take special steps to avoid overfitting.
We present a model representation which is able to describe logical relationships between binding sites, explicit upper-bounds on the distance between binding sites and between the CRM and the start of transcription (which may be known or estimated), the relative order (upstream or downstream) between any pair of binding sites, the DNA strand on which a binding site must appear, and a set of motifs which must not appear (e.g. the binding sites of repressors) in the CRM.
In order to make learning possible in such an expressive model space, we have developed a specialized learner which has two important distinctions: First, the search process is specifically tailored for the context of cis-regulatory modules. Second, although the expressivity capable of capturing all these physical aspects of a CRM is a major strength of our approach, only a few of these aspects may actually be needed to describe a given CRM. Therefore, there is a risk of overfitting due to this high-variance model space. For this reason, we have developed an expressivity selection method in which each aspect of the model space must be statistically justified by the data.
Results and Discussion
Figure 2 illustrates an example of our CRM representation. We divide the expressivity of this representation into six distinct parts, which we will refer to as structural aspects:
Multiple binding sites (Fig. 2a). This is the basic structure of our representation.
A multiplicity of motifs per binding site (Fig. 2b). For each binding site, there is a set of motifs sufficient to represent it (i.e. connected with the logical or operator). We allow a binding site to be associated with multiple motifs because it may be the case that multiple transcription factors (with different binding motifs) may play the same role in a CRM, or a single transcription factor may have multiple or varying binding motifs.
Distance constraints (Fig. 2c,d). These specify a maximum distance (in base pairs) between the motifs that satisfy any two binding sites, or between the CRM and the transcription start site.
Order constraints (Fig. 2e). These specify that the motif that satisfies a particular binding site must be upstream of another binding site.
Repressor sites (Fig. 2g). These specify that a particular motif (or any member of a disjunction of motifs) must not appear in the CRM, and its effective location can be constrained to be between a particular pair of binding sites, upstream of the CRM, or between the CRM and the transcription start site (in the case in Fig. 2, a single motif that must not appear upstream of the CRM).
Learning a Model
Our learning algorithm learns a CRM model from positive and negative example sequences, a set of potential binding site motifs, and an evaluation function.
Positive examples are those believed to contain a shared CRM (i.e. a set of particular binding sites and structural aspects). These may be, for instance, the set of promoter sequences from a set of genes which are co-expressed under certain conditions and suspected to be co-regulated.
Negative examples are those believed not to contain the target CRM, although they certainly may contain other arrangements of motifs. The purpose of these sequences is to make the learned CRM model discriminative-so that it captures something specific to the given set of positive examples instead of something that is trivially or generally true of promoter sequences. The regulation of these negative examples may be related to the positive set in some interesting way (e.g. they are co-expressed under some other conditions), or they may simply be promoter sequences believed not to be regulated along with the positive examples.
The set of potential binding sites is specified by indicating the location of each occurrence in the positive and negative example sequences. These potential binding sites may come from a set of known or postulated transcription factor binding sites (e.g. from a database) or they may come from a standard motif-finding algorithm, such as MEME .
Pseudocode illustrating our learning algorithm is given in Table 1.
Given sets of positive and negative DNA sequences, a set of potential binding site motifs, and an evaluation function, our algorithm searches through the space of possible models in an attempt to optimize the score given by the evaluation function. The ideal model would be satisfied by all the positive examples and none of the negative examples, so the evaluation function should be some measure of how well a given CRM model distinguishes between the positive and negative examples.
The search process is a best-first beam search  that starts with the null solution (an unconstrained model with zero binding sites) and searches in phases, modifying the best available solution and keeping only a queue of the best K models. Once the queue becomes empty, the best K solutions found are carried over to the initial queue for the next phase. In each phase, we apply a subset of the following operators:
A new binding site is added.
For a given binding site, a new motif is added.
The distance from the CRM to the transcription start site is constrained (to the best distance smaller than the current distance, according to the data and the scoring metric).
For a given pair of binding sites, the distance between them is constrained.
For a given pair of binding sites, their relative order is constrained.
For a given binding site, a strand constraint is imposed.
A repressor motif is added between the CRM and the transcription start site
A repressor motif is added upstream of the CRM
A repressor motif is added between a pair of binding sites.
There are user-defined limits on the maximum number of binding sites, motifs that can represent a binding site, and repressor motifs in a set.
Many of these slight changes to a solution will not affect its score (e.g. if a motif that does not appear in any sequence were added to the list of motifs for a particular binding site, the model would match exactly the same sequences). For this reason, we insist on some statistical difference between the set of sequences predicted by any of these changes. We use a χ2 test to decide whether we can reject the null hypothesis that two sets of sequence predictions by two different models come from the same distribution. It is not necessary to insist on near certainty when selecting the test's level of confidence; we mean only to avoid filling up our queue with multiple copies of essentially the same solution. If the test indicates that they come from different distributions, we add the new solution to the queue. Otherwise, we discard it.
In our experiments, we use two phases for the TRAIN procedure, making most of the above changes during the first phase, but adding repressor motifs in the second phase (because we argue these repressors can only be correctly added within the context of a CRM structure which has already been developed).
Controlling the Expressivity of a Model
Since the model space is expressive enough to represent many aspects of a CRM, we must address the potential for overfitting. We first identify the CRM model space appropriate for the data, and then search through this space for the correct CRM. To do this, we hold aside a tuning set of training sequences and select our expressivity by comparing the results of training a model first including, then leaving out an entire aspect of our CRM model design. We keep the more expressive model space if and only if the results with the aspect in question show both an improvement and a statistically significant difference. That is, we use an aspect of CRM expressivity if and only if doing so is statistically justified by the data. This way, we select only the expressiveness required by a specific CRM, and we can then retrain the model by searching through the appropriate model space. Pseudocode illustrating this procedure is shown in Table 2. Note that Select-Train is the main procedure which calls the entire Train procedure as a subroutine.
Since the inclusion of one model aspect may depend on another (e.g. distance constraints are only effective once the affected binding sites are identified), we do backward selection instead of forward. That is, we start with the full set of CRM structural aspects, and then remove some as is appropriate as opposed to starting with an empty set and adding to it. The list of model space restrictions is:
Reduce the maximum number of binding sites by one.
Reduce the maximum number of motifs (disjuncts) per binding site by one.
Disallow distance constraints.
Disallow order constraints.
Disallow strand constraints.
Reduce the maximum number of motifs in a set of repressor motifs by one.
Unless leaving the aspect in the model space produces a statistically significant improvement as determined by a χ2 test, we remove it. If more than one restriction in the list above is being considered, we make the restriction that gives us the best tuning set score. That is, we only make one model space restriction at a time. This process is repeated on the more restricted model space until no more restrictions should be considered (all structural aspects are statistically justified). This approach is similar to backward feature selection [9, 10]. However, we are not deciding on whether or not to include specific features (e.g. what is the distance between motif A and motif B in each DNA sequence), but rather we are deciding on whether or not to include entire aspects.
We test our approach on several data sets, summarized in Table 3. Three of these data sets have been used in previous studies of computational CRM finding. the Gasch et al. data set, however, is novel. In each case, we obtain upstream/promoter sequences from the University of California Santa Cruz Genome Browser  and perform cross-validation to evaluate our algorithms. We obtain a set of candidate motifs from running MEME  on the positive examples (not including any test sequences held-aside for evaluation) and from running MEME on on upstream/promoter regions randomly sampled from the appropriate organism. For the fly data set, we also evaluate our approach when it is provided with a set of known motifs [3, 12].
These motifs are described by position weight matrices (PWMs). We compare the likelihood of each PWM generating a subsequence in our data sets to the likelihood of the sequence being generated by a 5th-order Markov chain which is trained on the promoter regions of an entire genome. We consider a motif to be present if the ratio exceeds a threshold.
For our algorithm's scoring metric, we wish to measure how well the model predicts all, and only, the sequences which contain the target CRM. Precision is the frequency with which positive predictions are true positives, not false positives: . Recall is the frequency with which the correct sequences are predicted as positive and are not false negatives: . We use F1 as our scoring metric, which is the harmonic average of precision and recall: .
We set the maximum number of binding sites to three, the maximum number of motifs per binding site to three and the maximum number of repressor motifs in a set to one. We evaluate our model by using cross-fold-validation: We hold aside some data, train on the remainder, and then evaluate our trained models' predictions on the held-aside data. We predict that the held-aside sequence is a positive example if and only if it contains our hypothesized CRM. This process is repeated with different examples held aside for evaluation, and the results from each fold are summed together.
For each data set, we calculate an F1 score, (the same statistic as our algorithm's scoring metric), and use Fisher's exact test  to calculate a p-value. If our positive predictions (true positives plus false positives) were made simply by randomly sampling without replacement from the data set, this p-value would be the probability of an F1 score of or higher. If this p-value is sufficiently low (less than 0.01, following Segal and Sharan ), we consider our CRM for this data set to be significant.
Our results are shown in Table 4. We find a significant CRM in 17 of the 25 Lee et al. data sets (Table 3). In their similar experiments, Segal et al. found significant CRMs in only 12 of the 25 data sets (Note that the p-value calculations of Segal et al. are not identical to ours; as their CRM model makes probabilistic predictions, they are able to calculate a p-value using a classification margin ). Many of the motifs included in our CRM models correspond to known binding site motifs for the proteins thought to bind to the promoter regions in these data sets. However, we do not focus on recovered motifs, because our approach does not define these (they are found by MEME), it only selects them from a set of candidates. We find significant CRMs in the three Gasch et al. data sets, which suggests that our method can be used to find novel CRMs corresponding to genes clustered by expression analysis.
We find a significant CRM in the Sinha et al. fly data set as well. For this data set, using motifs suggested by MEME, we find three true positives and three false positives. Although this result is statistically significant, we hypothesize that the reason we are unable to recover more of the positive examples is because the training set size is too small for MEME to find good candidate motifs. To test this, we use the PWMs from Rajewsky, et al. and Sinha, et al. [3, 12] and locate positions where these motifs are most likely to occur. In this case, we recover seven of eight positive examples. Note that we do not compare our results to those of Sinha et al. because we use this data set to evaluate predictive accuracy on held-aside data, whereas they do not.
We wish to determine whether or not the inclusion of structural aspects increases the accuracy of our models. We do this by comparing the results of our approach to those obtained when we limit the set of aspects given to the Train function. We do this in two ways: First, we measure this accuracy by the F1 score of our predictions on held-aside data, and compare these scores to those obtained by a restricted version of our algorithm, for which the only aspect given to the Train function is multiple binding sites. This experiment is designed to compare against the model space of several previous methods in which a CRM model is characterized simply by the presence of a set of motifs anywhere in an input sequence. We refer to this as the "bag-of-motifs" approach. Second, we compare the F1 scores of our approach to those of running our algorithm with a single structural aspect left out of the set given to the Train function, for each aspect/dataset pair. This is designed to determine whether each structural aspect by itself makes a positive contribution to the learned models. We refer to these experiments as "lesion tests."
These comparisons are illustrated in Fig. 3. Note that sometimes the inclusion of a structural aspect can lead to overfitting (a point slightly above the diagonal line), but often it is essential (a point well below the line). Indeed, considering all data sets, the F1 score is more often higher with all aspects included than it is when any single structural aspect is removed. On the 25 yeast data sets from Lee et al. (Table 3a), the bag-of-motifs approach is often about as accurate as our approach. One exception is shown in Fig. 4. Here, our algorithm discovers that the order of binding sites is important. Compare the test set F1 score of our approach (0.500) to that of the bag-of-motifs approach (0.205). On the other data sets, our approach scores much higher than the bag-of-motifs approach. For instance, Figure 5 shows the hypothesis CRM model for the data set, rESR_RPcluster, which involves distance and strand constraints. The bag-of-motifs hypothesis (not shown) also includes two copies of the same motif, but without structural constraints, the model accepts eight additional true positives, and 265 additional false positives. Using our approach on the Sinha et al.-Fly data set, we find three true positives and three false positives (compared to two true positives and 34 false positives using the bag-of-motifs approach). Using PWMs from the literature, we recover seven of eight positive examples, with 10 false negatives (compared to six true positives and 16 false negatives using the bag-of-motifs approach).
One of the primary steps in gene regulation is transcription, and the ability to learn CRMs directly from data will be a crucial part of understanding how transcription is controlled. Our experiments, as well as those of Beer and Tavazoie  indicate that transcription is controlled not only by the presence of binding sites, but also by relationships between their locations. Our models represent a step forward in this area because these aspects are represented in a model which is easy to inspect and understand, and our results show that each of them contributes to the identification of significant CRMs in real biological data. With this increase in expressiveness, there is inevitably a risk of overfitting. We use data to identify the appropriate CRM aspects during the process of training our models. We believe that our novel approach of model space selection is an important and necessary step to facilitate the move toward more expressive models.
Availability and Requirements
The source code for our algorithm is available at: http://www.cs.wisc.edu/~noto/crm.
Aerts S, Loo PV, Thijs G, Moreau Y, Moor BD: Computational Detection of cis-regulatory modules. Bioinformatics 2003, 19(2):5–14.
Segal E, Sharan R: A Discriminative Model for Identifying Spatial cis-Regulatory Modules. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB). San Diego, California, USA: ACM Press; 2004:141–149.
Sinha S, van Nimegen E, Siggia ED: A Probabilistic Method to Detect Regulatory Modules. Bioinformatics 2003, 19: 292–301.
Zhou Q, Wong WH: CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proceedings of the National Academy of Sciences 2004, 101(33):12114–12119.
Keleş S, van der Lann MJ, Vulpe C: Regulatory motif finding by logic regression. Bioinformatics 2004, 20(16):2799–2811.
Beer MA, Tavazoie S: Predicting Gene Expression from Sequence. Cell 2004, 117: 185–198.
Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press; 1994:28–36.
Mitchell TM: Machine Learning. New York: McGraw-Hill; 1997.
Devijver PA, Kittler J: Pattern Recognition: A Statistical Approach. Prentice-Hall International; 1982.
Miller AJ: Subset Selection in Regression. Chapman and Hall; 1990.
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Research 2004, 32: D493-D496.
Rajewsky N, Vergassola M, Gaul U, Siggia E: Computational detection of genomic cis regulatory modules. BMC Bioinformatics 2002. [http://citeseer.ist.psu.edu/rajewsky02computational.html]
Agresti A: A Survey of Exact Inference for Contingency Tables. Statistical Science 1992, 7: 131–177.
Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: A sequence logo generator. Genome Research 2004, 14: 1188–1190.
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne J, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science 2002, 298: 799–804.
Gasch AP, Spellman P, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 2000, 11(12):4241–57.
This research was supported in part by NIH/NLM training grant 5T15LM005359, NSF grant IIS-0093016, and NIH/NLM grant R01-LM07050-01.
The authors would like to thank Louis Oliphant and Joseph Bockhorst for helpful comments on earlier drafts of this paper.
KN wrote the software and carried out the computational experiments. MC and KN designed the algorithm and experiments. All authors read and approved the final manuscript.
About this article
Cite this article
Noto, K., Craven, M. A specialized learner for inferring structured cis-regulatory modules. BMC Bioinformatics 7, 528 (2006). https://doi.org/10.1186/1471-2105-7-528
- Transcription Start Site
- Model Space
- Structural Aspect
- Candidate Motif
- Strand Preference