A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs
© Seitzer et al.; licensee BioMed Central Ltd. 2012
Received: 7 February 2012
Accepted: 1 November 2012
Published: 27 November 2012
Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research.
We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature.
Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package athttp://www.bme.ucdavis.edu/facciotti/resources_data/software/.
KeywordsMotif Monte Carlo ChIP-seq ChIP-chip Comparative genomics MEME STAMP TFB
The problem of discovering functional DNA or protein subsequences (motifs) in biological sequences data has driven the development of numerous motif-finding tools. Two approaches dominate motif-finding algorithms: enumerative word-based methods and probabilistic sequence models that optimize model parameters by applying expectation-maximization techniques or Bayesian inference. These ideas were first applied in a handful of tools developed more than twenty years ago[2–6]. Most subsequently developed motif-finders recapitulate these approaches with subtle variations, and report incremental improvements in motif detection. Motif searches have become more sophisticated in terms of the patterns they can recover - where we could once only discover identical sequence matches, we may now discover gapped, palindromic, and degenerate subsequences[7–11]. The incorporation of another type of meaningful prior information to the motif search may also improve motif detection: motif searches have improved to incorporate background sequence models, phylogenetic information, experimental data, and other large-scale prior information[12–18].
Here we present MotifCatcher: an integrated framework for motif discovery. Our framework utilizes random sampling of the input data set to generate a candidate list of motifs. We extend upon previous random sampling approaches by developing and applying 3 different random sampling-based candidate motif-generating protocols. We examine the effects of each approach in turn, and discover the best results with a novel position-specific iterative process. We create a linkage tree of these candidate motifs by integrating with existing motif tree generating software, which allow for 6 possible motif comparison and evaluation metrics. The user may decide to segment the linkage tree at various thresholds in order to condense individual motif leaves into a single, aggregate motif. Generation and parsing of a linkage tree of motifs has been used previously, however, in that case, the condensation of a set of motifs into a single motif focused on similarities and differences among the individual candidate motifs. An average motif is computed based on an analysis of the output motifs, without consideration of the subsequences from which they are created. The transfer of aligned subsequences to motif representation incurs a loss of information: a motif describes only the frequency of individual sequence elements at particular positions in a set of aligned subsequences, discarding the actual subsequences from which these sequence elements arise. Our motif condensation approach is based on the frequency of occurrence of individual subsequences in the output motifs, which allows us to retain dependent positional relationships between sequence elements in each subsequence string, and explicitly filter out subsequences erroneously included in the set of output motifs.
For motif-finding to operate smoothly, sequence data sets should be assembled with as high a ratio of motif signal sequence to background sequence as possible. We accomplish this task by randomly sampling the whole data set, and searching for motifs in these random subsets, with the expectation that certain subsets will have a high ratio of motif signal to background sequence. Sequence entries are incorporated in sequence data sets because they are suspected to contain a significant motif, but this need not always be the case - For example, ChIP-chip (Chromatin Immunoprecipitation with hybridization by microarray) and ChIP-seq (Chromatin Immunoprecipitation with massively parallel sequencing) experiments generate sequence data sets reflecting DNA binding events of a particular protein of interest, but not all reported binding events necessarily involve the presence of a unique motif (this may arise as a consequence of indirect protein-protein interactions accompanying protein-DNA associations). Sequence data sets with non-motif-containing sequence entries also arise from non-experimental analyses, such as the comparison of upstream regions of one or more putatively co-regulated genes across a set of related organisms. MotifCatcher’s emphasis on random sampling allows it to find significant motifs in data sets containing a very large number of non-sequence-containing entries (Additional file1: Figure S1 and Additional file2). In this regard, MotifCatcher offers an advantage over integrated motif discovery frameworks that only aggregate the output of many different motif-finding tools[20, 21], because these motif-finding tools may fail to find motifs in data sets containing many non-motif-containing sequence entries. It is worth mentioning that a number of tools have been developed to discover motifs exclusively in ChIP data[30, 31]. MotifCatcher is appropriately applied to ChIP data, however the generality of its algorithm makes it applicable to any sequence data set, and especially effective in discovering motifs in sets of sequences where a large number of sequence entries do not contain a motif.
To evaluate the performance of MotifCatcher, we explored several published and unpublished ChIP-chip and ChIP-seq data sets. Our investigations incorporated organisms from all three domains of life and highlighted biological phenomena with associated motifs spanning a large range in length, degeneracy, and prevalence among input sequence entries. We developed two implementations of the MotifCatcher approach, each employing a different mainstream motif-finding program (expectation maximization via the MEME Suite[5, 27], and Gibbs sampling via a Gibbs recursive sampler[4, 32]). In every investigation, we compared the performance of the MotifCatcher extension of the motif finder to the performance of the motif finder alone. Additionally, we evaluate MotifCatcher’s output motif tree to yield meaningful biological insights.
Theory and Motivation
If an input set of sequence entries is corrupted with non-motif containing sequence entries, from the standpoint of motif discovery, these non-motif containing sequences do not belong in the dataset – ideally, we would like to remove these sequences from the data set prior to carrying out a motif search. However, without knowing the identity of the motif or the sequence entries that contain them, we cannot remove these sequences directly. This problem has been addressed previously, in the MEME ZOOPS (z ero or o ne o ccurrence p er s ite) algorithm. In the MEME algorithm, sequence data is thought of as the product of a finite mixture model with unknown parameters. Using an expectation maximization algorithm, the values of the parameters are estimated based on the observed sequences data. Using the ZOOPS generalization, sequence entries are each assigned a prior probability that they contain a motif. The posterior probabilities are determined during the expectation maximization process. Low posterior probabilities therefore may effectively remove sequence entries from the data set. This protocol is very effective when the true motif stands out from the background, however it is more challenging to retrieve the motif in cases where the motif is obscured by a large amount of noisy background sequence. Instead of estimating parameters based on a collective input body of sequences, we explore the more extreme possibility of excluding these sequences entirely – effectively, assigning to a large set of sequences a prior probability of zero, even before any parameter estimation of the data set has occurred.
In order to discover which sequences contain motifs and which do not, we use a random sampling approach combined with motif searches to determine a set of candidate “related subsets” R. Certain R will not yield significant motifs at all, and are discarded. Among the R that yield a significant motif, certain highly similar motifs will tend to be re-discovered from many different random seeds. Candidate significant motifs are hierarchically clustered and joined by linkage tree, where frequently re-occurring motifs from naturally cluster into highly dense branches. In our analyses, we discovered that convergence to a common point (similar motif) from many different starting states (random subsets of the data set) is often an indication that the motif is meaningful. Our emphasis on convergence to a common end point from many random starting states bears much similarity to a classic Monte Carlo Markov Chain algorithm.
Monte Carlo framework
From an input data set of N sequence entries, n random seed subsets are extracted, where each. ‘n’ is a user-specified value, and should be selected according to the size of and the expected number of that is thought to contain a subsequence instance of a significant motif. Three alternative schemes are available to create a library of related subsets from the set of seed subsets, applied to each: (1) MEME ZOOPS MC (M otifC atcher) search: A MEME ZOOPS search is applied to Si, and sequence entries that contain subsequences included in the construction of the MEME ZOOPS-produced motif comprise Ri. (2) Single MAST MC (M otifC atcher) search: A MEME ZOOPS search is applied to Si, and the MEME ZOOPS-produced motif is scanned over using the MAST. All sequence entries in that contain a significant subsequence match to the preliminary motif comprise Ri. (3) Iterative MEME/MAST MC (M otifC atcher) search: A MEME ZOOPS search is applied to Si producing a motif M. Using MAST, is scanned for the motif M. All sequence entries in that contain a significant match are collected. These sequences constitute the modified seed Si'. MEME ZOOPS is then applied to the modified seed Si', to produce a modified motif, M’ which is scanned over with MAST as before. This iterative search continues until convergence: a MEME search of a modified seed Simod produces a motif Mmod, and a MAST search of Mmod over finds subsequence instances of Mmod in (and only in) Simod. The sequence entries in Simod comprise Ri. Typically, the iterative search (scheme 3) is to be preferred, as it will tend to converge upon meaningful motifs more often than the non-iterative and single-scan approaches (schemes 1 and 2).
Motif tree construction
A branching diagram is constructed comparing the relative similarity of motifs associated with each of the different related subsets Ri in (motif tree). Some of the Ri-associated motifs may not be statistically significant. Thus, all with an associated motif with a high E-value (typically, this value should be no larger than 0.01) are excluded from further analysis. As may be very large, all Ri except a small subset of with the lowest Ri-associated motif E-values may be excluded. The STAMP platform (S imilarity, T ree-building, and A lignment of DNA M otifs and P rofiles)[23, 24] is utilized to organize the remaining into a distance tree according to similarities of their Ri-associated motifs. The pairwise distance between two motifs is computed in a column-by-column fashion using one of several available statistical metrics that can be selected by the user. After a pairwise distance has been computed between all Ri-associated motifs, the distance tree is assembled using either an unweighted pair group method with arithmetic mean (UPGMA) or self-organizing tree algorithm (SOTA), according to the user’s preference.
Organization and evaluation
The represented in the motif tree will naturally cluster into groups according to the similarity of their Ri-associated motifs. Each motif family Fi is a collection of Ri (grouped by similarity among their Ri-associated motifs). The motif tree is therefore defined by a set of m non-intersecting motif families. The division of the motif tree into a set of non-intersecting motif families requires that a clustering threshold be imposed upon the represented in the tree. Each motif family Fk is a collection of Ri, and each Ri is a collection of sequence entries taken from the whole input set. The set of motif families is determined based entirely on the similarity of Ri-associated motifs, without regard to the sequence entries from which subsequences are drawn to create these Ri-associated motifs. The motif family Fk can be described by a singular characteristic motif, which we refer to as a “familial profile” (FP). Among the collection of Ri that forms Fk, some sequence entries will be re-discovered more frequently than others. An FP is generated for each Fk according to a user-selected FP frequency threshold. To compute the FP, sequence entries Yj that re-occur among the related subsets in a motif family with frequency of greater than or equal to the FP frequency threshold are collected and a motif is generated from these sequence entries using a MEME OOPS (o ne o ccurrence p er s ite) model. To compare the occurrence of different motif families over the set of sequence entries, a “MotifMap” matrix is created. This matrix compares the input sequence entries (rows) to significant FPs (columns). When an instance of FP, the familial profile derived from family F, is found in a particular input sequence entry Y, the coordinates that this instance spans within the sequence entry are noted at position (row, column) in the matrix.
Tests on biological data
LexA binding in E. coli
The absence of clearly identifiable motifs in large subsets of ChIP-chip and ChIP-seq data is a common occurrence. This can result from experimental error or protein binding to degenerate or motif-free sites. To test MotifCatcher’s ability to discover a motif in a data set with a large number of motif-free sites, we reanalyzed ChIP-Chip data collected by Wade et al., who identified 49 binding sites for the well-studied regulatory protein LexA in E. coli MG1655. Twenty-five of the binding sites were consistent with previous experimentally determined targets (termed type I sites) and 24 were novel sites. Wade et al. scanned the novel binding sites for a significant match to the canonical TACTG(TA)10CAGTA LexA motif. Five of these sites were found to have a canonical LexA motif (and thus termed type II sites), and 19 were determined not to have a canonical LexA motif (termed type III sites). We sought to test whether or not MotifCatcher could improve upon the previous interpretation of this dataset. In our reanalysis of the ChIP-chip data, we assembled a sequence data set based on the 49 LexA binding sites reported by Wade et al., and an additional sequence data set with random genomic sites substituted for the type III sites. A motif search was carried out on both data sets with MEME ZOOPS, recursive Gibbs sampling, MEME ZOOPS MC, single MAST MC, and an iterative MEME/MAST MC.
Suggested degenerate LexA binding sites for unconventional LexA targets
Binding site coordinates
Binding site sequence
CTG->ATG x2, 2del
Evidence for alternative Cse4p binding mechanisms in S. cerivisiae
We were interested to test MotifCatcher’s ability to characterize general binding profiles for proteins whose DNA binding signatures are longer and have fewer highly conserved elements than classical transcription factor binding site motifs. Cse4p is a histone H3 variant protein essential for kinetochore function in S. cerevisiae. The protein is known to localize to the centromeric (CEN) site on each chromosome. The protein is recognized at these CEN sites by a ~125-bp stretch of three sequence elements (designated CDEI, CDEII, and CDEIII). A detailed analysis of the biophysical interactions of CEN-binding proteins suggests that Cse4p genetically interacts with only the CDEI and CDEII sites, with the majority of direct protein-DNA interaction occurring with the 74–86 bp A/T-rich CDE II segment. A recent ChIP-seq experiment examining DNA-binding of Cse4p demonstrated that in addition to the 16 CEN binding sites, Cse4p also bound at 142 additional sites.
Lefrançois et al. determined a high correlation of the 142 novel Cse4p binding site target genes with the 100 highest ranked PolII targets (as determined by ChIP-seq) finding 49 sites in common. This led to the hypothesis that the 142 novel binding sites were the result of transient localization of Cse4p to regions of high histone turnover. Our iterative MEME/MAST MC analysis segregated the data set into three non-intersecting groups: Family 1 (66/158 sites, all novel sites), Family 2 (47/158 sites, 16 CEN sites and 31 novel sites), and sites that belonged to neither group (45/158 sites, all novel). Of the 49 novel sites reported by Lefrançois et al. that associated with highly ranked PolII targets, 37/49 (71%) belonged to Family 1, 1/49 (2%) belonged to Family 2, and 11/49 (24%) belonged to neither group. The distribution of PolII target sites among unconventional sites in motif families 1, 2, or neither group could not be explained by a uniform normal distribution (X2, α = 0.001).
Our analysis suggests that sites included in Family 1, but not Family 2, may be consistent with Lefrançois et. al’s hypothesis of transient localization of Cse4p to regions of high histone turnover. While the motif defining Family 1 does not contain any highly conserved features (Figure 5), it does demonstrate a general sequence profile distinct from the 61.86% A/T content of the whole S. cerevisiae genome. The absence of sites in Family 2 in the list of the top 100 PolII sites suggests that this hypothesis is not appropriate for the sites in this family. However, the second motif family was defined by a A/T-rich motif similar to the canonical CDEII motif. All sixteen canonical CEN sites were found in this subset. Considering the similarity of the motif defining the second motif family to the canonical motif, we hypothesize that sites in this group describe sites where Cse4p interacts with DNA directly, in a manner similar to Cse4p binding to the CDEII region of canonical CEN sites.
Coordinating the motif families, familial profiles, and branching diagram outputs generated by MotifCatcher, we conclude that only a fraction (66/142) of the novel Cse4p binding sites previously thought to localize to sites of high histone turnover actually do so. Another fraction (31/142) of the novel sites report a sequence profile similar to the canonical CDEII motif, binding at these sites therefore likely occurs directly to the DNA following the binding mechanism associated with the CDEII region at centromeric sites.
Discovery of closely related motif variants for a family of homologous transcription factors: general transcription factors TfbB, TfbD, and TfbG in Halobacterium sp. NRC-1
One of the main mechanisms for the evolution of gene regulatory networks is the duplication and subsequent divergence of transcription factors. Often this process happens multiple times and leads to the accumulation of families of transcription factors functioning simultaneously in the cell. Depending on the degree of divergence, multiple homologous transcription factors may target both overlapping sets of promoters and unique promoter sets. The resultant partitioning of gene regulatory networks by this mechanism is thought to be at the basis of some important functional properties of gene regulatory networks and could inform re-engineering of biological networks. At the root of this problem is the ability to distinguish functional differences between often-similar binding sites of homologous transcription factors.
In Archaea, two general transcription factors are thought to be necessary and sufficient for initiating basal transcription, homologs to eukaryotic (1) TFIIb (referred to in archaea as TFBs) and (2) TATA-binding proteins (TBPs). These general transcription factors are present in multiple copies in several archaea. For example, the genome of the archaeon Halobacterium sp. NRC-1 encodes 6 different TBP proteins and 7 different TFB proteins. With 6 different TBPs (tbpA-F) and 7 different TFBs (tfbA-G), Hb. NRC-1 could use up to 42 different TBP-TFB complexes. Evidence for at least 7 of these has been observed. In the same study, Facciotti et al. observed that the sets of genes bound by each of the TFB homologs in Hb. NRC-1 appear to be partially overlapping while still including distinct functional groupings).
We hypothesized that given the overlap in promoter binding sites for many TFB assayed in, a binding motif common to all TFBs must be present and likely include a signature similar to the TFIIB binding element (BRE) described earlier[45, 46]. We also proposed that direct protein-DNA interaction could play a role in discriminating between these different types of TFB binding sites. However, previous efforts to simply identify a TFB motif from ChIP-chip data using the MEME suite had been unsuccessful. Testing this hypothesis requires (1) that there are subtle differences in the binding sites between these different TFB proteins, (2) an experimental protocol exists that could confidently reduce the search space for motif finding algorithms, and (3) a sensitive tool for detecting and classifying these differences can be applied. We investigated our binding hypothesis by using a new ChIP-seq protocol for Hb. NRC-1 that provides greater spatial resolution than the previously adopted ChIP-chip and MotifCatcher to help discover a general TFB motif, and if possible, distinguish subtle differences between the TfbB, TfbD, and TfbG binding sites.
For each of the 7 TFB-binding ChIP-seq peak lists, five groups of data sets of sequences were constructed: (1) 60-bp, (2) 100-bp, and (3) 200-bp stretches of sequence centered about the genomic location of each peak center, and (4) 60-bp stretches of sequence centered around a site displaced 60 nt upstream of the center of each ChIP-seq peak. We randomly shuffled the members of the 7 TFB-binding ChIP-seq peak lists into 7 new groups, and extracted 60-bp stretches of sequence centered about the genomic location of each peak center (5). In addition to these five groups, we created a final data set containing 126 random 60-bp stretches of sequence taken from around the Hb. NRC-1 genome.
Ability to discover TFB motif from ChIP-seq data sets
Iterative MEME/MAST MC
Iterative Gibbs/MAST MC
Iterative Gibbs/MAST MC +Ideal
The MotifCatcher algorithm frames an existing pattern detection tool (motif finder) in a Monte Carlo simulation framework, and organizes significant candidate output motifs produced by that tool into a branching diagram that may be further processed. It is not in itself a motif-finder, but rather a generic strategy to extend the sensitivity and utility of existing motif-finders. Here, we demonstrated that application of two motif-finders supplemented with our MotifCatcher extension algorithm resulted in an increase in performance compared to the motif finder alone (LexA binding in E. Coli and TFB binding in Hb. NRC-1). We demonstrated this result using two widely used motif finders that rely on different motif-finding strategies (the MEME Suite and recursive Gibbs sampler).
Beyond simple increases in detection performance, our novel approach of organization of candidate motifs in a tree diagram highlighted an exciting feature of MotifCatcher: In our analysis of ChIP-seq-derived Cse4p binding in yeast, we discovered that the input sequence entries naturally segregated into two non-intersecting groups, entirely according to the discovered motifs. We were able to support the conclusion that this segregation is appropriate by discovering a very similar segregation of the data set based on correlation with polII binding sites. From our motif-finding analyses related to homologous TFBs in Hb. NRC-1, we were able to repeatedly discover a statistically significant motif present in all of the data sets. Comparing this motif with an archaeal promoter sequence used in a DNase footprinting experiment, we were able to map significant elements of our motif to the well-characterized TATA box, BRE, and PPE motifs. A comparison of the distance of elements in our motif to annotated and experimentally determined transcript start sites was in precise agreement with the distances described from the DNase footprinting experiment. Significant differences were discovered among the BRE element as it was discovered from the different TFB datasets. We hypothesize that these differences confer specificity for binding among various TFB proteins. Our results suggest an experimental investigation, which could further elucidate crucial details of the mechanisms of TFB binding in Hb. NRC-1.
In general, MotifCatcher may be used to suggest ways that a single data set might be more appropriately segregated into several smaller data sets (as demonstrated in both the S. cerevisiae and Hb. sp. NRC-1 analyses). The MotifMap utility, which maps discovered motifs to the input sequence entries themselves, coupled with statistical measures to evaluate significant co-occurrences and co-localizations of significant motifs, increases the power of this feature. In addition to improving motif detection sensitivity, MotifCatcher allows one to better organize and categorize biological sequence datasets based on discovered motifs.
All desired specifications regarding the nature of the sought motif (minimum width, maximum width, the option to check for motif instances on the reverse compliment strand, the option to force the motif to be palindromic in nature, etc.), are user-input. These specifications are applied to the motif-finding search, which in this implementation is accomplished by the MEME ZOOPS (Z ero or O ne O ccurrence P er S ite) model. For motif searches, a background model can either be supplied by the user or built from (with order appropriate for the total number of characters in).
In the MotifCatcher software package, a GUI interface allows the user to navigate the consequences of segmenting a motif tree at various clustering thresholds. The clustering threshold varies according to the topology of the tree, but as a general rule, the clustering threshold should be quite stringent (only highly similar Ri-associated motifs are grouped together). This preference is incorporated into the MotifCatcher software default settings.
The MotifCatcher software platform in its current implementation in wide release coordinates with (1) the MEME suite (v. 4.5.0), and (2) the STAMP platform (v. 1.1). Both programs must be installed and configured correctly prior to MotifCatcher installation. MotifCatcher is implemented in MATLAB, and beyond standard MATLAB toolboxes, relies on MATLAB’s commercially available (1) bioinformatics toolbox.
The MotifCatcher software is freely available at the Facciotti lab website (http://www.bme.ucdavis.edu/facciotti/resources_data/software/).
Preparation of biological data
The MEME Suite version 4.5.0 (including the MAST program, version 4.5.0) was used for all MEME searches and MC iterative MEME/MAST searches, and the Gibbs recursive sampler version 3.1 was used for all Gibbs searches and MC iterative Gibbs/MAST searches. For all motif searches, motifs could be discovered on the forward or reverse strand, with no preference to discover palindromic motifs. A 3rd-order background model was generated from all sequences in and applied to all MEME and MotifCatcher MEME searches (unless otherwise mentioned). Sequence entries were set to have 0 or 1 instances of a motif. For all MotifCatcher searches, whenever MAST was incorporated in a related subset determination protocol, it was always used with its default similarity threshold value of 10. All with an associated motif with an E-value greater than 0.001 were not included in the construction of the motif tree, and were excluded from further analyses. Motif trees were built by comparing the ALLR (average log-likelihood ratio) between Ri-associated motifs, and a distance tree was constructed using UPGMA linkage. All Ri in the motif tree were segregated into families using a clustering threshold of 5% of the maximum dissimilarity discovered among the whole set of all Ri-associated motifs. All motif logos were created using WebLogo version 3.1.
LexA binding in E. coli
Relevant sequence data sets were created by extracting 1000-bp regions centered at the reported ChIP-chip peaks reported by Wade et al. The original data set consisted of the original 1000-mers from the ChIP-chip experiment (type I, II, and III sites). The random substituted data set was composed of the original type I and type II 1000-mers with 19 non-overlapping 1000mers from random genome sites. All motif searches utilized the standard set of motif-finding parameters, where motifs could be anywhere from 10 to 30 nucleotides in length. The number of seeds created for each MC motif search varied to generate comparable motif trees for each related subset determination option (500, 200, and 50 seeds, respectively, for the MEME ZOOPS MC search, single MAST MC search, and iterative MEME/MAST MC search). Aside from the disparate number of seeds, all MC motif searches were identical.
Evidence for alternative Cse4p binding mechanisms in S. cerevisiae
ChIP-seq enrichment regions were taken from supplemental data by Lefrancois et al.. Accounting for 3 replicate experiments, there were 158 sequence entries, with 16 large peaks associated with the CEN regions, and 142 smaller peaks corresponding to novel Cse4p binding sites. An iterative MC MEME/MAST search was undertaken with 50 random seeds containing 8 sequence entries, searching for motifs of 125 bp in length (the length of the canonical CDEI-CDEII-CDEIII motif). A 3rd-order background model was generated from the whole genome (strain S288c) and applied to all motif searches.
Nearby open reading frames for the 100 highest-scoring polII sites, based on the number of excess reads (sample reads – input reads) as determined from the first replicate experiment performed by Lefrancois et al., were extracted and compared to the closest open reading frame to the binding sites of the Cse4p protein (as determined by ChIP-seq experiment). The MotifCatcher-derived family was determined for each of the 49 sites (1/49 from Family 1, 37/49 from Family 2, and 11/49 in neither group). Based on this breakdown, the null hypothesis that the 49 sties were randomly distributed over the 3 MotifCatcher-derived families had to be rejected at a confidence of α = 0.001..
Discovery of closely related motif variants for a family of homologous transcription factors: general transcription factors TfbB, TfbD, and TfbG in Halobacterium sp. NRC-1
In the determination of genomic sites accessible to one or more TFB proteins, peak centers within 30 nt of each other were considered to be the same site (in accordance to the resolution of the ChIP-seq experiment). In all motif searches, motifs could be anywhere from 6 to 30 nt in length. For iterative MC MEME/MAST runs, 10% of the sequence entries in the whole input data set were selected for each random seed, for iterative MC Gibbs sampler/MAST runs, seeds always consisted of 20 randomly selected sequence entries.
For evaluating the results of motif finding applied to TFB sequence data sets, all statistically significant output motifs were manually evaluated. Only motifs containing sequence elements resembling the TATA box, BRE, and PPE sequence motifs at the appropriate spacing were considered TFB motif matches. In the multiple alignments of representative TFB motifs, a sequence data set of 100 nt regions centered at all discovered peaks was constructed, and scanned with MAST for the putative TFB motif with a p-value threshold of 0.01 and an E-value threshold of 100. Sites that showed multiple matches, or matches far from the center of the sequence file were discarded. All remaining matches were aligned, with 9 additional bp at the 3’ end added in the alignment. Logos were constructed using weblogo (v.3.1). 233/417 (53%) of TFB binding sites were used in the alignment - Specifically, 51/102 (50%) from TfbB; 47/96 (49%) from TfbD; 51/126 (41%) from TfbG; 22/43 (51%) from TfbBD; 20/40 from TfbBG (50%); 19/26 from TfbDG (73%), and 23/38 (61%) from TfbBDG.
This work was funded by NSF grant #EF0949453 and startup funds to MTF.
General transcription factor
Chromatin immunoprecitation with hybridization by microarray
Chromatin immunoprecipitation with massively parallel sequencing
Multiple expectation maximization for motif elicitation
Motif annotation search tool
Zero or on occurrence per site
One occurrence per site
Similarity, tree-building, and alignment of DNA motifs and profiles
Unweighted paired group method
Self-organization tree algorithm
- Hb. NRC-1:
Halobacterium sp. NRC-1
Transcription factor IIb homologue
TFB recognition element
Proximal promoter element.
Acknowledgements are in order to Mike Starr, Aaron Darling, Russell Neches, and Erin Lynch for fruitful intellectual discussions, and to the Facciotti lab as a whole.
- Das MK, Dai HK: A survey of DNA motif finding algorithms. BMC Bioinformatics 2007, 8(Suppl 7):S21. 10.1186/1471-2105-8-S7-S21PubMed CentralView ArticlePubMedGoogle Scholar
- Lawrence CE, Reilly AA: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 1990, 7: 41–51. 10.1002/prot.340070105View ArticlePubMedGoogle Scholar
- Hertz GZ, Hartzell GW, Stormo GD: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Computer applications in the biosciences CABIOS 1990, 6: 81–92.PubMedGoogle Scholar
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multipe alignment. Science 1993, 262(5131):201–214.View ArticleGoogle Scholar
- Bailey TL, Elkan CP: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings International Conference on Intelligent Systems for Molecular Biology ISMB 1994, 2: 28–36.Google Scholar
- Helden JV, Andre B: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 1998, 281: 827–842. 10.1006/jmbi.1998.1947View ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proceedings International Conference on Intelligent Systems for Molecular Biology ISMB 1995, 3: 21–29.Google Scholar
- Tompa M: An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. International Conference on Intelligent Systems for Molecular Biology; ISMB 1999, 7: 262–271. International Conference on Intelligent Systems for Molecular Biology International Conference on Intelligent Systems for Molecular BiologyGoogle Scholar
- Liu X: Pacific Symposium on Biocomputing 6:127–138 (2001). Symposium A Quarterly Journal In Modern Foreign Literatures 2001, 138: 127–138.Google Scholar
- Thijs G, Marchlal K, Moreau Y: A Gibbs Sampling Method to Detect Overrepresented Motifs in the Upstream Regions of Coexpressed Genes. J Comput Biol 2002, 9: 447–464. 10.1089/10665270252935566View ArticlePubMedGoogle Scholar
- Carlson JM, Chakravarty A, DeZiel CE, Gross RH: SCOPE: a web server for practical de novo motif discovery. Nucleic Acids Res 2007, 35: W259-W264. 10.1093/nar/gkm310PubMed CentralView ArticlePubMedGoogle Scholar
- Wang T, Stormo GD: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 2003, 19: 2369–2380. 10.1093/bioinformatics/btg329View ArticlePubMedGoogle Scholar
- Berezikov E, Guryev V, Plasterk RH, Cuppen E: CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. Genome research 2004, 14: 170–178.PubMed CentralView ArticlePubMedGoogle Scholar
- Prakash A, Blanchette M, Sinha S, Tompa M: Motif discovery in heterogeneous sequence data. Pac Symp Biocomput 2004, 359: 348–359.Google Scholar
- Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 2004, 5: 170. 10.1186/1471-2105-5-170PubMed CentralView ArticlePubMedGoogle Scholar
- Moses A, Chiang D, Eisen M: Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pacific Symposium on Biocomputing 2004, 359: 324–35.Google Scholar
- Wang T, Stormo GD: Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. Proc Natl Acad Sci USA 2005, 102: 17400–17405. 10.1073/pnas.0505147102PubMed CentralView ArticlePubMedGoogle Scholar
- Siddharthan R, Siggia ED, van Nimwegen E, Nimwegen EV: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 2005, 1: e67. 10.1371/journal.pcbi.0010067PubMed CentralView ArticlePubMedGoogle Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, MacIsaac KD, Danford TW, Hannett NM, et al.: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431: 99. 10.1038/nature02800PubMed CentralView ArticlePubMedGoogle Scholar
- Hu J, Yang YD, Kihara D: EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics 2006, 7: 342. 10.1186/1471-2105-7-342PubMed CentralView ArticlePubMedGoogle Scholar
- Habib N, Kaplan T, Margalit H, Friedman N: A Novel Bayesian DNA Motif Comparison Method for Clustering and Retrieval. PLoS Comput Biol 2008, 4: e1000010. 10.1371/journal.pcbi.1000010PubMed CentralView ArticlePubMedGoogle Scholar
- Sandelin A, Wasserman WW: Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol 2004, 338: 207–215. 10.1016/j.jmb.2004.02.048View ArticlePubMedGoogle Scholar
- Mahony S, Auron PE, Benos PV: DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput Biol 2007, 3: e61. 10.1371/journal.pcbi.0030061PubMed CentralView ArticlePubMedGoogle Scholar
- Mahony S, Benos PV: STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Res 2007, 35: W253-W258. 10.1093/nar/gkm272PubMed CentralView ArticlePubMedGoogle Scholar
- Piipari M, Down T, Hubbard TJ: Metamotifs--a generative model for building families of nucleotide position weight matrices. BMC Bioinformatics 2010, 11: 348. 10.1186/1471-2105-11-348PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas-Chollier MS, OTuratsinze JV, Janky R, Defrance M, Vervisch E, Brohée S, et al.: RSAT: regulatory sequence analysis tools. Nucleic Acids Res 2008, 36: W119-W127. 10.1093/nar/gkn304PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 2006, 34: W369-W373. 10.1093/nar/gkl198PubMed CentralView ArticlePubMedGoogle Scholar
- MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E: An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 2006, 7: 113. 10.1186/1471-2105-7-113PubMed CentralView ArticlePubMedGoogle Scholar
- Novichkov PS, Rodionov D, Stavrovskaya ED, Novichkova ES, Kazakov AE, Gelfand MS, Arkin AP, et al.: RegPredict: an integrated system for regulon inference in prokaryotes by comparative genomics approach. Nucleic Acids Res 2010, 38: W299-W307. 10.1093/nar/gkq531PubMed CentralView ArticlePubMedGoogle Scholar
- Kulakovskiy IV, Boeva V, Favorov aV, Makeev VJ: Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics (Oxford, England) 2010, 26: 2622–2623. 10.1093/bioinformatics/btq488View ArticleGoogle Scholar
- Hu M, Yu J, Taylor JMG, Chinnaiyan AM, Qin ZS: On the detection and refinement of transcription factor binding sites using ChIP-Seq data. Nucleic Acids Res 2010, 38: 2154–2167. 10.1093/nar/gkp1180PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson W: Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res 2003, 31: 3580–3585. 10.1093/nar/gkg608PubMed CentralView ArticlePubMedGoogle Scholar
- Karandikar RL: On the Markov Chain Monte Carlo (MCMC) method. Sadhana 2006, 1: 20–104.Google Scholar
- Bailey TL, Gribskov M: Combining evidence using p-values: application to sequence homology searches. Bioinformatics (Oxford, England) 1998, 14: 48–54. 10.1093/bioinformatics/14.1.48View ArticleGoogle Scholar
- Wade JT, Reppas NB, Church GM, Struhl K: Genomic analysis of LexA binding reveals the permissive nature of the Escherichia coli genome and identifies unconventional target sites. Genes Dev 2005, 19: 2619–2630. 10.1101/gad.1355605PubMed CentralView ArticlePubMedGoogle Scholar
- Walker GC: Mutagenesis and inducible responses to deoxyribonucleic acid damage in Escherichia coli. Microbiol Rev 1984, 48: 60–93.PubMed CentralPubMedGoogle Scholar
- Roth FP, Hughes J, Estep P: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature 1998, 16: 939–945. 10.1038/nbt1098-939Google Scholar
- Nagarajan N, Ng P, Keich U: Refining Motif Finders With E-Value Calculations. RECOMB on Regulatory Genomics 2006, 73.Google Scholar
- Meluh PB, Yang P, Glowczewski L, Koshland D, Smith MM: Cse4p Is a Component of the Core Centromere of Saccharomyces cerevisiae. Cell 1998, 94: 607–613. 10.1016/S0092-8674(00)81602-5View ArticlePubMedGoogle Scholar
- Hegemann JH, Fleig UN: The Centromere of Budding Yeast. Bioessays. 1993, 15: 451–460.PubMedGoogle Scholar
- Keith KC, Fitzgerald-hayes M: CSE4 Genetically Interacts With the Saccharomyces cerevisiae Centromere DNA Elements CDE I and CDE II but Not CDE III: Implications for the Path of the Centromere DNA Around a Cse4p Variant Nucleosome. Cultures 2000, 156: 973–981.Google Scholar
- Lefrançois P, Euskirchen GM, Auerbach RK, Rozowsky J, Gibson T, Yellman CM, Gerstein M, et al.: Efficient yeast ChIP-Seq using multiplex short-read DNA sequencing. BMC Genomics 2009, 10: 37. 10.1186/1471-2164-10-37PubMed CentralView ArticlePubMedGoogle Scholar
- Teichmann S, Babu MM: Gene regulatory network growth by duplication. Nat Genet 2004, 36: 492–496. 10.1038/ng1340View ArticlePubMedGoogle Scholar
- Facciotti MT, Reiss DJ, Pan M, Kaur A, Vuthoori M, Bonneau R, Shannon P, et al.: General transcription factor specified global gene regulation in archaea. Proc Natl Acad Sci USA 2007, 104: 4630–4635. 10.1073/pnas.0611663104PubMed CentralView ArticlePubMedGoogle Scholar
- Littlefield O, Korkhin Y, Sigler PB: The structural basis for the oriented assembly of a TBP/TFB/promoter complex. Proc Natl Acad Sci USA 1999, 96: 13668–13673. 10.1073/pnas.96.24.13668PubMed CentralView ArticlePubMedGoogle Scholar
- Lagrange T, Kapanidis AN, Tang H, Reinberg D, Ebright RH: New core promoter element in RNA polymerase II-dependent transcription: sequence-specific DNA binding by transcription factor IIB. Genes Dev 1998, 12: 34–44. 10.1101/gad.12.1.34PubMed CentralView ArticlePubMedGoogle Scholar
- Wilbanks EG, Larsen DJ, Neches RY, Yao AI, Wu C-Y, Kjolby RS, Facciotti MT: A workflow for genome-wide mapping of archaeal transcription factors with ChIP-seq. Nucleic Acids Res 2012, 40: e74. 10.1093/nar/gks063PubMed CentralView ArticlePubMedGoogle Scholar
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nature Biotechnology 2008, 26: 1351–1359. 10.1038/nbt.1508PubMed CentralView ArticlePubMedGoogle Scholar
- Renfrow MB, Naryshkin N, Lewis LM, Chen HT, Ebright RH, Scott R: Transcription factor B contacts promoter DNA near the transcription start site of the archaeal transcription initiation complex. J Biol Chem 2004, 279: 2825–2831.View ArticlePubMedGoogle Scholar
- Hain J, Reiter WD, Hüdepohl U, Zillig W: Elements of an archaeal promoter defined by mutational analysis. Nucleic Acids Res 1992, 20: 5423–5428. 10.1093/nar/20.20.5423PubMed CentralView ArticlePubMedGoogle Scholar
- Bell SD, Jackson SP: The role of transcription factor B in transcription initiation and promoter clearance in the archaeon Sulfolobus acidocaldarius. J Biol Chem 2000, 275: 12934–12940. 10.1074/jbc.275.17.12934View ArticlePubMedGoogle Scholar
- Nrc H, Res G, Microbiol M, Res DNA, Tbp M, Tbps A, Biol M: MicroCorrespondence. Molecular Microbiology. 2000, 36: 1999–2000.Google Scholar
- Torarinsson E, Klenk HP, Garrett R: Divergent transcriptional and translational signals in Archaea. Environ Microbiol 2005, 7: 47–54. 10.1111/j.1462-2920.2004.00674.xView ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res 2004, 14: 1188–1190. 10.1101/gr.849004PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.