The value of position-specific priors in motif discovery using MEME
- Timothy L Bailey^{1}Email author,
- Mikael Bodén^{1},
- Tom Whitington^{1} and
- Philip Machanick^{1}
https://doi.org/10.1186/1471-2105-11-179
© Bailey et al; licensee BioMed Central Ltd. 2010
Received: 10 February 2010
Accepted: 9 April 2010
Published: 9 April 2010
Abstract
Background
Position-specific priors have been shown to be a flexible and elegant way to extend the power of Gibbs sampler-based motif discovery algorithms. Information of many types–including sequence conservation, nucleosome positioning, and negative examples–can be converted into a prior over the location of motif sites, which then guides the sequence motif discovery algorithm. This approach has been shown to confer many of the benefits of conservation-based and discriminative motif discovery approaches on Gibbs sampler-based motif discovery methods, but has not previously been studied with methods based on expectation maximization (EM).
Results
We extend the popular EM-based MEME algorithm to utilize position-specific priors and demonstrate their effectiveness for discovering transcription factor (TF) motifs in yeast and mouse DNA sequences. Utilizing a discriminative, conservation-based prior dramatically improves MEME's ability to discover motifs in 156 yeast TF ChIP-chip datasets, more than doubling the number of datasets where it finds the correct motif. On these datasets, MEME using the prior has a higher success rate than eight other conservation-based motif discovery approaches. We also show that the same type of prior improves the accuracy of motifs discovered by MEME in mouse TF ChIP-seq data, and that the motifs tend to be of slightly higher quality those found by a Gibbs sampling algorithm using the same prior.
Conclusions
We conclude that using position-specific priors can substantially increase the power of EM-based motif discovery algorithms such as MEME algorithm.
Keywords
Background
Short, linear sequence motifs in protein or nucleic acid sequences are of considerable interest to biologists. This type of sequence pattern is often indicative of important biological sequence signals such as transcription factor binding sites (TFBSs) or splice junctions in nucleotide sequences, and of sumoylation sites and stabilization domains in proteins. Consequently, there has been long and continuing interest in developing software algorithms that can automatically discover functional sequence motifs in sets of biopolymer sequences suspected to harbor one or more common sequence signals.
Biological sequence motifs are often quite subtle, and discovering them in a set of sequences is often impossible since real motifs may be indistinguishable from random artifacts. This has encouraged the development of specialized motif discovery algorithms that can effectively utilize information in addition to the sequences themselves.
One successful approach for improving motif discovery using auxiliary data has been to incorporate evolutionary conservation information into the discovery process [1, 2]. This approach typically augments the set of input sequences with one or more phylogenetic relatives of each of the original sequences. Such motif discovery algorithms are designed to emphasize motifs that are conserved across related species, on the assumption that such motifs are more likely to be functional.
Another fruitful approach has been to utilize biological information to select a "negative" set of sequences, and to modify the search process to seek motifs that are relatively over-represented in the "positive" sequences. This second approach can be also viewed as discovery of discriminative motifs [3, 4]. Using negative sequences has the effect of steering the motif discovery process away from sequence patterns that are due to effects unrelated to the particular type of motif being sought. This is desirable when searching for binding site in genomic sequences due to the extremely non-random nature of genomic DNA.
A third approach for improving motif discovery has been to seek motifs whose presence in sequences is correlated with some biological signal such as mRNA level. These approaches typically use a regression model, and look for motifs that minimize the residual error between a biological signal associated with each input sequence and a motif-based mathematical model of the signal [5, 6].
Recently, a general approach has been proposed that allows the incorporation of almost any type of auxiliary information into the class of motif discovery algorithms based on Gibbs sampling [7]. The additional information is converted into a measure of the likelihood that a motif starts at each position in each sequence in the input. This is measure is referred to as a "position-specific prior" (PSP). Gibbs sampling algorithms optimize a Bayesian sequence model, and the PSP approach allows the (summarized) auxiliary information to bias the optimization toward real motifs.
The PSP approach has several advantages. Firstly, it can directly and simultaneously incorporate multiple types of auxiliary data into motif discovery. Secondly, it cleanly separates the problem model optimization from any issues arising from trying to incorporate heterogenous data into the biological sequence model. Thirdly, the PSP methodology can sometimes avoid the severe increase in computational complexity suffered by many of previous approaches to incorporating auxiliary information into motif discovery.
The PSP approach has shown great promise in extending the power of Gibbs sampling-based motif discovery algorithms [7]. For example, a "discriminative conservation" ( ) prior has been shown to be extremely effective for discovering TFBS motifs in yeast sequences when used with a Gibbs sampling algorithm [8]. Incorporation of nucleosome positioning and transcription factor structural class information into a PSP has also proved useful in the discovery of yeast TFBS motifs [9]. However, the benefits of PSPs to EM-based algorithms (such as MEME) has yet to be studied.
In this paper, we describe extending MEME to enable it to use position-specific priors. Like Gibbs sampling-based algorithms, the popular MEME motif discovery algorithm [10] uses a Bayesian probabilistic model in the search for motifs. To allow comparison with previous work, we study the affect of using the PSP with MEME. This PSP combines evidence of evolutionary conservation with the ability of a motif to discriminate between sequences binding the TF and those that do not.
To explore the benefits of using PSPs with MEME, we focus on discovering TFBS motifs in chromatin immunoprecipitation (ChIP) data for yeast and mouse transcription factors. We show that using the PSP greatly improves MEME's ability to discover motifs in an extremely well-studied example of 156 sequence sets derived from TF ChIP-chip (ChIP followed by microrarray) experiments in yeast. In fact, using this PSP, MEME discovers the correct TF motif in more of the yeast ChIP-chip datasets than six other algorithms that use conservation information, including the Gibbs sample using the same PSP. We further show that using the PSP, MEME discovers more accurate motifs in mouse ChIP-seq (ChIP followed by sequencing) data [11].
Methods
We describe the enhancements to MEME required for reading in and utilizing a file containing a position-specific prior corresponding to the input DNA or protein sequences. We cover how PSP information is utilized during each of MEME's three major phases. We also describe how MEME converts a prior on motifs of a width w_{0} to a prior on motifs of width w ≠ w_{0} in order to allow MEME to discover motifs of a width different than that for which the prior was derived. (Further implementation details are given in Additional file 1.)
Incorporating position-specific priors into MEME
The basic task of biological sequence motif discovery is, given a set of DNA or protein sequences, to determine which positions in the sequences are motif occurrences (sites). MEME does this using a statistical sequence model that it creates based on certain hints provided by the user about the number of sites expected in each sequence and the width of the motif sites. The parameters of the model are referred to here collectively as ϕ. MEME discovers motifs by optimizing the statistical parameters of the model using the Expectation Maximization (EM) algorithm. The statistical parameters of the model include a position-specific probability matrix (PSPM) representation of the motif, referred to here as θ. The PSPM specifies the probability of each possible letter (amino acid or nucleotide) at each of the w positions in the motif. A motif is a model of aligned words of a specific length k (k-mers), each from a different sequence, representing the likelihood of a given letter appearing at each position.
To discover a motif, MEME proceeds in three phases. In Phase I, MEME determines good starting points for the EM algorithm. Since MEME automatically determines the width of the motif and the number of sites, this first phase actually selects starting points for various combinations of motif width and number-of-sites. In Phase II, MEME runs the EM optimization algorithm from each starting point determined in the first phase in order to produce a candidate PSPM representation of the motif. In Phase III, MEME scores each of the candidate motif PSPMs found by EM. To do this, it uses the candidate motif PSPM to predict motif sites, and calculates the statistical significance of the relative entropy of the predicted sites. MEME outputs the candidate motif with the highest statistical significance. Once the motif has been selected and output, MEME "probabilistically erases" the motif's sites, and begins the process again at Phase I in order to discover further motifs.
We describe below how we have enhanced MEME to utilize PSP information in each of its three phases. In what follows, we assume that MEME has been provided with a set of n sequences X = {X_{1}, ..., X_{ n }}. For notational convenience we assume, without loss of generality, that all input sequences have the same length, L.
Overview of position-specific priors
For convenience, we define the special value P_{i,0}to be the prior probability there is no motif site anywhere in sequence X_{ i }. To complete our definition of what a PSP is, we add the assumption that a PSP is tied to a particular motif width, w_{0}. Therefore, the meaning of P_{i,j}is the prior probability of any motif of width w_{0} having a site at position j in sequence X_{ i }. (We discuss later how MEME derives PSPs of different widths from a fixed-width PSP given in its input.)
MEME only allows sites that fit completely within a sequence, so we require that the last w - 1 positions in a sequence have P_{i,j}= 0. MEME can require every sequence to have one site (OOPS sequence model) or it can allow sequence to have zero or one sites (ZOOPS sequence model). Clearly, this implies (based on our definition of P_{i,0}, above) that P_{i,0}= 0 in the OOPS sequence model. For the ZOOPS model, we allow P_{i,0}to have any value in the range [0, ..., 1]. MEME has one more model–the ANR model–that allows any number of motif sites any sequence. We have not yet implemented PSPs for this model.
MEME searches for motifs in the protein or DNA sequences given in its input. However, MEME can also search for DNA motifs that may have sites on either strand. In that case, we index the sites on the opposite strand from -L to -1 and we then define Z_{i,j}and P_{i,j}for j ∈ [-L, ..., L]. In order for the P_{i,j}to define a probability distribution, they must all lie in the the range [0, ..., 1] and, for the OOPS and ZOOPS sequence models they must sum to 1 for i = 1, ..., n, where n is the number of sequences in the input to MEME. For all sequence models, the sum over site position, j, runs from 0 to L (rather than -L to L) in the protein and single-stranded DNA cases. Note that we define P_{i,j}= 0 for all values of j where a motif would not fit entirely within the sequence.
Our implementation of PSPs in MEME has one additional constraint. When we are considering motifs that may occur on either DNA strand (the strand given in the input sequences or its reverse complement), we require that the PSP be symmetrical. That is, we require that P_{i,j}= P_{i,-j}for all sequences X_{ i }and sequence positions j. This restriction seems reasonable to us, since the prior probability of any DNA motif in a set sequences by definition is the same as that of its reverse complement motif.
Providing position-specific priors to MEME
MEME can now read PSPs in a format described in Additional file 1. When a PSP file is not provided, MEME assumes, as before, a uniform prior over motif site positions. PSPs can be generated using Hartemink software [8] followed by conversion to MEME's PSP format as described in Additional file 1. The MEME PSP format requires that the set of prior values, {P_{i,j}} for i = 1, ..., N and j = 0, ..., L, be specified, and that they obey all the constraints described above. For any sequences for which priors are not supplied in the PSP file, priors are calculated as uniform priors. The MEME PSP format includes the width, w_{0}, of the motifs for which the prior is designed. If MEME is run in double-stranded mode on DNA, the symmetry restriction allows us to generate the PSP for the second strand automatically.
Renormalizing position-specific priors for motifs of different widths
As mentioned above, a PSP is tied to a particular motif width for which it is derived. When MEME is considering motifs of width w, different from w_{0}, the one specified in the PSP input file, it renormalizes the PSP. The renormalization attempts, in a heuristic fashion, to extend the information captured by the PSP about motifs of width w_{0} to a PSP suitable for motifs of width w. Renormalization also insures that the new PSP obeys all of the constraints described above. In particular, when w > w_{0}, there are fewer legal positions for motif sites in a given sequence, so the constraint that the P_{i,j}sum to 1 would be violated without renormalization.
for i [1, ..., n] and j ∈ [1, ..., L - w + 1]. To keep computation costs reasonable, and because the value of information contained in a prior on shorter motifs decreases as the width of the longer motif increases, we constrain c ≤ w_{0} in Eqn. 1.
For motifs that are shorter than those for which the PSP was designed (w < w_{0}), MEME does not renormalize the input PSP. In this case, it simply uses the input PSP as though it were designed for the shorter width motifs, setting PSP_{(w)}= PSP. This has the implication that some potential motif sites at the ends of sequences will be ignored when searching for shorter motifs, since their P_{i,j}values will remain zero even though they are legal starting positions for the shorter motif. For example, if the PSP width is 8 and MEME is searching for motifs of width 7, the last possible position for a motif in each sequence will have P_{i,j}= 0. This seems more sensible than setting the value of P_{i,j}based, say, on the value of p_{i,j-1}since the width-8 PSP contains no explicit information on the prior probability of a site starting at position j. This is because removing the first letter of the word starting at position j - 1 might result in a word with a much lower prior probability. In any case, we expect that useful priors will tend to be relatively short (6 to 10) in relationship to the lengths of the sequences containing the motifs. In what follows, we always assume that the PSP has been normalized to the current motif width being considered by MEME, so we drop the width notation from PSP_{(w)}and P_{i,j,w}.
MEME Phase I: Finding Starting Points
To find starting points for EM, MEME converts each subsequence of the data into a "starting" PSPM and calculates a score for it using an algorithm that approximates one step of EM followed by the scoring phase. Creation of the starting PSPM from a subsequence has been previously described [10]. Each such PSPM, θ_{ M }is then used to calculate the probability under the motif model of every potential site in the input sequence, Pr(site|θ_{ M }). Previously, for the OOPS and ZOOPS models, the single site with the highest likelihood from each sequence was determined. For the OOPS model, these sites were then assigned a score. For the ZOOPS model, these sites were sorted in decreasing order by their likelihoods, and the top t sites for successively larger values of t were scored.
To incorporate PSPs into this phase of MEME, sites are now sorted by a value proportional to their posterior probabilities, Pr(site|θ_{ M })Pr(θ_{ M }), where Pr(θ_{ M }) is the prior probability of the potential site being a real site, as specified by the PSP. That is, if the site starts at position j in sequence X_{ i }, then Pr(θ_{ M }) = P_{i,j}. We found this approach was not sufficient to insure that the best starting points for EM were found, but that incorporating the PSP into scoring the sets of sites with the highest posterior probabilities helped significantly (data not shown). Consequently, the prior probability of each site is now used by MEME when it scores the predicted sites, as described in the next paragraph.
The final score for a potential starting point is a weighted version of the log likelihood ratio (LLR) of its set of predicted sites. The LLR of a set of sites is normally computed by aligning the sites, counting the number of times each letter occurs in each column of the aligned sites, and normalizing the counts to frequencies. To calculate the weighted LLR, MEME scales the individual priors independently in each sequence so that the largest of P_{i,j}in each sequence is 1. These scaled priors are then used as weights on the counts of the numbers of letters in each column of the motif.
We refer to Eqn. 2 as the "weighted LLR" of the set of sites when the weights on the sites are not all equal to 1.
For each potential starting PSPM, MEME computes the LLR using Eqn. 2 on different numbers of predicted sites, t. MEME does this by considering only the t predicted sites with the largest posterior probabilities for successively larger values of t. For the OOPS model, the only value of t tried is the number of input sequences, t = n.
MEME repeats this entire process for successively larger values of w. For each combination of t and w, MEME runs EM using the potential PSPM that has the largest weighted log likelihood ratio. EM is described in the next section.
MEME Phase II: Expectation Maximization
Definition of terms used in describing the MEME algorithm
n | number of input sequences |
---|---|
L | length of input sequences |
X = {X_{1}, ..., X_{ n }} | the set of n input sequences |
w | width of a MEME motif |
m = L - w + 1 | number of positions for a site |
γ | probability of a site in any sequence |
θ | PSPM model of motif; |
P = {P_{i,j}} | position-specific prior (PSP) |
w _{0} | width for which input PSP is defined |
Z = {Z_{i,j}} | missing information variables for i ∈ [1, n],j ∈ [-L, L] |
Z ^{(t)} | expectation of Z at EM iteration t |
prior probability given PSP & model | |
ϕ ^{(t)} | model parameters at EM iteration t |
ϕ = {θ, γ, P} | all sequence model parameters |
For OOPS and ZOOPS models, the parameters of the sequence model are ϕ = {θ, γ, P}. EM re-estimates the PSPM, θ, but holds fixed the PSP, P. The additional parameter, γ, represents the probability that a randomly chosen sequence in the dataset contains a motif site. This is always equal to 1 for the OOPS model, and is estimated by EM for the ZOOPS model.
where m = L - w + 1 is the number of places a motif site will fit in a sequence.
for i ∈ [1, ..., n] and j ∈ [0, ..., m]. When searching both DNA strands, the sum in the denominator in Eqn. 4 goes from -m to m, and we define for j ∈ [-m, ..., 0, ...,m].
The M-step of the EM algorithm in MEME is unchanged. See Bailey and Elkan [10] for more details on how the terms in Eqn. 4 and Eqn. 5 are computed.
MEME Phase III: Scoring the Motifs
The scoring phase of the MEME algorithm assigns scores to the motifs discovered by EM. The criterion is based on the statistical significance of the log-likelihood ratio (Eqn. 2) of the most likely sites for the motif in the sequence dataset. Unlike the starting point phase (Phase I), the scoring phase of MEME computes the unweighted LLR, even when using non-uniform positional priors. This choice was motivated by tests which showed that the weighted LLR performed no better, so we chose to keep this part of the MEME algorithm unchanged (data not shown). Although the scoring phase of MEME was not changed as a result of incorporating PSPs, it has not been documented previously, so we describe it briefly here.
The significance measure used to rank motifs takes into account the LLR of the motif, its width and the number of sites it contains. The sites of a candidate motif are those with the largest final values of Z^{(t)}. For the OOPS model, MEME scores the motif consisting of these sites. With the ZOOPS model, MEME sorts the sites by decreasing Z_{i,j}value, and scores each prefix of the sorted list.
MEME scores a motif consisting of a set of sites as follows. The LLR of each column of the aligned sites is computed, and the p-value of the column-LLR is computed based on the background Markov model using the dynamic programming method of Hertz and Stormo [12]. These p-values are then multiplied together and the p-value of the resulting product is computed as described in Bailey and Gribskov [13]. (Computing this column-LLR based p-value requires far less time than directly computing the p-value of the total LLR of the motif.) To make the scores of various motif widths and numbers of sites compatible, MEME multiplies the p-value of the motif by the number of possible ways to select positions for the given number of sites in the set of sequences, X. This final score is referred to as the E-value of the motif.
Measuring the Benefits of using PSPs
To evaluate the benefit of using PSPs in motif discovery, we search for motifs in sets of sequences predicted to bind different TFs in yeast and in mouse. The yeast data is from 156 ChIP-chip experiments each measuring the binding of a single TF [14]. The mouse data is from 13 ChIP-seq experiments measuring binding of a TF [11]. The yeast TF data has been used extensively as a test case for evaluating motif discovery algorithms, so using it allows us to easily compare MEME with PSPs to a large number of other algorithms. Since ChIP-seq data is inherently of a higher quality than ChIP-chip data, the mouse TF data allows us to measure the benefit of using PSPs on a slightly easier motif discovery task. The mouse data covers 13 TFs–Nanog, Oct4, Sox2, Smad1, E2f1, Tcfcp2l1, Ctcf, Zfx, Stat3, Klf4, Esrrb, c-Myc and n-Myc.
We measure accuracy of MEME both with and without the use of PSPs. The PSP we use is the discriminative conservation prior ( ), which has previously been shown to be very effective for discovering TF binding site motifs in the yeast dataset [8]. The prior is based on the degree to which the 8-mer starting at position X_{i,j}is conserved in the input sequence set X and a set of orthologous sequences from other species, compared with a negative set of sequences and their orthologs. For comparison, we measure the accuracy of the PRIORITY motif discovery algorithm using the PSP (PRIORITY- ). We also compare with previously published results on the yeast dataset.
On the yeast data, we use the PSP as reported by Gordân et al[8]. This prior is based on intergenic regions from S. cerevisiae that have a ChIP-chip fluorescence p-value ≤ 0.001 and the orthologous regions from the six related organisms S. paradoxus, S. mikatae, S. kudriavzevii, S. bayanus, S. castelli, and S. kluyveri. The negative sequences are all S. cerevisiae intergenic regions with a p-value ≥ 0.5 and their orthologs.
We create our own PSP for each of the 13 mouse datasets. For each dataset, the positive sequences are 200 base pair (bp) regions centered on the ChIP-seq peaks reported by Chen et al. [11]. We use the mafFrags program to obtain orthologous sequences for sixteen additional species from the multiz17way alignment [15]. We obtain negative sequences and their orthologs for constructing the mouse PSP by extracting 100 bp regions on either side of each positive sequence. We use the -PSP creation scripts provided by the Hartemink group [8] to create the mouse PSPs from the positive and negative sequence and ortholog sets. (More detail including the list of other species is in Additional file 1.)
where f_{ a, i }and g_{ a, i }are the probabilities of base a at position i in the two motifs. We use the same known PSPMs as used by previous researchers [8], and the same criterion for successful motif discovery–scaled Euclidean distance < 0.25. Since the reported motif may be of a different length or on the opposite DNA strand from the known motif, we actually compute the minimum value of D for all possible alignments of the reported motif (or its reverse complement) with the known motif, with the minimum overlap the length of the shorter motif.
Our evaluation of motifs discovered in the yeast ChIP-chip datasets utilizes a human-curated set of motifs that represents the consensus predictions of many motif discovery algorithms on those datasets. Such a "gold standard" set of motifs does not exist for the 13 mouse ChIP-seq datasets. Consequently, we take a different approach to measuring the accuracy of motifs discovered in those datasets.
With the mouse ChIP-seq datasets, our underlying measure of motif quality is the amount of correlation between a motif-based binding affinity score and a ChIP-based binding score. We believe that a high correlation between an in vivo measure of TF affinity and a motif-based in silico measure is indicative of an accurate TF binding motif. (We describe our two binding affinity scores and the correlation measure we use in the next paragraph.) For each ChIP-seq dataset, we measure this correlation in a cross-validation setting, discovering motifs on randomly chosen sets of positive and negative sequences, and computing the correlation measure on held-out sequences. To compare algorithms, we compare our correlation-based quality measure between motifs found on the same sample of sequences.
The details of our evaluation of motifs in the mouse ChIP-seq data are as follows. Our ChIP-based estimate of binding by the ChIP-ed TF at a genomic location is the "peak score" reported by Chen et al.[11], and is the normalized count of the number of sequence tags overlapping the peak's genomic location. This is our best direct evidence that the TF was bound in the neighborhood of the peak. Each positive sequence is assigned the peak score of the peak it contains. Our motif-based measure of binding by the ChIP-ed TF is for each positive sequence is its "Average Motif Affinity" (AMA) [16] score. The AMA score is justified as a measure of TF binding affinity on theoretical grounds [17], and it has been used for motif discovery [5] where it showed strong correlation with gene expression, and for motif enrichment analysis [18, 19] where it showed strong correlation with TF binding. Because the AMA score estimates the average binding affinity of a region of DNA, it captures contributions from multiple sites in a given region. Our motif quality measure is the Spearman correlation coefficient (CC) between the ranks of the held-out positive sequences sorted by their AMA and peak scores, respectively. We use a rank-based statistic because it is less sensitive than a correlation between the original values to dissimilarities in the distributions being compared [20]. To compare pairs of algorithms, we use each algorithm to learn one motif in each of 50 random samples consisting of 100 positive sequences and 200 negative sequences. We then apply the sign test to the quality measures of pairs of motifs learned on the same input set to decide if one algorithm discovers significantly better motifs.
All yeast runs use a third-order Markov background model, for consistency with reported PRIORITY results. We let MEME search for a single motif of width from 7 to 12, with sites on either DNA strand. For all mouse runs, MEME and PRIORITY- use a fifth order background Markov model, computed from the negative set, and search for motifs of widths 8 to 20. To compute AMA, we use the AMA program, which is part of the MEME suite of programs [21], using the same background model as we use in motif discovery. In all cases, we use PRIORITY 2.0.0 with its default settings, except for changing to a fifth-order background model for the mouse runs. We test MEME with both the OOPS and ZOOPS models, with and without the PSP, with sites on either strand.
Results and Discussion
Improved motif discovery using MEME with PSPs in yeast TF ChIP-chip datasets
Our evaluation of the effect of adding PSPs to MEME starts with measuring improvement in finding TF motifs in yeast ChIP-chip datasets. We run MEME using the OOPS and ZOOPS models with and without the prior on each of 156 ChIP-chip datasets, and compare the single reported motif PSPM to the known PSPM for the TF pulled down in the ChIP-chip experiment. Success is defined as scaled Euclidean distance < 0.25 between the reported PSPM and the known PSPM for the TF. Note that, to insure our results are directly comparable to the results reported by Gordân et al.[8], we use the script provided by them to compute the scaled Euclidean distance, which reports a distance of 1.0 (the maximum) if the found motif does not contain a region of width six with average information content at least 1 bit.)
Performance of motif discovery algorithms on yeast TF ChIP-chip datasets.
Algorithm | Description | Successes | Proportion of Successes |
---|---|---|---|
PhyloCon | local alignment of conserved regions | 19 | 12% |
PhyME | alignment-based; uses EM | 21 | 13% |
MEME_c | MEME run with non-conserved bases masked | 49 | 31% |
PhyloGibbs | similar to PhyME but uses Gibbs sampling | 54 | 35% |
Kellis et al. | alignment-based | 56 | 36% |
Converge | alignment-based; uses EM | 66 | 42% |
Gibbs sampler with conservation-based priors | 69 | 44% | |
Gibbs sampler with discriminative conservation-based priors | 76 | 49% | |
MEME: OOPS | MEME with OOPS model | 36 | 23% |
MEME: ZOOPS | MEME with ZOOPS model | 39 | 25% |
73 | 47% | ||
81 | 52% | ||
Gibbs sampler with discriminative conservation-based priors | 69 (3) | 44% |
The accuracy of motif discovery by several other algorithms using these same yeast TF ChIP-chip datasets and success metric has been reported previously [8, 22], allowing us to compare our current results more broadly. As seen in Table 2, the success rate of ZOOPS- (81 motifs found) is substantially higher than a number of conservation-based EM or Gibbs sampler motif discovery algorithms (PhyloCon [22], PhyME [23], PhyloGibbs [1], Converge [22], PRIORITY-C [8]).
The ZOOPS- approach also performs at least as well on the yeast datasets as the Gibbs sampler-based algorithm PRIORITY, when PRIORITY is given the same PSP as MEME. The developers of the PRIORITY algorithm (and of the PSP concept) reported a success rate of 76 out of 156 on the yeast datasets (result shown in Table 2 above the horizontal line). However, since Gibbs sampling algorithms are stochastic–their outputs vary from run to run–we wished to place error bars on PRIORITY- 's success rate. We therefore ran PRIORITY- ten times on each yeast dataset. The success rate varied from 65 to 74 correct motifs, with an average success rate of 69 (sd = 3), as shown in the last line of Table 2. The fact that we did not observe any PRIORITY- run with as high a success rate as previously reported [8] may be a result of the stochastic nature of the algorithm, or may be due to us using a more recent version of PRIORITY (Version 2.0.0).
Improved motif discovery using MEME with PSPs in mouse TF ChIP-seq datasets
As an additional check on the value of using PSPs with MEME, we measure the improvement in TF motif discovery on 13 mouse TF ChIP-seq datasets. Our evaluation of mouse data is intended to demonstrate that the results generalize to a data set of different properties–a higher eukaryote, with sequence data derived from a different technology. We measure the correlation between the ChIP-seq peak score ranks of the sequences, and the AMA score ranks assigned using the discovered motif. To insure that this measurement is unbiased, we measure the correlation using held-out sequences, which are not used in discovering the motif. We compare pairs of motif discovery algorithms by sampling from all the sequence data (positive and negative), and applying a paired significance test (sign test) to the pairs of correlation scores.
Performance of motif discovery algorithms on mouse TF ChIP-seq datasets.
TF | W | L | T | p -value | W | L | T | p -value | W | L | T | p -value | W | L | T | p -value |
Nanog | × | 3.4e-01 | × | 1.3e-03 | × | 7.7e-03 | × | 2.1e-09 | ||||||||
Oct4 | × | 1.0e-01 | × | 3.4e-01 | × | 5.8e-07 | × | 4.5e-14 | ||||||||
Sox2 | × | 1.6e-01 | × | 1.6e-02 | × | 4.4e-01 | × | 1.3e-03 | ||||||||
Smad1 | × | 1.0e-01 | × | 1.6e-01 | × | 2.4e-01 | × | 1.6e-01 | ||||||||
E2f1 | × | 2.4e-01 | × | 4.5e-05 | × | 4.5e-05 | × | 4.4e-01 | ||||||||
Tcfcp2l1 | × | 1.0e-01 | × | 7.7e-03 | × | 1.6e-02 | × | 1.9e-11 | ||||||||
Ctcf | × | 4.4e-01 | × | 2.4e-01 | × | 4.4e-01 | × | 8.9e-16 | ||||||||
Zfx | × | 1.0e-01 | × | 1.3e-03 | × | 1.6e-01 | × | 2.2e-10 | ||||||||
Stat3 | × | 3.3e-03 | × | 4.4e-01 | × | 6.0e-02 | × | 1.6e-01 | ||||||||
Klf4 | × | 1.6e-01 | × | 6.0e-02 | × | 1.0e-01 | × | 1.6e-01 | ||||||||
Esrrb | × | 6.0e-02 | × | 6.0e-02 | × | 3.3e-03 | × | 4.5e-14 | ||||||||
c-Myc | × | 3.3e-02 | × | 3.3e-03 | × | 2.4e-01 | × | 4.5e-05 | ||||||||
n-Myc | × | 1.5e-04 | × | 4.5e-05 | × | 1.6e-02 | × | 1.6e-08 | ||||||||
Total | 3 | 0 | 10 | 4 | 3 | 6 | 4 | 2 | 7 | 6 | 3 | 4 |
To answer the question of whether using the prior with the OOPS or ZOOPS model is more appropriate with the mouse ChIP-seq data, Table 3 shows the relative accuracy of OOPS- versus ZOOPS- . According to the sign test, OOPS- finds significantly more accurate motifs in more datasets (4 vs. 2) than ZOOPS- . Although the sample size is small (13 datasets), it seems reasonable to conclude that the OOPS- approach will produce better motifs on average with ChIP-seq data.
A direct comparison of the accuracy of motifs found in the mouse datasets by OOPS- and PRIORITY- indicates that MEME with using the prior and the OOPS model has a slight edge. According to the sign test, OOPS- produces significantly better motifs for 6 of the 13 mouse ChIP-seq datasets, compared with PRIORITY- . On 3 of the datasets, PRIORITY- produces more accurate motifs. This result is in agreement with our results using the yeast ChIP-chip datasets, where MEME using the prior and the ZOOPS model was (slightly) more successful than PRIORITY- . As we expect, the ZOOPS model works better for ChIP-chip data, while the OOPS model works better for ChIP-seq for the examples we present here.
Conclusions
Position specific priors are an elegant and flexible way to utilize prior information from heterogeneous sources to improve the discovery of sequence motifs. In addition to allowing information from multiple sources to be incorporated into a Bayesian motif discovery framework, positional priors can even incorporate information from negative examples (so-called "discriminative" priors). Furthermore, using PSPs does not increase the running time of the underlying motif discovery algorithm. This flexibility has the potential to extend the range of applications and sensitivity of motif discovery algorithms that can utilize PSPs. Although we only study DNA datasets, our modifications to MEME are not DNA-specific. MEME is freely available for academic use and downloading at http://meme.nbcr.net.
PSPs had previously been shown to be of benefit when used with a Gibbs sampling motif discovery algorithm. We have shown here that they can also be of great benefit to MEME, which is based on EM and a heuristic search for starting points. We focused on using a prior that combines evolutionary information gleaned from orthologous sequences with positively and negatively labeled sequences in a discriminative prior (the "discriminative conservation", prior). Using this PSP on well-studied sequence datasets from 156 yeast TF ChIP-chip experiments improves the performance of MEME dramatically–more than doubling the number of datasets where MEME identifies the correct TF binding motif as its first prediction. Furthermore, using the prior allows MEME to achieve prediction accuracies that are superior to a large number of motif discovery algorithms, without increasing its running time.
We also confirm the benefit of PSPs to MEME when applied to TF motif discovery in ChIP-seq data from a higher eukaryotic species (mouse). To increase the independence of this second test, we used a novel way to measure the accuracy of the discovered motifs that obviates the need to rely on a set of known motifs (a "gold standard"). Although the observed benefits were somewhat small, they were substantial enough to indicate that constructing a prior and utilizing it with MEME is worthwhile even for higher eukaryotic ChIP-seq derived data.
In follow-up work, we plan to investigate PSPs designed specifically for ChIP-seq data. One approach might be to create a PSP that encodes the increased prior probability of the primary motif being located near the center of the ChIP-seq peak. We also plan to investigate PSPs designed for motif discovery in protein sequences. For protein motifs, PSPs based on spaced triples rather than the k-mers used here for DNA PSPs might be more appropriate, given the larger protein alphabet. We also intend to implement PSPs for use with MEME's ANR model, which allows multiple repeats of a motif within a single sequence. We don't foresee any major difficulties in incorporating PSPs into the ANR model but have focused on the OOPS and ZOOPS models in this work in order to facilitate direct comparison with previous work by others on PSPs.
Declarations
Authors’ Affiliations
References
- Siddharthan R, Siggia ED, van Nimwegen E: PhyloGibbs: a gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 2005, 1(7):e67. 10.1371/journal.pcbi.0010067View ArticlePubMedPubMed CentralGoogle Scholar
- Fang F, Blanchette M: FootPrinter3: phylogenetic footprinting in partially alignable sequences. Nucleic Acids Res 2006, (34 Web Server):W617-W620. 10.1093/nar/gkl123Google Scholar
- Redhead E, Bailey T: Discriminative motif discovery in DNA and protein sequences using the DEME Algorithm. BMC Bioinformatics 2007, 8: 385. 10.1186/1471-2105-8-385View ArticlePubMedPubMed CentralGoogle Scholar
- Barash Y, Bejerano G, Friedman N: A simple hyper-geometric approach for discovering putative transcription factor binding sites. In Algorithms in Bioinformatics: Proc. First International Workshop, no. 2149 in LNCS Edited by: Gascuel O, Moret BME. 2001, 278–293.View ArticleGoogle Scholar
- Foat BC, Houshmandi SS, Olivas WM, Bussemaker HJ: Profiling condition-specific, genome-wide regulation of mRNA stability in yeast. Proc Natl Acad Sci USA 2005, 102(49):17675–17680. 10.1073/pnas.0503803102View ArticlePubMedPubMed CentralGoogle Scholar
- Bauer DC, Bailey TL: Studying the functional conservation of cis-regulatory modules and their transcriptional output. BMC Bioinformatics 2008, 9: 220. 10.1186/1471-2105-9-220View ArticlePubMedPubMed CentralGoogle Scholar
- Narlikar L, Gôrdan R, Hartemink AJ: Nucleosome Occupancy Information Improves de novo Motif Discovery. 11th Annual International Conference on Computational Biology, RECOMB 2007, San Francisco 2007, 107–121.Google Scholar
- Gordân R, Narlikar L, Hartemink AJ: A Fast, Alignment-Free, Conservation-Based Method for Transcription Factor Binding Site Discovery. In 12th Annual International Conference on Computational Biology, RECOMB 2008. Edited by: Vingron M, Wong L. Springer-Verlag; 2008:98–111.Google Scholar
- Narlikar L, Gôrdan R, Hartemink AJ: A nucleosome-guided map of transcription factor binding sites in yeast. Plos Computational Biology 2007, 3(11):e215. 10.1371/journal.pcbi.0030215View ArticlePubMedPubMed CentralGoogle Scholar
- Bailey TL, Elkan C: The value of prior knowledge in discovering motifs with MEME. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, United Kingdom, July 16–19, 1995 1995, 3: 21–29.Google Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 2008, 133(6):1106–1117. 10.1016/j.cell.2008.04.043View ArticlePubMedGoogle Scholar
- Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15(7–8):563–577. 10.1093/bioinformatics/15.7.563View ArticlePubMedGoogle Scholar
- Bailey TL, Gribskov M: Combining evidence using p -values: application to sequence homology searches. Bioinformatics 1998, 14: 48–54. 10.1093/bioinformatics/14.1.48View ArticlePubMedGoogle Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99–104. 10.1038/nature02800View ArticlePubMedPubMed CentralGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res 2002, 12(6):996–1006.View ArticlePubMedPubMed CentralGoogle Scholar
- Buske FA, Bodén M, Bauer DC, Bailey TL: Assigning roles to DNA regulatory motifs using comparative genomics. Bioinformatics 2010, 26: 860–866. 10.1093/bioinformatics/btq049View ArticlePubMedPubMed CentralGoogle Scholar
- Stormo GD: Information content and free energy in DNA-protein interactions. J Theor Biol 1998, 195: 135–137. 10.1006/jtbi.1998.0785View ArticlePubMedGoogle Scholar
- Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z: Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res 2004, 32(4):1372–1381. 10.1093/nar/gkh299View ArticlePubMedPubMed CentralGoogle Scholar
- McLeay RC, Bailey TL: Motif Enrichment Analysis: A unified framework and method evaluation. BMC Bioinformatics 2010, in press.Google Scholar
- Kruskal WH, Wallis WA: Use of Ranks in One-Criterion Variance Analysis. J American Statistical Association 1952, 47(260):583–621. 10.2307/2280779View ArticleGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME Suite: tools for motif discovery and searching. Nucleic Acids Res 2009, (37 Web Server):W202-W208. 10.1093/nar/gkp335Google Scholar
- MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E: An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 2006, 7: 113. 10.1186/1471-2105-7-113View ArticlePubMedPubMed CentralGoogle Scholar
- Sinha S, Blanchette M, Tompa M: PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 2004, 5: 170. 10.1186/1471-2105-5-170View ArticlePubMedPubMed CentralGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, (34 Database):D108-D110. 10.1093/nar/gkj143Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.