Towards a theoretical understanding of false positives in DNA motif finding

Background Detection of false-positive motifs is one of the main causes of low performance in de novo DNA motif-finding methods. Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios. Results Using large-deviations theory, we derive a remarkably simple relationship that describes the dependence of false positives on dataset size for the one-occurrence per sequence motif-finding problem. As expected, we predict that false-positives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. Interestingly, we find that the false-positive strength depends more strongly on the number of sequences in the dataset than it does on the sequence length, but that the dependence on the number of sequences diminishes, after which adding more sequences does not reduce the false-positive rate significantly. We compare our theoretical predictions by applying four popular motif-finding algorithms that solve the one-occurrence-per-sequence problem (MEME, the Gibbs Sampler, Weeder, and GIMSAN) to simulated data that contain no motifs. We find that the dependence of false positives detected by these softwares on the motif-finding parameters is similar to that predicted by our formula. Conclusions We quantify the relationship between the sequence search space and motif-finding false-positives. Based on the simple formula we derive, we provide a number of intuitive rules of thumb that may be used to enhance motif-finding results in practice. Our results provide a theoretical advance in an important problem in computational biology.


Introduction
Because binding of sequence specific transcription factors to their recognition sites in noncoding DNA is an important step in the control of gene expression, the development of computational methods to identify transcription factor binding motifs in non-coding DNA has received much attention in computational biology.The low information content of transcription factor binding motifs implies difficulty for computational analyses.For example, given a known binding motif, identification of bona fide examples is always plagued by false positives -the so-called Futility Theorem [4].
An even more challenging computational problem is the de novo identification of transcription factor binding motifs (so-called motif-finding), for which there are many available tools (for tutorials on different methods see [5,6] and references therein).Despite the substantial algorithm development effort in this area, recent comprehensive benchmark studies [1][2][3] revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios, where known transcription factor binding sites have been planted in test sequence sets.One of the major problems is that DNA motif-finders can identify seemingly strong candidate motifs, even when randomly chosen sequences are provided as the input.This has led to simulation-based approaches to identify the bona fide motifs where the motif-finding algorithm is repeated several times on random data and the p-value is computed as the fraction of motifs with better scores than the motif identified in real data [7].While feasible for expert computational biologists, this approach requires significant computational resources, and is not practical for most biological users.
We argue that part of low performance in motif finding algorithms is due to the statistical nature of large sequence datasets: when the dataset is large enough, any structure can occur by chance.We formalize this idea using information theory, to obtain a remarkably simple analytical relationship between the size of the sequence search space and the strength of the false-positive motifs.Interestingly, our analysis shows that for biologically realistic dataset sizes and motif strengths, false positives as strong as real transcription factor binding sites are quite likely to arise.This represents an extension of the "Futility Theorem" [4] to the de novo motif-finding problem.Results

Results
Motif-finders are expected to find strong signals in random DNA sequences: We represent patterns in DNA sequence families (called motifs) as probability matrices, where each column specifies the distribution of the DNA letters.The underlying idea here is to quantify the probability of observing motifs of a certain strength in a set of random sequences using large-deviations theory.Suppose that the set of all motifs, X, is generated according to a random nucleotide background distribution g (for instance, g can be the genome-wide distribution of nucleotides).It is expected that all nucleotides in motifs will have frequencies close to those in g.Therefore, motifs that have a distribution significantly different from g (i.e. the false-positives in our case) are considered as the rare events that are far from expectation.We use the large-deviations theory, in particular Sanovs theorem [8] to measure the probability of these rare events.We, then, derive expected size of the set X above which the observation of strong motifs becomes likely to be due to chance.
Let a DNA motif with W columns have a distribution or probability matrix f (see Fig. 1 and Methods for definition of motif finding problem parameters).The difference between the distribution of the motif, f , and the background distribution, g, is measured using the Kullback-Leibler (KL) divergence [8], also known as the biological information content [9] [10], defined as in the following: where f jk is the relative frequency of base k in column j of the motif, and g k is the background distribution of base k (e.g. the genome-wide distribution of nucleotide bases).Throughout the text we use the strength of a motif and its information content, interchangeably to refer to D(f, g) and Iseq.
Our main theoretical result is as follows.Consider the "one-occurrence-per-sequence" motif-finding model where each of n sequences is assumed to have exactly one occurrence of a motif of width W .The expected sequence length, L, in order to observe at least one motif with a probability matrix (PM) diverged from the background, g, by at least D(f, g) is given by: where |A| is the cardinality of the set A, e.g.|A| = 4 for DNA sequences.According to this theorem, if the length of DNA sequences is approximately (or larger than) L, false-positive motifs with information content D(f, g) can occur by chance.Please see Appendix A for the proof of this theorem.
Figure 2 show the expected length of sequence, L as a function of motif information content, D(f, g), for DNA sequences with typical motif-finding parameters and W = 5, W = 10 and W = 15, respectively.Each graph illustrates L, at which false-positive motifs with strength D(f, g) are expected to occur by chance.
The dependency of false-positives on n is stronger compared to the dependency on L. As an example, for motifs with W = 10 (Fig. 2b), a threefold increase of n (while keeping L constant) reduces D(f, g) by the same amount as if L were increased by 2 orders of magnitude (while keeping n unchanged).However, the dependency of false-positives on n decreases with n and reaches a plateau for larger n suggesting that in order to reduce the false-positive rate only a sufficient number of sequences in the dataset is necessary (Fig. 3).
Finally, the false-positive information content, D(f, g), is approximately linear in W in the range of interest (Fig. 4).Therefore, given a motif-strength of interest, detecting real motifs with smaller width is easier and less prone to false-positives.
MEME performance is consistent with the theoretical expectations: To confirm our theoretical results, we conducted a set of experiments using the MEME software [11,12], because it implements the one-occurrence-per-sequence set up that we have treated theoretically (see Methods for detail of the experiment setup).We ran MEME on a set of randomly generated sequences and asked MEME to report the most significant motif.The detected motifs are therefore false-positives.The results from MEME are plotted in Fig. 2 and are consistently following the theoretical predictions.
Simple rules of thumb for DNA motif-finding: The theoretical predictions provide sequence lengths above which observation of motifs with given strengths or less are most probably due to chance than any biological reason.Therefore, to reduce the false-positive strength in experimental design, it is generally desired to move towards weaker motifs (using Eq. 2 or using the curves in Fig. 2).We have the following rules of thumb for this purpose: (1) As it is intuitively expected, it is generally preferred to use shorter sequences (when it is biologically plausible) to avoid unnecessary false-positives.(2) Adding more sequences to the dataset reduces the false-positive rate considerably (e.g. using 30 sequences compared to 10 reduces the false-positive motif strengths by more than 6 bits (%25) for W = 10, see Fig. 3).This effect, however, diminished for larger n (e.g.increasing n from 30 to 50 has only 2 bits reduction in motif strengths, see Fig. 3).This suggests that in order to reduce false-positive rate in "one-occurrence-per-sequence" motif finding, only a "sufficient" number of sequences is needed.(3) The dependency of false-positives (the strength of false-positive motifs) on L is weaker than dependency on n.Therefore, using many sequences (but not too many) is generally preferred to using shorter sequences.(4) Given n sequences of length L and a width W for potential motifs, Eq. 2 gives expected strength of false-positive motifs.Detected motifs that do not greatly exceed this expected strength should be doubted, while motifs that are stronger than the expected value are most probably not false-positives.(5) Given a certain strength of interest, detection of motifs with smaller width is less prone to false-positives and therefore easier.
Examples of applications: In using the theoretical results in Eq. 2 or the graphs in Fig. 2, it is generally desired to move towards weaker motifs (towards the left on the graphs).To illustrate this we chose the ZFP423 and the TATA-box motifs from the Jaspar database [13] with D(f, g) = 17.93 and D(f, g) = 10.20,respectively.We show that is it difficult to detect ZFP423 in sequences of length 1000, but it can be detected in shorter sequences (Fig. 5).
Similarly, we show that it is very difficult to detect the TATA-box using 20 sequences, but it is possible if this is increased to 30 or if the motif is trimmed to include only the core positions (Fig. 5).

Discussion
Application to protein sequences: The theoretical analysis here can be applied directly for motif-finding in sequences of different alphabets.In particular, the proposed equations can be used for protein sequences by replacing |A| = 4 with |A| = 20 corresponding to 20 aminoacid residues.It is easy to verify in Eq. 2, that by this modification, i.e. changing |A| = 4 to |A| = 20, the expected length, L, increases exponentially.This suggests that, under equivalent settings, the false-positive rate in the protein motif finding is exponentially lower than in the DNA motif finding.
Extension to other motif-finding models: The proposed method here assumes the "oneoccurrence-per-sequence" model in motif finding (similar to the OOPS model in MEME [12]).However, the analysis is extendable to other models by appropriately redefining the space of all motifs in the dataset.See Appendix B for extension of Eq. 2 to the cases where each sequence can carry either zero or one motif (similar to ZOOPS model in MEME [12]).
A simple formula for computing the p-value: For a motif with a given PM f , the p-value is defined as the probability of observing stronger motifs assuming that the sequences are generated according to a background distribution.There are different approaches for accurately computing the p-value [10,[14][15][16][17].While these approaches provide sophisticated methods that precisely compute the p-value, they tend to be complicated to implement.Here, however, as a side-product of our main results, we provide a simple equation that conservatively approximates the p-value.
Specifically, given n sequences, the p-value of a motif f with width W is no more than: Please see the Appendix A for the detail of derivation of this equation.

Motif finding problem:
The motif-finding problem considered here assumed the "oneoccurrence-per-sequence" model.It is assumed that there are n sequences of length L in the data set (see Fig. 1 for the definition of different parameters).The motifs are assumed to have W columns with a probability matrix denoted by f .The motifs PM represents the relative frequency of symbols (e.g.DNA bases) in each column of the motif.We measure the strength of a motif by the divergence of its PM from a uniform background distribution g.We use the Kullback-Leilber (KL) divergence, also referred to as biological information content [9,10], denoted by D(f, g) = Iseq(f, g) (see Eq. 1).
Correction of information content bias due to the sampling error: The theoretical result in Eq. 2 is accurate for relatively large n.However, in practical application, where the number of sequences is relatively small, e.g.n < 15 for DNA sequences, a sampling error in computing f causes a bias in the information content D(f, g).We account for this bias by subtracting an approximate term suggested in [9] from the information content used in Eq. 2 as follows: where ln is the natural logarithm.The contribution of sampling error vanishes as n increases.

Simulations:
In each experiment, we generated a set of n sequences with length L drawn from a uniform background distribution g = [0.25 0.25 0.25 0.25].We then ran the MEME using OOPS model (only one motif per sequence) and restricted MEME to generate only one motif (the most significant) with width W .We repeated the experiment for different number of sequences (n = {10, 20, 30}), different motif width (W = {5, 10, 15}), and different sequence lengths (L = {50, 100, 500, 1000, 5000}).We repeated each experiment for 50 Monte-Carlo runs resulting in 50 data points for each experiment.
For each detected motif, we computed the information content or divergence, D(f, g), using the PMs reported by MEME.Since the input to MEME is a set of random sequences, all detected motifs are supposed to be false-positives.We then compared the false-positives detected by MEME with the theoretical predictions.Each motif detected by MEME is depicted on figures by a star (*).

Acknowledgment
The first author would like to acknowledge useful discussions and comments by Alex Nguyen Ba that enhanced the presented results as well as the manuscript significantly.This research is supported by Canadian Institute for Health Research grant #202372 and an infrastructure grant from the Canadian Foundation for Innovation to AMM.The results are for three different number of sequences, n = {10, 20, 30}, in the dataset.Each set of experiments are repeated for 50 Monte-Carlo runs (so there are 50 stars (*) for each set of experiments).The range of information content is chosen between 0 and 40 bits corresponding to what we found for motifs in the Jaspar database [13] (See Supplementary Fig. 6 that shows the frequency of motifs with respect to their corresponding information content).For any given n, decreasing L reduces the strength of false-positive motifs.Alternatively, for a fixed L, adding more sequences (increasing n) reduces the false-positive strength.The dependency of motif strength on n is stronger compared to the dependency on L. For instance, that for motifs with W = 10 in (b), a threefold increase of n (while keeping L constant) reduces D(f, g) by the same amount if L is increased by 2 orders of magnitudes (while keeping n unchanged).Two real motifs are used to show the application of the theoretical predictions (here motif width is W = 15).Motifs as strong as ZFP423 [13] in n = 10 sequences of length L = 1000 will be buried in false-positives.Therefore, in order to avoid such false-positive motifs, one can reduce L (along Arrow-2) or preferably add more sequences (along Arrow-1) to the dataset.Similarly, it would be very difficult to identify a motif such as the TATA-box motif in a set of 20 sequences with length L = 100 due to falsepositives.Since using shorter sequences is unlikely, one can increase the number of sequences to n = 30 (along Arrow-3) to avoid false-positives that have the same strength as the TATA-box.It is interesting to know how strong the false-positive motifs are for motifs with information content equal to the TATA-box but with a width W = 5 (this is equivalent to trimming all but the core bases of the TATA-box).Fig. 4 shows that this is equivalent to moving along the theoretical curve from W = 15 to W = 5 which reduces the false-positive strength enough to detect this motif.no. of sequences -n mo-f informa-on content -D(f,g) bits Note that each Y i is a row vector of L DNA bases or amino-acid residues.In presenting our analysis we consider DNA sequences with alphabets A = {A, T, G, C}.The alphabet size, denoted by |A| in this case is equal to |A| = 4.However, the theoretical results are directly applicable to any other alphabet size, including |A| = 20 for protein sequences.
Motif finding algorithms seek to find a set of over or under-represented short subsequences in Y .To prepare the dataset for motif finding, we slide a window of length W on each Y i shifting by one base at a time to obtain (L − W + 1) subsequences of length W .We then arrange n number of such subsequences, one for each Y i , to form a motif X as in the following: Each X is a potential motif.This arrangement is based on the "one-occurrence-per-sequence" model in motif finding where each sequence Y i contributes one and only one subsequence to motif.
We denote by X the set of all motifs X.The size of this set is equal to |X | = (L − W + 1) n .The search for statistically significant motifs, in essence, involves finding X ∈ X that is distributed differently from a background distribution g (e.g. the distribution of DNA bases genome-wide that is commonly considered to be Uniform).To do so, we represent the motif X by a probability matrix f defined as: where, e.g.f jT denotes the relative frequency of the symbol T in the j th column of the subalignment X.The PM f represents the empirical distribution of DNA bases at each column of X.For sequences of different alphabet, e.g. protein sequences, the PM is defined with 20 rows corresponding to the number of amino-acid residues.
Motifs represent the abundance of a particular set of similarly composed short sequences in the set Y ; a property that is commonly associated with biological importance [10] [1] [6].To quantify the biological importance of a motif we use information content measure [9] [6] that is defined as the divergence of the PM of a motif from a background distribution.Specifically, for a motif with a PM f , the divergence from a background distribution g is defined as the

Figure legends: Figure 1 .Figure 2 .
Figure legends: Figure 1.DNA motif finding problem parameters.In this example, n = 5 sequences of length L = 80 are used to detect a motif of width W = 15.Corresponding probability matrix, f , is also shown that represents the relative frequency of nucleotides in each column of the motif.Note that each sequence has only one occurrence of the motif (hence one-occurrenceper-sequence (OOPS) model) Figure 2. Theoretical results compared to MEME simulations.Theoretical predic-

Figure 3 .
Figure3.False-positive information content versus the number of sequences.The dependency on false-positives strengths diminishes with increasing n and reaches a plateau suggesting that it is not necessary to use too many sequences to maintain an acceptable level of false-positives.In this figures, the sequence length is fixed to L = 1000.Simulation results from MEME, shown by blue (*).There are 50 simulation results (50 stars) for each value of n.Simulations are done for n = {10, 20, 30, 50, 100}.

Figure 4 .
Figure 4. False-positive information content versus the motif width.False-positive motifs information content, D(f, g), is shown with respect to the motif width for a fixed L and n.For the range of motif widths of our interest (5 to 20), the information content is approximately linear in W .Given a motif-strength of interest, detecting real motifs with smaller width is easier and less prone to false-positives (i.e. for a given motif strength, shorted motifs rarer).In this figure, the sequence length and the number of sequences are fixed to L = 1000 and n = 30, respectively.The theoretical predictions are shown by solid line.The experimental results from MEME are shown by (*).There are 50 repeated results for each W = {5, 10, 15}.

Figure 5 .
Figure 5. Examples of applications.Two real motifs are used to show the application of the theoretical predictions (here motif width is W = 15).Motifs as strong as ZFP423[13] in n = 10 sequences of length L = 1000 will be buried in false-positives.Therefore, in order to avoid such false-positive motifs, one can reduce L (along Arrow-2) or preferably add more sequences (along Arrow-1) to the dataset.Similarly, it would be very difficult to identify a motif such as the TATA-box motif in a set of 20 sequences with length L = 100 due to falsepositives.Since using shorter sequences is unlikely, one can increase the number of sequences to n = 30 (along Arrow-3) to avoid false-positives that have the same strength as the TATA-box.It is interesting to know how strong the false-positive motifs are for motifs with information content equal to the TATA-box but with a width W = 5 (this is equivalent to trimming all but the core bases of the TATA-box).Fig.4shows that this is equivalent to moving along the theoretical curve from W = 15 to W = 5 which reduces the false-positive strength enough to detect this motif.