Finding over- or under-represented motifs in biological sequences is now a common task in genomics. Thanks to p-value calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count p-values in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required.

Results

We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K-12 backbone versus variable strain-specific loops.

Conclusion

The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use.

Background

Detecting motifs with a significantly unexpected frequency in DNA sequences has become a very common task in genome analysis. It is generally addressed to propose candidate functional motifs based on their statistical properties [1–3]. Lots of statistical methods have been developed to that purpose (see the recent surveys by [4] or [5] and references therein) and satisfactory solutions exist now to find exceptional motifs thanks to p-value calculations.

More recently, a new related question has arisen in the literature concerning the comparison of motif exceptionalities in two sequences. One wants for instance to compare particular sets of genes [6], upstream regions of CDSs versus whole chromosome [7], structural domains [8], CDSs versus intergenic regions, conserved regions versus strain-specific regions of bacterial genomes [9], or chromosomes from the same species [10]. Chromosomes from different species can also be compared from a comparative genomics point of view. In all these works, one would like to know if a given motif is significantly more exceptional in one sequence compared to another one. This criterion is usually used to identify motifs which are specific from some regions or expected to be more frequent in some particular parts of the genome. Transcription factor binding sites, for instance, are expected to be more frequent in upstream regions than along the whole genome.

Surprisingly, no rigorous statistical method has been proposed yet to decide if a given motif, exact or not, is significantly more exceptional in one sequence compared to a second one. Of course, two p-values can be calculated separately on each sequence to know if the motif is exceptional in these sequences but the difficult point is how to compare these two p-values from a statistical point of view. It is indeed not sufficient to make the difference or the ratio to know if the two p-values are significantly different; One needs a statistical test.

In this paper, we propose two statistical tests to compare the motif count exceptionalities in two independent sequences. In the Results Section, we first present the underlying model for motif occurrences and the null hypothesis to test, namely the motif is similarly exceptional in both sequences. Then we derive an exact binomial test and an asymptotic likelihood ratio test adapted for frequent motifs. Usage conditions and power of both tests are described in the Discussion Section, together with a more refined model for occurrences of overlapping words and the associated tests. An illustration of the method is finally given; We compare the octamer exceptionalities in two sets of regions (backbone/loops) from the Escherichia coli K12 leading strands. These two sets correspond to the mosaic structure of E. coli's genome when comparing the two strains K12 and O157:H7: the backbone represents the common regions whereas the loops are specific to the K12 strain. As a toy example all along this paper, we will treat in detail the case of the palindromic octamer cagcgctg which occurs respectively 30 times in the loops (758434 bps long) and 113 times in the backbone (3 882 513 bps long).

Results

Poisson model

In sequence i, the motif count N_{
i
}is supposed to have a Poisson distribution with mean (and variance) λ_{
i
}. This distribution has been shown to fit correctly theoretical (in Markovian sequences, for example) as well as observed count distributions of non-overlapping words [11]; A non-overlapping word is a word such that two occurrences of itself can not overlap in a sequence.

The mean count λ_{
i
}in sequence i must account for three parameters: (i) the length ℓ_{
i
}of the sequence, (ii) the composition of the sequence, (iii) the possible exceptionality of the motif in the sequence.

Expected intensity

The composition of the sequence can be accounted for via the probability μ_{
i
}for the motif to occur at any position in the sequence under a simple model. The most popular models are Markov chain models which can fit the frequencies in mono-, di-, tri-nucleotides, etc. Indeed, the Markov chain model of order m (denoted by Mm) takes the (m + 1)-mer composition into account. Under such models, the occurrence probability μ_{
i
}of a h-letter motif w = w_{1}w_{2} ... w_{
h
}on the {a, c, g, t} alphabet can be expressed in terms of counts of its subwords of length m and m + 1 [5]. For instance, here are the expression of μ_{
i
}in models M0, M1 and M(h - 2) which fit respectively the composition in bases, in dinucleotides and in oligonucleotides of length h - 1:

If one does not want to account for the sequence composition (this case will be referred to as model M00), then μ_{
i
}simply depends on the motif, hence μ_{1} = μ_{2} = (1/4)^{h}.

The choice of the Markov chain model depends on the sequence composition one wants to fit. For instance, model M2 is often used for coding DNA sequences to take the codon bias into account. Higher the model order, better the fit, but usually the model order is bounded either by h - 2 or because the sequence is too small (the number of parameters to be estimated increases exponentially with the order).

Table 1 gives the expected counts ℓ_{
i
}μ_{
i
}for the motif cagcgctg in the E. coli loops/backbone sequences. Since N_{1} = 30 and N_{2} = 113, we see that this motif is highly over-represented in both sequences under models M00, M0 and M1. However, under the richest possible model (M6), it is over-represented in sequence 1 (loops) but under-represented in sequence 2 (backbone).

Exceptionality coefficient

When the motif is not exceptional with respect to the considered model, the mean count λ_{
i
}is simply ℓ_{
i
}μ_{
i
}. For exceptional motifs, i.e. motifs with an observed count N_{
i
}far from its expectation ℓ_{
i
}μ_{
i
}, under a given model, the mean count λ_{
i
}should reflect this exceptionality.

We therefore introduce an exceptionality coefficient k_{
i
}which allows λ_{
i
}to be greater (or smaller) than the expected value:

λ_{
i
}: = k_{
i
}ℓ_{
i
}μ_{
i
}.

In the following, parameters ℓ_{
i
}and μ_{
i
}will be supposed to be known a priori: they can be considered as two correction terms. The inference will only be made on k_{
i
}.

Hypothesis testing

Comparing the (potential) exceptionality of a motif in two sequences is equivalent to test the null hypothesis H_{0} = {k_{1} = k_{2}}.

We emphasize that the respective values of k_{1} and k_{2} can be larger than one (unexpectedly frequent motif), smaller than one (unexpectedly rare motif) or close to one (motif with expected count). These values do not matter: our only concern is to know if they are significantly different or not.

Exact binomial test

We first propose an exact test based on a general property of the Poisson distribution. If N_{1} and N_{2} are two independent Poisson counts with respective means λ_{1} and λ_{2}, the distribution of N_{1} given their sum N_{+} : = N_{1} + N_{2} is binomial [12]: N_{1} ~ \mathcal{B} (N_{+}, π) with

because k_{1} = k_{2}. In absence of correction (M00 model) for the sequence composition (i.e. μ_{1} = μ_{2}), we have π_{0} = ℓ_{1}/(ℓ_{1} + ℓ_{2}). If furthermore the two sequences have the same length, we get π_{0} = 1/2.

Moreover, the proportion π and then the expectation of N_{1}, increases as the ratio k_{1}/k_{2} increases. Therefore, the p-value for the one-sided alternative H_{1} = {k_{1} > k_{2}} is p_{
B
}= Pr {\mathcal{B} (n_{+}, π_{0}) ≥ n_{1}}, i.e.

where n_{+} and n_{1} are the observed values of N_{+} and N_{1}.

Table 2 gives the probability π_{0} and the p-value p_{
B
}for the motif cagcgctg in E. coli. At level 5%, the null hypothesis is accepted under models M00 and M6 meaning that the motif is similarly exceptional in both sequences with respect to their length and/or 7-mer composition. However, {k_{1} = k_{2}} is rejected at level 5% against {k_{1} > k_{2}} under models M0 and M1; since cagcgctg is over-represented in both sequences, it means that it is significantly more exceptionally over-represented in sequence 1 (loops) with respect to the base and/or dinucleotide compositions of both sequences.

Likelihood ratio test (LRT)

Another test statistic based on the comparison of the likelihood of the data under the H_{0} and the alternative hypothesis H_{1} = {k_{1} ≠ k_{2}} can be derived. This statistic is known as the Likelihood Ratio Test (see [13], vol. IV). In our model (see the Methods Section), it is defined as

where π_{0} is defined in (1). Under the null hypothesis, its asymptotic distribution is a chi-square distribution with one degree of freedom.

This test is two-sided, because, under H_{
1
}, parameters k_{1} and k_{2} are estimated independently (in particular, without the constraint k_{1} > k_{2}). The exact distribution of LRT could be calculated via permutation techniques but the computation time would be tremendeous for large counts. We will then calculate the following asymptotic p-value:

where n_{2} is the observed value of N_{2} and χ^{2} ~ χ^{2} (1).

Table 3 gives the LRT statistic and the associated p-value for the motif cagcgctg in E. coli. Remember that the LRT is two-sided, so p_{
L
}have to be divided by two when compared to the one-sided binomial p-value p_{
B
}. We see that the significances obtained with the LRT are different from the ones obtained with the exact binomial test, but the qualitative conclusions are the same.

Chi-square test

Another standard asymptotic test is the chi-square test where the counts N_{
i
}are compared to their expected values \widehat{N}_{
i
}under H_{0} given the total count N_{+}:

where \widehat{N}_{1} = π_{0}N_{+} and \widehat{N}_{2} = (1 - π_{0})N_{+}. Under the null hypothesis, X^{2} has also an asymptotic chi-square distribution with one degree of freedom. It is also an intrinsically two-sided test. Further analyzes (including simulations) not presented here (see [14]) show that this test performs very similarly to the LRT in every situations. Note that the chi-square test is the same as the score test [13].

Discussion

LRT distribution

The chi-square distribution of the LRT statistic is only asymptotic, so the actual level may be different from the nominal one (typically α = 5%). To measure this difference, we have calculated this actual level for different values of π_{0} and N_{+}. Since LRT is a function of N_{1}, the actual level can be derived from the exact distribution of N_{1} given N_{+} which is binomial (see Results Section).

Figure 1 compares both levels (actual and nominal). Since the counts are discrete, the actual level can never be exactly α leading to oscillations in the plot. We see that the nominal level is only reached with N_{+}≃ 1000 for π_{0} = 0.5 and even later for π_{0} = 0.95 (or π_{0} = 0.05). It means that the chi-square approximation of the LRT statistics is only valid for motifs with many total occurrences.

Regarding the motif cagcgctg, because π_{0} is about 15% (cf. Table 2), the picture is close to the right plot of Figure 1; In fact, with a total count of 143, the actual level is respectively 0.095%, 1.1%, 5.1% and 12.5% for a nominal level α equal to 0.1%, 1%, 5% and 10%.

LRT as a contrast measure

The LRT statistic can still be used as a contrast measure, i.e. a measure of the difference, between the two exceptionalities. For large values of N_{+} it is much faster and easier to compute than the binomial p-value. We will see in the illustration below that the two quantities are strongly correlated.

Decidability limits for the binomial test

Because the binomial test is exact, the actual and nominal levels are equal. The significance can then always be determined. It would be maximal when N_{1} = N_{+} (i.e. N_{2} = 0) and the corresponding p-value p_{
B
}would be equal to {\pi}_{0}^{{N}_{+}}. Therefore, if this minimal p-value is greater than the desired level α (typically 5%), no significance conclusion can be made. This happens when {\pi}_{0}^{{N}_{+}}α, i.e. when N_{+} ≥ ln (α)/ln(π_{0}).

Figure 2 gives this critical value of N_{+} for various values of π_{0} and α. We see, for instance, that for π_{0} = 0.7 and N_{+} = 10, one may get significant results at a level greater than 5% but not at a level smaller than 1%.

Power

An important property for a statistical test is its ability to detect departure from the null hypothesis. This ability is measured by the power of the test which is the probability to exceed the significance threshold (defined under H_{0}) when the true parameter satisfies H_{1}. In our case, the parameter of interest is

which is equal to π_{0} when k_{1} = k_{2}. So the departure from H_{0} will be measured by the ratio k_{1}/k_{2} when it differs from 1.

Exact binomial

Figure 3 presents the power of the exact binomial test when k_{1}/k_{2} increases. As expected, the power increases with N_{+}. Moreover, it decreases when π_{0} increases i.e. when the expected ratio ℓ_{1}μ_{1}/(ℓ_{2}μ_{2}) increases. It means that, when the motif is already expected to be much more frequent in sequence 1 than in sequence 2, it is more difficult to detect that its exceptionality in the first sequence is also higher.

The motif cagcgctg occurs N_{+} = 143 times in the whole genome. In the different models considered in Table 2, probability π_{0} is between 11.6% and 16.4%. The power of the binomial test in this case can therefore be read in Figure 3, in the two top plots between the black and red solid lines. We see that a ratio k_{1}/k_{2} = 2 can be detected with probability greater than 90%, while a ratio of 1.5 will be detected with a bit more than 50% probability.

LRT

The same analysis can be made for the LRT tests. However, this only makes sense for sufficiently large N_{+}, to guaranty the validity of the chi-square distribution.

Case of overlapping words

Compound Poisson model

The distribution of overlapping word occurrences is better modeled by a compound Poisson process (see [15]) in the following way:

The word occurs in clumps distributed according to a Poisson process. The number of clumps C_{
i
}in sequence i is hence a random Poisson variable with mean denoted by \tilde{\lambda}_{
i
}.

The size V_{
ic
}of the c-th clump (in sequence i) is random with geometric distribution:

Pr{V_{
ic
}= v} = {a}_{i}^{v-1} (1 - a_{
i
}).

The clump sizes are supposed to be independent. Parameter a_{
i
}is the overlapping probability of the motif and can be calculated under various Markovian models (see [5]).

In this setting, the count N_{
i
}is hence the sum of the sizes of C_{
i
}clumps and has the Polya-Aeppli (or geometric Poisson) distribution (see [12]). We have (see [5]) \tilde{\lambda}_{
i
}= (1 - a_{
i
}) λ_{
i
}. In the case of a non-overlapping word, we have C_{
i
}= N_{
i
}, a_{
i
}= 0 and λ_{
i
}= λ_{
i
}. For overlapping words, the mean clump size is equal to 1/(1 - a_{
i
}) and increases with a_{
i
}.

Tests

An overlapping word can occur with an exceptional frequency (i) because of an exceptional number of clumps or (ii) because of exceptional sizes of clumps. Then comparing the exceptionalities of an overlapping word in two sequences leads to compare the number of clumps C_{1} with C_{2}, and/or the sizes V_{1c}'s with V_{2c}'s.

Comparison of the number of clumps

In this compound Poisson model, the number of clumps in each sequence is Poisson distributed. The comparison of the counts C_{1} and C_{2} is then exactly equivalent to the comparison of the counts N_{1} and N_{2} studied in the Results Section, replacing λ_{
i
}by \tilde{\lambda}_{
i
}and μ_{
i
}by \tilde{\mu}_{
i
}:= (1 - a_{
i
}) μ_{
i
}.

Exact test for the overlapping probability under M00

The question is now to test the null hypothesis H_{0} = {a_{1} = a_{2}}. This comparison is made conditionally to the observed counts N_{1} and N_{2}. It only makes sense if the motif occurs at least once in each sequence, i.e. if N_{1}, N_{2}, C_{1} and C_{2} are all larger than (or equal to) 1. In this case, the first occurrence necessarily corresponds to the first clump and the C_{
i
}- 1 last clumps have to be chosen among the other N_{
i
}- 1 motif occurrences. Since a motif occurrence (except the first one) corresponds to a clump occurrence with probability 1 - a_{
i
}, the number of clumps (except the first one) has a binomial distribution:

C_{
i
}- 1 ~ \mathcal{B} (N_{
i
}- 1, 1 - a_{
i
}) (2)

which means that the expected number of clumps decreases when the overlapping probability increases.

Following the same strategy as for the non-overlapping case, we base our test on the distribution of C_{1} given the total clump count C_{+} = C_{1} + C_{2}. Under H_{0}, (C_{1} - 1) has an hyper-geometric distribution \mathscr{H} (N_{+} - 2, N_{1} - 1, C_{+} - 2) (see [12], Eq. (3.23)):

The overlapping probability a_{1} is then significantly greater than a_{2} if the probability Pr{C_{1} ≤ c_{1}|N_{1}, N_{2}, C_{+}} is smaller than a given level α.

Exact test in the general case

The previous test does not account for the composition of the sequences. The overlapping probabilities a_{1} and a_{2} can be expected to be different, according to some null model. In this case, the true overlapping probability in sequence i is b_{
i
}= h_{
i
}a_{
i
}, where h_{
i
}is an exceptionality coefficient (analogous to k_{
i
}for the mean count). The problem is then to test H_{0} = {h_{1} = h_{2}}. Such a test is proposed in Appendix: it involves the generalized negative hyper-geometric distribution.

Asymptotic tests

As for the counts N and C, asymptotic tests such as likelihood ratio, chi-square or score tests can be derived to compare exceptionalities in terms of overlaps. These tests are not presented here to avoid further statistical developments but also because the small overlapping probabilities generally observed make them rarely relevant.

Illustration

Materials

Comparing complete genomes of strains of single bacterial species allows to determine highly conserved regions (so-called backbone) and numerous strain-specific DNA segments (so-called loops) for each strain. These mosaic structures help to understand the evolution of bacterial genomes. Indeed, the backbone probably corresponds to the common ancestral strain and is under vertical pressure whereas the loops may be associated with mobile elements or strain-specific pathogenicity. Such backbone/loops segmentation has been systematically performed [9] and store in the public MOSAIC database [16]. We have extracted from this database the E. coli K-12 specific loops (sequence 1) and the backbone (sequence 2) obtained from the pairwise alignment of the complete genomes of E. coli K-12 laboratory strain and the enterohemorrhagic E. coli O157:H7 strain. As an illustration, we have compared the exceptionalities of all the 65536 octamers in the backbone versus in the loops. Such comparison will point out octamers which do not have the same constraint, with respect to their frequency, on the loops versus on the backbone.

Exact binomial test

Figure 4 presents the significance of the binomial test for all octamers in the backbone/loops comparison. The limits between the different significance levels are clear under M00 because the probability π_{0} is the same for all octamers, while they are fuzzy under M1 because π_{0} depends on the octamer composition. In this case, same counts (N_{1}, N_{2}) may result in different p_{
B
}values. The distribution of the p-value p_{
B
}is summarized in Table 4. The 10 motifs with smallest p-values, i.e. with an exceptionality coefficient significantly higher in the loops than in the backbone, are listed in the top of Table 5. Multiple testing problems arise when we compare the exceptionalities of the 65 536 octamers simultaneously. Table 6 gives the number of significant octamers and the corresponding threshold when adjusting for a False Discovery Rate (FDR, [17]) of 1%. For example, under model M1 only 154 octamers are significantly more exceptional in the loops. These octamers have all a p-value p_{
B
}smaller than 2.2 10^{-5}.

Symmetrically, to find the motifs with an exceptionality coefficient significantly higher in the backbone than in the loops, we have to test H_{0} versus {{H}^{\prime}}_{1} = {k_{2} > k_{1}} using the p-value {{p}^{\prime}}_{B} defined as {{p}^{\prime}}_{B} = Pr{\mathcal{B} (n_{+}, π_{0}) ≤ n_{1}}. The 10 most significant motifs for this test are given at the bottom of Table 5. When adjusting for a False Discovery Rate of 1%, only 14 octamers under model M1 are significantly more exceptional in the backbone than in the loops. These octamers have all a p-value {{p}^{\prime}}_{B} smaller than 1.8 10^{-6}. Note that under model M6, no octamer is significant after multiple testing adjustment.

According to the p_{
B
}list, the motif cagcgctg has rank 1 115 among 65 536 under the M1 model. Note that the well known Chi motif (gctggtgg) which is the most overrepresented octamer in E. coli genome has a {{p}^{\prime}}_{B} value of 5.1 10^{-5} (rank 1 281) under the same model; It means that k_{backbone} is significantly higher than k_{loops} but due to multiple testing Chi is not among the significant octamers.

LRT versus binomial

We now compare the results provided by the two tests: binomial and LRT. Because the former is one-sided and the latter is two-sided, we use a signed version LRT^{s}of the LRT statistic which is equal to LRT when N_{1} is greater than expected (N_{1} ≥ π_{0}N_{+}) and to – LRT otherwise (N_{1} <π_{0}N_{+}). To make the graph more readable, we also transform the p-value p_{
B
}into a Gaussian score S_{
B
}∈ℝ:

S_{
B
}= Φ^{-1} (1 - p_{
B
})

where Φ is the cumulative distribution function of the standard Gaussian distribution. High positive values of S_{
B
}correspond to motifs with an exceptionality coefficient in sequence 1 significantly higher than in sequence 2, while high negative values of S_{
B
}correspond to motifs having an exceptionality coefficient in sequence 1 significantly lower than in sequence 2.

We see in Figure 5 that the two statistics give very similar results for all the octamers in the backbone/loops comparison. Table 7 gives the Spearman and Kendall correlation coefficients between the two statistics for different models. Recall that Spearman's coefficient is the correlation between the ranks, while Kendall's one is the proportion of concordant pairs between the two rankings. This confirms that the LRT statistics is a reliable exceptionality comparison score, although the associated p-value is questionable for small counts.

Note that the naive comparison between the two p-values simply associated with the exceptionality of each motif in each sequence does not provide the same sets of significant octamers (see Figure 6). Such p-values have been calculated using the Poisson approximation of the number of clumps.

Test for overlaps

Very few motifs have significant differences in their clumps sizes. Table 8 presents the results for the 4 motifs having a p-value smaller than 10%. For all of them, no overlap is observed in the backbone (C_{2} = N_{2} means that all clumps are of size 1 while few are observed in the loops (C_{1} <N_{1}). The probability a is the overlapping probability under model M00.

Conclusion

We have proposed two complementary statistical tests to compare the exceptionalities of motif counts in two sequences. The binomial test is exact and particularly of interest for small counts (from a computational point of view). For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test. The LRT statistics is simple to calculate and can be directly interpreted as a contrast measure between the exceptionalities; its p-value can be derived from the chi-square distribution. Both tests will be implemented in the R'MES software already devoted to exceptional motifs [18].

The likelihood ratio test can be generalized to more than two sequences. Suppose we want to compare I sequences S_{1}, S_{2},..., S_{
I
}. In each of them, we assume that the count N_{
i
}has a Poisson distribution with parameter λ_{
i
}= k_{
i
}ℓ_{
i
}μ_{
i
}and we want to test H_{0} = {k_{1} = k_{2} = ⋯ = k_{
I
}} versus H_{1} = {At least one k_{
i
}differs from the others}. The LRT statistics is

Under H_{0}, LRT has an asymptotic chi-square distribution with (I - 1) degrees of freedom. The Chi-square test can be generalized as well.

Under the Poisson model, both tests can be easily used for degenerated motifs or more generally for sets of motifs. Let denote by \mathcal{W} a set of motifs; The count N_{
i
}(respectively the occurrence probability μ_{
i
}) will be the sum of the counts (resp. occurrence probability) of w for all motifs w from \mathcal{W}. However, the generalization is much more involved for the compound Poisson model because of the possible overlaps between motifs from the set; In particular, the overlapping probability a_{
i
}becomes a matrix [19].

We emphasize that these tests are valid only for independent sequences. They can not be used to detect skewed oligomers because the leading strand is not independent from the lagging strand [20]. This particular question requires the development of another rigorous statistical method; this is an ongoing work.

Finally, note that the exceptionality comparison of word counts in sequences is actually equivalent to the differential analysis of SAGE expression data [21]. Indeed, in the SAGE technology, the expression level of a given gene is measured by a number of associated tags and the problem is to compare the number of tags between two conditions. In such problem, no correction has to be done except for the total number of tags and our test statistics under model M00 are adapted.

Methods

Likelihood ratio test

The model presented in the Results Section can be rephrased as two Poisson processes with respective intensity k_{
i
}u_{
i
}(i = 1,2). To calculate the likelihood, we need to estimate the exceptionality coefficients k_{1} and k_{2}. Under the alternative hypothesis, their respective maximum likelihood estimates (MLE) are \widehat{k}_{1} = N_{1}/(ℓ_{1}μ_{1}) and \widehat{k}_{2} = N_{2}/(ℓ_{2}μ_{2}). Assuming that the two sequences are independent, the log-likelihood of the two processes is

The LRT is defined as twice the difference between \mathcal{L}_{1} and \mathcal{L}_{0}: LRT = 2(\mathcal{L}_{1} - \mathcal{L}_{0}). The result follows after standard algebraic manipulations.

Appendix

Exact hyper-geometric test

Conditional distribution of the number of clumps

The conditional distribution of C_{
i
}- 1 given in (2) can be modified as

N_{
i
}- C_{
i
}~ \mathcal{B} (N_{
i
}- 1, b_{
i
})

where b_{
i
}= h_{
i
}a_{
i
}is the true overlapping probability. This version is preferable, since the exceptionality coefficient h_{
i
}directly appears here as a multiplicative constant. The conditional distribution of the difference N_{
i
}- C_{
i
}given the clump counts C_{1} and C_{2} and the total count N_{+} is a generalized negative hyper-geometric distribution (see [12] p. 264 for the classical version and p. 270 for the generalization):

where A is the constant such that the sum over all n_{1} between C_{1} and N_{+} is equal to one.

Test

Under H_{0} = {h_{1} = h_{2}}, the term b_{1}/b_{2} can be replaced by a_{1}/a_{2}. The overlapping probability b_{1} is significantly greater than b_{2} if N_{1} is significantly large, i.e. if Pr{N_{1} ≥ n_{1}|C_{1}, C_{2}, N_{+}} is small. The power of this test can also be studied: under H_{0}, b_{1}/b_{2} equals a_{1}/a_{2}, while under the alternative hypothesis, it is equal to (h_{1}/h_{2}) (a_{1}/a_{2}). The power of the test is then a function of h_{1}/h_{2}.

References

van Helden J, André B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 1998, 281: 827–842. 10.1006/jmbi.1998.1947

El Karoui M, Biaudet V, Schbath S, Gruss A: Characteristics of Chi distribution on several bacterial genomes. Research in Microbiology 1999, 150: 579–587. 10.1016/S0923-2508(99)00132-1

Bigot S, Saleh O, Lesterlin C, Pages C, El Karoui M, Dennis C, Grigoriev M, Allemand JF, Barre FX, Cornet F: KOPS: DNA motifs that control E. coli chromosome segregation by orienting the FtsK translocase. EMBO J 2005, 24: 3770–3780. 10.1038/sj.emboj.7600835

Robin S, Rodolphe F, Schbath S: DNA Words and Models. Cambridge University Press; 2005. [English version of ADN, mots et modéles, BELIN 2003]. [English version of ADN, mots et modéles, BELIN 2003].

Davidsen T, Rodland E, Lagesen K, Seeberg E, Rognes T, Tonjum T: Biased distribution of DNA uptake sequences towards genome maintenance genes. Nucleic Acids Research 2004, 32: 1050–1058. 10.1093/nar/gkh255

Chiapello H, Bourgait I, Sourivong F, Heuclin G, Jacquemard A, Petit MA, El Karoui M: Systematic determination of the MOSAIC structure – backbone versus strain specific loops – of bacterial genomes. BMC Bioinformatics 2005, 6: 171. 10.1186/1471-2105-6-171

McNeil J, Smith K, Hall L, Lawrence J: Word frequency analysis reveals enrichment of dinucleotide repeats on the human X chromosome and [GATA]n in the X escape region. Genome Research 2006, 16: 477–484. 10.1101/gr.4627606

Vandewalle V: Etude de motifs dans les séquences d'ADN : comparaison d'exceptionnalités. In Master's thesis. Institut National Agronomique Paris-Grignon; 2005.

Roquain E, Schbath S: Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain. Adv Appl Prob 2007., 39:

We thank Meriem El Karoui and Marie-Agnès Petit for helpful discussions. We also thank the referees for their remarks. This work has been supported by the French Action Concertée Incitative IMPBio.

Author information

Authors and Affiliations

INA PG/ENGREF/INRA, UMR518 Unité Mathématiques et Informatique Appliquées, 75005, Paris, France

Stéphane Robin & Vincent Vandewalle

INRA, UR1077 Unité Mathématique, Informatique et Génome, 78350, Jouy-en-Josas, France

SR and SS developed the statistical methodology, analyzed the examples and wrote the paper. VV studied the usage conditions. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.