Statistical tests to compare motif count exceptionalities
 Stéphane Robin^{1}Email author,
 Sophie Schbath^{2}Email author and
 Vincent Vandewalle^{1}
DOI: 10.1186/14712105884
© Robin et al; licensee BioMed Central Ltd. 2007
Received: 15 September 2006
Accepted: 08 March 2007
Published: 08 March 2007
Abstract
Background
Finding over or underrepresented motifs in biological sequences is now a common task in genomics. Thanks to pvalue calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count pvalues in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required.
Results
We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K12 backbone versus variable strainspecific loops.
Conclusion
The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use.
Background
Detecting motifs with a significantly unexpected frequency in DNA sequences has become a very common task in genome analysis. It is generally addressed to propose candidate functional motifs based on their statistical properties [1–3]. Lots of statistical methods have been developed to that purpose (see the recent surveys by [4] or [5] and references therein) and satisfactory solutions exist now to find exceptional motifs thanks to pvalue calculations.
More recently, a new related question has arisen in the literature concerning the comparison of motif exceptionalities in two sequences. One wants for instance to compare particular sets of genes [6], upstream regions of CDSs versus whole chromosome [7], structural domains [8], CDSs versus intergenic regions, conserved regions versus strainspecific regions of bacterial genomes [9], or chromosomes from the same species [10]. Chromosomes from different species can also be compared from a comparative genomics point of view. In all these works, one would like to know if a given motif is significantly more exceptional in one sequence compared to another one. This criterion is usually used to identify motifs which are specific from some regions or expected to be more frequent in some particular parts of the genome. Transcription factor binding sites, for instance, are expected to be more frequent in upstream regions than along the whole genome.
Surprisingly, no rigorous statistical method has been proposed yet to decide if a given motif, exact or not, is significantly more exceptional in one sequence compared to a second one. Of course, two pvalues can be calculated separately on each sequence to know if the motif is exceptional in these sequences but the difficult point is how to compare these two pvalues from a statistical point of view. It is indeed not sufficient to make the difference or the ratio to know if the two pvalues are significantly different; One needs a statistical test.
In this paper, we propose two statistical tests to compare the motif count exceptionalities in two independent sequences. In the Results Section, we first present the underlying model for motif occurrences and the null hypothesis to test, namely the motif is similarly exceptional in both sequences. Then we derive an exact binomial test and an asymptotic likelihood ratio test adapted for frequent motifs. Usage conditions and power of both tests are described in the Discussion Section, together with a more refined model for occurrences of overlapping words and the associated tests. An illustration of the method is finally given; We compare the octamer exceptionalities in two sets of regions (backbone/loops) from the Escherichia coli K12 leading strands. These two sets correspond to the mosaic structure of E. coli's genome when comparing the two strains K12 and O157:H7: the backbone represents the common regions whereas the loops are specific to the K12 strain. As a toy example all along this paper, we will treat in detail the case of the palindromic octamer cagcgctg which occurs respectively 30 times in the loops (758434 bps long) and 113 times in the backbone (3 882 513 bps long).
Results
Poisson model
In sequence i, the motif count N_{ i }is supposed to have a Poisson distribution with mean (and variance) λ_{ i }. This distribution has been shown to fit correctly theoretical (in Markovian sequences, for example) as well as observed count distributions of nonoverlapping words [11]; A nonoverlapping word is a word such that two occurrences of itself can not overlap in a sequence.
The mean count λ_{ i }in sequence i must account for three parameters: (i) the length ℓ_{ i }of the sequence, (ii) the composition of the sequence, (iii) the possible exceptionality of the motif in the sequence.
Expected intensity
The composition of the sequence can be accounted for via the probability μ_{ i }for the motif to occur at any position in the sequence under a simple model. The most popular models are Markov chain models which can fit the frequencies in mono, di, trinucleotides, etc. Indeed, the Markov chain model of order m (denoted by Mm) takes the (m + 1)mer composition into account. Under such models, the occurrence probability μ_{ i }of a hletter motif w = w_{1} w_{2} ... w_{ h }on the {a, c, g, t} alphabet can be expressed in terms of counts of its subwords of length m and m + 1 [5]. For instance, here are the expression of μ_{ i }in models M0, M1 and M(h  2) which fit respectively the composition in bases, in dinucleotides and in oligonucleotides of length h  1:
where N_{ i }(·) denotes the count in sequence i.
If one does not want to account for the sequence composition (this case will be referred to as model M00), then μ_{ i }simply depends on the motif, hence μ_{1} = μ_{2} = (1/4)^{ h }.
The choice of the Markov chain model depends on the sequence composition one wants to fit. For instance, model M2 is often used for coding DNA sequences to take the codon bias into account. Higher the model order, better the fit, but usually the model order is bounded either by h  2 or because the sequence is too small (the number of parameters to be estimated increases exponentially with the order).
Expected count for cagcgctg in the loops (1) and in the backbone (2) of E. coli leading strands under different models.
Model  M00  M0  M1  M6  Count 

ℓ_{1} μ_{1}  11.6  9.4  13.9  24.8  n_{1} = 30 
ℓ_{2} μ_{2}  59.2  66.0  106.2  126.1  n_{2} = 113 
Exceptionality coefficient
When the motif is not exceptional with respect to the considered model, the mean count λ_{ i }is simply ℓ_{ i }μ_{ i }. For exceptional motifs, i.e. motifs with an observed count N_{ i }far from its expectation ℓ_{ i }μ_{ i }, under a given model, the mean count λ_{ i }should reflect this exceptionality.
We therefore introduce an exceptionality coefficient k_{ i }which allows λ_{ i }to be greater (or smaller) than the expected value:
λ_{ i }: = k_{ i }ℓ_{ i }μ_{ i }.
In the following, parameters ℓ_{ i }and μ_{ i }will be supposed to be known a priori: they can be considered as two correction terms. The inference will only be made on k_{ i }.
Hypothesis testing
Comparing the (potential) exceptionality of a motif in two sequences is equivalent to test the null hypothesis H_{0} = {k_{1} = k_{2}}.
We emphasize that the respective values of k_{1} and k_{2} can be larger than one (unexpectedly frequent motif), smaller than one (unexpectedly rare motif) or close to one (motif with expected count). These values do not matter: our only concern is to know if they are significantly different or not.
Exact binomial test
We first propose an exact test based on a general property of the Poisson distribution. If N_{1} and N_{2} are two independent Poisson counts with respective means λ_{1} and λ_{2}, the distribution of N_{1} given their sum N_{+} : = N_{1} + N_{2} is binomial [12]: N_{1} ~ $\mathcal{B}$ (N_{+}, π) with
Under H_{0}, we have π = π_{0} with
because k_{1} = k_{2}. In absence of correction (M00 model) for the sequence composition (i.e. μ_{1} = μ_{2}), we have π_{0} = ℓ_{1}/(ℓ_{1} + ℓ_{2}). If furthermore the two sequences have the same length, we get π_{0} = 1/2.
Moreover, the proportion π and then the expectation of N_{1}, increases as the ratio k_{1}/k_{2} increases. Therefore, the pvalue for the onesided alternative H_{1} = {k_{1} > k_{2}} is p_{ B }= Pr {$\mathcal{B}$ (n_{+}, π_{0}) ≥ n_{1}}, i.e.
where n_{+} and n_{1} are the observed values of N_{+} and N_{1}.
Probability π_{0} and pvalue p_{ B }under different models for cagcgctg in the E. coli loops/backbone comparison.
Model  M00  M0  M1  M6 

π_{0} (%)  16.3  12.4  11.6  16.4 
p _{ B }  8.6 10^{2}  2.7 10^{3}  9.1 10^{4}  9.1 10^{2} 
Likelihood ratio test (LRT)
Another test statistic based on the comparison of the likelihood of the data under the H_{0} and the alternative hypothesis H_{1} = {k_{1} ≠ k_{2}} can be derived. This statistic is known as the Likelihood Ratio Test (see [13], vol. IV). In our model (see the Methods Section), it is defined as
where π_{0} is defined in (1). Under the null hypothesis, its asymptotic distribution is a chisquare distribution with one degree of freedom.
This test is twosided, because, under H_{ 1 }, parameters k_{1} and k_{2} are estimated independently (in particular, without the constraint k_{1} > k_{2}). The exact distribution of LRT could be calculated via permutation techniques but the computation time would be tremendeous for large counts. We will then calculate the following asymptotic pvalue:
where n_{2} is the observed value of N_{2} and χ^{2} ~ χ^{2} (1).
LRT statistic and associated pvalue p_{ L }under different models for cagcgctg in the E. coli loops/backbone comparison.
Model  M00  M0  M1  M6 

LRT  2.1  8.2  10.2  2.0 
p _{ L }  1.5 10^{1}  4.2 10^{3}  1.4 10^{3}  1.6 10^{1} 
Chisquare test
Another standard asymptotic test is the chisquare test where the counts N_{ i }are compared to their expected values $\widehat{N}$_{ i }under H_{0} given the total count N_{+}:
where $\widehat{N}$_{1} = π_{0} N_{+} and $\widehat{N}$_{2} = (1  π_{0})N_{+}. Under the null hypothesis, X^{2} has also an asymptotic chisquare distribution with one degree of freedom. It is also an intrinsically twosided test. Further analyzes (including simulations) not presented here (see [14]) show that this test performs very similarly to the LRT in every situations. Note that the chisquare test is the same as the score test [13].
Discussion
LRT distribution
The chisquare distribution of the LRT statistic is only asymptotic, so the actual level may be different from the nominal one (typically α = 5%). To measure this difference, we have calculated this actual level for different values of π_{0} and N_{+}. Since LRT is a function of N_{1}, the actual level can be derived from the exact distribution of N_{1} given N_{+} which is binomial (see Results Section).
Regarding the motif cagcgctg, because π_{0} is about 15% (cf. Table 2), the picture is close to the right plot of Figure 1; In fact, with a total count of 143, the actual level is respectively 0.095%, 1.1%, 5.1% and 12.5% for a nominal level α equal to 0.1%, 1%, 5% and 10%.
LRT as a contrast measure
The LRT statistic can still be used as a contrast measure, i.e. a measure of the difference, between the two exceptionalities. For large values of N_{+} it is much faster and easier to compute than the binomial pvalue. We will see in the illustration below that the two quantities are strongly correlated.
Decidability limits for the binomial test
Because the binomial test is exact, the actual and nominal levels are equal. The significance can then always be determined. It would be maximal when N_{1} = N_{+} (i.e. N_{2} = 0) and the corresponding pvalue p_{ B }would be equal to ${\pi}_{0}^{{N}_{+}}$. Therefore, if this minimal pvalue is greater than the desired level α (typically 5%), no significance conclusion can be made. This happens when ${\pi}_{0}^{{N}_{+}}$α, i.e. when N_{+} ≥ ln (α)/ln(π_{0}).
Power
An important property for a statistical test is its ability to detect departure from the null hypothesis. This ability is measured by the power of the test which is the probability to exceed the significance threshold (defined under H_{0}) when the true parameter satisfies H_{1}. In our case, the parameter of interest is
which is equal to π_{0} when k_{1} = k_{2}. So the departure from H_{0} will be measured by the ratio k_{1}/k_{2} when it differs from 1.
Exact binomial
The motif cagcgctg occurs N_{+} = 143 times in the whole genome. In the different models considered in Table 2, probability π_{0} is between 11.6% and 16.4%. The power of the binomial test in this case can therefore be read in Figure 3, in the two top plots between the black and red solid lines. We see that a ratio k_{1}/k_{2} = 2 can be detected with probability greater than 90%, while a ratio of 1.5 will be detected with a bit more than 50% probability.
LRT
The same analysis can be made for the LRT tests. However, this only makes sense for sufficiently large N_{+}, to guaranty the validity of the chisquare distribution.
Case of overlapping words
Compound Poisson model
The distribution of overlapping word occurrences is better modeled by a compound Poisson process (see [15]) in the following way:

The word occurs in clumps distributed according to a Poisson process. The number of clumps C_{ i }in sequence i is hence a random Poisson variable with mean denoted by $\tilde{\lambda}$_{ i }.

The size V_{ ic }of the cth clump (in sequence i) is random with geometric distribution:
Pr{V_{ ic }= v} = ${a}_{i}^{v1}$ (1  a_{ i }).
The clump sizes are supposed to be independent. Parameter a_{ i }is the overlapping probability of the motif and can be calculated under various Markovian models (see [5]).
In this setting, the count N_{ i }is hence the sum of the sizes of C_{ i }clumps and has the PolyaAeppli (or geometric Poisson) distribution (see [12]). We have (see [5]) $\tilde{\lambda}$_{ i }= (1  a_{ i }) λ_{ i }. In the case of a nonoverlapping word, we have C_{ i }= N_{ i }, a_{ i }= 0 and λ_{ i }= λ_{ i }. For overlapping words, the mean clump size is equal to 1/(1  a_{ i }) and increases with a_{ i }.
Tests
An overlapping word can occur with an exceptional frequency (i) because of an exceptional number of clumps or (ii) because of exceptional sizes of clumps. Then comparing the exceptionalities of an overlapping word in two sequences leads to compare the number of clumps C_{1} with C_{2}, and/or the sizes V_{1c}'s with V_{2c}'s.
Comparison of the number of clumps
In this compound Poisson model, the number of clumps in each sequence is Poisson distributed. The comparison of the counts C_{1} and C_{2} is then exactly equivalent to the comparison of the counts N_{1} and N_{2} studied in the Results Section, replacing λ_{ i }by $\tilde{\lambda}$_{ i }and μ_{ i }by $\tilde{\mu}$_{ i }:= (1  a_{ i }) μ_{ i }.
Exact test for the overlapping probability under M00
The question is now to test the null hypothesis H_{0} = {a_{1} = a_{2}}. This comparison is made conditionally to the observed counts N_{1} and N_{2}. It only makes sense if the motif occurs at least once in each sequence, i.e. if N_{1}, N_{2}, C_{1} and C_{2} are all larger than (or equal to) 1. In this case, the first occurrence necessarily corresponds to the first clump and the C_{ i } 1 last clumps have to be chosen among the other N_{ i } 1 motif occurrences. Since a motif occurrence (except the first one) corresponds to a clump occurrence with probability 1  a_{ i }, the number of clumps (except the first one) has a binomial distribution:
C_{ i } 1 ~ $\mathcal{B}$ (N_{ i } 1, 1  a_{ i }) (2)
which means that the expected number of clumps decreases when the overlapping probability increases.
Following the same strategy as for the nonoverlapping case, we base our test on the distribution of C_{1} given the total clump count C_{+} = C_{1} + C_{2}. Under H_{0}, (C_{1}  1) has an hypergeometric distribution $\mathscr{H}$ (N_{+}  2, N_{1}  1, C_{+}  2) (see [12], Eq. (3.23)):
The overlapping probability a_{1} is then significantly greater than a_{2} if the probability Pr{C_{1} ≤ c_{1}N_{1}, N_{2}, C_{+}} is smaller than a given level α.
Exact test in the general case
The previous test does not account for the composition of the sequences. The overlapping probabilities a_{1} and a_{2} can be expected to be different, according to some null model. In this case, the true overlapping probability in sequence i is b_{ i }= h_{ i }a_{ i }, where h_{ i }is an exceptionality coefficient (analogous to k_{ i }for the mean count). The problem is then to test H_{0} = {h_{1} = h_{2}}. Such a test is proposed in Appendix: it involves the generalized negative hypergeometric distribution.
Asymptotic tests
As for the counts N and C, asymptotic tests such as likelihood ratio, chisquare or score tests can be derived to compare exceptionalities in terms of overlaps. These tests are not presented here to avoid further statistical developments but also because the small overlapping probabilities generally observed make them rarely relevant.
Illustration
Materials
Comparing complete genomes of strains of single bacterial species allows to determine highly conserved regions (socalled backbone) and numerous strainspecific DNA segments (socalled loops) for each strain. These mosaic structures help to understand the evolution of bacterial genomes. Indeed, the backbone probably corresponds to the common ancestral strain and is under vertical pressure whereas the loops may be associated with mobile elements or strainspecific pathogenicity. Such backbone/loops segmentation has been systematically performed [9] and store in the public MOSAIC database [16]. We have extracted from this database the E. coli K12 specific loops (sequence 1) and the backbone (sequence 2) obtained from the pairwise alignment of the complete genomes of E. coli K12 laboratory strain and the enterohemorrhagic E. coli O157:H7 strain. As an illustration, we have compared the exceptionalities of all the 65536 octamers in the backbone versus in the loops. Such comparison will point out octamers which do not have the same constraint, with respect to their frequency, on the loops versus on the backbone.
Exact binomial test
Number of significantly unbalanced octamers under different models and for different thresholds.
Model:  M00  M0  M1  M6  

p _{ B }  < 10^{4}  277  126  83  37  
10^{4} ≤  p _{ B }  < 10^{3}  519  303  247  4 
10^{3} ≤  p _{ B }  < 10^{2}  1758  1330  1143  104 
10^{2} ≤  p _{ B }  62982  63777  64063  65391 
Top: 10 motifs with smallest pvalue p_{ B }(k_{loops} > k_{backbone}) for model M00, M0, M1 and M6. * indicates overlapping words. Bottom: 10 motifs with smallest pvalue ${{p}^{\prime}}_{B}$ (k_{backbone} > k_{loops}).
M00  M0  M1  M6  

cggataag  1.2 10^{19}  cggataag  3.9 10^{20}  cggataag  2.7 10^{18}  gggataaa  2.4 10^{4} 
ggataagg*  8.6 10^{16}  ccgcatcc*  2.0 10^{16}  taaggcgt*  9.1 10^{15}  tcgaccaa  3.0 10^{4} 
taaggcgt*  4.6 10^{15}  ggataagg*  3.0 10^{16}  ccgcatcc*  4.0 10^{14}  agttttta*  4.5 10^{4} 
gataaggc  1.2 10^{14}  tgtaggcc  1.1 10^{15}  acgccgca*  4.0 10^{14}  aagtgata*  5.3 10^{4} 
taataaaa  1.9 10^{14}  tcaggcct*  2.9 10^{15}  ataaggcg  3.2 10^{13}  gatagcgc  8.1 10^{4} 
ataaggcg  5.6 10^{14}  taaggcgt*  2.9 10^{15}  gccgcatc  1.0 10^{12}  gggtcagg*  1.5 10^{3} 
ctgataag  1.2 10^{13}  gataaggc  4.9 10^{15}  gataaggc  2.2 10^{12}  agccgaga*  1.7 10^{3} 
tgtaggcc  4.0 10^{13}  ggcctaca  1.1 10^{14}  gttccccg*  4.0 10^{12}  gaggttac  1.7 10^{3} 
cttatccg  5.5 10^{13}  ccggccta  1.2 10^{14}  cgcatccg*  4.4 10^{12}  cagagtcc*  1.8 10^{3} 
ccttatcc*  6.0 10^{13}  aggcctac  1.4 10^{14}  tgtaggcc  4.7 10^{12}  ccctggcc*  2.0 10^{3} 
ggcgctgg*  < 10^{20}  ctggaaga  6.8 10^{10}  ctggaaga  1.2 10^{10}  tcggttac  4.9 10^{4} 
gcgctgga  2.5 10^{14}  cgatgaag  2.9 10^{9}  atctggtg  3.3 10^{8}  ggttgatg*  5.4 10^{4} 
cggcgctg  3.0 10^{13}  gaagtgct  7.2 10^{9}  gaagtgct  4.6 10^{8}  gcgcatcc  6.8 10^{4} 
tggcgctg*  5.8 10^{12}  tgaaactg*  4.0 10^{8}  ggcgctgg*  5.2 10^{8}  taggccgc  8.5 10^{4} 
gcgctggt  7.2 10^{12}  atctggtg  4.9 10^{8}  cgatgaag  6.6 10^{8}  aagcttcg  1.1 10^{3} 
cgctggtg  8.9 10^{12}  gcgctgga  8.0 10^{8}  tatctggt*  1.1 10^{7}  cgatgaag  1.1 10^{3} 
cgcgctgg  1.0 10^{10}  cggtaaag  1.1 10^{7}  cggtaaag  1.4 10^{7}  cggataaa  1.2 10^{3} 
gctggcga  1.3 10^{10}  ggttgatg*  1.4 10^{7}  ggttgatg*  2.0 10^{7}  ggggggac  1.4 10^{3} 
tggcgcag  1.7 10^{10}  gtgctgga  1.6 10^{7}  gtgctgga  2.5 10^{7}  caggcgtt  1.6 10^{3} 
ctggaaga  3.1 10^{10}  aattgtcg  2.1 10^{7}  tgggcttc  5.6 10^{7}  acgccttc  1.8 10^{3} 
Top: numbers of octamers significantly more exceptional in the loops when adjusting for a False Discovery Rate of 1% and associated thresholds for the pvalue p_{ B }for different models. Bottom: idem for octamers significantly more exceptional in the backbone.
Model  M00  M0  M1  M6 

Nb. of significant octamers  677  257  154  0 
Threshold for p_{ B }  1.0 10^{4}  3.9 10^{5}  2.2 10^{5}  _ 
Nb. of significant octamers  159  23  14  0 
Threshold for ${{p}^{\prime}}_{B}$  2.4 10^{5}  3.4 10^{6}  1.8 10^{6}   
Symmetrically, to find the motifs with an exceptionality coefficient significantly higher in the backbone than in the loops, we have to test H_{0} versus ${{H}^{\prime}}_{1}$ = {k_{2} > k_{1}} using the pvalue ${{p}^{\prime}}_{B}$ defined as ${{p}^{\prime}}_{B}$ = Pr{$\mathcal{B}$ (n_{+}, π_{0}) ≤ n_{1}}. The 10 most significant motifs for this test are given at the bottom of Table 5. When adjusting for a False Discovery Rate of 1%, only 14 octamers under model M1 are significantly more exceptional in the backbone than in the loops. These octamers have all a pvalue ${{p}^{\prime}}_{B}$ smaller than 1.8 10^{6}. Note that under model M6, no octamer is significant after multiple testing adjustment.
According to the p_{ B }list, the motif cagcgctg has rank 1 115 among 65 536 under the M1 model. Note that the well known Chi motif (gctggtgg) which is the most overrepresented octamer in E. coli genome has a ${{p}^{\prime}}_{B}$ value of 5.1 10^{5} (rank 1 281) under the same model; It means that k_{backbone} is significantly higher than k_{loops} but due to multiple testing Chi is not among the significant octamers.
LRT versus binomial
We now compare the results provided by the two tests: binomial and LRT. Because the former is onesided and the latter is twosided, we use a signed version LRT^{ s }of the LRT statistic which is equal to LRT when N_{1} is greater than expected (N_{1} ≥ π_{0} N_{+}) and to – LRT otherwise (N_{1} <π_{0} N_{+}). To make the graph more readable, we also transform the pvalue p_{ B }into a Gaussian score S_{ B }∈ ℝ:
S_{ B }= Φ^{1} (1  p_{ B })
where Φ is the cumulative distribution function of the standard Gaussian distribution. High positive values of S_{ B }correspond to motifs with an exceptionality coefficient in sequence 1 significantly higher than in sequence 2, while high negative values of S_{ B }correspond to motifs having an exceptionality coefficient in sequence 1 significantly lower than in sequence 2.
Spearman and Kendall correlation coefficients between LRT^{ s }and S_{ B }for different models.
Model  M00  M0  M1  M6 

Spearman (%)  99.7  99.7  99.7  99.3 
Kendall (%)  96.0  95.6  95.5  93.3 
Test for overlaps
Octamers with significant differences in terms of overlaps in the backbone/loops comparison.
Motif  Loops  backbone  p  a  

C _{1}  N _{1}  C _{2}  N _{2}  (%)  (%)  
accactac  7  9  44  44  2.20  0.02 
tattatta  38  41  69  69  4.83  1.56 
tcggggtc  2  3  24  24  8.00  0.02 
cgcgccgc  27  28  246  246  9.93  0.10 
Conclusion
We have proposed two complementary statistical tests to compare the exceptionalities of motif counts in two sequences. The binomial test is exact and particularly of interest for small counts (from a computational point of view). For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test. The LRT statistics is simple to calculate and can be directly interpreted as a contrast measure between the exceptionalities; its pvalue can be derived from the chisquare distribution. Both tests will be implemented in the R'MES software already devoted to exceptional motifs [18].
The likelihood ratio test can be generalized to more than two sequences. Suppose we want to compare I sequences S_{1}, S_{2},..., S_{ I }. In each of them, we assume that the count N_{ i }has a Poisson distribution with parameter λ_{ i }= k_{ i }ℓ_{ i }μ_{ i }and we want to test H_{0} = {k_{1} = k_{2} = ⋯ = k_{ I }} versus H_{1} = {At least one k_{ i }differs from the others}. The LRT statistics is
Under H_{0}, LRT has an asymptotic chisquare distribution with (I  1) degrees of freedom. The Chisquare test can be generalized as well.
Under the Poisson model, both tests can be easily used for degenerated motifs or more generally for sets of motifs. Let denote by $\mathcal{W}$ a set of motifs; The count N_{ i }(respectively the occurrence probability μ_{ i }) will be the sum of the counts (resp. occurrence probability) of w for all motifs w from $\mathcal{W}$. However, the generalization is much more involved for the compound Poisson model because of the possible overlaps between motifs from the set; In particular, the overlapping probability a_{ i }becomes a matrix [19].
We emphasize that these tests are valid only for independent sequences. They can not be used to detect skewed oligomers because the leading strand is not independent from the lagging strand [20]. This particular question requires the development of another rigorous statistical method; this is an ongoing work.
Finally, note that the exceptionality comparison of word counts in sequences is actually equivalent to the differential analysis of SAGE expression data [21]. Indeed, in the SAGE technology, the expression level of a given gene is measured by a number of associated tags and the problem is to compare the number of tags between two conditions. In such problem, no correction has to be done except for the total number of tags and our test statistics under model M00 are adapted.
Methods
Likelihood ratio test
The model presented in the Results Section can be rephrased as two Poisson processes with respective intensity k_{ i }u_{ i }(i = 1,2). To calculate the likelihood, we need to estimate the exceptionality coefficients k_{1} and k_{2}. Under the alternative hypothesis, their respective maximum likelihood estimates (MLE) are $\widehat{k}$_{1} = N_{1}/(ℓ_{1} μ_{1}) and $\widehat{k}$_{2} = N_{2}/(ℓ_{2} μ_{2}). Assuming that the two sequences are independent, the loglikelihood of the two processes is
Under the null hypothesis, the common MLE of k_{1} and k_{2} is $\widehat{k}$ = (N_{1} + N_{2})/(ℓ_{1} μ_{1} + ℓ_{2} μ_{2}) and the loglikelihood is
The LRT is defined as twice the difference between $\mathcal{L}$_{1} and $\mathcal{L}$_{0}: LRT = 2($\mathcal{L}$_{1}  $\mathcal{L}$_{0}). The result follows after standard algebraic manipulations.
Appendix
Exact hypergeometric test
Conditional distribution of the number of clumps
The conditional distribution of C_{ i } 1 given in (2) can be modified as
N_{ i } C_{ i }~ $\mathcal{B}$ (N_{ i } 1, b_{ i })
where b_{ i }= h_{ i }a_{ i }is the true overlapping probability. This version is preferable, since the exceptionality coefficient h_{ i }directly appears here as a multiplicative constant. The conditional distribution of the difference N_{ i } C_{ i }given the clump counts C_{1} and C_{2} and the total count N_{+} is a generalized negative hypergeometric distribution (see [12] p. 264 for the classical version and p. 270 for the generalization):
where A is the constant such that the sum over all n_{1} between C_{1} and N_{+} is equal to one.
Test
Under H_{0} = {h_{1} = h_{2}}, the term b_{1}/b_{2} can be replaced by a_{1}/a_{2}. The overlapping probability b_{1} is significantly greater than b_{2} if N_{1} is significantly large, i.e. if Pr{N_{1} ≥ n_{1}C_{1}, C_{2}, N_{+}} is small. The power of this test can also be studied: under H_{0}, b_{1}/b_{2} equals a_{1}/a_{2}, while under the alternative hypothesis, it is equal to (h_{1}/h_{2}) (a_{1}/a_{2}). The power of the test is then a function of h_{1}/h_{2}.
Declarations
Acknowledgements
We thank Meriem El Karoui and MarieAgnès Petit for helpful discussions. We also thank the referees for their remarks. This work has been supported by the French Action Concertée Incitative IMPBio.
Authors’ Affiliations
References
 van Helden J, André B, ColladoVides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 1998, 281: 827–842. 10.1006/jmbi.1998.1947View ArticlePubMedGoogle Scholar
 El Karoui M, Biaudet V, Schbath S, Gruss A: Characteristics of Chi distribution on several bacterial genomes. Research in Microbiology 1999, 150: 579–587. 10.1016/S09232508(99)001321View ArticlePubMedGoogle Scholar
 Bigot S, Saleh O, Lesterlin C, Pages C, El Karoui M, Dennis C, Grigoriev M, Allemand JF, Barre FX, Cornet F: KOPS: DNA motifs that control E. coli chromosome segregation by orienting the FtsK translocase. EMBO J 2005, 24: 3770–3780. 10.1038/sj.emboj.7600835PubMed CentralView ArticlePubMedGoogle Scholar
 Lothaire M: Applied Combinatorics on Words, Volume 105 of Encyclopedia of Mathematics and its Applications. Cambridge University Press; 2005.View ArticleGoogle Scholar
 Robin S, Rodolphe F, Schbath S: DNA Words and Models. Cambridge University Press; 2005. [English version of ADN, mots et modéles, BELIN 2003]. [English version of ADN, mots et modéles, BELIN 2003].Google Scholar
 Davidsen T, Rodland E, Lagesen K, Seeberg E, Rognes T, Tonjum T: Biased distribution of DNA uptake sequences towards genome maintenance genes. Nucleic Acids Research 2004, 32: 1050–1058. 10.1093/nar/gkh255PubMed CentralView ArticlePubMedGoogle Scholar
 Touzain F, Schbath S, DebledRennesson I, Aigle B, Leblond , Kucherov G: SIGffRid: Searching for transcription factor binding sites in bacterial genomes using comparative approach and biologically driven statistics. 2006. [Preprint. Preliminary version in JOBIM 2005 proceedings, 417426].Google Scholar
 Valens M, Penaud S, Rossignol M, Cornet F, Boccard F: Macrodomain organization of the Escherichia coli chromosome. EMBO J 2004, 23: 4330–4341. 10.1038/sj.emboj.7600434PubMed CentralView ArticlePubMedGoogle Scholar
 Chiapello H, Bourgait I, Sourivong F, Heuclin G, Jacquemard A, Petit MA, El Karoui M: Systematic determination of the MOSAIC structure – backbone versus strain specific loops – of bacterial genomes. BMC Bioinformatics 2005, 6: 171. 10.1186/147121056171PubMed CentralView ArticlePubMedGoogle Scholar
 McNeil J, Smith K, Hall L, Lawrence J: Word frequency analysis reveals enrichment of dinucleotide repeats on the human X chromosome and [GATA]n in the X escape region. Genome Research 2006, 16: 477–484. 10.1101/gr.4627606PubMed CentralView ArticlePubMedGoogle Scholar
 Schbath S: Compound Poisson approximation of word counts in DNA sequences. ESAIM: Probability and Statistics 1995, 1: 1–16.View ArticleGoogle Scholar
 Johnson NL, Kotz S, Kemp AW: Univariate Discrete Distributions. Wiley: NewYork; 1992.Google Scholar
 Armitage P, Colton T, (Eds): Encyclopedia of Biostatistics. Wiley; 1998.Google Scholar
 Vandewalle V: Etude de motifs dans les séquences d'ADN : comparaison d'exceptionnalités. In Master's thesis. Institut National Agronomique ParisGrignon; 2005.Google Scholar
 Robin S: A compound Poisson model for words occurrences in DNA sequences. J R Statist Soc C 2002, 51: 437–451. 10.1111/14679876.00279View ArticleGoogle Scholar
 MOSAIC[http://genome.jouy.inra.fr/mosaic/]
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerfull approach to multiple testing. JRSS B 1995, 57: 289–300.Google Scholar
 Hoebeke M, Schbath S: R'MES: Finding Exceptional Motifs. User guide, version 3 2006. [http://genome.jouy.inra.fr/ssb/rmes/]Google Scholar
 Roquain E, Schbath S: Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain. Adv Appl Prob 2007., 39:Google Scholar
 Salzberg S, Salzberg A, Kerlavage A, Tomb JF: Skewed Oligomers and Origins of Replication. Gene 1998, 217: 57–67. 10.1016/S03781119(98)003746View ArticlePubMedGoogle Scholar
 Audic S, Claverie JM: The significance of digital gene expression profiles. Genome Research 1997, 7: 986–995.PubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.