Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics.
We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob.
The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.
Motif discovery and characterization are important for the study of gene evolution, duplication, transcription sites, and protein identification , as well as of genetic diseases caused by unstable repeat expansion [2, 3].
Several tools have been developed for de novo motif discovery [4,5,6]—including discriminative regular expression motif elicitation (DREME), hypergeometric optimization of motif enrichment (HOMER), multiple expectation maximizations for motif elicitation (MEME), the memetic framework for motif discovery (MFMD), peak-motifs, prosampler, regulatory sequence analysis tools (RSAT), Trawler Web, and Weeder—either generic or specialized, e.g. for ChIP-seq data [7,8,9,10,11,12,13,14,15].
Assessing the statistical significance of motif enrichment is a fundamental and challenging step of motif discovery, and can severely hamper downstream analytics. Kiesel et al.  pointed out that p values “Small enrichment factors can occur frequently in practice simply due to an imperfect background model that slightly underestimates the expected frequency of occurrence”. In addition, p values are crucial not only in the discovery phases, but also in motif comparison and motif-motif similarity studies . The classical definition of the motif enrichment problem (in terms of differences among motifs occurrences within background genome contents) has been proven to be NP-hard . The p value calculation is not straightforward, and requires making assumptions on a background model of base frequencies and co-occurrence in order to derive a distribution of motif occurrences in reference genomes . Several formulae—approximated and exact—and algorithms for estimating motif count distributions have been devised and implemented [20,21,22,23,24,25,26,27,28]. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large data sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging [19, 25]. Thus, methods for p value estimation can be a bottleneck in large-scale projects. HOMER, Weeder and Peak-motifs do not report motif statistical significance, MEME uses an approximation approach (very conservative), later improved by DREME and the new simple, thorough, rapid, enriched motif elicitation (STREME) [10, 15], and MFMD uses information content score and complexity scores .
A software that provides a comprehensive occurrence and probability estimation is the bioinformatics toolkit for Motif Statistics and Discovery (MoSDi) by Marschall , written in Java, featuring models based on the approximated compound Poisson and nth level Markov order, as well as (quasi-)exact combinatorial formulae to reduce computational complexity (https://bitbucket.org/tobiasmarschall/mosdi). Another tool is motifcounter , an R-Bioconductor library implementing existing methods [27, 32], as well as an improvement on the compound Poisson model. One limitation of these programs is that calculation of occurrence distribution—even using the fast compound Poisson—becomes impractical with longer motifs (10+) and longer reference genomes (millions of bases), besides large motif datasets.
Prosperi et al.  provided an exact formula for counting the distribution of strings that do not overlap with themselves (i.e. non-clumpable), coupled with a mathematical demonstration of its validity, under both Bernoullian and Markovian assumptions. The calculation of the formula was exponential in the genome length by the length of the motif, but the authors demonstrated that it could be calculated efficiently within an arbitrary tolerance level.
This software article describes “motif_prob”, a count distribution tool suitable for long motifs and long reference genomes, implementing the exact method by Prosperi et al.  with the efficient error-bound algorithm. In addition to the relevance of this software piece for large-scale processing, another motivation for our work is that the majority of probability distribution or p value calculators, even the most recent ones, use heuristics. To our knowledge, the formula by Prosperi et al. is still among the most efficient for exact calculation. The proposed motif_prob implementation thus makes exact quantification suitable with large scale projects, and posits to substitute currently employed heuristics. We compare motif_prob with other tools in terms of run time and precision, showing that its exact algorithm is several orders of magnitude faster even than the approximated methods, and finally we describe use cases for long motifs in bacteria.
The exact formula by Prosperi et al. , for the calculation of the frequency distribution j of a string of length m within a text of length n (m < n) over alphabet k, under the Markovian model, is
where P(S) = P(a1) · P(a2 | a1) · … · P(am−1 | am), P(S0,n) = P(S0,n−1) − P(S) · P(S0,n−m), S0,n = S0,n−1 · k − P(S) · km · S0,n−m, d1 … dj+1 are the lengths of the j + 1 segments where the j strings divide the text of length n in exact configurations with d1 + ··· + dj+1 = n − mj, and
Formula (1) has a complexity of O(nj), which becomes quickly intractable. However, by defining R = P(S0,n+1)/P(S0,n) as a constant, Prosperi et al. show that for any positive (arbitrarily small) number ε, there is an index ηε such that for every η > ηε then
By using this approximation, the summation of the original formula can be reduced to a single step, and calculations can be stopped when the ratio P(S0,n)/P(S0,n−1) reaches a desired level of tolerance ε. Specifically, after plugging the iterative approximation (3) in (1), we obtain the final formula
We note that P(j, m, n) is the same irrespective of the position of the nucleotides in a query string, e.g. AACCC and CCCAA have the same probability. This property permits to extrapolate a probability for clumpable strings by permutation, e.g. ACCA into CCAA, although the value is not guaranteed given possible overlap. Another way is to replace the first or the last character with another one that has the same frequency. All details on the derivation of the exact formula and the proof for its progressive approximation, along with comparison against other state-of-art algorithms, can be found in the original work by Prosperi et al. .
Two different implementations are produced: one in Perl and another in C++. Both programs take the same input and parameters, namely: (1) a query string or multiple strings to be analyzed; (2) the length of the reference genome; and (3) the nucleotide frequencies of the genome. In alternative to the genome length and nucleotide frequencies, a FASTA file containing the genome string can be passed as input to the program. The output file reports—for each motif—the count distribution and other summary information including a flag for clumped strings, string probability, and statistics on the precision and tolerance levels.
Since the computational complexity of the formula is exponential, motif occurrences are calculated at increasing counts until the occurrence probability becomes lower than given a tolerance level ε, or the upper limit of counts j is reached. We also control estimates at each iteration in order to avoid issues with floating point operations when frequency/length ratios diverge, and to handle relatively ill-posed configurations. Given the motif m and genome g lengths, one can set a tolerance level ε such that P(0, m, n) > (1 − ε), and in general each case where (1 − P(S))(m−m+1) > (1 − ε). This is equal to (n − m + 1)∙log(1 − P(S)) > log(1 − ε), which implies n > m − 1 + log(1 − ε)/log(1 − P(S)). In the source code, we have set ε to 10−7 and j to 500. Further, we implement the calculation of the expected number of strings and the motif’s (stationary) occurrence probability at any text position, according to Robin et al. .
Figure 1 provides a flowchart of the data processing pipeline, showing the required input specifications, the method’s internal parameters, and the output fields.
An example of the occurrence distribution for motif query sequences of length 6, calculated on a randomly generated genome of 20,000 bases, varying the nucleotide frequencies, is illustrated in Fig. 2. The difference between the equiprobable base and the more general case is evident and demonstrates how the background distribution affects the p value calculation (see real-world use case after the benchmarks).
Table 1 shows run time benchmarks on different motif length and motif set size configurations, executed on a laptop machine with Intel(R) Core(TM) i9-10885H CPU @ 2.4 GHz, 32 GB RAM. Both the Perl and the C++ programs exhibit run times several orders of magnitude smaller than MoSDi, even when the latter is executed with the fast compound Poisson approximation. We set a maximum processing time of 30 min for datasets up to 400,000 motifs, and MoSDi can process them only with smaller values of k and the approximated model, while the exact model is not feasible for most of datasets. The C++ implementation is the fastest, and the expected run time increase due to higher motif lengths is well compensated by the implementation setup.
In terms of precision, we compare the exact probability values yielded by our program with both the compound Poisson and the exact estimates of MoSDi (allowing it to switch automatically to standard/doubling algorithms to improve run time). As previously described, usually the largest errors appears near the probability mass points . For all motif lengths combinations of 4 bases, over a 10,000 bases reference genome, on average the peak probability values of MoSDi and motif_prob exact differ by two orders of magnitude, e.g. if the peak probability is in the range of 10−2 then the observed absolute difference is 10−4. The difference with the compound Poisson approximation is larger, on average double than the exact, but the relative ratio it is still one-two orders of magnitude smaller than the actual values. The difference becomes smaller as the sequence lengths increase.
We further test the concordance among MoSDi and motif_prob using a real motif dataset, the library of DNA-binding site matrices for Escherichia coli (https://arep.med.harvard.edu/ecoli_matrices/), which contains 802 motifs from 67 housekeeping genes for a median motif length of 26 (interquartile range, IQR 20–29). We consider motifs length within 20 bases to be able to estimate non-near-zero probabilities on the genome length of Escherichia coli. The final set includes 230 motifs with a median length 16 (IQR 15–18). The median (IQR) difference between MoSDi and motif_prob exact overall is 2.6·10−8 (2.2·10−8–5.0·10−8), while for all probabilities where the center of mass is not zero (median 0.18), it is 3.8·10–8 (3.3·10−9–2.6·10−7). Once again, the differences with the approximated estimation are larger but of the same level of magnitude. Figure 3 illustrates the absolute difference in probability between motif_prob and MoSDi (exact/compound Poisson) as well as the relative magnitude difference, expressed as the log10(Probmotif_prob/abs(Probmotif_prob − ProbMoSDi)), which well highlights how the difference between the two exact methods (and the compound Poisson too, although larger) is negligible with respect to the actual probability estimates.
As a final use case, we investigate the distribution of frequencies of antimicrobial resistance gene signatures found in bacteria under different GC content. Drug resistance mechanisms in bacteria involve acquisition of genes, often via mobile genetic elements, and in some cases changes within core housekeeping genes. A number of algorithms use k-mers, i.e. motifs of fixed k length, to classify antimicrobial resistance , as they can be handled efficiently through ad hoc data structures suitable to process high-throughput data. But assessing the importance of a k-mer with respect to their frequency in drug resistance genes is not straightforward; one issue is that bacteria and genes can have very different GC content . When the GC content varies, the probability distributions of motif occurrence can change over a broad range (given also the underlying, individual A, C, G, and T content), and thus the p values of over- or under-representation. To show how the quantification can have large variance, we analyze k-mers from antimicrobial resistance genes collected in the MEGARes 2.0 database . MEGARes contains 7868 genes, with an average gene length of 1030.29 nucleotide bases, 57 different antibiotic resistance classes, and 220 distinct resistance mechanisms.
From MEGARes, we select all the 3911 genes conferring resistance to beta-lactamase; we then identify all 13-mers, for a total of 453,308 motifs (50% GC content). In Table 2, we show how the count probability distribution of the 13-mers in MEGARes’ beta-lactamase genes changes among bacterial species present in the human microbiome of respiratory tract , where we select uniformly 18 species on the basis of their GC content. The median probability of finding the aforementioned 13-mers at least once varies between 93 and 99%, and even species with a similar GC content can show different medians and interquartile ranges, such as Stomatobaculum longum (55% GC content, median p = 97%) and Kluyvera intermedia (52% GC content, median p = 93%). This variability is due to: the individual nucleotide content, which can differ even when the GC content is the same, and it directly affects the distribution (see also Fig. 2); the genome length; and the nucleotide content of the query motifs.
The motif_prob software is a multi-platform, open source, efficient solution for calculating exact frequency distributions of (long) motif occurrences in reference genomes using high-throughput data. We showed how our code estimates are consistent with other, slower, exact calculations, and how the run times of our code (both Perl and C++) are competitive even with the non-exact compound Poisson approximation. Specifically, motif_prob is 50–1000× faster than MoSDi exact and 60–120× faster than MoSDi compound Poisson.
The current implementation is limited to non-clumpable strings, although it extrapolates a probability for clumpable strings by permutation. As future development of our work we foresee to develop an exact formula for clumpable strings and to extend the approach to generalize over motifs that can include nucleotide changes, insertions or deletions.
In conclusion, our tool can be effectively used in conjunction with motif discovery suites that process high-throughput data, allowing them to compute exact count distributions and associated p values without loss of run time performance, instead of relying on to approximations.
Luu P-L, Schöler HR, Araúzo-Bravo MJ. Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs. Genome Res. 2013;23(12):2013–29.
Luu PL, Schöler HR, Araúzo-Bravo MJ. Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs. Genome Res. 2013;23:2013–29.
Pavesi G, Mereghetti P, Mauri G, Pesole G. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004;32(Web Server issue):W199–203.
Marschall T, Rahmann S. Speeding up exact motif discovery by bounding the expected clump size. In: Moulton V, Singh M, editors. Algorithms in bioinformatics. Lecture notes in computer science. Berlin: Springer; 2010. p. 337–49.
Pape UJ, Rahmann S, Sun F, Vingron M. Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. J Comput Biol J Comput Mol Cell Biol. 2008;15(6):547–64.
Doster E, Lakin SM, Dean CJ, Wolfe C, Young JG, Boucher C, et al. MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Res. 2020;48:D561–9.
We thank Luciano Prosperi, MSc, and Roberto Di Castro, MSc, for aiding with the implementation and code maintenance.
This work has been supported by the National Institutes of Health (NIH)—National Institute of Allergy and Infectious Diseases (NIAID) Grants No. R01AI141810 and R01AI145552, and by the National Science Foundation (NSF) Grant No. 2013998. The funding body did not have roles in the design of the study and collection, analysis, interpretation of data, and in writing the manuscript.
Authors and Affiliations
Data Intelligence Systems Lab, Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA
Mattia Prosperi & Simone Marini
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Dr. Christina Boucher is Associate Editor of BMC Bioinformatics. All other authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.