Skip to main content

Fast and exact quantification of motif occurrences in biological sequences

Abstract

Background

Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics.

Results

We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob.

Conclusions

The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.

Background

Motif discovery and characterization are important for the study of gene evolution, duplication, transcription sites, and protein identification [1], as well as of genetic diseases caused by unstable repeat expansion [2, 3].

Several tools have been developed for de novo motif discovery [4,5,6]—including discriminative regular expression motif elicitation (DREME), hypergeometric optimization of motif enrichment (HOMER), multiple expectation maximizations for motif elicitation (MEME), the memetic framework for motif discovery (MFMD), peak-motifs, prosampler, regulatory sequence analysis tools (RSAT), Trawler Web, and Weeder—either generic or specialized, e.g. for ChIP-seq data [7,8,9,10,11,12,13,14,15].

Assessing the statistical significance of motif enrichment is a fundamental and challenging step of motif discovery, and can severely hamper downstream analytics. Kiesel et al. [16] pointed out that p values “Small enrichment factors can occur frequently in practice simply due to an imperfect background model that slightly underestimates the expected frequency of occurrence”. In addition, p values are crucial not only in the discovery phases, but also in motif comparison and motif-motif similarity studies [17]. The classical definition of the motif enrichment problem (in terms of differences among motifs occurrences within background genome contents) has been proven to be NP-hard [18]. The p value calculation is not straightforward, and requires making assumptions on a background model of base frequencies and co-occurrence in order to derive a distribution of motif occurrences in reference genomes [19]. Several formulae—approximated and exact—and algorithms for estimating motif count distributions have been devised and implemented [20,21,22,23,24,25,26,27,28]. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large data sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging [19, 25]. Thus, methods for p value estimation can be a bottleneck in large-scale projects. HOMER, Weeder and Peak-motifs do not report motif statistical significance, MEME uses an approximation approach (very conservative), later improved by DREME and the new simple, thorough, rapid, enriched motif elicitation (STREME) [10, 15], and MFMD uses information content score and complexity scores [29].

A software that provides a comprehensive occurrence and probability estimation is the bioinformatics toolkit for Motif Statistics and Discovery (MoSDi) by Marschall [30], written in Java, featuring models based on the approximated compound Poisson and nth level Markov order, as well as (quasi-)exact combinatorial formulae to reduce computational complexity (https://bitbucket.org/tobiasmarschall/mosdi). Another tool is motifcounter [31], an R-Bioconductor library implementing existing methods [27, 32], as well as an improvement on the compound Poisson model. One limitation of these programs is that calculation of occurrence distribution—even using the fast compound Poisson—becomes impractical with longer motifs (10+) and longer reference genomes (millions of bases), besides large motif datasets.

Prosperi et al. [28] provided an exact formula for counting the distribution of strings that do not overlap with themselves (i.e. non-clumpable), coupled with a mathematical demonstration of its validity, under both Bernoullian and Markovian assumptions. The calculation of the formula was exponential in the genome length by the length of the motif, but the authors demonstrated that it could be calculated efficiently within an arbitrary tolerance level.

This software article describes “motif_prob”, a count distribution tool suitable for long motifs and long reference genomes, implementing the exact method by Prosperi et al. [28] with the efficient error-bound algorithm. In addition to the relevance of this software piece for large-scale processing, another motivation for our work is that the majority of probability distribution or p value calculators, even the most recent ones, use heuristics. To our knowledge, the formula by Prosperi et al. is still among the most efficient for exact calculation. The proposed motif_prob implementation thus makes exact quantification suitable with large scale projects, and posits to substitute currently employed heuristics. We compare motif_prob with other tools in terms of run time and precision, showing that its exact algorithm is several orders of magnitude faster even than the approximated methods, and finally we describe use cases for long motifs in bacteria.

Methods

Theoretical formulation

The exact formula by Prosperi et al. [28], for the calculation of the frequency distribution j of a string of length m within a text of length n (m < n) over alphabet k, under the Markovian model, is

$$\begin{array}{*{20}c} {P\left( {j,m,n} \right) = P\left( S \right)^{j} \mathop \sum \limits_{z = 1}^{{\left| {C_{n,m,j} } \right|}} \mathop \prod \limits_{y = 1}^{j + 1} P\left( {S_{{0,d_{yz} }} } \right),} \\ \end{array}$$
(1)

where P(S) = P(a1) · P(a2 | a1) · … · P(am1 | am), P(S0,n) = P(S0,n1) − P(S) · P(S0,nm), S0,n = S0,n1 · k − P(S) · km · S0,nm, d1 … dj+1 are the lengths of the j + 1 segments where the j strings divide the text of length n in exact configurations with d1 + ··· + dj+1 = n − mj, and

$$\begin{array}{*{20}c} {\left| {C_{n,m,j} } \right| = \left( {\begin{array}{*{20}c} {n + j\left( {1 - m} \right)} \\ {n - mj} \\ \end{array} } \right).} \\ \end{array}$$
(2)

Formula (1) has a complexity of O(nj), which becomes quickly intractable. However, by defining R = P(S0,n+1)/P(S0,n) as a constant, Prosperi et al. show that for any positive (arbitrarily small) number ε, there is an index ηε such that for every η > ηε then

$$\begin{array}{*{20}c} {P\left( {S_{0,n + x} } \right) \sim P\left( {S_{0,n} } \right) \cdot R^{x} . } \\ \end{array}$$
(3)

By using this approximation, the summation of the original formula can be reduced to a single step, and calculations can be stopped when the ratio P(S0,n)/P(S0,n−1) reaches a desired level of tolerance ε. Specifically, after plugging the iterative approximation (3) in (1), we obtain the final formula

$$\begin{array}{*{20}c} {P\left( {j,m,n} \right) \sim P\left( S \right)^{j} \cdot R^{{n - mj - n_{\varepsilon } \left( {j + 1} \right)}} \cdot P\left( {S_{{\left( {0,n_{\varepsilon } } \right)}} } \right)^{j + 1} \cdot \left( {\begin{array}{*{20}c} {n + j\left( {1 - m} \right)} \\ {n - mj} \\ \end{array} } \right).} \\ \end{array}$$
(4)

We note that P(j, m, n) is the same irrespective of the position of the nucleotides in a query string, e.g. AACCC and CCCAA have the same probability. This property permits to extrapolate a probability for clumpable strings by permutation, e.g. ACCA into CCAA, although the value is not guaranteed given possible overlap. Another way is to replace the first or the last character with another one that has the same frequency. All details on the derivation of the exact formula and the proof for its progressive approximation, along with comparison against other state-of-art algorithms, can be found in the original work by Prosperi et al. [28].

Implementation

Two different implementations are produced: one in Perl and another in C++. Both programs take the same input and parameters, namely: (1) a query string or multiple strings to be analyzed; (2) the length of the reference genome; and (3) the nucleotide frequencies of the genome. In alternative to the genome length and nucleotide frequencies, a FASTA file containing the genome string can be passed as input to the program. The output file reports—for each motif—the count distribution and other summary information including a flag for clumped strings, string probability, and statistics on the precision and tolerance levels.

Since the computational complexity of the formula is exponential, motif occurrences are calculated at increasing counts until the occurrence probability becomes lower than given a tolerance level ε, or the upper limit of counts j is reached. We also control estimates at each iteration in order to avoid issues with floating point operations when frequency/length ratios diverge, and to handle relatively ill-posed configurations. Given the motif m and genome g lengths, one can set a tolerance level ε such that P(0, m, n) > (1 − ε), and in general each case where (1 − P(S))(m−m+1) > (1 − ε). This is equal to (n − m + 1)∙log(1 − P(S)) > log(1 − ε), which implies n > m − 1 + log(1 − ε)/log(1 − P(S)). In the source code, we have set ε to 10−7 and j to 500. Further, we implement the calculation of the expected number of strings and the motif’s (stationary) occurrence probability at any text position, according to Robin et al. [33].

Figure 1 provides a flowchart of the data processing pipeline, showing the required input specifications, the method’s internal parameters, and the output fields.

Fig. 1
figure1

Flowchart of the data processing pipeline for motif_prob, with input/output specifications and program parameters

The source code, documentation, sample datasets, and executable files are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob.

Results

An example of the occurrence distribution for motif query sequences of length 6, calculated on a randomly generated genome of 20,000 bases, varying the nucleotide frequencies, is illustrated in Fig. 2. The difference between the equiprobable base and the more general case is evident and demonstrates how the background distribution affects the p value calculation (see real-world use case after the benchmarks).

Fig. 2
figure2

Application output for motif sequences of length 6 over a genome of length 20,000, and different nucleotide frequencies

Table 1 shows run time benchmarks on different motif length and motif set size configurations, executed on a laptop machine with Intel(R) Core(TM) i9-10885H CPU @ 2.4 GHz, 32 GB RAM. Both the Perl and the C++ programs exhibit run times several orders of magnitude smaller than MoSDi, even when the latter is executed with the fast compound Poisson approximation. We set a maximum processing time of 30 min for datasets up to 400,000 motifs, and MoSDi can process them only with smaller values of k and the approximated model, while the exact model is not feasible for most of datasets. The C++ implementation is the fastest, and the expected run time increase due to higher motif lengths is well compensated by the implementation setup.

Table 1 Run time (mm:ss) of the Perl and C++ programs compared to MoSDi (exact and approximated using compound Poisson) for calculating the occurrence distribution for s motif query sequences of length m (13–31) over a reference genome of 5 million bases

In terms of precision, we compare the exact probability values yielded by our program with both the compound Poisson and the exact estimates of MoSDi (allowing it to switch automatically to standard/doubling algorithms to improve run time). As previously described, usually the largest errors appears near the probability mass points [28]. For all motif lengths combinations of 4 bases, over a 10,000 bases reference genome, on average the peak probability values of MoSDi and motif_prob exact differ by two orders of magnitude, e.g. if the peak probability is in the range of 10−2 then the observed absolute difference is 10−4. The difference with the compound Poisson approximation is larger, on average double than the exact, but the relative ratio it is still one-two orders of magnitude smaller than the actual values. The difference becomes smaller as the sequence lengths increase.

We further test the concordance among MoSDi and motif_prob using a real motif dataset, the library of DNA-binding site matrices for Escherichia coli (https://arep.med.harvard.edu/ecoli_matrices/), which contains 802 motifs from 67 housekeeping genes for a median motif length of 26 (interquartile range, IQR 20–29). We consider motifs length within 20 bases to be able to estimate non-near-zero probabilities on the genome length of Escherichia coli. The final set includes 230 motifs with a median length 16 (IQR 15–18). The median (IQR) difference between MoSDi and motif_prob exact overall is 2.6·10−8 (2.2·10−8–5.0·10−8), while for all probabilities where the center of mass is not zero (median 0.18), it is 3.8·10–8 (3.3·10−9–2.6·10−7). Once again, the differences with the approximated estimation are larger but of the same level of magnitude. Figure 3 illustrates the absolute difference in probability between motif_prob and MoSDi (exact/compound Poisson) as well as the relative magnitude difference, expressed as the log10(Probmotif_prob/abs(Probmotif_prob − ProbMoSDi)), which well highlights how the difference between the two exact methods (and the compound Poisson too, although larger) is negligible with respect to the actual probability estimates.

Fig. 3
figure3

Comparison between motif_prob and MoSDi (exact and approximated with compound Poisson) in terms of concordance of probability estimates. Panel A shows the absolute difference in peak probability values, stratified by motif length and probability mass value, while panel B shows the relative magnitude difference by motif length

As a final use case, we investigate the distribution of frequencies of antimicrobial resistance gene signatures found in bacteria under different GC content. Drug resistance mechanisms in bacteria involve acquisition of genes, often via mobile genetic elements, and in some cases changes within core housekeeping genes. A number of algorithms use k-mers, i.e. motifs of fixed k length, to classify antimicrobial resistance [34], as they can be handled efficiently through ad hoc data structures suitable to process high-throughput data. But assessing the importance of a k-mer with respect to their frequency in drug resistance genes is not straightforward; one issue is that bacteria and genes can have very different GC content [35]. When the GC content varies, the probability distributions of motif occurrence can change over a broad range (given also the underlying, individual A, C, G, and T content), and thus the p values of over- or under-representation. To show how the quantification can have large variance, we analyze k-mers from antimicrobial resistance genes collected in the MEGARes 2.0 database [36]. MEGARes contains 7868 genes, with an average gene length of 1030.29 nucleotide bases, 57 different antibiotic resistance classes, and 220 distinct resistance mechanisms.

From MEGARes, we select all the 3911 genes conferring resistance to beta-lactamase; we then identify all 13-mers, for a total of 453,308 motifs (50% GC content). In Table 2, we show how the count probability distribution of the 13-mers in MEGARes’ beta-lactamase genes changes among bacterial species present in the human microbiome of respiratory tract [37], where we select uniformly 18 species on the basis of their GC content. The median probability of finding the aforementioned 13-mers at least once varies between 93 and 99%, and even species with a similar GC content can show different medians and interquartile ranges, such as Stomatobaculum longum (55% GC content, median p = 97%) and Kluyvera intermedia (52% GC content, median p = 93%). This variability is due to: the individual nucleotide content, which can differ even when the GC content is the same, and it directly affects the distribution (see also Fig. 2); the genome length; and the nucleotide content of the query motifs.

Table 2 Median (interquartile range, IQR) probability of finding at least once 13-mer motifs (top-frequent among beta-lactamase resistance genes) in the MEGARes database over different bacterial species characterized by heterogeneous GC content

Conclusion

The motif_prob software is a multi-platform, open source, efficient solution for calculating exact frequency distributions of (long) motif occurrences in reference genomes using high-throughput data. We showed how our code estimates are consistent with other, slower, exact calculations, and how the run times of our code (both Perl and C++) are competitive even with the non-exact compound Poisson approximation. Specifically, motif_prob is 50–1000× faster than MoSDi exact and 60–120× faster than MoSDi compound Poisson.

The current implementation is limited to non-clumpable strings, although it extrapolates a probability for clumpable strings by permutation. As future development of our work we foresee to develop an exact formula for clumpable strings and to extend the approach to generalize over motifs that can include nucleotide changes, insertions or deletions.

In conclusion, our tool can be effectively used in conjunction with motif discovery suites that process high-throughput data, allowing them to compute exact count distributions and associated p values without loss of run time performance, instead of relying on to approximations.

Availability and requirements

Project name: motif_prob

Project home page: https://github.com/DataIntellSystLab/motif_prob

Operating system(s): Multi-platform (UNIX/Linux/Mac, Windows)

Programming language: Perl, C++

Other requirements: None

License: MIT

Any restrictions to use by non-academics: Permissible under the terms of the MIT license.

Availability of data and materials

The datasets generated and/or analysed during the current study are available in the motif_prob GitHub repository (https://github.com/DataIntellSystLab/motif_prob), the MEGARes database (https://megares.meglab.org/) and the website for DNA-binding site matrices for Escherichia coli (https://arep.med.harvard.edu/ecoli_matrices/).

Abbreviations

DREME:

Discriminative regular expression motif elicitation

HOMER:

Hypergeometric optimization of motif enrichment

MEME:

Multiple expectation maximizations for motif elicitation

MFMD:

Memetic framework for motif discovery

RSAT:

Regulatory sequence analysis tools

STREME:

Simple, thorough, rapid, enriched motif elicitation

MoSDi:

Motif statistics and discovery

mm:ss:

Minutes:seconds

IQR:

Interquartile range

References

  1. 1.

    Luu P-L, Schöler HR, Araúzo-Bravo MJ. Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs. Genome Res. 2013;23(12):2013–29.

    CAS  Article  Google Scholar 

  2. 2.

    Gatchel JR, Zoghbi HY. Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet. 2005;6:743–55.

    CAS  Article  Google Scholar 

  3. 3.

    Luu PL, Schöler HR, Araúzo-Bravo MJ. Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs. Genome Res. 2013;23:2013–29.

    CAS  Article  Google Scholar 

  4. 4.

    Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23(1):137–44.

    CAS  Article  Google Scholar 

  5. 5.

    Lee NK, Li X, Wang D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci. 2018;1(466):25–43.

    Article  Google Scholar 

  6. 6.

    Hashim FA, Mabrouk MS, Al-Atabany W. Review of different sequence motif finding algorithms. Avicenna J Med Biotechnol. 2019;11(2):130–48.

    PubMed  PubMed Central  Google Scholar 

  7. 7.

    Pavesi G, Mereghetti P, Mauri G, Pesole G. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 2004;32(Web Server issue):W199–203.

    CAS  Article  Google Scholar 

  8. 8.

    Ettwiller L, Paten B, Ramialison M, Birney E, Wittbrodt J. Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nat Methods. 2007;4(7):563–5.

    CAS  Article  Google Scholar 

  9. 9.

    Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37(Web Server issue):W202–8.

    CAS  Article  Google Scholar 

  10. 10.

    Bailey TL. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics. 2011;27(12):1653–9.

    CAS  Article  Google Scholar 

  11. 11.

    Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J. RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res. 2012;40(4):e31–e31.

    CAS  Article  Google Scholar 

  12. 12.

    Dang LT, Tondl M, Chiu MHH, Revote J, Paten B, Tano V, et al. TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets. BMC Genomics. 2018;19(1):238.

    Article  Google Scholar 

  13. 13.

    Caldonazzo Garbelini JM, Kashiwabara AY, Sanches DS. Sequence motif finder using memetic algorithm. BMC Bioinform. 2018;19(1):4.

    Article  Google Scholar 

  14. 14.

    Li Y, Ni P, Zhang S, Li G, Su Z. ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery. Berger B, editor. Bioinformatics. 2019;35(22):4632–9.

    CAS  Article  Google Scholar 

  15. 15.

    Bailey TL. STREME: accurate and versatile sequence motif discovery. bioRxiv. 2020;2020.11.23.394619.

  16. 16.

    Kiesel A, Roth C, Ge W, Wess M, Meier M, Söding J. The BaMM web server for de-novo motif discovery and regulatory sequence analysis. Nucleic Acids Res. 2018;46(W1):W215–20.

    CAS  Article  Google Scholar 

  17. 17.

    Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8(2):R24.

    Article  Google Scholar 

  18. 18.

    Finding similar regions in many strings|Proceedings of the thirty-first annual ACM symposium on Theory of Computing [Internet]. [cited 2021 May 28]. https://doi.org/10.1145/301250.301376.

  19. 19.

    Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang MQ. Computing exact p values for DNA motifs. Bioinformatics. 2007;23(5):531–7.

    CAS  Article  Google Scholar 

  20. 20.

    Gentleman JF, Mullin RC. The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics. 1989;45(1):35–52.

    CAS  Article  Google Scholar 

  21. 21.

    Régnier M. A unified approach to word occurrence probabilities. Discrete Appl Math. 2000;104(1):259–80.

    Article  Google Scholar 

  22. 22.

    Nicodème P, Salvy B, Flajolet P. Motif statistics. Theor Comput Sci. 2002;287(2):593–617.

    Article  Google Scholar 

  23. 23.

    Robin S, Daudin J-J, Richard H, Sagot M-F, Schbath S. Occurrence probability of structured motifs in random sequences. J Comput Biol J Comput Mol Cell Biol. 2002;9(6):761–73.

    CAS  Article  Google Scholar 

  24. 24.

    Rivals E, Rahmann S. Combinatorics of periods in strings. J Comb Theory Ser A. 2003;104(1):95–113.

    Article  Google Scholar 

  25. 25.

    Bejerano G, Friedman N, Tishby N. Efficient exact p-value computation for small sample, sparse, and surprising categorical data. J Comput Biol J Comput Mol Cell Biol. 2004;11(5):867–86.

    CAS  Article  Google Scholar 

  26. 26.

    Lladser ME, Betterton MD, Knight R. Multiple pattern matching: a Markov chain approach. J Math Biol. 2008;56(1):51–92.

    PubMed  Google Scholar 

  27. 27.

    Marschall T, Rahmann S. Efficient exact motif discovery. Bioinformatics. 2009;25(12):i356–64.

    CAS  Article  Google Scholar 

  28. 28.

    Prosperi MCF, Prosperi L, Gray RR, Salemi M. On counting the frequency distribution of string motifs in molecular sequences. Int J Biomath. 2012;5:1250055.

    Article  Google Scholar 

  29. 29.

    Fogel GB, Weekes DG, Varga G, Dow ER, Harlow HB, Onyia JE, et al. Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res. 2004;32(13):3826–35.

    CAS  Article  Google Scholar 

  30. 30.

    Marschall T, Rahmann S. Speeding up exact motif discovery by bounding the expected clump size. In: Moulton V, Singh M, editors. Algorithms in bioinformatics. Lecture notes in computer science. Berlin: Springer; 2010. p. 337–49.

    Chapter  Google Scholar 

  31. 31.

    Kopp W. motifcounter: R package for analysing TFBSs in DNA sequences [Internet]. Bioconductor version: Release (3.12); 2021 [cited 2021 Mar 17]. https://bioconductor.org/packages/motifcounter/.

  32. 32.

    Pape UJ, Rahmann S, Sun F, Vingron M. Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. J Comput Biol J Comput Mol Cell Biol. 2008;15(6):547–64.

    CAS  Article  Google Scholar 

  33. 33.

    DNA, Words and Models: Statistics of Exceptional Words by S. Robin, F. Rodolphe, S. Schbath | 9780521847292 | Hardcover | Barnes & Noble® [Internet]. [cited 2021 Mar 17]. https://www.barnesandnoble.com/w/dna-words-and-models-s-robin/1110953123.

  34. 34.

    Clausen PTLC, Zankari E, Aarestrup FM, Lund O. Benchmarking of methods for identification of antimicrobial resistance genes in bacterial whole genome data. J Antimicrob Chemother. 2016;71:2484–8.

    CAS  Article  Google Scholar 

  35. 35.

    Hildebrand F, Meyer A, Eyre-Walker A. Evidence of selection upon genomic GC-content in bacteria. PLoS Genet. 2010;6:e1001107.

    Article  Google Scholar 

  36. 36.

    Doster E, Lakin SM, Dean CJ, Wolfe C, Young JG, Boucher C, et al. MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Res. 2020;48:D561–9.

    CAS  Article  Google Scholar 

  37. 37.

    Ibironke O, McGuinness LR, Lu S-E, Wang Y, Hussain S, Weisel CP, et al. Species-level evaluation of the human respiratory microbiome. GigaScience. 2020;9:giaa038. https://doi.org/10.1093/gigascience/giaa038.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Luciano Prosperi, MSc, and Roberto Di Castro, MSc, for aiding with the implementation and code maintenance.

Funding

This work has been supported by the National Institutes of Health (NIH)—National Institute of Allergy and Infectious Diseases (NIAID) Grants No. R01AI141810 and R01AI145552, and by the National Science Foundation (NSF) Grant No. 2013998. The funding body did not have roles in the design of the study and collection, analysis, interpretation of data, and in writing the manuscript.

Author information

Affiliations

Authors

Contributions

MP conceived idea, wrote paper and code; SM wrote code and set up web repository; CB reviewed algorithm, wrote the paper. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Mattia Prosperi.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Dr. Christina Boucher is Associate Editor of BMC Bioinformatics. All other authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prosperi, M., Marini, S. & Boucher, C. Fast and exact quantification of motif occurrences in biological sequences. BMC Bioinformatics 22, 445 (2021). https://doi.org/10.1186/s12859-021-04355-6

Download citation

Keywords

  • Bioinformatics
  • Motifs
  • Probability distribution
  • Markov model