Theoretical formulation
The exact formula by Prosperi et al. [28], for the calculation of the frequency distribution j of a string of length m within a text of length n (m < n) over alphabet k, under the Markovian model, is
$$\begin{array}{*{20}c} {P\left( {j,m,n} \right) = P\left( S \right)^{j} \mathop \sum \limits_{z = 1}^{{\left| {C_{n,m,j} } \right|}} \mathop \prod \limits_{y = 1}^{j + 1} P\left( {S_{{0,d_{yz} }} } \right),} \\ \end{array}$$
(1)
where P(S) = P(a1) · P(a2 | a1) · … · P(am−1 | am), P(S0,n) = P(S0,n−1) − P(S) · P(S0,n−m), S0,n = S0,n−1 · k − P(S) · km · S0,n−m, d1 … dj+1 are the lengths of the j + 1 segments where the j strings divide the text of length n in exact configurations with d1 + ··· + dj+1 = n − mj, and
$$\begin{array}{*{20}c} {\left| {C_{n,m,j} } \right| = \left( {\begin{array}{*{20}c} {n + j\left( {1 - m} \right)} \\ {n - mj} \\ \end{array} } \right).} \\ \end{array}$$
(2)
Formula (1) has a complexity of O(nj), which becomes quickly intractable. However, by defining R = P(S0,n+1)/P(S0,n) as a constant, Prosperi et al. show that for any positive (arbitrarily small) number ε, there is an index ηε such that for every η > ηε then
$$\begin{array}{*{20}c} {P\left( {S_{0,n + x} } \right) \sim P\left( {S_{0,n} } \right) \cdot R^{x} . } \\ \end{array}$$
(3)
By using this approximation, the summation of the original formula can be reduced to a single step, and calculations can be stopped when the ratio P(S0,n)/P(S0,n−1) reaches a desired level of tolerance ε. Specifically, after plugging the iterative approximation (3) in (1), we obtain the final formula
$$\begin{array}{*{20}c} {P\left( {j,m,n} \right) \sim P\left( S \right)^{j} \cdot R^{{n - mj - n_{\varepsilon } \left( {j + 1} \right)}} \cdot P\left( {S_{{\left( {0,n_{\varepsilon } } \right)}} } \right)^{j + 1} \cdot \left( {\begin{array}{*{20}c} {n + j\left( {1 - m} \right)} \\ {n - mj} \\ \end{array} } \right).} \\ \end{array}$$
(4)
We note that P(j, m, n) is the same irrespective of the position of the nucleotides in a query string, e.g. AACCC and CCCAA have the same probability. This property permits to extrapolate a probability for clumpable strings by permutation, e.g. ACCA into CCAA, although the value is not guaranteed given possible overlap. Another way is to replace the first or the last character with another one that has the same frequency. All details on the derivation of the exact formula and the proof for its progressive approximation, along with comparison against other state-of-art algorithms, can be found in the original work by Prosperi et al. [28].
Implementation
Two different implementations are produced: one in Perl and another in C++. Both programs take the same input and parameters, namely: (1) a query string or multiple strings to be analyzed; (2) the length of the reference genome; and (3) the nucleotide frequencies of the genome. In alternative to the genome length and nucleotide frequencies, a FASTA file containing the genome string can be passed as input to the program. The output file reports—for each motif—the count distribution and other summary information including a flag for clumped strings, string probability, and statistics on the precision and tolerance levels.
Since the computational complexity of the formula is exponential, motif occurrences are calculated at increasing counts until the occurrence probability becomes lower than given a tolerance level ε, or the upper limit of counts j is reached. We also control estimates at each iteration in order to avoid issues with floating point operations when frequency/length ratios diverge, and to handle relatively ill-posed configurations. Given the motif m and genome g lengths, one can set a tolerance level ε such that P(0, m, n) > (1 − ε), and in general each case where (1 − P(S))(m−m+1) > (1 − ε). This is equal to (n − m + 1)∙log(1 − P(S)) > log(1 − ε), which implies n > m − 1 + log(1 − ε)/log(1 − P(S)). In the source code, we have set ε to 10−7 and j to 500. Further, we implement the calculation of the expected number of strings and the motif’s (stationary) occurrence probability at any text position, according to Robin et al. [33].
Figure 1 provides a flowchart of the data processing pipeline, showing the required input specifications, the method’s internal parameters, and the output fields.
The source code, documentation, sample datasets, and executable files are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob.