Notation | Explanation |
---|---|
|x| | The length of a string or the size of a set. |
Σ | The DNA alphabet, Σ = {A, C, G, T}. |
l-mer | An l-length string over Σ. |
s[i] | The ith character in the string s. |
s[i..j] | A substring of the string s from the ith position to the jth position. |
s∙s’ | The concatenation of two strings s and s’. |
x ∈ l s | The string x is an l-length substring of the string s. In other words, x is an l-mer in the string s. |
x ∈ l D | The string x is an l-length substring of the sequence set D. In other words, there exists s ∈ D such that x ∈ l s. |
D = {s1, s2, …, s t }, t, n, q, l, d | Notations for the input. D is the input DNA sequence set, where each sequence s i is an n-length string over Σ; t = |D|; n = |s i | for 1 ≤ i ≤ t; q is the proportion of the input sequences containing motif instances in D; l is the motif length and d is the maximum number of mismatches between a motif and its instance. |
D’, t’, q’ | Notations for the output. D’ is a sample sequence set selected from D, i.e., D’ ⊂ D; t’ = |D’|; q’ is the proportion of the input sequences containing motif instances in D’. |
count k (x) | The count (number of occurrences) of a string x in D with up to k mismatches, represented by (4). |
count(x) | The count (number of occurrences) of a string x in D. |
d H (y, x) | The Hamming distance between two strings y and x of equal length. |
B k (x) | The set of k-neighbors of a string x, i.e., the set of strings with Hamming distance no more than k from x. B k (x) = {y: y ∈ Σ|x|, d H (y, x) ≤ k}. |
stn(y) | The integer obtained by conversion from a string y over Σ. The characters A, C, G and T are converted to binary numbers 00, 01, 10 and 11, respectively. Because of the need to compute count k (y), y is first reversed and then converted to an integer. For example, if y = AC, then y is converted to the binary number 0100, i.e., the decimal number 4. |