Skip to main content

Table 1 Notations used in this paper

From: SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets

Notation

Explanation

|x|

The length of a string or the size of a set.

Σ

The DNA alphabet, Σ = {A, C, G, T}.

l-mer

An l-length string over Σ.

s[i]

The ith character in the string s.

s[i..j]

A substring of the string s from the ith position to the jth position.

ss

The concatenation of two strings s and s’.

x l s

The string x is an l-length substring of the string s. In other words, x is an l-mer in the string s.

x l D

The string x is an l-length substring of the sequence set D. In other words, there exists s D such that x l s.

D = {s1, s2, …, s t }, t, n, q, l, d

Notations for the input. D is the input DNA sequence set, where each sequence s i is an n-length string over Σ; t = |D|; n = |s i | for 1 ≤ i ≤ t; q is the proportion of the input sequences containing motif instances in D; l is the motif length and d is the maximum number of mismatches between a motif and its instance.

D’, t’, q

Notations for the output. D’ is a sample sequence set selected from D, i.e., D D; t’ = |D’|; q’ is the proportion of the input sequences containing motif instances in D’.

count k (x)

The count (number of occurrences) of a string x in D with up to k mismatches, represented by (4).

count(x)

The count (number of occurrences) of a string x in D.

d H (y, x)

The Hamming distance between two strings y and x of equal length.

B k (x)

The set of k-neighbors of a string x, i.e., the set of strings with Hamming distance no more than k from x. B k (x) = {y: y Σ|x|, d H (y, x) ≤ k}.

stn(y)

The integer obtained by conversion from a string y over Σ. The characters A, C, G and T are converted to binary numbers 00, 01, 10 and 11, respectively. Because of the need to compute count k (y), y is first reversed and then converted to an integer. For example, if y = AC, then y is converted to the binary number 0100, i.e., the decimal number 4.