Efficient motif finding algorithms for large-alphabet inputs
© Pavlovic and Kuksa. 2010
Published: 26 October 2010
Skip to main content
© Pavlovic and Kuksa. 2010
Published: 26 October 2010
We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.
The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.
Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.
Explicit mapping (voting) algorithms proposed in  use an indicator array V of the maximum size |Σ| k to find motifs through voting. Each length-k substring observed in the input has at most one vote for each input sequence and gives this vote to all of its V(k, m) neighbors. The substrings that occur in every input string will receive N votes and will be included in the output motif set M. The algorithm takes O(k m +1|Σ| m nN) time and requires at least O(k m +1|Σ| m nN) space. The large space requirement of the algorithm restricts its usage to small values of k and m, as well as to small alphabet size |Σ|.
One of the most efficient exact algorithms for motif search, the mismatch tree (MITRA) algorithm , uses efficient trie traversal to find a set of motifs in the input strings. Under a trie-based computation framework [5, 11], the list of k-long contiguous substrings (k-mers) extracted from given strings is traversed in a depth-first search manner with branches corresponding to all possible symbol substitutions from alphabet Σ. Each leaf node at depth k corresponds to a particular k-mer feature (either exact or inexact instance of the observed exact string features) and will contain a list of matching features from each string. The leaf nodes corresponding to motifs will contain instances from all (or almost all) strings. The complexity of the trie-based traversal algorithm for motif finding is O(k m +1|Σ|m nN). Note that the algorithm essentially explores the neighborhood of all O(nN) k-mers in the input.
Another class of efficient algorithms is based on sorting and enumeration . The PMSP algorithm enumerates all possible neighboring k-mers for the first string s 1 and outputs k-mers that occur in every string with Hamming distance at most m, similar to the Voting algorithm . The PMSprune algorithm  employs a more efficient search strategy to traverse the candidate space and is an improvement, in the expected case, over the PMSP. We note that explicit enumeration is employed by all above-mentioned algorithms.
While the exact algorithms focus on retrieving all possible motif patterns, an important issue of estimating significance of the found motif patterns can be addressed with existing techniques as used in, for instance, non-exhaustive algorithms based on stochastic optimization (e.g., MEME ).
In contrast to existing exact exhaustive algorithms, we approach the problem of motif finding by performing an efficient search over patterns with wildcards. As a consequence, the proposed method’s complexity becomes independent of the alphabet size.
Algorithm 1 Motif search algorithm
1. Use multiple rounds of counting sort to iterate over input strings and construct a set of potential motif instances I, k-mers that are at Hamming distance of at most 2m from each string (Algorithm 2).
2. Construct candidate set C by building stem sets H(a, b) for k-mer pairs in I (Algorithm 3)
3. Prune all stems from C that do not satisfy motif property using rounds of counting sort (Algorithm 2, Section ‘Pruning using selection’)
4. Output remaining stems as motifs.
This algorithm uses as its main sub-algorithm (in step 2) a procedure that finds the intersection of k-mer neighborhoods for any pair of the k-mers a, b. This intersection finding algorithm is described in Section ‘Motif generation’. We describe selection and pruning steps (steps 1 and 3) in Section ‘Selection algorithm’.
The overall complexity of the algorithm is where H is the maximum size of H(a, b), and I is the size of I, the number of k-mers used to construct the candidate set C. The important fact that makes our algorithm efficient in practice is that typically I ≪ min(nN, |Σ| k ) and H ≪ V(k, m), particularly for large alphabets. We demonstrate this in our experimental results and provide an expected-size analysis in Section ‘Selection algorithm’.
Algorithm 2 Selection algorithm
Input: set of k-mers with associated sequence index, distance parameter d
Output: set of k-mers at distance d from each input string
1. Pick d positions and remove from the k-mers symbols at the corresponding positions to obtain a set of (k − d)-mers.
2. Use counting sort to order (lexicographically) the resulting set of (k – d)-mers.
3. Scan the sorted list to create the list of all sequences in which k-mers appear.
4. Output the k-mers that appear in every input string.
where p k ,2 m is the probability that two randomly selected k-mers are at distance of at most 2m. For instance, for a set of N = 20 protein sequences (sampled from alphabet |Σ| = 20) of length n = 600 the expected number of potential motifs of length k = 13, m = 4 by chance is about 8, with p 13,8 = 2.9 10−4. Given t implanted motif instances, the average number of k-mers that will be selected from nN input samples, or the expected size of I, is
E[I] = t + nN(1 − (1 − p k ,2 m ) t ) + E[I B ].
Since t and p are typically small, for small pn, E[I] ≪ nN, the number of k-mers in the input. In the protein example above the expected size of I is about 1 + 3 + 8=12 for t = 1, which is orders of magnitude smaller than nN = 12000, signifying the importance of creating I first. This is empirically demonstrated in Section ‘Results and Discussion’.
The sorting approach of Algorithm 2 is also used to select patterns satisfying the motif property from the candidates C (Step 3 in main Algorithm 1). The pruning step is based on verifying the motif property (i.e. whether given patterns match all input sequences with up to m mismatches) and can be accomplished using rounds of counting sort.
In what follows, we describe an efficient algorithm that finds the set of stems that represent the set of k-mers shared by a pair of k-mers a and b. This process is used to create set C from potential instances I, which is subsequently pruned to yield the true motif instances.
The number of k-mers in the common neighborhood of any two particular k-mers a and b assumes a fixed set values depending on the Hamming distance d(a, b) between k-mers , for given values of |Σ|, k, and m. We want to represent the shared k-mers in this intersection using a set of stems, patterns with wildcards. However, the number of stems will not depend on the alphabet size |Σ|.
To find all stems shared by k-mers a and b, consider two sets of positions: mismatch region in which a and b disagree and match region in which a and b agree. We consider two cases depending on the number of mismatch positions (i.e. Hamming distance between a, b). In the first case, the distance d(a, b) is at most m, the maximum number of mismatches allowed. In the second case, the distance d(a, b) exceeds m. When d(a, b) ≤ m, wildcard characters can appear both inside and outside of the mismatch region. When d(a, b) >m, wildcard characters can appear only inside the mismatch regions. Consider for example, the case of d(a, b) = 0 and m = 1. In this case, the set of stems is the set of patterns with 1 wildcard at each of the possible k positions (with the remaining positions as in a) plus one stem with 0 wildcards. When m = 2, and d(a, b) = 1, the set of stems will include patterns with 0 or 1 wildcard in k − d positions and 0 or 1 wildcards in the remaining d = 1 positions. For example, for the pair (tgt, tgc) the corresponding patterns with wildcards are tg?, t??, ?g?, t?c, and ?gc, where ? denotes a wildcard.
Algorithm 3 Stem generation (independent of the alphabet size |Σ|)
Input: pair of k-mers a, b
Output: set of stems (patterns with wildcards) shared by a and b
if if d(a, b) ≤ m then
Set stem = a
Set i = 0 … d positions in the mismatch region of the stem as in b
Place j 1 = 0 … d − i wildcards inside the mismatch region
Place j 2 = 0 … m − max(d − i, j 1 + i) wildcards outside the mismatch region
if d(a, b) >m then
Set stem = a
Fix i = d − m … m positions in the mismatch region of the current stem as in b
Place j = 0 … m − i wild-cards in the remaining d − i positions in the mismatch region
Output resulting stems (patterns with wildcards)
The complexity of the selection step 1 for constructing I is and does not depend on the alphabet size |Σ|. Steps 2 and and 3 have the complexity and again do not depend on |Σ|. As a consequence, the three-step procedure gives us an efficient, alphabet-independent motif search algorithm that outputs all motifs embedded in the input S. Our experiments will next demonstrate that this allows efficient exploitation of sparsity of typical solutions—we explore only a small portion of the motif space by focusing (using Algorithm 2) only on the support samples that are potential instances of the motifs. This results in significant reductions in running times, especially for large-alphabet inputs, i.e. the cases difficult for the current exact motif finding algorithms.
Our proposed framework can be used to reduce search complexity for other exact search-based motif finding algorithms. Existing exhaustive algorithms typically (e.g. [5, 8, 10]) use the entire input (i.e. all the k-mers in the input) and find motif by essentially exploring neighborhoods of every k-mer in the input. Their search complexity can be improved by using a reduced set of k-mers instead of all input samples. This reduced set of k-mers can be obtained using our linear time selection algorithm (Algorithm 2, Section ‘Selection algorithm’). Using reduced set of k-mers, the actual search complexity after the selection step becomes sublinear in the input size (since the number of selected k-mers I = |I| is much smaller than input length O(nN)). For instance, the search complexity of the trie-based algorithms (e.g., ) can be reduced to instead of O(knNV(k, m)), where V(k, m) is O(k m |Σ| m ). This will lead to a more efficient search especially for large-alphabet since a possibly large input (O(knN)) is replaced with a smaller set I of k-mers that match with up to 2m mismatches every string in the input.
We evaluate our algorithms on synthetic benchmark motif finding tasks and real data sets. We first test our algorithms on the planted motif problem commonly used as a benchmark for evaluation of the motif finding algorithms [2, 5, 10]. We then illustrate our method on several DNA and protein sequence data sets.
A planted motif problem  is the task of finding motifs and their instances in a set of sequences with variants of the consensus string (motif) implanted with up to m mismatches in every string. This task represents a well-defined subtle motif discovery problem. Instances of this problem with large number of mutations m are known to be challenging for most of the motif finding algorithms.
We follow the standard setting used in previous studies [2, 5, 10] and generate N = 20 random strings of length n = 600 using iid, uniformly distributed symbols from an alphabet of size |Σ|. We then embed a copy (with up to m substitutions at random positions) of a motif at a random location in every string. The task is then to identify motifs hidden in the input.
Running time comparison on the challenging instances of the planted motif problem (DNA, |Σ| = 4, N = 20 sequences of length n = 600). Problem instances are denoted by (k,m, |Σ|), where k is the length of the motif implanted with m mismatches.
Running time, in seconds, on large-|Σ| inputs. (k, m) instances denote implanted motifs of length k with up to m substitutions.
> 1 month
We use several data sets with experimentally confirmed TF binding sites: CRP, FNR, and LexA. The CRP data set contains 18 DNA sequences of length 105 with one or two CRP-binding sites [2, 14]. The FNR and LexA data sets are obtained from RegulonDB  database and contain 30 and 91 sequences known to have sites of length 14 and 20 bases. The task is to identify the sequence motif corresponding to the binding sites and the positions of sites within sequences.
For CRP, we use relatively long k-mers of length k = 18, with a large number of allowed mismatches (m = 7) from a given set of 18 DNA sequences (|Σ| = 4). For FNR and LexA data sets, we set motif length to k = 14 and k = 16 bases, with the maximum number of mismatches set to m = 4 and m = 6, respectively.
For FNR and LexA motifs, our algorithm correctly finds consensus patterns TTGATnnnnATCAA and CTGTnnnnnnnnnCAG, in line with the validated transcription factor binding sites, with the performance coefficients  of 83.69 and 90.38.
We also apply our algorithm to finding subtle sequence motifs on several protein sequence datasets, a challenging task due to the increased alphabet size (|Σ| = 20) coupled with large k and m.
Zinc metallopeptidase motif. In this experiment, 10 relatively long (average length is 800) human zinc metallopeptidase sequences used to test motif finding. Identification of subtle motifs in this case is made even more challenging by the length of the sequences. We use 11 residues long k-mer with m = 5 mismatches and find sequence motifs with the instance majority VAAHELGHS[GL]G in 9 out of 10 sequences that correspond to previously confirmed locations. We note the large number of mismatches (m = 5) was critical to motif identification.
We presented a new deterministic and exhaustive algorithm for finding motifs, the common patterns in sequences. Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in large-alphabet sequences (e.g., proteins), as we showed on several motif discovery problems in both DNA and protein sequences. The proposed algorithms could be applied to other cases and challenging problems in sequence analysis and mining potentially characterized by large alphabets, such as text mining.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S8.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.