Volume 11 Supplement 8
Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB)
Efficient motif finding algorithms for large-alphabet inputs
- Pavel P Kuksa^{1} and
- Vladimir Pavlovic^{1}Email author
DOI: 10.1186/1471-2105-11-S8-S1
© Pavlovic and Kuksa; licensee BioMed Central Ltd. 2010
Published: 26 October 2010
Abstract
Background
We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.
Results
The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.
Conclusions
Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.
Background
Explicit mapping (voting) algorithms proposed in [8] use an indicator array V of the maximum size |Σ|^{ k } to find motifs through voting. Each length-k substring observed in the input has at most one vote for each input sequence and gives this vote to all of its V(k, m) neighbors. The substrings that occur in every input string will receive N votes and will be included in the output motif set M. The algorithm takes O(k^{ m }^{+1}|Σ|^{ m }nN) time and requires at least O(k^{ m }^{+1}|Σ|^{ m }nN) space. The large space requirement of the algorithm restricts its usage to small values of k and m, as well as to small alphabet size |Σ|.
One of the most efficient exact algorithms for motif search, the mismatch tree (MITRA) algorithm [5], uses efficient trie traversal to find a set of motifs in the input strings. Under a trie-based computation framework [5, 11], the list of k-long contiguous substrings (k-mers) extracted from given strings is traversed in a depth-first search manner with branches corresponding to all possible symbol substitutions from alphabet Σ. Each leaf node at depth k corresponds to a particular k-mer feature (either exact or inexact instance of the observed exact string features) and will contain a list of matching features from each string. The leaf nodes corresponding to motifs will contain instances from all (or almost all) strings. The complexity of the trie-based traversal algorithm for motif finding is O(k^{ m }^{+1}|Σ|^{m}nN). Note that the algorithm essentially explores the neighborhood of all O(nN) k-mers in the input.
Another class of efficient algorithms is based on sorting and enumeration [10]. The PMSP algorithm enumerates all possible neighboring k-mers for the first string s_{1} and outputs k-mers that occur in every string with Hamming distance at most m, similar to the Voting algorithm [8]. The PMSprune algorithm [10] employs a more efficient search strategy to traverse the candidate space and is an improvement, in the expected case, over the PMSP. We note that explicit enumeration is employed by all above-mentioned algorithms.
While the exact algorithms focus on retrieving all possible motif patterns, an important issue of estimating significance of the found motif patterns can be addressed with existing techniques as used in, for instance, non-exhaustive algorithms based on stochastic optimization (e.g., MEME [12]).
In contrast to existing exact exhaustive algorithms, we approach the problem of motif finding by performing an efficient search over patterns with wildcards. As a consequence, the proposed method’s complexity becomes independent of the alphabet size.
Methods
Combinatorial algorithm for motif search
Algorithm 1 Motif search algorithm
This algorithm uses as its main sub-algorithm (in step 2) a procedure that finds the intersection of k-mer neighborhoods for any pair of the k-mers a, b. This intersection finding algorithm is described in Section ‘Motif generation’. We describe selection and pruning steps (steps 1 and 3) in Section ‘Selection algorithm’.
The overall complexity of the algorithm is where H is the maximum size of H(a, b), and I is the size of I, the number of k-mers used to construct the candidate set C. The important fact that makes our algorithm efficient in practice is that typically I ≪ min(nN, |Σ|^{ k }) and H ≪ V(k, m), particularly for large alphabets. We demonstrate this in our experimental results and provide an expected-size analysis in Section ‘Selection algorithm’.
Selection algorithm
Algorithm 2 Selection algorithm
Input: set of k-mers with associated sequence index, distance parameter d |
---|
Output: set of k-mers at distance d from each input string |
1. Pick d positions and remove from the k-mers symbols at the corresponding positions to obtain a set of (k − d)-mers. |
2. Use counting sort to order (lexicographically) the resulting set of (k – d)-mers. |
3. Scan the sorted list to create the list of all sequences in which k-mers appear. |
4. Output the k-mers that appear in every input string. |
where p_{ k }_{,2}_{ m } is the probability that two randomly selected k-mers are at distance of at most 2m. For instance, for a set of N = 20 protein sequences (sampled from alphabet |Σ| = 20) of length n = 600 the expected number of potential motifs of length k = 13, m = 4 by chance is about 8, with p_{13,8} = 2.9 10^{−4}. Given t implanted motif instances, the average number of k-mers that will be selected from nN input samples, or the expected size of I, is
E[I] = t + nN(1 − (1 − p_{ k }_{,2}_{ m })^{ t }) + E[I_{ B }].
Since t and p are typically small, for small pn, E[I] ≪ nN, the number of k-mers in the input. In the protein example above the expected size of I is about 1 + 3 + 8=12 for t = 1, which is orders of magnitude smaller than nN = 12000, signifying the importance of creating I first. This is empirically demonstrated in Section ‘Results and Discussion’.
Pruning using selection
The sorting approach of Algorithm 2 is also used to select patterns satisfying the motif property from the candidates C (Step 3 in main Algorithm 1). The pruning step is based on verifying the motif property (i.e. whether given patterns match all input sequences with up to m mismatches) and can be accomplished using rounds of counting sort.
Motif generation
In what follows, we describe an efficient algorithm that finds the set of stems that represent the set of k-mers shared by a pair of k-mers a and b. This process is used to create set C from potential instances I, which is subsequently pruned to yield the true motif instances.
The number of k-mers in the common neighborhood of any two particular k-mers a and b assumes a fixed set values depending on the Hamming distance d(a, b) between k-mers [13], for given values of |Σ|, k, and m. We want to represent the shared k-mers in this intersection using a set of stems, patterns with wildcards. However, the number of stems will not depend on the alphabet size |Σ|.
To find all stems shared by k-mers a and b, consider two sets of positions: mismatch region in which a and b disagree and match region in which a and b agree. We consider two cases depending on the number of mismatch positions (i.e. Hamming distance between a, b). In the first case, the distance d(a, b) is at most m, the maximum number of mismatches allowed. In the second case, the distance d(a, b) exceeds m. When d(a, b) ≤ m, wildcard characters can appear both inside and outside of the mismatch region. When d(a, b) >m, wildcard characters can appear only inside the mismatch regions. Consider for example, the case of d(a, b) = 0 and m = 1. In this case, the set of stems is the set of patterns with 1 wildcard at each of the possible k positions (with the remaining positions as in a) plus one stem with 0 wildcards. When m = 2, and d(a, b) = 1, the set of stems will include patterns with 0 or 1 wildcard in k − d positions and 0 or 1 wildcards in the remaining d = 1 positions. For example, for the pair (tgt, tgc) the corresponding patterns with wildcards are tg?, t??, ?g?, t?c, and ?gc, where ? denotes a wildcard.
Algorithm 3 Stem generation (independent of the alphabet size |Σ|)
Input: pair of k-mers a, b |
---|
Output: set of stems (patterns with wildcards) shared by a and b |
if if d(a, b) ≤ m then |
Set stem = a |
Set i = 0 … d positions in the mismatch region of the stem as in b |
Place j_{1} = 0 … d − i wildcards inside the mismatch region |
Place j_{2} = 0 … m − max(d − i, j_{1} + i) wildcards outside the mismatch region |
end if |
if d(a, b) >m then |
Set stem = a |
Fix i = d − m … m positions in the mismatch region of the current stem as in b |
Place j = 0 … m − i wild-cards in the remaining d − i positions in the mismatch region |
end if |
Output resulting stems (patterns with wildcards) |
Algorithm analysis
The complexity of the selection step 1 for constructing I is and does not depend on the alphabet size |Σ|. Steps 2 and and 3 have the complexity and again do not depend on |Σ|. As a consequence, the three-step procedure gives us an efficient, alphabet-independent motif search algorithm that outputs all motifs embedded in the input S. Our experiments will next demonstrate that this allows efficient exploitation of sparsity of typical solutions—we explore only a small portion of the motif space by focusing (using Algorithm 2) only on the support samples that are potential instances of the motifs. This results in significant reductions in running times, especially for large-alphabet inputs, i.e. the cases difficult for the current exact motif finding algorithms.
Extensions
Our proposed framework can be used to reduce search complexity for other exact search-based motif finding algorithms. Existing exhaustive algorithms typically (e.g. [5, 8, 10]) use the entire input (i.e. all the k-mers in the input) and find motif by essentially exploring neighborhoods of every k-mer in the input. Their search complexity can be improved by using a reduced set of k-mers instead of all input samples. This reduced set of k-mers can be obtained using our linear time selection algorithm (Algorithm 2, Section ‘Selection algorithm’). Using reduced set of k-mers, the actual search complexity after the selection step becomes sublinear in the input size (since the number of selected k-mers I = |I| is much smaller than input length O(nN)). For instance, the search complexity of the trie-based algorithms (e.g., [5]) can be reduced to instead of O(knNV(k, m)), where V(k, m) is O(k^{ m }|Σ|^{ m }). This will lead to a more efficient search especially for large-alphabet since a possibly large input (O(knN)) is replaced with a smaller set I of k-mers that match with up to 2m mismatches every string in the input.
Results and discussion
We evaluate our algorithms on synthetic benchmark motif finding tasks and real data sets. We first test our algorithms on the planted motif problem commonly used as a benchmark for evaluation of the motif finding algorithms [2, 5, 10]. We then illustrate our method on several DNA and protein sequence data sets.
Planted motif problem
A planted motif problem [2] is the task of finding motifs and their instances in a set of sequences with variants of the consensus string (motif) implanted with up to m mismatches in every string. This task represents a well-defined subtle motif discovery problem. Instances of this problem with large number of mutations m are known to be challenging for most of the motif finding algorithms.
We follow the standard setting used in previous studies [2, 5, 10] and generate N = 20 random strings of length n = 600 using iid, uniformly distributed symbols from an alphabet of size |Σ|. We then embed a copy (with up to m substitutions at random positions) of a motif at a random location in every string. The task is then to identify motifs hidden in the input.
Running time comparison on the challenging instances of the planted motif problem (DNA, |Σ| = 4, N = 20 sequences of length n = 600). Problem instances are denoted by (k,m, |Σ|), where k is the length of the motif implanted with m mismatches.
Running time, in seconds, on large-|Σ| inputs. (k, m) instances denote implanted motifs of length k with up to m substitutions.
|Σ| | (9,2) | (11,3) | (13,4) | (15,5) | ||||
---|---|---|---|---|---|---|---|---|
MITRA | Stemming | MITRA | Stemming | MITRA | Stemming | MITRA | Stemming | |
20 | 8.39 | 0.637 | 1032.17 | 1.07 | 28905 | 5.247 | n/a | 12.31 |
50 | 89.82 | 0.633 | 12295.73 | 0.963 | 685015 | 2.244 | n/a | 11.92 |
100 | 265.94 | 0.645 | n/a | 0.967 | > 1 month | 2.227 | n/a | 11.86 |
Identifying TF binding sites
We use several data sets with experimentally confirmed TF binding sites: CRP, FNR, and LexA. The CRP data set contains 18 DNA sequences of length 105 with one or two CRP-binding sites [2, 14]. The FNR and LexA data sets are obtained from RegulonDB [15] database and contain 30 and 91 sequences known to have sites of length 14 and 20 bases. The task is to identify the sequence motif corresponding to the binding sites and the positions of sites within sequences.
For CRP, we use relatively long k-mers of length k = 18, with a large number of allowed mismatches (m = 7) from a given set of 18 DNA sequences (|Σ| = 4). For FNR and LexA data sets, we set motif length to k = 14 and k = 16 bases, with the maximum number of mismatches set to m = 4 and m = 6, respectively.
For FNR and LexA motifs, our algorithm correctly finds consensus patterns TTGATnnnnATCAA and CTGTnnnnnnnnnCAG, in line with the validated transcription factor binding sites, with the performance coefficients [2] of 83.69 and 90.38.
Protein motif finding
We also apply our algorithm to finding subtle sequence motifs on several protein sequence datasets, a challenging task due to the increased alphabet size (|Σ| = 20) coupled with large k and m.
Zinc metallopeptidase motif. In this experiment, 10 relatively long (average length is 800) human zinc metallopeptidase sequences used to test motif finding. Identification of subtle motifs in this case is made even more challenging by the length of the sequences. We use 11 residues long k-mer with m = 5 mismatches and find sequence motifs with the instance majority VAAHELGHS[GL]G in 9 out of 10 sequences that correspond to previously confirmed locations. We note the large number of mismatches (m = 5) was critical to motif identification.
Conclusions
We presented a new deterministic and exhaustive algorithm for finding motifs, the common patterns in sequences. Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in large-alphabet sequences (e.g., proteins), as we showed on several motif discovery problems in both DNA and protein sequences. The proposed algorithms could be applied to other cases and challenging problems in sequence analysis and mining potentially characterized by large alphabets, such as text mining.
Authors contributions
All authors contributed equally to this work.
Declarations
Acknowledgments
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S8.
Authors’ Affiliations
References
- Xing EP, Jordan MI, Karp RM, Russell S: A hierarchical Bayesian Markovian model for motifs in biopolymer sequences. In In Proc. of Advances in Neural Information Processing Systems. MIT Press; 2003:200–3.Google Scholar
- Pevzner PA, Sze SH: Combinatorial Approaches to Finding Subtle Signals in DNA Sequences. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. Volume 0. AAAI Press; 2000:269–278.
- Fellous JM, Tiesinga PHE, Thomas PJ, Sejnowski TJ: Discovering Spike Patterns in Neuronal Responses. J. Neurosci. 2004, 24(12):2989–3001. [http://www.jneurosci.org/cgi/content/abstract/24/12/2989] 10.1523/JNEUROSCI.4649-03.2004PubMed CentralView ArticlePubMedGoogle Scholar
- Jojic N, Jojic V, Frey B, Meek C, Heckerman D: Using “epitomes” to model genetic diversity: Rational design of HIV vaccine cocktails. In Advances in Neural Information Processing Systems 18. Edited by: Weiss Y, Schölkopf B, Platt J. Cambridge, MA: MIT Press; 2006:587–594.Google Scholar
- Eskin E, Pevzner PA: Finding composite regulatory patterns in DNA sequences. Bioinformatics 2002, 18(suppl1):S354–363. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/18/suppl_1/S354]View ArticlePubMedGoogle Scholar
- Pisanti N, Carvalho AM, Marsan L, Sagot MF: RISOTTO: Fast Extraction of Motifs with Mismatches. In LATIN 2006, 757–768.Google Scholar
- Tompa M, Li N, Bailey T, Church G, Moor BD: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23: 137–144. 10.1038/nbt1053View ArticlePubMedGoogle Scholar
- Chin FYL, Leung HCM: Voting algorithms for discovering long motifs. APBC 2005, 261–271. [http://www.comp.nus.edu.sg/~wongls/psZ/apbc2005/camera-ready/212.pdf]Google Scholar
- Sagot MF: Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. In LATIN ’98: Proceedings of the Third Latin American Symposium on Theoretical Informatics. London, UK: Springer-Verlag; 1998:374–390. full_textView ArticleGoogle Scholar
- Davila J, Balla S, Rajasekaran S: Fast and Practical Algorithms for Planted (l, d) Motif Search. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2007, 4(4):544–552. 10.1109/TCBB.2007.70241View ArticlePubMedGoogle Scholar
- Leslie C, Kuang R: Fast String Kernels using Inexact Matching for Protein Sequences. J. Mach. Learn. Res. 2004, 5: 1435–1455. [http://jmlr.csail.mit.edu/papers/volume5/leslie04a/leslie04a.pdf]Google Scholar
- Bailey TL, Elkan C: Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Mach. Learn. 1995, 21(1–2):51–80. 10.1007/BF00993379View ArticleGoogle Scholar
- Kuksa P, Huang PH, Pavlovic V: Scalable Algorithms for String Kernels with Inexact Matching. In NIPS 2008, 881–888.Google Scholar
- Stormo GD, Hartzell GW: Identifying protein-binding sites from unaligned DNA fragments. Proceedings of the National Academy of Sciences of the United States of America 1989, 86(4):1183–1187. [http://www.pnas.org/content/86/4/1183.abstract] 10.1073/pnas.86.4.1183PubMed CentralView ArticlePubMedGoogle Scholar
- RegulonDB.[http://regulondb.ccg.unam.mx/]
- Lawrence C, Reilly A: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 1990, 7: 41–51. 10.1002/prot.340070105View ArticlePubMedGoogle Scholar
- Lawrence C, Altschul S, Boguski M, Liu J, Neuwald A, Wootton J: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262(5131):208–214. [http://www.sciencemag.org/cgi/content/abstract/262/5131/208] 10.1126/science.8211139View ArticlePubMedGoogle Scholar
- Kister AE, Fokas AS, Papatheodorou TS, Gelfand IM: Strict rules determine arrangements of strands in sandwich proteins. Proceedings of the National Academy of Sciences of the United States of America 2006, 103(11):4107–4110. [http://www.pnas.org/content/103/11/4107.abstract] 10.1073/pnas.0510747103PubMed CentralView ArticlePubMedGoogle Scholar
- Super-Secondary Structure Database.[http://binfs.umdnj.edu/sssdb/]
- Evans PA, Smith AD: Toward Optimal Motif Enumeration. In WADS 2003, 47–58.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.