- Research
- Open Access
- Published:

# Efficient motif finding algorithms for large-alphabet inputs

*BMC Bioinformatics*
**volume 11**, Article number: S1 (2010)

## Abstract

### Background

We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.

### Results

The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.

### Conclusions

Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.

## Background

Finding motifs or repeated patterns in data is of wide scientific interest [1–4] with many applications in genomic and proteomic analysis. The motif search problem abstracts many important problems in analysis of sequence data, where motifs are, for instance, biologically important patterns. For example, elucidating motifs in DNA sequences is a critical first step in understanding biological processes as basic as the RNA transcription. There, the motifs can be used to identify promoters, the regions in DNA that facilitate the transcription. Finding motifs can be equally crucial for analyzing interactions between viruses and cells or identification of disease-linked patterns. Discovery of motifs in music sequences, text, or time series data is a fundamental, general means of summarizing, mining and understanding large volumes of data. For the purpose of this study, motifs are (short) patterns that occur in an exact or *approximate* form in all or most of the strings in a data set. Consider a set of input strings *S* of size *N* = |*S*| constructed from an alphabet Σ. The solution for the (*k*, *m,* Σ, *N*)-motif finding problem (Figure 1) is the set *M* of *k*-mers (substrings of length *k*), *M* ⊆ Σ^{k}, such that each motif *a* ∈ *M*, |*a*| = *k*, is at minimum Hamming distance of at most *m* from all (or almost all) strings *s* ∈ *S*.

In this work, we focus on a deterministic, exhaustive approach to motif search. Exhaustive motif finding approaches are guaranteed to report all instances of motifs in a set of sequences, but are faced by the exponential complexity of such search. As a consequence, the problem quickly becomes intractable for even moderately long motifs and small alphabets. We present a new deterministic algorithm for finding common patterns with the search complexity that scales well with the size of the alphabet. Compared to existing algorithms in this class (e.g. [5, 6]) that have strong dependency on the alphabet size and work with small-alphabet input, our algorithms significantly improve search efficiency in the important case of large-alphabet inputs (e.g. protein alphabet, extended DNA alphabet, etc.) and inputs of large length. As we show in the experiments, using both synthetic and real data, our algorithms are orders-of-magnitude faster than existing state-of-the-art deterministic search algorithms, especially on large-alphabet inputs (e.g., protein sequences). This result extends applicability of the exact motif search algorithms to more complex problems requiring analysis of biological sequence data modeled as strings over large alphabets. The problem of motif discovery has been tackled extensively over the past two decades [7]. Within the class of exhaustive methods, a number of approaches have been proposed, including graph methods (WINNOWER) [2], explicit trie traversal (MITRA) [5], explicit mapping (Voting algorithms) [8], suffix trees [6, 9], sorting and enumeration [10], etc. Existing exhaustive algorithms use *explicit* exploration of the motif space and require time proportional to the size of the *neighborhood* of a *k*-mer, i.e. the number of *k*-mer sequences at Hamming distance of at most *m* from it. This size, depends on the alphabet size, and can lead to high computational complexity and running times, as shown in Table 1.

Explicit mapping (voting) algorithms proposed in [8] use an indicator array *V* of the maximum size |Σ|^{k} to find motifs through *voting.* Each length-*k* substring observed in the input has at most one vote for each input sequence and gives this vote to all of its *V*(*k*, *m*) neighbors. The substrings that occur in every input string will receive *N* votes and will be included in the output motif set *M*. The algorithm takes *O*(*k*^{m}^{+1}|Σ|^{m}*nN*) time and requires at least *O*(*k*^{m}^{+1}|Σ|^{m}*nN*) space. The large space requirement of the algorithm restricts its usage to small values of *k* and *m*, as well as to small alphabet size |Σ|.

One of the most efficient exact algorithms for motif search, the mismatch tree (MITRA) algorithm [5], uses efficient trie traversal to find a set of motifs in the input strings. Under a trie-based computation framework [5, 11], the list of *k*-long contiguous substrings (*k*-mers) extracted from given strings is traversed in a *depth-first* search manner with branches corresponding to all possible symbol substitutions from alphabet Σ. Each leaf node at depth *k* corresponds to a particular *k*-mer feature (either exact or inexact instance of the observed exact string features) and will contain a list of matching features from each string. The leaf nodes corresponding to motifs will contain instances from all (or almost all) strings. The complexity of the trie-based traversal algorithm for motif finding is *O*(*k*^{m}^{+1}|Σ|^{m}*nN*). Note that the algorithm essentially explores the neighborhood of all *O*(*nN*) *k*-mers in the input.

Another class of efficient algorithms is based on sorting and enumeration [10]. The PMSP algorithm enumerates all possible neighboring *k*-mers for the first string *s*_{1} and outputs *k*-mers that occur in every string with Hamming distance at most m, similar to the Voting algorithm [8]. The PMSprune algorithm [10] employs a more efficient search strategy to traverse the candidate space and is an improvement, in the expected case, over the PMSP. We note that *explicit enumeration* is employed by all above-mentioned algorithms.

While the exact algorithms focus on retrieving all possible motif patterns, an important issue of estimating significance of the found motif patterns can be addressed with existing techniques as used in, for instance, non-exhaustive algorithms based on stochastic optimization (e.g., MEME [12]).

In contrast to existing exact exhaustive algorithms, we approach the problem of motif finding by performing an efficient search over patterns with wildcards. As a consequence, the proposed method’s complexity becomes independent of the alphabet size.

## Methods

### Combinatorial algorithm for motif search

In this section, we develop an efficient combinatorial algorithm for motif finding with the search complexity independent of the size of the alphabet |Σ|. The algorithm begins by finding a set of candidate motifs, followed by the construction of the intersections of those candidates’ neighborhoods, the sequences that are at most m symbols apart from each candidate pair. In a crucial departure from other approaches, this set is efficiently represented using *stems,* or patterns with wildcards. The number of the stems does not depend on the alphabet size and is a function of the motif length (*k*), the number of mismatches (*m*) and the Hamming distance between *k*-mers. Patterns common to all (or almost all) input strings are then found by pruning the stems that do not satisfy the motif property (i.e., do not occur in all input strings). The main idea of our approach is to construct a candidate set *C* which includes all motifs *M* plus some non-motifs, i.e. *M* ⊆ *C*, and then efficiently select true motifs from the candidate set. Given *C*, the complexity of motif finding is then proportional to its size: the motifs can be extracted from *C* by checking each candidate against the motif property, a task we accomplish using rounds of counting sort in Algorithm 2. To generate *C*, we collect the sets of stems which characterize the common neighbors of the pairs of *k*-mers (*a, b*) in the input. We call these sets the *stem sets, H*(*a, b*). Finding each *H*(*a*, *b*) is independent of the alphabet size and is accomplished in Algorithm 3. To further reduce the complexity, we construct the stem sets only for potential motif instances *I*, those *k*-mers that are at Hamming distance of at most 2*m* from every input string. We find *I* using rounds of counting sort (Algorithm 2). We outline our motif search algorithm below:

This algorithm uses as its main sub-algorithm (in step 2) a procedure that finds the intersection of *k*-mer neighborhoods for any pair of the *k*-mers *a*, *b*. This intersection finding algorithm is described in Section ‘Motif generation’. We describe selection and pruning steps (steps 1 and 3) in Section ‘Selection algorithm’.

The overall complexity of the algorithm is where *H* is the maximum size of *H*(*a*, *b*), and *I* is the size of *I*, the number of *k*-mers used to construct the candidate set *C*. The important fact that makes our algorithm efficient in practice is that typically *I* ≪ min(*nN*, |Σ|^{k}) and *H* ≪ *V*(*k*, *m*), particularly for large alphabets. We demonstrate this in our experimental results and provide an expected-size analysis in Section ‘Selection algorithm’.

### Selection algorithm

A necessary condition for a group of *k*-mers to have a shared, common neighbor (motif) is that the Hamming distance between any pair of patterns cannot exceed 2*m*. We will use this condition to select *k*-mers from input that are potential motif instances and place them in set *I*. A particular *k*-mer *a* in the input is a potential motif instance if it is at the minimum Hamming distance at most 2*m* from each of the input strings. All other *k*-mers that violate the above condition cannot be instances of a motif and can be discarded. To select the valid *k*-mers, we use multiple rounds of count sort by removing iteratively 2*m* out of *k* positions and sorting the resulting set of (*k* − 2*m*)-mers. A *k*-mer is deemed a potential motif instance if it matched at least one *k*-mer from each of the other strings in at least one of the sorting rounds. The purpose of sorting is to group same *k*-mers together. Using a simple linear scan over the sorted list of all input *k*-mers, we can find the set of potential motifs and construct *I*. This algorithm is outlined in Algorithm 2. As we will see in the experiments (Section ‘Results and Discussion’), the selection step

significantly reduces the number of *k*-mer instances considered by the algorithm and improves search efficiency. The number of selected *k*-mers, i.e. the size of *I*, is small, especially for large-alphabet inputs. This can be seen from the expected case analysis. For this purpose we assume that sequences are generated from a background process with few motifs implanted in the background-generated sequences. Assuming an iid background model with equiprobable symbols, the expected number of *k*-mers in the input of *N* strings of length n that match each of the *N* strings with up to 2*m* mismatches by chance is

where *p*_{
k
}_{,2}_{
m
} is the probability that two randomly selected *k*-mers are at distance of at most 2*m*. For instance, for a set of *N* = 20 protein sequences (sampled from alphabet |Σ| = 20) of length *n* = 600 the expected number of potential motifs of length *k* = 13, *m* = 4 by chance is about 8, with *p*_{13,8} = 2.9 10^{−4}. Given *t* implanted motif instances, the average number of *k*-mers that will be selected from *nN* input samples, or the expected size of *I*, is

*E*[*I*] = *t* + *nN*(1 − (1 − *p*_{
k
}_{,2}_{
m
})^{t}) + *E*[*I*_{
B
}].

Since *t* and *p* are typically small, for small *pn*, *E*[*I*] ≪ *nN*, the number of *k*-mers in the input. In the protein example above the expected size of *I* is about 1 + 3 + 8=12 for *t* = 1, which is orders of magnitude smaller than *nN* = 12000, signifying the importance of creating *I* first. This is empirically demonstrated in Section ‘Results and Discussion’.

#### Pruning using selection

The sorting approach of Algorithm 2 is also used to select patterns satisfying the motif property from the candidates *C* (Step 3 in main Algorithm 1). The pruning step is based on verifying the motif property (i.e. whether given patterns match all input sequences with up to m mismatches) and can be accomplished using rounds of counting sort.

### Motif generation

In what follows, we describe an efficient algorithm that finds the set of *stems* that represent the set of *k*-mers shared by a pair of *k*-mers *a* and *b*. This process is used to create set *C* from potential instances *I*, which is subsequently pruned to yield the true motif instances.

The number of *k*-mers in the common neighborhood of any two particular *k*-mers *a* and *b* assumes a fixed set values depending on the Hamming distance *d*(*a*, *b*) between *k*-mers [13], for given values of |Σ|, *k*, and *m*. We want to represent the shared *k*-mers in this intersection using a set of *stems,* patterns with wildcards. However, the number of stems will not depend on the alphabet size |Σ|.

To find all stems shared by *k*-mers *a* and *b*, consider two sets of positions: *mismatch region* in which *a* and *b* disagree and *match region* in which *a* and *b* agree. We consider two cases depending on the number of mismatch positions (i.e. Hamming distance between *a*, *b*). In the first case, the distance *d*(*a*, *b*) is at most *m*, the maximum number of mismatches allowed. In the second case, the distance *d*(*a*, *b*) exceeds *m*. When *d*(*a*, *b*) ≤ *m*, wildcard characters can appear both inside and outside of the mismatch region. When *d*(*a*, *b*) >*m*, wildcard characters can appear only inside the mismatch regions. Consider for example, the case of *d*(*a*, *b*) = 0 and *m* = 1. In this case, the set of stems is the set of patterns with 1 wildcard at each of the possible *k* positions (with the remaining positions as in *a*) plus one stem with 0 wildcards. When *m* = 2, and *d*(*a*, *b*) = 1, the set of stems will include patterns with 0 or 1 wildcard in *k* − *d* positions and 0 or 1 wildcards in the remaining *d* = 1 positions. For example, for the pair (*tgt*, *tgc*) the corresponding patterns with wildcards are *tg*?, *t*??, ?*g*?, *t*?*c*, and ?*gc*, where ? denotes a wildcard.

We outline our algorithm for finding set of stems for the *k*-mer neighborhood intersection in Algorithm 3. The number of stems generated by the algorithm is

The number of stems describing all the explicit *k*-mers shared between *a*, *b* does not depend on the alphabet size. The complexity of the stemming algorithm is proportional to the number of stems generated. The maximum number of stems *H* is for typical values of *m* <*k*/2. We use Algorithm 3 for every pair of *k*-mers in *I* (step 2) to construct *C* as outlined in the main algorithm.

### Algorithm analysis

The complexity of the selection step 1 for constructing *I* is and does not depend on the alphabet size |Σ|. Steps 2 and and 3 have the complexity and again do not depend on |Σ|. As a consequence, the three-step procedure gives us an efficient, alphabet-independent motif search algorithm that outputs all motifs embedded in the input *S*. Our experiments will next demonstrate that this allows efficient exploitation of sparsity of typical solutions—we explore only a small portion of the motif space by focusing (using Algorithm 2) only on the support samples that are potential instances of the motifs. This results in significant reductions in running times, especially for large-alphabet inputs, i.e. the cases difficult for the current exact motif finding algorithms.

### Extensions

Our proposed framework can be used to reduce search complexity for other exact search-based motif finding algorithms. Existing exhaustive algorithms typically (e.g. [5, 8, 10]) use the entire input (i.e. all the *k*-mers in the input) and find motif by essentially exploring neighborhoods of every *k*-mer in the input. Their search complexity can be improved by using a *reduced* set of *k*-mers instead of all input samples. This reduced set of *k*-mers can be obtained using our linear time selection algorithm (Algorithm 2, Section ‘Selection algorithm’). Using reduced set of *k*-mers, the actual search complexity after the selection step becomes sublinear in the input size (since the number of selected *k*-mers *I* = |*I*| is much smaller than input length *O*(*nN*)). For instance, the search complexity of the trie-based algorithms (e.g., [5]) can be reduced to instead of *O*(*knNV*(*k*, *m*)), where *V*(*k*, *m*) is *O*(*k*^{m}|Σ|^{m}). This will lead to a more efficient search especially for large-alphabet since a possibly large input (*O*(*knN*)) is replaced with a smaller set *I* of *k*-mers that match with up to 2*m* mismatches every string in the input.

## Results and discussion

We evaluate our algorithms on synthetic benchmark motif finding tasks and real data sets. We first test our algorithms on the planted motif problem commonly used as a benchmark for evaluation of the motif finding algorithms [2, 5, 10]. We then illustrate our method on several DNA and protein sequence data sets.

### Planted motif problem

A planted motif problem [2] is the task of finding motifs and their instances in a set of sequences with variants of the consensus string (motif) implanted with up to *m* mismatches in every string. This task represents a well-defined subtle motif discovery problem. Instances of this problem with large number of mutations *m* are known to be challenging for most of the motif finding algorithms.

We follow the standard setting used in previous studies [2, 5, 10] and generate *N* = 20 random strings of length *n* = 600 using iid, uniformly distributed symbols from an alphabet of size |Σ|. We then embed a copy (with up to *m* substitutions at random positions) of a motif at a random location in every string. The task is then to identify motifs hidden in the input.

In Table 2, we compare the running time of our algorithms with state-of-the-art motif finding algorithms on several challenging instances of the planted motif problem. We give the running time comparison for large-alphabet (|Σ| = 20 − 100) instances in Table 3. As we can see from the results in Table 2 and Table 3, our algorithms show significant reduction in running times compared to state-of-the-art methods, especially for large-|Σ| inputs (Table 3). For large alphabets and large *k*,*m* trie traversal takes substantial amount of time and results in these cases are not reported. In Figure 2, we show the running time ratio (logarithmic scale) between the mismatch trie traversal (MITRA) algorithm and our algorithm as a function of the alphabet set size. The running time is measured on (13,4) instances of the planted motif problem. For relatively small alphabet of size 20 our algorithm is about 10^{4} times faster than the mismatch trie. The difference in running time increases with the size of the alphabet. Large alphabets can, for instance, arise when encoding the 3D protein structure, a necessity in cases when sequences share little similarity at primary level.

Figure 3 shows efficiency of the selection (step 1 in the algorithm) as a ratio between the input size and the number of the selected samples (*k*-mers) |*I*|. We observe that across different input sizes selection reduces the number of samples by a factor of about 10^{3}. The observed number of selected samples *I* = |*I*| agrees with the theoretical estimates (Section ‘Selection algorithm’) (e.g., for |Σ| = 50, n=5000, N=20, we expect about 52 *k*-mers to be selected, and the observed size of *I* is 103 *k*-mers).

### Identifying TF binding sites

We use several data sets with experimentally confirmed TF binding sites: CRP, FNR, and LexA. The CRP data set contains 18 DNA sequences of length 105 with one or two CRP-binding sites [2, 14]. The FNR and LexA data sets are obtained from RegulonDB [15] database and contain 30 and 91 sequences known to have sites of length 14 and 20 bases. The task is to identify the sequence motif corresponding to the binding sites and the positions of sites within sequences.

For CRP, we use relatively long *k*-mers of length *k* = 18, with a large number of allowed mismatches (*m* = 7) from a given set of 18 DNA sequences (|Σ| = 4). For FNR and LexA data sets, we set motif length to *k* = 14 and *k* = 16 bases, with the maximum number of mismatches set to *m* = 4 and *m* = 6, respectively.

Figure 4 illustrates motifs found by the algorithm on the CRP data set. In the figure, colors indicate the importance of positions as measured by the number of hits between the found motif patterns and the sequences, with blue horizontal lines denoting true (confirmed) locations of the binding sites. The set of discovered locations agrees with the set of experimentally confirmed primary positions. The discovered motif patterns correspond to instances of the reference consensus motif TGTGAnnnnnnTCACA [14, 16]. Because of large *k* and *m* we observe running time improvements similar to the benchmark planted motif problems: our algorithm takes about 6 minutes, while the mismatch trie traversal requires about 12 times as long (4489 seconds). Allowing a large number of mismatches (*m* = 7) in this case is critical for the motif prediction performance, because fewer mismatches do not lead to successful identification of the binding sites.

For FNR and LexA motifs, our algorithm correctly finds consensus patterns TTGATnnnnATCAA and CTGTnnnnnnnnnCAG, in line with the validated transcription factor binding sites, with the performance coefficients [2] of 83.69 and 90.38.

### Protein motif finding

We also apply our algorithm to finding subtle sequence motifs on several protein sequence datasets, a challenging task due to the increased alphabet size (|Σ| = 20) coupled with large *k* and *m*.

**Lipocalin motifs.** We first consider motifs in *lipocalins* which are topologically similar but have very diverse primary sequences. Using *k*-mer of length *k* = 15 with *m* = 7 mismatches, we identify motifs containing 15 residues with the instance majority FD[IKLW]S[AKNR]FAGTWYE[ILMV]AK (Figure 5), which agrees with the known reference motif [17]. Our algorithm takes about 5 minutes to complete this task, while the mismatch trie algorithm takes more than a day. As in the case of the DNA, a large number of mismatches is critical for finding motifs, while smaller values of *k*, *m* do not result in motif identification.

**Zinc metallopeptidase motif.** In this experiment, 10 relatively long (average length is 800) human zinc metallopeptidase sequences used to test motif finding. Identification of subtle motifs in this case is made even more challenging by the length of the sequences. We use 11 residues long *k*-mer with *m* = 5 mismatches and find sequence motifs with the instance majority VAAHELGHS[GL]G in 9 out of 10 sequences that correspond to previously confirmed locations. We note the large number of mismatches (*m* = 5) was critical to motif identification.

**Super-secondary structure sequence motifs.** We consider now two data sets of protein sequences with interesting 3D sandwich structure studied previously by biologists, for which existence of corresponding sequence motifs has been postulated [18]. Using Cadherin and Immunoglobin superfamilies as an example, our algorithm finds sequence patterns that correspond to the supersecondary structure (SSS) motifs [18, 19], i.e. arrangements of the secondary structure units (loops, strands). In particular, in Cadherin superfamily we find long motifs of length 20 (using *m* = 4 mismatches) corresponding to the secondary structure units *strand 1 - loop - strand 2* (VIPPISCPENE[KR]GPFPKNLV) and *strand 3 - loop - strand 4* (YSITGQGAD[KNQT]PPVGVFII) (3D SSS motif [19]). Figure 6 shows identified motif positions within the sequences. Our algorithm finds 36 potential motif instances (out of 330 samples) after the selection (step 1) and takes about 47 seconds (compared to about 600 seconds using the trie traversal). In Immunoglobin superfamily (C1 set domains), we find a sequence motif of length 19 SSVTLGCLVKGYFPEPVTV which corresponds to *strand 2-loop-strand 3* secondary structure units (2E SSS motif).

## Conclusions

We presented a new deterministic and exhaustive algorithm for finding motifs, the common patterns in sequences. Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in large-alphabet sequences (e.g., proteins), as we showed on several motif discovery problems in both DNA and protein sequences. The proposed algorithms could be applied to other cases and challenging problems in sequence analysis and mining potentially characterized by large alphabets, such as text mining.

## Authors contributions

All authors contributed equally to this work.

## References

Xing EP, Jordan MI, Karp RM, Russell S: A hierarchical Bayesian Markovian model for motifs in biopolymer sequences. In

*In Proc. of Advances in Neural Information Processing Systems*. MIT Press; 2003:200–3.Pevzner PA, Sze SH: Combinatorial Approaches to Finding Subtle Signals in DNA Sequences. In

*Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology*.*Volume 0*. AAAI Press; 2000:269–278.Fellous JM, Tiesinga PHE, Thomas PJ, Sejnowski TJ: Discovering Spike Patterns in Neuronal Responses.

*J. Neurosci.*2004, 24(12):2989–3001. [http://www.jneurosci.org/cgi/content/abstract/24/12/2989] 10.1523/JNEUROSCI.4649-03.2004Jojic N, Jojic V, Frey B, Meek C, Heckerman D: Using “epitomes” to model genetic diversity: Rational design of HIV vaccine cocktails. In

*Advances in Neural Information Processing Systems 18*. Edited by: Weiss Y, Schölkopf B, Platt J. Cambridge, MA: MIT Press; 2006:587–594.Eskin E, Pevzner PA: Finding composite regulatory patterns in DNA sequences.

*Bioinformatics*2002, 18(suppl1):S354–363. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/18/suppl_1/S354]Pisanti N, Carvalho AM, Marsan L, Sagot MF: RISOTTO: Fast Extraction of Motifs with Mismatches.

*In LATIN*2006, 757–768.Tompa M, Li N, Bailey T, Church G, Moor BD: Assessing computational tools for the discovery of transcription factor binding sites.

*Nat Biotechnol*2005, 23: 137–144. 10.1038/nbt1053Chin FYL, Leung HCM: Voting algorithms for discovering long motifs.

*APBC*2005, 261–271. [http://www.comp.nus.edu.sg/~wongls/psZ/apbc2005/camera-ready/212.pdf]Sagot MF: Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. In

*LATIN ’98: Proceedings of the Third Latin American Symposium on Theoretical Informatics*. London, UK: Springer-Verlag; 1998:374–390. full_textDavila J, Balla S, Rajasekaran S: Fast and Practical Algorithms for Planted (l, d) Motif Search.

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*2007, 4(4):544–552. 10.1109/TCBB.2007.70241Leslie C, Kuang R: Fast String Kernels using Inexact Matching for Protein Sequences.

*J. Mach. Learn. Res.*2004, 5: 1435–1455. [http://jmlr.csail.mit.edu/papers/volume5/leslie04a/leslie04a.pdf]Bailey TL, Elkan C: Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization.

*Mach. Learn.*1995, 21(1–2):51–80. 10.1007/BF00993379Kuksa P, Huang PH, Pavlovic V: Scalable Algorithms for String Kernels with Inexact Matching.

*In NIPS*2008, 881–888.Stormo GD, Hartzell GW: Identifying protein-binding sites from unaligned DNA fragments.

*Proceedings of the National Academy of Sciences of the United States of America*1989, 86(4):1183–1187. [http://www.pnas.org/content/86/4/1183.abstract] 10.1073/pnas.86.4.1183RegulonDB.[http://regulondb.ccg.unam.mx/]

Lawrence C, Reilly A: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences.

*Proteins*1990, 7: 41–51. 10.1002/prot.340070105Lawrence C, Altschul S, Boguski M, Liu J, Neuwald A, Wootton J: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

*Science*1993, 262(5131):208–214. [http://www.sciencemag.org/cgi/content/abstract/262/5131/208] 10.1126/science.8211139Kister AE, Fokas AS, Papatheodorou TS, Gelfand IM: Strict rules determine arrangements of strands in sandwich proteins.

*Proceedings of the National Academy of Sciences of the United States of America*2006, 103(11):4107–4110. [http://www.pnas.org/content/103/11/4107.abstract] 10.1073/pnas.0510747103Super-Secondary Structure Database.[http://binfs.umdnj.edu/sssdb/]

Evans PA, Smith AD: Toward Optimal Motif Enumeration.

*In WADS*2003, 47–58.

## Acknowledgments

This article has been published as part of *BMC Bioinformatics* Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S8.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Kuksa, P.P., Pavlovic, V. Efficient motif finding algorithms for large-alphabet inputs.
*BMC Bioinformatics* **11**
(Suppl 8), S1 (2010). https://doi.org/10.1186/1471-2105-11-S8-S1

Published:

DOI: https://doi.org/10.1186/1471-2105-11-S8-S1

### Keywords

- Input String
- Alphabet Size
- Motif Finding Algorithm
- Large Alphabet
- Mismatch Region