SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences
 Fathi Elloumi^{1}Email author and
 Martha Nason^{2}
DOI: 10.1186/147121058354
© Elloumi and Nason; licensee BioMed Central Ltd. 2007
Received: 26 January 2007
Accepted: 20 September 2007
Published: 20 September 2007
Abstract
Background
Computational methods to predict transcription factor binding sites (TFBS) based on exhaustive algorithms are guaranteed to find the best patterns but are often limited to short ones or impose some constraints on the pattern type. Many patterns for binding sites in prokaryotic species are not well characterized but are known to be large, between 16–30 base pairs (bp) and contain at least 2 conserved bases. The length of prokaryotic species promoters (about 400 bp) and our interest in studying a small set of genes that could be a cluster of coregulated genes from microarray experiments led to the development of a new exhaustive algorithm targeting these large patterns.
Results
We present Searchpattool, a new method to search for and select the most specific (conservative) frequent patterns. This method does not impose restrictions on the density or the structure of the pattern. The best patterns (motifs) are selected using several statistics, including a new application of a zscore based on the number of matching sequences. We compared Searchpattool against other well known algorithms on a Bacillus subtilis group of 14 input sequences and found that in our experiments Searchpattool always performed the best based on performance scores.
Conclusion
Searchpattool is a new method for pattern discovery relative to transcription factor binding sites for species or genes with short promoters. It outputs the most specific significant patterns and helps the biologist to choose the best candidates.
Background
The availability of complete genomic sequences has opened the door for computational methods to predict binding sites and understand gene regulation. Patternfinding algorithms can be divided into two groups [1]: local multiple sequence alignment algorithms and exhaustive algorithms. Alignment based algorithms (e.g. Gibbs sampling, expectation maximization) may converge to a local maximum without always finding the best patterns [2, 3]. Exhaustive algorithms are guaranteed to find the best patterns within certain constraints. Brazma formalized the problem of pattern discovery as a classification one [4]. Rigoutsos defined the problem as follows [5]: Given a database D (set of sequences), a set of events E (four nucleotides for DNA sequences) and an evaluation function F (that measures the degree of similarity between two events) the task is to determine interesting patterns of events which are contained in D. An interesting pattern is, for example, a frequent and statistically significant pattern. A frequent pattern is one that appears in a minimum number of records (sequences). This minimum number is called the threshold, or minimum support. A significant pattern is one that occurs too frequently to be attributed to chance alone, as judged by having a high statistical score. Frequent patterns can be classified in three categories [6]: all frequent patterns, the closed frequent patterns where all extensions have smaller support, and finally the maximal frequent patterns that are not contained in other patterns. Regular expressions are sometimes used to define the patterns. The pattern can contain the events of E (the fixed alphabet) and ambiguous characters (N, R, W...). The character N (or '.') is a wildcharacter that can represent any event. The density of a pattern is its number of nonwildcharacters.
Most exhaustive algorithms operate by enumerating the solution space. For example, after fixing a minimum support they search for a simple frequent pattern (singleton), extend it, and check if the extended pattern is frequent. The process is then repeated. As the lengths of the patterns increase the running time grows. Some programs use pruning techniques or impose constraints on the kind of patterns so the performance is improved. Well known exhaustive algorithms include Ymf, Weeder and Mitra [7–9]. Others, like Pratt, Teiresias and Splash are not dedicated to patterns relative to binding sites [10–12]. Some programs are based on mining sequential patterns [13], like the Wang program [14], or Tomms that uses a topdown pattern enumeration [6].
In this work we are looking for TFBS relative to prokaryotic species. A summary of the motifs in TF databases (such as RegulonDB and DBTBS) reveal that approximately 67% of the motifs are at least 15 bp [15, 16]. Based on the observations found in these databases and relevant papers, it seems that the motifs for prokaryotic species are likely to be large and contain a number of conserved bases. Based on a search in DBTBS and RegulonDB, we find that 77% of the motifs have at least 2 conserved bases for Bacillus subtilis. For monad patterns without flexible gaps, the corresponding percentage is 88%. 39% of the motifs begin and end with a conserved base. For E. coli, we find that 54% of the motifs have at least 2 conserved bases and 33% of the motifs begin and end with a conserved base.
In general the motifs do not follow a specific type. In DBTBS we find that about 29% of the motifs have few spacers (between one and seven) in the middle and 21% of the motifs have flexible gaps. Approximately 19% of motifs are palindromic.
Local multiple algorithms may be suitable for our application; however, they are not guaranteed to find optimal solutions. Currently available exhaustive algorithms impose some constraints in order to limit the search time and to have good performance: these did not fit our requirements because they are not capable of finding long motifs, as our context demands. For example, in Ymf a target pattern or motif is a string of length 6–8 over the alphabet {A,C,G,T,R,Y,S,W} with 0 to 11 character 'N's inserted in the center and a limited number of R, Y, S, W characters (Ex: CGGNNNNNNSCG). Weeder enumerates all patterns up to a maximum length (max 12) with a fixed number of substitutions (max 4) of the sites when compared to the motif. Mitra searches for contiguous strings (monads) with a fixed number of mismatches (substitutions) and a minimum value of occurrences. On synthetic and biological data Mitra succeeded in retrieving a monad of size 18 bp; however, when the motif is larger and the number of mismatches is higher it consumes more resources and may fail to retrieve the motif. This algorithm can also find dyads or composite patterns where a group of monad patterns occur near each other with spacers in the middle.
None of the available algorithms suited the demands of our context exactly. As we could not find an existing exhaustive search capable of finding long motifs, and knowing that we are interested in searching patterns for binding sites relative to a small set of genes (around 100 genes) believed to be coregulated or share a common pathway or a biologic function, with short promoter regions (between 400–1000 bp), we decided to develop a new exhaustive algorithm that looks for short and large motifs with at least 2 conserved bases. Our algorithm does not require that the user specify the exact length and structure of the motifs, except that they must begin and end with conserved bases. The new algorithm needs to be run just one time and will search for the most conservative (specific) motifs (with conserved bases) that are common to a minimum number of the input sequences.
The length of those patterns found by this new algorithm may vary from 2 to 30 bp or more. The format of the pattern is E (E ∪ '.')* E where E represent an event from the exact alphabet, the character '.' is the wildcharacter N, ∪ represents the union operator and * the repetition from 0 to n times. We assume that a pattern must begin and end with a conserved nucleotide. About 40% of the patterns for Bacillus subtilis fit strictly within this heuristic. While not every possible pattern fits this heuristic, our algorithm is intended to look for the most specific or conservative large frequent patterns that does. Searchpattool will retrieve known and unknown motifs that fit this format. If a desired motif contains at least 2 conserved bases that are not located at the first and last positions, Searchpattool is able to discover the core region that contained the conserved bases. In the case where the motif does not contains conserved bases at all, our algorithm will retrieve the most specific patterns that are shared by a smaller number of sequences. By specifying a minimum support value less than 100%, Searchpattool will output the most specific pattern with different support values.
The algorithm consists of 4 main steps: first, it finds all frequent patterns of different sizes that contain exactly 2 conserved nucleotides; second, it makes them more specific by replacing the wildcharacters with conserved nucleotides; third, it scores all frequent patterns based on summary statistics, and fourth it outputs the best userspecified number of patterns ordered by their statistical score. We follow the same statistical method (zscore) used by Van Helden to score the total number of occurrences for a motif and we introduce a new application of the zscore by applying it to each motif's support [17]. The support statistic is very interesting because it is based on the matching number of sequences.
The search for frequent patterns is done in the positive and negative strands of the input sequences that can be of different sizes. The algorithm provides, for each pattern, useful information including length, support, density, total number of occurrences, positions, zscore for the support and zscore for the total number of occurrences, sites, profiles (matrix of frequency), mean of information content and consensus. The algorithm compares the best patterns and provides information about their similarity and reverse complements. In this work we consider that a pattern and its reverse complement are different candidates and are scored independently.
The format of the pattern is similar to that used by Teiresias [11]; however, Searchpattool and Teiresias are different. Teiresias outputs all maximal (or closed) <L,W> frequent patterns where each subpattern with length W contains at least L residues or conserved bases. The user has to specify L, W and the value of minimum support.
In addition, we developed an associated tool to compute the pvalue associated with the zscore of the support for the best motifs by comparing to a similar search on randomly generated data. We can compute a pvalue for each ranked score by comparing it to a null distribution of similarly ranked scores from simulations. This additional step provides an additional safeguard against the risk of erroneously identifying patterns as common due only to the large number of patterns being considered, and is an improvement over other existing algorithms. Based on the pvalue of the zscore and the previous information about the pattern including its support value and its similarity with other patterns, the user can make inferences about the strength of evidence that their best patterns are truly more common than chance would suggest, and can choose the best candidate or candidates for further biologic experiments.
We assessed Searchpattool by comparing its accuracy to three widely used local multiple algorithms (Meme, Motif Sampler, Consensus) [3, 18, 19]; and one exhaustive algorithm(Mitra) [9]. Meme uses an expectationmaximization algorithm, Motif sampler is a variant of a Gibbs sampler algorithm and Consensus is a greedy algorithm. Exhaustive algorithms allow searching for monad or composite motifs. Our algorithm should be compared to those that look for monad motifs. Most of these algorithms limit the size of monad motifs to 12 bp. Mitra can retrieve large monad patterns. Due to these limitations we chose to compare our algorithm to only Mitra.
We also studied Searchpattool's runtime and its number of patterns by varying different parameters including the minimum support, the maximum length of the pattern and the number of sequences.
Results
Algorithm
In order to run Searchpattool, the user has to specify the input sequences, the minimum support, the maximum length of patterns, the background probabilities for the sequence's species and finally (optionally) the number of patterns to output. Searchpattool has four steps:
Step 1: Search all E ('.')* E frequent patterns
All frequent patterns for family A
Pattern  Matches(No seq, 1st position)  support  total 

AA  [1,7] [7,5]  2  2 
AG  [1,1] [2,3] [2,5] [4,4] [6,4] [7,1]  4  6 
AT  [3,6] [4,2] [5,5] [6,2]  2  4 
A.A  [2,3] [3,6] [4,2] [6,2]  3  4 
A.C  [1,1] [2,5] [7,1]  2  3 
A..G  [2,3] [4,2] [4,4] [6,2] [7,1]  4  5 
A...C  [1,1] [2,3] [4,4]  2  3 
A....A  [3,3] [7,1]  2  2 
Step 2: Deduce from all E ('.')* E frequent patterns the most specific ones
All most specific frequent patterns for family A
Pattern  Matches(No seq,1st position)  support  Total 

AA  [1,7] [7,5]  2  2 
AG  [1,1] [2,3] [2,5] [4,4] [6,4] [7,1]  4  6 
AT  [3,6] [4,2] [5,5] [6,2]  2  4 
A.A  [2,3] [3,6] [4,2] [6,2]  3  4 
ATA  [3,6] [4,2] [6,2]  2  3 
AGC  [1,1] [2,5] [7,1]  2  3 
A..G  [2,3] [4,2] [4,4] [6,2] [7,1]  4  5 
AG.G  [2,3] [4,4] [7,1]  3  3 
ATAG  [4,2] [6,2]  2  2 
A.AG  [2,3] [4,2] [6,2]  3  3 
AG..C  [1,1] [2,3] [4,4]  2  3 
AG.GC  [2,3] [4,4]  2  2 
A....A  [3,3] [7,1]  2  2 
For each family we assure that each pattern is made the most specific for its total number of occurrences but it can lead to a new more specific pattern with fewer occurrences or matching sequences. In our example, the pattern "A...C" becomes "AG..C" (with the same number of occurrences which is 3) and leads to a new pattern "AG.GC" (with only 2 occurrences).
Step 3: Scoring all frequent patterns
Given a set V of m sequences, a subset C (C ⊆ V) of size n, and a pattern P that occurs in 's' sequences from C and matches 'o' positions in C (including double strands), we can compute the probability of pattern P matching s or more sequences of C and the probability of P matching o or more positions in C. For each pattern we compute a zscore of the number of matching sequences or support (zssup) and a zscore of the total number of matching positions (zstot).
One issue with the latter score is the overlapping words. Pevzner defines an autocorrelation coefficient [20]. The presence of the degenerate symbol '.' increases the number of overlapping patterns. To avoid this problem we follow the formula of Van Helden and we count the number of occurrences of a motif while disregarding the overlapping positions [17]. For the former score we introduce a new formula (see Methods section).
Step 4: Output the patterns with best zscores
The patterns can be ordered according to their zscore of the support (zssup) or the zscore of the total number of occurrences (zstot). In this paper, since our experiments are done in prokaryotic species and many transcriptions factor binding sites (TFBS) are rare we use the zssup score as our ordering criterion. The patterns are ordered according to the Zscore of the support (zssup) and the best userspecified numbers of patterns (default value 40 patterns) are selected. For each selected pattern we extract its sites, compute its matrix of frequencies, derive its consensus following the rules adapted from Cavener and calculate its information content score [21]. Since we have patterns of different lengths we take the mean (average) of the information content (MIC). Finally the selected patterns are compared to detect the reverse complements and measure the degree of their similarity. For each pattern P we check if it covers or extends (overlap 100%) another pattern Q and measure their degree of similarity by computing the average site similarity score. It is the same score function defined by Burset and Guigo and used to assess the performance quality of a motif finding algorithm (see average site performance in Methods section) [22, 23]. A similarity score of 1 means that P covers totally Q. So P is an extension of Q. A zero score means that P does not cover Q at all. A score between 0 and 1 means that some sites of P cover some of Q. This scoring function is not symmetric.
A formal description of the algorithm is described in the Methods section.
Predicting the best pattern
We compute the probabilities that the higher zscores (zssup) can be reached by chance by the selection process. This is important because the motif selection process of our algorithm means that we cannot expect the distribution of the (for example) max(zssup) to have a standard normal distribution under the null hypothesis that there is no association between the sequences, so we cannot rely on the magnitude of the zssup from the chosen motifs to judge statistical significance (see methods section).
We suggest the pvalue score be used as a first criterion to select the candidates for the best patterns. If two patterns have the same pvalue one can choose the pattern with the highest support. Finally the user should check if the patterns are reverse complement and examine the similarity values between the patterns.
Implementation
We developed Searchpattool and other programs with Borland C++ under Windows XP. We run it on a PC Pentium 4 (3,2 GHz). The main programs are:

bestzssup.txt: contains the list of patterns with their length, density, support, zscores and list of occurrences.

Sitespositions.txt: contains the sites and the exact positions of the patterns

Sitesforlogos.txt: contains just the sites in order to display logos

Pattprofiles.txt: contains the frequency matrices of the patterns

Pattconsmic.txt: contains the list of patterns plus their consensus and their mean information content

Pattrevcom.txt: indicates for each pattern its reverse complement on the list of patterns

Pattsimilarity.txt: measures the degree of similarity between the listed patterns
SEARCH_BEST_RANDOM_SCORES: a program that computes the best zscores of the support (zssup) based on 1000 random samples. In order to compute the pvalues of the zscore of the support we wrote an R script that generates random samples according to the background probabilities of the species and the specific lengths of the input sequences. The user can run this script and then call this program with the same parameters used in the Searchpattool call including the number of outputs.
COMPUTE_PVALUE: a program that computes the pvalue of the best patterns (as judged by zssup) relative to the current patterns on our list by using the scores from the random data.
SELECT_BEST_ZS_SUP: a program that outputs the n best patterns ordered by their zscore of the support. A run of Searchpattool should precede it. This program is useful if the user wants to specify a number of outputs different from that used in the call of Searchpattool. It avoids searching again for all patterns with the same parameters.
SELECT_BEST_ZS_TOT: a program that outputs the best patterns ordered by their zscore of the total number of occurrences. A run of Searchpattool should precede it
Testing
Data experiments
Presentation of experiment datasets
TF  Pattern Length  No genes (sequences)  No known Sites  No checked sites  MinLen Sequence  MaxLen Sequence  AvrLen Sequence 

SigL  17  6  6  6  124  240  197.6 
comA  15  4  5  4  144  400  276.5 
Hrca  27  2  2  2  119  238  178.5 
zur  14  3  3  3  106  359  234.3 
mntr  19  2  4  2  280  330  305 
gltr  15  2  4  4  197  384  290.5 
glnr  17  3  6  3  110  400  229.66 
Spo0A  7  10  23  21  84  451  284.3 
RocR  15  2  7  4  277  291  284 
Fnr  16  5  6  6  186  246  211.6 
CodY  11  4  4  4  179  451  356 
Fur  20  20  23  21  82  451  231.65 
DegU  20  14  15  11  133  451  335.85 
TnrA  17  21  25  20  161  451  287.47 
More details about the input sequences and the known checked sites are given in the additional files [see Additional file 1]. We designed two tests to compare the accuracy of Searchpattool to that of well known algorithms on these 14 input sequences.
Test1: search for fixed length patterns
In the first test we compare Searchpattool to Meme, Motifsampler, Consensus and Mitra. We provide all algorithms with the exact known length of the target patterns. They share other common parameters including the input sequences, the search in the double strands and a report size of the best 40 patterns. All the algorithms were run using a background model of order 0 for Bacillus subtilis except Mitra that computes its own background model. Meme and Consensus are run twice varying each time the number of sites per sequence. Motif sampler is set to run 20 times and report each time the best 10 motifs. Mitra is set to run 5 times (varying the number of mismatches from 0 to 4) and reports each time the best 40 motifs. Searchpattool is run only one time. Since we are interested in searching for patterns for TFBS that are common to a set of genes believed to be a cluster of coregulated genes from microarray experiments, specifying a minimum support value equal to 100% may not be appropriate for this example. We chose to specify a minimum support value less than 100% for two reasons. First, the input sequences may not represent the complete promoters or may not contain the motif due to imprecision in experiments or in the clustering process. Second, we are interested in looking for the most conservative motifs which are not necessarily shared by all the sequences. In our experiments we set the minimum support value to 60%, because it seemed to be a reasonable minimum level within a cluster for our context.
Test1 setting parameters
Meme  Consensus  Motif sampler  Mitra  Searchpattool  

Run times  2  2  20  5  1 
Number of site per sequence  zoops anr  0n 1n  Max n sites  no  no 
Minimum number of occurrences  no  no  no  2–3  no 
Minimum support  no  no  no  no  60% 
Number of mismatches  no  no  no  0–4  no 
Pattern length input  L*  L  L  L  Max L 
Number of outputs  40  40  10 (x 20) Limited to 40  Best 40 patterns  Limited to 40 best zssup Ordered by pvalue + and support  
Pattern length output  L  L  L  L  2 to L Limited to L 
Different statistics have been suggested to assess the performance quality of a motif finding algorithm [23]. Since all the patterns have the same length, we chose to follow Pevzner and used their nucleotide level performance score ps (see Methods section) [25]. The score will be 1 if the program finds the known sites whereas it will be 0 if it fails to retrieve any known sites. For each algorithm we record only the best performance score.
Searchpattool results
The remaining seven TF known pattern don't follow Searchpattool's model, and as expected the algorithm cannot report all their known sites. However, Searchpattool succeeded in retrieving the most specific patterns that are similar to the correct ones for five; the exceptions are TF Cody and Degu, where it reported no results among the best 40 scored patterns. Figure 3 shows the main results for Searchpattool.
The zscore of the support (zssup) is a useful criterion to maximize in our experiments. However, we note that for patterns with many sites per sequence the zscore of the total number of occurrences (zstot) is higher than the zscore of the support (zssup) and that might affect the rank of the pattern. For example, when ordering the patterns by using the zscore of the total number of occurrences the rank of Gltr pattern becomes 225 (previously ranked 819 for the zscore of the support). Detailed results for Searchpattool test1 can be found in the additional files [see Additional file 2].
Comparison with other algorithms
Test2: search for maximum length patterns
Test2 setting parameters
Meme  Searchpattool  

Run times  2  1 
Number of site per sequence  zoops anr  No 
Minimum support  no  60% 
Pattern length input  Maximum T*  Maximum T 
Number of outputs  40  1000 best zssup Ordered by pvalue + and support  Limited to 40 
Searchpattool results
We checked if Searchpattool retrieved the known sites (with minimum overlap of 100%). We found that it succeeded in retrieving all known sites for four TF which are SigL, coma, HrcA and Mntr. It only retrieved some known sites for the rest of the patterns. The correct sites for Zur and Rocr TF are found but their pvalue rank is greater than 40. The correct length is reported for five TF which are SigL, Glnr, Spo0A, Fur and Tnra. In fact if the known sites of a pattern have some common extensions to the right or left, Searchpattool will report first the larger patterns that are also the most specific with better zscores. For instance, the predicted pattern for ComA TF has a length of 18 bp and is ranked 8, however, the pattern ranked 21 has the correct length which is 15. For ComA TF, Pattern 8 is an extension of pattern 21 with the same number of occurrences, has the best pvalue and so has been selected. We remark also that Searchpattool reports pattern candidates for all TF including Cody and Degu.
Comparison with other algorithms
We run the same 14 input sequences on Meme (zoops and anr). We compared the performance of Searchpattool and Meme based on the average site performance score and by varying the percentage of overlap. Like the first test, for each pattern we declare a program the "winner" if it has the highest performance scores. In order to take into account the rank of the pattern we compute the ratio of rank to averagesitescore.
Searchpattool runtime and number of patterns study
In order to study the scalability of Searchpattool with different numbers of sequences we generated 3 sets of 10, 20 and 100 sequences, respectively. Each sequence has 400 bp. We then ran Searchpattool with a maximum length of 24 and a minimum support value of 80%. We chose these values because they are suitable for a study of a set of linked genes in Bacillus subtilis. Our experiments (see Figure 8) showed that as the number of sequences increased the runtime increased. Interestingly, however, the number of patterns decreased. In fact, Searchpattool spent more time searching common patterns of large number of sequences, which reduced the chance of finding a large number of common patterns. Detailed results can be found in the additional files [see Additional file 4].
Discussion
The problem of discovering patterns for binding sites is complex and it is difficult to know a priori the kind of pattern to search for (organization, size or location). Searchpattool looks for patterns that contain at least 2 conserved bases with a conserved base at the beginning and the end. Its pattern model is derived from observations of several known patterns of transcription factors relative to prokaryotic species. Searchpattool is based on an exhaustive algorithm that searches for all frequent patterns of different sizes (from 2 to the specifiedmaximum length) with useful information like the support, density, length, zssup, zstot and list of positions. The user can view them and select the best ones according to personal criteria. We chose to select the best patterns initially according to the rank of the Zscore of the support (zssup). This score is very interesting for prokaryotic species where many motifs are rare, and it avoids the problem of overlapping motifs. For these best patterns we provide the user with additional information including their sites, matrices of frequencies, consensus, mean of information content and information about their reverse complements and mutual similarities.
Our scoring function allows extraction of the most specific patterns that are large and dense (the conservative patterns). These patterns will have the best zssup scores. By looking at the similarity matrix we can search for less specific similar ones. Finally we compute the pvalue of zssup of selected patterns to check if these patterns are statistically significant. We propose to select the ones which have the best pvalue and the highest support value. In cases where many patterns have the same rank (same pvalue and same support) the user can refine the selection by looking at information about reverse complement, similarity and information content.
Comparing the accuracy of Searchpattool to that of well known local multiple alignment algorithms Meme, Motif sampler and Consensus, our experiments on 14 input sequences have shown that Searchpattool performs very well based on performance scores. In fact, Searchpattool does better than the other algorithms based on the performance score every time when the length is not restricted. However due to its algorithm type, the rank of the patterns are not always the best, especially when compared with Meme which is a local multiple alignment algorithm. In order to take into account the rank, we calculated the ratio of rank to performancescore which measures the precision of each algorithm. Our study shows that Meme performs the best based on these ratios when the pattern length is fixed. If the length is not restricted, Searchpattool has the best rate when the overlap percentage is set to 100%. Searchpattool performs very well for sets of very small numbers of sequences (2 or 3), but Meme fails to retrieve the patterns for those cases. Comparing Searchpattool to the exhaustive algorithm Mitra, our experiments have shown that Searchpattool is more accurate and has better ranking. Searchpattool also can search larger patterns.
Our study shows that Searchpattool performs very well and outputs the most specific patterns that correspond to the known motifs or closer similar ones. However, there were some TF (like Spo0A, Fur or Tnra) where Meme, Consensus or Mitra did better. When the known motif conformed to its model, Searchpattool performed better, as expected. However, there are cases where the known motif cannot be reasonably captured by Searchpattool's motif model. Hence, we suggest using some other motif finding algorithms in conjunction to Searchpattool, for better accuracy.
Many factors affect the runtime and the output of Searchpattool, including the number of input sequences, the values for the minimum support and the maximum length of the target patterns. As the number of sequences increases the run time increases as well. In addition, Searchpattool outputs more patterns and consumes more time when the length is greater than 24 and the support is less than 80%. We note that for our testing we generated random samples with lengths fixed at 400 bp (the maximum upstream length for prokaryotic species) but we know from wellknown regulons that the upstream region can be as small as a dozen base pairs, so in practice we may often have smaller runtimes. On the other hand, if we use input sequences from regulon sets instead of random sets then the genes are more correlated and we may get a larger number of (frequent) patterns. For instance Searchpattool finds 46978 patterns for SigL TF (with minsup = 6 and maxlen = 17), whereas the corresponding maximum number of patterns found from the 1000 random samples is 40020.
For the future we are working on improving the precision of Searchpattool. We are thinking about extending it to allow some ambiguous characters in order to have more general patterns. We also plan to test Searchpattool on other species, and to implement higher orders of Hidden Markov Models.
In order to reduce the runtime we will work on a parallel version of Searchpattool. In fact we can easily process each family of pattern separately and this will improve the performance. We will continue improving Searchpattool's interface and integration of the other programs. We will develop a web application with the same interface for all tools.
Conclusion
We have presented a new method, Searchpattool, for TFBS pattern discovery based on an exhaustive algorithm. Searchpattool looks for the most specific or conservative patterns shared by a set of sequences. Our testing on Bacillus subtilis datasets shows that it performs very well and is efficient for small numbers of sequences. It is easy to use and provides rich and complete information about the best patterns. Either alone or as a complement to other algorithms, Searchpattool can be a powerful tool for discovering novel and important TFBS patterns common to a cluster of genes.
Methods
Algorithm Searchpattool (input_sequence, min_support, max_length, backgroundproba, #output)
Begin
 Read the n input sequence and make their reverse complement.
 Create the 4 arrays Tab_{i} that contains the positions of corresponding nucleotide in the 2n sequences.
 For i = 1 to 4 //each i references a nucleotide N_{i}
○ For j = 1 to 4 //step 1
■ Join (Tab_{i}, Tab_{j}, LFP_{ij}, N_{i}, N_{j}, min_support, max_length)
■ //create a LFP_{ij} that start with N_{i} and finish with N_{j}.
End for
○ Merge and sort by length all LFP_{ij} to LFP_{i}.
○ Morespecific(LFP_{i}, min_support) //step 2
○ Statistics(LFP_{i}, Results_{i}) //step 3
End for
 SortExtractinfo(All_results, #output) //step 4
End
Searchpattool begins by reading the n input sequence and making their reverse complement. Then it allocates 4 arrays (of 2n elements) relative to the 4 nucleotides. Each array (Tab_{i}) contains the positions of corresponding nucleotide in the 2n sequences (in the double strands).
The module 'Join' creates the lists of patterns of the form E (.)* E. For each nucleotide i it creates the list of all frequent patterns (LFP_{i}) that start with i and finish with all the 4 others with just wildcharacters in the middle The number of wildcharacter varies from 0 to maxlength2. The module 'Morespecific' makes a list of patterns of the form E (.)* E more specific by replacing wildcharacter with nucleotides. The format of the pattern will be E (E ∪ .)* E. The module 'Statistics' computes the Zscores of all patterns by family and store the results in text files. Finally the module 'SortExtractinfo' sorts the patterns according to the zscore of the support and selected the desired number of patterns with all useful information.
Computation of the zscores
For each pattern P we define these variables:
p = probability of P at any given position
n = number of sequences in the sequence set.
L_{j} = length of the j^{th} sequence
k = length of P
s = support of P over the n sequences
o = number of observed matching positions of P without overlapping positions
T = the number of possible matching positions of P
T = (2 * Σ_{j = 1, n} (L_{j} + 1  k))  o * (k1)
The count of T does not allow overlapping positions in the double strands. For each occurrence of P we exclude the next k1 positions from the count.
The expected occurrences of P (eo) is:
eo = p * T
The variance (vo)is:
vo = T * p * (1p).
The zscore for total number of occurrences of pattern P is:
Zstot = (oeo)/sqrt(vo).
The probability that there will be at least one occurrence of pattern P within the sequence 'j' (q_{j}) is [24]:
q_{j} = 1  (1p) T_{j} where T_{j} = 2 * (L_{j}k+1).
The expected number of matching sequences (es) is:
es = Σ_{j = 1,n} (q_{j}).
We can estimate the variance of the support (vs) under an independence assumption as
vs = Σ_{j = 1,n} ((1 q_{j}) q_{j})
The zscore of the support of pattern P is:
zssup = (ses)/sqrt(vs).
For this statistic, we do not assume it will have a Gaussian distribution, nor do we need the independence assumption to hold, since the significance of the score is computed via simulations from random data. For our experiments we assumed that the four nucleotides are independent so the background probability p of pattern P at a position is the product of the background probabilities of well conserved nucleotides contained in P.
Scoring by information content
We can score selected patterns according to their information content. Given a pattern of length k and its list of instances, the information content (IC) of this pattern is defined to be
IC = Σ_{j = 1,k}Σ_{c} p_{c,j} log_{2} p_{c,j}/b_{c}
Where p_{c,j} is the frequency with which character c (A,G,C or T) occurs in position j among the pattern occurrences and b_{c} is the background frequency of c.
Since we have patterns of different lengths we take the mean (average) of the information content (MIC) as a second score for significance
MIC = [Σ_{j = 1,k}Σ_{c} p_{c,j} log_{2} p_{c,j}/b_{c}]/k
The MIC score is useful for refining the selection of the best patterns.
Computing the pvalues of the best zscores
For each experiment, we select the 'h' highest scored motifs zssup_{i} (with 1<=i<=h) and for each zssup_{i} we compute a probability pvalue_{i} that the i^{th} maximum score chosen by our selection process would be at least zssup_{i} if the input sequences were random. To calculate this probability in an unbiased way we use the following approach: first, we generate random samples of the same length and in the same quantity as the input sequences using the background probabilities of the four nucleotides. For each random sample we run Searchpattool with the same parameters as the original experiment and select the best 'h' zssup* scores; denote these zssup_{1}*, zssup_{2}*, ... zssup_{h}*, ordered from largest to smallest. We repeat this process for 1000 different sets of randomly generated sequences. The end result is that we have a distribution of 1000 of each of the maximum zssup_{1}*, the secondtohighest zssup_{2}*, and so forth to the h^{th}highest. Each of our observed zssup_{i} is then compared to the distribution of the equivalent order statistics from the randomly generated samples. We compute the pvalue corresponding to the zssup_{i} as the proportion of zssup_{i}*>= zssup_{i}. This allows us to account for the selection process in computing the probability that a particular motif would have support as high as was observed by chance. Note that based on these 1000 randomly generated sequence sets the lowest possible pvalue for any motif is <0.001, corresponding to the case when none of the zssup_{i}* are greater or equal to the observed zssup_{i}.
Nucleotide level performance score
For each pattern we define TP,FN,FP as respectively the number of nucleotides positions in both known sites and predicted sites, the number of nucleotide positions in known sites but not in predicted sites and the number of nucleotide positions not in known sites but in predicted sites. The nucleotide level performance score (ps) is defined as TP/(TP+FN+FP).
Average site level performance score
For each pattern we define STP, SFN, SFP as respectively the number of known sites overlapped by predicted sites, the number of known sites not overlapped by predicted sites, and the number of predicted sites not overlapped by known sites. The site sensitivity score is defined as STP/(STP+SFN). The site positive predictive value score is defined as STP/(STP+SFP). The average site score is the average of the sensitivity score and positive predictive scores. We compute this score for a minimum overlap percentage taking respectively the values 100 (S100%), 75 (S75%), 50 (S50%) and 25 (S25%).
Availability and requirements
Searchpattool software is available as an additional file with the article [see Additional file 5]. It is hosted at bioinformatics.org and is also accessible via ftp: http://ftp.bioinformatics.org/pub/searchpattool
Operating system: Windows
Programming language: C++
Other requirements: none
License: GNU GPL
Any restrictions to use by nonacademics: none
Abbreviations
 TF:

transcription factor
 TFBS:

transcription factor binding sites
Declarations
Acknowledgements
We are very grateful to Dr Robert Hohman (RTB,NIAID,NIH) for all the support during this project. We would like to thank Dr Teresa Przytycka (NCBI,NLM,NIH) for early discussion of this project and for encouraging us, Dr Mary Ann Robinson (RTB,NIAID,NIH) for reviewing the manuscript and Dr Eleazar Eskin for providing Mitra software. This research was supported in part by an appointment to the Senior Fellow Program, National Institute of Allergy and Infectious Diseases. This program is administrated by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and the National Institutes of Health.
Authors’ Affiliations
References
 Brejova B, DiMarco C, Vinar T, Hidalgo SR, Holguin G, Patten C: Finding Patterns in Biological Sequences. In Technical report 2000, CS2000–22. University of Waterloo. Canada;Google Scholar
 Roth FP, Hughes JD, Estep PW, Church GM: Finding DNA Regulatory Motifs within Unaligned NonCoding Sequences Clustered by WholeGenome mRNA Quantitation. Nature Biotechnology 1998, 16(10):939–45.View ArticlePubMedGoogle Scholar
 Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biologye. AAAI Press, Menlo Park, California; 1994:28–36.Google Scholar
 Brazma A, Jonassen I, Eidhammer I, Gilbert D: Approaches to Automatic Discovery of Patterns in Biosequences. Journal of Computational Biology 1998, 5: 277–303.View ArticleGoogle Scholar
 Rigoutsos I, Floratos A, Parida L, Gao Y, Platt D: The emergence of pattern discovery techniques in computational biology. Metabolic Engineering 2000, 2(3):159–177.View ArticlePubMedGoogle Scholar
 Ester M, Zhang X: A TopDown Method for Mining MostSpecific Frequent Patterns in Biological Sequences. Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA April 22–24, 2004 April 22–24, 2004
 Sinha S, Tompa M: A Statistical Method for Finding Transcription Factor Binding Sites. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, San Diego, CA. AAAI Press, Menlo Park, CA; 2000:344–354.Google Scholar
 Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17(suppl 1):S207S214.View ArticlePubMedGoogle Scholar
 Eskin E, Pevzner PA: Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics (Supplement 1):S354–63. 2002 July 18. ISMB2002, Edmonton, Canada: August 3–7, 2002 2002 July 18. ISMB2002, Edmonton, Canada: August 3–7, 2002Google Scholar
 Jonassen I: Efficient discovery of conserved patterns using a pattern graph. Comput Appl Biosci 1997, 13: 509–522.PubMedGoogle Scholar
 Rigoutsos I, Floratos A: Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics 1998, 14: 55–67.View ArticlePubMedGoogle Scholar
 Califano A: SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics 2000, 16(4):341–57.View ArticlePubMedGoogle Scholar
 Agrawal R, Srikant R: Mining Sequential Patterns. Proceedings of the 11th International Conference on Data Engineering (ICDE'95), Taipei, Taiwan 1995, 3–14.Google Scholar
 Wang K, Xu Y, Xu Yu J: Scalable Sequential Pattern Mining for Biological Sequences. Proceedings of the thirteenth ACM international conference on Information and knowledge management, CIKM 2004. Washington, DC 178–187.
 Salgado H, GamaCastro S, PeraltaGil M, DiazPeredo E, SanchezSolano F, SantosZavaleta A, MartinezFlores I, JimenezJacinto V, BonavidesMartinez C, SeguraSalazar J, MartinezAntonio A, ColladoVides J: RegulonDB (version 5.0): Escherichia coli K12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res (34 Database):D394–7. 2006 Jan 1 2006 Jan 1Google Scholar
 Makita Y, Nakao M, Ogasawara N, Nakai K: DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Res 2004, 32: D75–77.PubMed CentralView ArticlePubMedGoogle Scholar
 Van Helden J, Rios AF, ColladoVides J: Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Research 2000, 28(8):1808–1818.View ArticlePubMedGoogle Scholar
 Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y: A higherorder background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 2001, 17: 1113–1122.View ArticlePubMedGoogle Scholar
 Hertz GZ, Stormo GD: Identification of consensus patterns in unaligned DNA and protein sequences: a largedeviation statistical basis for penalizing gaps. Proceedings of the Third International Conference on Bioinformatics and Genome Research. Singapore 1995, 201–216.Google Scholar
 Pevzner PA, Borodovsky Myu, Mironov AA: Linguistics of Nucleotide Sequences I: The significance of Deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J Biomol Struct Dyn 1989, 6: 1013–1026.View ArticlePubMedGoogle Scholar
 Cavener DR: Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucleic Acids Res 1987, 15: 1353–1361.PubMed CentralView ArticlePubMedGoogle Scholar
 Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34: 353–367.View ArticlePubMedGoogle Scholar
 Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, Van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 2005, 23(1):137–144.View ArticlePubMedGoogle Scholar
 The RSAT web site[http://rsat.scmbb.ulb.ac.be/rsat/]
 Pevzner PA, Sze SH: Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the Eighth International Conference on intelligent systems for Molecular Biology. Edited by: Altman R et al. AAAI Press, Menlo Park, CA; 2000:269–278.Google Scholar
 Sinha S, Tompa M: Performance Comparison of Algorithms for Finding Transcription Factor Binding Sites. Proceedings of the Third IEEE Symposium on Bioinformatics and Bioengineering, Washington, D.C 2003, 214–220.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.