### Benchmark data sets

The benchmark data set is generated from RegulonDB[38], which stores motif information of*E. coli*K12. Since this benchmark set is basically the same as the one used in our previous study[6], we only briefly describe it here.

Three types of data sets (Type A, B, and C) are prepared. Type A data sets are generated from the intergenic regions of*E. coli*genome. The intergenic region sequences are taken from the KEGG database[39]. This set A contains 62 motif groups and is called ECRDB62A. It has the following characteristics: the average number of sequences per motif group is 12; the average number of sites per sequence is 1.85; the average sequence length is 300 nt; the average site width is 22.83.

Types B and C data sets include sequences with symmetric margins on both sides of known sites. The difference between the sets B and C is that actual flanking sequences in the*E. coli*genome are used for margins for the set B, whereas artificial sequences are used for the type C.

Set B is termed ECRDB61B-X, where X denotes the margin size of the sequences. The length of the margins is changed from 20 to 800, so that the scalability of algorithms in terms of the input sequence length can be examined. Because margins are added on both sides of a site, the actual length of an input sequence is the total of the site width and two times the margin size. The number 61 denotes the number of motif groups available in this benchmark set. When a large margin size is used, it sometimes happened that an extended sequence includes another site of the same motif group, thus the two input sequences contain the same set of two sites of the same type. In such a case, one of the input sequences is removed, and if this procedure reduces the number of input sequence to one, the entire motif group is removed from the benchmark dataset. It also happened that an input sequence of a certain motif group contains another site of a different type when a large margin size is used. But since this case happens in a real situation, primarily we kept these sequences in the dataset. In such a case, because sites of a target motif are still abundant, we expect an algorithm is able to pick the target sites as one of the top K scoring motifs. We also observe that when the margin sizes are larger (*e.g*. >500 nt), some part of the sequences are located in the coding regions. However, as shown in the previous study[6], no significant influence has been observed of these variations on the prediction accuracy. In ECRDB62A and ECRDB61B-X dataset, there are input sequences which have multiple identical motifs (i.e. two AraC motifs on a sequence) on it. In those cases, positions of both motifs are considered to be correct. And the dataset is cleaned so that a certain motif position only occurs once in a motif group. Also a motif group does not contain different motifs multiple times, so that the motif discovery algorithms do not confuse by them.

In addition to the type A and B sets, we have prepared the type C set, where artificially shuffled sequences are used for margins on both sides of known sites. This dataset is termed ECRDB61C-X. X represents the length of the margin sequence. The same set of the known sites are used as in ECRDB61B data set. Three different margin sizes, 100, 200, and 400 nt. are used. The margin sequences are artificially shuffled, while preserving the di-mer nucleotide frequency of intergenic regions of the*E. coli*genome. The motivation of using the type C set is to evaluate the performance of the algorithms on sequences which surely contain only one target regulatory site in a sequence. The benchmark data sets are available at our web site[40].

### Ensemble motif discovery algorithm

The Ensemble Motif Discovery (EMD) algorithm represents a family of motif discovery algorithms that combine multiple results from individual component algorithms. In this study, we primarily used five component algorithms which are described in the next section. A certain ensemble algorithm can be uniquely identified by the set of component algorithms used, the number of times each component algorithm is run (R_{i}, i denotes for a certain component algorithm), the parameter set used to run each component algorithm (P_{i}), and the number of top scoring motifs considered from each run (T_{i}). Thus, an EMD algorithm can be specified as EMD(-CA_{i}-R_{i}-P_{i}-T_{i})_{M}, where CA_{i}denotes a certain component algorithm with the conditions, R_{i}, P_{i}, T_{i}(i goes from 1 to*M*,*M*is the number of different component algorithms combined).

More generally, P_{i}and T_{i}can be changed for each different run of the algorithm i, which can introduce further variations to the EMD algorithms. However, since it is not possible to explore every possible condition, some of the parameters are set to be the same. The number of motifs to be reported from individual runs is set to five for all runs of all the component algorithms (T_{i}= 5). The number of runs (R_{i}) is set to 20 if not specified otherwise for all component algorithms. For a single component algorithm, the same parameter set is used for all the runs. For deterministic algorithms, however, the parameter set is changed for each run because otherwise they produce identical results. For a deterministic algorithm, ten different values are prepared for a tuning parameter (we call it a diversity parameter), one of which is selected randomly for each run. Since we don't change R_{i}and T_{i}for each different algorithm i, and the parameter set P_{i}is set to be same in all the runs of the algorithm i, the notation of an EMD algorithm can be simplified to be EMD(-CA_{i})_{M}. Specifically, we denote EMD-X (X = 1~5) as the set of all the possible combinations of X component algorithms.

### Algorithm of the EMD

Suppose an EMD algorithm combines*M*component algorithms, A_{i}, (i = 1..M). The EMD algorithm predicts motifs in a set of*N*sequences, S_{i}, (i = 1..N). Here a*motif*is defined as a set of local regions (=*sites*) in input sequences which are detected to be similar to each other. In the other words, one or sometimes more*sites*are predicted in each input sequence, and all of the sites in an input sequence set form a*motif*. The EMD algorithm has five steps to generate its final prediction: collecting, grouping, voting, smoothing, and extracting. The overview of the algorithm is provided in Figure1.

#### (1) Collecting

Each algorithm A_{i}runs*R*times against an input dataset of*N*sequences and reports top*K*scoring motifs for each run. For a motif, a component algorithm A_{i}usually detects one site per input sequence by a single run, resulting the total of*R*K*predicted sites in a sequence. If an algorithm A_{i}reports more than*K*motifs, only the top*K*scoring motifs are considered. Oppositely, if less than*K*motifs are predicted by an algorithm, all of the motifs are considered. Also since sometimes an algorithm picks more than one site or no sites in a sequence, the total number of predicted sites in a sequences can be more or less than*R*K*. Now for each combination of an input sequence and a component algorithm, all the predicted sites are collected. Figure1 illustrates the collecting phase of sites in the input sequence number 1 from all the algorithms, A_{1}to A_{M}.

#### (2) Grouping

From the collecting phase above, for an input sequence S_{i}we have about*R*K*predicted sites from each of the algorithm A_{i}. In the grouping phase, first, all the predicted sites in an input sequence S_{i}by a certain algorithm A_{i}are sorted by the algorithm's major statistical score. Then the predicted sites are divided into*K*groups by the sorted score, with each of the groups having an equal number of predicted sites. Because usually an input sequence has*R*K*sites, most of the groups have*R*sites.

Then, the groups of the same score rank across the results by*M*different algorithms are joined together. This results in*K*groups of predicted sites from all predictions made by all algorithms.

The reason for employing this sorting step is to take the score assigned to predicted sites by the algorithm into account. We have observed from prediction results of single component algorithms that the correct site position is frequently predicted within the top couple of score ranks most of the time. Therefore, a local region predicted as the target site consistently in every run by the algorithm within the top scores can be considered to be more reliable.

Basically, each of the*K*groups will finally produce a predicted site. Thus in total EMD outputs*K*predicted sites for an input sequence. But if the average number of predicted sites,*B*
_{
i
}, for a given input sequence S_{i}, is more than one, there is an option to output*B*
_{
i
}sites for the sequence S_{i}from each group. The subsequent steps of voting, smoothing, and extracting steps explain how EMD construct final predictions from the collected sites in each group.

#### (3) Voting

For each of the*K*predicted site groups in a sequence S_{i}, all the predicted sites in a group are placed on the sequence. Then for each position p in a sequence, the number of times the position p is included, or votes for the position p,*V*
_{
p
}, in the predicted sites is counted. Figure1 shows the number of votes*V*
_{
p
}along the sequence position p for the site group 1.

#### (4) Smoothing

The vote*V*
_{
p
}along an input sequence is smoothed using a sliding window of a width of*W*
_{
s
}, which is half of the specified motif width*W*(*W*is a user specified parameter, the default value is 15 and W_{s}is 8). The sliding window starts from the left most position of the sequence, and the sum of the votes in the window,
, is placed in the center position*p*of the window. In the case that*W*
_{
s
}is even, the smoothed score,
, is placed at the*q*-th position in the window, where*q*= (*W*
_{
s
}/2)+1.

#### (5) Extracting final prediction of sites

The final stage is to pick up the top local peak (or the top*B*
_{
i
}local peaks) in the smoothed voting curve,
. Then a window of the length*W*is placed as the final prediction of the site, with the center of the window positioned at the peak.

The smoothing and the extracting phases are aimed to decide the final site prediction by majority votes. Although it may be possible that minority votes for different motifs are not selected as one of the final predictions, but this procedure will be superior in picking up sites which are consistently predicted with a high score rank. Alternative ways to combine different predictions are discussed in Discussion.

### Component algorithms

There are several factors that need to be considered to develop an ensemble algorithm. First, we have to identify whether a component algorithm is a stochastic or deterministic algorithm. In an ensemble algorithm, the component algorithms are usually run multiple times and all the results are combined together. We need a way to control the proportion of predictions from different algorithms to avoid any bias. For deterministic component algorithms, such as MDScan[32] and MEME[33], multiple runs generate identical predictions, which will strongly bias the final combined result. To address this problem, we introduced the diversity parameter(s), which is defined as one or more parameters of an algorithm that one can tune to generate different predictions. In this study, only one diversity parameter is chosen for the deterministic algorithms, namely, MEME and MDScan.

Five motif discovery programs, namely, AlignACE[30], MEME[33], BioProspector[31], MDScan[32], and MotifSampler[34] are selected as the component algorithms for composing the ensemble algorithms. These algorithms only use DNA sequence information as input to identify the regulatory motifs. These algorithms are selected because of their wide use and being ready for download from the internet, allowing us to do large scale local runs. Below we describe the parameter setting of each algorithm. A random algorithm is also introduced to evaluate the statistical significance of the prediction accuracy of the ensemble algorithm.

One difficulty in testing the performance of an algorithm is to set optimal parameters. Here most of the parameters for the component algorithms are set as default values except those which can be easily estimated from general biology knowledge, as we did in the previous study[6]. For example, we have chosen 15 as the expected motif width for the component algorithms (except for MEME, which can adjust motif width by itself), because 15 is the approximate average between the default value of the algorithms and the average motif width in ECRDB62A, which is 21. The reason why we used default parameters (except for the few parameters mentioned below) for the component algorithms is that it is infeasible to try all the possible combinations of parameters of multiple component algorithms, and the default setting for an algorithm should work reasonably well in most of the cases, because it is set up by the authors of the algorithm. The parameter set used for each component algorithms can be found at our web site[40]. This is the same parameter set used in the previous study[6].

### AlignACE

AlignACE[30] is a stochastic motif discovery program based on the widely adopted Gibbs Sampling method[11]. Running parameters for AlignACE were set as the default except for the background fractional GC content gcback set to 0.5, which is calculated from the whole*E. coli*genome. The expected motif width was set to 15. The major statistical score in AlignACE is the MAP score, being the larger is better.

### BioProspector

BioProspector[31] is another variant of the Gibbs Sampling algorithm, which has fifteen parameters to fine-tune its prediction behavior. We used the default values for most of these parameters except for: the motif width, which was set to 15; the number of top motifs to report, which was set to 5. The background frequency model was generated using the whole*E. coli*genome and the third order Markov model was used. The order of the background model and subsequent ones for MDScan, MotifSampler and MEME was determined based on our previous benchmark study of these component algorithms[6]. BioProspector uses a maximum a posterior (MAP) score to evaluate candidate motifs.

### MDScan

MDScan[32] is an enumerative deterministic greedy algorithm. Among its ten parameters, we only specified the following parameters. The motif width was set to 15. The background frequency model was generated using the whole*E. coli*genome and the third order Markov model was used. MDScan uses a maximum a posterior (MAP) score to evaluate candidate motifs. We chose the -s parameter < the number of top motifs to scan and refine > as the diversity parameter to generate different predictions for multiple runs. It ranged from 10 to 100 with a step size of 10.

### MEME

MEME (Multiple Expectation Maximization Estimation)[33] is a deterministic algorithm based on the expectation maximization (EM) technique. It is the only algorithm in this evaluation that does not require a motif width parameter, because MEME can estimate by itself. MEME has 28 parameters. We set the maximum dataset size in characters to one million, the maximum running time to 3600 CPU seconds, the maximal number of motifs to find to five, and the minimum number of sites for each motif to one. The third order Markov model was used for the background frequency model. Default values were used for all the other parameters. We chose the -maxw <maximum motif width> as the diversity parameter, ranging from 10 to 19, with the step size of 1.

### MotifSampler

MotifSampler[34] is another motif discovery program based on Gibbs sampling. MotifSampler has seven major parameters. We made the following adjustments to the default parameter values. We searched five different motifs of a width of fifteen. The number of repeating runs was set to five. The background frequency model was generated using the intergenic region sequences of all*E. coli*genome and the third order Markov model was used. We used the consensus score as the statistical measure for the quality of the predicted motifs.

### The multi-restart algorithm

One interesting question for ensemble algorithms is whether the performance improvement of the ensemble algorithm is due to more number of runs or to synergetic effect of multiple runs of multiple algorithms. We developed a multi-restart algorithm (RS) and compared its performance against that of ensemble algorithms. The basic idea of RS algorithm is to run a given algorithm multiple times and use the highest scored predictions as the final results. The multi-restart algorithm works as follows:

- 1)
Run the component algorithm for*R*times, with each run reporting top*K*motifs.

- 2)
Collect all the predicted motifs and sort them by the major statistical score of the algorithm.

- 3)
Report the top*K*motifs among all the sorted motifs as the final prediction.

### The random algorithm

In a random motif algorithm, a certain number of sites are randomly picked up as predictions of sites. The number of sites picked up is decided for each input sequence as follows: First, conducted 10 runs of AlignACE, BioProspector, MotifSampler and one run of MEME to get the minimum (nSiteMin) and the maximum number (nSiteMax) of predicted sites. Then, the number of sites to be predicted is randomly chosen between nSiteMin and nSiteMax. The random algorithm is run 1000 times, and the average performance is reported.

### Measure of prediction accuracy

We use two levels of performance criteria: nucleotide and site levels to measure the prediction accuracy. A more detailed description is given in the previous study[6]. The nucleotide level accuracy measures include the performance coefficient (nPC), the sensitivity (nSn) and the specificity (nSp). The site level accuracy measures include the site level performance coefficient (sPC), the sensitivity (sSn) and the specificity (sSp). As described above, an EMD algorithm reports K (or sometimes less) motifs for a given input dataset. In this study, we evaluated the accuracy of the best prediction out of the K motifs[6].

### Nucleotide level accuracy

First, for each target site with overlapping predicted sites in an input sequence, we define the following values to calculate the accuracy metrics at the nucleotide level: nTP (true positive), the number of target site positions predicted as site positions; nTN (true negative), the number of non-site positions predicted as non-site positions; nFP (false positive), the number of non-site positions predicted as site positions; nFN (false negative), the number of target site positions predicted as non-site positions.

The nucleotide level performance coefficient (nPC), sensitivity (nSn) and specificity (nSp) for a pair of target/predicted sites is defined as:

In addition to the accuracy score for target sites with overlapping predictions, we need to address the cases of targeted sites which do not overlap with any predicted sites or predicted sites without overlapping any targeted sites. We define the number of non-overlapping target and predicted site pairs as the larger number among MT and MP, where MT denotes for the number of missing targeted sites and MP denotes for the number of wrong predictions. The accuracy scores of these non-overlapping pairs are set to zero. This definition will penalize algorithms that report either too many or too few site predictions. Based on the scores defined for the site pairs, the accuracy scores of a motif discovery program are calculated as:

Thus, the score is first averaged over all site pairs in a sequence, followed by averaging over all sequences in a motif group, and finally averaging over all the motif groups. Note that we allow multiple sites on a sequence as targeted sites.

### Site level accuracy

The site level accuracy indicates if predicted sites overlap with true sites by one or more nucleotide position. We define, sTP, sFP, and sFN as follows: sTP, the number of predicted sites which overlaps with the true sites by at least one nt; sFP, the number of predicted sites which have no overlaps with the true sites; sFN, the number of true sites that have no overlaps with any predicted sites.

For each input sequence, we define the site level performance coefficient (sPC), sensitivity (sSn), and specificity (sSp) in the following way:

The site level accuracy score for an input sequence set (or a motif group) is the average of the score over all the sequences. The site level accuracy score of the entire benchmark data set is the average of the scores for all input sequence sets.