A novel ensemble learning method for de novo computational identification of DNA binding sites

Background Despite the diversity of motif representations and search algorithms, the de novo computational identification of transcription factor binding sites remains constrained by the limited accuracy of existing algorithms and the need for user-specified input parameters that describe the motif being sought. Results We present a novel ensemble learning method, SCOPE, that is based on the assumption that transcription factor binding sites belong to one of three broad classes of motifs: non-degenerate, degenerate and gapped motifs. SCOPE employs a unified scoring metric to combine the results from three motif finding algorithms each aimed at the discovery of one of these classes of motifs. We found that SCOPE's performance on 78 experimentally characterized regulons from four species was a substantial and statistically significant improvement over that of its component algorithms. SCOPE outperformed a broad range of existing motif discovery algorithms on the same dataset by a statistically significant margin. Conclusion SCOPE demonstrates that combining multiple, focused motif discovery algorithms can provide a significant gain in performance. By building on components that efficiently search for motifs without user-defined parameters, SCOPE requires as input only a set of upstream sequences and a species designation, making it a practical choice for non-expert users. A user-friendly web interface, Java source code and executables are available at .

c. SPACER: finding gapped and long degenerate motifs SPACER extends the conceptual framework of BEAM and PRISM to bipartite and long, highly degenerate motifs [3]. SPACER first enumerates all motifs of the form A-S n -B, where A and B are non-degenerate 3-mers and S n is a spacer region of 2-11 bases. It then iteratively extends the A and B regions of the most significant motifs in a manner similar to BEAM. Next, SPACER uses PRISM to generalize the A and B cores of the highest scoring motifs. Finally, SPACER specializes the spacer region by taking all the non-degenerate putative binding sites in the regulon that match a given bipartite motif and iteratively merging them in a greedy fashion.
On a test set containing 39 experimentally defined regulons (based on Zn(II)Cys6 bipartite fungal motifs and long, highly degenerate bacterial motifs), SPACER significantly outperformed a range of popular motif finding programs.

d. Filtering overlapping motifs
Substrings of bona fide binding sites yield trivially high scores by any statistical metric that correlates well with biological relevance. These artifactual motifs are filtered from the output of each of SCOPE's three constituent algorithms based on the degree of actual overlap in the group sequences. This is measured using the Phi score (the same measure used to define the accuracy when one of the two motifs is the set of true binding sites), which measures the percent overlap between the bases of two motifs in a given regulon. When two motifs are found to overlap by at least a fixed threshold, the lower scoring motif is discarded. The Phi score cutoffs used were determined empirically from the synthetic datasets (Section S2). These values were 0.01, 0.01 and 0.5 for BEAM, PRISM, and SPACER, respectively. SPACER's relatively high threshold is due to the long, highly degenerate nature of the motifs it reports.
The Phi score approach is an approximation to a more rigorous interpretation of the artifactual substring problem. If m 1 is the highest scoring motif, then the remaining sites can be considered independent only if we recompute the statistical significance conditioned on the observed distribution of m 1 . Thus, the significance of all reported motifs must be recalculated by conditioning on the existence of those motifs that are more significant. This approach was described by Blanchette and Sinha [7] and implemented as the program Best Explanators, which is run as a post-processing step by YMF [5].
Although Best Explanators provides a rigorous framework in which artifactual, overlapping motifs can be identified, the computational requirements grow prohibitively burdensome for complex statistical metrics. The Phi score filter approximates this approach by the assumption that the conditional probability p(m i | m j ) differs from p(m j ) if and only if motifs m i and m j overlap in the regulon. Furthermore, it assumes that if m i and m j overlap, then p(m i | m j ) will not be significant. While there exist transcription factors whose binding sites may overlap to some degree [8], such examples are rare and do not seem to justify the computational cost associated with the complex statistical metrics we employ. Our implementation of Best Explanators provided no observable difference to the Phi score method on the synthetic test sets.
While we consider the Phi score thresholds to be fixed, the command line executable exposes these parameters so that users may run Best Explanators if so desired. e. Filtering repetitive motifs Repetitive regions in upstream sequences (such as AT repeats) cause problems for most statistical metrics, as they are highly autocorrelated. Since autocorrelation violates the independence assumption inherent in most statistical metrics, repetitive motifs tend to have artificially high scores. A common solution is to mask repetitive regions from the input sequences. To minimize the burden on the user, we incorporate this masking into the individual algorithms by removing repetitive motifs after each iteration of BEAM and SPACER's nondegenerate motif search. Thus, we remove all motifs with a periodicity of 1 or 2 (e.g., poly-A and poly-AT motifs).

S2. Validation on Synthetic Data Sets
Synthetic data sets were constructed using sets of randomly selected 800 bp upstream sequences from S. cerevisiae as background. For each data set, one or more cis-regulatory element(s) were selected at random from the TRANSFAC database [9], and planted into a collection of randomly selected upstream sequences. We constructed 120 synthetic datasets, which varied widely in number of genes in each dataset, number of cis-regulatory elements planted and number of noisy genes (upstream regions without planted motifs) present. All development and parameter optimization was performed using these synthetic regulons.
We assessed SCOPE's performance using the synthetic datasets and the overrepresentation metric. We measured performance on these datasets using Positive Predictive Value (PPV) and False Negative rates (calculated via textual comparison, based on comparing the highest-scoring motif returned by SCOPE against the planted motif). In this evaluation, we considered 2 motifs to be textually equivalent if one is a subset of the other and is at least half the length. That is, if m 1 is smaller than m 2 , then m 1 and m 2 are equivalent if and only if m 1 is a substring of an instantiation of m 2 , and the length of m 1 is at least half that of m 2 .
We tested SCOPE's sensitivity to a wide range of Sig scores, by varying the number of times a motif was planted in each upstream region, as well as the number of upstream regions that did not contain any planted motifs ( Figure 3 of the main paper). At Sig values greater than 15, corresponding to strongly over-represented cis-regulatory elements with low internal variation, SCOPE found 93% of all planted motifs. At Sig values between 0 and 15, corresponding to weakly over-represented cis-regulatory elements with low internal variation, SCOPE found 81% of all planted motifs. Finally, at Sig values at or below 0, SCOPE found 31% of all planted motifs. A Sig score of 0 indicates that we would expect one motif of that significance to exist in a random regulon.

a. Test Regulons
Previous studies have found that motif finders perform best on synthetic datasets generated under a model similar to that on which the motif finder is based [10]. Rather than bias our experiments in favor of SCOPE (or any other motif finder), we chose a large range of known regulons from several species.
We tested SCOPE's performance against experimentally determined regulons from S. cerevisiae, B. subtilis, E. coli, and D. melanogaster. Each regulon consisted of a set of upstream regions of genes that had been previously demonstrated to be bound by a given transcription factor based on direct biological evidence. We refer to each regulon by the name of the regulating transcription factor. The databases used for each species were as follows: b. Measuring accuracy using the Phi score As described in Methods, we measure a motif's accuracy by computing the Phi score between the instances of the motif in the upstream regions and the instances of the known binding sites in those upstream regions. Using a scoring metric like the Phi score that directly compares the predicted and published binding sites without regard to motif representation is advantageous for a number of reasons.
The obvious advantage is that comparisons between predicted motifs and published motifs can be made in the same manner for all motif finding programs. Additionally, comparing the list of predicted binding sites against the list of published binding sites directly sidesteps the problem of spurious binding sites. (Spurious binding sites arise because motif models are inherently imperfect, meaning that a motif model fitted to include all known binding sites will also describe a number of motifs that do not correspond to actual binding sites. For example, if the known binding sites are ACGT and AGCT, the consensus representation will be ASST, which describes ACGT, ACCT, AGCT and AGGT, two of which are spurious.) Other metrics for performance comparisons have been used previously, such as mutual information [13] and textual overlap [14], for PWM and consensus representations respectively. Both metrics are limited by their assumption of equal lengths for all binding sites. Transcription factor binding sites differ widely in length, sometimes even for the same transcription factor, and mutual information and textual overlap remain undefined in these cases.
Papers that use textual overlap as their main metric typically report high performance via this metric (see, for example [14]), in contrast to the low accuracy reported here and in other papers that use Phi-based accuracy as a performance metric. We view this disconnect as an indication that textual overlap can be a misleading metric of similarity, as two motifs that look very similar can describe a very different set of binding sites. For example, consider the motif pairs ATCAGTCG and ATTAGTCG. Although the motifs are almost identical upon visual inspection, their actual occurrences in the genome do not overlap at all. The textual overlap reported between this motif pair is 0.88, while the Phi score is 0. As a more general example, if consensus motif m 1 is a substring of m 2 , it can be shown that the expected Phi score of m 1 and m 2 is close to 0.25 d , where d is the difference in length between the two motifs (assuming equal base frequencies). Thus, if m 1 is two bases shorter than m 2 but otherwise identical, visual inspection would suggest a reasonable match, while the expected Phi score would be approximately 0.06.
Indeed, in a recent performance comparison, Hu et al. [14] found that low accuracy translated to relatively high "motif level success rates," a measure of the percentage of regulons for which some portion of the real binding sites were found.
We use Phi as our metric of choice as it reflects realistic end-use cases for computationally predicted binding sites; such as intensive interrogation using site-directed mutagenesis followed by gel-shift and reporter gene assays. Phi-based accuracy combines both the likelihood that a predicted binding site belongs to the set of bona fide binding sites (PPV), and the degree to which true binding sites for a given transcription factor have been represented in the computational prediction (sensitivity).
The key disadvantage of the Phi is score is that it is undefined when a program fails to return any motif. In this study, we remove such instances when computing average Phi, and define Phi to be 0 in such cases for the purposes of counting wins and clear wins.
c. Criteria for performance comparison Ten other motif finding programs were run on the same dataset to provide a point of comparison for SCOPE's performance. These motif finders were chosen because of their popularity and availability in a web server format, and were intended to represent a diverse range of motif models (including consensus, k-mismatch and PWM) and search algorithms (including exhaustive, Gibbs sampling, expectation maximization, and several specialized search techniques). Of the programs, one (YMF) is specifically aimed at S. cerevisiae, two (BioProspector and MITRA) are aimed at the long, highly degenerate motifs common to prokaryotes, and the others are generalized motif finders.
All programs were run directly from their web site using default parameter settings. The only parameter we specified was species designation, where that option was available. (We were unable to assess the performance of Weeder on prokaryotes because Weeder does not supply a background model for these species).
YMF, whose web server was not available at the time of this study, was run as a downloadable executable. This program is notable in that it uses a type of ensemble method in that the downloadable script runs YMF numerous times under different parameter settings  A consistent finding in previous performance comparisons was that all programs perform roughly the same, with no statistically significant differences. While Tompa et al. [15] found that Weeder slightly outperformed the other programs and Sinha and Tompa [10] found that YMF outperformed AlignAce and MEME, neither result was statistically significant. Likewise, Hu and colleagues [14] found no clear differences in their performance comparison on prokaryotes and Harbison et al. [16] found no clear superiority among the 6 motif finders they ran on sequences identified by ChIP (Chromatin Immunoprecipitation) experiments for 172 transcription factors from S. cerevisiae. Consistent with these results, we found no differences between the ten other motif finding programs in this performance comparison (assessed using a one way ANOVA, p = 0.13).
In contrast, when SCOPE was added to the comparison, we found a statistically significant difference between all eleven programs (as assessed by a one-way ANOVA, p=0.00016). The accuracy score of all motif finders was found to be dependent on the identity of the regulon (as assessed by a two-way ANOVA without replication, p=1.81x10 -13 for regulon).
We assessed the statistical significance of SCOPE's performance margin using permutation testing, a standard nonparametric statistical technique. The use of permutation testing was particularly appropriate, given the dependency of the accuracy score on the identity of the regulon, since this technique preserves the underlying structure of the data. We randomly permuted each row of Supplementary Table 2 (shuffling motif finder scores for a given regulon) and recomputed the averages for each column (each motif finder). We then computed average accuracies and the difference between the top two averages for 50,000 iterations (the cumulative distribution function for this statistic is shown in Supplementary Figure 2). The difference in accuracy between SCOPE and Weeder (the second-highest scoring program) was 0.065. For a difference of this magnitude on the dataset in question, the p-value from the permutation test was 2 x 10 -5 . In comparison, without SCOPE, the permutation test p-value was 0.44 for the difference in accuracy between the next two motif finders (Weeder and RSAT, which differed by 0.008).
Taken together with results of the single factor ANOVAs (with and without SCOPE), this analysis suggests that SCOPE maintains a unique, statistically significant performance margin over the other motif finders run on this dataset, under the specific conditions chosen for this test, and on the dataset in question.   Figure 1: Performance according to relative rank. For each regulon, the programs were ranked according to their accuracy. This plot shows the percentage of regulons for which a given program's accuracy places it in the Top 3, Middle 5 or Bottom 3 relative to the other 10 programs. Thus, for 63% of the regulons, SCOPE was among the best three programs, while it was among the worst three programs for 9% of the regulons. None of the other programs consistently rank in the top 3 relative to the other programs, consistent with our finding that the 10 other programs perform similarly to each other. Program abbreviations and references are defined in Table 1 in the main paper.  Table 1 in the main paper. Bars represent standard error.  Wins  19  5  3  5  2  2  11  3  6  13  3  scores 0.50  8  4  7  6  3  2  5  3  0  5  4  scores 0.33  21  12  14  10  5  9  12  6  9  10  11  scores 0.20  39  24  22  22  14  12  23  14  18  24  13  Regulons returned  78  70  78  78  78  44  78  77  77  78  41   clear win for SCOPE vs  -24  24  25  37  18  31  36  36  33  15  clear loss for SCOPE vs  -8  6  6  3  1  13  5  4  13  3 Supplementary Table 2: Summary results for performance comparisons on all regulons. In situations where a program failed to return any motif for a regulon, the accuracy is undefined and was not used to calculate the averages. A "win" is a regulon for which a program has the highest accuracy, and that accuracy is at least 0.10. A "clear win (loss)" is a regulon for which SCOPE's accuracy was at least 0.10 higher (lower) than the other program. Abbreviations and citations are defined in Table 1 in the main paper.