Sequence motif discovery has been applied to discover many types of patterns in DNA and amino acid sequences. For example, motif discovery has been used extensively to elucidate putative transcription factor binding sites [1, 2] and to discover protein-protein interaction domains [3]. In most cases, motif discovery algorithms take as input only a set of sequences hypothesized to contain a biologically important sequence pattern, and search for patterns that are unlikely to occur by chance. Usually, the concept of "occurring by chance" is captured in some kind of probabilistic model of "random" sequences. Since motifs are usually short and can be highly variable sequence patterns [1], a challenging problem for motif discovery algorithms is to distinguish functional motifs from random patterns that are over-represented by chance.

Discriminative motif discovery attempts to find motifs that occur more frequently in one set of sequences compared to another set. This can help with the problem of distinguishing functional motifs from randomly occurring sequence patterns, because the negative set of sequences may be a better representation of "random" sequences than can easily be captured in a probabilistic model. For example, in order to discover the TFBS motif of a transcription factor (TF), the set of DNA probes from a ChIP-chip [4–6] or DIP-chip [7] experiment that do *not* bind to the TF can be used as the negative sequence set. The actual TFBSs may be more strongly over-represented in the DNA probes that do bind the TF when compared to a negative set of non-binding probe sequences than when compared to a random model of DNA.

Another natural application for discriminative motif discovery is in the search for differences in proteins that have evolved in different environments. Orthologs of a single bacterial protein can be divided in two sets according to whether the organism is a thermophile or a mesophile [8]. Using the thermophilic orthologs as the positive set and the mesophilic orthologs as the negative set, discriminative motif discovery can be used to find motifs that are indicative of a high-temperature environment. These motifs might differ only slightly from the corresponding sites in the mesophilic sequences [8], and may be embedded in much longer conserved domains that would be reported by a non-discriminative motif discovery algorithm.

Many algorithms have been designed to solve motif discovery problems. Most of these algorithms, including ALIGNACE [9], CONSENSUS [10], MEME [11], PATTERN-BRANCHING [12] and YMF [13], are not specifically designed for discriminative motif discovery. Some algorithms, such as WEEDER [14], do make use of a set of negative sequences in scoring candidate motifs. A few algorithms have been developed specifically for discriminative motif discovery, including ALSE [15], DIPS [16], DME [17] and SEEDSEARCH [18].

Both discriminative and non-discriminative motif discovery algorithms can be loosely grouped according to how they represent a motif. A motif may be represented as: 1) strings (or regular expressions), 2) position-specific weight matrices (PWMs), or, 3) collections of sites. String-based methods represent the motif as a sequence of letters, possibly allowing wildcards or "ambiguity characters" to represent variability in the motif. PATTERN-BRANCHING, WEEDER and YMF (non-discriminative) and SEEDSEARCH (discriminative) use a string representation. In contrast, a PWM specifies a score for each base/amino acid at each position of the motif, assuming independence between positions in the motif. When applied to DNA binding site motifs, PWMs have a strong theoretical basis relating their scores to free energy of binding [19–22]. PWMs are used by CONSENSUS and MEME (non-discriminative) and by ALSE, DIPS and DME (discriminative). The representation of a motif as a collection of sites is used by Gibbs sampling algorithms such as ALIGNACE and GLAM [23] (non-discriminative).

Several approaches have been applied to search for discriminative motifs. SEEDSEARCH [18] uses exhaustive enumeration in discrete string space to search for discriminative motifs. SEEDSEARCH counts the number of occurrences of a string, allowing a specified number of mismatches, then applies a hypergeometric significance test to discover patterns that are enriched in the positive set relative to the negative set. The enriched patterns are expanded to construct a set of PWMs and an EM-like (expectation maximization [24]) heuristic is used refine the model parameters.

DME (**D**iscriminating **M**atrix **E**numerator) [17] discovers discriminative motifs using an exhaustive, enumerative search of a discrete PWM space. That is, given a finite set of possible PWM columns, DME constructs all possible matrices of a specified width. DME applies a likelihood function to score the relative over-representation of the motif in the positive set.

DIPS (**D**iscriminative **P**WM **S**earch) [16] applies a novel probabilistic score, the "*w*-score", to represent the number and strength of PWM matches in a sequence. A novel hill-climbing heuristic is used to maximise the difference between the mean *w*-score for the positive and negative sequences. The ALSE(**AL**l **SE**quences) [15] algorithm has a very similar approach, using an EM-like refinement step for refining a PWMs and a scoring function based on the hypergeometric distribution, a distribution frequently used for modelling over-representation.

Another approach to discriminative motif discovery was taken by Segal *et al*. [25] and extended by Sharan *et al*. [26]. They use a discriminative motif finder as a component of a larger system that integrates promoter primary structure, localization and expression data to predict gene expression from sequence motifs. Their algorithm, which we will refer to as Segal-Sharan, uses a two-step process. First, it discovers discriminatory string-based motifs using SEEDSEARCH. Then, it converts these to PWMs and uses conjugate gradient [27] find PWMs that maximize a probabilistic scoring function.

The scoring function used by Segal-Sharan is based on two probabilistic sequence models, one for sequences containing a motif site, and one for sequences without a motif site. Sequences with a site are assumed to contain a single motif occurrence, and are modeled using the OOPS (One Occurrence Per Sequence) sequence model [11]. Motif occurrences are assumed to be distributed according to the PWM, treated as a position-specific frequency matrix. Sequences without a site are modelled using a 0-order Markov process. The overall data model of the Segal-Sharan algorithm has two flavors; one forcing all positive sequences to contain a site, and a variation, which we refer to here as the NOOPS (Noisy OOPS) model, allowing a fraction of the positive sequences to contain no motif occurrence [28]. In either case, negative sequences are assumed not to contain a motif site.

The Sharan-Segal algorithm labels the input sequences as "1" (positive class) or "0" (negative class). The scoring function,

*F*(

*D*,

*θ*), is the log conditional probability of the class labels given the sequences in the dataset,

*D*, and the data model parameters,

*θ*. The algorithm attempts to maximizee

*θ*, where

Here *D* is the dataset of labeled sequences <**X**, *C* >, where *C* is the class label of the sequence **X**. The SEEDSEARCH algorithm is used to find string motifs, and these are converted to PWMs, which are used as initial estimates for *θ*. Conjugate gradient is then used to refine each initial *θ*.

In this work, we have developed a discriminative motif discovery algorithm called DEME (**D**iscriminatively **E**nhanced **M**otif **E**licitation). DEME is based on the discriminative framework of Segal *et al*. [25]. However, we apply a novel combination of global and local search to learn the parameters of the motif model that maximise the discriminative objective function (refer to Eqn. 1).

Since a string based approach has been shown to be effective for both synthetic and real motif discovery problems [1], the DEME global search algorithm searches in string space. In contrast to the hypergeometric approach of Segal *et al*. [25] and Sharan *et al*. [26], we use "substring search" [11] and "pattern branching" [12] to find good starting points for conjugate gradient. Substring search samples all substrings contained in the positive set and has been shown to work well for both DNA and protein motif discovery problems [11]. Pattern branching follows substring search to expand the search space by considering strings in the local neighbourhood of the sample strings.

To improve the search using conjugate gradient, we reparameterize the objective function to ensure that all solutions are consistent with the underlying sequence models and to allow us to use Bayesian priors on the columns of the PWM model to prevent over-fitting.

Although intended as a general purpose motif discovery algorithm, DEME includes refinements to make it more effective with protein sequences. To improve the search for protein motifs, DEME utilises prior knowledge of amino similarities to estimate the motif model parameters. That is, DEME uses the PAM120 substitution matrix and a Dirichlet mixture model to assign similar weights to amino acids with similar properties. In contrast, a simple Dirichlet prior is used for DNA sequences to prevent over-fitting. Naturally, other refinements can be imagined to improve performance in particular types of DNA or protein motif discovery problems.

A second contribution of this paper is a set of synthetic data problems for discriminative motif discovery. The synthetic problems are intended to simulate situations where algorithms such as DEME would be useful. The idea is derived from the so-called "standard challenge problem" introduced by Pevzner *et al*. [29] as a way of testing non-discriminative motif finders. The standard challenge problem specifies a synthetic DNA dataset consisting of 20 length-600 sequences, each containing an artificially generated motif occurrence. The motif is represented a string of length 15, and each occurrence contains exactly four mismatches. We augment the standard challenge problem by including a set of negative sequences and define four synthetic discriminative motif discovery problems. These include problems where a variant of the motif is planted in the negative sequences; where the negative sequences contain a strong, "decoy" motif; and where the motif is underrepresented in the negative sequences. We also describe a synthetic problem where the planted motif is generated using a PWM model based on real transcription factor PWMs. We evaluate DEME using these synthetic problems, and complement this with an evaluation of its ability to discover transcription factor binding motifs in yeast. We also illustrate the usefulness of DEME for motif discovery problems in protein sequences.