### The evolutionary model

The evolutionary model makes the following assumptions: (i) Nucleotides in an aligned position are evolved from a common ancestor. (ii) The weight matrix applies to the common ancestor and to all descendants, a reasonable assumption given the propensity of DNA binding domains of proteins to evolve slower than cis-regulatory modules. (iii) All positions evolve independently, at equal rates, and the probability of fixation of a mutation *α* → *β* at position *k* is proportional to the weight matrix entry of *β* at that position. Suppose we are given a phylogenetic tree Ψ, with the species {*σ*
_{1}, *σ*
_{2}, .... *σ*
_{
K
}} at the leaves. Let the vector *ψ* = (*s*
_{1}, *s*
_{2}, ... *s*
_{
K
}), where *s*
_{
σ
}is the nucleotide from species *σ* in the (single position) alignment *ψ*. The term *Pr*
_{
e
}(*ψ*|*W*, *k*) denotes the probability of observing *ψ* at position *k* when sampling from weight matrix *W*. For each node *j* of the tree, except the root, let *μ*
_{
j
}be the probability of a base in the parent species of *j* having mutated (under neutral evolution) in species *j*. Also, let *ψ*
_{
j
}be the vector formed by elements of *ψ* that correspond to leaf nodes descended from node *j*. Let *C*(*j*) denote the set of children of node *j* and let *r* be the root of the tree. Then, we can write (using the model assumptions):

where *f*(*ψ*
_{
j
}, *α*) denotes the probability of observing *ψ*
_{
j
}given that the base at the parent of *j* is *α*. For a leaf node *σ*, this can be written as
, from the model assumptions. (*δ*
_{
ij
}= 1 if *i* = *j*, and 0 otherwise.) For an internal node *j* (except root *r*), the expression is :

For the special case where Ψ has a star topology, Equation 4 reduces to Equation 1.

### Training parameters in a HMM

Given a sequence *S* and a set of position weight matrices {*W*
_{
i
}}, the objective function to be maximized is *F*(*S*, *θ*) = log(*Pr*(*S*|*θ*)/*Pr*(*S*|*θ*
_{
b
})), where *Pr*(*S*|*θ*) is the probability of generating the sequence *S* using the parameters *θ*, and *θ*
_{
b
}represents the parameter values that only allow the background motif *W*
_{
b
}to be used by the HMM. The sequence *S* can be written as *ψ*
_{1}
*ψ*
_{2} ... *ψ*
_{
L
}, where each *ψ*
_{
i
}is either a single base or an alignment of orthologous bases at a single position. *θ* includes the weight matrices *W*
_{
i
}and their transition probabilities *p*
_{
i
}. Since *Pr*(*S*|*θ*
_{
b
}) depends only on *W*
_{
b
}, which is assumed constant, we shall outline how to maximize log *Pr*(*S*|*θ*), following the description in [23]. A parse of the sequence *S* in terms of the *W*
_{
i
}(i.e., the series of motifs chosen in the successive steps of the generative probabilistic process) is denoted by *T*.

We thus have

The maximization is iterative, with the *t*
^{
th
}iteration computing a model *θ*
^{
t + 1}that improves the objective function from the current model *θ*
^{
t
}. In classical E-M fashion, let us define a function *Q*(*θ*|*θ*
^{
t
}) as

It is easily shown that log *Pr*(*S*|*θ*) - log *Pr*(*S*|*θ*
^{
t
}) ≥ *Q*(*θ*|*θ*
^{
t
}) - *Q*(*θ*
^{
t
}|*θ*
^{
t
}). Thus, if we maximize *Q*(*θ*|*θ*
^{
t
}) over all *θ*, we shall always improve upon log *Pr*(*S*|*θ*
^{
t
}), or remain there if the local maximum has been reached. Let *A*
_{
i
}(*T*, *S*) be the number of times *W*
_{
i
}occurs in the parse *T* of *S*. Also, let *E*
_{
ikψ
}(*T*, *S*) denote the number of times that the alignment *ψ* is emitted (sampled) at the *k*
^{
th
}position of the matrix *W*
_{
i
}, in parse *T* of *S*. Let *l*
_{
i
}denote the length of *W*
_{
i
}. Then we have

which gives us

Note that the only the first term in this expression depends on *p*
_{
i
}, and only the second term depends on *W*
_{
i
}. Hence, we maximize each of these terms independently, with respect to the appropriate free parameters. We first maximize the term

Note that
is the average number of occurrences of *W*
_{
i
}in *S* over all parses *T*.

Thus the term to maximize is
, and this is maximized when

Next, we maximize the second term:

Again, note that
is the average number of times that the alignment *ψ* is sampled at the *k*
^{
th
}position of the matrix *W*
_{
i
}while generating *S*, the average being over all parses *T*. Thus, the term to maximize is
. We first note that in our case, there is a single weight matrix *W*
_{
m
}to be trained. Hence, we need to maximize *Q* with respect to *W*
_{
m
}. We can do this maximization with respect to each column *k* independently. Let *W*
_{
kβ
}denote the (*k*, *β*)^{
th
}entry of *W*
_{
m
}. Thus, for each *k* = 1 ... *l*, we need to maximize *Q* with respect to *W*
_{
kβ
}(*β* ∈ Σ), with the constraint Σ_{
β
}
*W*
_{
kβ
}= 1. Using Lagrangian multiplier *λ*, the objective function becomes *Q* + *λ* (Σ_{
β
}
*W*
_{
kβ
}- 1).

Transforming to log variables *u*
_{
β
}= log *W*
_{
kβ
}to ensure that the *W*
_{
kβ
}remain positive during optimization, we then have the following necessary conditions for optimality (in addition to the constraint
) :

We therefore have a system of five equations (including the constraint) in the variables *u*
_{
β
}(∀*β* ∈ Σ and *λ*. Denoting the vector of these five variables by **u**, we solve this system of equations using Newton's iterative method. Let us write the above system of equations as **F**(**u**) = **0**, where **F**(**u**) = [[*f*
_{
β
}], *f*
_{
λ
}], with *f*
_{
β
}being the left side of Equation 9, and
. Newton's method uses the update relation:

Δ**u** = -(**J**(**u**))^{-1}
**F**(**u**)

where Δ**u** is the change in **u** in the current iteration and **J** is the Jacobian matrix of **F**. The important terms in the computation of **F** and **J** are the first and second partial derivatives of log *Pr*
_{
e
}(*ψ*|*W*
_{
m, k
}) with respect to the *u*
_{
β
}variables. For this purpose, we need to compute *Pr*
_{
e
}(*ψ*|*W*
_{
m, k
}) and its first and second partial derivatives. Computation of *Pr*
_{
e
}(*ψ*|*W*
_{
m, k
}) uses the formulas 4 and 5. The partial derivatives can be computed recursively (over the tree Ψ) by using the chain rule of differentiation. These recursive computations are implemented in a bottom-up manner, so as to avoid redundant computations. Newton's method uses **F** and **J** to iteratively compute new values of **u**, until convergence. The Jacobian matrix **J** in our case is not positive definite, hence Newton's method is not guaranteed to converge. However, in practice, we found the method to always converge from a single initial seed. Upon convergence, the log variables *u*
_{
β
}are transformed back to *W*
_{
kβ
}= *e*
^{
uβ
}. The procedure is repeated for each *k* = 1 ... *l*, and *W*
_{
m
}is then updated with the new values. This update, along with that given by Equation 8, is used iteratively to improve *F*(*S*|*θ*) until the local maximum is reached, as indicated by a very small change in its value.

### Time complexity

The E-step computes
,
and
, for *k* = 1 … *l*, ∀ *ψ*. The Forward-Backward algorithm is run once, in *O*(*LKl*) time, where *L* is the total length of the input sequences, *K* is the number of species, and *l* is the length of the motif *W*
_{
m
}. (This time complexity assumes that nodes in the phylogenetic tree have a fixed maximum degree.) Thereafter,
,
are computed in *O*(*L*) time, and all the
are computed in one scan of the input, expending *O*(*Ll*) time.

The M-step runs Newton's method to solve a system of equations, once for each column of *W*
_{
m
}. Each run of Newton's method goes through a small number (3–5) of iterations. Each iteration computes the first and second partial derivatives of log *Pr*
_{
e
}(*ψ*|*W*
_{
m, k
}) Each of these derivatives can be computed in *O*(*K*) time, where *K* is the number of species (since |*ψ*| ≤ *K*) Hence, **F** and **J** can be computed in *O*(*LK*) time, where *L* is the total length of the sequence. Hence, Newton's method takes *O*(*LK*) time, and is run *l* times, for an overall time complexity of *O*(*LKl*) for the M-step.

Thus, the running time of (each E-M iteration in) PhyME scales linearly with the length of the sequences, the length of the motif desired, and the number of species.

### Implementation details

PhyME is implemented in C++ for Linux, and the source code will be made freely available at http://edsc.rockefeller.edu/cgi-bin/phyme/download.pl. The current implementation runs in a few minutes (on a workstation) for typical applications with total sequence length ~10000 bp, 2–4 species, and motif length of ~10.

PhyME uses the LAGAN alignment tool of Brudno *et al*. [21] for the alignment step. After alignment, the ungapped blocks extracted are required to be at least 10 bp long, and have at least 70% identity. PhyME is implemented to handle an arbitrary phylogenetic tree Ψ relating the input species.

The E-M algorithm is guaranteed to converge only to a local optimum. To address this problem, the motif-finding step is executed a fixed number of times, each time using a randomly chosen substring of the input sequence as the "seed" to initialize *W*
_{
m
}, and truncating the E-M procedure after a small number of iterations. The seed with greatest score *F*(*S*, *θ*) among these runs is then used to run the E-M to convergence and the trained motif is reported, along with all its instances with posterior probability above a certain threshold. To find more motifs, PhyME masks out the central base of each of these instances. Optionally, the user may specify nsites, the expected number of occurrences of each of the desired motifs. In such a case, PhyME turns off training of the parameter *p*, and uses a fixed value computed from nsites. Similarly, an option maxsites specifies the maximum number of occurrences expected.

PhyME considers occurrences on both strands by introducing a new weight matrix *W*
_{
r
}, and an associated transition probability *p*
_{
r
}, in the HMM parameters. The weight matrix is constrained to be the reverse complement of *W*
_{
m
}. The model has a fixed bias of planting the motif in one orientation versus the other, and this bias is trained from the data. PhyME also has the option of capturing local correlations in background nucleotide composition. To implement a *κ*
^{
th
}order Markov background, PhyME uses a special background weight matrix that is of length 1 but uses the knowledge of the previous *κ* bases generated to determine the emission probabilities of the next base.

### Performance score in experiments with synthetic data

We use the following score for measuring the performance of a motif-finding algorithm on synthetic data. Let *S* = {*S*
_{1}, *S*
_{2}, ... *S*
_{
n
}} be the set of *n* input sequences. For any motif *m*, let *I*
_{
mi
}be the set of positions in sequence *S*
_{
i
}that are occupied by an occurrence of *m*. We know the occurrences of the planted motif *m*
^{
k
}, and are evaluating the motif *m*
^{
r
}reported by an algorithm. The performance score Φ is defined as follows:

In other words, it is the number of positions, over all sequences, where occurrences of the known and reported motifs overlap, divided by the total number of positions at which the known *or* the reported motif occurs. Note that if the reported occurrences exactly concur with the known occurrences, the score is 1, and when the reported and known occurrences have no position in common, it is 0.

### Details of experiments with biological data sets

#### Yeast

The genes regulated by each transcription factor are listed in SCPD. For each such "regulon", the known sites and the known weight matrix were extracted from SCPD. Also, 800 bp long upstream sequences of the genes in each regulon were extracted (for *S. cerevisiae*) from the RSA-Tools web site [30]. Orthologous promoters in the other yeast species were obtained from Cliften *et al*. [11]. Let *η* be the number of known binding sites in *S. cerevisiae*. The input to the motif finding algorithm consisted of the sequences from *S. cerevisiae* and their orthologs from one or more of the other species, depending on *K*. (In addition to *S. cerevisiae*, we used *S. mikatae* for *K* = 2, *S. mikatae* and *S. kudriavzevii* for *K* = 3, and *S. mikatae*, *S. kudriavzevii* and *S. bayanus* for *K* = 4.) The length of the motif was also input to each program. Each algorithm was made to report 3 motifs, and for each motif, the top *η* reported occurrences in *S. cerevisiae* were examined. For each such occurrence, the logarithm of the probability of sampling it from the known weight matrix was computed, and a *z*-score of this logarithm was obtained. If the *z*-score was above 3, the occurrence was called a "match". To allow for slight offsets in the reported motif, each reported occurrence was padded with 3 bp of its context, on either side.

PhyME was run with the *maxsites* option set to *η*, and MEME was run with the same option set to *ηK*. We also experimented with running MEME with the *nsites* parameter set to *ηK*. OrthoMEME was run with a zeroth order Markov background, in the "zoops" mode, with expected number of sites between 0.8**η* ("minsites") and 1.2**η* ("maxsites"). PhyloGibbs was run with mutation probability 0.7 ("-G 0.3") for all species, and was asked to report three motifs (three "colors") each with 1.5 × *η* occurrences ("-I") initially. A 3^{
rd
}order Markov background ("-N 3") trained on the full complement of yeast promoters was used, as with PhyME and MEME. The "loose align" option ("-D 1") and the "stop after anneal" option ("-X") were used. These options were suggested by an author of PhyloGibbs (Rahul Siddharthan, personal communication). We experimented with a different value for the mutation probability ("-G 0.7"), with no improvement, except in the RAP1 regulon. EMnEM was run with default parameters, the motif length being input through the "-w" parameter. Phylogenetic trees were derived from each input promoter, using the fastDNAML software of Olsen *et al*. [31]. The alignments were done using the MLAGAN program of Brudno *et al*. [21]. In separate runs, we also tried non-default values of the parameters "-p" (relative rate of motif to background; default 0.5, also tried 0.25) and "-m" (evolutionary model; default Jukes-Cantor, also tried F81). The expected number of instances of each motif per sequence ("-e") was set to *η*/*n* and *η*/*n* + 1 in separate runs, where *n* is the number of input promoters. For each data set, and for each value of *K*, we took the best scoring choice of parameters. This was done to give some advantage to EMnEM, since we lacked expertise in choosing optimal parameter values.

#### Fly

The locations of cis-regulatory modules involved in body-patterning of the early embryo in *D. melanogaster* were obtained from [26], and their sequences were extracted from BDGP [32]. The evaluation procedure was identical to that in yeast, with the following difference. Since there is no complete list of verified sites in the enhancers, we first scanned the sequences (in *D. melanogaster*) with the known weight matrix, and counted matches, by the same measure as above. This count was the value of *η* used in the experiment. An extra complication in the fly data is caused by the fact that each enhancer typically contains sites for multiple transcription factors. We restricted our tests to the factors Kr and Bcd, because their weight matrices are of better quality than others. Moreover, for each enhancer, we chose to test with the transcription factor with most putative sites (matches to its weight matrix).

OrthoMEME was run as in the yeast data sets (see above), except that the "tcm" mode was used now. PhyloGibbs was also run as in the yeast data sets, except that we used a mutation probability of 0.5 ("-G 0.5"), and a 2^{
nd
}order Markov background ("-N 2"), trained on non-coding regions in fly. We also experimented with a higher value of the mutation probability, and tried specifying the initial number of occurrences per motif ("-I") differently, with no clear improvement. EMnEM was run with the Jukes-Cantor evolutionary model ("-m 0") and with the relative rate of motif to background ("-p") set to 0.5 and 0.25 in separate runs. The expected number of motifs was set to *η* and 1.5 × *η* in separate runs. The best performing choice of parameters was used for each data set.

#### Human

The genes comprising each regulon were obtained from TRANSFAC [33]. Mouse and rat orthology information for human genes was obtained from Homologene [34]. Human, mouse and rat promoters were obtained from the UCSC Genome Browser [35].