Volume 8 Supplement 6

## Otto Warburg International Summer School and Workshop on Networks and Regulation

# Finding regulatory elements and regulatory motifs: a general probabilistic framework

- Erik van Nimwegen
^{1}Email author

**8(Suppl 6)**:S4

https://doi.org/10.1186/1471-2105-8-S6-S4

© van Nimwegen; licensee BioMed Central Ltd. 2007

**Published: **27 September 2007

## Abstract

Over the last two decades a large number of algorithms has been developed for regulatory motif finding. Here we show how many of these algorithms, especially those that model binding specificities of regulatory factors with position specific weight matrices (WMs), naturally arise within a general Bayesian probabilistic framework. We discuss how WMs are constructed from sets of regulatory sites, how sites for a given WM can be discovered by scanning of large sequences, how to cluster WMs, and more generally how to cluster large sets of sites from different WMs into clusters. We discuss how 'regulatory modules', clusters of sites for subsets of WMs, can be found in large intergenic sequences, and we discuss different methods for *ab initio* motif finding, including expectation maximization (EM) algorithms, and motif sampling algorithms. Finally, we extensively discuss how module finding methods and *ab initio* motif finding methods can be extended to take phylogenetic relations between the input sequences into account, i.e. we show how motif finding and phylogenetic footprinting can be integrated in a rigorous probabilistic framework. The article is intended for readers with a solid background in applied mathematics, and preferably with some knowledge of general Bayesian probabilistic methods. The main purpose of the article is to elucidate that all these methods are not a disconnected set of individual algorithmic recipes, but that they are just different facets of a single integrated probabilistic theory.

## Keywords

## The weight matrix representation of regulatory sites

*l*base pairs long. For any length-

*l*sequence

*s*there will be a well-defined (but generally unknown) binding free-energy

*E*(

*s*) to the regulatory factor. A key assumption [1] that is introduced at this point is that the energy

*E*(

*s*) can be written as the sum of independent contributions

*E*

_{ i }(

*s*

_{ i }) from each of the bases

*s*

_{ i }in segment

*s*, i.e.

This assumption of course generally only holds to some extent. Large-scale *in vitro* studies have shown that the binding energies can deviate from this simple additivity assumption [2]. However, these deviations are typically small, and moreover they seem generally restricted to segments with low binding-energy [3]. At this point it is not yet clear to what extent and for what fraction of regulatory factors, the additivity assumption holds. Some researchers believe that, at least for some factors, functional binding sites deviate significantly from this assumption, and this may well be the case. However, it is this author's experience that in collections of experimentally determined binding sites there is little evidence of correlations between the nucleotides occurring at different positions which, as we will see below, supports the additivity assumption for functional binding sites.

^{ l }possible segments

*s*can be meaningfully divided into 'binding sites' and all other sequences. Since this is not

*a priori*clear at all, it is good to consider what this assumption entails. At a given concentration

*c*of the regulatory factor, the probability that a sequence segment

*s*will be bound by the factor is given by an expression of the following form [4, 5]

*β*= 1/(

*kT*) is the inverse temperature, and

*K*is a constant. (We here ignore the fact that the factor may bind at segments that overlap

*s*, which would prevent the factor from binding at

*s*. Below we will derive the general solution that takes this complication into account.) The expression (2) is an s-shaped function that goes from 0 to 1 as

*ce*

^{βE(s)}goes from much smaller than

*K*to much larger than

*K*. Therefore, at a given concentration

*c*one can naturally separate sequences

*s*into binders, i.e. those with

*E*(

*s*) > log(

*K/c*)/

*β*and non-binders with

*E*(

*s*) < log(

*K/c*)/

*β*. If the concentration of the (active) regulatory factor were to vary continuously between different cellular states, then the set of sites bound by the factor would also vary continuously and it would not make much sense to divide segments

*s*into binders and non-binders. However, if in physiological conditions the regulatory factor primarily switches between an 'off' state, i.e. low concentration

*c*

_{off}, and an 'on' state, i.e. high concentration

*c*

_{on}than there would be a well defined set of sites that are bound when the factor is 'on' and unbound when the factor is 'off', i.e. those with energies in the range

*P*(

*s*) that a randomly chosen binding site will have sequence

*s*, given only the constraint that the average energy of the sites is $\overline{E}$. The maximum entropy formalism [6], i.e. as applied in statistical mechanics, prescribes that distribution

*P*(

*s*) is given by

*s'*is over all length-

*l*sequences, and the sum over

*α*is over the four bases. The Langrangian multiplier

*λ*is chosen such that <

*E*> = ∑

_{ s }

*E*(

*s*)

*P*(

*s*) = $\overline{E}$. Note that this is the same functional form as the well-known Boltzmann distribution. To avoid confusion, note also that equations (2) and (4) are probability distributions over entirely different spaces. The former takes a fixed sequence segment

*s*and compares the probabilities of the bound and unbound states for this sequence segment, whereas the latter assigns the probabilities that a binding site will take on any of the 4

^{ l }possible sequences. In equation (4) the bases at different positions are independent, i.e. $P(s)={\displaystyle {\prod}_{i=1}^{l}{P}_{i}({s}_{i})}$ with

*position specific weight matrix*(WM)

*w*with components

*s*:

Finally, note that *P*(*s*|*w*) gives the probability that a given binding site will have sequence *s*, which should be carefully distinguished from the probability *P*(*w*|*s*) that a sequence segment *s* is a binding site for *w*. The latter cannot be calculated without specifying how likely *s* is to arise under alternative hypotheses as will be discussed in detail below.

Weight matrices are probably the most commonly used representation of regulatory sites and, as has just been shown, can be derived under the assumptions that the contribution to the binding energy from bases at different positions in the site are independent, and that functional binding sites are characterized by a given average binding energy. In this chapter we will focus on regulatory motif finding methods that use WMs. It should be noted, however, that in some circumstances regulatory sites can be adequately represented by either specific DNA words, i.e. when the regulatory factor recognizes essentially only a single sequence segment, or by regular expressions, and there is a substantial amount of work on motif finding in this context. There is also a moderate amount of work on more complex representations of regulatory sites, such as hidden Markov models that allow sites of varying length and correlations between bases at neighboring positions [2, 7].

## Finding WM matches

*w*that summarizes the binding specificity of a regulatory factor. One of the simplest applications is to 'scan' one or more sequences for 'matches' to this WM. Let

*s*denote some sequence of length

*L*, where

*L*is typically much larger than the length

*l*of the WM. We now want to infer if one or more sites for this WM occur in this sequence. Probabilistic inference always [6] takes the following general form

- 1.
Enumerate all possible hypotheses

*H*that could have accounted for the data*D*. - 2.
Assign prior probabilities

*P*(*H*) to each of these hypotheses. - 3.
Define a likelihood model that gives the probability

*P*(*D*|*H*) of producing the entire data*D*under each of the hypotheses*H*. - 4.The posterior probability
*P*(*H*|*D*)for each of the hypotheses is then given by Bayes' theorem:$P(H|D)=\frac{P(D|H)P(H)}{{\displaystyle {\sum}_{\tilde{H}}P(D|\tilde{H})P(\tilde{H})}}.$(8)

For example, assume that we have prior information that precisely one site for WM *w* occurs in *s* and that the other bases in *s* were drawn from a background model *b*. For simplicity we will assume that under this background model *b*, each letter has a probability *b*_{
α
}to be base *α*. In this situation all possible hypotheses are simply all possible locations *i* at which the binding site might start. If we have no information to suggest that the site is more likely to occur at some places than others we use an uniform prior *P*(*i*) = constant. The likelihood *P*(*D*|*i*) of the data, i.e. sequence *s*, given the corresponding hypothesis is given by the product of probabilities that the bases from 1 up to *i* derive from the background model, that the segment from *i* + 1 through *i* + *l* derives from the WM *w*, and that bases *i* + *l* + 1 through *L* again derive from the background model.

*P*(*D*|*i*) = *P*(*s*_{[0,i]}|*b*)*P*(*s*_{[i,l]}|*w*)*P*(*s*_{[i+l,L-i-l]}|*b*),

*s*

_{[i,l]}=

*s*

_{i+1}

*s*

_{i+2}...

*s*

_{i+l}is the length-

*l*segment in

*s*starting after position

*i*with

*P*(

*i*|

*D*) that the site occurs at

*i*is

*s*. Therefore, we generally want to consider the extended set of hypotheses that consists of all possible configurations of binding sites that can be assigned to sequence

*s*. Figure 2 shows a possible configuration containing 3 hypothesized binding sites.

*n*sites can be denoted by a vector

*i*= (

*i*

_{1},

*i*

_{2},...,

*i*

_{ n }) which denotes the positions at which the binding sites occur. The probability of the data given a configuration

*i*is now given by

where *B*_{
i
}is the set of background bases and *S*_{
i
}is the set of hypothesized sites in configuration *i*.

To assign prior probabilities *P*(*i*) to all possible configurations one generally assumes that the data *D* was produced through a stochastic process where at each step with probability (1 - *π*) a single background base is emitted, and with probability *π* a length-*l* binding site is emitted. Under this model the prior probability *P*(*i*) for a configuration *i* depends only on the number of sites *n*(*i*) that occurs in the configuration, and is given by

*P*(*i*) ∝ *π*^{n(i)}(1 - *π*)^{L-ln(i)}

*i*given the data becomes

where the sum in the denominator is over all possible binding site configurations *j*.

*L*, the sum in the denominator can be easily calculated using dynamic programming as follows. Let

*F*

_{ n }denote the sum of the likelihoods of all configurations up to position

*n*in

*s*. We have the recurrence relation

*F*

_{ L }, i.e.

*F*

_{ L }= ∑

_{ j }

*P*(

*D*|

*j*)

*P*(

*j*), which can be calculated in a time

*O*(

*L*) using the above recurrence relation. Similarly, we can move backward from the end of the sequence to have a recurrence relation for the sum of likelihoods of all configurations of positions

*n*through

*L*of

*s*:

*P*(

*i*|

*D*) for a particular configuration

*i*, we can also calculate the posterior probability that a site occurs at a given position, independent of the rest of the configuration. Let us denote by {

*n*} the set of all configurations that have a site at segment

*s*

_{[n,l]}.The posterior probability

*P*({

*n*}|

*D*) is given by the sum of posterior probabilities of all configurations in {

*n*}, i.e.

*F*

_{ n }and

*R*

_{ n }as follows:

where the numerator corresponds to the sum over all configurations that have a site at *s*_{[n,l]}.

Here it is useful to note that, formally speaking, the model that we have introduced is a hidden Markov model and that the expressions (16), (17), and (19) are essentially the same as the so-called forward-backward algorithms of hidden Markov model theory [8, 9]. Researchers with a background in statistical physics tend to think of *F*_{
L
}as a *partition sum* and the recurrence relations are essentially what is known as the *transfer matrix* technique.

Given the WM *w* for a regulatory factor we may use equation (19) to scan any sequence *s* for positions at which functional binding sites for the factor are likely to occur. The likelihood of success of this procedure critically depends on the density of true sites in the input sequence *s*. That is, even in a sequence generated entirely from the background model, segments that are indistinguishable from binding sites will occur by chance at a certain rate. For example, let's assume that such chance 'binding site lookalikes' occur once every 500 bps on average, and assume that we are looking for between 1 and 3 functional binding sites in an intergenic region of length 250 which stems from a bacterial genome. In this case we expect less than 1 binding site to occur by chance, and so we will likely be able to accurately determine the location of the 1 to 3 functional sites. In contrast, assume we are looking for 1 to 3 functional sites in the introns and upstream regions of a human gene, which together might contain as many as 100,000 bps of non-coding DNA. It is clear that in this case the functional sites will 'drown' in a sea of about 200 binding site lookalikes.

At this point the reader may ask how *the cell* distinguishes functional binding sites from mere 'lookalikes'. Comparing equation (2) with (6) and (7), we see that *P*_{bound}(*s*) can be written in terms of *P*(*s*|*w*) and *c*. In other words, two segments *s* and *s'* for which *P*(*s*|*w*) = *P*(*s'*|*w*) necessarily have *P*_{bound}(*s*) = *P*_{bound}(*s'*), and one may thus wonder why two segments that are equally likely to be bound by the regulatory factor are not equally functional. There are a number of reasons. First, in eukaryotic genomes DNA is wrapped up in chromatin and so different sites may have different accessibility to the regulatory factor. Second, binding of the regulator may by itself not guarantee functionality, i.e. a regulatory effect. A number of additional constraints typically have to be satisfied. The site may need to occur in the vicinity of specific other regulatory sites, e.g. to mediate interactions between different factors bound at the different sites. The site may need to occur at a particular distance from the basal promoter and in a particular orientation to be able to interact with the basal machinery, and other constraints currently not yet understood. When we want to look for functional sites in long sequences we thus generally have to use information that goes beyond the probabilities *P*(*s*|*w*) of the individual sequence segments given individual WMs *w*. One type of additional information that can be used is that in some cases functional binding sites are known to cluster on the genome. We now discuss approaches to incorporating this information.

### Finding clusters of binding sites: regulatory modules

It has been well-established that in higher eukaryotic organisms transcription regulation is often implemented through regulatory 'modules' in which multiple binding sites for multiple regulatory factors cluster together relatively tightly in intergenic regions [10]. In some cases one may even know the subsets of regulatory factors that tend to cooperate in regulatory modules for particular biological pathways. For example, a large body of work has identified the sets of transcription factors that are involved in segmentation of the early Drosophila embryo, e.g. see [11].

One approach to distinguishing functional binding sites from nonfunctional ones is to look for such regulatory modules. That is, the idea is to start with a set of WMs {*w*}, preferably from a set of regulatory factors that are believed to interact in regulatory modules, and to look for relatively short genomic segments in which there is a surprisingly high density of sites for the WMs from {*w*}. As far as this author is aware, this general idea was introduced around the same time by a number of groups [12–15]. The implementation we discuss here is most closely-related to the approaches of refs. [14, 15].

*w*(potentially of different lengths). We now assume that the data is produced through a stochastic process where at each step with probability

*π*

_{bg}a background base is generated, and with probability

*π*

_{ w }a WM segment from WM

*w*with length

*l*

_{ w }is generated. The priors of course satisfy the normalization

*π*

_{bg}+ ∑

_{ w }

*π*

_{ w }= 1. For notational simplicity we can consider the background

*b*to just be one of the WMs (with length

*l*= 1) in the set {

*w*}. In this more general model the recurrence relation for

*F*

_{ n }becomes

where the background *b* is now one of the WMs *w*.

The second thing to note is that the sum over all configurations *F*_{
L
}= ∑_{
c
}*P*(*D*|*c*)*P*(*c*|{*π*_{
w
}}) is formally the likelihood of the data *D* under our entire set of hypotheses *c*, that is, it is the probability to obtain the data under the assumed stochastic model. Note that in this expression we have indicated explicitly that this probability depends on the priors {*π*_{
w
}}. The quantity *F*_{
L
}thus summarizes how well the sequence can be explained in terms of the set of WMs in the model. The basic idea of the regulatory module detecting algorithms in [14, 15] is to identify putative regulatory modules with sequence segments that have a high value for the sum *F*_{
L
}of probabilities of all binding site configurations in the segment.

The procedure works as follows. One starts with an intergenic region upstream of a gene of interest in a higher eukaryotic genome. Such intergenic regions are typically quite large, i.e. from 10 Kbps in flies to over 100 Kbps in humans. One then slides a windows of length between 200 and 500 bps or so over this long intergenic region. For each window one then determines the set of priors {*π*_{
w
}} that maximize *F*_{
L
}for the sequence *σ* in the window, and calculates the value of *F*_{
L
}at this maximum. One also calculates the probability *P*(*σ*|*b*) for the sequence in the window deriving entirely from the background model. The ratio *X* = *F*_{
L
}/*P*(*σ*|*b*) then quantifies the 'score' for the window in question. Finally, the predicted regulatory models are all windows for which *X* is larger than some prespecified cut-off, and for which the score *X* is larger than the score for any other window overlapping it.

A key step in this procedure is maximizing *F*_{
L
}with respect to the prior {*π*_{
w
}}. Different regulatory modules may have different densities of sites and we thus want to allow for different priors {*π*_{
w
}} within different windows. Since we do not know the {*π*_{
w
}} for each segment, from the point of view of probability theory one should strictly speaking not maximize with respect to {*π*_{
w
}} but rather integrate over all possible priors {*π*_{
w
}}. However, the resulting expressions no longer allow for an effective dynamic programming solution and this would thus make the problem computationally intractable. However, if the function *F*_{
L
}has a sharp peak with respect to the {*π*_{
w
}} then the height of the maximum is representative for the value of the integral and one can thus think of the maximization of *F*_{
L
}with respect to the {*π*_{
w
}} as an approximation to doing the full integral.

*σ*is of length

*L*the set of equations specifying the maximum with respect to the {

*π*

_{ w }} are

*n*(

*w*)> is the expected number of binding sites for WM

*w*averaged over all configurations, each weighted by its probability. The last equation follows from the fact that, the prior

*P*(

*c*|{

*π*

_{ w }}) is given by

*n*(

*w*,

*c*) is the number of sites for

*w*in configuration

*c*. The derivative then becomes

*π*

_{ w }are normalized to sum to 1 we have

Typically this maximum is found through *expectation maximization* (EM). Starting from an initial guess of the {*π*_{
w
}} we calculate <*n*(*w*)> for all *w* and set a new set of priors {*π*_{
w
}} using equation (24). Under iteration this is guaranteed to lead to an optimum in *F*_{
L
}, although not necessarily the global optimum.

## Motif finding

Up to now we have assumed that we are in possession of the WMs *w* representing the sequence-specificities of the regulatory factors. However, unless one has experimental data that directly measures binding affinities of different sequence segments we generally do not possess such detailed information. Typically the best situation encountered is that we have a collection *S* of sequences that have been determined to be functional binding sites for the regulatory factor. So we now ask what we know about the WM *w* given such a set of sequences *S*, i.e. we aim to calculate *P*(*w*|*S*).

*w*will have sequence

*s*. This can be trivially extended to

*sets*of sequences. That is, the probability to obtain the set of

*n*length-

*l*sequences

*S*when sampling

*n*sequences from the WM

*w*is given by

where in the last equation we have defined ${n}_{\alpha}^{i}$(*S*) as the number of times the letter *α* occurs at position *i* in the sequences *S*. Thus, the probability to obtain sequences *S* when sampling from the WM *w* depends only on the counts ${n}_{\alpha}^{i}$(*S*).

*P*(

*w*|

*S*) for the WM given the set of sites

*S*is formally given by

*P*(

*w*) is the

*prior*probability that the WM is given by

*w*. The denominator is a normalizing constant, which does not depend on the WM (we discuss its meaning in a minute). The prior

*P*(

*w*) represents our prior information about the WM

*w*before we see any sites. As will become clear below, the computations are analytically most easily tractable if we use so-called Dirichlet priors that have the following general form

where *c*_{
i
}is a normalization constant for column *i*, and the ${\gamma}_{\alpha}^{i}$ are constants that determine the prior. Notice that for the particular choice ${\gamma}_{\alpha}^{i}$ = 1 we obtain a *uniform* prior that makes all WMs a priori equally likely, which can be argued to reflect a state of complete ignorance about the WM. In reality, however, we know that for most positions in the site, regulatory factors tend to have distinct preferences for certain bases. That is, we a priori know that a WM column *w*^{
i
}= (0.25, 0.25, 0.25, 0.25) is not very likely. To reflect this information we can choose ${\gamma}_{\alpha}^{i}$ < 1. This will put more weight on WM columns that are 'skewed', i.e. giving low probability to some bases and high probabilities to others. Sometimes we have even more pertinent information. It has, for example, been argued recently that groups of related TFs show the same pattern of highly and less skewed columns [16]. If we are inferring the WM of such a TF we can thus reflect that information by setting ${\gamma}_{\alpha}^{i}$ small for those positions *i* that are known to be highly skewed and ${\gamma}_{\alpha}^{i}$ ≈ 1 for columns that are known not to be very skewed (for example because TFs of that family do not touch the DNA at that position).

*C*is an overall normalization constant. Equation (28) shows why the ${\gamma}_{\alpha}^{i}$ are often called pseudocounts. Increasing ${\gamma}_{\alpha}^{i}$ by 1 has the same effect on the posterior

*P*(

*w*|

*S*) as adding 1 to the number of times ${n}_{\alpha}^{i}$(

*S*) that letter

*α*was observed at position

*i*. Put another way, the posterior

*P*(

*w*|

*S*) has exactly the same functional form as the prior

*P*(

*w*), i.e. both are of the form $\prod}_{\alpha}{({w}_{\alpha})}^{{x}_{\alpha}$ with

*x*

_{ α }the 'count' of base

*α*. Priors that have this property are called

*conjugate*priors. In this particular case it means that one may think of the posterior

*P*(

*w*|

*S*) as the prior for another problem with 'pseudocounts' ${\tilde{\gamma}}_{\alpha}^{i}={n}_{\alpha}^{i}(S)+{\gamma}_{\alpha}^{i}$. How to use the distribution

*P*(

*w*|

*S*) in practice? In order to estimate the WM one could for example determine the WM

*w*that maximizes

*P*(

*w*|

*S*). This maximum posterior probability WM has components

with *n*^{
i
}(*S*) = ∑_{
α
}${n}_{\alpha}^{i}$(*S*) and *γ*^{
i
}= ∑_{
α
}${\gamma}_{\alpha}^{i}$. Note that with a uniform prior ${\gamma}_{\alpha}^{i}$ = 1 the maximum occurs when the WM entries match the observed frequencies. This means, for example, that if a given base *α* is not observed at all at some position *i*, i.e. ${n}_{\alpha}^{i}$(*S*) = 0, we will assume that it is *impossible* for *α* to occur at position *i*. This is true even if the set *S* contains only very few sites.

*P*(

*w*|

*S*). To calculate these expectation values we have to integrate

*P*(

*w*|

*S*) over all possible WMs. That is, for each position

*i*the integral is over the simplex:

*P*(

*w*|

*S*)

*dw*= 1 we obtain

*x*+ 1) =

*x*Γ(

*x*)) we find for the expectation values

Note that in this estimate of the ${w}_{\alpha}^{i}$ no component gets probability zero if we use a prior with ${\gamma}_{\alpha}^{i}$ > 0 for all *i* and *α*.

*P*(

*s*|

*w*), i.e. the probability to obtain sequence

*s*when sampling from the WM. We now calculate an analogous expression

*P*(

*s*|

*S*) = ∫

*P*(

*s*|

*w*)

*P*(

*w*|

*S*)

*dw*, which is the probability to obtain sequence

*s*when sampling from the same WM as the one from which the set

*S*derived (without ever specifying precisely what this WM is, i.e. we integrate over all possible

*w*). Using again the general identity (31) we obtain

That is, we find that *P*(*s*|*S*) is precisely the probability that would be obtained from expression *P*(*s*|*w*) when using the expectation values <${w}_{\alpha}^{i}$> as an estimate for the WM *w*.

Up to now we assumed that we were given a set *S* of length-*l* sequences that were sampled from the WM. Except in cases where we have, for example DNase footprinting data that give the precise locations of the regulatory sites, such specific data are again generally rare. It is much more common that we have a set of *n* longer sequences that we know (or strongly suspect) to contain one (or more) regulatory site(s) each for a common regulatory factor. In this situation we simultaneously need to infer *where* in the sequences the sites occur and what the WM is from which they derive.

*D*that consists of

*n*length-

*L*sequences, and we know that each sequence contains precisely one binding site of length

*l*for a common regulatory factor. The set of hypotheses for this problem then corresponds to all combinations (

*w*,

*i*) of a WM

*w*and a vector

*i*= (

*i*

_{1},

*i*

_{2},...

*i*

_{ n }) that denotes the positions where the regulatory sites occur, i.e

*i*

_{1}is the position of the site in the first sequence,

*i*

_{2}the position of the site in the second sequence, etcetera. We now first calculate the probability

*P*(

*D*|

*w*,

*i*) of the data given (

*w*,

*i*). Let

*S*

_{ i }denote the set of

*n*length-

*l*segments that make up the hypothesized binding sites with positions

*i*and let

*B*

_{ i }denote all background nucleotides in the data

*D outside*of these segments. In analogy with equation (13) the probability

*P*(

*D*|

*w*,

*i*) is then given by

where the first product is over all nucleotides outside of the hypothesized binding sites, and the second product is over all hypothesized binding sites *s*.

*P*(

*D*|

*w*) of the data given the weight matrix only by summing over all possible binding site configurations

*i*:

where *P*(*i*) is a prior probability distribution over vectors of site assignments, and the sum is over all possible vectors. One then next searches the space of all possible WMs *w* for those with high *P*(*D*|*w*). In the second approach one calculates the probability *P*(*D*|*i*) of the data given the vector of site positions only by integrating over all possible weight matrices. Formally [6] this probability is given by

*P*(*D*|*i*) = ∫ *P*(*D*, *w*|*i*)*dw* = ∫ *P*(*D*|*w*, *i*)*P*(*w*)*dw*,

and next the set of all site positions *i* is searched for those with high *P*(*D*|*i*). We now discuss these approaches in turn.

### Maximizing *P*(*D*|*w*) through Expectation Maximization

*w*that maximizes the probability of the data

*P*(

*D*|

*w*). Note that, as we have seen in section "Finding WM matches", the sum over all possible site configurations

*i*can be easily performed through dynamic programming once the matrix

*w*is given. For the particular case we are considering, i.e. assuming precisely one site per sequence, the probability

*P*(

*D*|

*w*) is given by the product of the probabilities for the individual sequences

where *D*_{
m
}is the *m* th sequence, the product over *σ* is over all bases outside of the site (i.e. the background), and we have used the uniform prior *P*(*i*_{
m
}) = 1/(*L* - *l* + 1) over the binding binding site position *i*_{
m
}.

*w*that maximizes

*P*(

*D*|

*w*) we proceed analogously as we did for finding the set of priors {

*π*

_{ w }} in equations (21) through (24). For each column

*k*of the WM we have the four equations

where <${n}_{\alpha}^{k}$> is the number of times letter *α* is expected to occur at position *k* of the regulatory sites under posterior distribution *P*(*i*|*D*,*w*).

*s*(

*i*

_{ m }+

*k*) at position

*i*

_{ m }+

*k*of sequence

*m*is equal to

*α*, then the last derivative on the right simply divides

*P*(

*D*

_{ m }|

*w*,

*i*

_{ m }) by ${w}_{\alpha}^{k}$, and else the derivative is zero. We thus have

*s*(

*i*

_{ m }+

*k*) =

*α*and zero otherwise. We thus find

*α*occurs at position

*k*of the binding sites in

*D*

_{ m }under the posterior distribution

*P*(

*i*

_{ m }|

*D,w*). Summing over all sequences

*m*we thus obtain

*P*(

*D*|

*w*) the weight matrix components obey the equalities

As in section "Finding clusters of binding sites: regulatory modules" one can use EM to solve these equations. We start with a randomly chosen WM *w* and calculate <${n}_{\alpha}^{k}$> for that WM. We then update the WM components using equation (46) and repeat until the WM no longer changes. This procedure is guaranteed to converge to a local optimum of *P*(*D*|*w*).

Note that in the above we assumed just one site per sequence but it is easy to extend these derivations to arbitrary configurations, using the identities derived in section "Finding WM matches". Probably the first algorithm developed to find regulatory motifs in this way is the well-known MEME algorithm [17], and by now there are quite a number of algorithms that have been developed using this general idea, e.g. MDScan [18].

Once an optimal WM *w*_{*} is found it is straightforward, i.e. using equation (19), to calculate the posterior probabilities *P*(*i*_{
m
}|*D*, *w*_{*}) that a site occurs at position *i*_{
m
}in sequence *m* and this allows one to distinguish between high confidence and low confidence sites. Programs that use the EM approach to motif finding often report such probabilities. Note, however, that the posterior probabilities *P*(*i*_{
m
}|*D*, *w*_{*}) should not be confused with the posterior probabilities *P*(*i*_{
m
}|*D*) which give the posterior probability that a site occurs at *i*_{
m
}*independent* of what the WM *w* is (we derive an expression for this probability below). The latter quantifies how much evidence there is in the data *D* that a site occurs at *i*_{
m
}, whereas *P*(*i*_{
m
}|*D,w*_{*}) assumes in addition that the inferred WM *w*_{*} is correct. Since in many cases there is a reasonably high probability that *w*_{*} does not match precisely the WM from which the site derives, the probabilities *P*(*i*_{
m
}|*D, w*_{*}) will typically be significantly larger than *P*(*i*_{
m
}|*D*).

*π*

_{ w }} and the set of WMs {

*w*} that maximize the overall probability

*P*(

*D*|{

*w*}, {

*π*

_{ w }}) of the data. For each

*WM w*the expectation-maximization update equation of the WM components would take on the form

where <*n*(*w*)> is the expected total number of sites for WM *w* that occur in *D* and <${n}_{\alpha}^{k}$(*w*)> is the expected number of those sites that have a base *α* at position *k*. The problem with this approach is that EM will very often lead to a local rather than the global optimum (it roughly speaking moves uphill from the starting point to the nearest local optimum). So depending on the initial sets {*w*} and {*π*_{
w
}} the EM procedure may lead to very different optima and the higher the dimension of the search-space, the more serious this problem becomes. Therefore, in practice algorithms such as MEME do not search for multiple WMs simultaneously but rather find one WM at a time. In addition, programs like MEME will start from many different initial WMs *w* and perform EM for each of them, reporting the best optimum found in any of these EMs.

### Motif sampling

*P*(

*D*|

*i*). To calculate (37) we substitute (35) for the likelihood and first note that it can be separated in a part

*P*(

*B*

_{ i }|

*b*,

*i*) that depends only on the background, and a part

*P*(

*S*

_{ i }|

*i*) that is given by an integral, i.e.

*P*(

*D*|

*i*) =

*P*(

*B*

_{ i }|

*b,i*)

*P*(

*S*

_{ i }|

*i*) with

*n*

_{ α }(

*B*

_{ i }) is the number of times base

*α*occurs in the background

*B*

_{ i }, and

*P*(

*w*) we use a Dirichlet prior as in (27), and use the general identity (31) to calculate the integral, which results in

where ${n}_{\alpha}^{k}$(*S*_{
i
}) is the number of times base *α* occurs at position *k* of the sites in *S*_{
i
}, and the ${\gamma}_{\alpha}^{k}$ are again the pseudocounts of the Dirichlet prior. The most common situation is that we know little about the WM that can be expected and in such situations either a uniform prior ${\gamma}_{\alpha}^{k}$ = 1 or one that biases toward the corners of the simplex, e.g. ${\gamma}_{\alpha}^{k}$ = 0.5, are reasonable choices. However, as we mentioned in the discussion of equation (28), if we already have a set of known sites *S*_{known} for the motif, in which base *α* appears ${m}_{\alpha}^{k}$ times at position *k*, then the posterior probability for the WM has the same form as a prior with counts ${\gamma}_{\alpha}^{k}={m}_{\alpha}^{k}+1$. Using this posterior as a prior in equation (50) we can thus also calculate the probability of obtaining the sequence segments in *S*_{
i
}when sampling from the same WM as the WM from which the set *S*_{known} derived. That is, equation (50) easily allows for the incorporation of prior knowledge about the WM *w*.

#### The meaning of equation (50)

*P*(

*S*

_{ i }|

*i*) is a product of independent factors for each column

*k*. We thus focus on a single column only. In addition, we will assume a uniform prior over WMs, i.e. ${\gamma}_{\alpha}^{k}$ = 1. The expression for a single column then takes on the simpler form

where we used that Γ(*x* + 1) = *x*! for integer *x*. The second equality on the right is to clarify that *P*(*S*) can be written as the product of two factors. The first of these factors, 3!*n*!/(*n* + 3)!, is the inverse of the binomial coefficient $\left(\begin{array}{c}n+3\\ 3\end{array}\right)$. This binomial coefficient corresponds to the number of different sets of counts {*n*_{
α
}} that are possible. That is, it counts the number of vectors of integers (*n*_{
a
}, *n*_{
c
}, *n*_{
g
}, *n*_{
t
}) such that ∑_{
α
}*n*_{
α
}= *n*.

The second factor in equation (51), ∏_{
α
}*n*_{
α
}!/*n*!, is the inverse of the multinomial coefficient *n*!/(∏_{
α
}*n*_{
α
}!) which gives the number of different ways that *n* objects can be distributed over 4 boxes such that *n*_{
a
}objects are in the first box, *n*_{
c
}in the second, *n*_{
g
}in the third, and *n*_{
t
}in the fourth. Thus, the probability *P*(*S*) for a column of *n* bases is inversely proportional to the number of ways in which the counts {*n*_{
α
}} of this column can be realized. In summary, there are 4^{
n
}possible outcomes for the *n* bases in the column. The probability distribution *P*(*S*) assigns a probability to each of these that is precisely inversely proportional to the number of the 4^{
n
}outcomes that lead to the counts {*n*_{
α
}}. As a result, the total probability to obtain an outcome with counts {*n*_{
α
}} is *constant* for all $\left(\begin{array}{c}n+3\\ 3\end{array}\right)$ possible counts (because we have to sum *P*(*S*) over all possible outcomes that lead to the same set of counts).

*n*we can approximate the multinomial coefficient using Stirling's approximation to find

*H*({

*n*

_{ α }}) is the entropy of the distribution

*n*

_{ α }/

*n*:

Thus, the probability *P*(*S*) is largest for sets of sequences whose base distributions have lowest entropy.

#### Back to motif sampling

*P*(

*D*|

*i*) in terms of the counts ${n}_{\alpha}^{k}$(

*S*

_{ i }) and

*n*

_{ α }(

*B*

_{ i }). Finally, using a uniform prior over hypotheses

*i*, the posterior

*P*(

*i|D*) becomes simply

where the sum in the denominator is over all possible assignments *j* = (*j*_{1},..., *j*_{
n
}) for the positions of the binding sites.

*i*

_{*}that maximizes

*P*(

*i*|

*D*), or we would for each position

*i*

_{ k }calculate the posterior probability

*P*(

*i*

_{ k }|

*D*) that a site occurs at position

*i*

_{ k }in sequence

*k*, which is formally given by

*P*(

*i*|

*D*) is a complicated nonlinear function of the base counts ${n}_{\alpha}^{k}$(

*S*

_{ i }) and

*n*

_{ α }(

*B*

_{ i }) we cannot separate it easily into contributions from the different hypothesized sites in

*i*and there is generally no way to calculate sums like (55) other than explicitly summing over all (

*L*-

*l*+ 1)

^{n-1}states. To find site configurations

*i*with high

*P*(

*i*|

*D*) researchers have in general resorted to Markov chain Monte-Carlo techniques for sampling the distribution

*P*(

*i*|

*S*)[19]. The most commonly used way of sampling the distribution

*P*(

*i*|

*S*) is through so-called

*Gibbs sampling*[20] and consists of iterations of the following steps, which are illustrated in Fig. 4

- 1.
Randomly select one of the

*n*sequences with uniform probability. - 2.
If sequence number

*m*was selected, remove the segment*s*located at position*i*_{ m }from the set of sites*S*_{ i }of the current configuration. Denote this set of (*n*- 1) sequences as ${S}_{i}^{-}$ and the new configuration as*i*^{-}. - 3.
For every position

*i*_{ m }= 0 through*i*_{ m }=*L*-*l*denote the new configuration that results from placing the site at*i*_{ m }in sequence*m*as (*i*^{-},*i*_{ m }) and calculate*P*(*D*|*i*^{-},*i*_{ m }). - 4.Select a new configuration by sampling the position of the site in sequence
*m*according to the probability distribution$P({i}_{m}|D,{i}^{-})=\frac{P(D|{i}^{-},{i}_{m})}{{\displaystyle {\sum}_{{j}_{m}=0}^{L-l}P(D|{i}^{-},{j}_{m})}}.$(56)

*s*(

*i*

_{ m }+

*k*) is the base that occurs at position

*i*

_{ m }+

*k*in sequence

*m*. Note that this expression is precisely the ratio between the probability

*P*(

*S*[

*i*m,L]|Si) of the site at

*i*

_{ m }deriving from the same WM as the others in ${S}_{i}^{-}$, i.e. as in equation (34), and the probability $P({s}_{[{i}_{m},l]}|b)$ of this segment under the background, i.e.

By iterating these steps one can sample the entire distribution *P*(*i*|*D*) and, for example, estimate the posterior probability *P*(*i*_{
m
}|*D*) that a site occurs at position *i*_{
m
}in sequence *m*, i.e. by the fraction of time a site occurs at *i*_{
m
}during sampling. The probabilities *P*(*i*_{
m
}|*D*) rigorously quantify the evidence in *D* that a site occurs at position *i*_{
m
}. Thus, whenever *P*(*i*_{
m
}|*D*) is large we can be confident that a site does occur at *i*_{
m
}.

To make a single prediction for the set of regulatory sites in *D* one searches for the configuration *i*_{*} that maximizes *P*(*i*|*D*). In some approaches, e.g. [21], this is done simply by keeping track of the highest probability configuration that was observed during sampling. However, more accurate determination of the optimal configuration *i* can be obtained through *simulated annealing* [22]. One introduces a parameter *β* and instead of sampling from *P*(*i*|*D*) one samples from a probability distribution which is proportional to *P*(*i*|*D*)^{
β
}. At the start of the search *β* is set to a small number and then *β* is slowly increased with time. As *β* increases more weight will be put on configurations with high probability and eventually the sampler will 'freeze' into a state with locally optimal probability *P*(*i*|*D*). Provided the annealing is done slowly enough the optimum will correspond to the globally optimal state. This is for example the approach taken by the PhyloGibbs algorithm [23].

*i*

_{*}is found through simulated annealing one can of course use normal sampling, i.e. with

*β*= 1, to obtain the posterior probabilities of the sites in

*i*

_{*}. Given the optimal configuration

*i*

_{*}one can of course also report the expected WM given this configuration, which has components

*n*sequences we can of course also sample much more general configurations

*c*. Most generally, one could allow varying numbers of sites for multiple WMs. The top left panel in figure 5 shows such a general configuration with sites for 3 different motifs (red, blue, and green). If we assume the same kind of priors as we used in section "Finding clusters of binding sites: regulatory modules" then the prior probability for a particular configuration

*c*, which has

*n*(

*w*,

*c*) sites for WM

*w*and

*n*(

*b, c*) bases in background, is proportional to

*w*in configuration

*c*by

*S*

_{ w }and the set of background nucleotides as

*B*(

*c*) we obtain for the likelihood of the data given the configuration

*P*(

*S*

_{ w }) is given in complete analogy with (50) by

where *n*(*w*) is the total number of sites in group *S*_{
w
}and ${n}_{\alpha}^{k}$(*S*_{
w
}) is the number of times base *α* occurs at position *k* of the sites in *S*_{
w
}.

The posterior probability *P*(*c*|*D*) of a configuration is simply proportional to the product of (60) and (61), i.e. *P*(*c*|*D*) ∝ *P*(*D*|*c*)*P*(*c*|{*π*}). To sample from the posterior probability *P*(*c*|*D*) over all possible configurations we need a more extensive set of 'moves' then the one described in the Gibbs sampler above. This can be done in a number of ways [24]. One possibility is to pick a sequence at random, to remove all sites currently located in it, and to sample from all ways of putting a new set of sites in, see [25] for details. The set of moves implemented by the PhyloGibbs algorithm [23] is illustrated in Fig. 5. These moves are:

**1. Resampling a segment:** Pick a sequence *m* at random and a random position *i*_{
m
}in it. Check if there is a site overlapping the region from *i*_{
m
}+ 1 to *i*_{
m
}+ *l* in the current configuration *c*. If so, do nothing, i.e. move from *c* to *c*. If the region is free (or a site occurs precisely at *i*_{
m
}+ 1 through *i*_{
m
}+ *l*) calculate the probabilities *P*(*c'*|*D*) for all configurations *c'* that are obtained by putting a site for any of the WMs *w* at *i*_{
m
}, including putting no site at all or putting a site for a new motif. Finally, sample one of these configurations *c'* with probability proportional to *P*(*c'*|*D*).

**2. Moving a site:** Pick one of the sites occurring in *c* and remove it creating configuration *c*^{-}. Find all sequence segments *s* of length *l* in *c*^{-} that are not overlapping any site. Calculate the probability *P*(*c'*|*D*) for all configurations that can be obtained by placing a new site for the same WM at any of the free segments *s*. Sample one of these configurations *c'* in proportion to *P*(*c'*|*D*).

**3. Shifting a site group:** Pick one of the sets of sites *S*_{
w
}at random. Check how far the sites in *S*_{
w
}can be shifted to the left and right without colliding with other sites in the current configuration *c*. Denote these maximal shifts by *l*_{max} and *r*_{max}. For every shift *h* between *h* = *l*_{
max
}and *h* = *r*_{
max
}calculate the probability *P*(*c'*|*D*) of the configuration that would result if all sites in *S*_{
w
}were shifted by an amount *h*. Sample one of the configurations *c'* in proportion to probability *P*(*c'*|*D*).

One of the main advantages of the motif sampling approach over EM algorithms is that it is much less likely to get stuck in local optima. In particular, one can sample multiple motifs without becoming trapped in bad local optima. Another advantage is that one can obtain rigorous posterior probabilities for sites appearing at different positions which allows for a more reliable separation of trustworthy predictions from spurious ones (see [23]). As for the single motif, i.e. our discussion below equation (50), we can here also use 'informative' priors for each of the motifs. That is, if we have a set motifs for which known sites are available we can use the base counts in these sites as 'pseudocounts' ${\gamma}_{\alpha}^{k}$ of priors for corresponding motifs in the binding sites configurations. That is, apart from inferring multiple new motifs, we can use informative priors to discover new sites for known motifs at the same time. This can be especially useful when we are trying to find a new motif in a set of sequences that also contains sites for a number of known motifs. If we were to search this data for a single motif then it is quite likely that the search would return one of the known motifs. By searching for multiple motifs at the same time and using informative priors for each of the known motifs we can make sure that known sites will automatically associate with the known motifs, and that the remaining motifs are indeed new motifs. Finally, under the sampling approach one can use arbitrarily complicated priors *P*(*c*) on configurations, including priors that demand that certain combinations of sites occur at certain specified distances of each other, in particular orientations, etcetera. In the EM approach such complex priors would typically cause the dynamic programming solution to summing over all configurations to break down.

The main disadvantage of the motif sampling approach is of course speed. To obtain accurate statistics one needs to sample for a long time, and the time necessary grows with the product of the size of the data-set *D*, the total number of sites, and the number of motifs. In contrast, the dynamic programming approaches outlined in section "Finding WM matches" allow for efficient computation of sums over all possible configurations even for very large input data, allowing one to search very large sequences for matches to sets of WMs, which is computationally infeasible with motif sampling algorithms.

As mentioned already, motif sampling was introduced more than a decade ago [20]. Since then a significant number of algorithms has been developed including [21, 26–28], and probably many more. The PhyloGibbs algorithm [23] introduces several extensions such as simulated annealing to find the configuration with maximal probability, simultaneously sampling multiple motifs, and taking the phylogenetic relationships between the sequences into account (discussed below).

## Clustering sites and motifs

There are several situations in which we may want to cluster sets of binding sites. This demand for instance arises whenever we have obtained a set of sequence segments that are thought to each have regulatory function, without knowing the specific function of any of the segments. For example, several researchers have used so-called 'phylogenetic footprinting', the identification of short overly conserved segments in alignments of orthologous intergenic regions from related genomes, to gather large collections of putative regulatory sites [29–32]. It is reasonable to assume that most of these short segments contain a regulatory site for some regulatory factor, but we do not know which sites are sites for the same factor nor how many different regulatory factors are represented in data.

Formally, given a dataset *D* of sequence segments, we want to partition this dataset into subsets such that all segments within a subset contain a regulatory site for a common regulatory factor, and different subsets correspond to different regulatory factors. In addition, we want to multiply align all the segments within each subset. Thus, for this problem the set of hypotheses is all possible ways in which the set *D* can be partitioned into subsets, and all possible ways in which the sequences in each subset can be multiply aligned. Let us denote possible configurations by *C*. Each configuration *C* consists of a set of subsets *c* ∈ *C* that each consist of a collection of sequences from *D*. The union of these subsets *c* of course equals *D*. In addition *C* specifies, for each subset c, an alignment *S*_{
c
}of sequence segments that are taken from the sequences in *c*. For simplicity we will assume that all these sequence segments are of fixed length *l* in all subsets. That is, *C* specifies a partition of the sequences in *D* into subsets *c*, and it specifies where in each of the sequences the regulatory site of length *l* occurs, thereby specifying length-*l* alignments *S*_{
c
}for each subset *c*. We now want to calculate the probability *P*(*D*|*C*) of the data given a configuration *C*. We can generally separate *P*(*D*|*C*) into a contribution of the sites (those segments from the sequences in hypothesized regulatory sites) and the bases outside these segments that are scored according to a background model.

*P*(*D*|*C*)*= P*(*D*_{
sites
}|*C*)*P*(*D*_{
bg
}|*C*).

*P*(

*D*

_{bg}|

*C*) is constant, i.e. does not depend on

*C*and we just consider

*P*(

*D*

_{sites}|

*C*). This probability can be written as a product of independent contributions from each subset

where *S*_{
c
}is the alignment of sites in subset *c*. The probability *P*(*S*_{
c
}) is just the probability that all sequence segments in *S*_{
c
}derive from a common WM. The probability *P*(*S*_{
c
}) is simply given by replacing *S*_{
i
}with *S*_{
c
}in the right-hand side of equation (50).

we also need a prior *P*(*C*) over partitions. The simplest prior is of course to assign a uniform prior *P*(*C*) = constant. Note however that, a uniform prior over partitions may correspond to a very peaked prior with respect to the number of clusters. That is, given a dataset with 100 sequences there are astronomically more partitions of the data into, say, 30 subsets than there are partitions of the data into, say, 2 subsets. If one wants a uniform prior over the number of clusters one needs to assign a probability *P*(*C*) ∝ 1/${S}_{\left|C\right|}^{\left|D\right|}$, where |*D*| is the total number of sequences in *D*, |*C*| is the number of subsets in *C*, and ${S}_{\left|C\right|}^{\left|D\right|}$ is the number of possible partitions of |*D*| objects into |*C*| subsets, which is called a Stirling number of the second kind [33]. Note that with this prior a particular configuration *C* with, say, 2 subsets will have a much higher a priori probability than a configuration with, say, 30 subsets. That is, it is impossible to be a priori completely ignorant about partitions in general *and* about the number of subsets at the same time. Again there is no easy way to find the configuration *C* with maximal posterior probability. A fast procedure for determining a state *C* with high posterior probability is through hierarchical clustering. One starts out with each sequence in *D* forming a subset on its own. For every pair of sequences *s*, and *s'* one then calculates the probability of the configuration *C*(*s*, *s'*, *i*, *i'*) that is obtained when the subsets *s* and *s'* are joined into a cluster, putting the hypothesized sites at positions *i* and *i'* respectively. We then find the combination (*s*, *s'*, *i*, *i*) with maximal *P*(*C*(*s*, *s'*, *i*, *i'*)|*D*) and create the corresponding state *C*(*s*, *s'*, *i*, *i'*). This procedure is repeated, i.e. at each iteration two subsets are fused so as to maximize *P*(*C*|*D*). The iteration stops when there is no more subset merger that would increase *P*(*C*|*D*). The great disadvantage of this procedure is that it generally leads to highly suboptimal local optima in *P*(*C*|*D*).

- 1.
Select one of the sequences in

*D*at random and remove it from its current subset thereby creating a configuration*C*^{-}. - 2.
For each of the subsets in

*C*^{-}consider the configuration*C*(*c*,*i*) when the removed sequence*s*is put into subset*c*and the length-*l*site is*s*is started at position*i*. Also consider the configuration*C*(0) which is obtained by putting the sequence*s*in a subset of its own. Calculate*P*(*C*|*D*) for all these configurations and sample one of the configurations in proportion to these probabilities.

By repeating these two steps one can sample from the posterior distribution *P*(*C*|*D*) over all possible configurations. Through simulated annealing, i.e. sampling from *P*(*C*|*D*)^{
β
}and slowly increasing *β*, one can attempt to locate the configuration *C** which globally maximizes *P*(*C*|*D*). The PROCSE software [34] implements such a Markov chain Monte-Carlo scheme for simultaneously clustering and aligning sets of sequences that are thought to contain regulatory sites and it has been used to predict regulons in bacteria genome-wide. It has also been used to automatically curate sets of experimentally determined binding sites [35]. PROCSE first determines a 'reference configuration' *C** through simulated annealing and then performs another sampling run, i.e. with *β* = 1, to determine the posterior probabilities of the clusters that occur in the reference state *C**.

An almost identical procedure as just described can be used to cluster motifs or arbitrary combinations of motifs and sequences. Application of different motif finding algorithms to the same dataset, or application of the same algorithm to related datasets, often results in sets of inferred motifs that show clear commonalities. One is thus often interested in analyzing sets of motifs to identify which motifs are really different, and which motifs might represent a common underlying WM.

*w*, can be represented by counts ${n}_{\alpha}^{k}$ that represent the number of observations of base

*α*at position

*k*of the sites. So more generally, we will assume that when we are given 'a motif' this information can always be represented by a set of counts ${n}_{\alpha}^{k}$. For example, when we are given WM components ${w}_{\alpha}^{k}$ then we transform this into a set of counts ${n}_{\alpha}^{k}$ by specifying the pseudocounts ${\gamma}_{\alpha}^{k}$ of a prior, and the effective total number of observations

*n*on which the ${w}_{\alpha}^{k}$ are based:

Without loss of generality we can thus think of these counts ${n}_{\alpha}^{k}$ as deriving from an alignment *S* of sites for the motif. That is, we can generally specify our knowledge about a 'motif' by specifying an alignment *S* of sites drawn from the motif WM.

Such alignments *S* can be clustered and aligned with each other completely analogously to the procedure just described for single sequences. That is, one can think of an alignment *S* as a set of individual sequences *s* that have already been clustered and aligned with each other. When multiple such alignments *S* are mutually aligned and clustered into a larger alignment *S*_{
c
}then we calculate the probability *P*(*S*_{
c
}) that all sequences in *S*_{
c
}derive from a common WM exactly like we did before, i.e. equation (49). Thus, the only difference between clustering and aligning single sequences, and clustering and aligning 'motifs' is that motifs are represented by sets of multiply aligned sequences, and that these motif alignments are so to speak 'glued together' in that these sequences will never be repartitioned during the sampling. The PROCSE software also allows for such pre-clustered and pre-aligned sequences to be submitted as input. In this way arbitrary combinations of single sequences and motifs can be aligned and clustered simultaneously.

## Incorporating phylogeny

In all our approaches so far we have assumed that different sequences that contain binding sites can be considered independent samples from a WM *w*. In addition, the motif finding approaches that we discussed all presume that one is given sets of sequences that are likely to contain sites for common regulatory factors. In many cases researchers use independent biological evidence, such as expression data, to collect such sets of sequences that appear 'co-regulated' [27, 36]. Apart from expression data, more recently ChIP-on-chip techniques have been used to collect sets of sequences that appear to be bound by a common regulatory factor, see e.g. [37, 38].

Another possibility is to collect sets of orthologous intergenic regions from related species. It is often reasonable to assume that many of the regulatory sites occurring in the ancestor of these species have been maintained and are shared by all or most of the descendants. Therefore, orthologous intergenic sequences can generally be expected to contain sites for common regulatory factors. However, in contrast to sites in collections of upstream regions of genes from a single species, these sites cannot be considered *independent* samples from a common WM. That is, the orthologous sites are related evolutionary, and their sequences will therefore generally be more correlated than independent samples from a WM. Therefore, to correctly analyze orthologous intergenic regions we need to take the phylogenetic relationships of the species into account.

### Binding site evolution

Let us consider a single position in a regulatory site whose WM has components *w*_{
α
}at that position. We now want to calculate the probabilities *P*_{
αβ
}(*w*, *t*) that over an evolutionary time *t* this position in the site evolves from base *β* to base *α*.

There is a long history of such models for the evolution of amino acids, e.g. see [39, 40]. For our application to nucleotide evolution a general treatment of this problem was given by the model of Halpern and Bruno [41]. The rate *u*_{
αβ
}at which base *β* is substituted by base *α* during evolution is written as the product of an instantaneous rate of mutation *μ*_{
αβ
}from *β* to *α*, and the probability *f*_{
αβ
}that a mutation from *β* to *α* will be fixed in the population (which depends on selection), i.e.

*u*_{
αβ
}= *f*_{
αβ
}*μ*_{
αβ
}.

*P*(

*α*|

*β*,

*w*,

*t*) are the solution of the differential equations

*P*

_{ αβ }(

*w*,

*t*) become independent of time, i.e. memory of the start state is lost, and by the definition of the WM components the probabilities

*P*

_{ αβ }(

*w*,

*t*) limit to

*w*

_{ α }, i.e.

*μ*

_{ αβ }are given one can then solve [41] for the substitution rates

*u*

_{ αβ }that will lead to the limit distribution (69):

**U**through

with matrix **P**(*t*) having components *P*_{
αβ
}(*t*). Using the boundary condition *P*_{
αβ
}(0) = *δ*_{
αβ
}the solution is given by

*P*_{
αβ
}(*t*) = (*e*^{
U
t
})_{
αβ
}.

In this general model one thus solves for *P*_{
αβ
}(*w*, *t*) by first determining **U** using equation (70), and determining its eigenvalues and eigenvectors. However, note that in general the solution is a complicated function of the WM components *w*_{
α
}which is not easily amenable to further analysis.

*μ*

_{ αβ }=

*μ*, and that the probability of fixation

*f*

_{ αβ }depends only on the target base

*α*i.e.

*f*

_{ αβ }=

*w*

_{ α }. Under these assumption the differential equations become

This equation can be easily solved to give

*P*_{
αβ
}(*w*, *t*) = *δ*_{
αβ
}*e*^{-μt}+ *w*_{
α
}(1- *e*^{-μt}).

Note that *e*^{-μt}is the probability that no mutations have taken place during time *t*. We call this no-mutation-probability the *proximity* *q* = *e*^{-μt}between the ancestor and the descendant [23]. In terms of the proximity the solution becomes

*P*_{
αβ
}(*w*, *q*) = *δ*_{
αβ
}*q* + (1 - *q*)*w*_{
α
}.

This expression has a nice simple interpretation. With probability *q* no mutations have taken place in going from *β* to *α* and the bases are identical. With probability (1 - *q*) one or more mutations took place and the probability that one then ends up with base *α* is simply the WM component *w*_{
α
}.

### Probability of an orthologous set of bases

Assume that we have a set of orthologous intergenic regions and assume that we know the phylogenetic tree *T* that relates the species from which the regions derive. Consider now a set of orthologous bases *S* from these intergenic regions. That is, the bases in *S* have evolved from a common ancestor base in the common ancestor of the species according to the tree *T*. We now calculate the probability *P*(*S*|*T, w*) that, when evolving from a common ancestor under one of the evolutionary models just discussed, and according to the given phylogenetic tree *T*, the set of bases *S* will result at the leafs of the tree.

*S*only specifies the bases at the leafs of the tree

*T*, i.e. the bases at the internal nodes are unknown. If we also knew all the bases at the internal nodes we could calculate

*P*(

*S*|

*T*,

*w*) simply by multiplying the probabilities

*P*

_{ αβ }(

*w*,

*t*) for each branch, i.e.

*n*,

*s*

_{ n }is the base at node

*n*,

*a*(

*n*) is the ancestor of node

*n*, and

*t*

_{ n }is the length of the branch from

*a*(

*n*) to

*n*. This is illustrated in the left panel of Fig. 7.

*D*

_{ α }(

*n*,

*w*) the probability to observe all bases of

*S*that are descendants of node

*n*of the tree given that node

*n*has base

*α*. For nodes

*n*that are leafs, i.e. bases of

*S*, we of course have

*D*

_{ α }(

*n*,

*w*) =

*δ*

_{ αsn }. We can determine

*D*

_{ α }(

*n*,

*w*) for all nodes using the following recursion relation

*c*(

*n*) is the set of children of node

*n*, and

*t*

_{ m }is the length of the branch connecting

*m*to its parent

*n*. This basic recursion is illustrated in the middle panel of Fig. 7. Starting from the leafs we can use (78) to calculate

*D*

_{ α }(

*n*,

*w*) for all nodes up to the root of the tree. Finally the probability

*P*(

*S*|

*T, w*) for the whole tree is obtained by summing over the bases of the root node

*r*, noting that the prior probability that root

*r*has base

*α*is

*w*

_{ α }. This gives

In complete analogy we can calculate the probability *P*(*S*|*T*, *b*) of the column of bases *S* assuming that they evolved under a background model *b*. which is given by background probabilities *b*_{
α
}. To obtain *P*(*S*|*T, b*) we just replace *P*_{
αβ
}(*w*, *t*) with *P*_{
αβ
}(*b, t*) for each branch of the tree in equation (78) and replace *w*_{
α
}with *b*_{
α
}in (79). Finally, we can also easily accommodate cases in which the regulatory site has been maintained in some but not all species. That is, we can have some branches of the tree *T* evolve according to the background model *b* whereas other branches evolve according to the WM column *w*, simply by using *P*_{
αβ
}(*w*, *t*) for each branch evolving according to the WM, and using *P*_{
αβ
}(*b*, *t*) for each branch evolving according to the background. An example of such a more complicated 'selection pattern' is shown in the right panel of Fig. 7.

### Finding sites and modules in multiple alignments

To apply the probabilities *P*(*S*|*T*, *w*) and *P*(*S*|*T*, *b*) to a set of orthologous intergenic regions we of course first have to identify which sets of bases in these sequences form orthologous groups. That is, we have to produce a multiple alignment of the orthologous intergenic regions. Given a multiple alignment we can then assume that every column of the alignment corresponds to a set of orthologous bases. The problem of producing accurate multiple alignments of non-coding sequences is extremely challenging and is beyond the scope of this article. There are now a number of algorithms available that focus specifically on alignment of non-coding DNA [44–46], although our personal experience is that consistency based methods [47, 48] and evolutionary explicit progressive alignment [49] often outperform these methods significantly. From this point on we will assume that a global multiple alignment of the orthologous intergenic regions is given and that we can assume that vertically aligned bases in this alignment are orthologous.

We can use the probabilities *P*(*S*|*T*, *w*) and *P*(*S*|*T*, *b*) that we derived above to extend the formalism of sections "Finding WM matches" and "Finding clusters of binding sites: regulatory modules" to multiple alignments. The simplest way of doing this is to take one of the sequences in the multiple alignment as a *reference* sequence and to consider all binding site configurations for this reference sequence. This is often natural since in many cases we are really only interested in finding regulatory sites in one particular species and it is thus natural to take this species as a reference.

Let *s*_{[i,l]}denote a segment of length *l* in this reference sequence, and let *S*_{[i,l]}denote the corresponding block in the multiple alignment. To calculate the probability that a regulatory site occurs at *s*_{[i,l]}we will now calculate the probabilities of observing the alignment segment *S*_{[i,l]}under different assumptions for the selection that was operating at each branch of the tree *T* relating the species in the alignment. The simplest assumptions about the selection are that either all sequences in *S*_{[i,l]}evolved according to the background model, i.e. using expression *P*(*S*|*T*, *b*) for each column *S* in *S*_{[i,l]}, or that all sequences evolved according to WM *w*, i.e. using *P*(*S*|*T*, *w*) for each column *S* in *S*_{[i,l]}. Many algorithms [23, 50, 51] in fact restrict themselves to these two possibilities. However, there are many other possibilities. If there are *B* branches in the tree then there are in principle 2^{
B
}possible ways of assigning selection to the branches, i.e. either WM *w* or background *b* for each branch. Formally, to calculate the probability that a regulatory site occurs at *s*_{[i,l]}we would want to consider all 2^{B - 1}'selection patterns' *σ* for which *s*_{[i,l]}is under selection of the WM *w*. We would want to assign prior probabilities *P*(*σ*) to all 2^{
B
}possible selection patterns *σ*, and calculate the probabilities *P*(*S*_{[i,l]}|*T*, *σ*) for each. Finally, by summing *P*(*S*_{[i,l]}|*T*,*σ*)*P*(*σ*) over all selection patterns for which *s*_{[i,l]}is under selection of the WM *w* one would obtain the total probability of the data *S*_{[i,l]}under the assumption that a regulatory site occurs at *s*[*i*,*l*]. Unfortunately, there is no simple way of determining a reasonable distribution *P*(*σ*) and the sum would generally involve a large number of terms. This author is not aware of any algorithm that currently implements this general scheme.

*σ*

_{*}is chosen that best fits the alignment and the sequences in it. Note first that, since WMs have a fixed width, a site in the reference species can only occur in another species if the corresponding segment in that species is gaplessly aligned with the site in the reference species. Therefore, we first check which of the other sequences in

*S*

_{[i,l]}are gaplessly aligned with the reference sequence and which are not. For those sequences not gaplessly aligned with the reference we assign the background evolution model to the branches leading to these sequences. For each of the other sequences

*s*in

*S*

_{[i,l]}we calculate the probability

*P*(

*s*|

*w*) of the sequence under the WM

*w*, and the probability

*P*(

*s*|

*b*) of the sequence under the background model

*b*. Whenever

*P*(

*s*|

*w*) >

*P*(

*s*|

*b*) we assign the WM model to the branch leading to

*s*, and for all others we assign the background model. Finally, we assume that an internal node evolved according to the WM if any of its descendants do. This defines a unique selection pattern

*σ*for

*S*

_{[i,l]}and we calculate

*P*(

*S*|

*T*,

*w*) using this selection pattern. The procedure is illustrated in Fig. 8.

*P*(

*S*

_{[i,l]}|

*T*,

*b*) assuming all branches evolved according to background. Finally, if we assign a prior probability

*π*that a site occurs at

*s*

_{[i,l]}the posterior probability

*P*(site|

*S*

_{[i,l]}) that the reference species has a functional site at

*i*becomes

This is essentially the expression used by the MotEvo algorithm [35] to find regulatory sites. The MONKEY algorithm finds regulatory sites in a very similar manner. Instead of the simple evolutionary model (76) MONKEY uses the more general Halpern/Bruno model (70). However, MONKEY does not consider the possibility that the site is conserved in some but not all of the aligned species, i.e. it assumes that either all branches of the tree evolve according to the WM, or all branches evolve according to background.

*P*(

*s*

_{[i,l]}|

*w*) of a single sequence segment under the WM we instead calculate the probability

*P*(

*S*

_{[i,l]}|

*w*) of the ungapped alignment block at that location using the procedure just outlined. That is, for every segment

*s*

_{[i,l]}we find which other sequences are ungapped at the segment and choose which of these are evolving according to the WM based on the probabilities of the individual sequence segments under the WM. The generalization of equation (20) is then simply

Note that position *n* here always refers to the *n* th base in the reference sequence.

Finally, using this formalism we can of course also search for regulatory modules in multiple alignments in complete analogy with the equations in section "Finding clusters of binding sites: regulatory modules".

This procedure has been implemented for two-species alignments in the Stubb algorithm [42]. Applying the Stubb algorithm to predict developmental regulatory modules in *Drosophila* it was shown in [52] that using two-species alignments improves predictions of the locations of regulatory modules over the single species algorithms.

### Motif finding incorporating phylogeny

In section "Motif finding" we discussed two approaches to motif finding, one based on maximizing the probability *P*(*D*|*w*) of the data given a WM *w* using expectation maximization, and one using Markov chain Monte-Carlo sampling to find the site configuration *c* that maximizes the posterior *P*(*c*|*D*). These methods can also be extended in a straightforward way to multiple alignments and we now discuss these in turn.

#### Motif EM incorporating phylogeny

The PhyME algorithm implements an extension of the MEME algorithm to multiple alignments of orthologous intergenic sequences from related species. It uses a reference species and considers all configurations of binding sites that can be assigned to the reference species in the same way as discussed in the previous section, i.e. it uses equation (81) to calculate the overall likelihood *P*(*D*|*w*) of the alignment given the WM *w*. The evolutionary model that is used by PhyME to score ungapped alignment blocks *P*(*S*_{[i,l]}|*w*) is precisely the simplified model of equation (76). However, like MONKEY and in contrast to MotEvo, PhyME assumes that either all branches in the tree evolved according to the WM model, or that all evolved according to background.

*P*(

*D*|

*w*) with respect to the WM PhyMe needs to solve, for each column

*k*in the WM, the equations

Note that for the single sequence case, the derivative of *P*(*s*|*w*) with respect to the WM components ${w}_{\alpha}^{k}$, was very simple, i.e. see (43). In contrast,the derivative *dP*(*S*_{[i,l]}*T*, *w*)/*d*${w}_{\alpha}^{k}$ is a much more complicated function of the WM components ${w}_{\alpha}^{k}$ which needs to be calculated recursively just as *P*(*S*_{[i,l]}|*T*, *w*) itself. Here it becomes particularly advantageous that in the simplified model (76) the probability *P*_{
αβ
}(*w*,*t*) is such a simple function of the WM components. We do not discuss the mathematical details of solving (82) here except for mentioning the fact that it involves an iterative procedure similar to EM that leads to a local optimum in *P*(*D*|*w*).

#### Motif sampling incorporating phylogeny

We now discuss extending the motif sampling approach of section "Motif sampling" to alignments of phylogenetically related sequences. Remember that in the motif sampling approach, instead of summing over all possible binding site configurations to calculate the probability *P*(*D*|*w*) conditioned on the WM, we condition on a particular binding site configuration *c* and calculate the probability *P*(*D*|*c*) by integrating over all possible WMs *w*.

*c*of binding sites that can be assigned to the input data

*D*, and calculate the probability of the data

*P*(

*D*|

*c*) for each possible configuration. Whereas for single sequences the space of all possible configurations existed simply of all ways in which sets of non-overlapping windows can be assigned to the sequences, i.e. see Fig. 2, for multiple alignments the situation is a bit more complicated and illustrated in Fig. 9.

Above we assumed that a reference sequence *s* is given for each multiple alignment and that the set of binding ste configurations for the alignment is simply the set of all binding site configurations for the reference species. In the PhyloGibbs algorithm [23] there is no reference sequence and each sequence in the multiple alignment is treated the same. A site can be hypothesized to occur at any position of any of the sequences. By definition the algorithm assumes that, whenever a site occurs in one species, it will also occur in all other species that are gaplessly aligned with it at that location. That is, sites are automatically extended to all species that are mutually gaplessly aligned at that position, see Fig. 9. The algorithm makes sure to only allow configurations in which none of the sites overlap.

*P*(

*D*|

*c*) for every possible such configuration

*c*. This probability

*P*(

*D*|

*c*) is given by an equation essentially identical to equation (61). However, instead of single background bases

*σ*with probability

*b*

_{ σ }we will now have alignment columns

*S*with probability

*P*(

*S*|

*T*,

*b*) as calculated in section "Probability of an orthologous set of bases". The set of sequences

*S*

_{ w }assigned to a WM

*w*will now generally consist of several ungapped segments from the multiple alignments, i.e. alignment blocks, and possibly some single sequences as well, see Fig. 9. The probability

*P*(

*S*

_{ w }) will again be an integral over all possible WMs but the integrand in this case will be considerably more complicated. For simplicity let's focus on a single column from the set

*S*

_{ w }of sequence segments and alignment blocks. For simplicity assume that this column from

*S*

_{ w }contains two independent columns

*S*, and $\tilde{S}$ from the multiple alignments, see Fig. 9. The probability

*P*(

*S*

_{ w }) would then be formally given by

*T*is the phylogenetic tree of alignment column

*S*, $\tilde{T}$ the phylogenetic tree of alignment column $\tilde{S}$, and the expressions

*P*(

*S*|

*T*,

*w*) and

*P*($\tilde{S}$|$\tilde{T}$,

*w*) are given as in equations (78) and (79). To calculate the integral notice that, formally, the expression

*P*(

*S*|

*T*,

*w*) is a polynomial in the WM components of the following form

*c*

_{ k }depend on the branch lengths in the tree and the ${m}_{\alpha}^{k}$ are sets of integers. The expression

*P*($\tilde{S}$|$\tilde{T}$,

*w*) can of course also be written in this form. Denote its prefactors ${\tilde{c}}_{\tilde{k}}$, and its exponents ${\tilde{m}}_{\alpha}^{k}$. Using this the integral can be rewritten as

*P*(

*S*

_{ w }) in this way. However, the number of terms in the above sum grows exponentially both with the number of sequences in each alignment and, more importantly, with the number of alignments under the integral. That is, if the configuration

*c*contains 10 multiple alignment segments for WM

*w*, then even if there were only 10 terms for each alignment column

*P*(

*S*|

*T*,

*w*), there would still be 10

^{10}terms in total. In practice we thus have to resort to approximations of the above integral. The approach that is taken in the PhyloGibbs algorithm is to approximate the expression

*P*(

*S*|

*T*,

*w*) with a monomial for each alignment column, i.e.

*x*

_{ α }may be non-integer. The prefactor

*c*and the exponents

*x*

_{ α }are set such that the first moments of the approximation match those of

*P*(

*S*|

*T*,

*w*). That is, we demand that

*β*. As shown in [23] this fixes

*c*and the relative sizes of the

*x*

_{ α }but leaves ∑

_{ α }

*x*

_{ α }still free. The absolute magnitude of the

*x*

_{ α }we set so as to approximate the second moments, i.e. such that

*β*and

*γ*. With these approximations the integral for

*P*(

*S*

_{ w }) becomes simply

where *x* = ∑_{
α
}*x*_{
α
}and the variables with a tilde are those of the approximation to *P*($\tilde{S}$|$\tilde{T}$, *w*). The crucial point of this approximation procedure is that, at the start of the algorithm, we can determine these approximations, i.e. the values of the *x*_{
α
}, for every multiple alignment column *S* that occurs in the input data once and store the results. We thus replace the complex expression *P*(*S*|*T*, *w*) with the simple expression $c{\displaystyle {\prod}_{\alpha}{({w}_{\alpha})}^{{x}_{\alpha}}}$ for each alignment column *S*. After that, when we are sampling different configurations, the expression *P*(*S*_{
w
}) can be as efficiently calculated as for single sequences. That is, we can simply use equation (62), where ${n}_{\alpha}^{k}$(*S*_{
w
}) is now the sum over the *x*_{
α
}of all the alignment segments that occur in *S*_{
w
}.

For the prior over configurations *P*(*c*) PhyloGibbs uses the same priors (60) as for configurations over single sequences. PhyloGibbs uses Markov chain Monte-Carlo sampling to sample the space of all binding site configurations. The move-set employed when sampling binding site configurations in multiple alignments is essentially the same as the move-set for binding site configurations in single sequences illustrated in Fig. 5. The only difference is that 'sites' now typically extend over multiple aligned sequences, as illustrated in Fig. 9. Simulated annealing is used to find a configuration *c*_{*} that maximizes the posterior probability *P*(*c*|*D*). Finally, a further sampling run is used to calculate the posterior probabilities of the sites in configuration *c*_{*}. PhyloGibbs reports both the configuration *c*_{*} and the inferred WMs of the motifs in *c*_{*}, as well as posterior probabilities for all sites occurring in *c*_{*}. In [23] we demonstrate the performance of PhyloGibbs on synthetic data, on individual multiple alignments of orthologous intergenic regions from yeast, and on sets of multiple alignments of intergenic regions from yeast that are bound by a common regulatory factor [38]. These tests show that taking phylogeny into account significantly improves the performance in motif finding.

Finally, it is important to distinguish the motif finding methods that rigorously incorporate phylogeny by probabilistically modeling the evolution of binding sites, such as the PhyME and PhyloGibbs algorithms just discussed, from more *ad hoc* algorithms that use comparative genomic information in various ways in motif finding. This includes for example methods that simply identify significantly conserved sequence segments in multiple alignments, [30–32]. These conserved segments can then be post-processed to search for over-represented motifs. In other approaches, e.g. [29, 53], orthologous upstream regions are searched in the same way as set of upstream regions of co-regulated genes from a single species would be searched, i.e. ignoring the evolutionary relationships between the sequences. In other algorithms [54, 55] one only takes the topology of the phylogenetic tree into account and searches for length-*l* segments that occur in all orthologous sequences, such that the minimal number of mutations necessary to relate the length-*l* segments, i.e. the parsimony score, is under some prespecified cut-off. Another approach is to first search for significantly conserved segments in orthologous intergenic regions, and to then multiply align conserved segments from the upstream regions of co-regulated genes. This approach is taken by the PhyloCon algorithm [56] which, in spite of its name, ignores the phylogenetic relations between the species.

The biggest challenge for incorporating comparative genomic information in motif finding that is currently outstanding is the treatment of the multiple alignment. It is clear that errors in the multiple alignment can have very deleterious effects on the performance of algorithms such as PhyME, PhyloGibbs, and MotEvo. Ideally one would simultaneously search the space of all multiple alignments and all binding site configurations. However, this space is very large and it is currently unclear if and how it can be effectively searched, especially for large data-sets.

## Declarations

### Acknowledgements

EvN thanks Ionas Erb, Nacho Molina, and Mikhail Pachkov for useful comments.

This article has been published as part of *BMC Bioinformatics* Volume 8 Supplement 6, 2007: Otto Warburg International Summer School and Workshop on Networks and Regulation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S6

## Authors’ Affiliations

## References

- Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987, 193: 723-750. 10.1016/0022-2836(87)90354-8.View ArticlePubMedGoogle Scholar
- Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P: High-throughput SELEX-SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol. 2002, 20: 831-835.View ArticlePubMedGoogle Scholar
- Benos PV, Bulyk ML, Stormo GD: Additivity in protein-DNA interactions: how good an approximation is it?. Nucl acids res. 2002, 30 (20): 4442-4451. 10.1093/nar/gkf578.PubMed CentralView ArticlePubMedGoogle Scholar
- Djordjevic M, Sengupta AM, Shraiman BI: A Biophysical approach to Transcription Factor Binding Site Discovery. Genome Research. 2003, 13: 2381-2390. 10.1101/gr.1271603.PubMed CentralView ArticlePubMedGoogle Scholar
- Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, Kondev J, Phillips R: Transcriptional regulation by the numbers: models. Curr Opin Genet Dev. 2005, 15 (2): 116-124. 10.1016/j.gde.2005.02.007.PubMed CentralView ArticlePubMedGoogle Scholar
- Jaynes ET: Probability Theory: The Logic of Science. 2003, Cambridge University PressView ArticleGoogle Scholar
- Barash Y, Elidan G, Friedman N, Kaplan T: Modeling dependencies in protein-DNA binding sites. RECOMB. 2003, 28-37.Google Scholar
- Rabiner LR: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE. 1989, 77 (2): 257-286. 10.1109/5.18626.View ArticleGoogle Scholar
- Durbin R, Eddy S, Krogh G, Mitchison G: Biological Sequence Analysis. 1998, Cambridge University PressView ArticleGoogle Scholar
- Davidson EH: Genomic regulatory systems. 2001, San Diego: Academic PressGoogle Scholar
- Rivera-Pomar R, Jackle H: From gradients to stripes in Drosophila embryogenesis: filling in the gaps. Trends Genet. 1996, 12 (11): 478-483. 10.1016/0168-9525(96)10044-5.View ArticlePubMedGoogle Scholar
- Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics. 2001, 17 (10): 878-889. 10.1093/bioinformatics/17.10.878.View ArticlePubMedGoogle Scholar
- Berman BP, Nibu Y, Pfeifferdagger BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA. 2002, 99: 757-762. 10.1073/pnas.231608898.PubMed CentralView ArticlePubMedGoogle Scholar
- Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cis regulatory modules, applied to body patterning in the early Drosophila embryo. BMC Bioinformatics. 2002, 3 (30):Google Scholar
- Zavolan M, Rajewsky N, Socci ND, Gaasterland T: SMASHing regulatory sites in DNA by human-mouse sequence comparisons. Proc IEEE Conf on Comp Sys Bioinf. 2003Google Scholar
- Eisen MB: All motifs are NOT created equal: structural properties of transcription factor-DNA interactions and the inference of sequence specificity. Genome Biol. 2005, 6 (5): P7-10.1186/gb-2005-6-5-p7.View ArticleGoogle Scholar
- Bailey T, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for MolecularBiology. 1994, 2: 28-36.Google Scholar
- Liu XS, Brutlag DL, Liu JS: algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation experiments. Nat Biotechnol. 2002, 20: 835-839.View ArticlePubMedGoogle Scholar
- Liu JS: Monte Carlo Strategies in Scientific Computing. 2001, Springer-VerlagGoogle Scholar
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.View ArticlePubMedGoogle Scholar
- Thompson W, Rouchka EC, Lawrence CE: Gibbs Recursive Sampler: finding transcription factor binding sites. Nucl Acids res. 2003, 31 (13): 3580-3585. 10.1093/nar/gkg608.PubMed CentralView ArticlePubMedGoogle Scholar
- Kirkpatrick S, Jr CDG, Vecchi MP: Optimization by Simulated Annealing. Science. 1983, 220 (4598): 671-680. 10.1126/science.220.4598.671.View ArticlePubMedGoogle Scholar
- Siddharthan R, Siggia ED, van Nimwegen E: PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny. PLoS Comput Biol. 2005, 1 (7): e67-10.1371/journal.pcbi.0010067.PubMed CentralView ArticlePubMedGoogle Scholar
- Frenkel D, Smit B: Understanding Molecular Simulation: From Algorithms to Applications. 1996, Academic PressGoogle Scholar
- Liu JS, Neuwald AF, Lawrence CE: Markovian structures in biological sequence alignment. Journal of the American Statistical Association. 1999, 1-15. 10.2307/2669673.Google Scholar
- Roth FP, Hughes JD, Estep PW, Church CM: Finding DNA-regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol. 1998, 16: 939-945. 10.1038/nbt1098-939.View ArticlePubMedGoogle Scholar
- Liu X, Liu JS, Brutlag DL: Bioprospector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 2001, 127-138.Google Scholar
- Thijs G, Lescot M, Marchal K, Rombauts S, Moor BD, Rouzé P, Moreau Y: A higher order background model improves the detection of regulatory elements by Gibbs Sampling. Bioinformatics. 2001, 17 (12): 1113-1122. 10.1093/bioinformatics/17.12.1113.View ArticlePubMedGoogle Scholar
- McCue LA, Thompson W, Carmack CS, Ryan MP, Liu JS, Derbyshire V, Lawrence CE: Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucl Acids Res. 2001, 29 (3): 774-782. 10.1093/nar/29.3.774.PubMed CentralView ArticlePubMedGoogle Scholar
- Rajewsky N, Socci ND, Zapotocky M, Siggia ED: The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons. Genome Res. 2002, 12: 298-308. 10.1101/gr.207502. Article published online before print in January 2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M: Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003, 301: 71-76. 10.1126/science.1084337.View ArticlePubMedGoogle Scholar
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003, 423: 241-254. 10.1038/nature01644.View ArticlePubMedGoogle Scholar
- Abramowitz M, Stegun IA, Eds: Handbook of Mathematical Functions. With Formulas. Graphs, and Mathematical Tables. 1974, Dover PubnsGoogle Scholar
- van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics. Proc Natl Acad Sci USA. 2002, 99: 7323-7328. 10.1073/pnas.112690399.PubMed CentralView ArticlePubMedGoogle Scholar
- Erb I, van Nimwegen E: Statistical Features of yeast's transcriptional regulatory code. IEE Proceedings Systems Biology ICCSB. 2006Google Scholar
- Hughes JD, Estep PW, Tavazoie S, Church CM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 1205-1214. 10.1006/jmbi.2000.3519.View ArticlePubMedGoogle Scholar
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Jospeh Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tange JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcription regulatory neworks in Saccharomyces cerivisiae. Science. 2002, 799-804. 10.1126/science.1075090.Google Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, DK DKP, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99-104. 10.1038/nature02800.PubMed CentralView ArticlePubMedGoogle Scholar
- Dayhoff M, Schwartz R, Orcutt B: A model of evolutionary change in proteins. Atlas of protein sequence and structure. 1978, 5: 345-352.Google Scholar
- Müller T, Spang P, Vingron M: Estimating Amino Acid Substitution Models: A Comparison of Dayhoff's Estimator, the Resolvent Approach and a Maximum Likelihood Method. Mol Biol Evol. 2002, 19: 8-13.View ArticlePubMedGoogle Scholar
- Halpern AL, Bruno WJ: Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998, 15 (7): 910-917.View ArticlePubMedGoogle Scholar
- Sinha S, van Nimwegen E, Siggia ED: A probabilistic method to detect regulatory modules. Bioinformatics. 2003, 19 (suppl 1): i292-i301. 10.1093/bioinformatics/btg1040.View ArticlePubMedGoogle Scholar
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376. 10.1007/BF01734359.View ArticlePubMedGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzogloue S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003, 13 (4): 721-731. 10.1101/gr.926603.PubMed CentralView ArticlePubMedGoogle Scholar
- Morgenstern B, Dress A, Werner T: Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci USA. 1996, 93: 12098-12103. 10.1073/pnas.93.22.12098.PubMed CentralView ArticlePubMedGoogle Scholar
- Bray N, Pachter L: MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Res. 2004, 14: 693-699. 10.1101/gr.1960404.PubMed CentralView ArticlePubMedGoogle Scholar
- Do C, Mahabhashyam M, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research. 2005, 15: 330-340. 10.1101/gr.2821705.PubMed CentralView ArticlePubMedGoogle Scholar
- Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for multiple sequence alignments. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.View ArticlePubMedGoogle Scholar
- Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 2005, 102: 10557-10562. 10.1073/pnas.0409137102.PubMed CentralView ArticlePubMedGoogle Scholar
- Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB: MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol. 2004, 5: R98-10.1186/gb-2004-5-12-r98.PubMed CentralView ArticlePubMedGoogle Scholar
- Sinha S, Blanchette M, Tompa M: PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics. 2004, 5: 170-10.1186/1471-2105-5-170.PubMed CentralView ArticlePubMedGoogle Scholar
- Sinha S, Schroeder MD, Unnerstall U, Gaul U, Siggia ED: Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila. BMC Bioinformatics. 2004, 5: 129-10.1186/1471-2105-5-129.PubMed CentralView ArticlePubMedGoogle Scholar
- McCue LA, Thompson W, Carmack CS, Lawrence CE: Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res. 2002, 12: 1523-1532. 10.1101/gr.323602.PubMed CentralView ArticlePubMedGoogle Scholar
- Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Biol. 2002, 9 (2): 211-223. 10.1089/10665270252935421.View ArticlePubMedGoogle Scholar
- Blanchette M, Tompa M: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 2002, 12 (5): 739-748. 10.1101/gr.6902.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang T, Stormo G: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003, 19 (18): 2369-2380. 10.1093/bioinformatics/btg329.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.