Finding regulatory elements and regulatory motifs: a general probabilistic framework

van Nimwegen, Erik

doi:10.1186/1471-2105-8-S6-S4

Volume 8 Supplement 6

Otto Warburg International Summer School and Workshop on Networks and Regulation

Review
Open access
Published: 27 September 2007

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Erik van Nimwegen¹

BMC Bioinformatics volume 8, Article number: S4 (2007) Cite this article

9056 Accesses
31 Citations
Metrics details

Abstract

Over the last two decades a large number of algorithms has been developed for regulatory motif finding. Here we show how many of these algorithms, especially those that model binding specificities of regulatory factors with position specific weight matrices (WMs), naturally arise within a general Bayesian probabilistic framework. We discuss how WMs are constructed from sets of regulatory sites, how sites for a given WM can be discovered by scanning of large sequences, how to cluster WMs, and more generally how to cluster large sets of sites from different WMs into clusters. We discuss how 'regulatory modules', clusters of sites for subsets of WMs, can be found in large intergenic sequences, and we discuss different methods for ab initio motif finding, including expectation maximization (EM) algorithms, and motif sampling algorithms. Finally, we extensively discuss how module finding methods and ab initio motif finding methods can be extended to take phylogenetic relations between the input sequences into account, i.e. we show how motif finding and phylogenetic footprinting can be integrated in a rigorous probabilistic framework. The article is intended for readers with a solid background in applied mathematics, and preferably with some knowledge of general Bayesian probabilistic methods. The main purpose of the article is to elucidate that all these methods are not a disconnected set of individual algorithmic recipes, but that they are just different facets of a single integrated probabilistic theory.

The weight matrix representation of regulatory sites

The first step in any algorithm for identifying regulatory sites in DNA or RNA is to decide on a mathematical representation of the binding sites. For definiteness, let us assume we are considering a DNA binding factor which, when bound to DNA, covers a DNA segment of l base pairs long. For any length-l sequence s there will be a well-defined (but generally unknown) binding free-energy E(s) to the regulatory factor. A key assumption [1] that is introduced at this point is that the energy E(s) can be written as the sum of independent contributions E_i(s_i) from each of the bases s_iin segment s, i.e.

E (s) = \sum_{i = 1}^{l} E_{i} (s_{i}) .

(1)

This assumption of course generally only holds to some extent. Large-scale in vitro studies have shown that the binding energies can deviate from this simple additivity assumption [2]. However, these deviations are typically small, and moreover they seem generally restricted to segments with low binding-energy [3]. At this point it is not yet clear to what extent and for what fraction of regulatory factors, the additivity assumption holds. Some researchers believe that, at least for some factors, functional binding sites deviate significantly from this assumption, and this may well be the case. However, it is this author's experience that in collections of experimentally determined binding sites there is little evidence of correlations between the nucleotides occurring at different positions which, as we will see below, supports the additivity assumption for functional binding sites.

The crucial assumption which underlies the whole idea of 'finding regulatory sites' is that the set of all 4^lpossible segments s can be meaningfully divided into 'binding sites' and all other sequences. Since this is not a priori clear at all, it is good to consider what this assumption entails. At a given concentration c of the regulatory factor, the probability that a sequence segment s will be bound by the factor is given by an expression of the following form [4, 5]

P_{bound} (s) = \frac{c e^{β E (s)}}{c e^{β E (s)} + K},

(2)

where β = 1/(kT) is the inverse temperature, and K is a constant. (We here ignore the fact that the factor may bind at segments that overlap s, which would prevent the factor from binding at s. Below we will derive the general solution that takes this complication into account.) The expression (2) is an s-shaped function that goes from 0 to 1 as ce^βE(s)goes from much smaller than K to much larger than K. Therefore, at a given concentration c one can naturally separate sequences s into binders, i.e. those with E(s) > log(K/c)/β and non-binders with E(s) < log(K/c)/β. If the concentration of the (active) regulatory factor were to vary continuously between different cellular states, then the set of sites bound by the factor would also vary continuously and it would not make much sense to divide segments s into binders and non-binders. However, if in physiological conditions the regulatory factor primarily switches between an 'off' state, i.e. low concentration c_off, and an 'on' state, i.e. high concentration c_on than there would be a well defined set of sites that are bound when the factor is 'on' and unbound when the factor is 'off', i.e. those with energies in the range

\frac{1}{β} \log (\frac{K}{c_{on}}) < E (s) < \frac{1}{β} \log (\frac{K}{c_{off}}) .

(3)

Therefore, this set of binding sites may be characterized by a typical energy that lies somewhere in the middle of this range.The assumption that is thus generally made [1] is that binding sites are characterized by an average binding energy $\bar{E}$ . We now want to derive the probability P(s) that a randomly chosen binding site will have sequence s, given only the constraint that the average energy of the sites is $\bar{E}$ . The maximum entropy formalism [6], i.e. as applied in statistical mechanics, prescribes that distribution P(s) is given by

P (s) = \frac{e^{λ E (s)}}{\sum_{s^{'}} e^{λ E (s^{'})}} = \prod_{i = 1}^{l} [\frac{e^{λ E_{i} (s_{i})}}{\sum_{α} e^{λ E_{i} (α)}}],

(4)

where the sum over s' is over all length-l sequences, and the sum over α is over the four bases. The Langrangian multiplier λ is chosen such that <E> = ∑_sE(s) P(s) = $\bar{E}$ . Note that this is the same functional form as the well-known Boltzmann distribution. To avoid confusion, note also that equations (2) and (4) are probability distributions over entirely different spaces. The former takes a fixed sequence segment s and compares the probabilities of the bound and unbound states for this sequence segment, whereas the latter assigns the probabilities that a binding site will take on any of the 4^lpossible sequences. In equation (4) the bases at different positions are independent, i.e. $P (s) = \prod_{i = 1}^{l} P_{i} (s_{i})$ with

P_{i} (s_{i}) = \frac{e^{λ E_{i} (s_{i})}}{\sum_{α} e^{λ E_{i} (α)}} .

(5)

This property allows us to define a position specific weight matrix (WM) w with components

w_{α}^{i} = \frac{w^{λ E_{i} (α)}}{\sum_{α^{'}} e^{λ E_{i} (α^{'})}} .

(6)

That is, we can represent regulatory sites by WMs, and find the following expression for the probability that a binding site has sequence s:

P (s | w) = \prod_{i = 1}^{l} w_{s_{i}}^{i} .

(7)

Finally, note that P(s|w) gives the probability that a given binding site will have sequence s, which should be carefully distinguished from the probability P(w|s) that a sequence segment s is a binding site for w. The latter cannot be calculated without specifying how likely s is to arise under alternative hypotheses as will be discussed in detail below.

Weight matrices are probably the most commonly used representation of regulatory sites and, as has just been shown, can be derived under the assumptions that the contribution to the binding energy from bases at different positions in the site are independent, and that functional binding sites are characterized by a given average binding energy. In this chapter we will focus on regulatory motif finding methods that use WMs. It should be noted, however, that in some circumstances regulatory sites can be adequately represented by either specific DNA words, i.e. when the regulatory factor recognizes essentially only a single sequence segment, or by regular expressions, and there is a substantial amount of work on motif finding in this context. There is also a moderate amount of work on more complex representations of regulatory sites, such as hidden Markov models that allow sites of varying length and correlations between bases at neighboring positions [2, 7].

Finding WM matches

Assume that we are in possession of a WM w that summarizes the binding specificity of a regulatory factor. One of the simplest applications is to 'scan' one or more sequences for 'matches' to this WM. Let s denote some sequence of length L, where L is typically much larger than the length l of the WM. We now want to infer if one or more sites for this WM occur in this sequence. Probabilistic inference always [6] takes the following general form

1.
Enumerate all possible hypotheses H that could have accounted for the data D.
2.
Assign prior probabilities P(H) to each of these hypotheses.
3.
Define a likelihood model that gives the probability P(D|H) of producing the entire data D under each of the hypotheses H.
4.
The posterior probability P(H|D)for each of the hypotheses is then given by Bayes' theorem:
$P (H | D) = \frac{P (D | H) P (H)}{\sum_{\tilde{H}} P (D | \tilde{H}) P (\tilde{H})} .$
(8)

For example, assume that we have prior information that precisely one site for WM w occurs in s and that the other bases in s were drawn from a background model b. For simplicity we will assume that under this background model b, each letter has a probability b_αto be base α. In this situation all possible hypotheses are simply all possible locations i at which the binding site might start. If we have no information to suggest that the site is more likely to occur at some places than others we use an uniform prior P(i) = constant. The likelihood P(D|i) of the data, i.e. sequence s, given the corresponding hypothesis is given by the product of probabilities that the bases from 1 up to i derive from the background model, that the segment from i + 1 through i + l derives from the WM w, and that bases i + l + 1 through L again derive from the background model.

The probability P(D|i), as illustrated in Fig. 1, is given by

P(D|i) = P(s_[0,i]|b)P(s_[i,l]|w)P(s_[i+l,L-i-l]|b),

where s_[i,l]= s_i+1s_i+2...s_i+lis the length-l segment in s starting after position i with

P (s_{[i, l]} | w) = \prod_{k = 1}^{l} w_{s_{i + k}}^{k},

(10)

and the background probabilities are given by

\begin{array}{r} P (s_{[0, i]} | b) = \prod_{k = 1}^{i} b_{s_{k}} \\ P (s_{[i + l, L - i - l]} | b) = \prod_{k = i + l + 1}^{L} b_{s_{k}} . \end{array}

(11)

With the uniform prior, the posterior probability P(i|D) that the site occurs at i is

P (i | D = \frac{P (D | i)}{\sum_{j = 0}^{L - l} P (D | j)} .

(12)

In general we of course do not know that there is precisely one site in s. Therefore, we generally want to consider the extended set of hypotheses that consists of all possible configurations of binding sites that can be assigned to sequence s. Figure 2 shows a possible configuration containing 3 hypothesized binding sites.

Generally, each possible configuration with n sites can be denoted by a vector i = (i₁, i₂,...,i_n) which denotes the positions at which the binding sites occur. The probability of the data given a configuration i is now given by

P (D | i) = [\prod_{σ \in B_{i}} b_{σ}] \prod_{s \in S_{i}} P (s | w),

(13)

where B_iis the set of background bases and S_iis the set of hypothesized sites in configuration i.

To assign prior probabilities P(i) to all possible configurations one generally assumes that the data D was produced through a stochastic process where at each step with probability (1 - π) a single background base is emitted, and with probability π a length-l binding site is emitted. Under this model the prior probability P(i) for a configuration i depends only on the number of sites n(i) that occurs in the configuration, and is given by

P(i) ∝ πⁿ⁽ⁱ⁾(1 - π)^L-ln(i)

Using this the posterior probability of configuration i given the data becomes

P (i | D) = \frac{P (D | i) π^{n (i)} {(1 - π)}^{L - l n (i)}}{\sum_{j} P (D | j) π^{n (j)} {(1 - π)}^{L - l n (j)}},

(15)

where the sum in the denominator is over all possible binding site configurations j.

Even though the total number of configurations grows faster than exponential with the sequence length L, the sum in the denominator can be easily calculated using dynamic programming as follows. Let F_ndenote the sum of the likelihoods of all configurations up to position n in s. We have the recurrence relation

(16)

as illustrated in Fig 3.

Notice that the sum over all configurations is just F_L, i.e. F_L= ∑_jP(D|j)P(j), which can be calculated in a time O(L) using the above recurrence relation. Similarly, we can move backward from the end of the sequence to have a recurrence relation for the sum of likelihoods of all configurations of positions n through L of s:

(17)

Finally, instead of calculating the posterior P(i|D) for a particular configuration i, we can also calculate the posterior probability that a site occurs at a given position, independent of the rest of the configuration. Let us denote by {n} the set of all configurations that have a site at segment s_[n,l].The posterior probability P({n}|D) is given by the sum of posterior probabilities of all configurations in {n}, i.e.

P ({n} | D) = \sum_{i \in {n}} P (i | D) .

(18)

It is easy to see that this sum can be expressed in terms of F_nand R_nas follows:

P ({n} | D) = \frac{F_{n} P (s_{[n, l]} | w) π R_{n + l + 1}}{F_{L}},

(19)

where the numerator corresponds to the sum over all configurations that have a site at s_[n,l].

Here it is useful to note that, formally speaking, the model that we have introduced is a hidden Markov model and that the expressions (16), (17), and (19) are essentially the same as the so-called forward-backward algorithms of hidden Markov model theory [8, 9]. Researchers with a background in statistical physics tend to think of F_Las a partition sum and the recurrence relations are essentially what is known as the transfer matrix technique.

Given the WM w for a regulatory factor we may use equation (19) to scan any sequence s for positions at which functional binding sites for the factor are likely to occur. The likelihood of success of this procedure critically depends on the density of true sites in the input sequence s. That is, even in a sequence generated entirely from the background model, segments that are indistinguishable from binding sites will occur by chance at a certain rate. For example, let's assume that such chance 'binding site lookalikes' occur once every 500 bps on average, and assume that we are looking for between 1 and 3 functional binding sites in an intergenic region of length 250 which stems from a bacterial genome. In this case we expect less than 1 binding site to occur by chance, and so we will likely be able to accurately determine the location of the 1 to 3 functional sites. In contrast, assume we are looking for 1 to 3 functional sites in the introns and upstream regions of a human gene, which together might contain as many as 100,000 bps of non-coding DNA. It is clear that in this case the functional sites will 'drown' in a sea of about 200 binding site lookalikes.

At this point the reader may ask how the cell distinguishes functional binding sites from mere 'lookalikes'. Comparing equation (2) with (6) and (7), we see that P_bound(s) can be written in terms of P(s|w) and c. In other words, two segments s and s' for which P(s|w) = P(s'|w) necessarily have P_bound(s) = P_bound(s'), and one may thus wonder why two segments that are equally likely to be bound by the regulatory factor are not equally functional. There are a number of reasons. First, in eukaryotic genomes DNA is wrapped up in chromatin and so different sites may have different accessibility to the regulatory factor. Second, binding of the regulator may by itself not guarantee functionality, i.e. a regulatory effect. A number of additional constraints typically have to be satisfied. The site may need to occur in the vicinity of specific other regulatory sites, e.g. to mediate interactions between different factors bound at the different sites. The site may need to occur at a particular distance from the basal promoter and in a particular orientation to be able to interact with the basal machinery, and other constraints currently not yet understood. When we want to look for functional sites in long sequences we thus generally have to use information that goes beyond the probabilities P(s|w) of the individual sequence segments given individual WMs w. One type of additional information that can be used is that in some cases functional binding sites are known to cluster on the genome. We now discuss approaches to incorporating this information.

Finding clusters of binding sites: regulatory modules

It has been well-established that in higher eukaryotic organisms transcription regulation is often implemented through regulatory 'modules' in which multiple binding sites for multiple regulatory factors cluster together relatively tightly in intergenic regions [10]. In some cases one may even know the subsets of regulatory factors that tend to cooperate in regulatory modules for particular biological pathways. For example, a large body of work has identified the sets of transcription factors that are involved in segmentation of the early Drosophila embryo, e.g. see [11].

One approach to distinguishing functional binding sites from nonfunctional ones is to look for such regulatory modules. That is, the idea is to start with a set of WMs {w}, preferably from a set of regulatory factors that are believed to interact in regulatory modules, and to look for relatively short genomic segments in which there is a surprisingly high density of sites for the WMs from {w}. As far as this author is aware, this general idea was introduced around the same time by a number of groups [12–15]. The implementation we discuss here is most closely-related to the approaches of refs. [14, 15].

The first thing to note is that the dynamic programming solution introduced in the previous section can be easily extended to multiple WMs w (potentially of different lengths). We now assume that the data is produced through a stochastic process where at each step with probability π_bg a background base is generated, and with probability π_wa WM segment from WM w with length l_wis generated. The priors of course satisfy the normalization π_bg + ∑_wπ_w= 1. For notational simplicity we can consider the background b to just be one of the WMs (with length l = 1) in the set {w}. In this more general model the recurrence relation for F_nbecomes

F_{n} = \sum_{w} F_{n - l_{w}} π_{w} P (s_{[n - l_{w}, l_{w}]} | w),

(20)

where the background b is now one of the WMs w.

The second thing to note is that the sum over all configurations F_L= ∑_cP(D|c)P(c|{π_w}) is formally the likelihood of the data D under our entire set of hypotheses c, that is, it is the probability to obtain the data under the assumed stochastic model. Note that in this expression we have indicated explicitly that this probability depends on the priors {π_w}. The quantity F_Lthus summarizes how well the sequence can be explained in terms of the set of WMs in the model. The basic idea of the regulatory module detecting algorithms in [14, 15] is to identify putative regulatory modules with sequence segments that have a high value for the sum F_Lof probabilities of all binding site configurations in the segment.

The procedure works as follows. One starts with an intergenic region upstream of a gene of interest in a higher eukaryotic genome. Such intergenic regions are typically quite large, i.e. from 10 Kbps in flies to over 100 Kbps in humans. One then slides a windows of length between 200 and 500 bps or so over this long intergenic region. For each window one then determines the set of priors {π_w} that maximize F_Lfor the sequence σ in the window, and calculates the value of F_Lat this maximum. One also calculates the probability P(σ|b) for the sequence in the window deriving entirely from the background model. The ratio X = F_L/P(σ|b) then quantifies the 'score' for the window in question. Finally, the predicted regulatory models are all windows for which X is larger than some prespecified cut-off, and for which the score X is larger than the score for any other window overlapping it.

A key step in this procedure is maximizing F_Lwith respect to the prior {π_w}. Different regulatory modules may have different densities of sites and we thus want to allow for different priors {π_w} within different windows. Since we do not know the {π_w} for each segment, from the point of view of probability theory one should strictly speaking not maximize with respect to {π_w} but rather integrate over all possible priors {π_w}. However, the resulting expressions no longer allow for an effective dynamic programming solution and this would thus make the problem computationally intractable. However, if the function F_Lhas a sharp peak with respect to the {π_w} then the height of the maximum is representative for the value of the integral and one can thus think of the maximization of F_Lwith respect to the {π_w} as an approximation to doing the full integral.

Assuming that segment σ is of length L the set of equations specifying the maximum with respect to the {π_w} are

c o n s t a n t = \frac{d \log (Z)}{d π_{w}} = \frac{d \log (F_{L})}{d π_{w}} = \frac{〈 n (w) 〉}{π_{w}},

(21)

where <n(w)> is the expected number of binding sites for WM w averaged over all configurations, each weighted by its probability. The last equation follows from the fact that, the prior P(c|{π_w}) is given by

P (c | {π_{w}}) = \prod_{w} {(π_{w})}^{n (w, c)},

(22)

where n(w, c) is the number of sites for w in configuration c. The derivative then becomes

\frac{d \sum_{c} P (σ | c) P (c | {π_{w}})}{d π_{w}} = \frac{1}{π_{w}} \sum_{c} n (w, c) P (σ | c) P (c | {π_{w}}) .

(23)

Thus, from (21) and that fact that the π_ware normalized to sum to 1 we have

π_{w} = \frac{〈 n (w) 〉}{\sum_{\tilde{w}} 〈 n (\tilde{w}) 〉} .

(24)

Typically this maximum is found through expectation maximization (EM). Starting from an initial guess of the {π_w} we calculate <n(w)> for all w and set a new set of priors {π_w} using equation (24). Under iteration this is guaranteed to lead to an optimum in F_L, although not necessarily the global optimum.

Motif finding

Up to now we have assumed that we are in possession of the WMs w representing the sequence-specificities of the regulatory factors. However, unless one has experimental data that directly measures binding affinities of different sequence segments we generally do not possess such detailed information. Typically the best situation encountered is that we have a collection S of sequences that have been determined to be functional binding sites for the regulatory factor. So we now ask what we know about the WM w given such a set of sequences S, i.e. we aim to calculate P(w|S).

Equation (7) gives the probability that a binding site for w will have sequence s. This can be trivially extended to sets of sequences. That is, the probability to obtain the set of n length-l sequences S when sampling n sequences from the WM w is given by

P (S | w) = \prod_{s \in S} P (s | w) = \prod_{i = 1}^{l} \prod_{α} {(w_{α}^{i})}^{n_{α}^{i} (S)},

(25)

where in the last equation we have defined $n_{α}^{i}$ (S) as the number of times the letter α occurs at position i in the sequences S. Thus, the probability to obtain sequences S when sampling from the WM w depends only on the counts $n_{α}^{i}$ (S).

Using Bayes' theorem the posterior probability P(w|S) for the WM given the set of sites S is formally given by

P (w | S) = \frac{P (S | w) P (w)}{P (S)} .

(26)

In this equation P(w) is the prior probability that the WM is given by w. The denominator is a normalizing constant, which does not depend on the WM (we discuss its meaning in a minute). The prior P(w) represents our prior information about the WM w before we see any sites. As will become clear below, the computations are analytically most easily tractable if we use so-called Dirichlet priors that have the following general form

P (w) = \prod_{i = 1}^{l} P (w_{i}) = \prod_{i = 1}^{l} c_{i} \prod_{α} {(w_{α}^{i})}^{γ_{α}^{i} - 1},

(27)

where c_iis a normalization constant for column i, and the $γ_{α}^{i}$ are constants that determine the prior. Notice that for the particular choice $γ_{α}^{i}$ = 1 we obtain a uniform prior that makes all WMs a priori equally likely, which can be argued to reflect a state of complete ignorance about the WM. In reality, however, we know that for most positions in the site, regulatory factors tend to have distinct preferences for certain bases. That is, we a priori know that a WM column wⁱ= (0.25, 0.25, 0.25, 0.25) is not very likely. To reflect this information we can choose $γ_{α}^{i}$ < 1. This will put more weight on WM columns that are 'skewed', i.e. giving low probability to some bases and high probabilities to others. Sometimes we have even more pertinent information. It has, for example, been argued recently that groups of related TFs show the same pattern of highly and less skewed columns [16]. If we are inferring the WM of such a TF we can thus reflect that information by setting $γ_{α}^{i}$ small for those positions i that are known to be highly skewed and $γ_{α}^{i}$ ≈ 1 for columns that are known not to be very skewed (for example because TFs of that family do not touch the DNA at that position).

With a Dirichlet prior of the form (27) equation (26) becomes

P (w | S) = C \prod_{i = 1}^{l} \prod_{α} {(w_{α}^{i})}^{n_{α}^{i} (S) + γ_{α}^{i} - 1},

(28)

where C is an overall normalization constant. Equation (28) shows why the $γ_{α}^{i}$ are often called pseudocounts. Increasing $γ_{α}^{i}$ by 1 has the same effect on the posterior P(w|S) as adding 1 to the number of times $n_{α}^{i}$ (S) that letter α was observed at position i. Put another way, the posterior P(w|S) has exactly the same functional form as the prior P(w), i.e. both are of the form $\prod_{α} {(w_{α})}^{x_{α}}$ with x_αthe 'count' of base α. Priors that have this property are called conjugate priors. In this particular case it means that one may think of the posterior P(w|S) as the prior for another problem with 'pseudocounts' ${\tilde{γ}}_{α}^{i} = n_{α}^{i} (S) + γ_{α}^{i}$ . How to use the distribution P(w|S) in practice? In order to estimate the WM one could for example determine the WM w that maximizes P(w|S). This maximum posterior probability WM has components

w_{α}^{i} = \frac{n_{α}^{i} (S) + γ_{α}^{i} - 1}{n^{i} (S) + γ^{i} - 4},

(29)

with nⁱ(S) = ∑_α $n_{α}^{i}$ (S) and γⁱ= ∑_α $γ_{α}^{i}$ . Note that with a uniform prior $γ_{α}^{i}$ = 1 the maximum occurs when the WM entries match the observed frequencies. This means, for example, that if a given base α is not observed at all at some position i, i.e. $n_{α}^{i}$ (S) = 0, we will assume that it is impossible for α to occur at position i. This is true even if the set S contains only very few sites.

Alternatively we may estimate the $w_{α}^{i}$ by their expected values under the distribution P(w|S). To calculate these expectation values we have to integrate P(w|S) over all possible WMs. That is, for each position i the integral is over the simplex:

\sum_{α} w_{α}^{i} = 1, w_{α}^{i} \geq 0 \forall α .

(30)

The solution to such integrals is given by the following general identity

\int {(w_{1})}^{x_{1} - 1} ... {(w_{n})}^{x_{n} - 1} d w_{1} ... d w_{n} = \frac{\prod_{i = 1}^{n} Γ (x_{i})}{Γ (\sum_{i = 1}^{n} x_{i})},

(31)

where the integral is over the simplex $\sum_{i = 1}^{n} w_{i} = 1$ . Using this identity we first find the normalization constant of equation (28). That is, by demanding that ∫P(w|S)dw = 1 we obtain

\frac{1}{C} = \prod_{i = 1}^{l} \frac{\prod_{α} Γ (n_{α}^{i} (S) + γ_{α}^{i})}{Γ (n^{i} (S) + γ^{i})},

(32)

and using this (plus the general identity Γ(x + 1) = x Γ(x)) we find for the expectation values

〈 w_{α}^{i} 〉 = \int w_{α}^{i} P (w | S) d w = \frac{n_{α}^{i} (S) + γ_{α}^{i}}{n^{i} (S) + γ^{i}} .

(33)

Note that in this estimate of the $w_{α}^{i}$ no component gets probability zero if we use a prior with $γ_{α}^{i}$ > 0 for all i and α.

In the previous section we repeatedly made use of the expression P(s|w), i.e. the probability to obtain sequence s when sampling from the WM. We now calculate an analogous expression P(s|S) = ∫ P(s|w)P(w|S)dw, which is the probability to obtain sequence s when sampling from the same WM as the one from which the set S derived (without ever specifying precisely what this WM is, i.e. we integrate over all possible w). Using again the general identity (31) we obtain

P (s | S) = \prod_{i = 1}^{l} \frac{n_{s_{i}}^{i} (S) + γ_{s_{i}}^{i}}{n^{i} (S) + γ^{i}} = \prod_{i = 1}^{l} 〈 w_{s_{i}}^{i} 〉 .

(34)

That is, we find that P(s|S) is precisely the probability that would be obtained from expression P(s|w) when using the expectation values < $w_{α}^{i}$ > as an estimate for the WM w.

Up to now we assumed that we were given a set S of length-l sequences that were sampled from the WM. Except in cases where we have, for example DNase footprinting data that give the precise locations of the regulatory sites, such specific data are again generally rare. It is much more common that we have a set of n longer sequences that we know (or strongly suspect) to contain one (or more) regulatory site(s) each for a common regulatory factor. In this situation we simultaneously need to infer where in the sequences the sites occur and what the WM is from which they derive.

To be explicit, let's assume we have a dataset D that consists of n length-L sequences, and we know that each sequence contains precisely one binding site of length l for a common regulatory factor. The set of hypotheses for this problem then corresponds to all combinations (w,i) of a WM w and a vector i = (i₁, i₂,... i_n) that denotes the positions where the regulatory sites occur, i.e i₁ is the position of the site in the first sequence, i₂ the position of the site in the second sequence, etcetera. We now first calculate the probability P(D|w, i) of the data given (w, i). Let S_idenote the set of n length-l segments that make up the hypothesized binding sites with positions i and let B_idenote all background nucleotides in the data D outside of these segments. In analogy with equation (13) the probability P(D|w, i) is then given by

P (D | w, i) = [\prod_{σ \in B_{i}} b_{σ}] \prod_{s \in S_{i}} P (s | w),

(35)

where the first product is over all nucleotides outside of the hypothesized binding sites, and the second product is over all hypothesized binding sites s.

At this point there are two possible approaches. In the first approach one calculates the probability P(D|w) of the data given the weight matrix only by summing over all possible binding site configurations i:

P (D | w) = \sum_{i} P (D | w, i) P (i),

(36)

where P(i) is a prior probability distribution over vectors of site assignments, and the sum is over all possible vectors. One then next searches the space of all possible WMs w for those with high P(D|w). In the second approach one calculates the probability P(D|i) of the data given the vector of site positions only by integrating over all possible weight matrices. Formally [6] this probability is given by

P(D|i) = ∫ P(D, w|i)dw = ∫ P(D|w, i)P(w)dw,

and next the set of all site positions i is searched for those with high P(D|i). We now discuss these approaches in turn.

Maximizing P(D|w) through Expectation Maximization

In the first approach one attempts to find the weight matrix w that maximizes the probability of the data P(D|w). Note that, as we have seen in section "Finding WM matches", the sum over all possible site configurations i can be easily performed through dynamic programming once the matrix w is given. For the particular case we are considering, i.e. assuming precisely one site per sequence, the probability P(D|w) is given by the product of the probabilities for the individual sequences

P (D | w) = \prod_{m = 1}^{n} P (D_{m} | w),

(38)

with

P (D_{m} | w) = \frac{1}{L - l + 1} \sum_{i_{m}} [P (s_{[i_{m}, l]} | w) \prod_{σ \notin s_{[i_{m}, l]}} b_{σ}],

(39)

where D_mis the m th sequence, the product over σ is over all bases outside of the site (i.e. the background), and we have used the uniform prior P(i_m) = 1/(L - l + 1) over the binding binding site position i_m.

To find the WM w that maximizes P(D|w) we proceed analogously as we did for finding the set of priors {π_w} in equations (21) through (24). For each column k of the WM we have the four equations

c o n s t a n t = \frac{d \log [P (D | w)]}{d w_{α}^{k}} = \frac{〈 n_{α}^{k} 〉}{w_{α}^{k}},

(40)

where < $n_{α}^{k}$ > is the number of times letter α is expected to occur at position k of the regulatory sites under posterior distribution P(i|D,w).

To derive the last equality, first note that derivative is a sum of independent terms

\frac{d \log [P (D | w)]}{d w_{α}^{k}} = \sum_{m = 1}^{n} \frac{d \log [P (D_{m} | w)]}{d w_{α}^{k}},

(41)

and that each term is again a sum of independent terms

\frac{d \log [P (D_{m} | w)]}{d w_{α}^{k}} = \frac{{(L - l + 1)}^{- 1}}{P (D_{m} | w)} \sum_{i_{m}} \frac{d P (D_{m} | w, i_{m})}{d w_{α}^{k}} .

(42)

Now if the base s(i_m+ k) at position i_m+ k of sequence m is equal to α, then the last derivative on the right simply divides P(D_m|w, i_m) by $w_{α}^{k}$ , and else the derivative is zero. We thus have

\frac{d P (D_{m} | w, i_{m})}{d w_{α}^{k}} = \frac{δ (s (i_{m} + k), α)}{w_{α}^{k}} P (D_{m} | w, i_{m}),

(43)

where the delta-function is one if s(i_m+ k) = α and zero otherwise. We thus find

\frac{d \log [P (D_{m} | w)]}{d w_{α}^{k}} = \frac{\sum_{i_{m}} δ (s (i_{m} + k), α) P (i_{m} | D, w)}{w_{α}^{k}} .

(44)

Note that the numerator of the right-hand side of this equation is just the expected number of times letter α occurs at position k of the binding sites in D_munder the posterior distribution P(i_m|D,w). Summing over all sequences m we thus obtain

\frac{d \log [P (D | w)]}{d w_{α}^{k}} = \frac{〈 n_{α}^{k} 〉}{w_{α}^{k}} .

(45)

Using the fact that the WM columns are normalized, we find that at the maximum of P(D|w) the weight matrix components obey the equalities

w_{α}^{k} = \frac{〈 n_{α}^{k} 〉}{n}

(46)

As in section "Finding clusters of binding sites: regulatory modules" one can use EM to solve these equations. We start with a randomly chosen WM w and calculate < $n_{α}^{k}$ > for that WM. We then update the WM components using equation (46) and repeat until the WM no longer changes. This procedure is guaranteed to converge to a local optimum of P(D|w).

Note that in the above we assumed just one site per sequence but it is easy to extend these derivations to arbitrary configurations, using the identities derived in section "Finding WM matches". Probably the first algorithm developed to find regulatory motifs in this way is the well-known MEME algorithm [17], and by now there are quite a number of algorithms that have been developed using this general idea, e.g. MDScan [18].

Once an optimal WM w_* is found it is straightforward, i.e. using equation (19), to calculate the posterior probabilities P(i_m|D, w_*) that a site occurs at position i_min sequence m and this allows one to distinguish between high confidence and low confidence sites. Programs that use the EM approach to motif finding often report such probabilities. Note, however, that the posterior probabilities P(i_m|D, w_*) should not be confused with the posterior probabilities P(i_m|D) which give the posterior probability that a site occurs at i_mindependent of what the WM w is (we derive an expression for this probability below). The latter quantifies how much evidence there is in the data D that a site occurs at i_m, whereas P(i_m|D,w_*) assumes in addition that the inferred WM w_* is correct. Since in many cases there is a reasonably high probability that w_* does not match precisely the WM from which the site derives, the probabilities P(i_m|D, w_*) will typically be significantly larger than P(i_m|D).

Finally, it would even be straightforward to extend the EM approach to multiple WMs using the expressions of section "Finding clusters of binding sites: regulatory modules". One could then, in principle, simultaneously find the set of priors {π_w} and the set of WMs {w} that maximize the overall probability P(D|{w}, {π_w}) of the data. For each WM w the expectation-maximization update equation of the WM components would take on the form

w_{α}^{k} = \frac{〈 n_{α}^{k} (w) 〉}{〈 n (w) 〉},

(47)

where <n(w)> is the expected total number of sites for WM w that occur in D and < $n_{α}^{k}$ (w)> is the expected number of those sites that have a base α at position k. The problem with this approach is that EM will very often lead to a local rather than the global optimum (it roughly speaking moves uphill from the starting point to the nearest local optimum). So depending on the initial sets {w} and {π_w} the EM procedure may lead to very different optima and the higher the dimension of the search-space, the more serious this problem becomes. Therefore, in practice algorithms such as MEME do not search for multiple WMs simultaneously but rather find one WM at a time. In addition, programs like MEME will start from many different initial WMs w and perform EM for each of them, reporting the best optimum found in any of these EMs.

Motif sampling

The second approach to motif finding focuses on the probability P(D|i). To calculate (37) we substitute (35) for the likelihood and first note that it can be separated in a part P(B_i|b, i) that depends only on the background, and a part P(S_i|i) that is given by an integral, i.e. P(D|i) = P(B_i|b,i)P(S_i|i) with

P (B_{i} | b, i) = \prod_{σ \in B_{i}} b_{σ} = \prod_{α} {(b_{α})}^{n_{α} (B_{i})},

(48)

where n_α(B_i) is the number of times base α occurs in the background B_i, and

P (S_{i} | i) = \int P (S_{i} | w, i) P (w) d w = \int P (w) \prod_{s \in S_{i}} P (s | w) d w .

(49)

For the prior P(w) we use a Dirichlet prior as in (27), and use the general identity (31) to calculate the integral, which results in

P (S_{i} | i) = \prod_{k = 1}^{l} [\frac{Γ (γ^{k})}{Γ (n + γ^{k})} \prod_{α} \frac{Γ (n_{α}^{k} (S_{i}) + γ_{α}^{k})}{Γ (γ_{α}^{k})}],

(50)

where $n_{α}^{k}$ (S_i) is the number of times base α occurs at position k of the sites in S_i, and the $γ_{α}^{k}$ are again the pseudocounts of the Dirichlet prior. The most common situation is that we know little about the WM that can be expected and in such situations either a uniform prior $γ_{α}^{k}$ = 1 or one that biases toward the corners of the simplex, e.g. $γ_{α}^{k}$ = 0.5, are reasonable choices. However, as we mentioned in the discussion of equation (28), if we already have a set of known sites S_known for the motif, in which base α appears $m_{α}^{k}$ times at position k, then the posterior probability for the WM has the same form as a prior with counts $γ_{α}^{k} = m_{α}^{k} + 1$ . Using this posterior as a prior in equation (50) we can thus also calculate the probability of obtaining the sequence segments in S_iwhen sampling from the same WM as the WM from which the set S_known derived. That is, equation (50) easily allows for the incorporation of prior knowledge about the WM w.

The meaning of equation (50)

Since the expression (50) is central in all motif sampling strategies we will divert here to discuss its meaning in a little more detail. First, note that P(S_i|i) is a product of independent factors for each column k. We thus focus on a single column only. In addition, we will assume a uniform prior over WMs, i.e. $γ_{α}^{k}$ = 1. The expression for a single column then takes on the simpler form

P (S) = \frac{3! \prod_{α} n_{α}!}{(n + 3)!} = \frac{3! n!}{(n + 3)!} \frac{\prod_{α} n_{α}!}{n!},

(51)

where we used that Γ(x + 1) = x! for integer x. The second equality on the right is to clarify that P(S) can be written as the product of two factors. The first of these factors, 3!n!/(n + 3)!, is the inverse of the binomial coefficient $(\begin{matrix} n + 3 \\ 3 \end{matrix})$ . This binomial coefficient corresponds to the number of different sets of counts {n_α} that are possible. That is, it counts the number of vectors of integers (n_a, n_c, n_g, n_t) such that ∑_αn_α= n.

The second factor in equation (51), ∏_αn_α!/n!, is the inverse of the multinomial coefficient n!/(∏_αn_α!) which gives the number of different ways that n objects can be distributed over 4 boxes such that n_aobjects are in the first box, n_cin the second, n_gin the third, and n_tin the fourth. Thus, the probability P(S) for a column of n bases is inversely proportional to the number of ways in which the counts {n_α} of this column can be realized. In summary, there are 4ⁿpossible outcomes for the n bases in the column. The probability distribution P(S) assigns a probability to each of these that is precisely inversely proportional to the number of the 4ⁿoutcomes that lead to the counts {n_α}. As a result, the total probability to obtain an outcome with counts {n_α} is constant for all $(\begin{matrix} n + 3 \\ 3 \end{matrix})$ possible counts (because we have to sum P(S) over all possible outcomes that lead to the same set of counts).

For large n we can approximate the multinomial coefficient using Stirling's approximation to find

\frac{\prod_{α} n_{α}!}{n!} \approx e^{- n H ({n_{α}})},

(52)

where H({n_α}) is the entropy of the distribution n_α/n:

H ({n_{α}}) = - \sum_{α} \frac{n_{α}}{n} \log (\frac{n_{α}}{n}) .

(53)

Thus, the probability P(S) is largest for sets of sequences whose base distributions have lowest entropy.

Back to motif sampling

We now return to our motif sampling calculations. Using (50) and (48) we obtain P(D|i) in terms of the counts $n_{α}^{k}$ (S_i) and n_α(B_i). Finally, using a uniform prior over hypotheses i, the posterior P(i|D) becomes simply

P (i | D) = \frac{P (D | i)}{\sum_{j} P (D | j)},

(54)

where the sum in the denominator is over all possible assignments j = (j₁,..., j_n) for the positions of the binding sites.

Ideally we would now either find the configuration of site positions i_* that maximizes P(i|D), or we would for each position i_kcalculate the posterior probability P(i_k|D) that a site occurs at position i_kin sequence k, which is formally given by

P (i_{k} | D) = \sum_{i_{1}, ..., i_{k - 1}, i_{k + 1}, ..., i_{n}} P (i | D) .

(55)

Unfortunately, since P(i|D) is a complicated nonlinear function of the base counts $n_{α}^{k}$ (S_i) and n_α(B_i) we cannot separate it easily into contributions from the different hypothesized sites in i and there is generally no way to calculate sums like (55) other than explicitly summing over all (L - l + 1)^n-1states. To find site configurations i with high P(i|D) researchers have in general resorted to Markov chain Monte-Carlo techniques for sampling the distribution P(i|S)[19]. The most commonly used way of sampling the distribution P(i|S) is through so-called Gibbs sampling [20] and consists of iterations of the following steps, which are illustrated in Fig. 4

1.
Randomly select one of the n sequences with uniform probability.
2.
If sequence number m was selected, remove the segment s located at position i_mfrom the set of sites S_iof the current configuration. Denote this set of (n - 1) sequences as $S_{i}^{-}$ and the new configuration as i^-.
3.
For every position i_m= 0 through i_m= L - l denote the new configuration that results from placing the site at i_min sequence m as (i^-, i_m) and calculate P(D|i^-, i_m).
4.
Select a new configuration by sampling the position of the site in sequence m according to the probability distribution
$P (i_{m} | D, i^{-}) = \frac{P (D | i^{-}, i_{m})}{\sum_{j_{m} = 0}^{L - l} P (D | i^{-}, j_{m})} .$
(56)

using (48) and (50) one finds that this probability is proportional to

P (i_{m} | D, i^{-}) \propto \prod_{k = 1}^{l} \frac{n_{s (i_{m} + k)}^{k} (S_{i}^{-}) + γ_{i}^{k}}{b_{s (i_{m} + k)} (n - 1 + γ^{k})},

(57)

where s(i_m+ k) is the base that occurs at position i_m+ k in sequence m. Note that this expression is precisely the ratio between the probability P(S[i m,L]|Si) of the site at i_mderiving from the same WM as the others in $S_{i}^{-}$ , i.e. as in equation (34), and the probability $P (s_{[i_{m}, l]} | b)$ of this segment under the background, i.e.

P (i_{m} | D, i^{-}) \propto \frac{P (s_{[i_{m}, l]} | S_{i}^{-})}{P (s_{[i_{m}, l]} | b)} .

(58)

By iterating these steps one can sample the entire distribution P(i|D) and, for example, estimate the posterior probability P(i_m|D) that a site occurs at position i_min sequence m, i.e. by the fraction of time a site occurs at i_mduring sampling. The probabilities P(i_m|D) rigorously quantify the evidence in D that a site occurs at position i_m. Thus, whenever P(i_m|D) is large we can be confident that a site does occur at i_m.

To make a single prediction for the set of regulatory sites in D one searches for the configuration i_* that maximizes P(i|D). In some approaches, e.g. [21], this is done simply by keeping track of the highest probability configuration that was observed during sampling. However, more accurate determination of the optimal configuration i can be obtained through simulated annealing [22]. One introduces a parameter β and instead of sampling from P(i|D) one samples from a probability distribution which is proportional to P(i|D)^β. At the start of the search β is set to a small number and then β is slowly increased with time. As β increases more weight will be put on configurations with high probability and eventually the sampler will 'freeze' into a state with locally optimal probability P(i|D). Provided the annealing is done slowly enough the optimum will correspond to the globally optimal state. This is for example the approach taken by the PhyloGibbs algorithm [23].

Once an optimal state i_* is found through simulated annealing one can of course use normal sampling, i.e. with β = 1, to obtain the posterior probabilities of the sites in i_*. Given the optimal configuration i_* one can of course also report the expected WM given this configuration, which has components

〈 w_{α}^{k} 〉 = \frac{n_{α}^{k} (S_{i_{*}}) + γ_{α}^{k}}{n + γ^{k}} .

(59)

Instead of assuming that there is precisely one site in each of the n sequences we can of course also sample much more general configurations c. Most generally, one could allow varying numbers of sites for multiple WMs. The top left panel in figure 5 shows such a general configuration with sites for 3 different motifs (red, blue, and green). If we assume the same kind of priors as we used in section "Finding clusters of binding sites: regulatory modules" then the prior probability for a particular configuration c, which has n(w, c) sites for WM w and n(b, c) bases in background, is proportional to

P (c | {π}) \propto [{(π_{b g})}^{n (b, c)}] \prod_{w} {(π_{w})}^{n (w, c)} .

(60)

If we denote the set of sites for WM w in configuration c by S_wand the set of background nucleotides as B(c) we obtain for the likelihood of the data given the configuration

P (D | c) = [\prod_{σ \in B (c)} b_{σ}] \prod_{w} P (S_{w}),

(61)

where for each group of sites the probability P(S_w) is given in complete analogy with (50) by

P (S_{w}) = \prod_{k = 1}^{l} [\frac{Γ (γ^{k})}{Γ (n (w) + γ^{k})} \prod_{α} \frac{Γ (n_{α}^{k} (s_{w}) + γ_{α}^{k})}{Γ (γ_{α}^{k})}],

(62)

where n(w) is the total number of sites in group S_wand $n_{α}^{k}$ (S_w) is the number of times base α occurs at position k of the sites in S_w.

The posterior probability P(c|D) of a configuration is simply proportional to the product of (60) and (61), i.e. P(c|D) ∝ P(D|c)P(c|{π}). To sample from the posterior probability P(c|D) over all possible configurations we need a more extensive set of 'moves' then the one described in the Gibbs sampler above. This can be done in a number of ways [24]. One possibility is to pick a sequence at random, to remove all sites currently located in it, and to sample from all ways of putting a new set of sites in, see [25] for details. The set of moves implemented by the PhyloGibbs algorithm [23] is illustrated in Fig. 5. These moves are:

1. Resampling a segment: Pick a sequence m at random and a random position i_min it. Check if there is a site overlapping the region from i_m+ 1 to i_m+ l in the current configuration c. If so, do nothing, i.e. move from c to c. If the region is free (or a site occurs precisely at i_m+ 1 through i_m+ l) calculate the probabilities P(c'|D) for all configurations c' that are obtained by putting a site for any of the WMs w at i_m, including putting no site at all or putting a site for a new motif. Finally, sample one of these configurations c' with probability proportional to P(c'|D).

2. Moving a site: Pick one of the sites occurring in c and remove it creating configuration c^-. Find all sequence segments s of length l in c^- that are not overlapping any site. Calculate the probability P(c'|D) for all configurations that can be obtained by placing a new site for the same WM at any of the free segments s. Sample one of these configurations c' in proportion to P(c'|D).

3. Shifting a site group: Pick one of the sets of sites S_wat random. Check how far the sites in S_wcan be shifted to the left and right without colliding with other sites in the current configuration c. Denote these maximal shifts by l_max and r_max. For every shift h between h = l_maxand h = r_maxcalculate the probability P(c'|D) of the configuration that would result if all sites in S_wwere shifted by an amount h. Sample one of the configurations c' in proportion to probability P(c'|D).

One of the main advantages of the motif sampling approach over EM algorithms is that it is much less likely to get stuck in local optima. In particular, one can sample multiple motifs without becoming trapped in bad local optima. Another advantage is that one can obtain rigorous posterior probabilities for sites appearing at different positions which allows for a more reliable separation of trustworthy predictions from spurious ones (see [23]). As for the single motif, i.e. our discussion below equation (50), we can here also use 'informative' priors for each of the motifs. That is, if we have a set motifs for which known sites are available we can use the base counts in these sites as 'pseudocounts' $γ_{α}^{k}$ of priors for corresponding motifs in the binding sites configurations. That is, apart from inferring multiple new motifs, we can use informative priors to discover new sites for known motifs at the same time. This can be especially useful when we are trying to find a new motif in a set of sequences that also contains sites for a number of known motifs. If we were to search this data for a single motif then it is quite likely that the search would return one of the known motifs. By searching for multiple motifs at the same time and using informative priors for each of the known motifs we can make sure that known sites will automatically associate with the known motifs, and that the remaining motifs are indeed new motifs. Finally, under the sampling approach one can use arbitrarily complicated priors P(c) on configurations, including priors that demand that certain combinations of sites occur at certain specified distances of each other, in particular orientations, etcetera. In the EM approach such complex priors would typically cause the dynamic programming solution to summing over all configurations to break down.

The main disadvantage of the motif sampling approach is of course speed. To obtain accurate statistics one needs to sample for a long time, and the time necessary grows with the product of the size of the data-set D, the total number of sites, and the number of motifs. In contrast, the dynamic programming approaches outlined in section "Finding WM matches" allow for efficient computation of sums over all possible configurations even for very large input data, allowing one to search very large sequences for matches to sets of WMs, which is computationally infeasible with motif sampling algorithms.

As mentioned already, motif sampling was introduced more than a decade ago [20]. Since then a significant number of algorithms has been developed including [21, 26–28], and probably many more. The PhyloGibbs algorithm [23] introduces several extensions such as simulated annealing to find the configuration with maximal probability, simultaneously sampling multiple motifs, and taking the phylogenetic relationships between the sequences into account (discussed below).

Clustering sites and motifs

There are several situations in which we may want to cluster sets of binding sites. This demand for instance arises whenever we have obtained a set of sequence segments that are thought to each have regulatory function, without knowing the specific function of any of the segments. For example, several researchers have used so-called 'phylogenetic footprinting', the identification of short overly conserved segments in alignments of orthologous intergenic regions from related genomes, to gather large collections of putative regulatory sites [29–32]. It is reasonable to assume that most of these short segments contain a regulatory site for some regulatory factor, but we do not know which sites are sites for the same factor nor how many different regulatory factors are represented in data.

Formally, given a dataset D of sequence segments, we want to partition this dataset into subsets such that all segments within a subset contain a regulatory site for a common regulatory factor, and different subsets correspond to different regulatory factors. In addition, we want to multiply align all the segments within each subset. Thus, for this problem the set of hypotheses is all possible ways in which the set D can be partitioned into subsets, and all possible ways in which the sequences in each subset can be multiply aligned. Let us denote possible configurations by C. Each configuration C consists of a set of subsets c ∈ C that each consist of a collection of sequences from D. The union of these subsets c of course equals D. In addition C specifies, for each subset c, an alignment S_cof sequence segments that are taken from the sequences in c. For simplicity we will assume that all these sequence segments are of fixed length l in all subsets. That is, C specifies a partition of the sequences in D into subsets c, and it specifies where in each of the sequences the regulatory site of length l occurs, thereby specifying length-l alignments S_cfor each subset c. We now want to calculate the probability P(D|C) of the data given a configuration C. We can generally separate P(D|C) into a contribution of the sites (those segments from the sequences in hypothesized regulatory sites) and the bases outside these segments that are scored according to a background model.

P(D|C)= P(D_sites|C)P(D_bg|C).

For simplicity we will use a background model that assigns a probability 1/4 to each base (extensions to more complex background models are straight forward). In that case the contribution P(D_bg|C) is constant, i.e. does not depend on C and we just consider P(D_sites|C). This probability can be written as a product of independent contributions from each subset

P (D_{s i t e s} | C) = \prod_{c \in C} P (S_{c}),

(64)

where S_cis the alignment of sites in subset c. The probability P(S_c) is just the probability that all sequence segments in S_cderive from a common WM. The probability P(S_c) is simply given by replacing S_iwith S_cin the right-hand side of equation (50).

To obtain the posterior probability

P (C | D) = \frac{P (D | C) P (C)}{\sum_{C^{'}} P (D | C^{'}) P (C^{'})}

(65)

we also need a prior P(C) over partitions. The simplest prior is of course to assign a uniform prior P(C) = constant. Note however that, a uniform prior over partitions may correspond to a very peaked prior with respect to the number of clusters. That is, given a dataset with 100 sequences there are astronomically more partitions of the data into, say, 30 subsets than there are partitions of the data into, say, 2 subsets. If one wants a uniform prior over the number of clusters one needs to assign a probability P(C) ∝ 1/ $S_{| C |}^{| D |}$ , where |D| is the total number of sequences in D, |C| is the number of subsets in C, and $S_{| C |}^{| D |}$ is the number of possible partitions of |D| objects into |C| subsets, which is called a Stirling number of the second kind [33]. Note that with this prior a particular configuration C with, say, 2 subsets will have a much higher a priori probability than a configuration with, say, 30 subsets. That is, it is impossible to be a priori completely ignorant about partitions in general and about the number of subsets at the same time. Again there is no easy way to find the configuration C with maximal posterior probability. A fast procedure for determining a state C with high posterior probability is through hierarchical clustering. One starts out with each sequence in D forming a subset on its own. For every pair of sequences s, and s' one then calculates the probability of the configuration C(s, s', i, i') that is obtained when the subsets s and s' are joined into a cluster, putting the hypothesized sites at positions i and i' respectively. We then find the combination (s, s', i, i) with maximal P(C(s, s', i, i')|D) and create the corresponding state C(s, s', i, i'). This procedure is repeated, i.e. at each iteration two subsets are fused so as to maximize P(C|D). The iteration stops when there is no more subset merger that would increase P(C|D). The great disadvantage of this procedure is that it generally leads to highly suboptimal local optima in P(C|D).

A better alternative is to use Markov chain Monte-Carlo sampling and simulated annealing. A simple and effective move-set is as follows

1.
Select one of the sequences in D at random and remove it from its current subset thereby creating a configuration C^-.
2.
For each of the subsets in C^- consider the configuration C(c, i) when the removed sequence s is put into subset c and the length-l site is s is started at position i. Also consider the configuration C(0) which is obtained by putting the sequence s in a subset of its own. Calculate P(C|D) for all these configurations and sample one of the configurations in proportion to these probabilities.

These steps are illustrated in Fig. 6.

By repeating these two steps one can sample from the posterior distribution P(C|D) over all possible configurations. Through simulated annealing, i.e. sampling from P(C|D)^βand slowly increasing β, one can attempt to locate the configuration C* which globally maximizes P(C|D). The PROCSE software [34] implements such a Markov chain Monte-Carlo scheme for simultaneously clustering and aligning sets of sequences that are thought to contain regulatory sites and it has been used to predict regulons in bacteria genome-wide. It has also been used to automatically curate sets of experimentally determined binding sites [35]. PROCSE first determines a 'reference configuration' C* through simulated annealing and then performs another sampling run, i.e. with β = 1, to determine the posterior probabilities of the clusters that occur in the reference state C*.

An almost identical procedure as just described can be used to cluster motifs or arbitrary combinations of motifs and sequences. Application of different motif finding algorithms to the same dataset, or application of the same algorithm to related datasets, often results in sets of inferred motifs that show clear commonalities. One is thus often interested in analyzing sets of motifs to identify which motifs are really different, and which motifs might represent a common underlying WM.

As we have seen in section "Finding WM matches" all our information about a motif, i.e. a WM w, can be represented by counts $n_{α}^{k}$ that represent the number of observations of base α at position k of the sites. So more generally, we will assume that when we are given 'a motif' this information can always be represented by a set of counts $n_{α}^{k}$ . For example, when we are given WM components $w_{α}^{k}$ then we transform this into a set of counts $n_{α}^{k}$ by specifying the pseudocounts $γ_{α}^{k}$ of a prior, and the effective total number of observations n on which the $w_{α}^{k}$ are based:

n_{α}^{k} = w_{α}^{k} (n + γ^{k}) - γ_{α}^{k} .

(66)

Without loss of generality we can thus think of these counts $n_{α}^{k}$ as deriving from an alignment S of sites for the motif. That is, we can generally specify our knowledge about a 'motif' by specifying an alignment S of sites drawn from the motif WM.

Such alignments S can be clustered and aligned with each other completely analogously to the procedure just described for single sequences. That is, one can think of an alignment S as a set of individual sequences s that have already been clustered and aligned with each other. When multiple such alignments S are mutually aligned and clustered into a larger alignment S_cthen we calculate the probability P(S_c) that all sequences in S_cderive from a common WM exactly like we did before, i.e. equation (49). Thus, the only difference between clustering and aligning single sequences, and clustering and aligning 'motifs' is that motifs are represented by sets of multiply aligned sequences, and that these motif alignments are so to speak 'glued together' in that these sequences will never be repartitioned during the sampling. The PROCSE software also allows for such pre-clustered and pre-aligned sequences to be submitted as input. In this way arbitrary combinations of single sequences and motifs can be aligned and clustered simultaneously.

Incorporating phylogeny

In all our approaches so far we have assumed that different sequences that contain binding sites can be considered independent samples from a WM w. In addition, the motif finding approaches that we discussed all presume that one is given sets of sequences that are likely to contain sites for common regulatory factors. In many cases researchers use independent biological evidence, such as expression data, to collect such sets of sequences that appear 'co-regulated' [27, 36]. Apart from expression data, more recently ChIP-on-chip techniques have been used to collect sets of sequences that appear to be bound by a common regulatory factor, see e.g. [37, 38].

Another possibility is to collect sets of orthologous intergenic regions from related species. It is often reasonable to assume that many of the regulatory sites occurring in the ancestor of these species have been maintained and are shared by all or most of the descendants. Therefore, orthologous intergenic sequences can generally be expected to contain sites for common regulatory factors. However, in contrast to sites in collections of upstream regions of genes from a single species, these sites cannot be considered independent samples from a common WM. That is, the orthologous sites are related evolutionary, and their sequences will therefore generally be more correlated than independent samples from a WM. Therefore, to correctly analyze orthologous intergenic regions we need to take the phylogenetic relationships of the species into account.

Binding site evolution

Let us consider a single position in a regulatory site whose WM has components w_αat that position. We now want to calculate the probabilities P_αβ(w, t) that over an evolutionary time t this position in the site evolves from base β to base α.

There is a long history of such models for the evolution of amino acids, e.g. see [39, 40]. For our application to nucleotide evolution a general treatment of this problem was given by the model of Halpern and Bruno [41]. The rate u_αβat which base β is substituted by base α during evolution is written as the product of an instantaneous rate of mutation μ_αβfrom β to α, and the probability f_αβthat a mutation from β to α will be fixed in the population (which depends on selection), i.e.

u_αβ= f_αβμ_αβ.

Under this general model the probabilities P(α|β, w, t) are the solution of the differential equations

\frac{d P_{α β} (w, t)}{d t} = \sum_{γ \neq α} [u_{α γ} P_{γ β} (w, t) - u_{γ α} P_{α β} (w, t)] .

(68)

Note that in the limit of long time the probabilities P_αβ(w, t) become independent of time, i.e. memory of the start state is lost, and by the definition of the WM components the probabilities P_αβ(w, t) limit to w_α, i.e.

\lim_{t \to \infty} P_{α β} (w, t) = w_{α} .

(69)

Assuming that the rates μ_αβare given one can then solve [41] for the substitution rates u_αβthat will lead to the limit distribution (69):

u_{α β} = μ_{α β} \frac{\log [\frac{μ_{β α} w_{α}}{μ_{α β} w_{β}}]}{1 - \frac{μ_{α β} w_{β}}{μ_{β α} w_{α}}} .

(70)

To solve equation (68) we note that it can be written as a matrix equation. Define the rate matrix U through

U_{α β} = u_{α β} - δ_{α β} \sum_{γ} u_{γ α} .

(71)

In terms of this matrix U equation (68) becomes

\frac{d P (t)}{d t} = U \cdot P (t),

(72)

with matrix P(t) having components P_αβ(t). Using the boundary condition P_αβ(0) = δ_αβthe solution is given by

P_αβ(t) = (e^Ut)_αβ.

In this general model one thus solves for P_αβ(w, t) by first determining U using equation (70), and determining its eigenvalues and eigenvectors. However, note that in general the solution is a complicated function of the WM components w_αwhich is not easily amenable to further analysis.

To allow more analytic flexibility we have developed a simpler model of the evolution of binding sites [23, 42] that assumes that all mutations are introduced at the same rate, i.e. μ_αβ= μ, and that the probability of fixation f_αβdepends only on the target base α i.e. f_αβ= w_α. Under these assumption the differential equations become

\frac{d P_{α β} (w, t)}{d t} = μ \sum_{γ \neq α} [w_{α} P_{γ β} (w, t) - w_{γ} P_{α β} (w, t)] = μ w_{α} - μ P_{α β} (w, t)

(74)

This equation can be easily solved to give

P_αβ(w, t) = δ_αβe^-μt+ w_α(1- e^-μt).

Note that e^-μtis the probability that no mutations have taken place during time t. We call this no-mutation-probability the proximity q = e^-μtbetween the ancestor and the descendant [23]. In terms of the proximity the solution becomes

P_αβ(w, q) = δ_αβq + (1 - q)w_α.

This expression has a nice simple interpretation. With probability q no mutations have taken place in going from β to α and the bases are identical. With probability (1 - q) one or more mutations took place and the probability that one then ends up with base α is simply the WM component w_α.

Probability of an orthologous set of bases

Assume that we have a set of orthologous intergenic regions and assume that we know the phylogenetic tree T that relates the species from which the regions derive. Consider now a set of orthologous bases S from these intergenic regions. That is, the bases in S have evolved from a common ancestor base in the common ancestor of the species according to the tree T. We now calculate the probability P(S|T, w) that, when evolving from a common ancestor under one of the evolutionary models just discussed, and according to the given phylogenetic tree T, the set of bases S will result at the leafs of the tree.

Note that the set S only specifies the bases at the leafs of the tree T, i.e. the bases at the internal nodes are unknown. If we also knew all the bases at the internal nodes we could calculate P(S|T, w) simply by multiplying the probabilities P_αβ(w, t) for each branch, i.e.

P (S | T, w) = \prod_{n} P_{s_{n} s_{a (n)}} (w, t_{n}),

(77)

where the product is over all nodes n, s_nis the base at node n, a(n) is the ancestor of node n, and t_nis the length of the branch from a(n) to n. This is illustrated in the left panel of Fig. 7.

However, as we do not know the identities of the bases at the internal nodes, we thus have to sum over all possibilities. This can be done using a dynamic programming scheme first presented by Felsenstein [43]. We denote by D_α(n, w) the probability to observe all bases of S that are descendants of node n of the tree given that node n has base α. For nodes n that are leafs, i.e. bases of S, we of course have D_α(n, w) = δ_αsn. We can determine D_α(n, w) for all nodes using the following recursion relation

D_{α} (n, w) = \prod_{m \in c (n)} [\sum_{β} P_{α β} (w, t_{m}) D_{β} (m, w)] .

(78)

where c(n) is the set of children of node n, and t_mis the length of the branch connecting m to its parent n. This basic recursion is illustrated in the middle panel of Fig. 7. Starting from the leafs we can use (78) to calculate D_α(n, w) for all nodes up to the root of the tree. Finally the probability P(S|T, w) for the whole tree is obtained by summing over the bases of the root node r, noting that the prior probability that root r has base α is w_α. This gives

P (S | T, w) = \sum_{α} w_{α} D_{α} (r, w) .

(79)

In complete analogy we can calculate the probability P(S|T, b) of the column of bases S assuming that they evolved under a background model b. which is given by background probabilities b_α. To obtain P(S|T, b) we just replace P_αβ(w, t) with P_αβ(b, t) for each branch of the tree in equation (78) and replace w_αwith b_αin (79). Finally, we can also easily accommodate cases in which the regulatory site has been maintained in some but not all species. That is, we can have some branches of the tree T evolve according to the background model b whereas other branches evolve according to the WM column w, simply by using P_αβ(w, t) for each branch evolving according to the WM, and using P_αβ(b, t) for each branch evolving according to the background. An example of such a more complicated 'selection pattern' is shown in the right panel of Fig. 7.

Finding sites and modules in multiple alignments

To apply the probabilities P(S|T, w) and P(S|T, b) to a set of orthologous intergenic regions we of course first have to identify which sets of bases in these sequences form orthologous groups. That is, we have to produce a multiple alignment of the orthologous intergenic regions. Given a multiple alignment we can then assume that every column of the alignment corresponds to a set of orthologous bases. The problem of producing accurate multiple alignments of non-coding sequences is extremely challenging and is beyond the scope of this article. There are now a number of algorithms available that focus specifically on alignment of non-coding DNA [44–46], although our personal experience is that consistency based methods [47, 48] and evolutionary explicit progressive alignment [49] often outperform these methods significantly. From this point on we will assume that a global multiple alignment of the orthologous intergenic regions is given and that we can assume that vertically aligned bases in this alignment are orthologous.

We can use the probabilities P(S|T, w) and P(S|T, b) that we derived above to extend the formalism of sections "Finding WM matches" and "Finding clusters of binding sites: regulatory modules" to multiple alignments. The simplest way of doing this is to take one of the sequences in the multiple alignment as a reference sequence and to consider all binding site configurations for this reference sequence. This is often natural since in many cases we are really only interested in finding regulatory sites in one particular species and it is thus natural to take this species as a reference.

Let s_[i,l]denote a segment of length l in this reference sequence, and let S_[i,l]denote the corresponding block in the multiple alignment. To calculate the probability that a regulatory site occurs at s_[i,l]we will now calculate the probabilities of observing the alignment segment S_[i,l]under different assumptions for the selection that was operating at each branch of the tree T relating the species in the alignment. The simplest assumptions about the selection are that either all sequences in S_[i,l]evolved according to the background model, i.e. using expression P(S|T, b) for each column S in S_[i,l], or that all sequences evolved according to WM w, i.e. using P(S|T, w) for each column S in S_[i,l]. Many algorithms [23, 50, 51] in fact restrict themselves to these two possibilities. However, there are many other possibilities. If there are B branches in the tree then there are in principle 2^Bpossible ways of assigning selection to the branches, i.e. either WM w or background b for each branch. Formally, to calculate the probability that a regulatory site occurs at s_[i,l]we would want to consider all 2^{B - 1}'selection patterns' σ for which s_[i,l]is under selection of the WM w. We would want to assign prior probabilities P(σ) to all 2^Bpossible selection patterns σ, and calculate the probabilities P(S_[i,l]|T, σ) for each. Finally, by summing P(S_[i,l]|T,σ)P(σ) over all selection patterns for which s_[i,l]is under selection of the WM w one would obtain the total probability of the data S_[i,l]under the assumption that a regulatory site occurs at s[i,l]. Unfortunately, there is no simple way of determining a reasonable distribution P(σ) and the sum would generally involve a large number of terms. This author is not aware of any algorithm that currently implements this general scheme.

In the MotEvo algorithm [35] a single selection pattern σ_* is chosen that best fits the alignment and the sequences in it. Note first that, since WMs have a fixed width, a site in the reference species can only occur in another species if the corresponding segment in that species is gaplessly aligned with the site in the reference species. Therefore, we first check which of the other sequences in S_[i,l]are gaplessly aligned with the reference sequence and which are not. For those sequences not gaplessly aligned with the reference we assign the background evolution model to the branches leading to these sequences. For each of the other sequences s in S_[i,l]we calculate the probability P(s|w) of the sequence under the WM w, and the probability P(s|b) of the sequence under the background model b. Whenever P(s|w) > P(s|b) we assign the WM model to the branch leading to s, and for all others we assign the background model. Finally, we assume that an internal node evolved according to the WM if any of its descendants do. This defines a unique selection pattern σ for S_[i,l]and we calculate P(S|T, w) using this selection pattern. The procedure is illustrated in Fig. 8.

We also calculate P(S_[i,l]|T, b) assuming all branches evolved according to background. Finally, if we assign a prior probability π that a site occurs at s_[i,l] the posterior probability P(site|S_[i,l]) that the reference species has a functional site at i becomes

P (s i t e | S_{[i, l]}) = \frac{P (S_{[i, l]} | T, w) π}{P (S_{[i, l]} | T, w) π + P (S_{[i, l]} | T, b) (1 - π)} .

(80)

This is essentially the expression used by the MotEvo algorithm [35] to find regulatory sites. The MONKEY algorithm finds regulatory sites in a very similar manner. Instead of the simple evolutionary model (76) MONKEY uses the more general Halpern/Bruno model (70). However, MONKEY does not consider the possibility that the site is conserved in some but not all of the aligned species, i.e. it assumes that either all branches of the tree evolve according to the WM, or all branches evolve according to background.

Instead of looking at one sequence segment at a time, we can of course also use this formalism to calculate sums of the probabilities of all possible binding site configurations as in section "Finding WM matches". Instead of calculating the probability P(s_[i,l]|w) of a single sequence segment under the WM we instead calculate the probability P(S_[i,l]|w) of the ungapped alignment block at that location using the procedure just outlined. That is, for every segment s_[i,l]we find which other sequences are ungapped at the segment and choose which of these are evolving according to the WM based on the probabilities of the individual sequence segments under the WM. The generalization of equation (20) is then simply

F_{n} = F_{n - 1} P (S_{[n - 1, 1]} | T, b) π_{b g} + \sum_{w} F_{n - l_{w}} P (S_{[n - l_{w}, l_{w}]} | T, w) π_{w} .

(81)

Note that position n here always refers to the n th base in the reference sequence.

Finally, using this formalism we can of course also search for regulatory modules in multiple alignments in complete analogy with the equations in section "Finding clusters of binding sites: regulatory modules".

This procedure has been implemented for two-species alignments in the Stubb algorithm [42]. Applying the Stubb algorithm to predict developmental regulatory modules in Drosophila it was shown in [52] that using two-species alignments improves predictions of the locations of regulatory modules over the single species algorithms.

Motif finding incorporating phylogeny

In section "Motif finding" we discussed two approaches to motif finding, one based on maximizing the probability P(D|w) of the data given a WM w using expectation maximization, and one using Markov chain Monte-Carlo sampling to find the site configuration c that maximizes the posterior P(c|D). These methods can also be extended in a straightforward way to multiple alignments and we now discuss these in turn.

Motif EM incorporating phylogeny

The PhyME algorithm implements an extension of the MEME algorithm to multiple alignments of orthologous intergenic sequences from related species. It uses a reference species and considers all configurations of binding sites that can be assigned to the reference species in the same way as discussed in the previous section, i.e. it uses equation (81) to calculate the overall likelihood P(D|w) of the alignment given the WM w. The evolutionary model that is used by PhyME to score ungapped alignment blocks P(S_[i,l]|w) is precisely the simplified model of equation (76). However, like MONKEY and in contrast to MotEvo, PhyME assumes that either all branches in the tree evolved according to the WM model, or that all evolved according to background.

To maximize P(D|w) with respect to the WM PhyMe needs to solve, for each column k in the WM, the equations

\frac{d P (D | w)}{d w_{α}^{k}} = c o n s t a n t \forall α .

(82)

Note that for the single sequence case, the derivative of P(s|w) with respect to the WM components $w_{α}^{k}$ , was very simple, i.e. see (43). In contrast,the derivative dP(S_[i,l]T, w)/d $w_{α}^{k}$ is a much more complicated function of the WM components $w_{α}^{k}$ which needs to be calculated recursively just as P(S_[i,l]|T, w) itself. Here it becomes particularly advantageous that in the simplified model (76) the probability P_αβ(w,t) is such a simple function of the WM components. We do not discuss the mathematical details of solving (82) here except for mentioning the fact that it involves an iterative procedure similar to EM that leads to a local optimum in P(D|w).

Motif sampling incorporating phylogeny

We now discuss extending the motif sampling approach of section "Motif sampling" to alignments of phylogenetically related sequences. Remember that in the motif sampling approach, instead of summing over all possible binding site configurations to calculate the probability P(D|w) conditioned on the WM, we condition on a particular binding site configuration c and calculate the probability P(D|c) by integrating over all possible WMs w.

Instead of a set of single sequences the input will now generally consist of a set of multiple alignments of orthologous non-coding sequences or a combination of multiple alignments and single sequences. As in section "Motif sampling" we want to consider all possible configurations c of binding sites that can be assigned to the input data D, and calculate the probability of the data P(D|c) for each possible configuration. Whereas for single sequences the space of all possible configurations existed simply of all ways in which sets of non-overlapping windows can be assigned to the sequences, i.e. see Fig. 2, for multiple alignments the situation is a bit more complicated and illustrated in Fig. 9.

Above we assumed that a reference sequence s is given for each multiple alignment and that the set of binding ste configurations for the alignment is simply the set of all binding site configurations for the reference species. In the PhyloGibbs algorithm [23] there is no reference sequence and each sequence in the multiple alignment is treated the same. A site can be hypothesized to occur at any position of any of the sequences. By definition the algorithm assumes that, whenever a site occurs in one species, it will also occur in all other species that are gaplessly aligned with it at that location. That is, sites are automatically extended to all species that are mutually gaplessly aligned at that position, see Fig. 9. The algorithm makes sure to only allow configurations in which none of the sites overlap.

Next we need to calculate P(D|c) for every possible such configuration c. This probability P(D|c) is given by an equation essentially identical to equation (61). However, instead of single background bases σ with probability b_σwe will now have alignment columns S with probability P(S|T, b) as calculated in section "Probability of an orthologous set of bases". The set of sequences S_wassigned to a WM w will now generally consist of several ungapped segments from the multiple alignments, i.e. alignment blocks, and possibly some single sequences as well, see Fig. 9. The probability P(S_w) will again be an integral over all possible WMs but the integrand in this case will be considerably more complicated. For simplicity let's focus on a single column from the set S_wof sequence segments and alignment blocks. For simplicity assume that this column from S_wcontains two independent columns S, and $\tilde{S}$ from the multiple alignments, see Fig. 9. The probability P(S_w) would then be formally given by

P (S_{w}) = \int P (S | T, w) P (\tilde{S} | \tilde{T}, w) P (w) d w,

(83)

where T is the phylogenetic tree of alignment column S, $\tilde{T}$ the phylogenetic tree of alignment column $\tilde{S}$ , and the expressions P(S|T,w) and P( $\tilde{S}$ | $\tilde{T}$ ,w) are given as in equations (78) and (79). To calculate the integral notice that, formally, the expression P(S|T,w) is a polynomial in the WM components of the following form

P (S | T, w) = \sum_{k} c_{k} \prod_{α} {(w_{α})}^{m_{α}^{k}}

(84)

where the prefactors c_kdepend on the branch lengths in the tree and the $m_{α}^{k}$ are sets of integers. The expression P( $\tilde{S}$ | $\tilde{T}$ ,w) can of course also be written in this form. Denote its prefactors ${\tilde{c}}_{\tilde{k}}$ , and its exponents ${\tilde{m}}_{α}^{k}$ . Using this the integral can be rewritten as

P (S_{w}) = \sum_{k, \tilde{k}} c_{k} {\tilde{c}}_{\tilde{k}} \int {\prod_{α} (w_{α})}^{m_{α}^{k} + {\tilde{m}}_{α}^{\tilde{k}} + γ_{α} - 1} d w .

(85)

Note that each monomial term of the form $\prod_{α} {(w_{α})}^{m_{α}^{k} + {\tilde{m}}_{α}^{\tilde{k}} + γ_{α} - 1}$ can be easily integrated using the general expression (31). We then obtain for the integral

P (S_{w}) = \sum_{k, \tilde{k}} c_{k} {\tilde{c}}_{\tilde{k}} \frac{Γ (γ)}{Γ (m^{k} + {\tilde{m}}^{\tilde{k}} + γ)} \prod_{α} \frac{Γ (m_{α}^{k} + {\tilde{m}}_{α}^{\tilde{k}} + γ_{α})}{Γ (γ_{α})} .

(86)

So in principle we can analytically determine the value of the integral P(S_w) in this way. However, the number of terms in the above sum grows exponentially both with the number of sequences in each alignment and, more importantly, with the number of alignments under the integral. That is, if the configuration c contains 10 multiple alignment segments for WM w, then even if there were only 10 terms for each alignment column P(S|T, w), there would still be 10¹⁰ terms in total. In practice we thus have to resort to approximations of the above integral. The approach that is taken in the PhyloGibbs algorithm is to approximate the expression P(S|T, w) with a monomial for each alignment column, i.e.

P (S | T, w) \approx c \prod_{α} {(w_{α})}^{x_{α}},

(87)

where the x_αmay be non-integer. The prefactor c and the exponents x_αare set such that the first moments of the approximation match those of P(S|T, w). That is, we demand that

c \int \prod_{α} {(w_{α})}^{x_{α}} d w = \int P (S | T, w) d w .

(88)

and

c \int w_{β} \prod_{α} {(w_{α})}^{x_{α}} d w = \int w_{β} P (S | T, w) d w

(89)

for all β. As shown in [23] this fixes c and the relative sizes of the x_αbut leaves ∑_αx_αstill free. The absolute magnitude of the x_αwe set so as to approximate the second moments, i.e. such that

c \int w_{β} w_{γ} \prod_{α} {(w_{α})}^{x_{α}} d w \approx \int w_{β} w_{γ} P (S | T, w) d w .

(90)

for all combinations of β and γ. With these approximations the integral for P(S_w) becomes simply

P (S_{w}) = c \tilde{c} \frac{Γ (γ)}{Γ (x + \tilde{x} + γ)} \prod_{α} \frac{Γ (x_{α} + {\tilde{x}}_{α} + γ_{α})}{Γ (γ_{α})},

(91)

where x = ∑_αx_αand the variables with a tilde are those of the approximation to P( $\tilde{S}$ | $\tilde{T}$ , w). The crucial point of this approximation procedure is that, at the start of the algorithm, we can determine these approximations, i.e. the values of the x_α, for every multiple alignment column S that occurs in the input data once and store the results. We thus replace the complex expression P(S|T, w) with the simple expression $c \prod_{α} {(w_{α})}^{x_{α}}$ for each alignment column S. After that, when we are sampling different configurations, the expression P(S_w) can be as efficiently calculated as for single sequences. That is, we can simply use equation (62), where $n_{α}^{k}$ (S_w) is now the sum over the x_αof all the alignment segments that occur in S_w.

For the prior over configurations P(c) PhyloGibbs uses the same priors (60) as for configurations over single sequences. PhyloGibbs uses Markov chain Monte-Carlo sampling to sample the space of all binding site configurations. The move-set employed when sampling binding site configurations in multiple alignments is essentially the same as the move-set for binding site configurations in single sequences illustrated in Fig. 5. The only difference is that 'sites' now typically extend over multiple aligned sequences, as illustrated in Fig. 9. Simulated annealing is used to find a configuration c_* that maximizes the posterior probability P(c|D). Finally, a further sampling run is used to calculate the posterior probabilities of the sites in configuration c_*. PhyloGibbs reports both the configuration c_* and the inferred WMs of the motifs in c_*, as well as posterior probabilities for all sites occurring in c_*. In [23] we demonstrate the performance of PhyloGibbs on synthetic data, on individual multiple alignments of orthologous intergenic regions from yeast, and on sets of multiple alignments of intergenic regions from yeast that are bound by a common regulatory factor [38]. These tests show that taking phylogeny into account significantly improves the performance in motif finding.

Finally, it is important to distinguish the motif finding methods that rigorously incorporate phylogeny by probabilistically modeling the evolution of binding sites, such as the PhyME and PhyloGibbs algorithms just discussed, from more ad hoc algorithms that use comparative genomic information in various ways in motif finding. This includes for example methods that simply identify significantly conserved sequence segments in multiple alignments, [30–32]. These conserved segments can then be post-processed to search for over-represented motifs. In other approaches, e.g. [29, 53], orthologous upstream regions are searched in the same way as set of upstream regions of co-regulated genes from a single species would be searched, i.e. ignoring the evolutionary relationships between the sequences. In other algorithms [54, 55] one only takes the topology of the phylogenetic tree into account and searches for length-l segments that occur in all orthologous sequences, such that the minimal number of mutations necessary to relate the length-l segments, i.e. the parsimony score, is under some prespecified cut-off. Another approach is to first search for significantly conserved segments in orthologous intergenic regions, and to then multiply align conserved segments from the upstream regions of co-regulated genes. This approach is taken by the PhyloCon algorithm [56] which, in spite of its name, ignores the phylogenetic relations between the species.

The biggest challenge for incorporating comparative genomic information in motif finding that is currently outstanding is the treatment of the multiple alignment. It is clear that errors in the multiple alignment can have very deleterious effects on the performance of algorithms such as PhyME, PhyloGibbs, and MotEvo. Ideally one would simultaneously search the space of all multiple alignments and all binding site configurations. However, this space is very large and it is currently unclear if and how it can be effectively searched, especially for large data-sets.

References

Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987, 193: 723-750. 10.1016/0022-2836(87)90354-8.
Article CAS PubMed Google Scholar
Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P: High-throughput SELEX-SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol. 2002, 20: 831-835.
Article CAS PubMed Google Scholar
Benos PV, Bulyk ML, Stormo GD: Additivity in protein-DNA interactions: how good an approximation is it?. Nucl acids res. 2002, 30 (20): 4442-4451. 10.1093/nar/gkf578.
Article PubMed Central CAS PubMed Google Scholar
Djordjevic M, Sengupta AM, Shraiman BI: A Biophysical approach to Transcription Factor Binding Site Discovery. Genome Research. 2003, 13: 2381-2390. 10.1101/gr.1271603.
Article PubMed Central CAS PubMed Google Scholar
Bintu L, Buchler NE, Garcia HG, Gerland U, Hwa T, Kondev J, Phillips R: Transcriptional regulation by the numbers: models. Curr Opin Genet Dev. 2005, 15 (2): 116-124. 10.1016/j.gde.2005.02.007.
Article PubMed Central CAS PubMed Google Scholar
Jaynes ET: Probability Theory: The Logic of Science. 2003, Cambridge University Press
Book Google Scholar
Barash Y, Elidan G, Friedman N, Kaplan T: Modeling dependencies in protein-DNA binding sites. RECOMB. 2003, 28-37.
Google Scholar
Rabiner LR: A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE. 1989, 77 (2): 257-286. 10.1109/5.18626.
Article Google Scholar
Durbin R, Eddy S, Krogh G, Mitchison G: Biological Sequence Analysis. 1998, Cambridge University Press
Book Google Scholar
Davidson EH: Genomic regulatory systems. 2001, San Diego: Academic Press
Google Scholar
Rivera-Pomar R, Jackle H: From gradients to stripes in Drosophila embryogenesis: filling in the gaps. Trends Genet. 1996, 12 (11): 478-483. 10.1016/0168-9525(96)10044-5.
Article CAS PubMed Google Scholar
Frith MC, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics. 2001, 17 (10): 878-889. 10.1093/bioinformatics/17.10.878.
Article CAS PubMed Google Scholar
Berman BP, Nibu Y, Pfeifferdagger BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA. 2002, 99: 757-762. 10.1073/pnas.231608898.
Article PubMed Central CAS PubMed Google Scholar
Rajewsky N, Vergassola M, Gaul U, Siggia ED: Computational detection of genomic cis regulatory modules, applied to body patterning in the early Drosophila embryo. BMC Bioinformatics. 2002, 3 (30):
Zavolan M, Rajewsky N, Socci ND, Gaasterland T: SMASHing regulatory sites in DNA by human-mouse sequence comparisons. Proc IEEE Conf on Comp Sys Bioinf. 2003
Google Scholar
Eisen MB: All motifs are NOT created equal: structural properties of transcription factor-DNA interactions and the inference of sequence specificity. Genome Biol. 2005, 6 (5): P7-10.1186/gb-2005-6-5-p7.
Article Google Scholar
Bailey T, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for MolecularBiology. 1994, 2: 28-36.
CAS Google Scholar
Liu XS, Brutlag DL, Liu JS: algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation experiments. Nat Biotechnol. 2002, 20: 835-839.
Article CAS PubMed Google Scholar
Liu JS: Monte Carlo Strategies in Scientific Computing. 2001, Springer-Verlag
Google Scholar
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.
Article CAS PubMed Google Scholar
Thompson W, Rouchka EC, Lawrence CE: Gibbs Recursive Sampler: finding transcription factor binding sites. Nucl Acids res. 2003, 31 (13): 3580-3585. 10.1093/nar/gkg608.
Article PubMed Central CAS PubMed Google Scholar
Kirkpatrick S, Jr CDG, Vecchi MP: Optimization by Simulated Annealing. Science. 1983, 220 (4598): 671-680. 10.1126/science.220.4598.671.
Article CAS PubMed Google Scholar
Siddharthan R, Siggia ED, van Nimwegen E: PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny. PLoS Comput Biol. 2005, 1 (7): e67-10.1371/journal.pcbi.0010067.
Article PubMed Central PubMed Google Scholar
Frenkel D, Smit B: Understanding Molecular Simulation: From Algorithms to Applications. 1996, Academic Press
Google Scholar
Liu JS, Neuwald AF, Lawrence CE: Markovian structures in biological sequence alignment. Journal of the American Statistical Association. 1999, 1-15. 10.2307/2669673.
Google Scholar
Roth FP, Hughes JD, Estep PW, Church CM: Finding DNA-regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol. 1998, 16: 939-945. 10.1038/nbt1098-939.
Article CAS PubMed Google Scholar
Liu X, Liu JS, Brutlag DL: Bioprospector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 2001, 127-138.
Google Scholar
Thijs G, Lescot M, Marchal K, Rombauts S, Moor BD, Rouzé P, Moreau Y: A higher order background model improves the detection of regulatory elements by Gibbs Sampling. Bioinformatics. 2001, 17 (12): 1113-1122. 10.1093/bioinformatics/17.12.1113.
Article CAS PubMed Google Scholar
McCue LA, Thompson W, Carmack CS, Ryan MP, Liu JS, Derbyshire V, Lawrence CE: Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucl Acids Res. 2001, 29 (3): 774-782. 10.1093/nar/29.3.774.
Article PubMed Central CAS PubMed Google Scholar
Rajewsky N, Socci ND, Zapotocky M, Siggia ED: The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons. Genome Res. 2002, 12: 298-308. 10.1101/gr.207502. Article published online before print in January 2002.
Article PubMed Central CAS PubMed Google Scholar
Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M: Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003, 301: 71-76. 10.1126/science.1084337.
Article CAS PubMed Google Scholar
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003, 423: 241-254. 10.1038/nature01644.
Article CAS PubMed Google Scholar
Abramowitz M, Stegun IA, Eds: Handbook of Mathematical Functions. With Formulas. Graphs, and Mathematical Tables. 1974, Dover Pubns
Google Scholar
van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics. Proc Natl Acad Sci USA. 2002, 99: 7323-7328. 10.1073/pnas.112690399.
Article PubMed Central CAS PubMed Google Scholar
Erb I, van Nimwegen E: Statistical Features of yeast's transcriptional regulatory code. IEE Proceedings Systems Biology ICCSB. 2006
Google Scholar
Hughes JD, Estep PW, Tavazoie S, Church CM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 1205-1214. 10.1006/jmbi.2000.3519.
Article CAS PubMed Google Scholar
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Jospeh Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tange JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcription regulatory neworks in Saccharomyces cerivisiae. Science. 2002, 799-804. 10.1126/science.1075090.
Google Scholar
Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, DK DKP, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99-104. 10.1038/nature02800.
Article PubMed Central CAS PubMed Google Scholar
Dayhoff M, Schwartz R, Orcutt B: A model of evolutionary change in proteins. Atlas of protein sequence and structure. 1978, 5: 345-352.
Google Scholar
Müller T, Spang P, Vingron M: Estimating Amino Acid Substitution Models: A Comparison of Dayhoff's Estimator, the Resolvent Approach and a Maximum Likelihood Method. Mol Biol Evol. 2002, 19: 8-13.
Article PubMed Google Scholar
Halpern AL, Bruno WJ: Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998, 15 (7): 910-917.
Article CAS PubMed Google Scholar
Sinha S, van Nimwegen E, Siggia ED: A probabilistic method to detect regulatory modules. Bioinformatics. 2003, 19 (suppl 1): i292-i301. 10.1093/bioinformatics/btg1040.
Article PubMed Google Scholar
Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376. 10.1007/BF01734359.
Article CAS PubMed Google Scholar
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzogloue S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003, 13 (4): 721-731. 10.1101/gr.926603.
Article PubMed Central CAS PubMed Google Scholar
Morgenstern B, Dress A, Werner T: Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci USA. 1996, 93: 12098-12103. 10.1073/pnas.93.22.12098.
Article PubMed Central CAS PubMed Google Scholar
Bray N, Pachter L: MAVID: Constrained Ancestral Alignment of Multiple Sequences. Genome Res. 2004, 14: 693-699. 10.1101/gr.1960404.
Article PubMed Central CAS PubMed Google Scholar
Do C, Mahabhashyam M, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research. 2005, 15: 330-340. 10.1101/gr.2821705.
Article PubMed Central CAS PubMed Google Scholar
Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for multiple sequence alignments. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.
Article CAS PubMed Google Scholar
Loytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 2005, 102: 10557-10562. 10.1073/pnas.0409137102.
Article PubMed Central PubMed Google Scholar
Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB: MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol. 2004, 5: R98-10.1186/gb-2004-5-12-r98.
Article PubMed Central PubMed Google Scholar
Sinha S, Blanchette M, Tompa M: PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics. 2004, 5: 170-10.1186/1471-2105-5-170.
Article PubMed Central PubMed Google Scholar
Sinha S, Schroeder MD, Unnerstall U, Gaul U, Siggia ED: Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila. BMC Bioinformatics. 2004, 5: 129-10.1186/1471-2105-5-129.
Article PubMed Central PubMed Google Scholar
McCue LA, Thompson W, Carmack CS, Lawrence CE: Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res. 2002, 12: 1523-1532. 10.1101/gr.323602.
Article PubMed Central CAS PubMed Google Scholar
Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Biol. 2002, 9 (2): 211-223. 10.1089/10665270252935421.
Article CAS PubMed Google Scholar
Blanchette M, Tompa M: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 2002, 12 (5): 739-748. 10.1101/gr.6902.
Article PubMed Central CAS PubMed Google Scholar
Wang T, Stormo G: Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics. 2003, 19 (18): 2369-2380. 10.1093/bioinformatics/btg329.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

EvN thanks Ionas Erb, Nacho Molina, and Mikhail Pachkov for useful comments.

This article has been published as part of BMC Bioinformatics Volume 8 Supplement 6, 2007: Otto Warburg International Summer School and Workshop on Networks and Regulation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S6

Author information

Authors and Affiliations

Biozentrum, University of Basel, and Swiss Institute of Bioinformatics, Klingelbergstrasse 50/70, Basel, Switzerland
Erik van Nimwegen

Authors

Erik van Nimwegen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erik van Nimwegen.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

van Nimwegen, E. Finding regulatory elements and regulatory motifs: a general probabilistic framework. BMC Bioinformatics 8 (Suppl 6), S4 (2007). https://doi.org/10.1186/1471-2105-8-S6-S4

Download citation

Published: 27 September 2007
DOI: https://doi.org/10.1186/1471-2105-8-S6-S4

Otto Warburg International Summer School and Workshop on Networks and Regulation

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Abstract

The weight matrix representation of regulatory sites

Finding WM matches

Finding clusters of binding sites: regulatory modules

Motif finding

Maximizing P(D|w) through Expectation Maximization

Motif sampling

The meaning of equation (50)

Back to motif sampling

Clustering sites and motifs

Incorporating phylogeny

Binding site evolution

Probability of an orthologous set of bases

Finding sites and modules in multiple alignments

Motif finding incorporating phylogeny

Motif EM incorporating phylogeny

Motif sampling incorporating phylogeny

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Otto Warburg International Summer School and Workshop on Networks and Regulation

Finding regulatory elements and regulatory motifs: a general probabilistic framework

Abstract

The weight matrix representation of regulatory sites

Finding WM matches

Finding clusters of binding sites: regulatory modules

Motif finding

Maximizing P(D|w) through Expectation Maximization

Motif sampling

The meaning of equation (50)

Back to motif sampling

Clustering sites and motifs

Incorporating phylogeny

Binding site evolution

Probability of an orthologous set of bases

Finding sites and modules in multiple alignments

Motif finding incorporating phylogeny

Motif EM incorporating phylogeny

Motif sampling incorporating phylogeny

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us