### Mathematical model

Let

*S* be the set of

*N* sequences to be analyzed. Each sequence

*s*
_{
i
}∈

*S*, of length

*n*
_{
i
}, (

*i* = 1...

*N*) is assumed to contain zero or one motif pair. Below, we refer to motif-containing sequences as being

*regulated* and denote this by

*R*
_{
i
}
*= true*. The position of the first and second motif in a particular sequence

*s*
_{
i
}is denoted

*a*
_{
i
}and

*b*
_{
i
}respectively. Motifs are modeled as two position frequency matrices

*A* and

*B*, where

*A*
_{
j
}[

*l*] and

*B*
_{
j
}[

*l*] denotes the probability of the nucleotide

*l* appearing in position

*j* of motif A and B respectively. Unregulated sequences are modeled as background sequence, described by an order 0 Markov process with nucleotide frequencies

*θ*
_{0}. Regulated sequences are modeled as a combination of motif and background sequence. The probability of a sequence

*s*
_{
i
}(where

*s*
_{i,j}denotes the

*j*:th base in the sequence) can therefore be written

and where *W*
_{
A
}and *W*
_{
B
}are widths of the motifs. *a*
_{
i
}and *b*
_{
i
}cannot take on arbitrary values but will depend on each other, since we are looking for motif pairs where the distance between the two must follow certain criteria. We use a prior *p*(*a*
_{
i
},*b*
_{
i
}) to reflect this, described below. We also assume there is a fixed prior probability *p(R*
_{
i
}= *true)* for any sequence to be regulated. For *θ*
_{0}, A
_{
j
} and B
_{
j
} we use Dirichlet priors, with pseudocounts *α*[*l*] proportional to the frequencies of the bases in all the sequences. Our goal is to find values for *R* = (*R*
_{1}, ..., *R*
_{
N
}), *a* = (*a*
_{1}, ..., *a*
_{
N
}) and *b* = (*b*
_{1}, ..., *b*
_{
N
}) which maximize the posterior *p*(*R*,*a*,*b* | *S*). To accomplish this we use an algorithm based on the Gibbs sampling principle for motif discovery [21], which makes use of the predictive update version of the Gibbs sampler [22].

Given a partitioning of the sequences into motifs and background (

*a*,

*b* and

*R*) we can calculate the total observed counts of nucleotide

*l* in the background (

*c*
_{0}[

*l*]) and in the different positions

*i* of motif A (

*c*
_{A,i}[

*l*]) and motif B (

*c*
_{B,i}[

*l*]). Sequences where

*R*
_{
i
}= 0 are assumed to contain only background sequence. We can then estimate

*A*,

*B*, and

*θ*
_{0} as the expectation of

*p*(

*A*,

*B*,

*θ*
_{0} |

*R*,

*a*,

*b*,

*S*):

As in other Gibbs motif samplers, an iterative update/sampling procedure is applied. One of the sequences,

*s*
_{
i
}, is removed from the alignment by setting

*R*
_{
i
}= 0. Given values for

*A*,

*B*, and

*θ*
_{0} according to the formulas above, new values for

*R*
_{
i
},

*a*
_{
i
}and

*b*
_{
i
}are determined by sampling from

*p*(

*R*
_{
i
},

*a*
_{
i
},

*b*
_{
i
}|

*A*,

*B*,

*θ*
_{0},

*S*) using the following steps: Bayes formula on odds form gives that

which is used to sample whether

*R*
_{
i
}=

*true*. Note that, using (1) and (2), we have

We define the prior

*p*(

*a*
_{
i
},

*b*
_{
i
}) to be proportional to an indicator function

*e*(

*a*
_{
i
},

*b*
_{
i
}) which is zero unless

*a*
_{
i
}and

*b*
_{
i
}represent a pair of motif positions compatible with the assumptions that the motifs are both within the sequence, do not overlap, and have a distance conforming to the assumed periodicity and the assumed possible variation around this periodicity. As described above, the allowed distance is modeled as a fixed phase

*φ* plus a variable integer multiple of the period

*T* (Figure

1). Specifically, given

*W*
_{
A
},

*W*
_{
B
}, the period

*T*, the phase

*φ*, the allowed deviation from exact periodic distance ("noise"), the length of sequence

*i* and the minimum and maximum distances, we can for all

*i* = 1..

*n*
_{
i
}find all

*j* such that

*e*(

*i*,

*j*) = 1, and the value of (8) can be calculated as

Secondly, we get that

*p*(

*a*
_{
i
}|

*R*
_{
i
}=

*true*,

*A*,

*B*,

*θ*
_{0}, S) is proportional to

so if *R*
_{
i
}= *true*, a value for *a*
_{
i
}can be sampled by using probabilities proportional to the numbers (10). Finally, *b*
_{
i
}can be sampled by noting that given *R*
_{
i
}= *true* and a value for *a*
_{
i
}, the probabilities for valid values of *b*
_{
i
}according to *e*(*i*, *a*
_{
i
}) are proportional to *Q*
_{
B
}[*b*
_{
i
}].

The algorithm is initiated by setting all

*R*
_{
i
}=

*false*. The update/sampling procedure described above is then performed for each sequence

*s*
_{
i
},

*i* = 1...

*N*. When all

*R*
_{
i
},

*a*
_{
i
}and

*b*
_{
i
}have been updated, the alignment is scored according to

We are interested in finding values which maximize *p*(*R*, *a*, *b* | *S*), which approximately corresponds to maximizing *F* above. Having completed a full iteration of the update/sampling procedure, sampling continues at the first sequence. The algorithm stops when the same *F* has been observed several times in a row or when the maximum number of iterations is reached. To avoid getting stuck in local maxima, the algorithm is restarted several times. It is also systematically restarted with different settings of the phase *φ* (all values between 0...*T*-1 are evaluated), as this parameter is not updated during each run of the algorithm and therefore has to be determined exhaustively.

To avoid that the algorithm finds "shifted versions" of the actual motifs, a type of shift jump is introduced. Each time the score *F* is improved, possible shifts of the motifs are found, defined by adding or subtracting some integer to all *a*
_{
i
}and *b*
_{
i
}. For each of the possible shifts (*a**, *b**), we calculate *F*. If a better score is encountered, the positions are updated and used as a starting point for the next update/sampling iteration.

For simplicity, we have described the case where motif pairs are assumed to occur only on the forward strand. Our method optionally permits both forward and reverse strands to be searched. In this case, the sampling distribution and the calculation of the posterior probability for *R* is extended to included both strands. Optionally, information about conservation between species can be used to favor placement of motifs in evolutionarily conserved regions. In this case, instead of single sequences, pairwise alignments of orthologous sequences are loaded into the program. Gaps are removed from the "base" sequences to ensure that correct distances are maintained. The fraction of conserved bases over windows the same size as the motifs is calculated for each possible motif position. The sampling distributions are then weighted according to this vector. A similar strategy is implemented in [23]. The same vector is also used to exclude regions from being searched. This allows the sampler to be restarted after convergence to search for a new set of non-overlapping binding sites.