Volume 10 Supplement 1
Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)
DNA motif alignment by evolving a population of Markov chains
- Chengpeng Bi^{1, 2}Email author
DOI: 10.1186/1471-2105-10-S1-S13
© Bi; licensee BioMed Central Ltd. 2009
Published: 30 January 2009
Abstract
Background
Deciphering cis-regulatory elements or de novo motif-finding in genomes still remains elusive although much algorithmic effort has been expended. The Markov chain Monte Carlo (MCMC) method such as Gibbs motif samplers has been widely employed to solve the de novo motif-finding problem through sequence local alignment. Nonetheless, the MCMC-based motif samplers still suffer from local maxima like EM.
Therefore, as a prerequisite for finding good local alignments, these motif algorithms are often independently run a multitude of times, but without information exchange between different chains. Hence it would be worth a new algorithm design enabling such information exchange.
Results
This paper presents a novel motif-finding algorithm by evolving a population of Markov chains with information exchange (PMC), each of which is initialized as a random alignment and run by the Metropolis-Hastings sampler (MHS). It is progressively updated through a series of local alignments stochastically sampled. Explicitly, the PMC motif algorithm performs stochastic sampling as specified by a population-based proposal distribution rather than individual ones, and adaptively evolves the population as a whole towards a global maximum. The alignment information exchange is accomplished by taking advantage of the pooled motif site distributions. A distinct method for running multiple independent Markov chains (IMC) without information exchange, or dubbed as the IMC motif algorithm, is also devised to compare with its PMC counterpart.
Conclusion
Experimental studies demonstrate that the performance could be improved if pooled information were used to run a population of motif samplers. The new PMC algorithm was able to improve the convergence and outperformed other popular algorithms tested using simulated and biological motif sequences.
Background
Discovering cis-regulatory elements or DNA motifs in genomic sequences is fundamental to build genetic networks and important to understand gene regulation in biological and pathological processes [1]. Although much algorithmic effort has been expended, it still remains challenging (see recent reviews in [2, 3]). Multiple sequence local alignment coupling with position weight matrix (PWM) updating, or PWM-based technique for short, has been widely used to solve the de novo motif discovery problem [2, 3]. This problem is of combinatorial optimization, and it has proven NP-complete [4]. A motif model or PWM can be uniquely defined by a local alignment, and the PWM updating is applied to progressively calibrate the alignments until some specified criterion is met. In the following text, the terms local alignment and motif discovery are often interchangeable. Basically, the PWM updating technique consists of two broad approaches to approximating the local alignment solutions, deterministic and stochastic motif-finding algorithms. The deterministic local alignment algorithm developed by Lawrence and Reiley in 1990 [5] is rooted on the Expectation Maximization (EM) method [6]. EM has been widely applied in various scientific computing since its introduction due to its simplicity and efficiency [7]. The EM motif algorithm has spawned a plethora of its variations (see review in [8] and references therein), for example, a popular motif algorithms called MEME is an enhanced EM version [9].
The stochastic local alignment method is the second approach in the PWM updating. It has its origin in Markov chain Monte Carlo (MCMC) methods [10] pioneered by Metropolis et al. in 1953 [11] and generalized later on by Hastings [12]. The earliest MCMC-based motif discovery algorithm is called Gibbs motif sampler developed in 1993 [13], and later many other MCMC-based motif samplers have been investigated such as BioProspector [14]. Moreover, the EM algorithm relates to the MCMC algorithms in the sense that it can be viewed as a forerunner of the Gibbs sampler in its data augmentation version [10, 15, 16], also as seen in a recent comparative study on these two approaches to detecting protein domains [17]. The MCMC algorithms have some appealing features better than EM, for example, they can escape from the local optima suffered by the EM algorithms. Most importantly, MCMC can be used as a general framework for solving a wide range of complex problems where EM often fails, because for these problems computing the expectation or maximization steps as required in EM often become infeasible. In particular, MCMC provides a framework for drawing samples from a complicated target distribution that cannot be sampled with simpler deterministic methods.
In practical sequence alignment, the MCMC sampling algorithms may also suffer from local maxima like EM [3]. To remedy this limitation, a MCMC motif algorithm is often run a multitude of times each starting from different points, that is, there is a population of Markov chains running in parallel. This simple strategy is strongly recommended and should be able to efficiently explore the probability landscape with multiple modes [3]. The population-based Monte Carlo methods serve two purposes, that is, to improve or diagnose convergence [10, 18] and escape from local maxima [3]. It can be classified into two categories: (1) running multiple independent Markov chains (IMC) without any information exchange between chains, and (2) enabling information exchange among a population of Markov chains (PMC). It is straightforward to implement IMC for any MCMC-based samplers. In real world, IMC is often encouraged in running MCMC motif samplers. However, as up to now PMC is rarely concerned with in sequence local alignment [3]. In literature, PMC exists in a diversity of forms. For example, parallel tempering (PT) evolves R Markov chains each attaching a different temperature [19]. Chains in PT exchanges information by swapping two states that are generated by mutation, and accepted by a modified Metropolis-Hasting rule. Note that PT is the population-based simulated annealing algorithm and it only exchanges information pairwise. The real PMC algorithm should be able to exchange information at the population level, for example, the evolutionary Monte Carlo (EMC) [20] and population MCMC [21]. This paper presents the PMC motif-finding algorithm, a novel local alignment method. It evolves a population of Markov chains (PMC) with information exchange, each chain being updated according to the population-based or pooled proposal distributions and Metropolis-Hasting rule. Experimental studies demonstrate that the new algorithm was able to improve the convergence as well as evolve some better local alignments while compared to IMC and other motif algorithms tested.
Results
Metropolis-Hastings sampler for local alignment
A Metropolis-Hastings motif sampler (MHS) was first devised to carry out multiple local alignment. MHS run a single Markov chain initialized from a random alignment seed, which is the so-called initial alignment (A^{(0)} with t = 0 step). A motif model or PWM matrix Θ^{(t)}at step t can be derived from such a sequence alignment (A^{(t)}). Then a Monte Carlo simulation is processed as follows: first scanning all sequences by the motif model built at step t. The scanning process simply calculates the probabilities of each potential motif sites on the input sequences using equation (1). Second, new samples are drawn from each sequence according to proposed distributions built from the sites scanned in the step t. Third, a energy function as defined in equation (3) is calculated, and the Metropolis-Hastings acceptance rate defined in equation (6) is applied to decide whether or not the new samples (i.e. new alignment) are passed to the next iteration. Finally, the above procedure is iterated for a number of iterations or cycles (C). A parallel MHS (i.e IMC) run the MHS motif sampler a multitude of times (R) independently without any information exchange, whereas the population-based MHS (i.e. PMC) run the same MHS sampler R times with information exchange enabled among all chains.
Scanning function
where a' = a_{ i }+ w - 1, j' = j - a_{ i }+ 1, and δ is the indicator function. The information weight function simply indicates how similar a motif nucleotide probability (${\overrightarrow{\mathbf{\theta}}}_{j}$) is relative to the background (${\overrightarrow{\mathbf{\theta}}}_{0}$). Therefore, a site gets a larger scanning probability if its motif sequence is more different than the background. The scanning function can be derived in the EM motif algorithms [5, 8] or Gibbs motif samplers [13, 14, 17].
Scoring function
Given a set of unaligned sequences (S), each local alignment can be treated as a configuration (v) in the whole local alignment space. The potential energy of a configuration (i.e. a local alignment A^{(t)}) is defined by,
H(v) = H(Θ^{(t)}, A^{(t)}, S),
where |A| is the aligned motif sites, and ${\tau}_{t}({\mathbf{\theta}}_{jk})=-{\mathbf{\theta}}_{jk}^{(t)}\mathrm{log}(1.0/{\mathbf{\theta}}_{jk}^{(t)})$ is a negative entropy function. The energy function described as above and its variants have been widely adopted in maximum likelihood [5, 8, 9, 17] or maximum a posterior based local alignment algorithms [10, 13, 14].
MHS sampling
The probability of a local alignment follows the Boltzmann distribution as,
p(A|S) = Z^{-1} exp{-λH(A)},
where ΔH = H(A^{(n)}) - H(A^{(t)}). The transition probability from alignment A^{(t)}to A^{(n)}is thus expressed as follows: T(A^{(n)}|A^{(t)}) = P(A^{(n)}|A^{(t)})α_{ H }(A^{(n)}|A^{(t)}). The MHS algorithm keeps updating the current best alignment and associated maximum energy along the sampling process or Markov chain. The output is the final best alignment (A*) and its corresponding maximum energy (H_{ max }) as well as the associated best motif model (Θ*). It should be pointed out that the energy maximization is equivalent to the minimization, that is, the optimizations max{H} and min{-H} are the same.
The IMC version of the MHS motif algorithm can be easily implemented in parallel. Let R independent Markov chains each starting from a different initial alignment, and then the IMC algorithm run R chains separately. Thus, one can send each run (i.e. a single MHS sampler) to a different compute node for execution. A set of the best alignments collected from all samplers in nodes can be sorted out and output the top best alignments as the potential solutions (note that the top-10 solutions are output as the default).
The PMC motif-finding algorithm
where ΔH^{ r }= H(A^{(rn)}) - H(A^{(rt)}). The α_{ P }rule is independently applied to all chains in the population.
This would ensure that each chain evolves independently, but adaptively progresses as a member in the population towards a unified direction.
Comparison of sampling trajectories
On the other hand, the IMC version took the longest to come up with the best alignment at iteration 93 in 3 out of 5 chains. In addition, the IMC chains ended up with different results (Figure 1A), because some initial seeds result in good alignments with high energy (H), but others in bad alignments with low H. This nicely demonstrates that a prerequisite for achieving good alignment solutions would be running multiple times of a MCMC-based sampler, which provides more opportunities of being able to search more space and thus escape from local maxima, as already suggested in the literature [3, 17]. As noted in Figure 1B–1C, all PMC chains quickly converged to the same equilibrium, which produced a global optimal solution rather than a set of different solutions exhibited in IMC (Figure 1A).
Subtle motif discovery
Comparison using (l, d)-motifs. Note that here EM is DEM [8]. The number in bold corresponds to the best predictor in that case row.
Algorithms | |||||
---|---|---|---|---|---|
(l, d) | WEE | PRO | EM | IMC | PMC |
10,2 | 0.46 | 0.53 | 0.32 | 0.42 | 0.73 |
11,2 | 0.74 | 0.95 | 0.47 | 0.96 | 0.93 |
12,3 | 0.27 | 0.30 | 0.29 | 0.22 | 0.59 |
13,3 | 0.44 | 0.85 | 0.43 | 0.72 | 0.85 |
14,4 | 0.29 | 0.25 | 0.31 | 0.21 | 0.55 |
15,4 | 0.27 | 0.67 | 0.40 | 0.50 | 0.75 |
16,5 | 0.20 | 0.16 | 0.32 | 0.13 | 0.28 |
17,5 | 0.23 | 0.84 | 0.59 | 0.39 | 0.76 |
18,6 | 0.20 | 0.12 | 0.34 | 0.14 | 0.32 |
19,6 | 0.20 | 0.64 | 0.58 | 0.26 | 0.75 |
ave | 0.33 | 0.53 | 0.41 | 0.40 | 0.65 |
The IMC and EM motif-finders performed nearly equal on average (i.e. 0.40 vs. 0.41). These poor results achieved are in tune with previous reports [26, 27], that is, the EM and MCMC-based motif algorithms had difficulties in solving the planted motif problem. IMC only correctly predicted three easy cases: (11,2), (13,3) and (15,4) with average nla = 0.96, 0.72 and 0.50, respectively, whereas EM barely uncovered two easy cases with long motif widths: (17,5) and (19,6) with nla = 0.59 and 0.58, respectively. Notice that the PMC algorithm is significantly better than its independent counterpart IMC in almost all the simulated cases. It is evident that the PMC motif algorithm is superior to IMC, see the 25% increment on average performance (i.e. 0.65 vs. 0.40). However, in finding highly conserved motifs, IMC and PMC may perform approximately the same, as will be shown in the JASPAR benchmarks.
Weeder only detected the (11,2) case and failed the remaining cases with average performance of nla = 0.33. Previous studies showed that Weeder was one of the best motif predictors in mammalian promoter regions [29], it may imply that a highly heterogeneous word frequency distribution made a major contribution to its success in real biological cases, whereas the word frequencies in the simulated backgrounds were nearly uniformly distributed, that may render the motif-finding much more difficulty to Weeder. In addition, Weeder may be preferred in non-oops motif discovery, because its strategy is that selecting as many significant words as possible and put all of them in the candidate list. In difficult situations, while other algorithms totally fail, Weeder may still have the chance to find some meaningful signals. However, such winning strategy becomes the weakness in oops motif discovery, because the more motifs reported, the higher false positives incurred. Currently, there is no way to force Weeder run as an oops model.
Testing JASPAR benchmark
In order to test the algorithm performance using the JASPAR data sets, one needs to generate simulated promoter sequences since the original JASPAR sequences are too short (about 25 bp on average). The simulated eukaryotic promoter sequences were generated by zero-order Markov model and they were composed of 45% GC content, which conforms to the ENCODE promoter regions [1]. Each JASPAR binding site was implanted into one simulated promoter sequence of length 500 bp. Note that the longer the sequence, the harder to locate the target as demonstrated in Figure 2. The 500-bp long sequence is approximately the same size as in a ChIP-chip data set, and should be a moderate size to challenge each algorithms tested. The Weeder and Projection programs were run the same as in the planted motif cases. For Projection, when l = 9 bp, d is still set to 2 mismatches, and when l = 20 – 22 bp, d = 7 – 8 mismatches as an extension to the original problem. Here the EM motif algorithm used is the popular MEME [9] run as its defaults.
Comparison using JASPAR. Note that here EM is MEME [9].
Algorithms | |||||
---|---|---|---|---|---|
w | WEE | EM | PRO | IMC | PMC |
9 | 0.38 | 0.39 | 0.43 | 0.53 | 0.53 |
10 | 0.55 | 0.65 | 0.63 | 0.65 | 0.77 |
11 | 0.48 | 0.73 | 0.74 | 0.80 | 0.80 |
12 | 0.73 | 0.79 | 0.84 | 0.88 | 0.90 |
13 | 0.21 | 0.75 | 0.77 | 0.82 | 0.85 |
14 | 0.65 | 0.82 | 0.82 | 0.91 | 0.87 |
15 | 0.90 | 0.90 | 0.90 | 0.97 | 0.97 |
16 | 0.53 | 0.79 | 0.73 | 0.85 | 0.95 |
20 | 0.82 | 0.90 | 0.89 | 0.94 | 0.94 |
ave | 0.58 | 0.75 | 0.75 | 0.82 | 0.84 |
The above experimental results illustrated the ability of the PMC algorithm to predict the binding sites not only in the simulated subtle motif data, but also in the experimentally verified data. PMC did outperform its independent counterpart IMC. In other words, information exchange between multiple Markov chains as implemented in PMC improved the convergence and the motif prediction as well. If a population of MHS motif samplers evolves towards a common target function, it would be expected that the performance might be improved, because pooled information is used to inform each individual proposal distribution. Suppose an individual chain from one sampler starts from a bad point and it definitely results in a poor solution while evolving independently. Now, if the population information is used to define its proposal distribution, that would render the poor chain much better with the benefit of being a member, as illustrated in Figure 1.
Discussion
This paper presents a novel PMC motif-finding algorithm by evolving a population of Markov chains. The PMC motif algorithm exchanges local alignment information between individual chains by applying a pooled proposal distribution to all chains, and thus each individual chain can adaptively evolve towards a population-level equilibrium or global target function. Experimental studies demonstrate that the new PMC algorithm was able to improve the convergence and evolve better alignment solutions while compared to its multiple independent Markov chains method (IMC) and other algorithms. Further investigation into the population MCMC methods may be as follows: (1) disturbance added to the PMC chains may assist in exploring the alignment space, for example, one can incorporate genetic operators such as crossover and mutation into the PMC procedure; (2) fine-tuning sampling parameters might be informative to boost the accuracy of subtle motif discovery.
On the other hand, genetic algorithms [30], a class of adaptive global search methods modeled after biological systems, have been recently tried to overcome the limitations inherent in EM [31] as well as in MCMC [32]. It can be expected that computational intelligence techniques, which include genetic algorithms, neural networks and swarm intelligence, are likely to play important roles in sequence motif alignment.
Methods
Multiple sequence local alignment
The motif discovery problem is simply defined as: finding some recurrent short sequence patterns or motifs that are likely embedded in a given set of biological related sequences (S), for example, upstream promoter regions of co-regulated genes or enriched binding sequences in ChIP-chip experiments among others. Multiple local alignment is the most widely used method to locate over-represented sites in a given sequence set. The aligned over-represented sites are then used to build a frequency matrix that depicts a conserved domain or motif. Let S = {S_{1}, ..., S_{ i }, ..., S_{ N }} denote the sequence data set. Let L_{ i }be the length of the sequence i (S_{ i }) and S_{ ij }denote a residue symbol taking on a value in K, for instance, K = {A, C, G, T} is an alphabet of DNA sequences. If only one motif per sequence (i.e. oops model) is assumed, there are N motifs in total for N sequences. A zero or one motif per sequence (i.e. zoops) model is also frequently used. Nonetheless, both oops and zoops models assume that sequence data come from a two-component multinomial mixture model: (1) the background model, assuming that residues at non-motif positions follow an independent and identical multinomial distribution (${\overrightarrow{\mathbf{\theta}}}_{0}$); and (2) the w-mer motif model (or w-motif), assuming that residues within the motif are independent but not identical, in other words, residues at different motif positions come from different multinomial distributions (${\overrightarrow{\mathbf{\theta}}}_{j}$).
Let A_{ i }be an indicator variable drawn from the location space ${\{0,1\}}^{({L}_{i}-w+1)}$ of sequence i, A = [A_{1}, ..., A_{ i }, ..., A_{ N }]^{ T }be the set of indicator variables representing the motif start sites (i.e. a local alignment) in the sequences, and w be the motif width. The total number of local alignments (V) can be generally expressed as: $V={\displaystyle {\prod}_{i=1}^{N}\left(\begin{array}{l}{L}_{i}-w+1\\ \left|{A}_{i}\right|\end{array}\right)}$, and here the number of motif sites on sequence i is defined as: |A_{ i }| = Σ_{ l }A_{ il }. Therefore, if |A_{ i }| = 1 for all i, it is an oops model, otherwise it is a zoops or multiple-site model. The total number of motif sites is |A| = Σ_{ i }|A_{ i }|. Alternatively, an indicator variable a_{ i }= l is used to represent the motif starting at position l on sequence i, which is equivalent to A_{ il }= 1. Note that a_{ i }= 0 means no motifs found on sequence i. If multiple sites occur on a sequence, a vector (${\overrightarrow{a}}_{i}$) is used to store all the positions. Obviously a motif indicator vector ${\overrightarrow{a}}_{i}\in {\mathcal{P}}_{i}$ ({1, 2, ..., L_{ i }- w + 1}), here ${\mathcal{P}}_{i}$ is the power set of the i-th sequence motif sites. The alignment of motif sites is initialized by randomly generating a set of motif start sites (i.e. A^{(0)} or equivalently ${[{\overrightarrow{a}}_{1}^{(0)},\mathrm{...},{\overrightarrow{a}}_{N}^{(0)}]}^{T}$) and then it is progressively refined until convergence.
Evaluating the weight function
where τ_{ jk }is the information weight of the base k on motif position j. The pseudo-counts $\overrightarrow{\beta}$ are added to avoid zero counting in a motif alignment due to small sample, β_{ k }is the pseudo-count of the base k, and |$\overrightarrow{\beta}$| = Σ_{ k }β_{ k }. It is often set as follows: β_{ k }= 1.25/4 for a DNA sequence [33]. The background distribution ${\overrightarrow{\mathbf{\theta}}}_{0}$ can be estimated from the sequences or user-specified. Note that the expectation of the information weight defines the relative entropy function [34] or information content [24].
Overall objective function
Therefore, the optimized alignment state (v*) is the one with the maximum probability (p*). If v* is found, then the parameter estimation (Θ*) is done. However, computing the partition function or normalized constant (i.e. Z) is commonly intractable, because the alignment problem has proven NP-complete [4]. Markov chain Monte Carlo methods are frequently applied to approximate this kind of hard problem. This paper defined the maximum log-likelihood as the potential energy function, see equation (3).
Metropolis-Hastings sampler
Since it is hard to draw a sample directly according to equation (11), the Monte Carlo optimization methods such as the Metropolis-Hastings sampling (MHS) algorithm are often employed to iteratively draw a series of samples according to a proposed probability distribution such as the one in equation (5). Therefore, the sequence local alignment can be formulated as a Markov chain Monte Carlo (MCMC) optimization problem. Note that MCMC is a general framework for approximation methods in search of a large space characterized by complicated or unknown functions.
In solving the sequence local alignment problem, MHS generates samples (e.g. A) from a probability distribution p(·), and explores the local alignment space Ω(S) using a Markov chain. MCMC does not sample directly from p(·), but only requires that the density p(v) can be evaluated within a multiplicative constant p(v) = $\tilde{p}$(v)/Z and $\tilde{p}$(v) is the un-normalized target distribution. In equation (4), $\tilde{p}$ = exp{-λH(·)}. A Markov chain is a discrete-time stochastic process {A^{(0)}, A^{(1)}, ⋯} with property that the state A^{(t)}given all previous values {A^{(0)}, A^{(1)}, ⋯ A^{(t-1)}} only depends on A^{(t-1)}: P(A^{(t)}|A^{(0)}, A^{(1)}, ⋯ A^{(t-1)}) = P(A^{(t)}|A^{(t-1)}). We call P(·|·) the transition matrix of the Markov chain. P(·|·) is stationary, that is to say, independent of time or step (t).
The MHS sampler starts with any alignment A^{(0)} by randomly initializing a seed. Notice there exists a one-to-one map: A^{(t)}→ Θ^{(t)}. Then the MHS algorithm iterates the following two steps: (1) Propose a random perturbation of the current state (t), i.e., A^{(t)}→ A', where A^{'} can be viewed as generating from a proposal distribution P(A^{(t)}→ A') = P(A'|A^{(t)}) by a series of independent random sampling. Then, one can toss a Bernoulli coin with probability of α_{ H }coming up heads: if heads show up, accepts the new move: A^{(t+1)}= A'; otherwise stay as it was: A^{(t+1)}= A^{(t)}.
Parallel MHS samplers
The above function treats the temperature (T) as constant and can be easily extended to a temperature variable (T_{ r }) attached to a MCMC chain [20]. Notice that there are no information exchange among these chains. The population-based MCMC (i.e. PMC) simulates samples according to a pooled proposal distribution, that is, v_{ r }comes from the $\overline{p}$(v_{ r }) defined in equation (7) rather than the p(v_{ r }) in equation (4) as information exchange occurs among chains.
Performance evaluation
where |A| is the number of motif sites, |a_{ i }∩ o_{ i }| is the size of overlapping block between the predicted and observed motifs from the sequence i. The predicted alignment may exactly match the observed (nla = 1.0) or have some phase shifts (0.0 <nla < 1.0) or completely misaligned (nla = 0.0).
Motif-finders and their parameter settings
Both IMC and PMC versions run the same MHS motif-finding algorithms, and they were executed in the same parameter settings. The PMC population size was R = 20 as the default in all cases if not specified otherwise. The MEME algorithm (version 3.5.4) was also tested in the JASPAR data sets, and its parameters were set as its defaults except that MEME was notified of running the oops model, and the exact motif lengths were given as its input like other algorithms tested. The Projection algorithm [27] was specifically designed to deal with the planted (l, d)-motif discovery, and it requires the input of the l and d. The Projection program version 0.42 was tested with the same simulation data. Note that other algorithms tested do not require the d input. For example, to run the planted (10,2)-motif case, Projection executes the following command: "findmotif -l 10 -d 2 -M 20 -s seqfile". Projection can be downloaded from the website [35].
The Weeder algorithm [28] is one of the best motif predictors in mammalian promoters [29], however, its performance may be largely dependent on the nucleotide composition of real promoter regions. Several pre-calculated tables of word frequencies for various species (6-mers and 8-mers) come with the Weeder software. To cope with this requirement, new word frequency tables (6-mer and 8-mer words) were generated, for example, in the uniform case, each word was simply given frequency at 1. For other zero-order Markov backgrounds, first compute a w-word with minimum frequency: f_{ min }= (θ_{0k'})^{ w }, here k' is the nucleotide with the least probability and adjust the value to 1.0, and accordingly other words with the frequency f_{ word }are set to ⌊f_{ word }/f_{ min }+ 0.5⌋ (personal communication with Dr. Giulio Pavesi). The Weeder program version 1.31 was run with all the simulated data like this: "weederlauncher.out seqfile bk large A", here bk is the generated word frequency table. The Weeder user's manual can be found on the website [36].
Experimental data sets tested
The first example used is the gold standard CRP data set [5, 8, 9], which consists of 18 experimentally determined binding sequences (w = 22 bp) each being embedded within a 105 bp promoter sequence. The motif width is set as 22 base pairs long according to the verified site length. The local alignment is performed on both forward and reverse DNA strands. The motif discovery problem becomes more difficulty as the sequence length gets longer. To test the length (L) effect, the 18 verified sites were planted into different data sets, each being different sequence length: i.e. L = 100, 200, 400, 800 and 16,000 bp long. The simulated background sequences were generated based on the zero-order Markov model (${\overrightarrow{\mathbf{\theta}}}_{0}$ = [0.302, 0.183, 0.209, 0.306]^{ T }), which was estimated according to the original CRP data set.
Besides the length effect, other major factors such as the degree of motif conservation also have heavily impact on the performance. The data sets of the planted (l, d)-motif problem were generated as follows: given a fixed l-mer short motif (l = 10 to 19 bp), one generates 20 background sequences (N = 20) each with 600 bp long (L = 600), and each background sequence was randomly embedded with a variant of the consensus. A variant motif is a substring derived from the consensus with exactly d-position mutation (d = 2 to 6). The background distribution used in simulation is ${\overrightarrow{\mathbf{\theta}}}_{0}$ = [0.3, 0.2, 0.2, 0.3]^{ T }. Each simulation was repeated 20 times and all simulated data sets were fed to the five algorithms tested. All computing jobs were submitted to the Apple Xserve cluster facility located in Children's Mercy Hospital's main campus. Notice that the same motif sets were also planted in two other backgrounds: [0.25, 0.25, 0.25, 0.25] and [0.2, 0.3, 0.3, 0.2], and similar conclusions could be drawn (data not shown.)
The JASPAR web server [37] provides experimentally verified binding sites data sets of eukaryotic transcription factors (TFs). However, the vast majority of these binding data were obtained by the SELEX method [24] that tested randomly generated double-strand oligonucleotides rather than genomic DNA sequences (personal communication with Dr. Boris Lenhard). Note that these verified motifs are highly conserved, each being short sequence (i.e. 25 bp on average). To challenge the de novo motif algorithms, these verified sites were planted into randomly simulated background sequences, each being 500 bp. A subset of the JASPAR binding sites was used to test the algorithms in the paper, that includes those TFs with more than 25 verified binding sequences and the motif width is at least 9 bp (i.e w ≥ 9), as detailed in Table 2.
Declarations
Acknowledgements
The author would like to express his thanks to Dr. Boris Lenhard for discussing on the JASPAR data as well as to Dr. Giulio Pavesi for providing guidance on generating word frequency tables of simulated sequences.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
Authors’ Affiliations
References
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Science. 2007, 447: 799-816.
- MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Computat Biol. 2006, 2: e36-10.1371/journal.pcbi.0020026.View Article
- Ji H, Wong WW: Computational biology: Towards deciphering gene regulatory information in mammalian genomes. Biometrics. 2006, 62: 645-663. 10.1111/j.1541-0420.2006.00625.x.View ArticlePubMed
- Wang L, Jiang T: On the complexity of multiple sequence alignment. J Comput Biol. 1994, 1: 337-348.View ArticlePubMed
- Lawrence CE, Reilly AA: An expectation maximization algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function and Genetics. 1990, 7: 41-51. 10.1002/prot.340070105.View Article
- Dempster AP, Laird AM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J Roy Statist Soc Ser B. 1977, 39: 1-38.
- Meng XL, Dyk D: The EM algorithm – an old folk-song sung to a fast new tune. J Roy Statist Soc Ser B. 1997, 59: 511-567. 10.1111/1467-9868.00082.View Article
- Bi CP: SEAM: A stochastic EM-type algorithm for motif-finding in biopolymer sequences. J Bioinform Comput Biol. 2007, 5: 47-77. 10.1142/S0219720007002527.View ArticlePubMed
- Bailey T, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning. 1995, 21: 51-80.
- Liu JS: Monte Carlo Strategies in Scientific Computing. 2002, New York: Springer-Verlag
- Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller A, Teller H: Equations of state calculations by fast computing machines. J Chem Phys. 1953, 21: 1087-1091. 10.1063/1.1699114.View Article
- Hastings WK: Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970, 57: 97-109. 10.1093/biomet/57.1.97.View Article
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214. 10.1126/science.8211139.View ArticlePubMed
- Liu X, Brutlag DL, Liu JS: BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Proc. Pacific Symposium on Biocomputing (PSB). 2001, 127-138.
- Robert CP, Casella G: Monte Carlo Statistical Methods. 1999, New York: SpringerView Article
- van Dyk DA, Meng XL: The art of data augmentation. J Comput Graph Stat. 2001, 10: 1-50. 10.1198/10618600152418584.View Article
- Bi CP: Data augmentation algorithms for detecting conserved domains in protein sequences: A comparative study. J Proteome Res. 2008, 7: 192-201. 10.1021/pr070475q.View ArticlePubMed
- Gilks WR, Richardson S, Spielgelhalter DJ: Markov Chain Monte Carlo in Practice. 1996, New York: Chapman and Hall
- Geyer CJ: Markov chain Monte Carlo maximum likelihood. Computing Science and Statistics: Proc. of the 23rd Symposium on the Interface. 1991, 156-163.
- Liang F, Wong WH: Evolutionary Monte Carlo: Applications to C_{ p }model sampling and change point problem. Statistica Sinica. 2000, 10: 317-342.
- Laskey KB, Myers J: Population Markov chain Monte Carlo. Machine Learning. 2003, 50: 175-196. 10.1023/A:1020206129842.View Article
- Kullback S, Leibler RA: On information and sufficiency. Ann Math Statist. 1951, 22: 79-86. 10.1214/aoms/1177729694.View Article
- Berg O, von Hippel PH: Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. Journal of Molecular Biology. 1987, 193: 723-750. 10.1016/0022-2836(87)90354-8.View ArticlePubMed
- Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16: 16-23. 10.1093/bioinformatics/16.1.16.View ArticlePubMed
- Sagot MF: Spelling approximate repeated or common motifs using a suffix tree. Theoretical informatics, 1380. Edited by: Lucchesi C, Moura A. 1998, 111-127.
- Pevzner P, Sze SH: Combinatorial approaches to finding subtle signals in DNA sequences. Proc. First ISMB Conference. 2000, 1: 269-278.
- Buhler J, Tompa M: Finding motifs using random projection. Journal of Computational Biology. 2002, 9: 225-242. 10.1089/10665270252935430.View ArticlePubMed
- Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics. 2001, 17 (suppl 1): S207-S214.View ArticlePubMed
- Tompa M, Li N, Bailey TL, Church GM, De Moor B: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-144. 10.1038/nbt1053.View ArticlePubMed
- Holland JH: Adaptation in Natural and Artificial Systems. 1975, Ann Arbor: Michigan: The University of Michigan Press
- Bi CP: A genetic-based EM motif-finding algorithm for biological sequence analysis. Proceeding of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, IEEE. 2007, 275-282.
- Bi CP: Evolutionary Metropolis sampling in sequence alignment space. Proc. 2008 IEEE Congress on Evolutionary Computation (CEC), IEEE. 2008, 189-194.
- Frith MC, Hansen U, Spouge JL, Weng Z: Finding functional sequence elements by multiple local alignment. Nucleic Acids Res. 2004, 32: 189-200. 10.1093/nar/gkh169.PubMed CentralView ArticlePubMed
- MacKay DJC: Information Theory, Inference, and Learning Algorithms. 2003, New York: Cambridge University Press
- The Projection Software. [http://www.cse.wustl.edu/~jbuhler/pgt/]
- The Weeder Software. [http://159.149.109.9/modtools/]
- Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004, 32: D91-D94. 10.1093/nar/gkh012.PubMed CentralView ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.