 Methodology article
 Open access
 Published:
Discovering multiple realistic TFBS motifs based on a generalized model
BMC Bioinformatics volumeÂ 10, ArticleÂ number:Â 321 (2009)
Abstract
Background
Identification of transcription factor binding sites (TFBSs) is a central problem in Bioinformatics on gene regulation. de novo motif discovery serves as a promising way to predict and better understand TFBSs for biological verifications. Real TFBSs of a motif may vary in their widths and their conservation degrees within a certain range. Deciding a single motif width by existing models may be biased and misleading. Additionally, multiple, possibly overlapping, candidate motifs are desired and necessary for biological verification in practice. However, current techniques either prohibit overlapping TFBSs or lack explicit control of different motifs.
Results
We propose a new generalized model to tackle the motif widths by considering and evaluating a width range of interest simultaneously, which should better address the width uncertainty. Moreover, a metaconvergence framework for genetic algorithms (GAs), is proposed to provide multiple overlapping optimal motifs simultaneously in an effective and flexible way. Users can easily specify the difference amongst expected motif kinds via similarity test. Incorporating Genetic Algorithm with Local Filtering (GALF) for searching, the new GALFG (G for generalized) algorithm is proposed based on the generalized model and metaconvergence framework.
Conclusion
GALFG was tested extensively on over 970 synthetic, real and benchmark datasets, and is usually better than the stateoftheart methods. The range model shows an increase in sensitivity compared with the singlewidth ones, while providing competitive precisions on the E. coli benchmark. Effectiveness can be maintained even using a very small population, exhibiting very competitive efficiency. In discovering multiple overlapping motifs in a real liverspecific dataset, GALFG outperforms MEME by up to 73% in overall F scores. GALFG also helps to discover an additional motif which has probably not been annotated in the dataset. http://www.cse.cuhk.edu.hk/%7Etmchan/GALFG/
Background
In this section, motif discovery is introduced, followed by the summarization of existing methods, and methods beyond the scope of this paper. Motivations are then given and the paper layout is presented.
Motif Discovery
Transcription Factor Binding Sites (TFBSs) are small nucleotide fragments (usually â‰¤ 30 bp) in the cisregulatory/intergenic regions in DNA sequences. Regulatory proteins, namely the Transcription Factors (TFs), bind in a sequencespecific manner to TFBSs to activate or suppress gene transcription (gene expression). Therefore, TFBSs are a critical component in gene regulation, and identification of TFBSs is a central problem for understanding gene regulation in molecular biology.
The DNA binding domain(s) of a TF can recognize and bind to a collections of similar TFBSs, from which a conserved pattern called motif can be obtained. Based on this phenomenon, de novo motif discovery using computational methods have been proposed to identify and predict TFBSs and their corresponding motifs. Motif discovery provides significant insights into the understanding of the mechanisms of gene regulation. It serves as an attractive alternative for providing prescreening and prediction of unknown TFBS motifs to the expensive and laborious biological experiments such as DNA footprinting [1] and gel electrophoresis [2]. The recent technology of Chromatin immunoprecipitation (ChIP) [3, 4] measures the binding of a particular TF to DNA using microarray technology at low resolution in a highthroughput manner, and produces more reliable input data of coregulated genes for motif discovery [5].
Existing Methods
Categorization
Because the conservation of motifs is often degenerated due to TFBS mutations, the searching is difficult (NPhard [6]). Extensive algorithms have been proposed for de novo motif discovery since the last decades. There are two major representations for TFBS motifs (conserved patterns): (i) Consensus Representation and (ii) Matrix Representation; and there are two main different searching paradigms: (a) Enumeration Methods and (b) Stochastic Searching [4]. They are briefly described as follows:
(i) Consensus Representation is based on discrete strings. A simple model is to minimize the mismatches between the consensus and the TFBS instances [7â€“10].
(ii) Matrix Representation is usually a Position Frequency Matrix (PFM; see Table 1), or a Position Weight Matrix (PWM), to show the quantitative frequencies or weights of nucleotides in the motif. Representative evaluations for a motif matrix include Information Content (IC) [11], maximum a posterior (MAP) [12] and the Bayesian models [13] (see the probabilistic models in Methods section).
The searching techniques with respect to the two representations, are discussed below.
(a) Enumeration Methods are usually applied [7, 8, 14â€“16] to the consensus representation, but they do not scale up for long widths. However, they are useful to provide candidates for further searching and evaluations [5, 17, 18]. Weeder [15, 16] is one wellknown representative in this category.
(b)Stochastic Searching is usually applied to align TFBSs and obtain the motif matrix for the matrix representation. Typical techniques can be categorized into local searching [5, 12] and global searching, where the latter can be classified into (S) Singlepoint and (M) Multipoint or groupbased searching. Global searching is more likely to find the global optima compared with local searching. While Gibbs sampling is popular in motif discovery tools: e.g. BioProspector [19], AlignACE [20] and MotifSampler [21]). Its singlepoint nature requires numerous iterations to converge to the global optima, otherwise the performance may be affected significantly. Alternatively, the multipoint global searching approach, the genetic algorithm [22, 23], has shown promising results in motif discovery [9, 10, 24â€“28]. There is great potential for them to be applied to more sophisticated models and provide multiple optimal motifs [26].
Table 2 summarizes the representations, the associated models and the searching techniques employed by the motif discovery methods. The table serves to illuminate the representative methods in each category including those we have compared in our experiments.
Methods beyond
Methods out of the scope of this paper but worth introducing are briefly mentioned as follows: Ensembles of multiple motif discovery programs have been recently shown to improve their performance [4, 29, 30]. However, modelling TFBS motifs is critically beneficial for better understanding and predicting novel motifs, and provides essential performance improvement for ensembles. As a result, we will focus on individual motif discovery methods in this paper.
Incorporating additional information sources [31, 32] is another trend to improve the motif prediction accuracy. While extra requirements are needed for their success, the sequencebased motif discovery problem remains challenging [33â€“35] and calls for our serious attention because generalization and improvement on the sequencebased methods will without doubt help the integrated approaches.
Motivations
Challenges
There still exist great challenges for de novo motif discovery algorithms to succeed. Challenges mainly include (i) NP hardness (ii), width uncertainty and (iii) multiple (overlapping) motifs, of which the latter two demand for more focus.

(i) NP hardness: The most wellknown challenge is the NP hardness [6] due to the unknown conservation degree, where extensive approaches have been proposed to achieve optimality under certain models, as surveyed in the last subsections.

(ii) Width uncertainty: An often overlooked challenge in reallife problems is the uncertainty in the motif widths.
In real datasets, it is not easy to determine a single motif width (1) experimentally or (2) biologically. (1) Experimental: Annotated TFBSs are often affected by limited experimental resolutions, and it is thus difficult to choose any single width to fit the TFBSs before a motif can be discovered. (2) Biological: The most conserved binding contacts are between the short binding core of the target TFBS and the binding domain of a TF. The binding core may be fixedwidth (< 6 bp). However, the short binding core may not provide enough binding affinity for its corresponding TF to recognize. Instead, a TF contain flexible segments of polypeptide chain, and these flexible arms work together with the DNA binding domain of the TF to add additional affinity [36]. The complication makes the effective width not easy to be fixed at a single value. For example, the TFBS widths vary in the familial binding cases of the Zn2Cys6 motif [37].
Existing methods usually assume a known and fixed TFBS motif width or model a distribution around an expected width when there are uncertainties involved. The conservation contributed from different motif parts by varying the widths may be underutilized in a singlewidth approach, and the socalled expected value may be misleading and biased. Statistical significance to rank different widths, e.g. Evalue [38], is computational intensive and still only picks a singlevalue width at the end. In the illustrative example of a real motif with 19 LexA binding sites in Figure 1, if a single width is chosen, it may be 5 if only the stringent core part (37) is chosen; or it may be 12 if considering all columns (112). In the former case, the short motif may not be ranked higher than those nonTFBS frequent patterns happening by chance. In the latter case, since both highly and weakly conserved columns are evaluated equally, it may include additional false positives. On the contrary, modelling those uncertain bases with a range concept can better capture the different resolutions for assessing the motif signals, and thus potentially better describe the real TFBS motif.

(iii) Multiple (overlapping) motifs: Another challenge which is not well handled is the overlapping nature of TFBSs for different motifs because competitive binding exists amongst different TFs in the same regulatory region. Current techniques used are mainly masking/erasing and implicit maintaining.

Masking/erasing: These techniques can only discover one motif in a single execution, and thus several executions are required for outputting multiple motifs. Masking/erasing techniques also prohibits the subsequent discovery of the TFBSs overlapped with those previously masked ones. However, in real cases, different kinds of TFBSs may overlap with each other due to competitive binding of TFs.

Implicit maintaining: There are existing methods to sample different motifs simultaneously but with little or no mechanism to explicitly distinguish different solutions or flexibly control the overlapping degrees of TFBSs. As a result, highly redundant motifs may be produced. If there are limited number of output solutions, redundant topscored variant motifs will dominate and lessfit but different solutions will be missed. If nonredundant and different solutions need to be provided, a large output number has to be set and postprocessing is required [39] with additional costs.
Therefore, it is desirable to discover multiple motifs more effectively and efficiently with certain flexible and explicit overlapping control.
Paper Layout
To overcome all these drawbacks of the existing de novo motif discovery algorithms, we propose the generalized model which presents a new angle to handle the variable motif widths to better reflects the biological uncertainty. Then we present the metaconvergence framework to support multiple optimal solutions with flexible overlapping control using similarity tests. Based on the generalized model and the framework, a new algorithm called GALFG is developed.
The rest of the paper is arranged as follows. The generalized model, the metaconvergence framework and the new algorithm GALFG are first given. Extensive experimental results are reported, including single/multiple motif discovery problems with fixedwidth/variable widths inputs. A large number of both synthetic and real benchmark datasets are used in the experiments. After the substantial analysis of the results, discussion and conclusive remarks are made. The detailed implementations of our algorithm are given in the last Methods section.
Results
In this section, we present the generalized model and the metaconvergence framework in detail, and propose the resulting GALFG algorithm.
The Generalized Motif Model
To tackle the challenge raised from the uncertainty of motif widths, we propose a new generalized model by considering a width range of interest simultaneously. A range is more practical and suitable for real biological cases for two reasons:

First, it is easier to define a rough range than a particular width. All widths within contribute accordingly to the motif solution, and thus it is less sensitive than a wrongly chosen single width.

Second, TFBSs of a motif in reality vary in their widths and exhibit certain higher degrees of conservation compared to the nonsite fragments (the background). A range model can more appropriately capture the different conservation degrees than any single width.
Assume the width input is R = [w_{ min }, w_{ max }] and R = w_{ max } w_{ min }+ 1, and a candidate solution, i.e. a set of TFBSs to form a motif, is defined as A, with the TFBS positions denoted by {p_{ i }}. The formal problem denotations and formulations are shown in the Methods section: The Proposed Model and Evaluation. The generalized model evaluates A based on the whole range R. An illustrative example is shown in Figure 1. The model or scoring function (illustrated by the heights of color nucleotides in the figure) for a fixed width w_{ i }is well established, e.g. a probabilistic model, denoted as P (A(w_{ i })w_{ i }), where P(A(w_{ i })) is a part from the complete candidate solution A with respect to w_{ i }. The generalized model can then be formulated by summing them together as
For the most common case when there is no prior knowledge on which width is more likely to happen, w_{ i }can take a uniform distribution, i.e. P(w_{ i }) = 1/R for each w_{ i }. On the other hand, any prior distribution such as the Poisson one used in Bayesian models [40] can be also adopted. For each w_{ i }component where w_{ min }â‰¤ w_{ i }<w_{ max }, there are more than one choice and we only pick the component A(w_{ i }) by argmax(P(A(w_{ i })w_{ i })) (caps in Figure 1). The additional computational cost compared to a fixed width model is O(R^{2}), which is feasible since motif ranges (width variations) are usually short (â‰¤ 10 bp). The major difference of the generalized model from the previous ones is that all the widths from the input range R contribute to the solution score/fitness in the model, rather than choosing a certain single width by argmax(P(A(w_{ i })w_{ i }) P (w_{ i })), which has the risk of bias on a certain single value. If only one width is input, the generalized model reduces to one of the existing fixedwidth models.
Intuitively, the generalized model is a weighted sum of the probability of different widths from the range R. It is compatible with the existing probability models and is even applicable to nonprobability models, as long as there is a consistent expression of P(A(w_{ i })); here it refers to an evaluation function in general. We employ the fixedwidth probabilistic model in our generalized model, which will be discussed in detail in the Methods section.
The Metaconvergence Framework
For practitioners in molecular biology and medical research, it is desirable that multiple optimal candidate motifs can be provided concurrently for biological verification. Due to the limitations of masking/erasing and implicit maintaining, it is desired to explicitly maintain different solutions with flexible (typically overlapping) control efficiently. To address these issues, we propose a metaconvergence framework employing Genetic Algorithm (GA) with the similarity test as the overlapping control.
(i) The similarity test is first introduced to fulfill flexible overlapping control over different motifs. Positional information is considered since the generalized model involves a width range R of positions. In particular, to compare two candidate solutions/individuals A_{ a }and A_{ b }, the test calculates the relaxed Hamming distance h between each pair of their aligned TFBS positions: (A_{ a }) and (A_{ b }) in sequence i,
where tol is the shift tolerance. The similarity test is passed, if
, where dr is defined as the difference ratio, m indicates the number of sequences, and st is the similarity threshold. When dr <st, A_{ a }and A_{ b }are considered to be similar, i.e. belong to the same motif kind. The intuitive settings of tol, st for different purposes, and how the test is applied are detailed and included in Methods: Metaconvergence Framework Details.
The similarity test proposed allows users to control the differences between the expected motifs in an easy and intuitive way. On the contrary, the other possible comparisons based on the PFM involve complicated cutoff which is not trivial to specify and counterintuitive for common users.
(ii) Metaconvergence, with the similarity test, monitors the convergence of different optimal solutions and adaptively controls the numbers of GA runs rather than using a relatively large fixed number of GA runs in previous works [27, 28]. Furthermore, only a small number of candidates are subject to the similarity test to compete for the multiple optimal motifs, compared with the other method [26] that compares the whole population of solutions with nontrivial overhead. Therefore, the efficiency can be significantly improved. More details can be found in Methods: Metaconvergence Framework Details.
GALFG
Incorporating Genetic Algorithm with Local Filtering (GALF) with the generalized model and the metaconvergence framework, GALFG (G for generalized) is proposed to discover multiple optimal motifs with flexible overlapping control using the similarity test. To fit into the generalized model with range input, the operators in GALF are extended accordingly and detailed in the Methods section: GALFG implementations.
In the following section, we will report the results of GALFG tested on both synthetic and real benchmark datasets for various cases, namely fixedwidth, variable width, for single motif [with single (K = 1) or multiple outputs (K > 1) for single motif] and multiple motifs (K > 1) discoveries.
Experiments
In this section, The summary of the experiments is introduced, and then the experimental results are reported and analyzed in corresponding categories. Finally experiments concerning the efficiency of GALFG are presented.
Experiment Summary
First of all, the evaluation measurements are introduced here. For most experiments except the benchmark ones [34, 35], the measurements employed are the site level (prefix s) ones: positive predictive value/precision sPPV, sensitivity/recall sSn and Fscore sF with shift restrictions, similar to [27, 28]. The advantage is that they reflect both site level and part of the nucleotide level performances concisely. For the benchmark experiments, we have to follow their standard measurements which employ looser site level measurements but introduce additional nucleotide level (prefix n) PPV (nPPV) and sensitivity (nSn), as well as performance coefficient (PC) [14, 33â€“35] and correlation coefficient (CC) [33, 35] on both levels [see Additional file 1 for details of evaluation measurements for different experiments].
(i) Single motif discovery experiments ( K = 1) were firstly performed to test the generalized model. GALFG was verified on the 800 synthetic datasets from [28], and compared with other stateoftheart algorithms with fixedwidth inputs as a special/degenerative case. GALFG was then further tested on the 8 real datasets employed in GAME [27] with both fixedwidth (the assumed true widths from [27]) inputs and range (variable widths) inputs relatively close to the true widths. The challenges raised by the eukaryotic benchmark [33, 35] are then addressed, where there is no datasetspecific prior knowledge on the motif widths and only single motif outputs (K = 1) and compared.
(ii) Multiple motifs experiments ( K > 1) were then performed for two scenarios. In the first scenario, since multiple candidates are desirable for biological testing even for single motif discovery [34], GALFG was tested and compared with the stateoftheart algorithms on the 62 E. coli benchmark datasets [34], without datasetspecific prior knowledge on the motif widths. In the second scenario, since it is also desirable to discover different real motifs simultaneously, GALFG, GAME and MEME were tested on the real liverspecific dataset with multiple (overlapping) motifs. Investigating into the exceptional case of GAME's 8 datasets using GALFG with multiple motifs discovery, we discovered a putative motif not annotated in the dataset previously has been identified.
Single Fixedwidth Motif Discovery on Synthetic Data
GALFG was first verified in the special cases of fixedwidth single motif discovery (K = 1) on the 800 synthetic datasets used to test GALFP in [28], which had performed best for these fixed width cases. We compared GALFG with GALFP, GAME, MEME, BioProspector (BioPro.), and BioOptimizers based on MEME and BioProspector. Weeder was not compared because it cannot be run on the longwidth (16) datasets due to its width limit of 12. Details on generating the datasets were provided in [28] [see Additional file 1]. The average Fscores sF on the site level for each scenario are presented in Table 3, with the best results shown in bold. The full table with precisions (sPPV), recalls (sSn), including BioOptimizer results (almost identical to MEME and BioProspector), is shown in [Additional file 1]. GALFG and GALFP are in general the best among all scenarios, especially in the difficult scenarios (for example, short widths and low conservation). GALFG is slightly better than GALFP in the last 4 scenarios. To compare GALFG with another close competitor, MEME, ttest was employed [see Additional file 1]. GALFG is shown to be better than MEME within the significance level 0.05 in 4 out of the 6 scenarios with better sF, while MEME shows no convincing significance of being better in the other 2 scenarios.
We do not expect great differences between GALFG and other algorithms here, because under the fixedwidth cases the generalized model is similar to other models in representative power. The experiments demonstrate the search capability of GALFG is comparable to or better than the previous best GALFP on the synthetic datasets. The main reason is that they use similar effective searching techniques based on local filtering [28]. The results from the synthetic data can be interpreted intuitively with respect to searching difficulties, because their respective conservation degrees are explicitly generated. For variablewidth (range) cases, the complicated nature of different conservation degrees of TFBSs is not easy to model or evaluate with synthetic data, hence it is more appropriate to test different methods with substantial real datasets, and the experimental results are presented in the following subsections.
Single Motif Discovery on Real Datasets
In this subsection, GALFG was evaluated and compared with other methods on the 8 real datasets used to test GAME [27], for both fixed and variable widths cases in single motif discovery (K = 1).
Information of the 8 datasets is shown in Table 4. The CRP dataset contains the binding sites for cyclic AMP receptor, and has been widely tested since [41] was published. The ERE dataset contains the binding sites for the ligandactivated enhancer protein estrogen receptor (ER) [42]. The E2F datsets correspond to TFBSs of the E2F family from mammalian sequences [43]. CREB, MEF2, MyoD, SRF and TBP are chosen from the ABS database of annotated regulatory binding sites [44]. More details of the datasets can be found in [27].
The comparison studies for fixed and variable widths cases are given as follows:
(i) Fixedwidth single motif discovery ( K = 1) experiments were performed, where GALFP was previously tested and compared with GAME in a fixedwidth manner. GALFG shows comparable overall Fscores sF (0.81) to the best average results from GALFP (0.82) and is better than GAME (0.61) by 33% on average from 20 runs. While GALFP shows significantly smaller variations than GAME in the performance [28], GALFG shows even more stable and robust performance than GALFP, which is discussed further in the Efficiency Experiments.
We have also tried Weeder [15, 16] on part of the datasets because Weeder can only handle widths 6, 8, 10 and 12. Weeder is optimized for several width range modes [16] rather than fixed widths and will be formally compared in the following range experiments. For the fixedwidth experiments, only CREB, MyoD, SRF and TBP were tested. The averaged sPPV, sSn and sF of Weeder for the 4 datasets are 0.43, 0.63 and 0.51, respectively. On the other hand, GALFG is better where the corresponding values are 0.79, 0.83 and 0.81.
Similar to the conclusion on fixedwidth synthetic experiments, GALFG demonstrates competitive searching capacity on the fixedwidth real data experiments, while GALFG makes a looser assumption.
(ii) (K = 1) variablewidth (range) experiments were performed, where GALFG was compared with GAME, MEME, Weeder, and FlexModule from CisGenome [45] on the previous 8 real datasets. The additional FlexModule is a Gibbs sampling [46] motif discovery module implemented in the recent integrated system CisGenome [45] for analyzing transcriptional regulation.
For each dataset, 3 different width ranges were input for testing where
Each range represented variations of Â± 3 bp on the width w_{ i }while the lower bound for w_{min((i)}was set to 5 because it is rare for a motif width being smaller than 5. With increasing i, w_{ i }= w_{ true }+ (i  1) reflects larger divergence of shift from the biological truth w_{ true }[See Additional file 1 for the running parameters]. The average results of executing each program 20 times are shown in Tables 5 and 6. Weeder is deterministic, and MEME performs constantly in different runs for a same dataset (as contrast to different datasets in Table 3), so there are no standard deviations shown for them.
In most cases (19/24) GALFG achieves the best Fscores sF on the site level, as well as the average sPPV, sSn and sF averaged on all the cases. The overall Fscore of GALFG is 19% better than GAME, 14% better than MEME, 85% better than Weeder, and 21% better than FlexModule. The standard deviations of GALFG are also lower than GAME and FlexModule in most cases. The ttest on sF shows that GALFG is better than MEME in 20 cases within significance level 0.01, and in 1 case within significance level 0.02, while MEME is better in 3 cases within level 0.01. It should be noted that GALFG significantly outperforms the other algorithms in sSn, probably because the generalized model not only predicts motifs as precise as the other models, but also accepts more correct TFBSs based on a wider range than single widths.
The above experiments demonstrate that with a range relatively close to the true widths, GALFG with the generalized model shows favorable performance even compared with the results based on Evalues. In fact, the performance with the input width ranges close to the true widths is comparable to that with fixedwidth inputs, except for the MyoD dataset. The exceptional case of MyoD will be investigated separately and shown containing multiple motifs later.
To summarize, on the 8 real datasets for single motif discovery, GALFG demonstrates competitive performance in fixedwidth experiments, and provides obvious improvement over other methods in variablewidth (range) experiments. For the cases without much prior information on the exact widths, experiments will be described in the next subsections.
Single Motif Discovery Challenges on Eukaryotic Benchmarks
The recent wellknown eukaryotic benchmark by Tompa et al [33] imposes great challenges to motif discovery algorithms. The problems of Tompa et al benchmark include the insufficient signals (few but long sequences) and inappropriate evaluation methods (unclear experttuned parameters for running and single topscored motif outputs for comparisons) [See Additional file 1 for a more detailed discussion]. It has been indicated that many motifs in the Tompa et al benchmark are not able to be discriminated by common motif models from remaining sequence [35]. An improved benchmark [35] has thus been proposed for being more suitable to evaluate motif discovery algorithms. The algorithm benchmark suite [35] extracts motifs from TRANSFAC and includes representative eukaryotic species. There are 50 datasets with backgrounds generated by Markov models and 50 with real cisregulatory region backgrounds. The widths are not given in the benchmark and thus a uniform width range input has to be set for all experiments. The additional evaluation measure corresponding to this benchmark is the nucleotide level correlation coefficient (nCC) [33â€“35].
GALFG was tested on the corresponding algorithm benchmark suite [35] and compared with MEME and Weeder, the two most widely used algorithms [see Additional file 1 for the running parameters of GALFG]. The average results of nSn, nPPV, nPC and nCC are shown in Table 7. For Markov backgrounds, GALFG is 31% better than MEME, 214% than Weeder in nPC, and 42% better than MEME, 165% than Weeder in nCC. Similar conclusions can be drawn for the real backgrounds. It should be noted that while MEME and Weeder perform poorly in one of the two backgrounds, GALFG maintains the competitive performance well in both.
In the improved eukaryotic benchmark [35], which is considered more suitable to test motif discovery algorithms, GALFG shows superior performance to the widelyused MEME and Weeder, while only topscored motifs are compared. However, as stated in [33], it is more meaningful in practice to provide multiple motifs for testing [5] where the experiments are reported as following.
Multiple Motifs Outputs on the E. coli Benchmark
In this subsection, GALFG was tested, to address a more realistic scenario, where multiple candidate motifs are desired for identifying the true TFBSs in biological research, on the E. coli benchmark. The E. coli benchmark ECRDB62A [34] has 62 datasets, on average about 300 bp in the sequence length varying from 86 to 676 bp, 12 sequences per dataset, around 1.85 sites per sequence and the average site width is 22.83 with standard deviation > 10, which indicates very diversified widths.
Specifically, minimal parametertuning policy was employed as if the programs were run by a common user with minimum prior knowledge in practice. Results of AlignACE [20], BioProspector [19], MDScan [5], MEME [12], MotifSampler [21] and Weeder [16] were obtained for comparison. A uniform width of 15 was input for those fixedwidth algorithms, namely AlignACE, BioProspector, MDScan and MotifSampler. On the other hand, MEME was run with the default setting for widths and the optimal one was chosen automatically within. Weeder was run with the large width mode. For GALFG, we ran it on the benchmark datasets with both the uniform fixed width 15 and also the widest range accepted for the program of R = [10,20] with R = 10 around the central width 15. For all algorithms, 5 motifs were output for detailed comparisons.
We employ the evaluation criteria from [34], namely precision PPV, sensitivity Sn, performance coefficient PC and Fscore F, on both nucleotide (prefix n) and site (prefix s) levels [see Additional file 1] (We use the standard notation of PPV instead of the nonstandard specificity definition in their work). In the comparisons shown in Table 8, the accuracy of the best prediction out of the top 5 scoring predictions is evaluated with respect to nPC. With both fixedwidth and range inputs, GALFG outperforms the other algorithms in all evaluation criteria. For example, GALFG (15) outperforms the best among the other algorithms by 49% in nPC, 29% in nF, 28% in sPC and 18% in sF. GALFG (rg), with width range input [10,20], outperforms the other best algorithms by 46% in nPC, 29% in nF, 25% in sPC and 24% in sF. By comparing the two different input settings for GALFG we can see that with little sacrifice in other measures (< 0.01 on the nucleotide level and < 0.02 on the site level), the generalized model based on the range (rg) demonstrates improved site level sensitivity, in particular 15% (or 0.082) in sSn compared with GALFG (15) and 34% (or 0.172) compared with the best among other algorithms.
Besides the best predictions out of the 5 outputs, investigation was also done to analyze the topscored motifs as well as the rest individually for different algorithms. The statistics in terms of nPC, which reflects both nPPV and nSn, are shown in Table 9. As indicated before in [34], the topscored predictions are not necessarily the best predictions, implying that outputting only a single prediction may not be a good choice in practice or for comparison studies. However, the topscored predictions from GALFG are significantly better than the best among the other algorithms, by 30% (w15) and 36% (rg) respectively. We can also see that, for GALFG, the generalized model based on the range provides better performance than on the fixed width, with respect to both the topscored and the mean predictions. This implies that the generalized model using ranges is useful when the prior width information is usually not strong in practice. On this benchmark for multiple motif outputs, GALFG outperforms other stateoftheart algorithms considerably. The generalized model exhibits improved sensitivity while maintaining competitive precision, and thus achieves better overall performance on the site level.
Multiple Motif Types in Real Datasets
In gene regulation, TFBSs of different kinds of motifs may appear in the same promoter region. They either work together to regulate the transcription or compete for the TF binding when part of the TFBSs overlap with each other. Thus it is meaningful to discovery multiple TFBS motifs, possibly with overlaps in some of their TFBSs, from a dataset simultaneously. The following experiments tested GALFG under the corresponding scenario.
The liverspecific dataset
The liverspecific dataset [47] contains 19 sequences, embedded with several major motifs (with 619 sites) varying in widths, namely HNF1, HNF3, HNF4 and C/EBP, and some other motifs with fewer sites, such as CRE, BRF3 and BRF4 with only one occurrence for each of them. Some TFBSs from different types of motifs overlap with each other in the dataset. For example, a TFBS of HNF1 (width 15) overlaps with a TFBS of HNF4 (width 12) with 7 bp in a particular sequence, while cooccurring TFBSs of HNF1 and HNF4 in some other sequences do not overlap at all. The total number of (overlapping) TFBS instances is 60. The widths vary dramatically from 7 bp to 31 bp.
On this dataset, GALFG, GAME and MEME were compared using the width range input R = [8,16], which is considered a common range for TFBSs, to discover different types of motifs. The expected width for GAME was 12, the mean of the input range. Different numbers of motifs, K, ranging from 5 to 20 with step 5, were output and evaluated.
The site level (with shift restrictions) results of sPPV, sSn and Fscores sF (with shift restrictions) based on all TFBSs are shown in Figure 2 for different K. MEME fails to produce comparable recalls or Fscores to the others. It is probably caused by the masking techniques not allowing overlapping of motifs. GAME masks TFBSs individually rather than the whole motifs, so better sSn (recall) can be obtained from a diverse GA population. With overlapping control on the GA, GALFG shows recalls comparable to or better than GAME. Moreover, GALFG has the best sPPV (precision) while GAME generally has the worst. Both GALFG and MEME show an increasing trend of recalls as K increases. The sudden drop of GAME for K = 20 is probably because the expected width no longer suits some of the motifs while GAME actually performs fixedwidth search in its GA. GALFG provides the best balance between precisions and sensitivities, and thus gives the best Fscores in all cases. Averaged on all K, the Fscores are: GALFG: 0.54, GAME: 0.45 and MEME: 0.31 where GALFG outperforms the other two by 20% and 73% respectively.
Besides the previous evaluation that treats all the TFBSs as a whole, type specific investigation was also carried out on the output results of GALFG. With the help of STAMP [48], the predicted motifs with K = 5 GALFG were searched for matches of annotated TFBS motifs from the TRANSFAC database V11.3, based on ALLR (Average Log Likelihood Ratio). ALLR was considered to be the most effective in comparisons of single columns for motifs [48].
The relevant matches for the top 2 motifs are displayed in Sequence Logo formats in Figure 3. The top 2 highscored motifs, labeled in STAMP by Motif (width: 13) and Motif v2 (width: 11), match HNF1 and HNF4 in TRANSFAC respectively with high statistical significance, i.e., low Evalues (< 0.05). For Motif v4 (width: 16), it matches part of HNF3 alpha without high statistical significance (Evalue 2.71e01), because only part of the HNF3 TFBSs are identified in the predicted motif. It indicates that, topscored motifs output by GALFG in general match true TFBS motifs with high confidence. The other two motifs do not have relevant top 10 matches in TRANSFAC. C/EBP cannot be discovered as a whole motif, possibly due to its low conservation compared to the HNF motifs. STAMP also provides the phylogenetic profile where Motif (HNF1) and Motif v2 (HNF4) are grouped together, and so is Motif v4 (HNF3), implying they belong to the same HNF family. For K = 10, similar results are obtained, with matches mainly including HNF1 and HNF4.
Indepth investigation on the MyoD dataset
The MyoD dataset seems to be an exceptional case among the 8 real datasets tested by GAME [27]. Only GALFG (sPPV: 19/22, sSn: 19/21, sF: 0.88) and GALFP (sPPV: 21/37, sSn: 21/21, sF: 0.72) are able to show acceptable site level results (with shift restrictions) in the fixedwidth (w = 6) experiments, while in the variable width experiments none of the programs succeed in providing good results.
To investigate into this exception, GALFG was set to output K = 3 different motifs with the annotated width 6. Besides the fittest output being the annotated MyoD motif, the other two are only marginally lower in their fitness compared to the best one (differences < 2%). That is probably the reason why most existing algorithms perform poorly in this dataset  they either locate a suboptimal because of the low signaltonoise ratio, or obtain inappropriate rankings of the motifs due to the subtle differences in the modelling. It indicates that the accurate width information is still crucial for such subtle and short motifs. We searched the 2nd ranked motif, Motif v2, for matches from the TRANSFAC Database using STAMP, based on the various column comparison metrics provided by STAMP. Consistent matches, such as E2A [49, 50], p53 [51, 52], E47 [53] and Ebox [54] motifs, were obtained with high rankings (within top 10s), and these motifs are closely related to MyoD for muscle cell regulation according to the references [49â€“54]. The most consistent matches are shown in Figure 4. Thus there is a high probability that Motif v2 is a true motif which may not have been annotated previously in the MyoD dataset. In summary, GALFG outperforms GAME and MEME by 14% and 73% on average in sF respectively on the liverspecific dataset for multiple motifs discovery. Additionally, GALFG sheds light to an additional motif which may not have been annotated previously in the MyoD dataset.
Efficiency Experiments
Although the effectiveness is the major concern for motif discovery, practitioners also prefer efficient algorithms which have capability for large scale data. In this subsection, we tested GALFG with different GA population sizes to investigate the tradeoff between effectiveness and efficiency of metaconvergence. Firstly, different population sizes (PS = 500 (default: In the previous work, in order to be consistent with GAME's PS = 500, GALFP employed the same setting as default, and this is followed in GALFG for the minimum parametertuning purpose), 200, 100, 50, 10) were used to run GALFG, GALFP and GAME (results from [28]) on the 8 real datasets [27] for fixedwidth single motif discovery. For each PS, they were run 20 times on the same Pentium D 3.00 GHz machine with 1 GB memory, running Windows XP, and the results were averaged. The effectiveness (site Fscores sF) and efficiency are shown in Figures 5 (a) to 5 (c). For the default PS = 500, the average time (in seconds) follows that: GALFG (43.38) < GALFP (61.91) < GAME (291.11). Since the standard deviation of GAME's effectiveness is already large with PS = 500, we only focus on GALFG and GALFP to compare the effects (except the special MyoD case better to run with K > 1) of different PS. In Figure 5 (a), the overall performance for PS = 500 are similar, as well as the standard deviations: GALFG 0.004; GALFP 0.029. However, when the population size drops to PS = 10, the performance of GALFP drops significantly, and the standard deviation becomes 0.17 on average, and even â‰¥ 0.40 for MEF2 and TBP datasets (Figure 5 (c)). On the contrary, the average performance of GALFG is maintained, and the overall standard deviation is only 0.031, still a very small number. Furthermore, the average time of GALFG for PS = 10 is just 1.80 seconds, which is over 24 times speedup of the default PS, as shown in Figure 5 (b).
It is interesting that even with a population size of 10, GALFG still performs comparably well, while GALFP degenerates significantly. The major reason is due to the metaconvergence framework with similarity test, which is not used in GALFP. With an extremely small population, GALF may not be able to provide the optimal motif in every run. However, since different motifs are controlled and maintained on a meta level in GALFG, converged suboptimal motifs will be replaced by better ones and eventually the global optimum can be found.
The above results imply that, GALFG is able to provide comparable and consistent performance for fixedwidth single motif discovery with a small population for competitive efficiency.
On the E. coli benchmark for multiple outputs (K = 5) with range inputs, we observed similar performance maintenance with different PS for GALFG in Figure 5 (d), thanks to the metaconvergence mechanism to maintain different optimal motifs in the solutions. The average time on each dataset for the three PS is 655.80 (500), 74.40 (50) and 16.05 (10) seconds respectively, where the PS = 10 demonstrates a speedup of over 40 times compared to that of the default size (PS = 500). For PS = 10, the standard deviation of nPC is 0.0098, which is still small compared with 0.0070 for the default PS.
According to the efficiency experiments, GALFG is able to maintain competitive effectiveness with very high efficiency. Therefore GALFG has great potential to work on ever larger scale datasets successfully.
Discussion and Conclusion
To conclude, we summarize the proposed work of GALFG, discuss about the challenges and point out future directions.
Summary
In this paper, the generalized motif model is proposed for realistic motif discovery problems. It models a possible range of widths rather than any single width. The model has the potential to address the biological uncertainty better and is more practical in reality because TFBSs of the same motif may vary in widths and exhibit different degrees of conservation. The metaconvergence framework is proposed to support multiple and possibly overlapping optimal motifs, based on the flexible and easy control of the similarity test for users. GALFG is developed by incorporating the extended GALF searching methodology into the metaconvergence framework based on the generalized model.
GALFG has been tested extensively on over 970 datasets, including 800 synthetic datasets, 8 real datasets (further 24 range cases), 100 eukaryotic and 62 E. coli benchmark datasets, as well as a real liverspecific dataset with multiple overlapping motifs. GALFG has shown its competitiveness and better effectiveness for different kinds of motif discovery problems with both fixedwidth and range inputs. The generalized model not only predicts the motifs accurately but also include more correct TFBSs. The searching capacity for optimal solutions and efficiency of the metaconvergence framework have also been demonstrated with the synthetic and real datasets. GALFG has also discovered an additional motif which might not have been annotated previously in the MyoD dataset.
Discussion
However, the motif discovery problem remains challenging due to the weak underlying motif signals input data, as well as the diversity and complexity of TF binding TFBSs [55]. There are also a number of potential improvements for the generalized motif model and GALFG in our future work, such as further analysis on the effect of different width ranges, more efficient evaluation when handling different width fragments, flexible width distributions for different motif types, validation of the putative motif in MyoD dataset, etc. The candidate fixedwidth model for the generalized model still needs more investigation to better suit the biological observation. Integrating the generalized model for motif discovery with additional evidence such as expression data to increase the prediction power is another attractive research direction to us.
Methods
The Proposed Model and Evaluation
Denotations and Formulations
With our focus on the matrix representation (PFM), the motif discovery problem is formulated as follows. Defined on the alphabet Î£ = {A, T, G, C} for DNA sequences, the input data are a set of sequences S = {S_{ i }i = 1, 2, ..., m}, where each S_{ i }is a sequence with length l_{ i }of nucleotides from the alphabet. The motif width w is assumed to be known for the time being. TFBS instances are represented by R = {} where each is the k th instance of width w in S_{ i }. If we assume each sequence has at most one instance (ZOOPS), then is collapsed to be r_{ i }(r_{ i }= null if k = 0) for short. Table 1 illustrates an artificial example of motif discovery. A site indicator matrix (SIM) A, which is also used to represent the solution, locates the TFBS instances as sites, where A_{ ij }= 1 if a motif instance (site) starts at position j of S_{ i }and 0 otherwise. Alternatively, we can use the position = j to represent a instance given w. Thus we have a compact position representation of A = {p_{1}, p_{2}, ..., p_{ m }} especially for ZOOPS, where some the positions can be NULL. A profile of the motif can be built from aligning the TFBS instances indexed by A. The profile is represented as a 4 Ã— w Position Frequency Matrix (PFM) Î˜, where Î˜_{ jb }is the frequency of nucleotide b in column j of the motif. The nucleotides from background (nonmotif sites) are represented by Î˜_{0}, where Î˜_{0b}is the frequency of nucleotide b in the background and is treated as known from the input.
The motif discovery problem (of a known width w) can be thus formulated as finding A (with only the TFBS sites being considered) and the corresponding PFM Î˜ such that one of the above scoring/fitness functions is maximized according to different assumptions.
The Probabilistic Models
To complete our generalized model, the important component comes from the existing models handling a known width input. In this paper, we employ the probabilistic models which have most intuitive explanation with the generalized model. For a candidate solution A (which also indicates Î˜), the full Bayesian model of likelihood [13, 40] can be written as
where Î˜ is the motif PFM, Î˜_{0b}is the background distribution of nucleotide b, n_{ jb }is the count of nucleotide b in column j of the PFM, n_{0b}is the count of nucleotide b in the background, A is the total number of sites in the motif, is approximately the number of all possible sites (the number of invalid sites is trivial and can be ignored), and p_{0} = A/L* is the estimated abundance ratio which represents the probability of any position being a site in the dataset. Î˜_{ jb }= n_{ jb }/A (strictly it should be as an estimate, but we just use Î˜_{ jb }for simplicity). Similarly Î˜_{0b}â‰ˆ n_{0b}/L* (ignoring the relatively small affect of A).
In Bayesian analysis, noninformative priors of the independent p(Î˜) and p(p) are integrated out for convenience. Alternatively, by assuming them as constant we have the log likelihood as follows:
By ignoring the constant parts and approximating L* log(1  p_{0}) â‰ˆ  L* * p_{0} =  A since p_{0} is very small, the equivalent score psi' can be written as
which is exactly the approximation form used in the Bayesian analysis [40]. With one step further to ignore the penalty of  A, we have the approximation form for a known p [40] and it is also coined as the KullbackLeibler divergence with parameter (we use this form in the generalized model since we find the previous one imposes too much penalty on the number of TFBSs):
Furthermore, if we assume each sequence S_{ i }has exactly one site, i.e. one occurrence per sequence (OOPS), then p_{0} also becomes constant. As a result we only have to consider part of Equation 8
which is the well known information content (IC) [11]. IC(j) is defined as the positional IC for column j.
The Fitness Function and Evaluation
Recalling the generalized model in Equation 1, we can now choose P(A(w_{ i }) w_{ i }) = exp(Ïˆ(w_{ i })) accordingly from the previous probabilistic models, where Ïˆ(w_{ i }) is a simplified notation for exactly Ïˆ (Î˜, AS, Î˜_{0}) in Equation 8 given w_{ i }. For computational convenience, we represent the fitness function f in log likelihood form as
In the evaluation, a candidate solution consists of A (and the derived Î˜) with the maximal width w_{ max }. For each particular w_{ i }from the range R, we have to choose the fragment (a continuous w_{ i }submatrix A(w_{ i }) from the full matrix Î˜) that maximizes Ïˆ(w_{ i }) (see Figure 1). It is equivalent to maximizing IC for width w_{ i }since p in Equation 8 is now fixed for all A(w_{ i }). With the log format of f, we can avoid overflow with the exp function by taking out the largest log component during mediate computation and adding it back upon finishing the evaluation.
For the convenience of implementations of searching and consistency with other methods for evaluation (which output singlewidth motifs), a core fragment, located by the width w_{ cor }and offset w_{0}, is to be selected. w_{ cor }and w_{0} are also determined based on IC. Starting from the two ends of the maximal PFM with w_{ max }, we iteratively remove each columns j with positional IC(j) lower than the average. The remaining submatrix (or A(w_{ cor })) is thus with width w_{ cor }and offset w_{0}. Complexity of the whole evaluation grows quadratic to R = w_{ max } w_{ min }+ 1. Since the ranges are usually restricted within 5  10 bp, f is computationally feasible in practice with additional O(R^{2}) overhead compared with a fixed width model for w_{ max }. The offset w_{0}, combined with the position p_{ i }of A in the i^{th}sequence, is also used to determine the aligned position ( (A)) in the similarity test in Equation 2.
Metaconvergence Framework Details
Similarity test settings
The shift tolerance in Equation 2 is set as tol = 3 + (R  1)/2. The first part of tol is chosen for convenience to separate two TFBSs and the latter part is the tolerance for the range involved. In the case of competition for the same slot in slot dispatching, the threshold can be flexibly specified by the users (for general usage, the default is: st = 0.3, which is used throughout this paper). Users can customize st based on their needs, either with a large value (e.g. â‰¥ 0.5) to force solutions of highly different motifs, or with a small value (e.g. â‰¤ 0.1) to allow fine variations of the same motif type. On the other hand, for deleting individuals in the case of near convergence, the threshold is automatically fixed at the value of st' = 0.5 to make room for the other solutions. st' is not sensitive because the similar optimal motifs are finally controlled by the userspecified threshold st. However, if st' is set to be too low, many similar variations to the converged motif will remain in the population, and time will be wasted to converge repeatedly to the same motif kind.
Metaconvergence
In greater detail, the metaconvergence framework can incorporate any GA procedure (Genetic Algorithm with Local Filtering (GALF) [28] in our case). Like in the previous approaches [27, 28], up to a maximum number of the GA executions, MAXRUN, can be run but it will stop running if the convergence test is satisfied. Additionally in metaconvergence, K+1 slots are maintained where K is the number of optimal solutions expected. Each slot stores the best solution of a different of motif kind, and is allocated a counter Cnt, which keeps track of its motif convergence count. At the end of each GA run, a number (NUM) of best solutions (individuals) will be dispatched and subject to the similarity test to the K+1 slots. The corresponding counter will increment for each update of a solution of the same motif kind and reset if the motif is replaced by a new one. A convergence threshold MAXIND is used to monitor convergence. MAXIND is a relatively small number because each dispatched solution is already a converged one obtained by GA. In general, the metaconvergence framework needs at most MAXRUN GA runs to obtain K optimal solutions while the previous methods such as GAME and GALFP need K*MAXRUN runs. The whole procedure of metaconvergence is illustrated in Figure 6.
Similarity test applied in the framework
Solutions that pass the similarity test, i.e. those belong to the same motif kind in a particular slot, will compete for the same slot based on their fitness. On the other hand, the solution of a new motif will occupy an empty slot or the slot storing the solution with the worst fitness. After each GA run, when a slot is near convergence (we define this situation as Cnt > MAXIND/2), solutions similar to it will be eliminated, again based on the similarity test, to make room for the other optimal solutions in the next GA run. When the solution of a particular motif in the slot has converged (i.e. Cnt â‰¥ MAXIND), the motif will be taken out from the search process, i.e. all the exactly matched TFBSs belonging to this motif will be deleted, making room for efficient discovery of other motifs. The extra (K+1)^{th}slot is used to keep certain suboptimal solution in the early stage in order not to lose them, because otherwise the Cnt may fluctuate especially for the K = 1 case when there are several motifs with close fitness competing for the only slot.
GALFG Implementations
We employ the genetic algorithm (GA [see Additional file 1]) based GALF [28] as the searching procedure. However, since GALF was previously based on simpler assumptions, it has to be extended accordingly to suit the need of the generalized model.
Extended GALF Operators
Local filtering (LF) is the feature operator of GALF, which employs the combined representations for the whole motif (PFM Î˜) and individual instances (SIM A). However, it was based on the simple OOPS and fixedwidth assumptions. As a result, extensions have to be made for more general cases addressed by GALFG.
Generally, LF refines each individual (candidate solution) by iteratively scanning the sequence containing the currently worst instance and choosing the best replacement. To evaluate each instance (site) of the individual, the similarity score with the consensus concept is proposed. However, the relation between this heuristic score and the fitness is implicit. In GALFG, we propose to use the log likelihood ratio for an instance fragment starting at the column with width w',
to evaluate each instance r_{ i }, where r_{ i }(j) âˆˆ Î£ is the nucleotide in column j of r_{ i }, is the corresponding frequency from the PFM and is the corresponding background frequency. It measures the ratio of r_{ i }generated by the motif PFM over the background, and is more closely related to Ïˆ (w_{ i }) in Equation 10. The effectiveness of the log likelihood ratio and the mutation operator are verified [see Additional file 1] on the 8 datasets tested in [27]. In range input cases, with the w_{ cor }core fragment stored, we encourage LF to match instances with a longer width (â‰¥ w_{ cor }) so that the width w' is chosen randomly from [w_{ cor }, w_{ max }] and thus LF can be applied with fewest modifications.
Because now the fitness f can handle the general case with any motif instances, the new GALFG can now search based on zero or one occurrence per sequence (ZOOPS) assumption rather than OOPS. However, it is unwise to randomly generate null positions for nonsites at the very beginning during searching. It is because when most of the individuals are poor in their fitness, fewer instances will be strongly biased and the population will suffer from undesirable premature convergence. To alleviate this problem, we initialize the population with OOPS assumption and refine the abundance ratio (p_{0} in Equation 8) in later generations using a new mode of LF. The convergence (CONVER) mode of LF is triggered when the best individual stagnates for more than 1/4 of the convergence count MAXCONVER, or when it is toward the maximal generation of the GA. The convergence mode LF is applied to all individuals to adjust the motif abundance. The procedure is similar to normal LF except that the full w_{ max }fragment will be chosen for each instance and the worst instances are to be removed rather than refined, if eliminating it makes the overall fitness f increase.
Other Extensions
We adopt the singlepoint mutation and preselection from GALFP [28] and choose multipoint (close to uniform) crossover instead of singlepoint because it provides higher diversity. Since the new model adjusts widths automatically, the shift operator in [28] is no longer needed.
To handle general cases other than the ZOOPS assumption, where there may be several occurrences in a sequence, we employ a refinement process for additional instances upon the metaconvergence of GALF runs. Generally, if a fixed width is input, instances have to increase f in order to be added, while in the width range case, the threshold of f is relaxed slightly [see Additional file 1 for the details].
Combining the metaconvergence framework with extended GALF based on the generalized model, as well as the refinement procedure, we have the proposed GALFG to discover multiple TFBS motifs [see Additional file 1 for the pseudocodes of the new LF, the extended GALF and GALFG].
Parameter Setting
Besides the parameters discussed specifically (such as motif widths and output motif number K), and except the efficiency experiments (with different PS), the other parameter setting exactly follows GALFP [28] with the purpose of minimum tuning. In the extended GALF: default population size PS: 500; maximal number of generations MAXGEN: 300; interval of generations to trigger local filtering (LF)INTL: 10; convergence count MAXCONVER: 50; mutation rate: 0.9; crossover rate: 0.3; and maximal runs of GALF MAXRUN: 20. The quite large population size follows the setting of GAME for fair and consistent comparisons, though it turns out that a smaller population size also works comparably well (in the efficiency experiments).
References
Galas DJ, Schmitz A: DNAse footprinting: a simple method for the detection of proteinDNA binding specificity. Nucleic Acids Res 1987, 5(9):3157â€“3170. 10.1093/nar/5.9.3157
Garner MM, Revzin A: A gel electrophoresis method for quantifying the binding of proteins to specific DNA regions: application to components of the Escherichia coli lactose operon regulatory system. Nucleic Acids Res 1981, 9(13):3047â€“3060. 10.1093/nar/9.13.3047
Smith AD, Sumazin P, Das D, Zhang MQ: Mining ChIPchip data for transcription factor and cofactor binding sites. Bioinformatics 2005, 20(Suppl 1):i403i412. 10.1093/bioinformatics/bti1043
MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2006, 2(4):e36. 10.1371/journal.pcbi.0020036
Liu XS, Brutlag DL, Liu JS: An algorithm for finding proteinDNA binding sites with applications to chromatinimmunoprecipitation microarray experiments. Nat Biotechnol 2002, 20: 835â€“839.
Li M, Ma B, Wang L: Finding similar regions in many sequences. Journal of Computer and System Sciences 2002, 65: 73â€“96. 10.1006/jcss.2002.1823
Bieganski P, Riedl J, Carlis JV, Retzel E: Generalized suffix trees for biological sequence data: applications and implementations. Proc. of the 27th Hawaii Int. Conf. on Systems Sci 1994, 35â€“44.
Sagot MF: Spelling approximate repeated or common motifs using a suffix tree. LATIN'98, LNCS 1380 1998, 374â€“390.
Liu FFM, Tsai JJP, Chen RM, Chen SN, Shih SH: FMGA: finding motifs by genetic algorithm. BIBE '04: Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering 2004, 459â€“466. full_text
Paul TK, Iba H: Identification of weak motifs in multiple biological sequences using genetic algorithm. GECCO '06: Proceedings of the 8th annual conference on Genetic and evolutionary computation 2006, 271â€“278. full_text
Stormo GD: Computer methods for analyzing sequence recognition of nucleic acids. Annu Rev BioChem 1988, 17: 241â€“263.
Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology 1994, 28â€“36.
Jensen ST, Liu XS, Zhou Q, Liu JS: Computational discovery of gene regulatory binding motifs: a Bayesian perspective. Statistical Science 2004, 19: 188â€“204. 10.1214/088342304000000107
Pevzner PA, Sze SH: Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings International Conference on Intelligent Systems for Molecular Biology. AAAI Press; 2000:269â€“278.
Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17: S207S214.
Pavesi G, Mereghetti P, Mauri G, Pesole G: Weeder web: discovery of transcription factor binding sites in a set of sequences from coregulated genes. Nucleic Acids Res 2004, 32: W199W203. 10.1093/nar/gkh465
Buhler J, Tompa M: Finding motifs using random projections. RECOMB 2001, 69â€“76. full_text
Raphael B, Liu LT, Varghese G: A uniform projection method for motif discovery in DNA sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2004, 1(2):91â€“94. 10.1109/TCBB.2004.14
Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of coexpressed genes. Pac Symp Biocomput 2001, 6: 127â€“138.
Roth F, Hughes J, Estep P, Church G: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by wholegenome mRNA quantitation. Nat Biotechnol 1998, 16: 939â€“945. 10.1038/nbt1098939
Thijs G, Marchal K, Lescot M, Rombauts S, DeMoor B, Rouze P, Moreau Y: A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol 2002, 9: 447â€“464. 10.1089/10665270252935566
Holland JH: Adaptation in natural and artificial systems. In Ann Arbor. University of Michigan Press; 1975.
Goldberg DE: Genetic algorithms in search, optimization and machine learning. Boston, MA: Kluwer Academic Publishers; 1989.
Che D, Song Y, Rasheed K: MDGA: motif discovery using a genetic algorithm. GECCO '05: Proceedings of the 2005 conference on Genetic and evolutionary computation 2005, 447â€“452. full_text
Fogel GB, Weekes DG, Varga G, Dow ER, Harlow HB, Onyia JE, Su C: Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res 2004, 32(13):3826â€“3835. 10.1093/nar/gkh713
Lones MA, Tyrrell AM: Regulatory motif discovery using a population clustering evolutionary algorithm. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2007, 4(3):403â€“414. 10.1109/tcbb.2007.1044
Wei Z, Jensen ST: GAME: detecting cisregulatory elements using a genetic algorithm. Bioinformatics 2006, 22(13):1577â€“1584. 10.1093/bioinformatics/btl147
Chan TM, Leung KS, Lee KH: TFBS identification based on genetic algorithm with combined representations and adaptive postprocessing. Bioinformatics 2008, 24(3):341â€“349. 10.1093/bioinformatics/btm606
Hu J, Yang YD, Kihara D: EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics 2006, 7: e342. 10.1186/147121057342
Wijaya E, Yiu SM, Son NT, Kanagasabai R, Sung WK: MotifVoter: a novel ensemble method for finegrained integration of generic motif finders. Bioinformatics 2008, 24(20):2288â€“2295. 10.1093/bioinformatics/btn420
Siddharthan R, Siggia ED, van Nimwegen E: PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol 2005, 1(7):e67. 10.1371/journal.pcbi.0010067
Das MK, Dai HK: A survey of DNA motif finding algorithms. BMC Bioinformatics 2007., 8(S21):
Tompa M, Li N, Bailey TL, Church GM, Moor BD, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 2005, 23: 137â€“144. 10.1038/nbt1053
Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33: 4899â€“4913. 10.1093/nar/gki791
Sandve GK, Abul O, Walseng V, Drablos F: Improved benchmarks for computational motif discovery. BMC Bioinformatics 2007, 8: 193. 10.1186/147121058193
Garviea CW, Wolberger C: Recognition of specific DNA sequences. Molecular Cell 2001, 8: 937â€“946. 10.1016/S10972765(01)003926
Morozov AV, Siggia ED: Connecting protein structure with predictions of regulatory sites. Proc Natl Acad Sci USA 2007, 104(17):7068â€“7073. 10.1073/pnas.0701356104
Hertz G, Stormo G: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15(7â€“8):563â€“577. 10.1093/bioinformatics/15.7.563
Habib N, Kaplan T, Margalit H, Friedman N: A novel Bayesian DNA motif comparison method for clustering and retrieval. PLoS Comput Biol 2008, 4(2):e1000010. 10.1371/journal.pcbi.1000010
Jensen ST, Liu JS: BioOptimizer: a Bayesian scoring function approach to motif discovery. Bioinformatics 2004, 20: 1557â€“1564. 10.1093/bioinformatics/bth127
Stormo GD, Hartzell GW: Identifying proteinbinding sites from unaligned DNA fragments. Proc Natl Acad Sci USA 1989, 86: 1183â€“1187. 10.1073/pnas.86.4.1183
Klinge CM: Estrogen receptor interaction with estrogen response elements. Nucleic Acids Res 2001, 29: 2905â€“2919. 10.1093/nar/29.14.2905
Kel AE, KelMargoulis OV, Farnham PJ, Bartley SM, Wingender E, Zhang MQ: Computerassisted identification of cell cyclerelated genes: new targets for E2F transcription factors. J Mol Biol 2001, 309: 99â€“120. 10.1006/jmbi.2001.4650
Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of annotated regulatory binding sites from orthologous promoters. Nucleic Acids Res 2006, 34: D63D67. 10.1093/nar/gkj116
Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIPchip and ChIPseq data. Nature Biotechnology 2008, 26(11):1293â€“1300. 10.1038/nbt.1505
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wooton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, 262(8):208â€“214. 10.1126/science.8211139
Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liverspecific transcription. Genome Research 2001, 11: 1559â€“1566. 10.1101/gr.180601
Mahony S, Benos PV: STAMP: a web tool for exploring DNAbinding motif similarities. Nucleic Acids Res 2007, 35: W253W258. 10.1093/nar/gkm272
Blackwell TK, Weintraub H: Differences and similarities in DNAbinding preferences of MyoD and E2A protein complexes revealed by binding site selection. Science 1990, 250(4984):1104â€“1110. 10.1126/science.2174572
Aronheim A, Shiran R, Rosen A, Walker MD: Cellspecific expression of helixloophelix transcription factors encoded by the E2A gene. Nucleic Acids Res 1993, 21(7):1601â€“1606. 10.1093/nar/21.7.1601
Zambetti GP, Bargonetti J, Walker K, Prives C, Levine AJ: Wildtype p53 mediates positive regulation of gene expression through a specific DNA sequence element. Genes Dev 1992, 6: 1143â€“1152. 10.1101/gad.6.7.1143
Zhao J, Schmieg FI, Simmons DT, Molloy GR: Mouse p53 represses the rat brain creatine kinase gene but activates the rat muscle creatine kinase gene. Mol Cell Biol 1994, 14(12):8483â€“8492.
Lassara AB, Davisa RL, Wrightb WE, Kadeschc T, Murred C, Voronovad A, Baltimored D, Weintraub H: Functional activity of myogenic HLH proteins requires heterooligomerization with E12/E47like proteins in vivo. Cell 1991, 58: 305â€“315. 10.1016/00928674(91)90620E
Martin KA, Walsh K, Mader SL: The mouse creatine kinase paired Ebox element confers musclespecific expression to a heterologous promoter. Gene 1994, 142: 275â€“278. 10.1016/03781119(94)902747
Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML: Diversity and Complexity in DNA Recognition by Transcription Factors. Science 2009, 324: 1720â€“1723. 10.1126/science.1162327
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. They are also grateful to the authors of CisGenome for suggesting parameters to run FlexModule. This research is partially supported by the grants from the Research Grants Council of the Hong Kong SAR, China (Project CUHK414107 and CUHK414708).
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
TMC proposed the ideas, developed the algorithm and interpreted the results. GL refined the ideas and implementations, and carried out the experiments for comparisons. KSL and KHL were involved in the design and supervision of the project. TMC, KSL and KHL jointly wrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2009_3051_MOESM1_ESM.PDF
Additional file 1: Supplementary materials for discovering multiple realistic TFBS motifs based on a generalized model. Supplementary materials of additional details about implementations, datasets and experiments. (PDF 222 KB)
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Chan, TM., Li, G., Leung, KS. et al. Discovering multiple realistic TFBS motifs based on a generalized model. BMC Bioinformatics 10, 321 (2009). https://doi.org/10.1186/1471210510321
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210510321