New scoring schema for finding motifs in DNA Sequences
- Fatemeh Zare-Mirakabad^{1},
- Hayedeh Ahrabian^{2}Email author,
- Mehdei Sadeghi^{3, 4},
- Abbas Nowzari-Dalini^{2} and
- Bahram Goliaei^{1}
https://doi.org/10.1186/1471-2105-10-93
© Zare-Mirakabad et al; licensee BioMed Central Ltd. 2009
Received: 14 September 2008
Accepted: 20 March 2009
Published: 20 March 2009
Abstract
Background
Pattern discovery in DNA sequences is one of the most fundamental problems in molecular biology with important applications in finding regulatory signals and transcription factor binding sites. An important task in this problem is to search (or predict) known binding sites in a new DNA sequence. For this reason, all subsequences of the given DNA sequence are scored based on an scoring function and the prediction is done by selecting the best score. By assuming no dependency between binding site base positions, most of the available tools for known binding site prediction are designed. Recently Tomovic and Oakeley investigated the statistical basis for either a claim of dependence or independence, to determine whether such a claim is generally true, and they presented a scoring function for binding site prediction based on the dependency between binding site base positions. Our primary objective is to investigate the scoring functions which can be used in known binding site prediction based on the assumption of dependency or independency in binding site base positions.
Results
We propose a new scoring function based on the dependency between all positions in biding site base positions. This scoring function uses joint information content and mutual information as a measure of dependency between positions in transcription factor binding site. Our method for modeling dependencies is simply an extension of position independency methods. We evaluate our new scoring function on the real data sets extracted from JASPAR and TRANSFAC data bases, and compare the obtained results with two other well known scoring functions.
Conclusion
The results demonstrate that the new approach improves known binding site discovery and show that the joint information content and mutual information provide a better and more general criterion to investigate the relationships between positions in the TFBS. Our scoring function is formulated by simple mathematical calculations. By implementing our method on several biological data sets, it can be induced that this method performs better than methods that do not consider dependencies.
Keywords
Background
DNA-binding proteins, called transcription factors (TFs), are involved in transcription regulation. These factors bind to specific positions in promoter regions for modulating the expression of genes. The common pattern of the recognition sites of a TF is called a motif. We use the term transcription factor binding site or motif instance to mean the occurrence of the motif with some mutations in promoter regions.
Identifying transcription factor binding sites (TFBSs) in promoter regions is a difficult problem in molecular biology. The main reason for this difficulty is that a single transcription factor might bind to regions which vary greatly in their sequences. Although the binding sites for a particular transcription factor share short similar subsequences, sometimes they are highly degenerated. Such short sequences are expected to randomly occur every few hundred base pairs, and thus finding them is a difficult task. Since experimental procedures to determine the exact binding sites are too expensive and time-consuming, computational methods have been developed in the past two decades for discovering novel motifs and TFBSs in a set of promoter sequences [1, 2].
There are two main classes of algorithms for finding regulatory motifs. First, the methods that search for known transcription factor binding sites in a new sequence (known motif prediction). Example tools include ConSite [3], Match [4], Mapper [5], Patser [6] and rVista [7]. Second, the methods that try to detect new motifs within a set on DNA sequences based on sequence homology (unknown motif prediction). Example tools include Gibbs sampler [8], AlignACE [9], MEME [10] and Yeast Motif Finder [11]. Generally, motif finding algorithms in both of above methods have three important elements: a motif model that can capture the similarities of a diverse set of binding sites for the same transcription factor, an objective function defining the ranking of known motif (in the first methods) or potential motifs (in the second methods), and a search strategy for parameterizations of the motif model. The first two elements can be given an abstract representation or modeling, but should probably be designed to utilize and enhance biologically relevant information.
Until now, the most common way for binding sites modeling is to assume that any base in each site occurs independent of others. In this modeling, a motif is represented based on consensus sequences [12], position weight matrices (PWMs) [9, 10], matrix profiles [13], sequence logos [14], mismatch strings (MMs) [15, 16] (consensus string allowing some mismatches) and IUPAC strings (IUPACs) [11, 17] (consensus string with degenerate symbols). Methods based on the assumption of independency between positions are simple with small number of parameters that make them easy to implement. These methods are widely used and often considered as acceptable models for binding-site predictions [18]. However, recent experimental evidence [18, 19] has promoted the development of models which incorporate position dependencies. The related methods include Bayesian networks [20], permuted Markov models [21], Markov chain optimization [22], hidden Markov models [5], non-parametric models [23] and generalized weight matrix models [1]. Based on the above discussion, another method for modeling binding sites is presented by Tomovic and Oakeley [24]. In this method, for a given TFBS, dependent and independent positions are considered and in searching for a motif the scoring is calculated based on them. The dependency between positions of a given TFBS are predicted by statistical approach which may be explained by structure of TF-DNA complexes. Methods based on position dependencies usually have better binding site prediction accuracy with lower false positive rates. But these methods require more complicated mathematical tools, with more parameters to estimate, and require more experimental data than typically available ones [5, 20, 22, 23]. On one hand, a more comprehensive model may allow for a better fit to the data. On the other hand, the more complex model may over-fit the data and result in an inferior predictive power.
In this paper, we focus on TFBS modelings and search methods for known motif prediction which find known transcription factor binding sites in a given sequence, and investigate known motif ranking (scoring schema). We study whether TFs show position dependencies in their binding sites or not. We also investigate the use of joint information content and mutual information as a measure of dependency between positions in TFBS. We suggest a statistical approach for testing dependencies, and present a new scoring schema that can be used in search methods for finding known transcription factor binding sites. Our method for modeling dependencies is simply an extension of position independencies methods. This method is formulated by simple mathematical calculations, and as will be shown, the proposed algorithm is very simple and substantially efficient, and can be easily implemented on any data sets. We test our new scoring schema on the real data sets and compare the obtained results with two other well known independent and dependent scoring schemas. Using this comparison we can demonstrate the effectiveness of our proposed method against the independent scoring schema, and our scoring function performs better than methods that do not consider dependencies. It is shown that the joint information content and mutual information provide a better and more general criterion to investigate the relationships between positions in the TFBS. Also by using these two measurements, we can obtain results compatible to the results obtained by dependent scoring schema.
Methods
As mentioned in previous section, one of the important problems in motif discovery area is finding the known TFBSs in a given DNA sequence or promoter region (known motif prediction). In this section we focus on this problem and at first, some definitions and notations further used in this paper are introduced. Let N = {A, C, G, T} be the four nucleotide letters' of which DNA sequences are composed. We have the DNA sequence D = d_{1},..., d_{ n }(a promoter region) on N, and let us suppose that we have t known TFBSs of the length ℓ which are represented by a matrix B_{t × ℓ}for a given TF, and we intend to investigate by B, where D possess a motif instance or transcription factor binding site corresponding to the given TF. For finding the position of this motif instance in D, we first create a position weight matrix W of B, and then we scan all subsequences R = d_{ i },..., d_{i+ℓ-1}for i = 1,..., n - ℓ + 1 of D, and align position weight matrix W with each R. All the subsequences which their score are greater than a cutoff are reported as motif instances. The creation of position weight matrix W from TFBSs and calculating the score of alignment W with a subsequence are called scoring schema.
The accuracy of the solution in this search problem depends on how we design the scoring schema, and how the position weight matrix is constructed. In this section we first discuss two existing scoring schemas which are employed for ranking known motifs and predicting TFBSs [24], later a new scoring schema is presented.
Independent scoring schema
The first scoring schema is a conventional method and is employed in many papers [4, 8, 11, 16, 25, 26]. In this scoring schema, it is assumed that all positions in a given motif are completely independent. This scoring schema is defined as follows.
where each p(b) shows the occurrence probability of nucleotide b (independent of nucleotides in the other position) in a random sequence (obviously p(b) = 0.25 for every b ∈ N).
Dependent scoring schema
The second scoring schema was first introduced in [24]. In this scoring schema, dependency between some positions in a given TFBS is assumed. This method uses a statistical approach to find dependent positions in a set of known TFBSs. Therefore, if the dependent positions of a set of TFBSs are available, then this scoring schema is defined as follows.
Similar to the previous definition, we have a promoter region D and t binding sites of the length ℓ which are represented by a matrix B_{t × ℓ}for a given TF. Also, assume that F ([b_{1},..., b_{ m }], [j_{1},..., j_{ m }]) shows the occurrences of bases b_{1},..., b_{ m }(b_{ i }∈ N for 1 ≤ i ≤ m) in dependent positions j_{1},..., j_{ m }in the matrix B (positions j_{1},..., j_{ m }are determined by statistical approaches [24]). As an example, F([A, C, A, T ],[3, 4, 8, 11]) represents the number of occurrences of A, C, A, and T in the positions 3, 4, 8, and 11 in a given matrix B. It should be noted that the positions j_{1},..., j_{ m }are dependent and not necessarily consecutive.
where a(b_{1},..., b_{ m }) is a smoothing parameter and can be calculated as follows:
a(b_{1},..., b_{ m }) = a(b_{1}) × ... × a(b_{ m }).
where k_{1} is the number of independent positions, k_{2} is the number of dependent positions order 2 (nucleotides at positions j_{ i }and j_{i+1}) and k_{ m }the number of dependent positions order m (nucleotides at positions j_{ i }, j_{i+1},..., j_{i+m-1}).
New scoring schema
In the previous subsections we presented two scoring schemas. In the first, nucleotides in all positions in a given TFBS are considered as independent, but this may not be true in all cases because it is shown that dependency between some positions are important [19, 27]. In the second, dependency between some positions in a TFBS are considered, but this model has also two problems: first, calculation of dependency between positions is sophisticated, and second, final score is obtained by summation of all the scorings obtained by each order dependent positions, which are not in the same range.
As mentioned, all positions in TFBSs may be dependent, because the length of TFBSs are short, therefore all positions in TFBS may be involved in the interaction with a factor and dependency between all positions are important. TFBSs are short regions in promoter region that TFs can be bonded to them to provide initial conditions for gene transcription. By mutual comparison of TFBS corresponding to a specific TF, we see that some positions in TFBS are mutated and some other ones are conserved. Since the length of a TFBS is short, therefore it seems that both mutated and conserved positions play an important role in binding of TF and TFBS. During a transcription process, TFBS region constructs structure by hydrogen bonds and this causes the attraction of TF to this region. Thus, with respect to the above feature of this process, it seems that the conserved positions and mutated positions cause this attraction. Also, with respect to that, the average specific free energy of binding to all binding sites play an important role in this attraction, and by considering that this energy is directly related to the information content of the preferred binding sites [26], we use the information content for TFBS scoring.
and for this formula we have 0 ≤ JIC ≤ 4ℓ.
and from this formula we have 0 ≤ MI ≤ 2ℓ. The relation of MI and JIC for each position pairs is as follows. If MI = 0 then JIC = 4 and consequently MI + JIC = 4, if MI = 2 then JIC = 2 and consequently MI + JIC = 4. This condition implies that JIC does show less information and by adding up MI we can get more information. Actually MI carries meaningful information that can not be discarded. On the other hand, IC = 2 means, conservation is low but dependency between positions is high.
where a(b_{1}, b_{2}) is a smoothing parameter and can be calculated as:
a(b_{1}, b_{2}) = a(b_{1}) × a(b_{2}).
where [b_{1}, b_{2}] ∈ (N × N), 1 ≤ j_{1}, j_{2} ≤ ℓ, and j_{1} ≠ j_{2}.
Results and discussion
In order to determine the distribution of TFs with dependent positions and verify that our scoring schema indeed improves the specificity of known motif discovery, we extract some TFs from two public databases JASPAR [28] and TRANSFAC [29]. For extracting the TFs from JASPAR, we select all TFs from JASPAR database and implant TFBSs from these TFs in some random sequences which are generated by the similar way to [24]. For extracting the motifs from TRANSFAC, we use the benchmark data sets that generated by Sandve et al. [30] and Tompa et al. [31]. Sandve generated three data set versions from TRANSFAC based on the collections of binding site fragments that are ranked according to the optimal level of discrimination. These data sets are called 'algorithm-Markov', 'algorithm-real', and 'model-real'. Tompa also generated three data set versions from TRANSFAC based on the employed background sequences. These data sets are called 'Generic', 'MChain', and 'Real'. Therefore, we have seven data sets (JASPAR, algorithm-Markov, algorithm-real, model-Real, Generic, MChain and Real) that each of them contains some TFs (motifs) where each TF contains some TFBSs (motif instances).
We compare our new scoring schema with the two scoring schemas that were introduced in Section 2 on the above data sets for finding known motif instances. The comparisons are proceeded in two levels: Comparison of sites (site level) and comparison of nucleotides (nucleotide level) regarding the position of motifs in the main sequences. For this reason, we first introduce the following criteria for comparison [31].
- 2.
nFP is the number of nucleotide positions not in the known sites but in the predicted sites.
- 3.
nFN is the number of nucleotide positions in known sites but not in the predicted sites.
- 4.
nTN is the number of nucleotide positions in neither known sites nor the predicted sites.
- 5.
sTP is the number of known sites overlapped by the predicted sites.
- 6.
sFP is the number of predicted sites not overlapped by the known sites.
- 7.
sFN is the number of known sites not overlapped by the predicted sites.
- 1.Nucleotide Performance Coefficient (nPC): Following Pevzner and Sze [25], nPC is defined in the nucleotide level of the predicted sites and is equal to
- 2.Nucleotide Correlation Coefficient (nCC): Following Burset and Guigo [32], nCC is defined in the nucleotide level as
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 1.
Average: For each program, the measurement M is calculated on each data set and then the usual arithmetic mean of the measurement M is evaluated for each program.
- 2.
Combined: Adding up nTP, nFP, nFN, nTN, sTP, sFP and sFN over all data sets, the measurement M is computed for all data sets which are considered as a large data set.
- 3.
Normalized: For each motif, the measurement M is normalized by subtracting the mean and dividing by the standard deviation over all the programs on that motif, and the average of these normalized scores over all motifs are obtained. This method puts easy and hard motifs on the same scale.
For finding TFBSs in the generated data sets from the above mentioned data bases, we use three test methods. In the first test method, for each TF, we have t known TFBSs of the length ℓ which are implanted in t sequences of the length n. Initially we calculate the motif matrix B and corresponding PWM. Now we scan t sequences with PWM and calculate the score of all subsequences of these t sequences based on three mentioned scoring schemas (independent scoring schema, dependent scoring schema, and our scoring schema) and then report subsequences with the score above a predefined cutoff (with value in the range of [0, 1]) as motif instances or predicted TFBSs. The above process is repeated for all TFs in each of the data sets. The value of cutoff is chosen based on best nCC for each TF. Roughly, one method for computing the cutoff is to fix an initial value for cutoff (rather a maximum value) and then select all subsequences of the length ℓ from t sequences with a score above this cutoff, as motif instances.
With regard to the known actual positions of binding sites each TF, the nCC value of this TFBSs is computed. Then we decrease the value of cutoff and we again predict TFBSs and calculate its corresponding nCC. If the value of nCC increases we repeat the whole process for smaller value of the cutoff, until we get to a cutoff whose nCC value decreases. The previous cutoff before this last decrease, is selected as the final cutoff. It should be noticed that, if cutoff is small, so TP and FP are large and TN and FN are small, and if cutoff is large then TP and FP are small and TN and FN are large. Therefore we choose cutoff such that the calculated nCC be maximum. In the second test method we use Jackknife method, again for each TF, we have t known TFBSs with the length ℓ which are implanted in t sequences with length n. First we ignore j-th TFBS of this set, then calculate the motif matrix B for t - 1 remaining TFBSs, and the corresponding PWM. Then based on the mentioned scoring schema, we consider j-th sequence and scan this sequence with PWM for finding a subsequence with maximum score as a predicted TFBS or motif instance. For each j (1 ≤ j ≤ t) we repeat this process. Finally, accuracy of methods are investigated on all TFs in each data set. In the third test method, we use the method which is introduced in [24]. We have m = t + q sequences of the length n and t TFBSs of the length ℓ for each TF which are implanted into t sequences of m sequences. Therefore q sequences have no motif. Now by using the value of cutoff which is calculated in the first test method for each TF, we try to find motif instances in m sequences by PWM of t known TFBSs. The accuracy of known motif prediction is investigated in all the above test methods for the predicted motifs.
Finally, we perform statistical analysis on nCC measurement of motifs in each data set (JASPAR, algorithm-Markov, algorithm-real, model-real, Generic, MChain and Real). In following we describe our statistical analysis. Let P_{i, j, k}be the set of nCC values obtained from the j th test method (1 ≤ j ≤ 3) by the k th scoring schema (1 ≤ k ≤ 3) on the i th data set (1 ≤ i ≤ 7). Clearly k = 1, k = 2 and k = 3 stand for the independent, dependent and our scoring schema, respectively. Also μ_{i, j, k}shows the average of the values given in the set P_{i, j, k}. Now we intend to see whether the distribution of these values in the set P_{i, j, k}follows a normal distribution or not. This is done by using the K-S (Kolmogorov-Smirnov) Test. Actually, this test represents whether the data shows a significant deviation from normality or not. Now if p-value is more than 0.05 thus the null hypothesis (H_{0}), stating the data have come from normal distribution, is not rejected. We also compare the nCC values of our scoring schema with the two other scoring schemas (independent and dependent). For this reason, we compare the above mentioned mean values corresponding to each schemas. Mutually we compare (μ_{i, j, 3}, μ_{i, j, 1}) and (μ_{i, j, 3}, μ_{i, j, 2}) for each 1 ≤ i ≤ 7 and 1 ≤ j ≤ 3. For comparing μ_{i, j, 3}and μ_{i, j, k}(k = 1 or k = 2), we use the paired t-test, if P_{i, j, 3}and P_{i, j, k}have normal distribution, otherwise the Wilcoxon signed-ranks test is applied. This statistical analysis estimate significant deviation of two averages. The results of our statistical analysis are shown in the next subsections.
JASPAR database
As mentioned, for extracting the data from JASPAR, all 107 TFs are selected from this database. Let us denote the number of TFBSs of the i th TF by t_{ i }, 1 ≤ i ≤ 107. We implant TFBSs of each TF in some of random sequences that are extracted from the supplementary No. 8 enclosed in [24]. The number of these random sequences is 1800 and these sequences are of the length 250 to 500 and are sampled from a third-order Markov model background distribution. So we generate our data set as follows. Assume the i th TF in JASPAR, consists of t_{ i }TFBSs. We select randomly, t_{ i }sequences from 1800 background sequences and implant all these TFBSs in t_{ i }sequences in random position. We repeat this process for all TFs in JASPAR database. Finally 107 sets are obtained. Let S_{ i }be the set of t_{ i }sequences in which t_{ i }known TFBSs are implanted. The position dependency in this paper for evaluating dependent scoring schema is similar to the values given in [24], which are obtained by statistical approach with respect to their structures. Now, the performance of the above three test methods on generated data set are as follows. In the first test method, we use an ordinary search method. First, for the i th TF, the corresponding position weight matrix is constructed from its known TFBSs. Later, each subsequence R (|R| = ℓ) of S_{ i }is aligned to the constructed PWM and the Score_{ j }(R) and NScore_{ j }(R) (1 ≤ j ≤ 3) are computed. Finally, the subsequences with the score above the cutoff are considered as motif instances or predicted TFBSs.
We have also performed some statistical analysis for 107 nCC values. For all the sets P_{1, 1, k}, 1 ≤ k ≤ 3, which do not follow a normal distribution, we use the Wilcoxon signed-ranks test on (μ_{1, 1, 3}, μ_{1, 1, 1}) and (μ_{1, 1, 3}, μ_{1, 1, 2}). The calculated p-values indicate that μ_{1, 1, 3} ≥ μ_{1, 1, 1} (p-value = 1) and μ_{1, 1, 3} ≥ μ_{1, 1, 2} (p-value = 0.8686).
Sandve's Benchmark
As mentioned, the data sets used for testing and comparing the three mentioned scoring schemas on TRANSFAC database, are the 'algorithm-Markov', 'algorithm-real', and 'model-real' bench mark data sets which are generated by Sandve et al. [30]. As mentioned in [30], these data sets are created by extracting the sets of binding site fragments with the same length for 213 different TF matrices. A binding site fragment is the binding site region that is used in the construction of a matrix in the TRANSFAC alignment. All three data set versions 'algorithm-Markov', 'algorithm-real', and 'model-real' are constructed from the same fragment sets and the selection of data sets is based on nCC. For the 'algorithm-real' version, binding sites are kept in their original genomic sequence, which is truncated to a maximum length of 2000 bp. To make the data sets more coherent, the binding site fragments that contained degenerate bases are removed. This binding sites have gaps in the TRANSFAC alignment, not located within the 2000 bp upstream of transcription start site in the sequence linked to by TRANSFAC. Additionally the selected motifs have nCC value higher than 0.79. For the 'algorithm-Markov' version, binding sites are implanted in the sequences generated from a third order Markov model inferred from all sequences of the corresponding real data set. In addition the selected motifs have nCC value higher than 0.87. Both the lengths of the 'algorithm-Markov' version sequences and the positions of the implanted binding sites are kept equal to the corresponding real sequences. Motifs with fewer than five binding sites are removed, and 50 motifs (each motif has some motif instances or TFBSs) for 'algorithm-real' and 50 motifs for 'algorithm-Markov' are kept. For creating 'model-real' version, 25 motifs with nCC below 0.72 are selected. Each of these motifs have at least 18 motif instances (bing sites) and are kept in their original genomic sequences. It should be noted that in each motif, motif instances are similar to the background sequences (nCC ≤ 0.72).
We have also perform statistical analysis on 50 nCC values on the first and second test methods. Since P_{2, 1, 1} and P_{2, 1, 3} which do not follow normal distribution, we use the Wilcoxon signed-ranks test on (μ_{2, 1, 3}, μ_{2, 1, 1}) do not follow P_{2, 2, 1} and P_{2, 2, 3} do not follow distribution. The calculated p-values indicate that μ_{2, 1, 3} ≥ μ_{2, 1, 1} (p-value = 1) and μ_{2, 2, 3} ≥ μ_{2, 2, 1} (p-value = 0.5316).
As we can see, in Jackknife testing method, our scoring schema and independent scoring schema are similar, but our scoring schema performs better when the motif instances are planted in the data sets and the search method is based on cutoff.
Tompa's benchmark
As mentioned, other data sets used for testing and comparing the two mentioned scoring schemas on TRANSFAC data base, are the 'Generic', 'MChain', and 'Real' data sets which are generated by Tompa et al. [31]. Recall from [31], the data set 'Real' is created by implanting the selected TFBSs of TRANSFAC into real promoter sequences as a background, the data sets 'Generic' are created by implanting the selected TFBSs of TRANSFAC into randomly chosen promoter sequences from the same genome, and the data set 'MChain' is created by implanting the selected TFBSs of TRANSFAC into sequences generated by a Markov chain of order 3. The implanted TFBSs do not have the same length in all three data set types. The TFBSs are selected from TRANSFAC by the following process. Initially, only TFs are selected for which TRANSFAC also lists a binding site consensus sequences. For each factor, duplicate instances of the same binding site, binding sites missing sequence or position information, binding sites whose position is annotated as start site, binding sites whose position is less than -3000 or greater than 0, and sequences with two reported binding sites contradicting each other in the sequence are removed. The remaining binding sites are implanted into three type of background sequences. In addition in each data set some sequences without motifs are also inserted and consequently 52 motifs of each type are obtained. Since each data set contains some motif with non-similar length, and some sequences in each data sets do not have any motif, we omit some sequences in the data sets and finally, for each data set types, we have chosen 16 motifs that at least have 4 motif instances with the same length. It should be noted that the PWM is made from this data but the test is done on all member of these data set.
Conclusion
In this work, we investigate the dependencies within transcription factor binding sites, and present a simple way for modeling these dependencies. We have developed a new scoring schema for known binding site perdition. In this scoring schema the joint information content and mutual information are used as a measure of dependency between position in TFBS. We have evaluated different aspects of the scoring schema and this method is implemented and tested on real data sets. The results are compared with two well known scoring schemas. For comparison some statistical measurements are considered which show our scoring schema can improve motif prediction.
For investigating the improvement of our scoring schema on sTP, we calculated sTP of three scoring schema on all seven tested data sets and we observed that, on all the tested data sets our scoring schema show an average %21 improvement comparing to the independent scoring schema and also %11 improvement comparing to the dependent scoring schema on sTP.
The nTP, nFN, and nFP values obtained by our scoring and independent scoring schema on REB1, ROX1, UASH and URS1.
Scoring Schema | TF | nTP | nFN | nFP |
---|---|---|---|---|
Our scoring | REB1 | 140 | 64 | 16590 |
Independent scoring | REB1 | 140 | 64 | 7264880 |
Our scoring | ROX1 | 96 | 8 | 1632 |
Independent scoring | ROX1 | 96 | 8 | 5067288 |
Our scoring | UASH | 270 | 66 | 2130 |
Independent scoring | UASH | 270 | 66 | 347040 |
Our scoring | URS1 | 182 | 0 | 1534 |
Independent scoring | URS1 | 182 | 0 | 115427 |
In general, the obtained results on the biological data sets demonstrated that the joint information content and mutual information provide a better and more general criterion to investigate the relationship between positions in the TFBS, and motif detection can be improved with the scoring schema that considers dependency in TFBSs.
Declarations
Acknowledgements
The authors would like to thank the anonymous referees for their helpful and valuable suggestions.
Authors’ Affiliations
References
- Zhou Q, Liu J: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics. 2004, 20: 909-916.View ArticlePubMedGoogle Scholar
- Hertzberg L, Zuk O, Getz G, Domany E: Finding Motifs in Promoter Regions. J Compu Biology. 2005, 12: 314-330.View ArticleGoogle Scholar
- Sandelin A, Wasserman W, Lenhard B: ConSite: web-based prediction of regulatory elements using cross-species comparision. Nucleic Acids Res. 2004, 32: W249-W252.PubMed CentralView ArticlePubMedGoogle Scholar
- Kel A, Gößling E, Reuter I, Cheremushkin E, Kel-Margoulis O, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003, 31: 3576-3579.PubMed CentralView ArticlePubMedGoogle Scholar
- Marinescu V, Kohane I, Riva A: MAPPER: A search engine for the computational identification of putative transcription factor binding sites in multiple genomes. BMC Bioinformatics. 2005, 6: 79-PubMed CentralView ArticlePubMedGoogle Scholar
- Hertz G, Hartzell G, Stormo G: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990, 6 (2): 81-92.PubMedGoogle Scholar
- Loots G, Ovcharenkol I: rVISTA 2.0: Evolutionary analysis of transcription factor binding sites. Nucleic Acids Res. 2004, 32: W217-W221.PubMed CentralView ArticlePubMedGoogle Scholar
- Lawrence C, Altschul S, Bogusky M, Liu J, Neuwald A, Wootton J: Detecting subtle sequence signals: Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214.View ArticlePubMedGoogle Scholar
- Hughes J, Estep P, Tavazoie S, Church G: Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces Cerevisiae. J Mol Biology. 2000, 296: 1205-1214.View ArticleGoogle Scholar
- Bailey T, Elkan C: The value of priori knowledge in discovering motifs with MEME. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. 1995, AAAI Press, Menlo Park, CA, 21-29.Google Scholar
- Sinha S, Tompa M: YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003, 31: 3586-3588.PubMed CentralView ArticlePubMedGoogle Scholar
- Day W, McMorris F: Critical comparision of consensus methods for molecular sequences. Nucleic Acids Res. 1992, 20: 1093-1099.PubMed CentralView ArticlePubMedGoogle Scholar
- Stormo G, Schneider T, Gold L: Characterization of translational initiation sites in E. Coli. Nucleic Acids Res. 1982, 10: 2971-2996.PubMed CentralView ArticlePubMedGoogle Scholar
- Schneider T, Stephens R: Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 1990, 18: 6097-6100.PubMed CentralView ArticlePubMedGoogle Scholar
- Blanchette M, Tompa M: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 2002, 12: 739-748.PubMed CentralView ArticlePubMedGoogle Scholar
- Marsan L, Sagot M: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol. 2000, 7: 345-360.View ArticlePubMedGoogle Scholar
- Bortoluzzi S, Coppe A, Bisognin A, Pizzi C, Danieli G: A multistep bioinformatic approach detects putative regulatory elements in gene promoters. BMC Bioinformatics. 2005, 6: 121-136.PubMed CentralView ArticlePubMedGoogle Scholar
- Benos P, Bulyk M, Stormo G: Additivity in Protein-DNA interactions: how good an approximation is it?. Nucleic Acids Res. 2002, 30: 4442-4451.PubMed CentralView ArticlePubMedGoogle Scholar
- Bulyk M, Johnson P, Church G: Nucleotides of transcription factor binding site exert independent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002, 30: 1255-1261.PubMed CentralView ArticlePubMedGoogle Scholar
- Barash Y, Elidan G, Friedman N, Kaplan T: Modeling dependencies in protein-DNA binding sites. Proceedings of the seventh annual international conference on Research in computational molecular biology. 2003, Berlin, Germany: ACM, New York, NY, 28-37.Google Scholar
- Zhao X, Huang H, Speed T: Finding short DNA motifs using permuted Markov models. J Comput Biol. 2005, 12: 894-906.View ArticlePubMedGoogle Scholar
- Ellrott K, Yang C, Sladek F, Jiang T: Identifiying transcription factor binding sites through Markov chain optimization. Bioinformatics. 2002, 18 Suppl 2: S100-S109.View ArticlePubMedGoogle Scholar
- King O, Roth F: A non-parametric model for transcription factor binding sites. Nucleic Acids Res. 2003, 31: e116-PubMed CentralView ArticlePubMedGoogle Scholar
- Tomovic A, Oakeley E: Position dependencies in transcription factor binding sites. Bioinformatics. 2007, 23: 933-941.View ArticlePubMedGoogle Scholar
- Pevzner P, Sze S: Combinatorial approaches to finding subtle signals in DNA sequences. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 2000, AAAI Press, Menlo Park, CA, 269-278.Google Scholar
- Stormo G: Information content and free energy in DNA-Protein interaction. J Theor Biol. 1998, 195: 135-137.View ArticlePubMedGoogle Scholar
- Benos P, Lapedes A, Stormo G: Probabilistic code for DNA recognition by proteins of EGR family. J Mol Biol. 2002, 323: 701-727.View ArticlePubMedGoogle Scholar
- Lenhard B, Wasserman W: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002, 18: 1135-1136.View ArticlePubMedGoogle Scholar
- Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: A database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996, 24: 238-241.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandve G, Abul O, Walseng V, Drabløs F: Improved benchmarks for computational motif discovery. BMC Bioinformatics. 2007, 8: 193-PubMed CentralView ArticlePubMedGoogle Scholar
- Tompa M, Li N, Bailey T, Church G, De Moor B, Eskin E, Favorov A, Frith M, Fu Y, Kent W, Makeev V, Mironov A, Noble W, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005, 23: 137-144.View ArticlePubMedGoogle Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics. 1996, 34: 353-367.View ArticlePubMedGoogle Scholar
- Zhu J, Zhang M: SCPD: A promoter database of yeast Saccharomyces Cerevisiae. Bioinformatics. 1999, 15: 563-577.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.