Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data
© Dabrowski et al.; licensee BioMed Central. 2015
Received: 26 August 2014
Accepted: 14 April 2015
Published: 1 May 2015
For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter.
We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate.
We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/.
Transcription regulation is one of the key processes that allow cells to react to environmental cues and differentiate. It would not be possible without the specific interactions between DNA-binding proteins called transcription factors (TFs) and specific target regulatory elements.
While TFs exhibit some specificity in choosing their target binding sites, this specificity is imperfect i.e. instead of binding a single specific DNA sequence a typical TF recognizes a number of similar DNA fragments with varying affinity. A pattern describing these fragments is usually called a binding motif, and mathematically defined with position weight matrices for almost 30 years now . Importantly, the position weight matrix model makes a number of simplifying assumptions to make the model useful in practice: in particular it assumes independence between columns and additivity of the column scores. As the cost of this simplification, there comes a great advantage of being able to easily score any DNA sequence of the specified length with a log-odds measure, usually interpreted as a rough analog of the free energy of TF-DNA binding.
There has been a number of suggested extensions of the PWM model including Bayesian Markovian models  and so-called PBM-motifs  using mixtures of simpler models to alleviate the problems associated with the simplifying assumptions of the PWM model. More recently, there are also published approaches that combine the traditional PWM models with additional constraints on the physical properties of the DNA strand to be bound by a Transcription factor . However, while these approaches can be useful in a few situations, it has been shown that in the vast majority of real biological cases, the benefits of the simple PWM model outweigh the potential of the more complex models to give us a slightly better description of the bindind site .
Given this data and the fact that PWM models are used for three decades in practically unchanged form, attests to their tremendous applicability. However, in order to use such a model to determine potential binding sites of a TF with a known motif, one needs to choose a log-odds threshold to separate between scores high enough to facilitate binding event and the non-specific sequences. Choosing such a threshold value might be guided by a natural interpretation of the log-odds score. However, already the authors of the early motif scanning tool PATSER  have realized that, due to the dependency of the log-odds distribution on the information content (IC) of a motif , the log-odds thresholds for different motifs should be different. Later, Rahmann et al. showed in  that there are more meaningful ways of selecting the TF binding threshold controlling for type I, type II errors or for a certain balance between them.
At the time, the number of TF motifs was limited and there were very few large scale datasets to serve as a golden standard of TF binding. Since then, the community has accumulated hundreds TF motifs in several databases [8-16] and the large-scale ChIP-seq projects such as ENCODE  have provided us with dozens of ChIP-seq datasets for different TFs. This wealth of data allows us now to revisit the different ways of selecting the threshold and put it in the context of different motif databases with respect to their coverage and accuracy in order to find the optimal choices for practical applications.
Comparison of motif scanners
Comparsion of performance of motif databases should preferably be performed with the same motif scanning program, to separate the effect of the database from effects of different scanning programs. However, the two commercial motif databases: Transfac (Biobase) and MatBase (Genomatix); are each provided with a dedicated scanning program: Match  and MatInspector , respectively, using proprietary thresholds files. Therefore, as a preliminary step, we compared the performance of these dedicated programs to the performance of two motif scanners available in the public domain: matrix-scan  and Bio.Motif [21,22]. Both public scanners use background model-derived thresholds.
As the comparison metrics we used specificity and sensitivity, with ChIP-seq peaks for human TFs from Ensembl v.60 funcgen as the positive sets, and either third exons or genomics flanks of ChIP-seq peaks as the negative sets. The specificity and sensitivity were computed for every TF-motif pair, i.e. a pair of a TF and a related TFBS motif, and then averaged for a given database and scanning program.
Each commercial scanner was compared to the two public scanners using all motifs from its respective proprietary motif database. That is, we compared MatInspector to matrix-scan and Bio.Motif using 210 motifs from MatBase (v.8.3), representing 37 funcgen TFs; while Match was compared to the same two scanners using 106 motifs from Transfac (2010.3), representing 33 funcgen TFs. In this way, each commercial scanner could use its proprietary thresholds file. We also repeated the analysis on 32 funcgen TFs common to both motif databases, which gave nearly identical results (data not shown). For MatInspector/MatBase, in addition to individual motifs (matrices) we also used Genomatix-defined motif families (matrix families) .
The results for individual motifs when using exons as the negative set were straightforward: the two public scanners achieved higher sensitivity then the commercial scanners, while maintaining the same uniformly high specificity (Figure 1A, B). With flanks of ChIP-seq peaks as the negative sets, the situation was similar for MatInspector used with individual motifs (Figure 1C), whereas for Match, as compared to the public scanners, there was some trade-off between specificity and sensitivity (Figure 1D), with BA higher for the two public scanners. On both types of negative sets, the use of motif families resulted in a large increase of sensitivity and balanced accuracy (Figure 1A,C), accompanied by a marked decrease of specificity.
We conclude that on individual motifs the performance of the two public scanners as measured by balanced accuracy was clearly better than of either of the two commercial scanners. When the proprietary motif families were used, MatInspectors achieved nearly the same balanced accuracy as the public scanners. However, the use of motif families resulted in a large decrease of specificity, which we consider undesirable in whole-genome applications.
Significance of differences between balanced accuracies for public and commercial motif scanners
match vs matrix-scan
match vs Bio.Motif
MatInspector vs matrix-scan
MatInspector vs Bio.Motif
MatBase – families
MatInspector vs matrix-scan
MatInspector vs Bio.Motif
Comparison of databases coverage
Motif databases compared in the current study
HOCOMOCO v.9 (2013)
Jaspar vertebrates (2014)
TRANSFAC Professional 2013.1
MatBase v.9.0 (2012)
Comparison of databases quality
Every motif matrix defines a log-likelihood function that discriminates between true and false binding sites. Motif scanners use matrices supplied with thresholds separating positive and negative predictions.
In order to characterize the overall quality of motif matrices, we analyzed their behavior for the whole spectrum of possible threshold values. To this aim we generated their receiver operating characteristic (ROC) curves and calculated area under the curve (AUC) for each ROC. Expected AUC value for random scoring function is 0.5 and AUC for function perfectly discriminating true and false predictions is 1.
We computed ROC and AUC for all motifs in all databases with respect to 4 datasets of negative sequences: third exons, genomic flanks of ChIP-seq peaks, random sequences with dinucleotide composition the same as in ChIP-seq peaks, and sequences generated by 3rd order Markov chains learned on peak sequences (for details see Methods section). Below we focus our analysis on AUC statistics; full results, including all ROC curves, are presented in Supplementary Materials at the authors’ website http://bioputer.mimuw.edu.pl/papers/motifs/.
However, if one focuses on comparing AUC distributions between databases, all negative datasets yield similar picture. In all cases the highest and the lowest AUC belong to Hocomoco and MatBase motifs, respectively. The union of public databases always perform visibly better than both MatBase and Transfac. Choosing the best transcription factor for a motif favors databases having one motif for each TF over databases offering for each TF a large family of motifs. The second strategy is preferred by commercial databases (e.g. in the case of MatBase 63 TFs are represented by 377 motifs), which may to some degree explain their performance in this evaluation.
Comparison of optimal motifs for TFs common to MatBase, Transfac and the union of public databases
Selection of log-odds thresholds
As was mentioned previously, motif matrices define log-likelihood functions that should be supplied with thresholds separating positive and negative predictions. Selection of such a threshold determines the balance between prediction sensitivity and specificity. Usually, a reasonable solution is to maximize balanced accuracy, i.e. the mean of sensitivity and specificity.
Therefore we calculated maximal balanced accuracies and corresponding thresholds for all transcription factors represented in our benchmark dataset and related motifs. Results are presented in Additional file 2.
We also decided to examine generic threshold selection approaches, i.e. methods setting the threshold on the basis the motif matrix only. They have considerable advantages – simplicity (calculations are much easier) and wide applicability (no benchmark dataset is required).
Probably the simplest generic approach is to set a common log-likelihood threshold for all motifs. Unfortunately, log-likelihood distributions substantially vary across motifs and the same threshold value may result in underestimating occurrences for one motif and overestimating them for the other. Therefore several score distribution based approaches for threshold selection were proposed. In the current study we analyze 3 representative methods for threshold selection implemented in the Bio.Motif package: FPR, FNR and balanced (see 3 for details).
It should be noted that all these approaches (including pure log-likelihood score) are parameterized and in each case the parameter enables the user to select any sensitivity-specificity configuration obtainable for a given motif. In particular, each method allows optimizing balanced accuracy for individual motifs. Therefore the point in which the approaches differ lies in the ability to select the threshold consistently across motifs.
Average balanced accuracy for optimal threshold parameters
Average BA for parameter:
Optimal constant parameter for negative dataset:
Fitted with linear model
3rd order MCs
Therefore we selected several characteristics of the log-likelihood distribution under the background model (i.e. the theoretical distribution of the log-likelihood function of sequences randomly generated from the background model). Namely, we considered expected value (which equals to the motif IC), standard deviation, minimum and maximum. We fitted a linear regression model describing optimal threshold parameters by these characteristics.
The averages of BA (calculated with respect to negative sequences composed of flanks of ChIP-seq peaks) for constant parameters and parameters predicted from motif features are compared in Table 4. As expected, linear model visibly improved average BAs for methods with worse result for constant parameters, narrowing the range of average BAs to interval [0.6641;0.6757]. However, the average BA 0.6749 for constant FPR threshold was not reached by any of the other methods. To have the proper perspective to assess these differences, the reader should be aware of the distance between these values and the upper bound of the average BA for considered motifs. The ideal method would select for each motif a threshold yielding its optimal balanced accuracy. Consequently, it would reach a result equal to the average of BA maxima, which is 0.6949 for our dataset. Thus setting individual thresholds for each motif may improve the average BA by 0.02, when compared to thresholds globally set with FPR parameter α=0.0003.
Estimated optimal parameters obviously depend on the choice of the negative dataset (see the last 4 columns of Table 4), but the dispersions are relatively small: the log-likelihood range is less 2 and the ratios between the extremes for other parameters (the logarithmic scale is more suitable here) are ∼3. Moreover, the variability of optimal parameters is consistently explained by the characteristics of negative datasets - the lower are AUC for a dataset, the sharper is the parameter.
Discussion and conclusion
We have aimed to analyze the coverage and accuracy of different motif databases as well as the optimality of different approaches to motif scanning.
As one of the key advantages of the public resources is the possibility of freely combining them, we included into our analysis as an important category the union of the public databases.
In respect of the coverage of human TFs the two commercial databases maintain their lead as compared to each of the public databases separately (Figure 2A). However, the union of the public databases achieves coverage comparable to that of the commercial databases (Figure 2B).
In respect of motifs’ performance as judged by the distributions of the AUC, all the compared databases produce broadly similar results, with the median AUC higher for the union of the public databases than for either commercial one (Figure 3). At the single database level, Hocomoco was consistently the best performer for all choices of the negative datasets. Relative performance of some of the databases, in particular Jaspar and SwissRegulon, depended on the choice of the negative dataset. Jaspar performed better on peaks’ flanks and negative datasets generated from Markov models, while SwissRegulon performed better on third exons.
As another measure of motifs’ performance, we identified which database supplied the best-performing motif (Table 3) for every TF represented across all the databases. In more than half of the cases, the best-performing motif originated from the public databases. At the single database level, the largest number of best motifs was supplied by MatBase, but with Jaspar and Hocomoco at the second and third position supplying together more best motifs than MatBase.
In conclusion, public databases together match the coverage and the quality of their commercial counterparts, with no cost and limits of use imposed by the latter. On the other hand, an overhead associated with combining several public resources (unifying formats, installing updates, etc.) and differences in content other than the PWMs, make the choice public vs commercial not always obvious.
Given a sizeable number of TF motifs available only in one of the databases, a researcher going to predict binding of a particular TF needs a guide that helps to find the most appropriate public database. Our findings provide such a guide for a number of human TFs supported by reliable experimental data. Moreover, the usability of such guides may increase in a near future, when one can expect that the contents of several public databases will be available from a single server.
In terms of the performance of different scanning tools, publicly available mature software packages have exceeded the commercial tools in accuracy. While this might be at least in part a result of the commercial providers reluctance to change the behavior of the tools their customers depend on, we would advise researchers to use the publicly available tools.
With respect to the choice of the optimal threshold for multiple motifs, the method based on controlling the false positive rate is clearly the one least biased by the information content of the motif and therefore the most consistent between motifs.
In conclusion, the recent increase in availability of both sequence motifs and binding data have given us the opportunity to assess different motif databases and scanning methods for predicting potential TF binding sites. The results allow us to give some particular recommendations (such as the choice of the best motif and corresponding optimal threshold for a given TF) as well as general conclusions (superiority of FPR measure and public scanners) for users of these databases. We believe that our findings will prove to be useful also for hybrid methods [24,25], which use other data to improve motif-based predictions of binding sites.
Parsing and annotation of motif databases
Database distribution files downloaded from their providers websites (Table 1) were parsed using custom-made scripts to yield matrices in the common format, as well as matrices’ annotation to transcription factors. For the public databases, the available identifiers were the TF gene symbols or UniProt identifiers. For the purpose of comparison and mapping to the ChIP-seq datasets, these identifiers were mapped to human Entrez Gene ID, using the db2db function of the bioDBNet webservice  http://biodbnet.abcc.ncifcrf.gov/webServices/bioDBnet.wsdl. For the commercial databases the Entrez gene IDs were directly available. Genomatix uses the concept of matrix family , here referred to as ‘motif family’, i.e. family of the motifs so similar that they are predicted to bind a common set of TF orthologs in related species (e.g. all vertebrates). In Additional file 3 we provide a comprehensive mapping table for each motif database, for the commercial databases limited to the motifs used in this work.
Preparation of positive and negative TF-binding datasets
all human third exons, excluding exons with UTRs (common for all TFs),
sequences flanking ChIP-seq peaks for particular TFs (each peak shifted by its length + 40 nt along the chromosome),
random sequences of length and dinucleotide composition following ChIP-seq peaks for particular TFs, generated by BiasAway ,
random sequences of length following ChIP-seq peaks for particular TFs, generated by 3rd order Markov chains learned on these peaks.
All sequences are available in Supplementary Materials at the authors’ website http://bioputer.mimuw.edu.pl/papers/motifs/.
Sensitivity – the proportion of ChIP-seq peaks that contain at least one predicted binding site,
Specificity – the proportion of fragments in the negative dataset that contain no predicted binding site,
Balanced accuracy – the average of sensitivity and specificity.
Mapping of TFs to gene identifiers
Some TFs were represented by more than one ChIP-seq dataset and many TFs were mapped to several motifs from each database (for the details see Supplementary Materials). The annotation of the TF ChIP-seq datasets in Ensembl funcgen v.71 was downloaded from the Ensembl website (http://apr2013.archive.ensembl.org/Homo_sapiens/Experiment?db=core;ex=project-ENCODE-). The Ensembl gene IDs of the features in the classes: ‘Transcription factor’ and ‘Insulator’ (for CTCF) were mapped to Entrez Gene ID using the aforementioned db2db function of the bioDBNet webservice. For one entry (ZEB1), its Ensembl gene ID was corrected (to ENSG00000148516) prior to the mapping.
The two commercial motif databases: Transfac Professional (Biobase) and MatBase (Genomatix) are distributed with dedicated motif scanning programs: Match  and MatInspector [19,30], respectively, and proprietary thresholds files, aimed at controlling the false positive rate. For matrix-scan we choose 1st order Markov chain background model learned on human gene upstream no-orf sequences and for Bio.Motif we used the uniform background model. We choose 0.0001 chance of Type I error as the threshold for both public scanners, because it yielded similar numbers of genome-wide matches to Match or MatInspector when run with the two respective motif databases.
Computation of intersections
For the comparison of scanners, we computed genome-wide intersections between ChIP-seq peaks and motifs using stored procedures of the NGD database  http://www.nencki-genomics.org. The NGD database stores the TFs ChIP-seq data imported from funcgen and the results of genome-wide motif scanning. The intersections, specificity and sensitivity were computed separately for each dataset-motif pair, then averaged twice: first for each TF-motif pair, then for a given scanner/database. For the in-depth comparison of performance we used an extended set of motif databases: Jaspar (2014) vertebrates, Transfac Professional 2013.1, MatBase 9.0, HOCOMOCO v.9, HT-SELEX, SwissRegulon (2013). This time we scanned only the positive and the negative sequences with the Bio.Motif scanner. As the average length of human third exons (153 +/− 304) was smaller than the average size of the ChIP-seq peaks (403 +/− 172), before the scanning we added 100-nt flanks of either side of exons 3 and use these flanked exons as the negative set.
Threshold selection methods
FPR approach aims at restricting the number of false positive motif occurrences. For assumed type I error level α, t M is chosen to satisfy P(L M (w)>t M |B)=α.
FNR approach restricts the number of false negatives. In this method t M satisfies P(L M (w)<t M |M)=β for assumed type II error level β.
Balanced approach constrains the proportion between the levels of both errors, i.e. threshold t M satisfies P(L M (w)<t M |M)=γ·P(L M (w)>t M |B) for assumed parameter γ. Setting γ to the inverse of the expected frequency of motif occurrences results in roughly the same number of false positive and false negative binding site predictions.
This work was supported by: Polish National Science Centre grant No. 2013/09/B/NZ2/03170, Polish Ministry of Science and Higher Education grant No. N N519 652740 and Polish National Center for Research and Development grant No. ERA-NET-NEURON/10/2013.
- Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986; 188(3):415–31.View ArticlePubMedGoogle Scholar
- Xing EP, Jordan MI, Karp RM, Russell S. A hierarchical bayesian markovian model for motifs in biopolymer sequences In: Becker S, Thrun S, Obermayer K, editors. Advances in Neural Information Processing Systems 15. Vancouver Canada: MIT Press: 2003. p. 1513–20.Google Scholar
- Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012; 191(3):781–90. Accessed 2015-03-10.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014; 42(Database issue):148–55. Accessed 2015-03-10.View ArticleGoogle Scholar
- Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotech. 2011; 29(6):480–3.View ArticleGoogle Scholar
- Hertz GZ, Stormo GD. Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999; 15(7):563–77.View ArticlePubMedGoogle Scholar
- Rahmann S, Muller T, Vingron M. On the power of profiles for transcription factor binding site detection. Stat Appl Genet Mol Biol. 2003; 2(1):1032.Google Scholar
- Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, et al. Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature. 2005; 434(7031):338–45.View ArticlePubMedPubMed CentralGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, et al.Transfac and its module transcompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006; 34(Database issue):108–10.View ArticleGoogle Scholar
- Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Peña-Castillo L, et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008; 133(7):1266–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, et al.Diversity and complexity in DNA recognition by transcription factors. Science. 2009; 324(5935):1720–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, et al. Jaspar 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010; 38(Database issue):105–10.View ArticleGoogle Scholar
- Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. Dna-binding specificities of human transcription factors. Cell. 2013; 152(1-2):327–39.View ArticlePubMedGoogle Scholar
- Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, et al. Hocomoco: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 2013; 41(Database issue):195–202.View ArticleGoogle Scholar
- Pachkov M, Balwierz PJ, Arnold P, Ozonov E, van Nimwegen E. Swissregulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res. 2013; 41(Database issue):214–20.View ArticleGoogle Scholar
- Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2014; 42(Database issue):142–7.View ArticleGoogle Scholar
- ENCODE Project Consortium, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74.View ArticleGoogle Scholar
- Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E, MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 2003; 31(13):3576–579.View ArticlePubMedPubMed CentralGoogle Scholar
- Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, et al.MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics. 2005; 21(13):2933–42.View ArticlePubMedGoogle Scholar
- Turatsinze JV, Thomas-Chollier M, Defrance M, van Helden J. Using rsat to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc. 2008; 3(10):1578–88.View ArticlePubMedGoogle Scholar
- Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25(11):1422–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Wilczynski B, Dojer N, Patelak M, Tiuryn J. Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs. BMC Bioinformatics. 2009; 10:82.View ArticlePubMedPubMed CentralGoogle Scholar
- Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, et al.Matinspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics. 2005; 21(13):2933–42.View ArticlePubMedGoogle Scholar
- Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from dna sequence and chromatin accessibility data. Genome Res. 2011; 21(3):447–55.View ArticlePubMedPubMed CentralGoogle Scholar
- Kaplan T, Li XY, Sabo PJ, Thomas S, Stamatoyannopoulos JA, Biggin MD, et al. Quantitative models of the mechanisms that control genome-wide patterns of transcription factor binding during early drosophila development. PLoS Genet. 2011; 7(2):1001290.View ArticleGoogle Scholar
- Mudunuri U, Che A, Yi M, Stephens RM. bioDBnet: the biological database network. Bioinformatics. 2009; 25(4):555–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Krystkowiak I, Lenart J, Debski K, Kuterba P, Petas M, Kaminska B, et al. Nencki genomics database–ensembl funcgen enhanced with intersections, user data and genome-wide TFBS motifs. Database (Oxford). 2013; 2013:069.View ArticleGoogle Scholar
- Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic Acids Res. 2013; 41(Database issue):48–55.View ArticleGoogle Scholar
- Worsley Hunt R, Mathelier A, del Peso L, Wasserman W. Improving analysis of transcription factor binding sites within chip-seq data based on topological motif enrichment. BMC Genomics. 2014; 15(1):472.View ArticlePubMedPubMed CentralGoogle Scholar
- Quandt K, Frech K, Karas H, Wingender E, Werner T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 1995; 23(23):4878–884.View ArticlePubMedPubMed CentralGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.