A Parzen window-based approach for the detection of locally enriched transcription factor binding sites
© Vandenbon et al.; licensee BioMed Central Ltd. 2013
Received: 6 July 2012
Accepted: 14 January 2013
Published: 21 January 2013
Identification of cis- and trans-acting factors regulating gene expression remains an important problem in biology. Bioinformatics analyses of regulatory regions are hampered by several difficulties. One is that binding sites for regulatory proteins are often not significantly over-represented in the set of DNA sequences of interest, because of high levels of false positive predictions, and because of positional restrictions on functional binding sites with regard to the transcription start site.
We have developed a novel method for the detection of regulatory motifs based on their local over-representation in sets of regulatory regions. The method makes use of a Parzen window-based approach for scoring local enrichment, and during evaluation of significance it takes into account GC content of sequences. We show that the accuracy of our method compares favourably to that of other methods, and that our method is capable of detecting not only generally over-represented regulatory motifs, but also locally over-represented motifs that are often missed by standard motif detection approaches. Using a number of examples we illustrate the validity of our approach and suggest applications, such as the analysis of weaker binding sites.
Our approach can be used to suggest testable hypotheses for wet-lab experiments. It has potential for future analyses, such as the prediction of weaker binding sites. An online application of our approach, called LocaMo Finder (Local Motif Finder), is available at http://sysimm.ifrec.osaka-u.ac.jp/tfbs/locamo/.
KeywordsRegulation of transcription Promoter sequence Transcription factor binding site Parzen window
Regulation of transcription in eukaryote cells is controlled by the binding of transcription factors (TFs) to specific binding sites in the regulatory regions of their target genes. In this way, transcription factor binding sites (TFBSs) play an important role in the regulation of gene expression. Unfortunately, TFBSs are hard to identify; experimental approaches are laborious and costly, and computational analyses are plagued by high false positive rates. TFBSs are extremely short compared to the regions in which they are present, and TFs typically bind to a variety of motifs.
One of the many difficulties faced by TFBS detection approaches is that some TFBSs are restricted in their location with regard to the transcription start site (TSS). An extreme example of such a motif is the TATA-box, which is positioned about 25 to 30 bases upstream of the TSS. However, computational analyses usually use sequences of a fixed length (for eukaryotes typically 1000 bps or longer). In such cases, the region in which genuine regulatory motifs are positioned is small compared to the input sequence length, which makes position-restricted TFBSs hard to detect using standard approaches.
A number of studies have investigated the positional tendencies of nucleotide motifs and TFBSs in genome-wide or large sets (>1000 sequences) of promoter sequences [1-5]. Approaches aiming at predicting locally enriched TFBSs in smaller sets of sequences have also been reported. Many of these approaches involve counting the number of TFBSs in windows within the regulatory regions, either using a sliding window approach [6, 7] or by binning the TFBSs according to their position [8, 9]. One weakness of these approaches is that they process the TFBSs in a binary way; either a site is present within the window, or it is not. Within the region defined by the window or bin itself the distribution of sites is ignored, and sites at the edge of the region contribute as much to the score as do sites located at the center. This causes such approaches to be highly dependent on window sizes, and - especially for smaller window sizes - susceptible to noise: moving the region or changing its size even just a few bases can drastically change the score. Alternative methods for combining positional information and nucleotide motifs, such as one based on positional word counting [10, 11], are not suitable for application on smaller datasets where counts of TFBSs tend to be too low to process in such a way.
Here we introduce a novel method for the prediction of locally enriched TFBSs in sets of promoter sequences of co-regulated genes. Our approach is based on the Parzen window technique for density estimation, and is capable of converting predicted TFBSs in even small sets of promoters into continuous functions that reflect local TFBS enrichment. Usage of different window function widths allows for the detection of both TFBS motifs with loosely defined positional preferences as well as TFBSs that require very precise positioning. Furthermore, a random sampling technique is used to incorporate genome-wide tendencies into the evaluation of significance of local enrichment. Application on artificial and real datasets showed that our method compares favourably to other methods and measures. We also applied our method to a number of promoter sequence sets regulating tissue- or condition-specific expression profiles, and detected regions with local enrichment of TFBSs that could not be detected using a more standard TFBS over-representation analysis. We illustrate the validity of our findings and how our approach can be used for further dissection of the architecture of regulatory regions.
Our proposed method for detection of locally enriched regulatory motifs is described below. A workflow of our approach as well as details not covered below can be found in the Supplementary Methods section and Figure S1 (Additional file 1).
Parzen window-based scoring of local TFBS enrichment
We start with a set S of N promoter sequences, indexed by s = 1, 2, …, N. For each sequence s its m s associated TFBSs are represented by their location relative to a landmark within each sequence,
In this case p(x) simply reflects the count of sites in the window around x.
That is, ϕ is a Gaussian function fit over each position x in the set of promoters, with the distance to predicted TFBSs determining their contributions to the score of x. The h value here is essentially a measure of the width of this Gaussian. Other window functions can easily be introduced.
where x start and x stop indicate the start and stop coordinates of the region of interest relative to the TSS, respectively. In practice, if a sequence contains multiple predicted TFBSs, Z s will be greater than 1, resulting in a reduced contribution per site in this sequence to the S local . This scaling factor was introduced to limit the effect of a single sequence containing high numbers of predicted TFBSs, especially in simple repeat sequences (see Additional file 1: Figure S2).
Note that the S local is not a probability function and does not sum to unity. Using Equations 4 to 7, a local enrichment score can be calculated for each TFBS motif at each bp in the region from x start to x stop . In practice we focused on the region from -2 kb to 1 kb around the TSS, but to avoid irregularities at the boundaries of this region we based calculations on TFBSs predicted in the region from -3 kb to +2 kb. Wider window functions lead to broad, slowly changing S local values which are useful for detecting over-representation of TFBSs with relatively loosely defined regions of preference. Narrower window functions on the other hand result in sharper peaks in S local values, allowing for the detection of local enrichment in narrow, precisely defined regions. In order to detect both broad and narrow regions of enrichment we used the values h/2 = 10, 20, 50, 100, and 200 bps.
Evaluation of significance of local enrichment
The significance of observed S local values was evaluated using a random sampling approach, in which we sample N sequences with their predicted binding sites from the genomic set of promoters. Sampling was done such that sampled sequences had a similar GC content composition as the input sequences, in an effort to limit biases (see below). Using Equations 4 to 7, for each position x, we calculated
The use of the combination of both thresholds ensures that enriched regions are enriched in S in comparison to the same region in sampled sets (P dep ), but also that the regions have a certain degree of enrichment irrespective of their position to the TSS (P ind ). In order to reduce false positive predictions caused by multiple testing, we defined the thresholds for P dep (x) as a function of the window function width, with values of 0.001, 5e-4, 2e-4, 1e-4, and 5e-5 for scores calculated with h/2 = 200, 100, 50, 20, and 10 bps respectively. Here, more stringent thresholds for the smaller peak widths reflect the increasing number of independent tests performed as peak widths decrease. The P ind (x) threshold was set uniformly to 0.01. For every region within the sequences with start position x 1 and end position x 2 passing both P value restrictions we defined the corresponding region of local enrichment to be the region from x1 - h/2 to x2 + h/2.
Removal of redundant locally enriched regions
Pairs of overlapping enriched regions detected for the same motif with the same window function width were merged together to form one single region. Redundancy between enriched regions detected using different window function width values was removed such that for each set of overlapping regions only 1 representative region was retained. For each region, the number of TFBSs contained was counted. Next, for each set of overlapping regions we retained the largest region, unless there were smaller regions containing more than 2/3 of the predicted TFBSs in the largest region. In the latter case we retained the smallest region containing more than 2/3 of the predicted TFBSs. This approach ensured that, where possible, narrower enriched regions containing high numbers of TFBSs were returned.
A random sampling approach that takes into account GC content biases
At several points in this study we evaluate the significance of findings using a random sampling strategy. Observations made for the set of sequences S were compared to values obtained in a large number of sets of N sequences randomly sampled from the genomic set of sequences in a way that limits the difference in GC content between sampled sequences and the sequences in S. The p-value of observed values was estimated by the ratio of sampled sets having a higher value than the observed one.
First, we clustered the genomic set of promoter sequences by their GC content. Each sequence was represented by a vector of 20 values representing the GC content in bins of 100 bps in the region from -1 kb to +1 kb. Values for each bin were scaled to have mean 0 and standard deviation 1. Clustering was done using k-means clustering using the Hartigan-Wong algorithm, with k = 2 to 10. For each value of k, clustering was done 100 times using random initializations, each run had up to 100 iterations, and the result with the smallest sum of squares between samples and assigned centers was retained. The clusters obtained with k = 2 corresponded to sequences with high (especially proximal to the TSS) and low GC content, respectively. Clusters obtained by clustering with k > 2 corresponded to variations of both classes (see Additional file 1: Figure S3).
In our sampling approach, given a set S of N input sequences and a value of k*, we get the number of sequences in S belonging to each cluster,
where Density S is the number of predicted TFBSs in S divided by N, Density genomic the number of predicted TFBSs in the genomic set of promoter sequences divided by the total number of promoter sequences, and Proportion S and Proportion genomic the ratio of sequences that have at least 1 predicted TFBS, in S and the genome-wide set of promoters, respectively. The significance of observed ORI values is evaluated using a sampling approach equivalent to the one used for local enrichment scores, with P ORI being the proportion of random samples with higher ORI scores than the value observed in S. See (Additional file 1: Figure S4) for a flowchart of this approach. After application on artificial datasets, and based on our experience in previous analyses, we set the threshold for P ORI to 0.01.
Comparison to existing methods and measures
We compared the performance of our approach to that of existing methods and measures for the prediction of local and global enrichment, using artificial as well as real data. A number of realistic, artificial sequence datasets were prepared in which instances of known regulatory motifs were inserted into specific regions with respect to the TSSs. Real datasets consisted of 6 sets of genes with similar expression profiles in dendritic cells upon stimulation with LPS, for which ChIP-seq time-course data is available for 25 TFs . Using the ChIP-seq data we inferred the TFs controlling the expression of each set of genes. On these datasets we applied the methods and evaluated their performance in terms of precision, recall, and F-measure. For a more detailed description of the construction of the artificial and real datasets, the methods and measures we included in the comparison, and the measures of performance we used, we refer to the Supplementary Methods section (Additional file 1).
Summary of additional materials and methods
For a more detailed description we refer to the Supplementary Methods section (see Additional file 1). In brief, sets of co-expressed genes were defined based on micro-array gene expression data for 79 human and 61 mouse tissues and cell types from the GNF GeneAtlas dataset , and for mouse dendritic cells (DCs) stimulated with a number of immune stimuli . For all genes of the human (hg19) and mouse (mm9) genome, we scanned regions from -3 kb to +2 kb around the TSS for sites for a set of non-redundant, vertebrate position weight matrices (PWMs) constructed from TRANSFAC  and JASPAR  PWMs. Analysis of local enrichment of TFBSs was based on these predicted sites.
Evolutionary conservation of TFBSs within enriched regions was evaluated using PhastCons scores as available on the UCSC Genome Browser . Z scores for PhastCons scores corresponding to bases included in TFBSs were calculated based on the average and standard deviation of PhastCons scores of an equal number of bases located at equal distances to randomly selected TSSs.
Enrichment of weak TFBSs in locally enriched regions was evaluated by comparing the count of weak sites within predicted regions in S with the count in randomly sampled sets of sequences.
Results and discussion
Parzen window-based detection of local enrichment of TFBSs
The starting point for many TFBS analyses is a set of predicted TFBS in a set of regulatory sequences believed to be under the control of the same regulatory mechanism. Finding local over-representation of TFBSs in a set of sequences can be thought of as finding a region at a certain distance from a landmark, such as the TSS, in which there is a significant enrichment or increase in density of TFBSs compared to a reference set of sequences.
Various general approaches for the estimation of densities have been described, such as frequency histograms and kn-nearest-neighbor estimates . Here, we focused our attention on the Parzen window approach, a non-parametric technique where the density estimate is obtained by summing over all samples weighted by a function of choice, the so-called window function . Parzen widow approaches have been used in bioinformatics for ChIP-seq peak calling , but to our best knowledge it has never been used for the analysis of regulatory motifs. In our analysis, samples consist of predicted TFBSs in promoter sequences of co-expressed genes, and as a window function we use a Gaussian function of the distance to each TFBS. We choose a Gaussian function because it results in a smooth enrichment score, and because it has only one additional parameter to specify: the width of the Gaussian. We made a number of adjustments to the original Parzen window approach (see Methods section), which allow us to estimate local enrichment scores (S local ) for each TFBS. A sampling approach is then used to estimate the significance of observed scores in a way that takes into account GC content profiles of sequences.
Several studies have indicated differences in GC content and CpG scores between housekeeping genes and tissue- or condition-specific genes [24, 25], and a recent study has suggested that analysis of regulatory regions can lead to better results when treating CpG island-associated sequences and non-CpG island-associated sequences separately . Since PWMs with high (low) GC content naturally tend to have more hits in sequences with high (low) GC content, local regions of higher (lower) GC content in an input set of sequences can easily result in apparent local enrichment of GC-rich (GC-poor) TFBSs. To avoid such biases we designed a way to evaluate the significance of peaks using a sampling approach in which sampled sequences have a GC content profile similar to that of the input sequences (see Methods section). Figure 1 illustrates the importance of this approach using the set of mouse promoters with high activity in small and large intestine as an example. Figure 1A shows the GC content profile in the intestine-specific promoters, as well as in sequences randomly sampled using k = 1, 2, 3 cluster(s). If we were to ignore GC content profiles of sequences, sampling from the genomic set of sequences (equivalent to k = 1) would result in sampled sets having a significantly higher GC content than the input sequences, especially around the TSS. This in turn would lead to high numbers of false positive enriched regions for AT-rich TFBSs. However, when clustering genomic sequences into 2 clusters according to their GC content profile (k = 2), and sampling according to the number of input sequences in each cluster, the difference in average GC content between sampled and input sequences is strongly reduced. Doing the same using 3 clusters (k = 3) does not further decrease the difference in this particular case. Figure 1B shows the discrepancy between GC content of the same set of intestine-specific promoters and sampled sets of promoters as a function of the number of clusters k used to sample. Although the RMSD has a tendency to decrease as k increases, higher k values also result in smaller sets of promoter sequences to sample from. As a compromise, we therefore used the clusters obtained using k*, the smallest k value where the RMSD is smaller than the RMSD + SD of k + 1 clusters. In the case of promoters with high expression in small and large intestine, the random sampling for the evaluation of significance of local enrichment of TFBSs was thus done according to k* = 2. Since the existence of high-CG genes and low-CG genes is well established, k* = 1 was not considered. The k* values were 2 in 36 out of 44 sets of co-expressed genes in mouse, and in 17 out of 32 sets in human, indicating that in many cases a simple distinction between GC-rich and GC-poor sequences is sufficient. For a number of sets, however, high k* values were found (Additional file 1: Table S1, Additional file 1: Table S2 and Additional file 1: Table S3).
Comparison to existing methods and measures
Overview of performance of several methods and measures for prediction of motif enrichment on artificial and real datasets
Method or measure
LocaMo Finder (Gaussian)
LocaMo Finder (uniform)
RSAT (Binomial distribution) ($)
Hypergeometric distribution (*)
Fisher’s exact test (*)
RSAT (Binomial distribution) ($$)
Hypergeometric distribution (***)
Fisher’s exact test (***)
Casimiro et al.
Berendzen et al.
Vardhanabhuti et al.
FIRE (Information content)
In addition, we observed how our method significantly increases the accuracy of PWM-based TFBS predictions. When we assume that only predicted TFBS present inside bound regions of the TF in question (as defined by ChIP-seq data) are truly functional, and regard any other predicted TFBSs as false positives, PWM-based TFBS predictions resulted in a recall of 33.1%, with a low precision of 5.0%. However, when we filtered the PWM-based predicted TFBSs using the predicted regions of enrichment for the regulatory motif in question, while recall dropped 3.7-fold to 8.9%, there was a more than 7-fold increase in precision to 37.4%, resulting from a strongly reduced number of false positive predictions.
Application on a large number of gene sets
Next, we applied our approach on a large number of sets of genes with tissue- and cell type-specific expression, in human and mouse . Genes were clustered into 44 and 32 clusters of co-expressed genes for the mouse and human case, respectively. Promoter regions for all genes were scanned using a set of 198 non-redundant vertebrate PWMs, and we predicted local TFBS motif enrichment in each of the clusters using our approach, as well as TFBS enrichment using a standard motif over-representation approach.
A selection of regions of local enrichment that could not be found using standard TFBS over-representation analysis
Tissues or cell types (cluster index)
Region of enrichment:
x1to x2( h/ 2)
B cells, T cells (2)
ETS domain TFs, including SPI1 or PU.1
-10 to 0 (10)
B cells, T cells (2)
55 to 200 (200)
-91 to 129 (200)
CREB-binding TFs, including ATF family
-148 to 31 (100)
Cux1 (CR3 + HD)
-103 to -90 (10)
small and large intestine (19)
-93 to -37 (20)
small and large intestine (19)
RXR, RAR, and VDR
-52 to -43 (10)
MYB family TFs
-72 to 95 (100)
heat shock factors
-58 to 217 (200)
skeletal muscle (42)
THR alpha and beta
-30 to -15 (50)
General tendencies of local TFBS enrichment
Regulatory motifs show a preference for similar regions across different sets of promoters
Regions with local enrichment of TFBSs of ETS domain TFs
Tissues or cell types (cluster index)
Region of enrichment:
x1to x2( h/2)
T cells, NK cells (5)
-141 to 51 (100)
721 B-lymphoblasts, BM CD34+ cells (11)
-174 to 42 (100)
721 B-lymphoblasts, BM CD34+ cells (12)
-151 to 37 (100)
B cells, Burkitt's lymphoma (13)
-148 to 23 (100)
BM CD34+ cells, 721 B-lymphoblasts (14)
-81 to 10 (50)
NK cells, T cells (15)
-97 to 34 (50)
B cells, T cells (2)
-106 to 17 (50)
skeletal muscle, heart (4)
-161 to 26 (100)
thymus, ovary (6)
-137 to 18 (100)
-133 to -58 (100)
T cells, B cells (12)
-133 to -98 (100)
oocyte, fertilized egg (25)
-139 to 42 (100)
oocyte, fertilized egg (34)
-240 to 34 (200)
-178 to 71 (200)
oocyte, fertilized egg (37)
-77 to -11 (50)
TFBSs in locally enriched regions tend to have high evolutionary conservation
Locally enriched regions tend to be enriched for weak TFBSs
Recently a number of studies, including computational studies using thermodynamic models, have illustrated the importance of weak TFBSs in the regulation of transcription [33-35]. Although PWM scores are believed to be correlated with the TF-DNA binding energy , the sheer number of sites having only moderately high PWM scores prohibits most computational methods from analyzing them in more detail. One advantage our approach offers is that we can restrict the region of interest to the region preferred by the stronger TFBSs, and evaluate if weaker binding sites show a similar local enrichment (see Supplementary Methods section in Additional file 1). Our analysis of weak TFBSs in the enriched regions found in the human clusters shows that 42 out of 190 regions (22.1%) show a significant enrichment (defined as p-value < 0.01) of weaker TFBSs (Figure 3C). Similarly, in mouse datasets, 49 out of 269 regions (18.2%) were enriched for weak TFBSs (Additional file 1: Figure S5). This result does not include the presence of stronger TFBSs, nor weaker sites overlapping with them. For example, promoters of genes expressed in human liver are enriched for HNF1 TFBSs in the region -225 to +49. Our analysis indicates that in addition to the 56 strong HNF1 sites, there are 49 weak TFBSs predicted in the region of local enrichment in the promoters in this set of genes, (expected: 19.8 sites, p-value < 1e-4).
Application on sets of genes with differential expression upon TLR stimulation
The regulatory pathways controlling Toll-like receptor (TLR) signalling have been relatively well studied, and main regulators such as NF-κB and IRFs have been identified. However, it is likely that promoters of genes that are induced or repressed after TLR stimulation contain additional regulatory motifs that are still to be identified. From gene expression data taken from mouse DCs stimulated with 5 TLR ligands , we prepared 18 clusters of co-expressed genes. Applying our method to these clusters, we could detect locally enriched regions for the sites of CREB-binding TFs, NF-κB, and IRFs in clusters of transiently induced genes with peak expression around 2, 4, and 6 hrs after stimulation, respectively (See Additional file 2: Table S11). These motifs could also be found by standard motif over-representation analysis. However, our approach also detected a local enrichment for STAT binding sites in a cluster with induced expression peaking around 6 hours after stimulation , which could not be detected by the standard approach. In addition, we could make a number of findings that might offer hypotheses for further wet-lab experiments. One is the enrichment of TATA box motifs only in the set of promoters corresponding to genes with the fastest transient activation (peak induction at 2 hrs), which suggests that the TATA box plays a role in the rapid induction of these genes. Finally, a number of enriched regions downstream of the TSS were predicted. These include regions enriched in sites for Hmx3 (transiently induced, 2 hrs), RFX1 sites (transiently induced, 8 hrs), Mtf1 (late induction, 24 hrs), Zfp161 and E-box motifs (transiently repressed, 6-8 hrs).
Although various studies have illustrated that in eukaryotic genomes cis-regulatory motifs can be positioned several kilobases or even megabases away from their target genes, it has also been reported that in a number of cases TFBSs show a tendency to be present at a more or less fixed distance with regard to the TSS. Nevertheless, in general, no clear positional preferences have been described for most regulatory motifs, even though such information could be extremely useful for their prediction.
Here, we present a novel method for detecting locally enriched TFBSs in the regulatory regions of sets of co-regulated genes. Our approach is based on a non-parametric approach for sample density estimation, with adjustments which allow it to detect sequence regions that have a significant local enrichment in TFBSs, on a 1 bp resolution. An online tool of our approach, which we call LocaMo Finder (Local Motif Finder), is available at http://sysimm.ifrec.osaka-u.ac.jp/tfbs/locamo/. Our approach evaluates both positioning and enrichment of TFBSs simultaneously, using a set of control sequences as a reference. We implemented our method for the estimation of significance of enrichment in a way that takes into account GC content profiles of the input set of promoter sequences. Evaluation of significance is done against randomly selected promoters that have GC content profiles similar to the input sequences. We showed how our approach has a high accuracy compared to other methods and measures for local and global motif enrichment.
We applied our method on a large number of sets of genes with tissue- or cell type-specific expression, as well as on a number of sets of genes with similar expression profiles after TLR ligand stimulation in mouse DCs. Detected regions of local enrichment of TFBSs are supported by known regulatory interactions reported in literature, as well as by the observation that several regulatory motifs are found to be locally enriched in similar regions in different sets of promoters. In addition, detected regions tend to contain TFBSs with higher evolutionary conservation than expected, and they also tend to be enriched for weak TFBSs. Together, these results illustrated the usefulness and validity of our approach.
Approaches for finding local enrichment of TFBSs heavily rely on TSS annotations. Recent studies have shown that a large fraction of genes have several TSSs, and that different TSSs might allow different amounts of variation in the bases from which transcription is initiated [38, 39]. Future approaches that incorporate such features are likely to present more biological insights into the relationship between TFBS positioning and transcription initiation and TSS variability.
Position weight matrix
Transcription factor binding site
Transcription start site
Zero or one occurrence per sequence
The authors would like to thank all members of the laboratory of Systems Immunology and the laboratory of Host Defense for helpful discussions and advice; Dr K Nakai for helpful suggestions; M Ogawa, E Kamada and M Kageyama for secretarial assistance. We also thank the reviewers for their suggestions, which improved the paper considerably.
This work was supported by the Special Coordination Funds of the Japanese Ministry of Education, Culture, Sports, Science and Technology, and the Ministry of Health, Labour and Welfare in Japan, the Japan Society for the Promotion of Science through Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program), and by a Kakenhi Grant-in-Aid for Scientific Research (23710234) from the Japan Society for the Promotion of Science.
- Berendzen KW, Stuber K, Harter K, Wanke D: Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves. BMC Bioinformatics 2006, 7: 522.PubMed CentralView ArticlePubMedGoogle Scholar
- Bellora N, Farre D, Alba MM: Positional bias of general and tissue-specific regulatory motifs in mouse gene promoters. BMC Genomics 2007, 8: 459.PubMed CentralView ArticlePubMedGoogle Scholar
- Vardhanabhuti S, Wang J, Hannenhalli S: Position and distance specificity are important determinants of cis-regulatory motifs in addition to evolutionary conservation. Nucleic Acids Res 2007, 35: 3203-3213.PubMed CentralView ArticlePubMedGoogle Scholar
- Tharakaraman K, Bodenreider O, Landsman D, Spouge JL, Marino-Ramirez L: The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Research 2008, 36: 2777-2786.PubMed CentralView ArticlePubMedGoogle Scholar
- Yokoyama KD, Ohler U, Wray GA: Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory element candidates and functional motif-pair relationships. Nucleic Acids Research 2009, 37: e92.PubMed CentralView ArticlePubMedGoogle Scholar
- Defrance M, Touzet H: Predicting transcription factor binding sites using local over-representation and comparative genomics. BMC Bioinformatics 2006, 7: 396.PubMed CentralView ArticlePubMedGoogle Scholar
- Bellora N, Farre D, Alba MM: PEAKS: identification of regulatory motifs by their position in DNA sequences. Bioinformatics 2007, 23: 243-244.View ArticlePubMedGoogle Scholar
- FitzGerald PC, Shlyakhtenko A, Mir AA, Vinson C: Clustering of DNA sequences in human promoters. Genome Research 2004, 14: 1562-1574.PubMed CentralView ArticlePubMedGoogle Scholar
- Casimiro AC, Vinga S, Freitas AT, Oliveira AL: An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance. BMC Bioinformatics 2008, 9: 89.PubMed CentralView ArticlePubMedGoogle Scholar
- Salisbury J, Hutchison KW, Graber JH: A multispecies comparison of the metazoan 3 '-processing downstream elements and the CstF-64 RNA recognition motif. BMC Genomics 2006, 7: 55.PubMed CentralView ArticlePubMedGoogle Scholar
- Hutchins LN, Murphy SM, Singh P, Graber JH: Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics 2008, 24: 2684-2690.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2: 28-36.PubMedGoogle Scholar
- Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J Roy Stat Soc B 2001, 63: 411-423.View ArticleGoogle Scholar
- Bajic VB, Choudhary V, Hock CK: Content analysis of the core promoter region of human genes. In Silico Biol 2004, 4: 109-125.PubMedGoogle Scholar
- Garber M, Yosef N, Goren A, Raychowdhury R, Thielke A, Guttman M, Robinson J, Minie B, Chevrier N, Itzhaki Z: A high-throughput chromatin immunoprecipitation approach reveals principles of dynamic gene regulation in mammals. Mol Cell 2012, 47: 810-822.View ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 2004, 101: 6062-6067.PubMed CentralView ArticlePubMedGoogle Scholar
- Amit I, Garber M, Chevrier N, Leite AP, Donner Y, Eisenhaure T, Guttman M, Grenier JK, Li WB, Zuk O: Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science 2009, 326: 257-263.PubMed CentralView ArticlePubMedGoogle Scholar
- Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34: D108-110.PubMed CentralView ArticlePubMedGoogle Scholar
- Bryne JC, Valen E, Tang MHE, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Research 2008, 36: D102-D106.PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou MM, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 2005, 15: 1034-1050.PubMed CentralView ArticlePubMedGoogle Scholar
- Duda RO, Hart PE, Stork DG: Pattern Classification. 2nd edition. New York: Wiley; 2001.Google Scholar
- Parzen E: Estimation of a probability density-function and mode. Ann Math Stat 1962, 33: 1065.View ArticleGoogle Scholar
- Lee BK, Bhinge AA, Battenhouse A, McDaniell RM, Liu Z, Song LY, Ni YY, Birney E, Lieb JD, Furey TS: Cell-type specific and combinatorial usage of diverse transcription factors revealed by genome-wide binding studies in multiple human cells. Genome Research 2012, 22: 9-24.PubMed CentralView ArticlePubMedGoogle Scholar
- Yamashita R, Suzuki Y, Sugano S, Nakai K: Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene 2005, 350: 129-136.View ArticlePubMedGoogle Scholar
- Illingworth RS, Bird AP: CpG islands - 'A rough guide'. Febs Lett 2009, 583: 1713-1720.View ArticlePubMedGoogle Scholar
- Roider HG, Lenhard B, Kanhere A, Haas SA, Vingron M: CpG-depleted promoters harbor tissue-specific transcription factor binding signals-implications for motif overrepresentation analyses. Nucleic Acids Research 2009, 37: 6305-6315.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim NK, Tharakaraman K, Marino-Ramirez L, Spouge JL: Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 2008, 9: 262.PubMed CentralView ArticlePubMedGoogle Scholar
- Gallant S, Gilkeson G: ETS transcription factors and regulation of immunity. Arch Immunol Ther Ex 2006, 54: 149-163.View ArticleGoogle Scholar
- Taniguchi T, Ogasawara K, Takaoka A, Tanaka N: IRF family of transcription factors as regulators of host defense. Annu Rev Immunol 2001, 19: 623-655.View ArticlePubMedGoogle Scholar
- Bonizzi G, Karin M: The two NF-kappa B activation pathways and their role in innate and adaptive immunity. Trends Immunol 2004, 25: 280-288.View ArticlePubMedGoogle Scholar
- Odom DT, Zizlsperger N, Gordon DB, Bell GW, Rinaldi NJ, Murray HL, Volkert TL, Schreiber J, Rolfe PA, Gifford DK: Control of pancreas and liver gene expression by HNF transcription factors. Science 2004, 303: 1378-1381.PubMed CentralView ArticlePubMedGoogle Scholar
- Nagaki M, Moriwaki H: Transcription factor HNF and hepatocyte differentiation. Hepatol Res 2008, 38: 961-969.View ArticlePubMedGoogle Scholar
- Tanay A: Extensive low-affinity transcriptional interactions in the yeast genome. Genome Research 2006, 16: 962-972.PubMed CentralView ArticlePubMedGoogle Scholar
- Gertz J, Siggia ED, Cohen BA: Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 2009, 457: 215-U113.PubMed CentralView ArticlePubMedGoogle Scholar
- Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U, Gaul U: Predicting expression patterns from regulatory sequence in drosophila segmentation. Nature 2008, 451: 535-540.View ArticlePubMedGoogle Scholar
- Stormo GD, Fields DS: Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci 1998, 23: 109-113.View ArticlePubMedGoogle Scholar
- Leonard WJ, O'Shea JJ: JAKS AND STATS: biological implications. Annu Rev Immunol 1998, 16: 293-322.View ArticlePubMedGoogle Scholar
- Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 2006, 38: 626-635.View ArticlePubMedGoogle Scholar
- Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, Yamamoto J, Sekine M, Tsuritani K, Wakaguri H: Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Research 2006, 16: 55-65.PubMed CentralView ArticlePubMedGoogle Scholar
- van Helden J: Regulatory sequence analysis tools. Nucleic Acids Research 2003, 31: 3593-3596.PubMed CentralView ArticlePubMedGoogle Scholar
- Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 2000, 296: 1205-1214.View ArticlePubMedGoogle Scholar
- Sui SJH, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Research 2005, 33: 3154-3164.View ArticleGoogle Scholar
- Elemento O, Slonim N, Tavazoie S: A universal framework for regulatory element discovery across all genomes and data types. Mol Cell 2007, 28: 337-350.PubMed CentralView ArticlePubMedGoogle Scholar
- Fisher RC, Scott EW: Role of PU.1 in hematopoiesis. Stem Cells 1998, 16: 25-37.View ArticlePubMedGoogle Scholar
- Scott EW, Simon MC, Anastasi J, Singh H: Requirement of transcription factor Pu.1 In the development of multiple hematopoietic lineages. Science 1994, 265: 1573-1577.View ArticlePubMedGoogle Scholar
- Sitkovsky M, Lukashev D: Regulation of immune cells by local. tissue oxygen tension: Hif1 alpha and adenosine receptors. Nat Rev Immunol 2005, 5: 712-721.View ArticlePubMedGoogle Scholar
- Miki Y, Morotami-Yano K, Yano K, Saito H, Sun Z, Iwama A: Human regulatory factor X 4 (RFX4) is a testis-specific dimeric DNA-binding protein that cooperates with other human RFX members. J Biol Chem 2002, 277: 836-842.View ArticlePubMedGoogle Scholar
- Kistler WS, Horvath GC, Kistler MK: RFX2 is a candidate downstream amplifier of A-MYB regulation in mouse spermatogenesis. Bmc Dev Biol 2009, 9: 63.PubMed CentralView ArticlePubMedGoogle Scholar
- Don J, Stelzer G: The expanding family of CREB/CREM transcription factors that are involved with spermatogenesis. Mol Cell Endocrinol 2002, 187: 115-124.View ArticlePubMedGoogle Scholar
- Walker WH, Habener JF: Role of transcription factors CREB and CREM in cAMP-regulated transcription during spermatogenesis. Trends Endocrin Met 1996, 7: 133-138.View ArticleGoogle Scholar
- Vanden Heuvel GB, Brantley JG, Alcalay NI, Sharma M, Kemeny G, Warolin J, Ledford AW, Pinson DM: Hepatomegaly in transgenic mice expressing the homeobox gene Cux-1. Mol Carcinog 2005, 43: 18-30.PubMed CentralView ArticlePubMedGoogle Scholar
- Bosse T, Fialkovich JJ, Piaseckyj CM, Beuling E, Broekman H, Grand RJ, Montgomery RK, Krasinski SD: Gata4 and Hnf1 alpha are partially required for the expression of specific intestinal genes during development. Am J Physiol-Gastr L 2007, 292: G1302-G1314.Google Scholar
- Modica S, Gofflot F, Murzilli S, D'Orazio A, Salvatore L, Pellegrini F, Nicolucci A, Tognoni G, Copetti M, Valanzano R: The intestinal nuclear receptor signature with epithelial localization patterns and expression modulation in tumors. Gastroenterology 2010, 138: 636-U295.View ArticlePubMedGoogle Scholar
- Ogura Y, Suruga K, Takase S, Goda T: Developmental changes of the expression of the genes regulated by retinoic acid in the small intestine of rats. Life Sci 2005, 77: 2804-2813.View ArticlePubMedGoogle Scholar
- Latham KE, Litvin J, Orth JM, Patel B, Mettus R, Reddy EP: Temporal patterns of A-myb and B-myb gene expression during testis development. Oncogene 1996, 13: 1161-1168.PubMedGoogle Scholar
- Schimenti JC, Bolcun-Filas E, Bannister LA, Barash A, Schimenti KJ, Hartford SA, Eppig JJ, Handel MA, Shen LS: A-MYB (MYBL1) transcription factor is a master regulator of male meiosis. Development 2011, 138: 3319-3330.PubMed CentralView ArticlePubMedGoogle Scholar
- Sarge KD, Parksarge OK, Kirby JD, Mayo KE, Morimoto RI: Expression of heat-shock factor-2 in mouse testis - potential role as a regulator of heat-shock protein gene-expression during spermatogenesis. Biol Reprod 1994, 50: 1334-1343.View ArticlePubMedGoogle Scholar
- White P, Burton KA, Fowden AL, Dauncey MJ: Developmental expression analysis of thyroid hormone receptor isoforms reveals new insights into their essential functions in cardiac and skeletal muscles. Faseb J 2001, 15: 1367-1376.View ArticlePubMedGoogle Scholar
- Yu FS, Gothe S, Wikstrom L, Forrest D, Vennstrom B, Larsson L: Effects of thyroid hormone receptor gene disruption on myosin isoform expression in mouse skeletal muscles. Am J Physiol-Reg I 2000, 278: R1545-R1554.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.