A computational evaluation of over-representation of regulatory motifs in the promoter regions of differentially expressed genes

Background Observed co-expression of a group of genes is frequently attributed to co-regulation by shared transcription factors. This assumption has led to the hypothesis that promoters of co-expressed genes should share common regulatory motifs, which forms the basis for numerous computational tools that search for these motifs. While frequently explored for yeast, the validity of the underlying hypothesis has not been assessed systematically in mammals. This demonstrates the need for a systematic and quantitative evaluation to what degree co-expressed genes share over-represented motifs for mammals. Results We identified 33 experiments for human and mouse in the ArrayExpress Database where transcription factors were manipulated and which exhibited a significant number of differentially expressed genes. We checked for over-representation of transcription factor binding sites in up- or down-regulated genes using the over-representation analysis tool oPOSSUM. In 25 out of 33 experiments, this procedure identified the binding matrices of the affected transcription factors. We also carried out de novo prediction of regulatory motifs shared by differentially expressed genes. Again, the detected motifs shared significant similarity with the matrices of the affected transcription factors. Conclusions Our results support the claim that functional regulatory motifs are over-represented in sets of differentially expressed genes and that they can be detected with computational methods.


Background
Patterns of differential gene expression in organisms are known to result from a complex and dynamic gene regulatory network, where the interactions between transcription factors (TFs) and their target genes take center stage. Therefore, the activation or deactivation of TFs in specific signaling pathways triggers up-or down-regulation of their direct targets. Those effects have been subject of numerous studies dealing with different signaling pathways such as development and hormone signaling [1][2][3][4]. For some of these processes, it is well understood how TFs directly transform regulatory signals into gene expression levels by binding to proximal or distal promoters of genes.
The roles of TFs in regulating gene expression have been widely observed in microarray experiments, in which TF genes were knocked out, over-expressed, or stimulated with ligands [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. These studies generally investigated the change of gene expression induced by altering the activity of certain TFs and approved the roles of TFs in gene expression. Furthermore, computational studies have also demonstrated that genes with common regulatory binding sites are more likely to have similar expression profiles [22,23]. The importance of TFs in gene expression regulation naturally raises the question to what degree differential expression of genes under different conditions indicates the presence of shared regulatory motifs. If so, this provides a useful theoretical foundation for novel motif prediction and functional studies. Indeed, it has been a widely used and accepted hypothesis that co-expressed genes share common regulatory motifs. It serves as a useful working hypothesis in many scenarios, and numerous computational tools for regulatory motif discovery built with considerable suc-cess on this hypothesis [24][25][26][27][28][29][30][31][32][33][34][35][36]. While it has been fully explored and approved in yeast [37][38][39], little is known about the applicability of this working hypothesis for mammals.
Considering the rather anecdotal basis for its acceptance, the hypothesis of co-expressed genes sharing common regulatory motifs calls for a systematic evaluation. In fact, microarray experiments in public databases are now widely available, providing expression profiles of thousands of genes under numerous different conditions on a genome-wide scale. As these data are a popular basis for regulatory motif discovery, there is a big demand for a systematic evaluation of the underlying hypothesis.
In this work, we analyzed differentially expressed genes in microarray experiments from the ArrayExpress database [40] that were related to transcription factor activity modifications. We particularly analyzed such experiments, where the perturbation was aimed at a transcription factor. This setting allows us to test whether we are able to recover binding sites of the altered transcription factor from the differential genes alone. This is clearly not trivial because the set of differential genes will encompass a whole cascade of up-or down-regulated genes due to the initial perturbation. Although the microarray database contains many more experiments from which co-expressed genes could be derived, we focus on the ones where we know the identity of the causal transcription factor, such that we can evaluate the success rate of our recovery method.
We study two approaches toward checking whether the binding sites of the affected TFs are over-represented in the differentially expressed genes. In the first approach, we use oPOSSUM [25] to analyze the over-representation of Jaspar matrices [41,42], which represent profiles of binding sites derived from known TF binding sites. Among these matrices, we focus our attention on the matrices corresponding to the affected TFs, which we will hence refer to as target matrices throughout the rest of this paper. We apply oPOSSUM to evaluate the over-representation of target matrices in the promoter regions of differentially expressed genes according to a probabilistic scoring scheme. The second approach we investigate is based on de novo predictions using Weeder [43]. This motif finding tool computes de novo motif predictions, which allow us to compare the similarity between those predictions and matrices in the Jaspar database. High similarity suggests that affected TF binding sites were recovered in de novo prediction. Figure 1 shows the basic workflow of these two approaches.

Description of TF binding sites
Recognition of TF binding sites in promoter regions of differentially expressed genes was performed by detect-ing over-represented position frequency matrices (PFMs), which were taken from the publicly available Jaspar database [41,42]. This database contains a set of 138 matrices representing experiment-determined binding profiles, including 101 matrices for vertebrate TFs. We used percent similarity scores, predicted by Jaspar web-interfaced tool for similarity comparison of different Jaspar matrices [44]. Percent similarity has a maximal score of 100%, which indicates the highest similarity.

Microarray experiment selection and analysis
To obtain a set of suitable microarray experiments, we searched the ArrayExpress database for experiments with modified TF activity. We searched the TFs against the ArrayExpress database [40]. We verified the relationship of the TFs with the associated experiments by inspecting the literature references or experiment descriptions, and selected those experiments where TFs or their genes were modified by the experimental methods. The TF activity modifications we encountered included gene knockout, transgenic over-expression, ligand stimulation or stimulation by mimicking the action of transcription factor, among others.
Most of the microarray experiments in the ArrayExpress database provide both raw and processed (or normalized) data. In this work, we preferably chose the former. Raw data were normalized by RMA [45], a popular normalization method for Affymetrix data, with   [46] method was used for differential expression analysis and p-value was assigned to each gene for its significance of differentially expression. We sorted genes with ascending p-value as a gene list. In next step, we would choose the top n genes for over-representation and de novo prediction analysis, where n was an parameter for input gene number, e.g. set to n = 100, n = 200 or n = 400. For many of the experiments SAM did not return any differentially expressed genes with certain arbitrary cutoff. In search of the reason for this we studied the quality of the experiments from the database. In principle, microarray experiments involve a number of steps that are prone to errors, which may significantly distort the outcome of subsequent analysis. We studied primarily two criteria for the quality of an experiment. The first one was based on scatter plots, in which the averaged normalized expression level of one condition was plotted against that of another condition. For a meaningful microarray experiments, most of genes lies around diagonal line while differentially expressed genes are recognized by their distance to the main diagonal [47]. Another criterion for the quality of an experiment is the distribution of the p-value computed by SAM. Informative experiments should show a distribution of p-values which is roughly uniform in general with an increase or a peak for small p-values [48]. For all experiments, we inspected both scatter plot and p-value histogram and excluded experiments that did not obey the above criteria. All these plots are available in Additional file 1.

Over-representation of Jaspar matrices
Numerous tools for finding over-represented regulatory motifs in differentially expressed genes are available [49]. Among them, we employed oPOSSUM [24,25] for overrepresentation analysis. oPOSSUM is a tool that combines the phylogenetic footprinting method with statistical approaches for identifying over-represented Jaspar matrices in a set of co-expressed genes; it takes gene IDs as input and ranks matrices by two scores to describe their over-representation significance, namely the z-score and the Fisher-score.
While there is no systematic comparison between the performance of different over-representation analysis tools, we relied on the oPOSSUM tool for several reasons. First of all, oPOSSUM is relatively fast if the number and lengths of promoters are within reasonable bounds. Furthermore, oPOSSUM can handle long promoter sequences ranging from -20, 000 bp to +20, 000 bp around the transcription start site (TSS) and takes into account TF binding sites throughout this full range. As another advantage over other over-representation analysis tools, oPOSSUM uses phylogenetic footprinting to improve performance. Finally, the authors of oPOSSUM validated its performance with NF-κB microarray experiments and random sampling data in a setting that is similar to ours [24].
The oPOSSUM tool allows the user to specify a number of parameters, including species, Jaspar matrices, level of conservation (background conservation), matrix match threshold, promoter length, and display option. For most of the tested cases, top 30% conservation, 85% matrix match threshold and 200 input sequences with -2000 to +2000 bp around the TSS (+1 bp) were good choices (see Additional file 2). We set those parameters for all the experiments as default parameter setting. Whenever we did not find the target matrices to be overrepresented under those settings, we manually tried different setting of promoter number and length to check whether target matrices would rank among the over-represented matrices. We followed the suggestion by the authors of oPOSSUM that motifs with a z-score exceeding 10 and a Fisher-score below 0.01 could be considered significantly over-represented [25]. However, when the target matrices satisfied only one of the above cutoffs, we would treat it as weakly over-represented. Hence, for each experiment, according to the z-scores and Fisher-scores, target matrices would be categorized as either significantly over-represented (S), weakly over-represented (W), or not over-represented (N).

De novo prediction of motifs
To further study over-representation of target matrices in promoter regions of co-expressed genes, we predicted over-represented motifs using de novo motif finding methods. In choosing an appropriate de novo motif finding tool among the numerous available approaches, we followed the systematic evaluation by Tompa et al. [50], which found the Weeder tool [43] particularly successful in the context of binding site discovery. Using the same settings as with oPOSSUM (promoters of 200 top ranking differential genes, -2000 to +2000 bp around the TSS), we further analyzed all experiments using Weeder. Each run of Weeder predicted 10 motif profiles by default. We then compared the similarity between those motifs and the Jaspar target matrices using the Jaspar webinterface tool [44] and recorded the percent similarity score for the most similar pairs.

Microarray analysis
We searched the ArrayExpress database for experiments involving hybridizations that differed in loss or gain of the function of a specific TF. We retrieved 88 microarray experiments for human and mouse. Those experiments cover a whole bandwidth of methods to modify the activity of TFs; at least 59 experiments involve methods that decreased the activity of TFs such as gene knockout or RNAi. In more than 34 experiments, TF activity was increased by techniques such as ligand stimulation, or transgenic over-expression. A summary of TF activity modifications used is given in Additional file 3. In the process of eliminating low-quality experiments, we excluded 11 experiments that either had only one replication, or where our standard analysis procedure reported errors without clear reason. For the remaining experiments, we manually assessed the microarray quality based on scatter plots and p-value frequency distribution (see Additional file 1). Whenever the scatter plot or p-value distribution was obviously unreasonable, which indicates some problems of the underlying experiment, we excluded them from further step. As a result, the differentially expressed genes in 33 out of the 77 experiments were used for over-representation and de novo analysis. The following TFs were perturbed in those 33 experiments: cMyc, ESRalpha, irf1, HNF4a, nmyc, Myf, FoxQ, Myb NFkappaB2, AIbZIP, HiF1, Cepba, Evi1, Foxa2, CREB, PPARg2, p53, PPARalpha, PPARI, USF1, IRF6, HMGA2, STAT2, e2f2, HNF1a, Mef2c, Gata-1, KLF15, Nkx2.5 and Gata-3. They are associated with 30 target matrices in the Jaspar database. We summarize those TFs and their target matrices in Table 1. In the next step of our work, we would evaluate the over-representation of those target matrices in promoter regions of differentially expressed genes. According to the classification of Jaspar matrices by Sandelin and Wassserman, these TFs cover nine out of the 11 TF classes identified in [41] (see Additional file 4). Besides of the matrices falling into these nine familial profiles, another eight out of 30 target matrices remain unclassified in the scheme by Sandelin and Wasserman.

A case-study for over-representation
In order to illustrate our procedure, we take an exemplary in-depth look at the estrogen receptor α (ERα). The estrogen receptor (ER) is a ligand-dependent TF that can be activated by estrogen; ER can recognize short DNA sequences, the so-called estrogen response elements (EREs) (5'-GGTCAnnnTGACC-3') in the proximal and distal promoters of genes and regulates gene expression [51]. Here we used the microarray experiment supplied by Lin et al. in the ArrayExpress database (ArrayExpress ID E-GEOD-11352) [8]. In their work, cells in a estrogen-receptor positive breast cancer cell line (MCF7) were either exposed to 10 nM estradiol (a sex hormone, the major estrogen of human) or control only. Then sampled cells were prepared for microarray analysis at the time-points of 12, 24 and 48 hours; each sample hybridization was repeated three times. In this way, the authors obtained 18 hybridizations. We used SAM for differential expression and all the genes were assigned with p-values, which indicated the significance of differentially expres-sion. We then sorted those genes according to their p-values and formed a gene list. The top n up-or downregulated genes were selected as input of oPOSSUM analysis.
In the Jaspar database, we identified matrix ESR1 as a profile for ERα binding sites. Figure 2 shows the output of oPOSSUM with different gene numbers. In this example, we used the top 100 and top 200 up-regulated genes, respectively, of gene list as input to oPOSSUM, with background conservation of 30% and sequences from -2, 000 to +2, 000 bp around the TSS. Under both conditions, oPOSSUM found ESR1 as a top ranked matrix under both the Fisher-score and the z-score, which satisfied the thresholds for significant over-representation. This demonstrates that ER binding sites are indeed over-represented in differentially expressed gene promoters, and that this over-representation can be recovered computationally.
Beside ESR1, we also found other matrices, such as Ar, TLX1-NFIC and NFKB1. However, those matrices were not as significantly over-represented as ESR1. Since frequently several transcription factors are involved in regulating gene expression [52][53][54][55][56], it is conceivable that the additional matrices are reflections of interacting TFs rather than false positive discoveries. Without further experimental evidence, however, it is hard to tell in general, even if they are only weakly over-represented.
As it turned out, input promoter number and promoter length had a great influence on the sensitivity of oPOS-SUM. Hence, we used the ESR1 matrix as a showcase to evaluate different parameter settings for oPOSSUM. The following points summarized our findings: 1. ESR1 can be detected as over-represented in a wide range of promoter lengths from 4000 bp to 7000 bp. One possible reason for this is that ER can bind to proximal promoters as well as distal ones [8]. However, the region stretching from -2000 bp to +2000 bp around the TSS is the preferred region. 2. ESR1 can be found significantly over-represented in up-regulated genes under different numbers of upregulated genes, ranging between 40 and 800 genes. For the down-regulated genes, ESR1 was found to be over-represented when the gene number was greater than 400, however, at a very low level of significance. The importance of parameter settings for the performance of oPOSSUM might indeed reflect properties how a specific TF regulates its targets. For example, the ESR1 matrix was recognized as significantly over-represented by oPOSSUM in promoters ranging from -2000 bp to +2000 bp around the TSS, but not in the range of -2000 to 0 bp around the TSS, which might indicate the distribution of TF binding sites in promoter regions. Indeed, this had already been addressed specifically for the ER transcription factor by Lin et. al. Their Chip-PET experiment showed that the largest fraction (38%) of binding regions mapped to intragenic regions of transcripts and were localized within introns, whereas 23% were within 100 kb from the 5' start sites, and 19% were within 100 kb of 3' polyadenylation sites [8]. This clearly indicated significant enrichment of ER binding sites in downstream regions of promoters. This is in line with our observation that ignoring the promoter ranging from 0 to +2000 bp makes ESR1 not discoverable by over-representation analysis. This allows the conclusion that over-representation conditions reflect the distribution of TF binding sites, which is an important aspect in choosing proper promoter regions in our motif finding and analysis.

Systematic analysis of performance of over-representation analysis
As the above example demonstrates, the ESR1 binding site can be recovered through over-representation analysis in up-regulated genes. To see whether this carries over to other TFs, we proceeded to analyze the remaining experiments for which we had identified differential genes in the microarray experiments (see Methods). Under the default parameter settings, we repeated the above process for over-representation analysis with oPOSSUM. We summarized the result of oPOSSUM analysis in Table 2 (for detailed result, see Additional file 5). Under default parameter settings, up to 12 target matrices were found to be significantly over-represented in either of up-or down-regulated genes. In seven experi-ments, target matrices were over-represented at a low level of significance. Due to the great influence of parameters, for those 21 experiments whose target matrices were not significantly over-represented under default parameter settings, we subsequently altered input promoter number and length, leading to the identification of significantly over-represented target matrices in seven experiments and two new weakly over-represented experiments. The remaining eight experiments did not yield any of the target matrices to satisfy z-score above 10 or Fisher-score below 0.01. For all the experiments, we also recorded conditions which recovered the target matrices as over-represented at highest possible level of significance (see Table 2). We proceeded to determine whether this success rate could actually be due to chance. For all the tested experiments, oPOSSUM found on average 3.5 matrices to be significantly over-represented per analysis, out of which one happened to be the target matrix. We determined the probability of this event by comparing to the overall number of candidate matrices in Jaspar. Then, the event of finding the target matrix in a certain number of cases is binomially distributed. We found target matrices to be over-represented in 25 out of 33 experiments, including significantly over-representation in 19 experiments. In fact, the significance of finding the target matrix to be significantly over-represented out of 33 cases has a binomial tail probability below 2.2e -16, which makes it appear  highly unlikely that this performance would be due to chance rather than the ability of the computational pipeline to pick up the right matrix. We recovered target matrices correspond to 9 out of the 11 TF classes identified in [41], with the exceptions coming from the HMG and Homeobox groups of transcription factors. The number of experiments we had available for study seems to be insufficient, however, to fully decide whether this represents a bias in over-representation analysis with respect to certain TF classes. For all those 19 experiments where target motifs appeared significantly over-represented under default parameter setting, target matrices were found significantly over-represented in either up-regulated genes or down-regulated genes, but never in both. Although in some experiments, target matrices were also weakly overrepresented, we can still conclude that TFs generally play unequal roles in activating and repressing gene expression.
In this work, eight experiments did not allow us to identify the target matrices to be over-represented. There might be a plethora of reasons for this. First of all, we evaluated the hypothesis that the information content of PFMs had great influences on the performance of oPOS-SUM. PFMs with low information content would be likely to lead to more false positive binding site predictions, which results in low performance of oPOSSUM. Therefore, we carried out a Student's t-test for information content of over-represented and not over-represented matrices. The result showed a great difference in information content (one-tailed p < 0.029). Although this was in line with our hypothesis, we still could not ascribe all failures to recover matrices to low information content. Another hypothesis we investigated was that the real distribution of TF binding sites was out of the ability of oPOSSUM. As two experiments related to Gata factors did not yield over-represented target matrices, we investigated their properties in more detail. Although ChIP-chip experiments were available that indicated the binding sites of Gata factors in proximal promoters [57], many experiments suggested that Gata factors took important roles by binding to regions out of -2000 bp and +2000 bp of the TSS [58,59]. Together with seven other TFs, no whole-genome binding site investigation was available in public databases, making it hard to draw conclusions without further experimental data. A final reason why over-representation might fail in some cases lies in the networked nature of regulation by transcription factors. TFs do not act purely by themselves, but interact with other TFs through a cascade of signals. In microarray experiments, genes with differential expression may not be the direct target of TFs. For example, c-myc can be regulated by other TFs, and c-myc may also regulate about 15% of all other genes, including numerous other TF genes [60,61]; under such conditions, it is hard to distin- S: significantly over-represented; W: weakly over-represented; N: not over-represented guish signals directly induced by a TF from such cascaded "second-round" signals.

De novo prediction
In the previous step, oPOSSUM was applied to determine over-represented TF binding sites related Jaspar matrices in differentially expressed genes. A natural next step was to determine whether those regulatory motifs could also be recovered by de novo predictions. We performed de novo prediction in promoter regions of differentially expressed genes using the Weeder tool [43]. Figure 3 shows the logos for predicted motifs and their similarity with target matrices. In order to evaluate how well target matrices could be recovered by Weeder, we summarized the number of experiments with recovered target matrices at different similarity percentage cutoffs, as shown in Figure 4. For all the experiments, Weeder predicted at least one motif sharing ≥ 60% similarity and the number of recovered experiments decreased with stricter similarity percentage cutoff. Considering the nature of TF binding sites and the mechanism of de novo prediction methods [62], we could not expect the predicted motifs to share a high degree of similarity with the target matrices. If we set the similarity cutoff 75% for recovering target matrices, our predictions could recover the TF binding sites in about 73% of the 33 experiments. In general, we may conclude that the affected TF binding sites can indeed be recovered in many cases using de novo prediction methods.

Conclusions
In this work, we report a computational evaluation on recovering TF binding sites from differentially expressed genes using two different methods. Our over-representation analysis with oPOSSUM proves successful in 25 out of 33 experiments exhibiting differential expression patterns as a consequence of activating or deactivating TFs, indicating that TF binding site recovery is generally possible with computational methods when dealing with one single manipulated transcription factor. Our evaluation of de novo prediction for all experiments succeeds in recovering motifs similar to the binding site of the affected TFs in about 73% of all cases with a cutoff of similarity percentage of 75%. This allows the conclusion that TF binding site recovery may even be achieved using a de novo approach, though less reliable than oPOSSUM over-representation analysis.
In general, our findings support the hypothesis that the over-representation of TF binding sites in the promoter regions of differentially expressed genes can be detected with computational tools and it confirms that TF binding sites can be predicted by utilizing information of differential expression. With the increasing availability of microarray data in public databases, it will be a useful theoretical foundation for novel TF binding site prediction and functional studies.
In this work, we could also specify the influence of input gene numbers and promoter length and their importance for the sensitivity of computational methods, which indicates the different properties of TF regulating gene expression. More specifically, we could observe very particular regulatory effects such as the critical effect of downstream ESR1 binding sites on gene expression.

Additional material
Additional file 1 Assessment of microarray quality with scatter plots and p-value frequency histograms. To eliminate those microarrays with low quality, we used two methods for quality evaluation. The first one was based on scatter plots, in which the averaged normalized expression value of manipulated hybrids and control hybrids were plotted. Another methods was histograms of q-value frequency distributions, predicted by SAM. We manually checked those distribution and selected reasonable experiments for differential expression analysis.