Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data
© Gao et al 2004
Received: 12 November 2003
Accepted: 18 March 2004
Published: 18 March 2004
Skip to main content
© Gao et al 2004
Received: 12 November 2003
Accepted: 18 March 2004
Published: 18 March 2004
Functional genomics studies are yielding information about regulatory processes in the cell at an unprecedented scale. In the yeast S. cerevisiae, DNA microarrays have not only been used to measure the mRNA abundance for all genes under a variety of conditions but also to determine the occupancy of all promoter regions by a large number of transcription factors. The challenge is to extract useful information about the global regulatory network from these data.
We present MA-Networker, an algorithm that combines microarray data for mRNA expression and transcription factor occupancy to define the regulatory network of the cell. Multivariate regression analysis is used to infer the activity of each transcription factor, and the correlation across different conditions between this activity and the mRNA expression of a gene is interpreted as regulatory coupling strength. Applying our method to S. cerevisiae, we find that, on average, 58% of the genes whose promoter region is bound by a transcription factor are true regulatory targets. These results are validated by an analysis of enrichment for functional annotation, response for transcription factor deletion, and over-representation of cis-regulatory motifs. We are able to assign directionality to transcription factors that control divergently transcribed genes sharing the same promoter region. Finally, we identify an intrinsic limitation of transcription factor deletion experiments related to the combinatorial nature of transcriptional control, to which our approach provides an alternative.
Our reliable classification of ChIP positives into functional and non-functional TF targets based on their expression pattern across a wide range of conditions provides a starting point for identifying the unknown sequence features in non-coding DNA that directly or indirectly determine the context dependence of transcription factor action. Complete analysis results are available for browsing or download at http://bussemaker.bio.columbia.edu/papers/MA-Networker/.
For various organisms, DNA microarrays have been used to measure the mRNA abundance for essentially all protein-coding genes in the genome under a large number of conditions [1, 2]. Microarray technology can also be combined with chromatin-immunoprecipitation (ChIP) or chromatin profiling (DamID) to quantify the occupancy of upstream non-coding regions by transcription factors or other chromatin-associated proteins [3–9]. In the budding yeast Saccharomyces cerevisiae, ChIP has been used to globally map the binding sites of over a hundred transcription factors . Moreover, mRNA expression data for over a thousand conditions has been published. The challenge is to find new ways to extract knowledge about the regulatory mechanisms that govern the cell by combining these different types of data [10–14].
Initiation of transcription in eukaryotes is a complicated process that depends on the binding of transcription factors (TFs) and chromatin-modifying enzymes to the promoter region as well as the recruitment of the RNA Polymerase II complex to the transcription start site. Transcriptional control is combinatorial, and cooperative binding of multiple factors on the same promoter region and/or cooperative recruitment of the Pol II complex is often required for transcriptional activation . Occupancy of the promoter region of a gene by a transcription factor is thus a necessary but not a sufficient condition for the gene to be controlled by it. As a consequence, genome-wide transcription factor binding patterns measured using ChIP or DamID microarray experiments alone can only indicate the potential for a gene to be regulated by a given TF. Independent information will be required to establish that the gene is indeed a functional target of the factor.
Deletion or over-expression of a transcription factor, combined with genomewide microarray profiling of the difference in expression between mutant and wild type, is also widely used to infer regulatory interactions. However, drastic perturbation of the genetic network outside the physiologically relevant range may lead to false target prediction, or the mutant strain may simply not be viable. Moreover, direct and indirect targets of the factor cannot be distinguished using this approach.
When mRNA abundances for all genes are compared between two experimental conditions in a microarray experiment, the observed differential expression pattern is usually a superposition of responses of various pathways, mediated by signaling cascades that end at the level of transcription factors. It has previously been shown that these changes in TF activity can be quantitatively inferred by performing multivariate regression analysis on the expression log-ratios from a single microarray experiment [16–19]. Transcription factors are implicitly represented by a consensus motif for their DNA binding sites, and the regression coefficients estimate the changes in TF activity.
In the present study we will instead use ChIP occupancy log-ratios as predictors for expression. No sequence information will be used, and it is therefore not necessary that a DNA consensus motif be known for the TFs. Again, multivariate regression analysis of a single genomewide set of mRNA log-ratios on the genomewide binding profiles of a large number of TFs for which ChIP data is available can be used to quantify to what extent each transcription factor is responsible for the observed changes in mRNA expression.
One expects the mRNA expression profile of a gene regulated by a specific transcription factor to be similar to the TFAP of that factor. We therefore investigated whether the linear correlation across the experiment library between a TFAP and the mRNA expression profile of a gene whose promoter is bound by the factor could be interpreted as a regulatory coupling strength and used to improve the specificity of target prediction. To this end, we constructed a matrix of regulatory coupling strengths between all transcription factors and all genes ("Step 2" in Fig. 1a). When this information is combined with the original ChIP data for a given TF, the ChIP log-ratio and coupling strength for each gene can be shown simultaneously in a 2D scatter plot (Fig. 1c). The fact that each gene has two parameters associated with it allows a more sophisticated classification than is possible based on ChIP alone. We first defined a set "B+" of genes that are significantly bound by a TF (we required the P-value reported by Lee et al. to be smaller than 10-3) . Next, we then partitioned the "B+" gene set into two subsets "B+/C+" and "B+/C-" based on whether or not their mRNA expression profile was significantly correlated with the TFAP (Pearson correlation, 5% false discovery rate). Our hypothesis is that the B+/C+ genes (shown in red in Fig. 1c) are the functional direct targets of the factor, while the binding to the promoter region of the B+/C- genes (shown in green) is non-functional.
Classification of genes according to ChIP data and inferred regulatory coupling.
Analysis of transcriptional response to transcription factor deletion.
-log 10 (p)
-log 10 (p)
Over-representation of four cell cycle related motifs.
Replacing regulatory coupling strength by response to transcription factor deletion.
Number of genes in each group
Our results underscore the unique added value of ChIP data such as that of Lee et al. when it is used in combination with a library of mRNA expression data . We found that roughly half of the transcription factor targets predicted by ChIP are nonfunctional. Although some of these will be false positives of the ChIP technology, especially for TFs that are not present in active form in the nucleus under the conditions used by Lee et al., we believe that our results instead point to interesting biology: TF binding can fail to confer transcription of a nearby gene for a variety of reasons, including competition with nearby activators or repressors, local or global chromatin conformation, or lack of partners for cooperative recruitment of the Pol II complex.
Several works have relied on representing a TF by its mRNA expression profile in order to discover connections between transcription factors and their targets [23–25]. By contrast, our method infers changes in TF activity by analyzing the mRNA levels of putative TF targets. This allows us to analyze regulatory relationships even if the TF is modulated in a purely post-translational manner, e.g. by phosphorylation. The reliable classification of ChIP positives into functional and non-functional TF targets, as it has been presented here, provides a starting point for future research aimed at identifying the unknown sequence features in non-coding DNA that directly or indirectly determine the context dependence of TF action.
A library of 751 genomewide mRNA expression patterns (transcriptomes) was compiled from a variety of sources (see supplementary data for complete references). ChIP data for 113 transcription factors was downloaded from the website accompanying Lee et al. . We used the P-values provided by these authors to determine which genes were significantly bound by each given factor at a confidence level of P < 10-3. All microarray data used in our analysis was represented as log-ratio base two.
For each separate transcriptome t, we used the following multivariate regression model to infer transcription factor activities for each microarray experiment:
Here E gt represents the mRNA expression log-ratio of gene g in experiment t, while B fg represents the ChIP log-ratio for transcription factor f and the promoter region of gene g. The intercept F 0t represents a baseline expression level, while the regression coefficients F ft can be interpreted as inferred transcription factor activities. Starting with the full set of 113 transcription factors, we used backward selection to eliminate uninformative transcription factors from our model: First, for each microarray experiment a P-value corresponding to each regression coefficient was determined, based on an F-test . The transcription factors were then sorted based on the smallest P-value among all 751 experiments. In an iterative procedure, the transcription factor with the most insignificant P-value was removed until all factors were significant at a P-value of 0.005/751. Since this analysis in itself is novel and useful, we have made an online ChIP regression tool available at http://bussemaker.bio.columbia.edu/tools/.
For each pair-wise combination of a gene g (represented by its mRNA expression profile E gt ) and a transcription factor f (represented by the inferred activity profile F ft ), a regulatory coupling factor was calculated, equal to the Pearson correlation between E gt and F ft :
For each value of r, an associated P-value was computed by performing a t-test on t = r[(G-2)/(1-r 2 )] 1/2. To account for the parallel testing of many TF-target pairs, but at the same time avoid the overly conservative Bonferroni correction, we set a threshold for t by requiring a false discovery rate of 5% . The end result of this procedure is a list of genes that are significantly coupled to a transcription factor. Strictly speaking, to avoid circularity, the coupling of each gene should be evaluated based on a TFAP derived from expression data for all but that gene. However, as the TFAP is derived from the expression profile of all genes bound by the TF, the effect of leaving out one gene is relatively insignificant in practice. Moreover, repeating this procedure for every gene in the genome would be computationally unfeasible.
Based on the regulatory coupling analysis described above, the genes bound a given transcription factor (B+) were sorted in two classes, B+/C+ (bound and coupled) and B+/C- (bound but not coupled). These two sets were used as input for further analysis. The cumulative hypergeometric distribution was used to determine whether a set of genes is enriched for one or more Gene Ontology categories . The Bonferroni correction was applied to all P-values to deal with the parallel testing of GO categories. The organism-independent ontology and the gene-association table (version May 2003) for S. cerevisiae were downloaded from http://www.geneontology.org.
Expression data for mutant vs. wild-type comparison for the transcription factors Dig1, Gcn4, Hir2, Mbp1, Swi4, Swi5, and Yap1 were obtained from Hughes et al. . To test whether a given subset of genes responded to TF deletion, a sample t-test was performed, comparing the average expression log-ratio in the subset with the genome-wide distribution of expression changes. To guarantee that this analysis was fair, the respective TF deletion experiments were excluded from the library used to calculate the coupling factors that define the C+ and C- groups.
Four different DNA motifs found as top-scoring motifs by REDUCE and also reported in Spellman et al. were tested for over-representation in the set of B+/C+ and B+/C- genes, respectively, for the 7 cell-cycle related transcription factors within the set of 37 factors analyzed by us [16, 28]. These motifs are: ACGCGT (MCB), CGCGAAA (SCB), AACCAGC (Swi5p) and GTAAACA (SFF). Motifs were counted in non-coding regions up to 600 bp upstream from the ORF start position, and expected counts were based on upstream regions of all genes. No overlapping matches were counted. The cumulative binomial distribution was used to assign a P-value to the enrichment for these motifs.
DNA adenine methyltransferase identification
transcription factor activity profile
We would like to thank Marcel van Batenburg and Crispin Roven for their assistance and helpful suggestions. We are also grateful to Frank Holstege, Bas van Steensel, and Kevin White for valuable comments and a critical reading of the manuscript. F.G. was partially funded by the Netherlands Organization for Scientific Research (NWO) and the Human Frontier Science Programme (HFSP). B.C.F. and H.J.B. were partially funded by the National Institutes of Health.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.