A regression analysis of gene expression in ES cells reveals two gene classes that are significantly different in epigenetic patterns
© Park and Nakai; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Skip to main content
© Park and Nakai; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
To understand the gene regulatory system that governs the self-renewal and pluripotency of embryonic stem cells (ESCs) is an important step for promoting regenerative medicine. In it, the role of several core transcription factors (TFs), such as Oct4, Sox2 and Nanog, has been intensively investigated, details of their involvement in the genome-wide gene regulation are still not well clarified.
We constructed a predictive model of genome-wide gene expression in mouse ESCs from publicly available ChIP-seq data of 12 core TFs. The tag sequences were remapped on the genome by various alignment tools. Then, the binding density of each TF is calculated from the genome-wide bona fide TF binding sites. The TF-binding data was combined with the data of several epigenetic states (DNA methylation, several histone modifications, and CpG island) of promoter regions. These data as well as the ordinary peak intensity data were used as predictors of a simple linear regression model that predicts absolute gene expression. We also developed a pipeline for analyzing the effects of predictors and their interactions.
Through our analysis, we identified two classes of genes that are either well explained or inefficiently explained by our model. The latter class seems to be genes that are not directly regulated by the core TFs. The regulatory regions of these gene classes show apparently distinct patterns of DNA methylation, histone modifications, existence of CpG islands, and gene ontology terms, suggesting the relative importance of epigenetic effects. Furthermore, we identified statistically significant TF interactions correlated with the epigenetic modification patterns.
Here, we proposed an improved prediction method in explaining the ESC-specific gene expression. Our study implies that the majority of genes are more or less directly regulated by the core TFs. In addition, our result is consistent with the general idea of relative importance of epigenetic effects in ESCs.
Embryonic stem cells (ESCs) derived from blastocysts are self-renewal and pluripotent [1–3]. To understand the gene regulatory system in ESCs is an important step for uncovering the process of cell fate determination and for promoting regenerative medicine. Considerable recent evidence indicates that several transcription factors (TFs), so-called core TFs, are indispensable to maintain the pluripotency [4, 5]. Some of the core TFs reprogram somatic cells back to pluripotent states [6, 7]. These observations suggest that the regulatory network of TFs apparently governs the self-renewal and pluripotency [8, 9]. On the other hand, many studies have reported that other TFs can functionally substitute for the core TFs [10–13], suggesting that there still exist additional or alternative TFs unrevealed in the network. Epigenetic modifications are also essential for ESCs [14, 15]. Their involvement in the maintenance of the pluripotency is still not well clarified.
To understand the regulatory mechanism underlying in ESCs, a number of methods have been developed. In particular, massive parallel sequencing [9, 16–19] and various in silico approaches [8, 9, 20, 21] have yielded comprehensive recent advances in our understanding. In this study, we focus on predicting the gene expression in ESCs with the massive parallel sequencing data. Although a previous study successfully applied a regression model to the prediction , the model is based on a generalized weighting scheme to prepare predictors (explanatory variables). Intuitively, such weighting scheme cannot reflect the nature of the spatial rearrangement of TF-binding.
Here, we propose a density-based approach that uses the genome-wide bona fide TF binding sites. First, a publicly available ChIP-seq data  is reanalyzed. Then, density profiles of TFs estimated from the ChIP-seq data are adopted as predictors in a simple linear regression model to predict the genome-wide gene expression. Predictors are also combined with epigenetic data, such as H3K4me3, H3K27me3, DNA methylation, and CpG island [16, 17]. Furthermore, we analyze the regulatory effects of TFs, epigenetic states, and their higher-order interactions by using a pipeline developed in house. We demonstrate the predictive power of the density-based regression model and discuss our findings.
To minimize artifacts, we refined the binding signals of 12 core TFs in mouse ESC publicly available  (see methods). The ChIP-seq peak datasets generated by various tools are hereafter denoted as FP4_Bowtie, FP4_MAQ, and FP4_Soap2. Also, tag positions mapped by Eland  are used for the peak detection (FP4_Eland), and the peak data of Chen et al. is involved (Chen Eland). Thus, we prepared five peak datasets in total.
Reproducibility of newly detected peaks
Overlap of Chen Eland (%)
Correlation of Peak Intensity
The reason why the numbers vary is twofold. First, algorithmic differences in alignment tools cause the different numbers, particularly due to the gapped or ungapped alignment and random indel for mismatches. Second, thresholds for the peak intensity to distinguish experimental noise are different (Table S4 in Additional file 1). That is, Chen et al. used qPCR refinement with small number of peaks, whereas we used Monte Carlo simulation on each chromosome.
Next, we prepared 17060 genes by removing inconsistency between Ouyang’s study and Chen’s study. This procedure is prerequisite for gathering precise TF-binding instances. TFAS data for the genes were calculated by the exactly same procedure of Ouyang et al. As a result, the exponential-based model shows CV- R 2=0.495 with Chen_Eland. In contrast, CV- R 2 increases to 0.542 (FP4_Eland), 0.587 (FP4_Bowtie), 0.581 (FP4_MAQ), and 0.590 (FP4_Soap2).
These results clearly suggest that the proposed simple linear regression model is applicable to the prediction. Furthermore, it has been demonstrated that the peak datasets we remapped give more information for explaining the gene expression.
To investigate the characteristics of TF binding sites in ESC, the density profiles of TF-bindings are estimated from each of peak datasets (Figure 1C), then any two density profiles for a TF in different peak datasets are tested by Kolmogorov-Smirnov (KS) test. According to the KS test, the profiles of a TF are almost identical even if the number of mapped tags and peaks are largely different in, say, Esrrb (Figure 1D). The exceptional case is Sox2 in Chen_Eland and FP4_Bowtie (Figure 1E) due in part to the stringent filter used in Chen Eland; e.g. loss of Sox2 peaks in Chen Eland at gene clusters on chromosome X (Figure S1 in Additional file 2).
Importantly, in the same peak dataset, the profiles are significantly different among TFs, e.g. Oct4 and Smad1 in FP4_Bowtie are shown in Figure S2. It is, therefore, thought that spatial preference of TF-bindings cannot be explained by one generalized distribution. In fact, the binding distributions of Nanog, Smad1, Sox2, and Stat3 definitely do not follow the exponential distribution (Figure S2 in Additional file 3).
Our observations from the genome-wide distribution of TF binding sites revealed the distinct binding preference from exponential function (Figure 1E). Thus, we use the density profiles as predictors given as equation (2), which we call the density-based regression model. The predictive power of the density-based model with Chen_Eland (Figure 2B) is slightly higher (CV- R 2=0.508) than the exponential-based model (CV- R 2=0.495). Similar results were obtained when other peak datasets were used.
We suspect that the prediction quality of two regression models may depend on downstream genes that cause specific density profiles. To confirm it, we extracted 4095 ESC-specific genes. E2f1 was excluded here due to its excessive regression coefficient . Then, a subset of 4095 genes that is co-bound by a TF pair was prepared. Since the TFs used are well-known essential regulators in ESCs, the TF pairs, such as Oct4 and Sox2, possibly play an important regulatory role in their downstream ESC-specific genes. All subsets by any combination of two TFs have been prepared.
Figure 2D illustrates that the density-based regression model outperforms in many cases. Furthermore, 55 gene subsets that are co-bound by TF pairs whose density profiles are significantly different (p < 0.05) were successfully predicted (red solid circles in Figure 2D). These gene subsets cannot be modeled by a generalized exponential function. The results suggest that the spatial preferences of TF bindings are much more dynamically changed in ESC-specific gene subsets rather than observed from all the genes. This is why the density-based model improved the predictive power with respect to the generalized exponential-based model.
It was demonstrated previously that the absolute gene expression in ESCs is predictable by the ChIP-seq data of core TFs . We also confirmed the high predictive power of the regression model. However, the results strongly rely on certain genes whose ‘ predicted ’ expressions are constantly lower, but ‘ observed ’ expressions are more varied (Figure 2A-C). In Figure 2A, we observed the binomial distribution of predicted expressions that can be partitioned by 1 RPKM (zero on the horizontal axis). We denote C1 for genes where predicted expression is ≥1 RPKM, C2 for the remains. The conspicuous frequency of C2 is also observed from Figure 2B-C. C2 genes in Figure 2C consist of 1205 up- and 1254 down-regulated genes. Further, the subset of C2 (C2′) where observed expression is greater than 1 RPKM consists of 148 up- and 159 down-regulated genes.
C2 genes lack the TF-binding instances, implying less direct regulation by the core TFs. This depletion is due in part to excessive non-CpG DNA methylation . Gene ontology analysis shows that C2 genes are often related to differentiation. Thus, they should be preferentially repressed in ESCs. Interestingly, as the histone marks are relatively rare among C2 genes, they are likely to be controlled by other regulatory pathways connecting to the maintenance of self-renewal. One possibility is the competitive binding of additional TFs not involved in this study because of the global open chromatin conformation in ESC . Other possibilities include additional epigenetic patterns and homeostatic regulation, further investigations are required.
To further understand the epigenetic effects in gene regulation, we add three epigenetic states to the regression models; histone mark (HistM), DNA methylation (Methy), and CpG island (CpGI). Thus, 14 explanatory variables are used. To identify effective variables in the prediction, we reduced the regression model by using the stepwise model selection. Also, 100 runs of computer simulation that randomly assign the epigenetic states were performed.
Effects of epigenetic patterns in reduced regression models
11TFs + 3 epigenetic effects
CV- R 2
CV- R 2
CV- R 2
To investigate the cooperative effects among TFs and epigenetic patterns in gene regulation, we exhaustively searched significant interaction terms from our regression model. First, a subset of ESC-specific genes that are co-bound by a specific TF pair is prepared. Then, the saturated model for the genes is constructed. The model involves 469 variables; 14 main effect terms (11 TFs and 3 epigenetic states) and 455 higher-order interaction terms (all the possible pairwise and triplewise interactions). Finally, our pipeline greedily identifies important variables (see methods). This procedure is independently performed with each of five peak datasets.
With considering the redundancy and conservativity, we represented the interactive terms of 215 models as a network (Figure S4 in Additional file 2). As a result, 19 gene sets covering approximately 86% of genes (3523 out of 4095 genes) were linked by 28 regulatory edges of the epigenetic effects that are commonly found in the five peak datasets (Figure S4 in Additional file 2). These results suggest that the cooperative interactions between TF and the epigenetic state are indispensable to explain the majority of gene expression in ESCs. In addition, we confirmed that the regression coefficients in Figure 2F are dramatically changed in the regression of given gene sets, and also CpGI significantly contributes to the prediction of gene expression (Figure 4).
ESCs are the widely accepted source for the study of many biological principles. Despite recent advances in our understanding of biological systems, the gene regulation in ESCs is only incompletely understood. To explore the regulatory mechanism underlying in ESCs, we constructed a predictive model for explaining the absolute gene expression in mouse ESC. This model uses a novel density-based approach to exploit the recent massive parallel sequencing straightforwardly.
We first reanalyzed the publicly available ChIP-seq data for 12 well-known pluripotent core TFs , and retrieved the reproduced and extended TF-binding sites and intensities (Table 1). Using our regression model based on the exponential function , we found that the remapped peaks are more informative to explain the gene expression (Table 2). Therefore, we concluded that the algorithmic differences in computer tools for ChIP-seq data significantly affect the downstream analysis. Analyzing the heterogeneous peak datasets in a comparative manner, we found that the spatial binding preference of each TF is well conserved in all the datasets, whereas the preferences of TFs in a dataset are significantly different from each other (Figure 1D-E). These results imply that density profiles are better explanatory variables than the generalized exponential function. In fact, the predictive power of density-based model is constantly higher than the exponential-based model (Figure 2A-C, Table 2). Even if the density profiles are dynamically changed in certain downstream genes, the proposed model is still outstanding (Figure 2D).
Unexpectedly, we found two gene classes that are either well explained or inefficiently explained by the regression model. The latter class genes have less binding instances of the pluripotent TFs (Figure 2E), possibly related to excessive DNA methylation (Figure 3A). The gene classes show apparently different characteristics in epigenetic modifications (Figure 3), suggesting that they are likely to be under control in different regulatory mechanisms. In the present study, we simply combined the discrete epigenetic states with the powerful density-based model. This model significantly improved the predictive power (Table 2). Investigating higher-order interactions among the predictors, we found that the cooperative interactions between TF and epigenetic pattern are indispensable for regulating approximately 86% of ESC-specific genes (Figure S4 in Additional file 2). These results suggest that the relative importance of epigenetic effects to regulate the gene expression in ESCs, supporting the general idea [14, 15].
We proposed a powerful regression model, and uncovered the relative importance of epigenetic regulation in ESCs. Overall prediction quality is still insufficient. As future works, comprehensive representation of epigenetic patterns is required, and additional or alternative TFs in ESCs should be considered.
Raw tag sequences and a control library were downloaded from GEO database (GSE11431). High-quality 26 base pair (bp) tags that have less than three ambiguous bases were mapped to mm8 by Bowtie , MAQ , and Soap2  with allowing two mismatches. Only uniquely mapped tags were extended to 200-bp virtual fragments (Figure 1A). FP4 (FindPeaks 4.0)  detected significant peak regions. Monte Carlo simulation was performed on each chromosome to calculate false discovery rate (FDR). Also, the fold enrichment of tags in each peak region over remapped control tags was measured. Finally, we prepared peaks by criteria, FDR <5% and 5-fold enrichment.
For the absolute gene expression, the number of tags per kilobase of exon region per million mapped tags (RPKM)  for 18936 mouse genes in ESC and in embryoid body (EB) were prepared from . Positional information of transcription start sites (TSSs) of 17443 Refseq genes in mm8 were prepared from . Removing inconsistent gene IDs between RPKM data and TSS data, we compiled 17060 genes. We prepared 4095 ESC-specific genes whose expressions are 4-fold up- or down-regulated in ESC over EB. The dataset used in here is available at http://www.hgc.jp/~park/research/.
DNA methylation maps are prepared from two datasets that use different high-throughput detection methods [16, 17]. Methylation states of high-CpG-density promoters (GC content ≥0.55) are defined by mean methylation levels; unmethylated if mean ≤0.25, methylated if mean ≥0.75. The genome-wide distribution of CpG islands and histone mark were downloaded from UCSC genome browser. We consider three histone states; histone H3 lysine 4 trimethylation (H3K4me3), an active mark of expression, H3 lysine 27 trimethylation (H3K27me3), a repressive mark, and bivalent domain of H3K4me3 and H3K27me3, a ‘poised’ mark of expression.
Given a genome-wide location map of a TF-bindings, all peak positions were converted to relative positions to the nearest TSSs. Gaussian kernel density function (bandwidth=300 bps) estimated the density profile of the TF-bindings within ±50K bps. The profile was normalized into range of [0, 1] by dividing by the maximal density height.
where g k is the perk intensity of the k th binding peak of the j th TF, F j is the normalized density function for the j th TF, and l k is the relative position of the k th peak to TSS of gene i. Note that a small value is added to Y i for the logarithm.
where H is the type of histone mark (neither mark=1.0, H3K27me3=2.0, bivalent mark=3.0, H3K4me3=4.0), M is the DNA methylation (no annotation=1.0, methylation=2.0, unmethylation=3.0), C is the CpG island (absence=1.0, presence=2.0), and α, β, γ are the regression coefficients for H, M, C, respectively.
Explanatory variables in a regression model are log-transformed and quantile-normalized. 10 runs of 10-fold cross validation (CV) measure the average correlation coefficient (CV- R) and the average proportion of variation explained by the model (CV- R 2). The stepwise model selection is done by stepAIC in R language with the backward and forward procedure. The regression model with higher-order interactions are reduced by a pipeline developed in house; ANOVA in R language first diagnoses the significance of each explanatory variable in the given saturated model. Next, significant variables (p < 0.05 in F-test) are gathered. Finally, the best model is constructed by adding and removing the collected variables one by one in increasing order of p -value until CV- R 2 is not improved anymore.
We thank Dr. Tetsushi Yada (Kyoto University) for helpful discussions. Computational resources were provided by the super computer system at Human Genome Center, Institute of Medical Science, University of Tokyo. This work was supported by the Research Program of Innovative Cell Biology by Innovative Technology (Cell Innovation) by the Ministry of Education, Culture, Sports, Science and Technology-Japan.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.