Epigenetic domains found in mouse embryonic stem cells via a hidden Markov model
© Larson and Yuan. 2010
Received: 28 June 2010
Accepted: 12 November 2010
Published: 12 November 2010
Skip to main content
© Larson and Yuan. 2010
Received: 28 June 2010
Accepted: 12 November 2010
Published: 12 November 2010
Epigenetics is an important layer of transcriptional control necessary for cell-type specific gene regulation. Recent studies have shown significant epigenetic patterns associated with developmental stages and diseases. However, previous studies have been mostly limited to focal epigenetic patterns, whereas methods for analyzing large-scale organizations are still lacking.
We developed a hidden Markov model (HMM) approach for detecting the types and locations of epigenetic domains from multiple histone modifications. We used this method to analyze a published ChIP-seq dataset of five histone modification marks (H3K4me2, H3K4me3, H3K27me3, H3K9me3, and H3K36me3) in mouse embryonic stem (ES) cells. We identified three types of domains, corresponding to active, non-active, and null states. In total, our three-state HMM identified 258 domains in the mouse genome containing 9.6 genes on average. These domains were validated by a number of criteria. The largest domains correspond to olfactory receptor (OR) gene clusters. Each Hox gene cluster also forms a separate epigenetic domain. We found that each type of domain is associated with distinct biological functions and structural changes during early cell differentiation.
The HMM approach successfully detects domains of consistent epigenetic patterns from ChIP-seq data, providing new insights into the role of epigenetics in long-range gene regulation.
Well before the first eukaryotic genome was sequenced, the notion that chromatin is partitioned into larger than gene-size domains (such as the heterochromatin and euchromatin) was conceived [1, 2]. Genes that are functionally related and co-regulated are often located close to each other. These include the Hox and the β-globin gene clusters [3, 4]. More generally, unrelated spatially proximal genes can still be co-regulated, and such co-regulation can in part be explained by long-range chromosomal interactions . However, a major barrier for understanding the mechanism for long-range gene regulation is the difficulty of generating high-resolution long-range chromosomal interaction data on a genomic scale .
The fundamental unit of chromatin is the nucleosome, which is a histone octamer wrapping up ~147 bp DNA. The amino acid residues on the N-terminal tails of the histone proteins can be covalently modified in a number of different ways, and different biochemical modification marks may have very different biological functions . In recent years, genome-wide distributions of various histone modifications in several eukaryotic organisms have been mapped using chromatin immunoprecipitation followed by either microarrays (ChIP-chip) or DNA sequencing (ChIP-seq) [8–10]. It is thought that different combinations of histone modifications result in different functional specificities  and that several modifications form broad domains [12–14], which is helpful in stabilizing chromatin state and transmitting the states in cell division . Integration of multiple histone modification marks has identified distinct epigenetic patterns, associated gene activities, and regulatory elements [16, 17]. However, most of these earlier studies are done on a gene-by-gene basis and have ignored the spatial correlation of epigenetic patterns.
Because it is difficult to detect long-range chromosomal interactions through experimental methods, it is valuable to develop computational methods to identify long-range correlations such as epigenetic domains based on histone modification data. This can be used to infer such interactions. Here we use the term "epigenetic domain" to refer to a large-scale region containing multiple genes with epigenetic patterns. For the rest of the paper, we will use the terms "epigenetic domain" and "domain" interchangeably. Such epigenetic domain methods should provide mechanistic insights into coordinated gene regulation. Several approaches have already been developed in recent years [18–24]. However, two of the studies [18, 20] partition the genome without any constraint, making the results hard to interpret. Other studies consider only a single histone modification mark, which can only give partial epigenetic information [19, 20]. Or the researchers only examined a certain pattern of two modifications [21, 22], simply small regions , or were not looking at neighboring regions, only clusters as a whole . Therefore, further improvement is still needed.
Here, we developed a novel method that uses genome-wide histone modification data to identify epigenetically consistent, multi-gene domains. We applied our method to analyze a recently published ChIP-seq dataset for mouse embryonic stem (ES) cells [10, 25] and found that we were able to identify a number of domains that are significantly large (i.e. not just due to chance). We validated our predictions by integrating a number of data sources. We also explored these histone modifications in the neural progenitor (NP) cell line to determine what, if any, changes in domain size, structure, and/or function occur during early cell differentiation. Our method provides a useful tool to investigate the role of epigenetic domains in development.
To constrain on spatial correlation of histone modification patterns, we applied a hidden Markov model (HMM) to partition the genome into epigenetic domains based upon the five-dimensional summary histone modification scores. In accordance with previous methods , we assumed that the emission probability can be approximated by multivariate Gaussian distribution. This gives a better fit than a Poisson distribution (Additional file 1: Figure S1).
For hidden-state assignment, we compared the results from two commonly used methods: the Viterbi and Forward-Backward algorithms, on Ch19. The results from the methods were 99.8% identical (Additional file 2: Figure S2). Since the Viterbi method is more computationally efficient and essentially as accurate, it was used to assign our hidden gene states.
We considered two model configurations, corresponding to two and three hidden states. For the rest of this paper, we will further explore these two choices and compare the corresponding results.
Summary statistics for significant domains in the two-state HMM
Number of significant domains
Min domain size
Max domain size
Average domain size
Variance in domain size
Summary statistics for significant domains in the three-state HMM
Number of significant domains
Min domain size
Max domain size
Average domain size
Variance in domain size
Average (variance) histone modification activity within a state for the two-state HMM
Average (variance) histone modification activity within a state for the three-state HMM
It is intrinsically difficult to define a 'gold-standard' set of epigenetic domains. To test whether our predicted domains are biologically meaningful, we examined a number of properties that are associated with chromatin domains, as explained below.
Third, we asked whether the genes that are embedded in the same domain tend to have similar biological functions. To this end, we examined the Gene Ontology (GO) patterns within and between domains. We used Fleiss' Kappa , an accordance statistic, to measure the coherence of GO terms within our significant domains. Compared to a random selection of domain bounds, our predicted significant domains had a more concordant GO structure (p-values < 0.001) for the two- and three-state models (Figure 5b). Again, the three-state HMM had a higher level of accordance, suggesting that it is a better fit of the data.
Finally, we recognized that the histone modification patterns within a predicted domain are consistent, but substantial variance still remains. As a further validation, we asked whether our HMM was expected to provide reliable predictions under such circumstances. To this end, we designed numerical simulations mimicking the parameters for the real data (Methods). For the two-state HMM, we were 98% accurate in predicting the true states when using the observed variances, and 97% accurate when using twice the observed variances (Additional file 8: Figure S3a & Additional file 9: Figure S4a). For the three-state HMM, we were 99% accurate in our simulations (Additional file 8: Figure S3b & Additional file 9: Figure S4b). This suggests that if such domain patterns actually exist in the data, our model would be sufficiently able to detect them.
The Hox gene clusters are a well-described epigenetic domain family. These genes regulate the anterior-posterior axis of metazoan organisms and are expressed in a sequential order during cell differentiation. In ES cells, the Hox genes are targeted by the Polycomb group (PcG) proteins and associated with bivalent domains [12, 28]. Our method correctly detected each of the four Hox clusters to be in a non-active and significant domain. The results for the Hoxa cluster on chromosome (Ch) 6 are shown in Figure 6a. Simple K-means clustering failed to capture these state assignments.
The OR genes are only expressed on in sensory neurons, and only a single gene (out of 1300) is activated in each cell . We searched the literature for domain-level regulation of OR genes and noticed a recent paper showing that the selectivity of OR gene activation is established by the long-range chromosomal interaction between a single enhancer element and its target promoter . Due to the lack of sequence specificity of such an interaction, it is reasonable to assume that the maintenance of an open chromatin environment over a large domain plays an important role in the regulation of OR genes. A comparison of the domain organization between sensory neurons and other cell types may provide further insights into the unique feature of OR gene regulation.
Previous studies have identified dramatic epigenetic changes during cell differentiation [10, 14, 33]. To test whether the epigenetic changes also occur at the domain level, we applied our three-state HMM to infer domain states in the NP cells and compared the results for ES cells (Additional file 10: Table S6). While the overall change is moderate, we noticed some important changes at specific loci. For example, we found that the non-active Hoxa domain shrank significantly in NP cells (Figure 6c), consistent with a previous time-course study . Such change is accompanied by activation of certain Hoxa genes in NP cells (Additional file 11: Figure S5a). The loss of H3K27me3 is accompanied by a moderate increase of H3K4me3 and expression levels. Little changed for the OR genes between the ES and NP cell lines (Figure 6d and Additional file 11: Figure S5b).
In total, we found that 179 of the 258 domains in ES cells contain at least one gene that changes epigenetic state in the NP cell line. For each significant ES domain, if any genes within the domain have a new state in the NP cell line, then those genes would represent a domain change. Note that there may be smaller NP domains than their corresponding ES domains, and that one ES domain could be multiple NP domains. Thus our 258 significant ES domains were 450 NP domains (Additional File 10: Table S6). For these NP domains changes, we again used DAVID to analyze the functions of the genes that change state during this early stage of development by examining the top three significant DAVID clusters (Additional file 12: Figure S6) for each of our 6 types of change (and for genes that remained state unchanged). Eight domains are non-active in ES cells but become active in NP cells. These domains are enriched with developmental regulators. On the other hand, six domains, containing 108 genes, switch from null to active states during differentiation.
Some of the non-active domains in ES cells remain non-active in NP cells, and they may be important for further development. On the other hand, a number of domains switch from the active to the non-active state in NP cells, and the functions of these domains are typically related to early embryonic development. They are enriched with functions such as sex differentiation and apoptosis. Thus, the early developmental genes are epigenetically marked in ES cells rather than at a later developmental stage.
As the nucleosome level chromatin states become increasingly well described , the next frontier becomes the characterization of higher-order chromatin structure. Numerous studies have suggested that epigenetic domains play important roles in gene regulation [5, 35], yet the detection of genome-wide long-range chromosomal correlations remains technically challenging . On the other hand, genome-wide histone modification data provides important information about long-range gene regulation [12–14, 36]. Thus it is valuable to develop computational methods to detect large-scale domains based on histone modification data.
Here we developed an HMM-based method to predict epigenetic domains. A similar method has recently been used to characterize the epigenetic states associated with gene promoters . However, we extend this approach to identify large-scale epigenetic patterns. Compared to previous domain detection methods [18–20], our model can easily accommodate additional histone modification marks and provide easily interpretable prediction outcomes.
Our model detects three distinct types of epigenetic domains, two of which are transcriptionally inactive, which we call non-active and null. These two domain types are also distinct functionally in terms of both activation potential and biological functions. For example, both the Hox and OR gene clusters form epigenetic domains that are transcriptionally inactive in ES cells. Yet the Hox genes are critical for the overall development and are found in non-active domains in the ES cell line, whereas OR genes are expressed only in specific adult cell types and are found in null domains in the ES and NP cell lines. Therefore the epigenetic patterns provide more regulatory information than can be appreciated by gene expression data alone, signifying the importance of characterizing domain types.
Recent studies have shown that spreading of histone modification marks is an important epigenetic signature of cell differentiation [13, 14]. Our work can be viewed as an extension in terms of considering the combinatorial patterns of multiple histone modifications instead of focusing on a single modification alone. Indeed, we found changes of epigenetic domains from ES to NP cells, which are accompanied by coordinated activation of neuron-specific genes. Our analysis suggests that epigenetic domain-level changes may play an important role in neuron differentiation and organismal development.
We recognize that our model still has a number of limitations. For example, the reduction of the spatial epigenetic patterns by gene-level summary scores precludes us from pinpointing the exact locations of domain boundaries. In addition, we have ignored the correlation between different histone modification marks, which may important if data for a large number of marks is available. We plan to overcome these limitations in future studies.
Gene annotation was based on Refseq; we obtained 17,772 genes in total. For four of the five modifications (H3K4me2, H3K4me3, H3K27me3 and H3K9me3), the tag counts peak near the TSS; therefore the average was taken over the regions from -2 kb to +2 kb with respect to the TSS. The tag counts for H3K36me3 are highest around the 3'-end of the coding region of a gene; thus, we took the average tag counts over this area as a gene's corresponding score. Genes with more than 50% repetitive sequences in either of these two regions were not used in further analysis. Our preliminary data manipulations led to the elimination of sites with poor ChIP-seq coverage, resulting 17,469 (98.3%) genes used in further analysis.
where is the observed within-cluster sum of squares around the clusters means for one run, and E(.) represents the mean value for 1000 random bootstrap permutations.
We chose HMM to infer domain locations, where the hidden state at a given gene represented the associated domain type, and the emission variables are the m-dimensional vector summarizing the local histone modification pattern. We assumed that our emissions followed a multivariate Gaussian distribution.
When dealing with ChIP-seq data, researchers often like to assume a Poisson distribution for the counts mapped to each bin. This was not appropriate for our analysis for two main reasons: (1) we fit our model on non-integer summary scores to examine domain structure at the gene level and (2) the multi-dimensionality of our study. By the central limit theory, even if our raw counts followed a Poisson distribution, an average score of these counts (say over a promoter region) would follow a Gaussian distribution. To evaluate all five modifications simultaneously, we assumed that together they were from a multivariate Gaussian emission distribution with no covariance structure. We also checked the validity of our assumption by comparing distribution of the score data to its corresponding Poisson and Gaussian distributions. For these reasons, we assumed that our score data followed a multivariate Gaussian distribution.
where μ ks is the prior mean for modification k (k = 1, 2, ..., 5) and state s (s = 1, 2, 3), σ ks 2 is the initial variance for modification k and state s, θ k is the sample mean for modification k, s 2 k is the sample variance for modification k, and ν k is the degrees of freedom for modification k such that E(σ ks 2 )=(ν k -2) -1 = s 2 k . We found that using the results from k-means clustering to pick our prior results in a much higher final log-likelihood than this hierarchal prior (Additional file 13: Figure S7).
We also used the Viterbi algorithm, an approximation of the Forward-Backward algorithm, to assign a state to each gene. To determine the accuracy of the Viterbi method, we compared its corresponding hidden maximum likelihood estimate (MLE) state assignment to that obtained by the forward-backward algorithm, which assigns states based on posterior probabilities, also known as a maximum a posteriori (MAP) estimate. These two algorithms were compared on the 615 gene chromosome 19 and produced similar hidden states. The Viterbi algorithm was used to assign gene state as it is more computationally efficient.
The number of hidden states was equal to the optimal number of clusters from our gap statistic results. For the mouse data considered in this paper, our mouse data, the optimal number was K = 3, where K is the number of clusters. However, the gap statistic for K = 2 was similar, so we compared both setups in our analysis.
where x i is the observed m-dimensional vector of histone modifications for gene i (i = 1, 2, n j ), n j is the number of genes in domain j, X i | H 0 ~N(μ 0 , Σ 0 ) and X i | HMM ~N(μ s , Σ s ). Note that μ 0 and Σ 0 are the m-dimensional mean vector and diagonal variance matrix of the entire dataset whereas μ s and Σ s are their individual state-based counterparts. To calculate λ j ( X ) for each domain, the μ s and Σ s of the corresponding maximum likelihood estimate state was used. Based on likelihood theory, λj(X) ~ χ2 df where the degrees of freedom (df) is the difference in parameters between models. Thus, df = 2*S*m-2*m, where S is the number of states in the model. To correct for multiple hypothesis testing, the significant domains were selected such that the FDR = 0.01.
We simulated histone modification data corresponding to prescribed domain configurations and assessed the accuracy of our model. We assumed that the histone modification data for each gene was normally distributed with the mean and variances estimated from the real data (Tables 3 &4). To further explore the model robustness against data noise, we also repeated the above simulations with variances equal to twice the observed variances. We applied our HMM inference procedure to the simulated data. The accuracy of our model was quantified as the percentage of correctly assigned states.
To test for functional coherence, we examined the accordance of GO memberships within a domain via a new statistic for each significant domain. For domain j, we calculated the level of GO accordance, Y j , with Fleiss' Kappa , which is a generalization of the standard kappa for more than two raters (or in our case, ontologies). We then calculated the average accordance for these domains, and compared it to a distribution made under the null hypothesis (i.e., the hypothesis that domains are independent of GO memberships).
For the null distribution, we took into account the fact that neighboring genes often share GO annotations. To this end, we selected 1000*n random sections of the genome (of corresponding equal length in terms of number of genes to our significant domains), calculated their Y j 's and averaged them to get a null distribution for our average accordance. The p-values were evaluated as the proportion of permuted means that are larger than the observed mean accordance. Thus the minimum possible p-value was 0.001.
To account for variability in gene activity, expression data in the ES cell line was normalized across data from 23 cell lines [10, 25] to get a Z-statistic for each gene. We then calculated the within-domain variance for each significant domain and compared the average of these values to that of a random permutation.
To determine the significance of our within domain variance, we randomly selected 1000*n sections of the genome (each with an equal number of genes as its corresponding significant domains) and calculated the average within domain variances.
The histone modification data in NP cells were obtained from a published dataset [10, 25]. The ChIP-seq data in NP cells were normalized against those in ES cells by a negative-binomial regression as recommended . We assumed that the model parameters in the NP cells are identical to those in the ES cells, and inferred the hidden states in NP cells by using the Viterbi algorithm again. We determined domain changes by comparing the results in NP and ES cells. A domain is called changed if at least one gene changed state between the two cell lines.
hidden Markov model
maximum a posteriori
maximum likelihood estimate
We would like to thank Drs. Yi Li, John Quackenbush, Curtis Huttenhower and Jason Ernst for helpful discussions.
Funding: This research was supported by an NIH training grant (T32 GM074897) awarded to J. Larson and a Claudia Adams Barr Award (to GY).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.