Volume 14 Supplement 5
Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2013)
Discovering and mapping chromatin states using a tree hidden Markov model
 Jacob Biesinger†^{1, 3},
 Yuanfeng Wang†^{2} and
 Xiaohui Xie^{1, 3}Email author
DOI: 10.1186/1471210514S5S4
© Biesinger et al.; licensee BioMed Central Ltd. 2013
Published: 10 April 2013
Abstract
New biological techniques and technological advances in highthroughput sequencing are paving the way for systematic, comprehensive annotation of many genomes, allowing differences between cell types or between disease/normal tissues to be determined with unprecedented breadth. Epigenetic modifications have been shown to exhibit rich diversity between cell types, correlate tightly with celltype specific gene expression, and changes in epigenetic modifications have been implicated in several diseases. Previous attempts to understand chromatin state have focused on identifying combinations of epigenetic modification, but in cases of multiple cell types, have not considered the lineage of the cells in question.
We present a Bayesian network that uses epigenetic modifications to simultaneously model 1) chromatin mark combinations that give rise to different chromatin states and 2) propensities for transitions between chromatin states through differentiation or disease progression. We apply our model to a recent dataset of histone modifications, covering nine human cell types with nine epigenetic modifications measured for each. Since exact inference in this model is intractable for all the scale of the datasets, we develop several variational approximations and explore their accuracy. Our method exhibits several desirable features including improved accuracy of inferring chromatin states, improved handling of missing data, and linear scaling with dataset size. The source code for our model is available at http://github.com/ucicbcl/treehmm
Background
Although identical DNA is shared amongst most cells in an organism, a key question in biology relates to how different cell types are formed, maintained, and made to perform vastly different functions. Recent studies have shown that these processes are in part mediated by the posttranslational modifications of histone tails, which in turn affect chromatin accessibility and other properties of chromatin structures in a celltype specific way [1]. There are also interactions between these modifications [2, 3], which act combinatorially to exert dynamic control over gene expression and other fundamental cellular processes [4]. Although we do not fully understand the role of epigenetic modifications, their effect in the development of disease and in defining cell type is becoming clearer. For example, epigenetic changes have been shown to be tightly correlated with gene expression [5–7], have been linked to metastasis development in certain types of cancer [8] and are shown to control recombination [9]. Epigenetic inheritance across cells and across individuals has been highlighted in recent research (see [10] for a review) and our understanding of the scope of epigenetic modifications has expanded considerably in recent years.
Chromatin immunoprecipitation coupled with highthroughput sequencing (ChIPseq) has emerged as a costeffective method for determining epigenetic modifications. Although initially used as a highresolution transcription factor binding site discovery mechanism (see [11, 12] for review), ChIPseq has recently been used to target histone tail modifications and is proving to be particularly costeffective method for epigenomic annotation. Thanks to the ENCODE project [6], hundreds of ChIPseq datasets are now publicly available and the process of integrating speciesspecific and celltype specific binding site information, gene expression, and chromatin state is now underway. These highthroughput datasets provide an unbiased, comprehensive view of the function of different genomic regions.
Several computational approaches have been used to tackle the important problem of genome annotation using these highthroughput datasets. In particular, methods that integrate histone modification data can be segregated into two general approaches: one approach searches near known genomic annotations to identify characteristic marks of particular classes of regions, such as promoters and enhancers, and subsequently uses the learned characteristics to find new instances of the class [13–15]. The other approach learns the characteristic patterns of histone marks de novo using unsupervised methods, "rediscovering" and predicting genomic features associated with mark combinations. Methods for identifying these patterns have included clustering [16, 17], a dynamic Bayesian network [18], and hidden Markov models (HMM) [6, 19, 20]. These methods differ mostly in how they model the chromatin mark signal intensity. Some determine a characteristic signal shape while others focus on modeling the mark signal using nonparametric histograms, multivariate normal distributions, or binary presence and mark cooccurrence. Each of these methods focuses on modeling the histone mark combinations; none explicitly incorporate the lineage information by which the data are related.
Here, we expand the HMM methodology of Ernst et al. [21] (called ChromHMM), who originally analyzed nine transcription factors (TF) or histone modifications (plus control) performed in nine different human cell types. Their multivariate HMM model concatenated several cell types to form a single chain with the goal of learning a global set of histone mark combinations and left as secondary all comparative analysis between cell types. We generalize the model to more closely reflect biological reality: chromatin remodeling occurs as cells progress through several stages of differentiation. We expect many genomic regions to be correlated across a lineage since cell types diverged from a common progenitor are likely to share the chromatin changes that took place in that progenitor. To capture this reality, we simultaneously model both the genomic localization of histone marks and the chromatin dynamics along a lineage by explicitly aligning each cell type and connecting their internal, hidden nodes vertically in a tree structure. Our model learns both histone modifications' association with chromatin state and state transitions between cell types, capturing epigenetic changes that occur through differentiation or disease progression. Our method effectively pools information across species, and we expect it to show improved accuracy of genome segmentation over the previous HMM approach which does not incorporate cell lineage information.
Methods
Tree hidden Markov model
Model description and notation
We propose a tree hidden Markov model (TreeHMM) to discover and map chromatin states using the observed chromatin modification data. We begin by introducing some notation. We denote the chromatin modification of type l at position t of cell type i as ${x}_{t,l}^{i}$, which can take binary values, i.e.${x}_{t.l}^{i}\in \left\{0,1\right\}$. Subsequently we denote all the histone marks at position (i, t) to be ${x}_{t}^{i}=\left({x}_{t,1}^{i},.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}},{x}_{t,L}^{i}\right)$, which is a vector of length L and $X=\left\{{x}_{t}^{i}:\phantom{\rule{2.77695pt}{0ex}}i=1,\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}},I;t=1,.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}},T\right\}$ to be the collection of all observed data. We further introduce a hidden variable ${z}_{t}^{i}$ to denote the underlying chromatin state at chromosomal position t of cell type i. We assume ${z}_{t}^{i}$'s are discrete taking K possible values, i.e., ${z}_{t}^{i}\in \left\{1,.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}},K\right\}$ for all t and i. Let $Z=\left\{{z}_{t}^{i}:\phantom{\rule{2.77695pt}{0ex}}i=1,\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}},I;t=1,.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}}.\phantom{\rule{2.77695pt}{0ex}},T\right\}$ denote the collection of all hidden chromatin state variables. We assume that these chromatin state variables are the key determinant of the observed chromatin modifications, and that ${x}_{t}^{i}$'s are independent of each other conditioned on Z, i.e., $\mathbb{P}\left(X\phantom{\rule{0.3em}{0ex}}Z\right)={\prod}_{i=1}^{l}{\prod}_{t=1}^{T}\mathbb{P}\left({x}_{t}^{i}{z}_{t}^{i}\right)$.
where by definition ${z}_{t1}^{i}=\varnothing $ if t = 1 and ${z}_{t}^{\pi \left(i\right)}=\varnothing $ if node i is the root cell type. As a notation, we also use π (i, t) to denote the parent nodes of node (i, t) in the model, and use z_{π(i, t)}to denote the state variables at these parent nodes if they exist.
Parameters
The TreeHMM model presented above requires us to specify two sets of conditional distributions. One is the emission probabilities $P\left({x}_{t}^{i}{z}_{t}^{i}\right)$, that is, the probability of observing chromatin modification vector ${x}_{t}^{i}$ conditioned on chromatin state ${z}_{t}^{i}$. For simplicity, we assume different chromatin modification marks are independent of each other conditioned on the chromatin state, and use ${e}_{l}^{k}=\mathbb{P}\left({x}_{t,l}^{i}=1\phantom{\rule{0.3em}{0ex}}{z}_{t}^{i}=k\right)$ to denote the probability of observing mark l at position t of cell type i conditioned on the underlying state being k.
We will also use $\mathrm{\Theta}=\left\{{\theta}_{bc}^{a},\phantom{\rule{2.77695pt}{0ex}}{\alpha}_{b}^{a},\phantom{\rule{2.77695pt}{0ex}}{\beta}_{b}^{a},\phantom{\rule{2.77695pt}{0ex}}{\gamma}^{a},\phantom{\rule{2.77695pt}{0ex}}{e}_{l}^{a}\left(a,\phantom{\rule{2.77695pt}{0ex}}b,\phantom{\rule{2.77695pt}{0ex}}c\right)\in 1,\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}K;l\in 1,\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}L\right\}$ to denote the collection of all parameters associated with the model.
Inference and parameter learning
Note that in the above notation, we put Θ into the distributions to emphasize the dependency of the distributions on the parameters. However, we will also the simplified notation $\mathbb{P}\left(ZX\right)$ or $\mathbb{P}\left(X\right)$ when the context is clear. After finding the optimal parameters, we infer the underlying chromatin states using posterior inference, to calculating the posterior probability of each chromatin state conditioned on the observed data, $\mathbb{P}\left({z}_{t}^{i}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}X;\mathrm{\Theta}\right)$.
We explore various inference methods for the TreeHMM model, including exact methods and approximate methods. For exact inference, we provide two implementations: first, we generate a lattice for the Graphical Models Toolkit (GMTK) [22], which provides an efficient framework for exact inference and learning using the junction tree algorithm [23]. We also provide a custom library which implements a "cliqued" method in which each slice t of the model has all its nodes in that slice treated as if they were part of a single "cliqued" node that has K^{ I } states. In this cliqued node representation, we can apply standard HMM methodology to do inference and learning. The state space of the cliqued inference method grows exponentially with I, but we found it to be faster than the GMTK implementation for small trees. Both implementations gave the same results in our testing.
By Jensen's inequality, F is always lower bounded by the negative log likelihood function, i.e.$F\ge \text{log}\mathcal{L}\left(\mathrm{\Theta};X\right)$, with equality holding if and only if $\mathbb{Q}\left(Z\right)=\mathbb{P}\left(ZX\right)$. The goal of the variational inference is to find a distribution (usually under some approximate form) that minimizes the free energy function. We will consider three different forms of surrogate distributions and briefly describe variational inference for each of them. Details of the derivations are given in Additional file 1, section 1.3.
Mean field (MF) variational inference
where the expectation is with respect to , as will always be the case in the remainder of this paper.
The (j, s) nodes in the last term are all children of node (i, t), but the expectation involves all the parents of (j, s) except (i, t).
Structured mean field(SMF) variational inference
where we have defined ${f}_{it}\left({z}_{t}^{i},\phantom{\rule{2.77695pt}{0ex}}{z}_{t1}^{i}\right)=\text{exp}\left\{{E}_{{q}_{\pi \left(i\right)}}\left[\text{log}\mathbb{P}\left({z}_{t}^{i}{z}_{\pi \left(t,i\right)}\right)\right]+{\sum}_{j:i=\pi \left(j\right)}{E}_{{q}_{j}}\left[\text{log}\mathbb{P}\left({z}_{t}^{j}{z}_{\pi \left(j,t\right)}\right)\right]\right\}$. Since f_{ it } only involves expectations with respect to the distributions other than q_{ i }, it is a fixed function of ${z}_{t}^{i}$ and ${z}_{t1}^{i}$ during the update of the q_{ i }(z_{ i }). If the f_{ it } functions can be normalized to be conditional probability distributions, then Equation (8) shares the exact form of the free energy of a hidden Markov model with transmission probabilities specified by ${f}_{it}$ and emission probabilities specified by $\mathbb{P}\left({x}_{t}^{i}{z}_{t}^{i}\right)$. As such, the optimal q_{ i } minimizing the free energy is the same as the posterior probabilities of the states in the hidden Markov model, which can be efficiently calculated using the forwardbackward algorithm [25]. The details of how to normalize the f_{ it } functions to be proper transition probabilities are shown in Additional file 1, section 1.3.
Loopy belief propagation (LBP)
The third inference method we used is loopy belief propagation. Belief propagation is a message passing algorithm commonly used in probabilistic graphical models. The algorithm is exact for tree and polytree structured graphs. For general graphs that contain cycles or loops, it is an approximate algorithm also called loopy belief propagation. In this case, the algorithm is not guaranteed to converge nor is the approximate freeenergy a bound of the loglikelihood. Nevertheless, it has shown empirical success in some cases [26]. Loopy belief propagation can be also viewed as a variational method with the $\mathbb{Q}$ distribution taking the Bethe approximation form upon convergence [27]. Here we use Pearl's belief propagation algorithm which is directly applicable to the Bayesian network representation. We refer readers to [28] for the details of the algorithm.
Parameter learning
Above we have introduced different inference methods. To do parameter learning, we use a variant of the expectationmaximization (EM) algorithm called variational EM algorithm. Like the EM algorithm, the variational EM algorithm iterates between two steps: an expectation and a maximization step. The expectation step (Estep) is performed by the inference methods, during which we calculate $\mathbb{Q}\left(Z\right)$ in the approximate forms outlined above with fixed parameter values. In the maximization step (Mstep), we seek parameter values that minimize F (or maximize F) under $\mathbb{Q}\left(Z\right)$.
The above optimization can be solved explicitly. As a result, the state transition parameters are calculated as ${\theta}_{bc}^{a}\propto {\sum}_{i>1}{\sum}_{t>1}\mathbb{Q}\left({z}_{t}^{i}=a,\phantom{\rule{2.77695pt}{0ex}}{z}_{t}^{\pi \left(i\right)}=b,\phantom{\rule{2.77695pt}{0ex}}{z}_{t1}^{i}=c\right)$, ${\alpha}_{b}^{a}\propto {\sum}_{t>1}\mathbb{Q}\left({z}_{t}^{1}=a,\phantom{\rule{2.77695pt}{0ex}}{z}_{t1}^{1}=b\right)$, ${\beta}_{b}^{a}\propto {\sum}_{i>1}\mathbb{Q}\left({z}_{1}^{i}=a,{z}_{1}^{\pi \left(i\right)}=b\right)$, ${\gamma}^{a}\propto \mathbb{Q}\left({z}_{1}^{1}=a\right)$ up to a normalization constant, where $\mathbb{Q}\left(\cdot \right)$ denotes the marginal distribution of the variables inside the brackets. The emission parameters are given by ${e}_{l}^{a}=\frac{{\Sigma}_{i,t}\mathbb{Q}\left({z}_{t}^{i}\phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}a\right)I\left({x}_{t,l}^{i}\phantom{\rule{2.77695pt}{0ex}}=\phantom{\rule{2.77695pt}{0ex}}1\right)}{{\Sigma}_{i,t}\mathbb{Q}\left({z}_{t}^{i}=a\right)}$ where I(·) is the indicator function. The variational EM algorithm for the SMF case is outlined in Algorithm 1. Notationally, we have considered the entire genome as a single chunk. In practice, we break up the genome into many smaller chunks to allow more efficient, parallel execution and to reduce memory consumption, at the cost of computational artifacts at chunk borders.
Data processing
As a preprocessing step, we create a histogram of mapped reads by dividing the genome into 200 bp nonoverlapping bins and counting the number of mapped reads whose middle base fell into each bin. All replicates, if any, were added to the histogram and the histogram was then binarized using a threshold corresponding to a Poisson pvalue of 10^{4}, similar to [21]. We further segmented the genome into regions with and without chromatin marks by applying a smoothing filter to the raw count data, retaining regions that contained mapped reads. Further data processing details can be found in Additional file 1, section 1.1, and all preprocessing methods are available as part of the released source code.
Our model's preprocessing and parameterization are very similar to the multivariate HMM methodology of [21], however Ernst's implementation suffered from a very slow runtime on our processed data, which contains many regions to facilitate parallel inference. We reimplemented the method as described [20] and use this implementation for comparison in later sections. The implementation is available in the released source code.
Results
We used the same human ENCODE dataset reported in [21] which contains ChIPseq profiles for nine human cell types including human embryonic stem cells (H1 ES), erythrocytic leukaemia cells (K562), Blymphoblastoid cells (GM12878), hepatocellular carcinoma cells (HepG2), umbilical vein endothelial cells (HUVEC), skeletal muscle myoblasts (HSMM), normal lung fibroblasts (NHLF), normal epidermal keratinocytes (NHEK), and mammary epithelial cells (HMEC). For each cell type, ten different markers are used including eight histone modifications (H3K27me3, H3K36me3, H4K20me1, H3K4me1, H3K4me2, H3K4me3, H3K27ac, and H3K9ac), one transcription factor closely related to chromatin dynamics (CTCF), and a control data set (whole cell extract). Altogether, the dataset contains 90 ChIPseq profiles, which were downloaded from the ENCODE website [29].
Comparing approximate inference methods
TreeHMM on the complete genome using the SMF approximation
To explore the celltype specificity of our learned states, we performed Kmeans clustering regions assigned to each state in any cell type. We show three of the states in Supplementary Figure 7 (Additional file 1), including the insulator regions (state 14), strong enhancer regions (state 5) and active promoter regions (state 3). We can see that the distribution of different states across cell types differs drastically. Almost half of all insulator sites (state 14) are shared amongst all nine cell types or are only missing in one or two cell types. Many of the remainder are specific to a single cell type. Likewise, some active promoter regions (state 3) are shared amongst all or most cell types, but many more of the promoter regions are celltype specific. Finally, the strong enhancer regions (state 5) are almost entirely celltype specific. These overall patterns of celltype specificity are captured by the learned transition matrices α and θ, which are shown in Supplemental Figures 3 and 4 (Additional file 1).
Several states are dominated by their vertical component in the θ transition matrix, including the states localizing to TSS's (states 2, 3, 8, 10, and 11), copy number variant/repeat regions (state 4), and the insulator state marked by CTCF (state 14). Other states have weak vertical components: consistent with the celltype specificity of enhancers and chromatin remodeling, three of the enhancer regions (states 1, 5 and 6) and the polycomb repressed regions (state 17) show little to no vertical correlation. In particular, enhancer state 1 does not show the vertical correlation that might be expected given its propensity for TSS regions (4.24 fold enrichment).
Comparison with ChromHMM
We compare our result with ChromHMM  a similar method based on hidden Markov model described in [21] that does not utilize lineage information. We ran the HMM on the same histone data, treating each cell type's segment as a separate chain with inference performed in parallel but with tied parameters. We set number of states to be the same as in the TreeHMM result for consistency.
The learned emission probability matrix from ChromHMM together with the confusion matrix between the assigned states of the two results is shown in Supplemental Figure 5 (right panel, Additional file 1). Comparing the emission matrix from two methods (Figure 4 and Supplemental Figure 5 (left panel, Additional file 1)), we observe similar cooccurrence patterns of markers. But as revealed by the confusion matrix, there is a substantial set of regions that are assigned different states due to the lineage constraint introduced in our model. For example, the weak promoter state (state 8) overlaps with ChromHMM's inactive promoter and enhancer states (2 and 8). Also ChromHMM exhibits two repetitive states (similar to [21]) while there is only one such state in the TreeHMM result. To assess the accuracy of our methods, we tested our predicted states' overlap with several human EScellspecific ChIPseq datasets.
H1ES ChIPseq enrichment in predicted promoter and enhancer regions.
Promoters  

Factor  TreeHMM  ChromHMM  
All  Unique  All  Unique  
Taf1  32,069 (41.6x)  1,489 (15.2x)  35,082 (26.0x)  4,502 (6.7x) 
Oct4  4,980 (23.8x)  231 (8.7x)  6,932 (19x)  2,183 (12x) 
Klf4  2,622 (18.1x)  105 (5.7x)  3,819 (15.1x)  1,302 (10.3x) 
p300  141 (1.0x)  16 (0.9x)  1,597 (6.4x)  1,472 (11.8x) 
Nanog  1,556 (1.5x)  227 (1.7x)  8,650 (4.7x)  7,321 (7.7x) 
Sox2  412 (1.6x)  63 (2.0x)  2,509 (5.7x)  2,160 (9.8x) 
Enhancers  
Factor  TreeHMM  ChromHMM  
All  Unique  All  Unique  
Taf1  8,095 (2.5x)  4,293 (4.4x)  5,611 (2.2x)  1,809 (5.3x) 
Oct4  3,914 (4.5x)  2,060 (7.8x)  2,274 (3.3x)  420 (4.5x) 
Klf4  2,143 (3.6x)  1,294 (7.1x)  1,003 (2.1x)  154 (2.4x) 
p300  7,253 (12.2x)  1,517 (8.4x)  5,861 (12.2x)  125 (2.0x) 
Nanog  39,829 (9.1x)  7,941 (6.0x)  33,561 (9.6x)  1,673 (3.5x) 
Sox2  9,786 (9.4x)  2,185 (6.9x)  7,952 (9.5x)  351 (3.1x) 
As shown in Table 1, Taf1 shows strong enrichment in the promoter regions annotated by both ChromHMM and TreeHMM methods (26 and 41.6 fold enrichment over background, respectively). Although the two methods identify a similar number of active promoters (136,702 for TreeHMM vs. 239,792 by ChromHMM), a larger fraction of TreeHMM's predicted promoter overlaps with Taf1 binding sites than ChromHMM (32,069 or 23.5% of sites predicted by TreeHMM vs. 35,082 or 18.5% of sites predicted by ChromHMM). The enhancer regions predicted by the two methods with similar fold enrichment (12.2 and 12.3 fold) in p300 ChIPseq binding peaks, but 24% more sites are correctly predicted by TreeHMM (7,253 vs. 5,861). An interesting observation is that Oct4 and Klf4 both show preference for promoter regions over enhancer regions and in these cases, ChromHMM captures more of the ChIPSeq binding sites but at the cost of calling many more total sites (23.8 vs. 19 fold enrichment of Oct4; 18.1 vs. 15.1 fold enrichment of Klf4). Distinctly, Nanog and Sox2 show a strong preference for enhancer regions. For these predictions, more ChIP binding sites (19% more for Nanog, 23% more for Sox2) are captured by TreeHMM at similar enrichment levels. These results indicate TreeHMM's lower false negative rate for enhancer regions and lower false positive rate for promoter regions.
We also investigated the recovery of active transcription start sites. We compared the predicted promoter regions (states 2, 3, and 8) with the ENCODE Capped Analysis Gene Expression (CAGE) data for H1ES and K562 cells. Supplemental Figure 3 (Additional file 1) shows TreeHMM's improved precision (59% better) and similar recall (2% worse to 0.5% better) in predicting active TSS regions.
Discussion
We have here presented a tree hidden Markov model for identifying chromatin state based on measurements from multiple cell types in a principled way. The major improvement over the previous HMM approach is the incorporation of cell lineage explicitly in the model. While previous methods have focused only on the marks present at a particular region in a particular cell type, we pool information across the same genomic location at different cell types. This allows increased discernment in regions of uncertainty. Although model learning in our proposed model is intractable except in the smallest cases, we developed several approximate methods and demonstrated their accuracy using the ENCODE histone modification data for nine different cell types. Interestingly, we found strong correlations along cell lineages and show that in many cases the information gained from lineage correlations increases state inference accuracy. Inherent to our method is the discovery of states that are more likely to change during differentiation or disease progression. This information allows more accurate prediction and allows accurate delineation between housekeeping genes present in all cell types and genes regulated in a lineagespecific fashion.
In this work, we have focused on developing approximate methods for doing inference and learning in the general framework. Our implementation is general and can deal with missing marks and missing species (discussed in Additional file 1, section 1.4). With the capabilities of the model, there can be many further improvements including incorporating more cell types with incomplete measurements, modifying the lineage tree to include hidden nodes, and incorporating heterogenous data beyond histone marks. By pooling information from similar cell types and learning combinations of marks, it should be possible to infer cell state without a full spectrum of histone modifications measurements. We plan on exploring the rapidly increasingly heterogenous datasets to gain further insight into role of chromatin modifications in determining epigenetic states and their relationship with disease phenotype. Another possible application of the framework is to look into crossspecies correlation of histone modification [33] to gain insight into interspecies conservation or divergence of epigenetic mechanisms.
Conclusions
Understanding epigenetic factors' associations with cell state is a primary step towards proper context for biological systems. Histone modifications play an essential role in regulating and maintaining gene expression and determining cell state. We have developed a novel graphical model for determining chromatin state from epigenetic modifications. Our method explicitly models transitions between cell types during differentiation or disease progression by considering cell lineage relationship. Although performing exact inference in our model is intractable, we develop highly accurate approximate inference methods that scale well with dataset size. By utilizing information from several cell types, our method can infer epigenetic state more accurately and has the ability to incorporate tendency of transitions between cell states in a more principled way. These crosscell type correlations may be especially useful in datasets where the complete battery of experiments have not been performed in all cell types.
Notes
List of abbreviations used
 MF:

Mean Field
 SMF:

Structured Mean Field
 LBP:

loopy belief propagation
 BIC:

Bayes Information Criterion
 CAGE:

Capped Analysis Gene Expression
Declarations
Acknowledgements
We thank Ali Mortazavi, Yifei Chen and members of Xie lab for helpful discussions. This work was supported by National Institutes of Health grant T15LM07443 and National Science Foundation grant DBI0846218.
Declarations
Publication of this article was supported by National Institutes of Health grant T15LM07443.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.
Authors’ Affiliations
References
 Bannister A, Kouzarides T: Regulation of chromatin by histone modifications. Cell research. 2011, 21 (3): 381395. 10.1038/cr.2011.22.PubMed CentralView ArticlePubMedGoogle Scholar
 Strahl BD, Allis CD: The language of covalent histone modifications. Nature. 2000, 403 (6765): 415. 10.1038/47412.View ArticlePubMedGoogle Scholar
 Schreiber SL, Bernstein BE: Signaling network model of chromatin. Cell. 2002, 111 (6): 7718. 10.1016/S00928674(02)011960.View ArticlePubMedGoogle Scholar
 Felsenfeld G, Groudine M: Controlling the double helix. Nature. 2003, 421 (6921): 448453. 10.1038/nature01411.View ArticlePubMedGoogle Scholar
 Xu X, Hoang S, Mayo MW, Bekiranov S: Application of machine learning methods to histone methylation ChIPSeq data reveals H4R3me2 globally represses gene expression. BMC bioinformatics. 2010, 11: 396PubMed CentralPubMedGoogle Scholar
 Khatun J: An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature. 2012, 489: 5774. 10.1038/nature11247.View ArticleGoogle Scholar
 Dong X, Greven M, Kundaje A, Djebali S, Brown J, Cheng C, Gingeras T, Gerstein M, Guigó R, Birney E: Modeling gene expression using chromatin features in various cellular contexts. Genome biology. 2012, 13 (9): R5310.1186/gb2012139r53.PubMed CentralView ArticlePubMedGoogle Scholar
 Ju HX, An B, Okamoto Y, Shinjo K, Kanemitsu Y, Komori K, Hirai T, Shimizu Y, Sano T, Sawaki A, Tajika M, Yamao K, Fujii M, Murakami H, Osada H, Ito H, Takeuchi I, Sekido Y, Kondo Y: Distinct profiles of epigenetic evolution between colorectal cancers with and without metastasis. The American journal of pathology. 2011, 178 (4): 183546. 10.1016/j.ajpath.2010.12.045.PubMed CentralView ArticlePubMedGoogle Scholar
 Bergman Y, Cedar H: Epigenetic control of recombination in the immune system. Seminars in immunology. 2010, 22 (6): 3239. 10.1016/j.smim.2010.07.003.PubMed CentralView ArticlePubMedGoogle Scholar
 Jablonka E, Raz G: Transgenerational epigenetic inheritance: prevalence, mechanisms, and implications for the study of heredity and evolution. The Quarterly review of biology. 2009, 84 (2): 131176. 10.1086/598822.View ArticlePubMedGoogle Scholar
 Park P: ChIPseq: advantages and challenges of a maturing technology. Nature Reviews Genetics. 2009, 10 (10): 669680. 10.1038/nrg2641.PubMed CentralView ArticlePubMedGoogle Scholar
 Pepke S, Wold B, Mortazavi A: Computation for ChIPseq and RNAseq studies. Nature methods. 2009, 6 (11 Suppl): S2232.PubMed CentralView ArticlePubMedGoogle Scholar
 Heintzman N, Stuart R, Hon G, Fu Y, Ching C, Hawkins R, Barrera L, Van Calcar S, Qu C, Ching K: Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature genetics. 2007, 39 (3): 311318. 10.1038/ng1966.View ArticlePubMedGoogle Scholar
 Mitchell Guttman I, Garber M, French C, Lin M, Feldser D, Huarte M, Zuk O, Carey B, Cassady J, Cabili M: Chromatin signature reveals over a thousand highly conserved large noncoding RNAs in mammals. Nature. 2009, 458 (7235): 223227. 10.1038/nature07672.PubMed CentralView ArticlePubMedGoogle Scholar
 Hon G, Wang W, Ren B: Discovery and annotation of functional chromatin signatures in the human genome. PLoS computational biology. 2009, 5 (11): e100056610.1371/journal.pcbi.1000566.PubMed CentralView ArticlePubMedGoogle Scholar
 Ucar D, Hu Q, Tan K: Combinatorial chromatin modification patterns in the human genome revealed by subspace clustering. Nucleic acids research. 2011, 39 (10): 406375. 10.1093/nar/gkr016.PubMed CentralView ArticlePubMedGoogle Scholar
 Jaschek R, Tanay A: Spatial clustering of multivariate genomic and epigenomic information. Research in Computational Molecular Biology. 2009, Springer, 170183.View ArticleGoogle Scholar
 Hoffman M, Buske O, Wang J, Weng Z, Bilmes J, Noble W: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods. 2012, 9 (5): 473476. 10.1038/nmeth.1937.PubMed CentralView ArticlePubMedGoogle Scholar
 Xu H, Wei C, Lin F, Sung W: An HMM approach to genomewide identification of differential histone modification sites from ChIPseq data. Bioinformatics. 2008, 24 (20): 23442349. 10.1093/bioinformatics/btn402.View ArticlePubMedGoogle Scholar
 Ernst J, Kellis M: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature biotechnology. 2010, 28 (8): 81725. 10.1038/nbt.1662.PubMed CentralView ArticlePubMedGoogle Scholar
 Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, Bernstein BE: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011, 473 (7345): 439. 10.1038/nature09906.PubMed CentralView ArticlePubMedGoogle Scholar
 Bilmes J, Bartels C: On Triangulating Dynamic Graphical Models. Proceedings of the Nineteenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI03). 2003, San Francisco, CA: Morgan Kaufmann, 4756.Google Scholar
 Dean T, Kanazawa K: Probabilistic temporal reasoning. Proceedings of the Nineteenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI03). 1988, AAAIGoogle Scholar
 Wainwright M, Jordan M: Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning. 2008, 1 (12): 1305.Google Scholar
 Durbin R: Biological sequence analysis: probabilistic models of proteins and nucleic acids. 1998, Cambridge Univ PrView ArticleGoogle Scholar
 Murphy K, Weiss Y, Jordan M: Loopy belief propagation for approximate inference: An empirical study. Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. 1999, Morgan Kaufmann Publishers Inc., 467475.Google Scholar
 Yedidia J, Freeman W, Weiss Y: Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium, Volume 8. 2003, 236239.Google Scholar
 Darwiche A: Modeling and reasoning with Bayesian networks, Volume 54. 2009, Cambridge University PressView ArticleGoogle Scholar
 ENCODE: 2012, [http://genome.ucsc.edu/ENCODE/]
 Bibikova M, Chudin E, Wu B, Zhou L, Garcia E, Liu Y, Shin S, Plaia T, Auerbach J, Arking D: Human embryonic stem cells have a unique epigenetic signature. Genome research. 2006, 16 (9): 10751083. 10.1101/gr.5319906.PubMed CentralView ArticlePubMedGoogle Scholar
 Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, Jaenisch R, Wagschal A, Feil R, Schreiber SL, Lander ES: A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006, 125 (2): 31526. 10.1016/j.cell.2006.02.041.View ArticlePubMedGoogle Scholar
 Lister R, Pelizzola M, Dowen R, Hawkins R, Hon G, TontiFilippini J, Nery J, Lee L, Ye Z, Ngo Q: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462 (7271): 315322. 10.1038/nature08514.PubMed CentralView ArticlePubMedGoogle Scholar
 KolasinskaZwierz P, Down T, Latorre I, Liu T, Liu X, Ahringer J: Differential chromatin marking of introns and expressed exons by H3K36me3. Nature genetics. 2009, 41 (3): 376381. 10.1038/ng.322.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.