Discovering and mapping chromatin states using a tree hidden Markov model

Biesinger, Jacob; Wang, Yuanfeng; Xie, Xiaohui

doi:10.1186/1471-2105-14-S5-S4

Volume 14 Supplement 5

Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013)

Proceedings
Open access
Published: 10 April 2013

Discovering and mapping chromatin states using a tree hidden Markov model

Jacob Biesinger^1,3,
Yuanfeng Wang² &
Xiaohui Xie^1,3

BMC Bioinformatics volume 14, Article number: S4 (2013) Cite this article

6117 Accesses
29 Citations
2 Altmetric
Metrics details

Abstract

New biological techniques and technological advances in high-throughput sequencing are paving the way for systematic, comprehensive annotation of many genomes, allowing differences between cell types or between disease/normal tissues to be determined with unprecedented breadth. Epigenetic modifications have been shown to exhibit rich diversity between cell types, correlate tightly with cell-type specific gene expression, and changes in epigenetic modifications have been implicated in several diseases. Previous attempts to understand chromatin state have focused on identifying combinations of epigenetic modification, but in cases of multiple cell types, have not considered the lineage of the cells in question.

We present a Bayesian network that uses epigenetic modifications to simultaneously model 1) chromatin mark combinations that give rise to different chromatin states and 2) propensities for transitions between chromatin states through differentiation or disease progression. We apply our model to a recent dataset of histone modifications, covering nine human cell types with nine epigenetic modifications measured for each. Since exact inference in this model is intractable for all the scale of the datasets, we develop several variational approximations and explore their accuracy. Our method exhibits several desirable features including improved accuracy of inferring chromatin states, improved handling of missing data, and linear scaling with dataset size. The source code for our model is available at http://github.com/uci-cbcl/tree-hmm

Background

Although identical DNA is shared amongst most cells in an organism, a key question in biology relates to how different cell types are formed, maintained, and made to perform vastly different functions. Recent studies have shown that these processes are in part mediated by the post-translational modifications of histone tails, which in turn affect chromatin accessibility and other properties of chromatin structures in a cell-type specific way [1]. There are also interactions between these modifications [2, 3], which act combinatorially to exert dynamic control over gene expression and other fundamental cellular processes [4]. Although we do not fully understand the role of epigenetic modifications, their effect in the development of disease and in defining cell type is becoming clearer. For example, epigenetic changes have been shown to be tightly correlated with gene expression [5–7], have been linked to metastasis development in certain types of cancer [8] and are shown to control recombination [9]. Epigenetic inheritance across cells and across individuals has been highlighted in recent research (see [10] for a review) and our understanding of the scope of epigenetic modifications has expanded considerably in recent years.

Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has emerged as a cost-effective method for determining epigenetic modifications. Although initially used as a high-resolution transcription factor binding site discovery mechanism (see [11, 12] for review), ChIP-seq has recently been used to target histone tail modifications and is proving to be particularly cost-effective method for epigenomic annotation. Thanks to the ENCODE project [6], hundreds of ChIP-seq datasets are now publicly available and the process of integrating species-specific and cell-type specific binding site information, gene expression, and chromatin state is now underway. These high-throughput datasets provide an unbiased, comprehensive view of the function of different genomic regions.

Several computational approaches have been used to tackle the important problem of genome annotation using these high-throughput datasets. In particular, methods that integrate histone modification data can be segregated into two general approaches: one approach searches near known genomic annotations to identify characteristic marks of particular classes of regions, such as promoters and enhancers, and subsequently uses the learned characteristics to find new instances of the class [13–15]. The other approach learns the characteristic patterns of histone marks de novo using unsupervised methods, "rediscovering" and predicting genomic features associated with mark combinations. Methods for identifying these patterns have included clustering [16, 17], a dynamic Bayesian network [18], and hidden Markov models (HMM) [6, 19, 20]. These methods differ mostly in how they model the chromatin mark signal intensity. Some determine a characteristic signal shape while others focus on modeling the mark signal using non-parametric histograms, multivariate normal distributions, or binary presence and mark co-occurrence. Each of these methods focuses on modeling the histone mark combinations; none explicitly incorporate the lineage information by which the data are related.

Here, we expand the HMM methodology of Ernst et al. [21] (called ChromHMM), who originally analyzed nine transcription factors (TF) or histone modifications (plus control) performed in nine different human cell types. Their multivariate HMM model concatenated several cell types to form a single chain with the goal of learning a global set of histone mark combinations and left as secondary all comparative analysis between cell types. We generalize the model to more closely reflect biological reality: chromatin remodeling occurs as cells progress through several stages of differentiation. We expect many genomic regions to be correlated across a lineage since cell types diverged from a common progenitor are likely to share the chromatin changes that took place in that progenitor. To capture this reality, we simultaneously model both the genomic localization of histone marks and the chromatin dynamics along a lineage by explicitly aligning each cell type and connecting their internal, hidden nodes vertically in a tree structure. Our model learns both histone modifications' association with chromatin state and state transitions between cell types, capturing epigenetic changes that occur through differentiation or disease progression. Our method effectively pools information across species, and we expect it to show improved accuracy of genome segmentation over the previous HMM approach which does not incorporate cell lineage information.

Methods

Tree hidden Markov model

Model description and notation

We propose a tree hidden Markov model (TreeHMM) to discover and map chromatin states using the observed chromatin modification data. We begin by introducing some notation. We denote the chromatin modification of type l at position t of cell type i as $x_{t, l}^{i}$ , which can take binary values, i.e. $x_{t . l}^{i} \in {0, 1}$ . Subsequently we denote all the histone marks at position (i, t) to be $x_{t}^{i} = (x_{t, 1}^{i}, . . ., x_{t, L}^{i})$ , which is a vector of length L and $X = {x_{t}^{i} : i = 1, . . ., I; t = 1, . . ., T}$ to be the collection of all observed data. We further introduce a hidden variable $z_{t}^{i}$ to denote the underlying chromatin state at chromosomal position t of cell type i. We assume $z_{t}^{i}$ 's are discrete taking K possible values, i.e., $z_{t}^{i} \in {1, . . ., K}$ for all t and i. Let $Z = {z_{t}^{i} : i = 1, . . ., I; t = 1, . . ., T}$ denote the collection of all hidden chromatin state variables. We assume that these chromatin state variables are the key determinant of the observed chromatin modifications, and that $x_{t}^{i}$ 's are independent of each other conditioned on Z, i.e., $ℙ (X | Z) = \prod_{i = 1}^{l} \prod_{t = 1}^{T} ℙ (x_{t}^{i} | z_{t}^{i})$ .

We assume the I cell types are related to each other through a lineage tree $T$ and use π(i) to denote the parent node of the cell type i within the lineage tree $T$ . The conditional dependencies among the variables are modelled by a Bayesian network as shown in Figure 1 with the chromatin state variables at neighboring positions of each cell type linked as a chain (referred to as horizontal connections) and the state variables of different cell types at the same chromosomal position connected according to the lineage tree $T$ (referred to as vertical connections). The horizontal connections capture the spatial correlation between chromatin states, i.e., the tendency of histone modifications to spread and cluster spatially across the genome, allowing for example large inactivated regions and short "poised" regions. The lineage relation is modelled by vertical connections between the same locations of different chains, and captures temporal changes in chromatin states during differentiation or disease progression over the cell lineage. Given the conditional dependency specification, the joint distribution of the chromatin state variables can then be written as

ℙ (Z) = \prod_{i = 1}^{I} \prod_{t = 1}^{T} ℙ (z_{t}^{i} | z_{t - 1}^{i}, z_{t}^{π (i)})

(1)

where by definition $z_{t - 1}^{i} = \emptyset$ if t = 1 and $z_{t}^{π (i)} = \emptyset$ if node i is the root cell type. As a notation, we also use π (i, t) to denote the parent nodes of node (i, t) in the model, and use z_{π(i, t)}to denote the state variables at these parent nodes if they exist.

Parameters

The TreeHMM model presented above requires us to specify two sets of conditional distributions. One is the emission probabilities $P (x_{t}^{i} | z_{t}^{i})$ , that is, the probability of observing chromatin modification vector $x_{t}^{i}$ conditioned on chromatin state $z_{t}^{i}$ . For simplicity, we assume different chromatin modification marks are independent of each other conditioned on the chromatin state, and use $e_{l}^{k} = ℙ (x_{t, l}^{i} = 1 | z_{t}^{i} = k)$ to denote the probability of observing mark l at position t of cell type i conditioned on the underlying state being k.

The second set of conditional probabilities we need to specify are the transition probabilities among chromatin states, that is, $ℙ (z_{t}^{i} | z_{t - 1}^{i}, z_{t}^{π (i)})$ . When t > 1 and π(i) is not empty, we will use a K × K × K matrix to specify $ℙ (z_{t}^{i} | z_{t - 1}^{i}, z_{t}^{π (i)})$ . However, when one of the conditioned variable is non-existent, we use K × K matrix to specify the transition probability. More specifically, the state transition probabilities are

\begin{matrix} θ_{b c}^{a} = ℙ (z_{t}^{i} = a | z_{t}^{π (i)} = b, z_{t - 1}^{i} = c) & t \neq 1, i is not root \\ α_{b}^{a} = ℙ (z_{t}^{i} = a | z_{t - 1}^{i} = b) & t \neq 1, i is root \\ β_{b}^{a} = ℙ (z_{t}^{i} = a | z_{t}^{π (i)} = b) & t = 1, i is not root \\ γ^{a} = ℙ (z_{t}^{i} = a) & t = 1, i is root . \end{matrix}

We will also use $Θ = {θ_{b c}^{a}, α_{b}^{a}, β_{b}^{a}, γ^{a}, e_{l}^{a} | (a, b, c) \in 1, \dots, K; l \in 1, \dots, L}$ to denote the collection of all parameters associated with the model.

Inference and parameter learning

Given the TreeHMM model described above and the set of observed chromatin modification data X, our goal is to: 1) estimate the parameters of the model, and 2) infer the underlying hidden state at each chromosomal location of each cell type. For parameter learning, we will use the maximum likelihood method, that is, we seek to find the optimal parameter set Θ* that maximizes the log likelihood function

log L (Θ; X) = log ℙ (X; Θ) = log \sum_{z} ℙ (Z; Θ) ℙ (X | Z; Θ)

(2)

Note that in the above notation, we put Θ into the distributions to emphasize the dependency of the distributions on the parameters. However, we will also the simplified notation $ℙ (Z | X)$ or $ℙ (X)$ when the context is clear. After finding the optimal parameters, we infer the underlying chromatin states using posterior inference, to calculating the posterior probability of each chromatin state conditioned on the observed data, $ℙ (z_{t}^{i} | X; Θ)$ .

We explore various inference methods for the TreeHMM model, including exact methods and approximate methods. For exact inference, we provide two implementations: first, we generate a lattice for the Graphical Models Toolkit (GMTK) [22], which provides an efficient framework for exact inference and learning using the junction tree algorithm [23]. We also provide a custom library which implements a "cliqued" method in which each slice t of the model has all its nodes in that slice treated as if they were part of a single "cliqued" node that has K^I states. In this cliqued node representation, we can apply standard HMM methodology to do inference and learning. The state space of the cliqued inference method grows exponentially with I, but we found it to be faster than the GMTK implementation for small trees. Both implementations gave the same results in our testing.

Since the TreeHMM model contains undirected cycles, exact inference methods such as junction tree and the "cliqued" method quickly become intractable in computational time and memory consumption when the number of nodes I or the number of inferred states K increases. Therefore, we introduce several approximate inference methods to solve the inference and learning problem presented above. We focus on variational methods since they are usually computationally efficient and scale well with size of the dataset [24]. The overall strategy of variational methods is to find an easier-to-handle surrogate distribution of the states $ℚ (Z)$ that can be used to approximate the true posterior distribution $ℙ (Z | X)$ This is done through the venue of the free energy function

F = - \sum_{Z} ℚ (Z) log \frac{ℙ (X, Z; Θ)}{ℚ (Z)} = E_{ℚ} [log ℚ (Z)] - E_{ℚ} [log ℙ (X, Z; Θ)]

(3)

By Jensen's inequality, F is always lower bounded by the negative log likelihood function, i.e. $F \geq - log L (Θ; X)$ , with equality holding if and only if $ℚ (Z) = ℙ (Z | X)$ . The goal of the variational inference is to find a distribution (usually under some approximate form) that minimizes the free energy function. We will consider three different forms of surrogate distributions and briefly describe variational inference for each of them. Details of the derivations are given in Additional file 1, section 1.3.

Mean field (MF) variational inference

In the mean field variational method, we consider the surrogate distribution to be the product of the marginal distributions of each individual state variable

ℚ (Z) = \prod_{i = 1}^{I} \prod_{t = 1}^{T} q (z_{t}^{i})

(4)

where $q (z_{t}^{i})$ represents the marginal distribution of $z_{t}^{i}$ . For notational simplicity, we also use q_it as an abbreviation of $q (z_{t}^{i})$ . In this case, the free energy becomes

F = \sum_{i = 1}^{I} \sum_{t = 1}^{T} E [log q (z_{t}^{i}) - log ℙ (x_{t}^{i} | z_{t}^{i})] - E [log ℙ (z_{t}^{i} | z_{π (i, t)})]

(5)

where the expectation is with respect to , as will always be the case in the remainder of this paper.

To find the optimal $ℚ$ that minimizes the free energy, we use a coordinate descent method - alternatively updating each component q_it while keeping all other components fixed. To update q_it we collect the terms in F that involve q_it,

F_{i t} = E [log q (z_{t}^{i}) - log ℙ (x_{t}^{i} | z_{t}^{i})] - E [log ℙ (z_{t}^{i} | z_{π (i, t)})] - \sum_{{(j, s) : (i, t) \in π (j, s)}} E [log ℙ (z_{s}^{j} | z_{π (j, s)})] .

The last term involves nodes that are children of (i, t). The update formula for q_it is thus given by $q (z_{t}^{i}) ~ exp {ϕ (z_{t}^{i})}$ , up to a normalizing constant, where

ϕ (z_{t}^{i}) = log ℙ (x_{t}^{i} | z_{t}^{i}) + E_{q_{π (i, t)}} [log ℙ (z_{t}^{i} | z_{π (i, t)})] + \sum_{{(j, s) : (i, t) \in π (j, s)}} E_{q_{j s}, q_{π (j, s) \ (i, t)}} [log ℙ (z_{s}^{j} | z_{π (j, s)})] .

The (j, s) nodes in the last term are all children of node (i, t), but the expectation involves all the parents of (j, s) except (i, t).

Structured mean field(SMF) variational inference

In the structured mean field variational method, we consider the surrogate distribution to be the product of the marginal distributions of disjoint sets of state variables. Let $z_{i} = {z_{t}^{i} : t = 1, \dots, T}$ denote the set of all state variables within cell type i, corresponding to the state variables within each horizontal chain of the TreeHMM model. We consider the $ℚ$ to be of the following form

ℚ (Z) = \prod_{i = 1}^{I} q_{i} (z_{i}),

(6)

written as the product of marginal distributions of z_i variables. In this case, the free energy becomes

F = \sum_{i = 1}^{I} [E [log q_{i} (z_{i})] - \sum_{t = 1}^{T} (E [log ℙ (z_{t}^{i} | z_{π (i, t)})] + E [log ℙ (x_{t}^{i} | z_{t}^{i})])] .

(7)

To find the optimal distribution $ℚ$ that minimizes the free energy, we again alternatively optimize each marginal distribution component while keeping others fixed. To update q_i(z_i), we collect the terms in F that involve q_i(z_i),

F_{i} = E_{q_{i}} [log q_{i} (z_{i}) - \sum_{t = 1}^{T} (log f_{i t} (z_{t}^{i}, z_{t - 1}^{i}) + log ℙ (x_{t}^{i} | z_{t}^{i}))],

(8)

where we have defined $f_{i t} (z_{t}^{i}, z_{t - 1}^{i}) = exp {E_{q_{π (i)}} [log ℙ (z_{t}^{i} | z_{π (t, i)})] + \sum_{j : i = π (j)} E_{q_{j}} [log ℙ (z_{t}^{j} | z_{π (j, t)})]}$ . Since f_it only involves expectations with respect to the distributions other than q_i, it is a fixed function of $z_{t}^{i}$ and $z_{t - 1}^{i}$ during the update of the q_i(z_i). If the f_it functions can be normalized to be conditional probability distributions, then Equation (8) shares the exact form of the free energy of a hidden Markov model with transmission probabilities specified by $f_{i t}$ and emission probabilities specified by $ℙ (x_{t}^{i} | z_{t}^{i})$ . As such, the optimal q_i minimizing the free energy is the same as the posterior probabilities of the states in the hidden Markov model, which can be efficiently calculated using the forward-backward algorithm [25]. The details of how to normalize the f_it functions to be proper transition probabilities are shown in Additional file 1, section 1.3.

Loopy belief propagation (LBP)

The third inference method we used is loopy belief propagation. Belief propagation is a message passing algorithm commonly used in probabilistic graphical models. The algorithm is exact for tree and poly-tree structured graphs. For general graphs that contain cycles or loops, it is an approximate algorithm also called loopy belief propagation. In this case, the algorithm is not guaranteed to converge nor is the approximate free-energy a bound of the log-likelihood. Nevertheless, it has shown empirical success in some cases [26]. Loopy belief propagation can be also viewed as a variational method with the $ℚ$ distribution taking the Bethe approximation form upon convergence [27]. Here we use Pearl's belief propagation algorithm which is directly applicable to the Bayesian network representation. We refer readers to [28] for the details of the algorithm.

Parameter learning

Above we have introduced different inference methods. To do parameter learning, we use a variant of the expectation-maximization (EM) algorithm called variational EM algorithm. Like the EM algorithm, the variational EM algorithm iterates between two steps: an expectation and a maximization step. The expectation step (E-step) is performed by the inference methods, during which we calculate $ℚ (Z)$ in the approximate forms outlined above with fixed parameter values. In the maximization step (M-step), we seek parameter values that minimize F (or maximize -F) under $ℚ (Z)$ .

Consider the free energy F as a function of Θ, the variational maximization step seeks the parameters that minimize F given the current hidden variable distribution $ℚ (Z)$ , i.e.

\hat{Θ} = arg min_{Θ} F (Θ, ℚ (Z)) .

The above optimization can be solved explicitly. As a result, the state transition parameters are calculated as $θ_{b c}^{a} \propto \sum_{i > 1} \sum_{t > 1} ℚ (z_{t}^{i} = a, z_{t}^{π (i)} = b, z_{t - 1}^{i} = c)$ , $α_{b}^{a} \propto \sum_{t > 1} ℚ (z_{t}^{1} = a, z_{t - 1}^{1} = b)$ , $β_{b}^{a} \propto \sum_{i > 1} ℚ (z_{1}^{i} = a, z_{1}^{π (i)} = b)$ , $γ^{a} \propto ℚ (z_{1}^{1} = a)$ up to a normalization constant, where $ℚ (\cdot)$ denotes the marginal distribution of the variables inside the brackets. The emission parameters are given by $e_{l}^{a} = \frac{Σ_{i, t} ℚ (z_{t}^{i} = a) I (x_{t, l}^{i} = 1)}{Σ_{i, t} ℚ (z_{t}^{i} = a)}$ where I(·) is the indicator function. The variational EM algorithm for the SMF case is outlined in Algorithm 1. Notationally, we have considered the entire genome as a single chunk. In practice, we break up the genome into many smaller chunks to allow more efficient, parallel execution and to reduce memory consumption, at the cost of computational artifacts at chunk borders.

Data processing

As a preprocessing step, we create a histogram of mapped reads by dividing the genome into 200 bp non-overlapping bins and counting the number of mapped reads whose middle base fell into each bin. All replicates, if any, were added to the histogram and the histogram was then binarized using a threshold corresponding to a Poisson p-value of 10^-4, similar to [21]. We further segmented the genome into regions with and without chromatin marks by applying a smoothing filter to the raw count data, retaining regions that contained mapped reads. Further data processing details can be found in Additional file 1, section 1.1, and all preprocessing methods are available as part of the released source code.

Our model's preprocessing and parameterization are very similar to the multivariate HMM methodology of [21], however Ernst's implementation suffered from a very slow runtime on our processed data, which contains many regions to facilitate parallel inference. We re-implemented the method as described [20] and use this implementation for comparison in later sections. The implementation is available in the released source code.

Results

We used the same human ENCODE dataset reported in [21] which contains ChIP-seq profiles for nine human cell types including human embryonic stem cells (H1 ES), erythrocytic leukaemia cells (K562), B-lymphoblastoid cells (GM12878), hepatocellular carcinoma cells (HepG2), umbilical vein endothelial cells (HUVEC), skeletal muscle myoblasts (HSMM), normal lung fibroblasts (NHLF), normal epidermal ker-atinocytes (NHEK), and mammary epithelial cells (HMEC). For each cell type, ten different markers are used including eight histone modifications (H3K27me3, H3K36me3, H4K20me1, H3K4me1, H3K4me2, H3K4me3, H3K27ac, and H3K9ac), one transcription factor closely related to chromatin dynamics (CTCF), and a control data set (whole cell extract). Altogether, the dataset contains 90 ChIP-seq profiles, which were downloaded from the ENCODE website [29].

Since the cell types in the ENCODE data represent very diverse, distinct cell types, we used a simple lineage tree structure with the H1 ES cell type forming the tree root and all other cell types connecting to it directly as leaves. ES cells exhibit unique epigentic biology [30], however hierarchical clustering of the observed marks reveals that each mark exhibits substantial correlation between all cell types (see Supplemental Figure 2 (Additional file 1)). Further, TreeHMM can incorporate information from marks that are only available in certain cell types and can be adapted to more interesting tree structures by including additional latent cell types. Although the current choice of tree structure may be an oversimplification of the underlying biology, we are mostly focusing on the methodology for approximate inference in TreeHMM; we explored the performance on artificial data with more interesting tree structures in Additional file 1, section 1.5. Finally, we note that while exact inference methods scale exponentially in the tree width, the approximate inference methods developed here scale linearly with I, allowing deeper lineages and more complex tree structures to be examined eventually.

Comparing approximate inference methods

To determine the accuracy of our approximate inference methods, we apply the TreeHMM model to the human ENCODE dataset described above using the following scheme: Exact inference and learning are used to define a set of parameters at each iteration. Each of the approximate inference methods performs inference on the parameters' values to get the free energy. We apply this procedure on a randomly selected 2 MB region with 3 cell types (H1 ES, K562, GM12878) using K = 5. Figure 3 shows the log likelihood of the exact inference and the corresponding free energy of different inference methods during exact EM iterations. We observe that the SMF approximation gives the highest negative free energy in this test dataset. The closeness between SMF free energy and the exact log likelihood indicates that the SMF method captures the majority of correlation between variables. Notably, the free energy curves of MF approximation and LBP fluctuate widely as the parameters are refined by the exact algorithm, indicating inconsistency in the free energy landscapes of these approximations and the true one. We also experiment with parameter recovery in several artificial datasets with different tree structures (Additional file 1, section 1.5), and observe that SMF typically outperforms the other approximate methods. As SMF seems to be the most accurate approximation in both the artificial and real data cases, we proceed with the SMF approximation in the following real data genomic segmentation and prediction problems.

TreeHMM on the complete genome using the SMF approximation

We next apply the TreeHMM model's SMF approximation to the complete genomic histone data. We use the Bayes Information Criterion, a complexity-penalized likelihood, to determine the optimal number of states K = 18 (see Additional file 1, section 1.6). After running several random initializations of the EM algorithm to convergence, we report the one with highest final likelihood. Figure 4 shows the learned states' characteristic chromatin modification co-occurrence patterns (the emission matrix e) and their enrichment in different genomic regions. Although states are learned de novo based only on the chromatin markers, many marker co-occurrences correspond to previous biological observations (e.g. H3K4 di- and tri-methylation in promoter regions and H3K4 mono- and di-methylation in enhancer regions [31]). We have annotated the likely function of each state (Figure 4) based on its genomic localization and concordance with previously reported findings [21]. The states show distinct enrichment patterns in different genomic locations. Several of the states (3, 8, and 11) are strongly enriched (8-17 fold) in the ± 2 kb TSS region. Other states (7, 13, and 15) are enriched (2-3 fold) in coding genes. The coverage of each chromatin state region also varies widely, as shown in Supplementary Table 2 (Additional file 1). The promoter and enhancer states cover a relatively small portion of the genome, e.g. ~ 1.1% for both active promoter and strong enhancer regions while low signal regions combine for around 75% of the genome. The state distribution also shows some cell-type specific properties, e.g., enhancer states 5, 10 and 11 are largely depleted in H1 ES cells, while other enhancer states are not (one being 2 fold enriched), indicating different functional roles of the learned enhancer states.

To explore the cell-type specificity of our learned states, we performed K-means clustering regions assigned to each state in any cell type. We show three of the states in Supplementary Figure 7 (Additional file 1), including the insulator regions (state 14), strong enhancer regions (state 5) and active promoter regions (state 3). We can see that the distribution of different states across cell types differs drastically. Almost half of all insulator sites (state 14) are shared amongst all nine cell types or are only missing in one or two cell types. Many of the remainder are specific to a single cell type. Likewise, some active promoter regions (state 3) are shared amongst all or most cell types, but many more of the promoter regions are cell-type specific. Finally, the strong enhancer regions (state 5) are almost entirely cell-type specific. These overall patterns of cell-type specificity are captured by the learned transition matrices α and θ, which are shown in Supplemental Figures 3 and 4 (Additional file 1).

Several states are dominated by their vertical component in the θ transition matrix, including the states localizing to TSS's (states 2, 3, 8, 10, and 11), copy number variant/repeat regions (state 4), and the insulator state marked by CTCF (state 14). Other states have weak vertical components: consistent with the cell-type specificity of enhancers and chromatin remodeling, three of the enhancer regions (states 1, 5 and 6) and the polycomb repressed regions (state 17) show little to no vertical correlation. In particular, enhancer state 1 does not show the vertical correlation that might be expected given its propensity for TSS regions (4.24 fold enrichment).

Comparison with ChromHMM

We compare our result with ChromHMM - a similar method based on hidden Markov model described in [21] that does not utilize lineage information. We ran the HMM on the same histone data, treating each cell type's segment as a separate chain with inference performed in parallel but with tied parameters. We set number of states to be the same as in the TreeHMM result for consistency.

The learned emission probability matrix from ChromHMM together with the confusion matrix between the assigned states of the two results is shown in Supplemental Figure 5 (right panel, Additional file 1). Comparing the emission matrix from two methods (Figure 4 and Supplemental Figure 5 (left panel, Additional file 1)), we observe similar co-occurrence patterns of markers. But as revealed by the confusion matrix, there is a substantial set of regions that are assigned different states due to the lineage constraint introduced in our model. For example, the weak promoter state (state 8) overlaps with ChromHMM's inactive promoter and enhancer states (2 and 8). Also ChromHMM exhibits two repetitive states (similar to [21]) while there is only one such state in the TreeHMM result. To assess the accuracy of our methods, we tested our predicted states' overlap with several human ES-cell-specific ChIP-seq datasets.

We use a recent series of ChIP-seq datasets of transcription factor binding in H1-ES cells [32] including Taf1, p300, Nanog, Klf4, Oct4, and Sox2. Among those, Taf1 is part of the machinery that recruits Polymerase II to the transcription start site and we expect its enrichment in promoter regions. p300 is a transcription factor (TF) that interacts with many other TFs in enhancer regions and we expect its presence in predicted enhancer regions. The other TFs in this dataset are important in maintaining stem-cell state, but a preference for promoter vs. enhancer has not been established. We investigated the overlap of ChIP-seq peaks in these datasets with our predicted states. For each method, we pooled the "enhancer" states (states 1, 5, 6, 10 and 11 in both methods) and report the fraction of sites overlapping called peaks for each transcription factor in Table 1. Similar results are reported for "promoter" regions (states 2, 3 and 8 in both methods).

Table 1 H1-ES ChIP-seq enrichment in predicted promoter and enhancer regions.

Full size table

As shown in Table 1, Taf1 shows strong enrichment in the promoter regions annotated by both ChromHMM and TreeHMM methods (26 and 41.6 fold enrichment over background, respectively). Although the two methods identify a similar number of active promoters (136,702 for TreeHMM vs. 239,792 by ChromHMM), a larger fraction of TreeHMM's predicted promoter overlaps with Taf1 binding sites than ChromHMM (32,069 or 23.5% of sites predicted by TreeHMM vs. 35,082 or 18.5% of sites predicted by ChromHMM). The enhancer regions predicted by the two methods with similar fold enrichment (12.2 and 12.3 fold) in p300 ChIP-seq binding peaks, but 24% more sites are correctly predicted by TreeHMM (7,253 vs. 5,861). An interesting observation is that Oct4 and Klf4 both show preference for promoter regions over enhancer regions and in these cases, ChromHMM captures more of the ChIP-Seq binding sites but at the cost of calling many more total sites (23.8 vs. 19 fold enrichment of Oct4; 18.1 vs. 15.1 fold enrichment of Klf4). Distinctly, Nanog and Sox2 show a strong preference for enhancer regions. For these predictions, more ChIP binding sites (19% more for Nanog, 23% more for Sox2) are captured by TreeHMM at similar enrichment levels. These results indicate TreeHMM's lower false negative rate for enhancer regions and lower false positive rate for promoter regions.

We also investigated the recovery of active transcription start sites. We compared the predicted promoter regions (states 2, 3, and 8) with the ENCODE Capped Analysis Gene Expression (CAGE) data for H1-ES and K562 cells. Supplemental Figure 3 (Additional file 1) shows TreeHMM's improved precision (5-9% better) and similar recall (2% worse to 0.5% better) in predicting active TSS regions.

Discussion

We have here presented a tree hidden Markov model for identifying chromatin state based on measurements from multiple cell types in a principled way. The major improvement over the previous HMM approach is the incorporation of cell lineage explicitly in the model. While previous methods have focused only on the marks present at a particular region in a particular cell type, we pool information across the same genomic location at different cell types. This allows increased discernment in regions of uncertainty. Although model learning in our proposed model is intractable except in the smallest cases, we developed several approximate methods and demonstrated their accuracy using the ENCODE histone modification data for nine different cell types. Interestingly, we found strong correlations along cell lineages and show that in many cases the information gained from lineage correlations increases state inference accuracy. Inherent to our method is the discovery of states that are more likely to change during differentiation or disease progression. This information allows more accurate prediction and allows accurate delineation between housekeeping genes present in all cell types and genes regulated in a lineage-specific fashion.

In this work, we have focused on developing approximate methods for doing inference and learning in the general framework. Our implementation is general and can deal with missing marks and missing species (discussed in Additional file 1, section 1.4). With the capabilities of the model, there can be many further improvements including incorporating more cell types with incomplete measurements, modifying the lineage tree to include hidden nodes, and incorporating heterogenous data beyond histone marks. By pooling information from similar cell types and learning combinations of marks, it should be possible to infer cell state without a full spectrum of histone modifications measurements. We plan on exploring the rapidly increasingly heterogenous datasets to gain further insight into role of chromatin modifications in determining epigenetic states and their relationship with disease phenotype. Another possible application of the framework is to look into cross-species correlation of histone modification [33] to gain insight into inter-species conservation or divergence of epigenetic mechanisms.

Conclusions

Understanding epigenetic factors' associations with cell state is a primary step towards proper context for biological systems. Histone modifications play an essential role in regulating and maintaining gene expression and determining cell state. We have developed a novel graphical model for determining chromatin state from epigenetic modifications. Our method explicitly models transitions between cell types during differentiation or disease progression by considering cell lineage relationship. Although performing exact inference in our model is intractable, we develop highly accurate approximate inference methods that scale well with dataset size. By utilizing information from several cell types, our method can infer epigenetic state more accurately and has the ability to incorporate tendency of transitions between cell states in a more principled way. These cross-cell type correlations may be especially useful in datasets where the complete battery of experiments have not been performed in all cell types.

Abbreviations

MF:: Mean Field
SMF:: Structured Mean Field
LBP:: loopy belief propagation
BIC:: Bayes Information Criterion
CAGE:: Capped Analysis Gene Expression

References

Bannister A, Kouzarides T: Regulation of chromatin by histone modifications. Cell research. 2011, 21 (3): 381-395. 10.1038/cr.2011.22.
Article PubMed Central CAS PubMed Google Scholar
Strahl BD, Allis CD: The language of covalent histone modifications. Nature. 2000, 403 (6765): 41-5. 10.1038/47412.
Article CAS PubMed Google Scholar
Schreiber SL, Bernstein BE: Signaling network model of chromatin. Cell. 2002, 111 (6): 771-8. 10.1016/S0092-8674(02)01196-0.
Article CAS PubMed Google Scholar
Felsenfeld G, Groudine M: Controlling the double helix. Nature. 2003, 421 (6921): 448-453. 10.1038/nature01411.
Article PubMed Google Scholar
Xu X, Hoang S, Mayo MW, Bekiranov S: Application of machine learning methods to histone methylation ChIP-Seq data reveals H4R3me2 globally represses gene expression. BMC bioinformatics. 2010, 11: 396-
PubMed Central PubMed Google Scholar
Khatun J: An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.
Article Google Scholar
Dong X, Greven M, Kundaje A, Djebali S, Brown J, Cheng C, Gingeras T, Gerstein M, Guigó R, Birney E: Modeling gene expression using chromatin features in various cellular contexts. Genome biology. 2012, 13 (9): R53-10.1186/gb-2012-13-9-r53.
Article PubMed Central CAS PubMed Google Scholar
Ju HX, An B, Okamoto Y, Shinjo K, Kanemitsu Y, Komori K, Hirai T, Shimizu Y, Sano T, Sawaki A, Tajika M, Yamao K, Fujii M, Murakami H, Osada H, Ito H, Takeuchi I, Sekido Y, Kondo Y: Distinct profiles of epigenetic evolution between colorectal cancers with and without metastasis. The American journal of pathology. 2011, 178 (4): 1835-46. 10.1016/j.ajpath.2010.12.045.
Article PubMed Central CAS PubMed Google Scholar
Bergman Y, Cedar H: Epigenetic control of recombination in the immune system. Seminars in immunology. 2010, 22 (6): 323-9. 10.1016/j.smim.2010.07.003.
Article PubMed Central CAS PubMed Google Scholar
Jablonka E, Raz G: Transgenerational epigenetic inheritance: prevalence, mechanisms, and implications for the study of heredity and evolution. The Quarterly review of biology. 2009, 84 (2): 131-176. 10.1086/598822.
Article PubMed Google Scholar
Park P: ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics. 2009, 10 (10): 669-680. 10.1038/nrg2641.
Article PubMed Central CAS PubMed Google Scholar
Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nature methods. 2009, 6 (11 Suppl): S22-32.
Article PubMed Central CAS PubMed Google Scholar
Heintzman N, Stuart R, Hon G, Fu Y, Ching C, Hawkins R, Barrera L, Van Calcar S, Qu C, Ching K: Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature genetics. 2007, 39 (3): 311-318. 10.1038/ng1966.
Article CAS PubMed Google Scholar
Mitchell Guttman I, Garber M, French C, Lin M, Feldser D, Huarte M, Zuk O, Carey B, Cassady J, Cabili M: Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009, 458 (7235): 223-227. 10.1038/nature07672.
Article PubMed Central PubMed Google Scholar
Hon G, Wang W, Ren B: Discovery and annotation of functional chromatin signatures in the human genome. PLoS computational biology. 2009, 5 (11): e1000566-10.1371/journal.pcbi.1000566.
Article PubMed Central PubMed Google Scholar
Ucar D, Hu Q, Tan K: Combinatorial chromatin modification patterns in the human genome revealed by subspace clustering. Nucleic acids research. 2011, 39 (10): 4063-75. 10.1093/nar/gkr016.
Article PubMed Central CAS PubMed Google Scholar
Jaschek R, Tanay A: Spatial clustering of multivariate genomic and epigenomic information. Research in Computational Molecular Biology. 2009, Springer, 170-183.
Chapter Google Scholar
Hoffman M, Buske O, Wang J, Weng Z, Bilmes J, Noble W: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods. 2012, 9 (5): 473-476. 10.1038/nmeth.1937.
Article PubMed Central CAS PubMed Google Scholar
Xu H, Wei C, Lin F, Sung W: An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics. 2008, 24 (20): 2344-2349. 10.1093/bioinformatics/btn402.
Article CAS PubMed Google Scholar
Ernst J, Kellis M: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature biotechnology. 2010, 28 (8): 817-25. 10.1038/nbt.1662.
Article PubMed Central CAS PubMed Google Scholar
Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, Bernstein BE: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011, 473 (7345): 43-9. 10.1038/nature09906.
Article PubMed Central CAS PubMed Google Scholar
Bilmes J, Bartels C: On Triangulating Dynamic Graphical Models. Proceedings of the Nineteenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-03). 2003, San Francisco, CA: Morgan Kaufmann, 47-56.
Google Scholar
Dean T, Kanazawa K: Probabilistic temporal reasoning. Proceedings of the Nineteenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-03). 1988, AAAI
Google Scholar
Wainwright M, Jordan M: Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning. 2008, 1 (1-2): 1-305.
Google Scholar
Durbin R: Biological sequence analysis: probabilistic models of proteins and nucleic acids. 1998, Cambridge Univ Pr
Book Google Scholar
Murphy K, Weiss Y, Jordan M: Loopy belief propagation for approximate inference: An empirical study. Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. 1999, Morgan Kaufmann Publishers Inc., 467-475.
Google Scholar
Yedidia J, Freeman W, Weiss Y: Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium, Volume 8. 2003, 236-239.
Google Scholar
Darwiche A: Modeling and reasoning with Bayesian networks, Volume 54. 2009, Cambridge University Press
Book Google Scholar
ENCODE: 2012, [http://genome.ucsc.edu/ENCODE/]
Bibikova M, Chudin E, Wu B, Zhou L, Garcia E, Liu Y, Shin S, Plaia T, Auerbach J, Arking D: Human embryonic stem cells have a unique epigenetic signature. Genome research. 2006, 16 (9): 1075-1083. 10.1101/gr.5319906.
Article PubMed Central CAS PubMed Google Scholar
Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, Jaenisch R, Wagschal A, Feil R, Schreiber SL, Lander ES: A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006, 125 (2): 315-26. 10.1016/j.cell.2006.02.041.
Article CAS PubMed Google Scholar
Lister R, Pelizzola M, Dowen R, Hawkins R, Hon G, Tonti-Filippini J, Nery J, Lee L, Ye Z, Ngo Q: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462 (7271): 315-322. 10.1038/nature08514.
Article PubMed Central CAS PubMed Google Scholar
Kolasinska-Zwierz P, Down T, Latorre I, Liu T, Liu X, Ahringer J: Differential chromatin marking of introns and expressed exons by H3K36me3. Nature genetics. 2009, 41 (3): 376-381. 10.1038/ng.322.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Ali Mortazavi, Yifei Chen and members of Xie lab for helpful discussions. This work was supported by National Institutes of Health grant T15LM07443 and National Science Foundation grant DBI-0846218.

Declarations

Publication of this article was supported by National Institutes of Health grant T15LM07443.

This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.

Author information

Authors and Affiliations

Department of Computer Science, University of California, Irvine, CA, USA
Jacob Biesinger & Xiaohui Xie
Department of Physics and Astronomy, University of California, Irvine, CA, USA
Yuanfeng Wang
Institute for Genomics and Bioinformatics, University of California, Irvine, CA, USA
Jacob Biesinger & Xiaohui Xie

Authors

Jacob Biesinger
View author publications
You can also search for this author in PubMed Google Scholar
Yuanfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Xie.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JB and YW implemented the methodology. All authors worked on, read and approved the final manuscript.

Jacob Biesinger, Yuanfeng Wang contributed equally to this work.

Electronic supplementary material

12859_2013_5772_MOESM1_ESM.PDF

Additional file 1: Supplemental material. Additional details on data processing, model derivation, model parametrization, and training results on the ENCODE and synthetic datasets are available in Additional file 1. (PDF 692 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Biesinger, J., Wang, Y. & Xie, X. Discovering and mapping chromatin states using a tree hidden Markov model. BMC Bioinformatics 14 (Suppl 5), S4 (2013). https://doi.org/10.1186/1471-2105-14-S5-S4

Download citation

Published: 10 April 2013
DOI: https://doi.org/10.1186/1471-2105-14-S5-S4

Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013)

Discovering and mapping chromatin states using a tree hidden Markov model

Abstract

Background