 Methodology article
 Open Access
 Published:
Parameter estimation for robust HMM analysis of ChIPchip data
BMC Bioinformatics volume 9, Article number: 343 (2008)
Abstract
Background
Tiling arrays are an important tool for the study of transcriptional activity, proteinDNA interactions and chromatin structure on a genomewide scale at high resolution. Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc. Especially in the context of ChIPchip experiments, no standard procedures exist to obtain parameter estimates from the data. Common methods for the calculation of maximum likelihood estimates such as the BaumWelch algorithm or Viterbi training are rarely applied in the context of tiling array analysis.
Results
Here we develop a hidden Markov model for the analysis of chromatin structure ChIPchip tiling array data, using t emission distributions to increase robustness towards outliers. Maximum likelihood estimates are used for all model parameters. Two different approaches to parameter estimation are investigated and combined into an efficient procedure.
Conclusion
We illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates. The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.
1 Background
High density oligonucleotide tiling arrays allow the investigation of transcriptional activity, proteinDNA interactions and chromatin structure across a whole genome. Tiling arrays have been used in a wide range of studies, including investigation of transcription factor activity [1] and of histone modifications in animals [2] and plants [3], as well as DNA methylation [4]. Analyses of these data are usually based either on a sliding window [1, 5], or on hidden Markov models (HMMs) [6–8]. Other approaches have been suggested, e.g., by Huber et al. [9] and Reiss et al. [10], but are less common.
Parameter estimates for sliding window approaches as well as hidden Markov models are typically ad hoc. Although there are some notable exceptions in gene expression studies [8, 11], no established procedures exist to obtain good parameter estimates from tiling array data, especially in the context of chromatin immunoprecipitation (ChIPchip) experiments. Attempts have been made to obtain parameter estimates by integrating genome annotations into the analysis [12]. While this may provide good results when investigating transcriptional activity in well studied organisms, it is limited by the quality of available annotations. For ChIPchip studies the required annotation data is unavailable. A method for the localisation of transcription factors from ChIPchip experiments by Keleş [13] does obtain the required parameter estimates from the data and allows for variations in length of enriched regions.
Methods designed for the analysis of ChIPchip data focus almost exclusively on the study of transcription factors [[6, 7, 10], and [13]]. While this is an important class of experiments, ChIPchip studies are not limited to transcription factors, and the analysis of other ChIPchip experiments may require new methods. One other area of active research that utilises ChIPchip experiments is the study of histone modifications and chromatin structure [3]. Although both types of experiment employ the same technology, there are several important differences between them. Most importantly, the 147 bp of DNA bound by a histone complex are considerably longer than the typical transcription factor binding site, and the histone modifications of interest are expected to affect several neighbouring histones. Consequently the ChIP fragments derived from a transcription factor binding site all originate from a small region containing the given binding site while regions affected by histone modifications can be much longer than the ChIP fragments used. As a result of this, the data from histone modification experiments usually contain long regions of interest encompassing several nonoverlapping ChIP fragments, rather than the short and relatively isolated peaks produced by transcription factor studies.
Here we consider the analysis of data from a histone modification study in Arabidopsis [3]. These data consist of four ChIP samples for histone H3 with lysine 27 trimethylation (H3K27me3) and four histone H3 ChIP samples that act as a control. The aim of this analysis is to identify and characterise regions throughout the genome that exhibit enrichment for H3K27me3. It is desirable to use a method which is specifically designed for the analysis of histone modifications or flexible enough to accomodate the varying length of enriched regions. Furthermore, the method should obtain all parameter estimates from the data without the use of genome annotations and be robust towards outliers. Amongst the methods discussed above TileMap [7] comes closest to these requirements. Although it was developed with transcription factor analysis in mind it is general enough that it should provide useful results for other ChIPchip experiments. This is emphasised by its application to histone modification [3] and DNA methylation [4] data as well as transtription factor analysis [14, 15]. TileMap obtains some, but not all, of the required parameter estimates from the data. To provide a method which meets the requirements oulined above we develop a two state HMM with t emission distributions. All parameter estimates for the model are obtained by maximum likelihood estimation using the BaumWelch algorithm [16] and Viterbi training [17]. These methods have the advantage that no prior knowledge about parameter values is required and one need not rely on frequently unavailable genome annotations. To assess the performance of our model, we apply it to simulated and real data. Results are compared to those produced by TileMap. The remainder of this article is structured as follows. In Section 2 the hidden Markov model is developed and MLEs for all parameters are derived in Section 4. The performance of the resulting model is assessed in terms of sensitivity and specificity on simulated data in Sections 2.3.3–2.3.6. In Section 2.3.7 the model is used to analyse a public ChIPchip data set [3] and results are compared to the original analysis of these data.
2 Results and discussion
Tiling array data consists of a series of measurements taken along the genome. Typically, microarray probes are designed to interrogate the genome at regular intervals. Design constraints such as probe affinity and uniqueness cause differences in probe density along the genome and can lead to large gaps between probes. Here we assume that the probe density is homogeneous except for a number of large gaps where the distance between adjacent probes is larger than max_gap. In the following analyses we use max_gap = 200 bp. This is identical to the value used by Zhang et al. [3], allowing for a direct comparison of results. Consider a ChIPchip tiling array experiment with two conditions, a ChIP sample X_{1} targeting the protein of interest and a control sample X_{2}. Each sample X_{ l }has m_{ l }replicates (l = 1, 2) providing measurements for K genomic locations. The measurements for each probe are summarised by the "shrinkage t" statistic [18]:
where {v}_{lk}^{\ast} is a JamesStein shrinkage estimate of the probe variance obtained by calculating
and {s}_{lk}^{2} are the usual unbiased empirical variances and {\stackrel{\u02c6}{\lambda}}_{l}^{\ast} is the estimated optimal pooling parameter
Other moderated t statistics have been suggested and could be used instead, most notably the empirical Bayes t statistic used by Ji and Wong [7] and the moderated t of Smyth [19]. All of these approaches are designed to increase performance compared to the ordinary t statistic by incorporating information from all probes on the microarray into individual probe statistics. Here we choose the "shrinkage t" because it does not require any knowledge about the underlying distribution of probe values while providing similar performance compared to more complex models [18].
2.1 Hidden Markov Model
To detect enriched regions we use a two state discrete time hidden Markov model with continuous emission distributions and homogeneous transition probabilities (Figure 1), i.e., the transition probabilities depend only on the current state of the model. The use of homogeneous transition probabilities assumes equallyspaced probes within each observation sequence as well as a geometric distribution of the length of enriched regions. As discussed above there will be some variation in probe distances. Using a relatively small value for max_gap ensures that the assumption of homogeneity holds at least approximately. The two states of the model correspond to enrichment or no enrichment in the ChIP sample. The model is characterised by the set of states S = {S_{1}, S_{2}}, the initial state distribution p, the matrix of transition probabilities A and the state specific emission density functions f_{ i }, i = 1, 2. The emission distribution of state S_{ i }is modelled as a t distribution with location parameter μ_{ i }, scale parameter σ_{ i }, and ν_{ i }degrees of freedom.
The use of t distributions has the advantage that their sensitivity to outliers can be adjusted via the degrees of freedom parameter, making them more robust and versatile than normal distributions. This is particularly useful when ν is estimated from the data [20]. It should be noted that the y_{ k }modelled here are from a tlike distribution (Equation (1)). While this in itself might suggest the use of t distributions for the f_{ i }s, they are primarily chosen for their robustness. In the following we will refer to this model by its parameter vector θ = (θ_{1}, θ_{2}), where θ_{1} is the ordered pair (p, A) and θ_{2} the ordered triple (μ, σ, ν).
Given a hidden Markov model θ and an observation sequence Y, it is possible to compute the sequence of states Q = q_{1}q_{2}...q_{ K }that is most likely to produce Y. There are several approaches to obtaining Q [21]. Usually Q is computed either by maximising the posterior probabilities P(q_{ k }= S_{ i }Y; θ), k = 1, ..., K or by calculating the sequence that maximises P(QY; θ). The latter provides the single most likely sequence of states and can be computed efficiently by the Viterbi algorithm [22]. For the particular model used here both approaches are equivalent.
2.2 Parameter Estimation
In this section we will discuss two different approaches to estimate θ for the model described in Section 2.1. The methods under consideration are the EM algorithm, which is usually known as the BaumWelch algorithm in the context of HMMs, and Viterbi training. While the BaumWelch algorithm is guaranteed to converge to a local maximum of the likelihood function, it is computationally intensive. Viterbi training provides a faster alternative but may not converge to a local maximum.
2.2.1 Initial Estimates
Both optimisation algorithms discussed here require initial parameter estimates. These are obtained from the data by first partitioning the vector of observations Y into two clusters using kmeans clustering [23]. From these clusters the location and scale parameters of the corresponding states are obtained as the mean and variance of the observations in the cluster. In the following, ν_{1} = ν_{2} = 6 is used as initial estimate for the degrees of freedom parameters.
2.2.2 BaumWelch Algorithm
The BaumWelch algorithm [16] is a well established iterative method for estimating parameters of HMMs. It represents the EM algorithm [24] for the specific case of HMMs. This algorithm can be used to optimise the transition parameters θ_{1} as well as the emission parameters θ_{2}. Each iteration of the algorithm consists of two phases. During the first phase, the current parameter estimates are used to determine for each probe statistic in the observation sequence how likely it is to be produced by the different states of the model. In the second phase, parameters for the emission distributions of each state are estimated using contributions from all observations, according to the probability that they were produced by the respective state of the model. The state transition parameters are updated in a similar fashion, accounting for the probability of transitions between states based on the observation sequence and the current model. After each iteration this procedure results in a model which explains the observed data better than the previous one, approaching a locally optimal solution. Using this method parameter estimates are updated until convergence is achieved. The details of the resulting algorithm are outlined in Section 4.1.
This method of parameter estimation is computationally expensive and timeconsuming for a typical tiling array data set. The computing time can be reduced by fixing the degrees of freedom for the emission distributions in advance, thus avoiding the rootfinding required for the estimation of these parameters. While this does not provide the same flexibility as estimating the required degree of robustness from the data it reduces the complexity of the optimisation problem. It is noted by Liu and Rubin [25] that attempts to estimate the degrees of freedom are more likely to produce results which are of little practical interest. The impact on classification performance of this choice is investigated in Section 2.3.
The formulation of the BaumWelch algorithm used in this article is based on the description given by Rabiner [21] and on the EM algorithm derived by Peel and McLachlan [26] for fitting mixtures of t distributions.
2.2.3 Viterbi Training
While the BaumWelch algorithm described in Section 2.2.2 is expected to provide good parameter estimates, it is computationally expensive. A faster modelfitting procedure can be devised by replacing the first phase of the BaumWelch algorithm with a maximisation step. This method was introduced in [17] as segmental kmeans and is now commonly referred to as Viterbi training. Unlike the BaumWelch algorithm which allows each probe statistic to contribute to the parameter estimates for all states, Viterbi training assigns each observation to the state that is most likely to produce the given probe statistic. Thus each observation contributes to exactly one state of the model. While each iteration of this method is faster than one iteration of the BaumWelch algorithm some iterations may decrease the likelihood of the model, thus failing to advance it towards a useful solution. See Section 4.2 for further details on the implementation of Viterbi training used here.
2.3 Testing
2.3.1 Simulated Data
To assess the ability to distinguish between enriched and nonenriched probes of the models obtained by the different parameter estimation methods discussed in Section 2.2, we simulate data with known enriched regions. To ensure that the simulation study is providing meaningful results, it is based as closely on real data as possible. To this end, two independent analyses of the H3K27me3 data published by Zhang et al. [3] are carried out, one using TileMap [7], the other based on our model. The result of each analysis is used to generate a new dataset with known enriched regions. See Section 4 for further details. In the following these data are referred to as datasets I and II respectively. Since the simulation procedure is likely to bias results towards the model that was used in the process, we concentrate on the analysis of dataset I, with some results for dataset II presented for comparison. The use of data based on both models allows us to consider their performance under advantageous and disadvantageous conditions.
2.3.2 Performance Measure
The performance of different models on these data is determined in terms of false positive and false negative rates at probe level. While the relative importance of false positives and false negatives depends on the experiment under consideration, they are often equally problematic in the context of ChIPchip experiments, especially when considering experiments which investigate differences between different cell lines or developmental stages, where all incorrect classifications are of equal concern. In this context, we define false positives as probes that are classified as nonenriched by the analysis of the real data but are called enriched in the subsequent analysis of simulated data, and vice versa for false negatives.
The output of each model is the estimated posterior probability of enrichment for each probe. In practice, probe calls ("enriched" or "nonenriched") are generated from this posterior probability based on a 0.5 cutoff. For any given model, classification performance will change with the chosen threshold. Thus we assess model performance across a range of cutoffs, reporting the relative number of false positives and false negatives as well as the error rate. The latter is also used to determine the cutoff that minimises incorrect classification results, and model performance is judged on the numbers of incorrect classifications at this optimal cutoff and at the usual 0.5 cutoff, and on the distance between the optimal cutoff and 0.5. The tradeoff between sensitivity and specificity provided by the different models is characterised with ROC curves and the associated AUC values.
Another measure of interest is the ability to characterise the length distribution of enriched regions correctly. When studying chromatin structure the extent of structural changes is of interest; this is the case for the data studied in Section 2.3.7. This property of the different models is investigated in Section 2.3.6.
2.3.3 Estimating Degrees of Freedom
We now consider the performance of both the BaumWelch procedure and Viterbi training when all model parameters, including the degrees of freedom ν, are estimated from the data. Both parameter estimation methods are used to fit an HMM to datasets I and II, and the performance of resulting models is assessed in terms of the achieved error rate (Figure 2), ROC curves (Figure 3) and their associated AUC (Table 1) for both datasets. To assess how well these methods perform in comparison to an established algorithm, we also fit a TileMap model to the two simulated datasets. The three models are compared to each other, as well as an ad hoc model which simply uses, without optimisation, the initial parameter estimates used by the two parameter optimisation methods. When comparing the performance of these models on both simulated datasets, it is important to consider that the simulation procedure introduces a bias towards the underlying model.
Estimating all parameters from the data with either the BaumWelch algorithm or Viterbi training leads to models with high sensitivity, producing fewer false negatives than TileMap for any given cutoff [see Additional file 1]. At the same time they lead to an increased number of false positives [see Additional file 2] compared to TileMap, indicating a slight reduction of specificity. When considering the error rate it becomes apparent that both BaumWelch and Viterbi training provide a favourable tradeoff between sensitivity and specificity. These models reduce the number of incorrect classifications compared to TileMap both at the usual 0.5 cutoff and at the optimal cutoff. Moreover, while the BaumWelch algorithm and Viterbi training both lead to models with an optimal cutoff close to 0.5 (0.51 and 0.42 respectively), TileMap provides an optimal cutoff of 0.19, indicating that it underestimates the posterior probability of enrichment. This becomes even more apparent when considering the result for dataset II where the optimal cutoff for TileMap is at 0.002 compared to 0.5 for BaumWelch and 0.41 for Viterbi training. This result suggests that TileMap is more tuned towards avoiding false positives than false negatives. From the above results we estimate that the weight given to false positives by TileMap is approximately 3.2 and 26 times larger than the weight for false negatives on datasets I and II respectively. The ROC curves (Figure 3) provide further evidence that the models with MLEs outperform TileMap. Although all three models perform well on dataset I, both parameter optimisation methods lead to better results than TileMap. The benefits of optimising parameter estimates are further highlighted by the performance of the model with ad hoc estimates that is used as starting point for the optimisation procedures. On both datasets, optimised parameters provide a notable increase in performance, with TileMap performing only slightly better than the ad hoc model on dataset II.
2.3.4 Fixed Degrees of Freedom
Estimating ν, the degrees of freedom, for t distributions from the data is timeconsuming and may not be very accurate, especially for relatively large values of ν. In this section we investigate the effect of fixing ν a priori for both states of the model. Only the case ν_{1} = ν_{2} is considered here. The remaining parameters are estimated from the training data using the BaumWelch algorithm and Viterbi training with ν = 3, 4, ..., 50. For each value of ν, we report the error rate (Figure 4) as well as the AUC (Figure 5) on the simulated data.
For the best combination of ν and cutoff, both parameter estimation methods result in models with a classification performance comparable to the case of variable degrees of freedom (Figure 2). While the BaumWelch algorithm tends to produce models with an optimal cutoff close to 0.5, Viterbi training only achieves this for large values of ν. Notably, the best classification performance of the Viterbi trained model is achieved with 14 degrees of freedom and a 0.37 cutoff compared to 7 degrees of freedom and a 0.49 cutoff from BaumWelch. This results in a decreased performance of the Viterbi model relative to the BaumWelch model at the 0.5 cutoff.
2.3.5 Convergence
To reduce the time required for parameter estimation it is useful to limit the number of iterations. While each iteration of the BaumWelch algorithm is guaranteed to improve the likelihood of the model, small changes to the parameter values do not necessarily lead to significant changes in the classification result. Furthermore, Viterbi training is not guaranteed to converge to a local maximum of the likelihood function and a likelihood based convergence criterion may not be appropriate for this method. Here we investigate the convergence of both algorithms based on the error rate and AUC to gauge the number of iterations required to achieve good classification results. Parameter estimation is performed with 60 iterations for both algorithms. Current estimates are used to classify the test data at every 5^{th} iteration and AUC (Figure 5) and error rate (Figure 6) are determined.
The most striking difference in the convergence behaviour of the two methods is that Viterbi training appears to obtain good parameter estimates within a small number of iterations. Further iterations of the algorithm do not improve results substantially, whereas the BaumWelch procedure provides parameter estimates that are better than the ones obtained by Viterbi training, both in terms of likelihood and classification performance, but takes substantially longer to obtain these estimates. The BaumWelch algorithm not only requires more iterations than Viterbi training, but the time required for each iteration is also longer.
2.3.6 Length Distribution of Enriched Regions
When studying histone modifications one possible characteristic of interest is the length of enriched regions. To assess how accurately the different methods reflect the length distribution of enriched regions, we compare the length of regions predicted by TileMap and by the model (using BaumWelch parameter estimates) to the length distribution of enriched regions in the simulated data (the "true length distribution"). Note that this length distribution may vary from the one found in real data. Nevertheless this comparison highlights some of the differences between the two models. Quantilequantile plots of the respective length distributions show that TileMap systematically underestimates the length of enriched regions (Figure 7 (bottom left) and Figure 8 (bottom left)). While this effect is relatively small on dataset I there is some indication that it increases with region length and long regions may not be characterised appropriately by TileMap (Figure 7 (top left)). This observation is further supported by the length distribution of enriched regions produced by TileMap on dataset II (Figure 8 (left)). Enriched regions in dataset II are generally longer than regions in dataset I. This difference is not captured by TileMap. Both TileMap and the BaumWelch trained model produce several regions that are shorter than the shortest enriched region in the simulated data (Figure 7 (bottom)). There are two possible explanations for these short regions. They may be caused by underestimating the length of enriched regions, possibly splitting one enriched region into several predicted regions, or they may represent spurious enriched results produced by the model. In each case there is the possibility that the occurrence of extremely short regions is caused either by an intrinsic shortcoming of the model or by artifacts introduced during the simulation process. Since the simulation relies on TileMap to identify enriched and nonenriched probes it is inevitable that some probes will be misclassified. Subsequently these probes may be included in the simulated data, causing short disruptions of enriched and nonenriched regions. A sufficiently sensitive model could detect these unintended changes between enriched and nonenriched states.
To investigate further which of these is the case, we first examine the number of enriched probes contained in the short regions found by the BaumWelch model and by TileMap respectively. The model with BaumWelch parameter estimates found 126 regions with less than 10 probes. These regions contain a total of 866 probes of which 717 are in enriched regions. While this indicates that the majority of short regions is due to underestimating the length of enriched regions, several spurious probe calls remain. TileMap produced 249 regions with less than 10 probes, containing a total of 1781 probes, of which 1753 are in enriched regions. This is strong evidence that almost all of these short regions are caused by underestimating the length of enriched regions, and is consistent with the above observation that TileMap systematically underestimates the length of enriched regions.
To investigate whether the spurious short regions produced by the BaumWelch model are due to an intrinsic shortcoming of the model or are artifacts introduced by the simulation procedure, we turn to real data. Here we focus on enriched regions containing only a single probe, which are most likely to be false positives. On dataset I the BaumWelch model produced six of these extremely short regions. One of these probes is a true positive from an enriched region containing ten probes, i.e., the length of this region is underestimated by the BaumWelch model. Of the remaining five probes three are identical, leaving three unique probes to be investigated further. For each of these three probes, we determine its position in the real data and its distance from enriched regions identified by TileMap and by our model (Section 2.3.7). Two of the probes are found to be located close to enriched regions identified by TileMap (142 and 391 bp) and all three probes are contained within enriched regions identified by our model [see Additional file 3]. This suggests that these probes may have been misclassified by TileMap during the original analysis, leading to an overestimation of the number of false positives produced by the BaumWelch model on dataset I.
2.3.7 Application to ChIPChip Data
To investigate the performance of our model further, we apply it to the data of [3] and compare the result to the original analysis. Based on the results of the simulation study (Sections 2.3.3–2.3.6) we use the following procedure:

1.
Quantile normalise and log transform data;

2.
Calculate probe statistics (Equation (2));

3.
Obtain initial estimates (Section 2.2.1);

4.
Use 5 iterations of Viterbi training to improve initial estimates;

5.
Use 15 iterations of BaumWelch algorithm to obtain maximum likelihood estimates;

6.
Apply resulting model to data to identify enriched regions.
This results in the detection of 5285 H3K27me3 regions covering 12.9 Mb of genomic sequence. Of these enriched regions, 3962 (~75%) are overlapping at least one annotated transcript. A total of 4982 or about 18.9% of all annotated genes are found to be enriched for H3K27me3. While most of the enriched regions cover a single gene, some regions are found to contain up to seven genes (Figure 9(b)). Enriched regions are predominantly longer than 1 kb with some extending over more than 20 kb (Figure 9(c)).
To assess whether there is a difference between regions of the genome that show H3K27me3 enrichment and the rest of the genome, we investigate the density of genes in the neighbourhood of genes that appear to be regulated by H3K27me3, and compare this to the gene density in other regions of the genome. For this purpose we obtain the gene density for the 50 kb upstream and downstream of each gene as (bp annotated as genes)/100 kb. The resulting gene densities for genes with and without enriched regions are summarised in Figure 9(a). There are visible differences between the two distributions which we test for significance with a two sided KolmogorovSmirnov test; this results in an approximate pvalue of 2 × 10^{15}. The significance of this result is further confirmed by a resampling experiment: the smallest pvalue obtained from a series of 10000 resampled datasets is 1 × 10^{6}.
3 Conclusion
With the use of MLEs for all model parameters, our model clearly improves classification performance on simulated data compared to ad hoc estimates, and outperforms TileMap. While our model produced some short regions that appear to be false positives, they are readily explained as a result of the simulation process. Comparison of results on simulated and real data suggests that TileMap produced a large number of false negatives in the original analysis used as the basis for the simulation. Inevitably, these false negatives were selected as part of nonenriched regions during the simulation process. The fact that the model with BaumWelch parameter estimates was able to identify these isolated enriched probes despite the nonenriched contexts where they appeared emphasises the high sensitivity of the model.
TileMap's apparent tendency to penalise false positives more than false negatives clearly contributes to its relatively low performance in our comparisons which are based on the assumption that both types of error are equally problematic. While this is the case for the application considered here, one may argue that false positives are indeed of greater concern in some cases. When this is the case, TileMap's tradeoff between sensitivity and specificity may lead to better results. However, it should be noted that the relative weights given to false positives and false negatives by TileMap can vary substantially between datasets. The parameter estimation procedure used for our model on the other hand provides consistent performance at the chosen cutoff.
The modelfitting procedure derived from the results of the simulation study (Sections 2.3.3–2.3.6) provides a fast and reliable approach to parameter estimation. This method retains all the favourable properties of the BaumWelch algorithm while utilising the reduced computing time provided by Viterbi training. The use of MLEs ensures that model parameters are appropriate for the data. Results from the simulation study show that estimating model parameters from the data improves the model's ability to recognise enriched regions of varying length and generally improves classification performance.
3.1 Future Work
The analysis of the H3K27me3 data (Section 2.3.7) largely confirms the analysis of [3] although there are some notable differences. Most importantly, the H3K27me3 regions detected by our analysis are longer than the ones determined by TileMap (Figure 10). While Zhang et al. [3] found few regions longer than 1 kb, our analysis indicates that over 70% of enriched regions have a length of at least 1 kb, with the longest region spanning over 20 kb. Accordingly we find more regions that extend over several genes (Figure 9(b)). This may have implications for conclusions about the spreading of H3K27me3 regions in Arabidopsis.
At this stage, the biological significance of the observed difference in gene density in the neighbourhood of enriched and nonenriched genes is unclear. However, it indicates that the two groups of genes differ in a significant way. This suggests that the partition into enriched and nonenriched genes produced by our analysis is indeed meaningful.
The hidden Markov model presented in this article uses homogeneous transition probabilities, assuming that all probes are spaced out equally along the genome. To satisfy this assumption at least approximately, we use a fixed cutoff of 200 bp to partition the sequence of probe statistics such that there are no large gaps between probes. This arbitrary cutoff could be avoided by using a continuous time hidden Markov model.
4 Methods
4.1 BaumWelch Algorithm
The BaumWelch algorithm [16] used to estimate parameters for our model is outlined in Section 2.2.2; further details are given below. Computing the likelihood of the long observation sequences produced by tiling arrays involves products of many small contributions. This typically results in likelihoods below machine precision. To avoid this effect computations are carried out in logspace, using the identityln(x + y) = ln(x) + ln (1 + e^{ln(y)ln(x)}).
In the following we use ^{ln}∑ to denote summations which should be computed via Equation (4). The sequence of probe statistics Y is split into D observation sequences Y ^{(d)}such that the distance between probes within each observation sequence is at most max_gap and the distance between the end points of different observation sequences is greater than max_gap.
The emission distribution of state S_{ i }is given as
For a given parameter set θ we can obtain new parameter estimates for transition probabilities by calculating
Here α_{ k }and β_{ k }are known as forward and backward variables. For observation sequence d, d = 1, ..., D, they are defined as
where 1 ≤ i ≤ N, 1 ≤ j ≤ N, 1 ≤ k <K_{ d }and
where 1 ≤ i ≤ N, 1 ≤ i ≤ N, k = K_{ d } 1, ..., 1. Note that ln [P(Y ^{(d)}; θ)] is given by \begin{array}{c}\mathrm{ln}\end{array}{\displaystyle {\sum}_{i=1}^{N}{\alpha}_{{K}_{d}}^{d}} We then calculate
Combining the estimates from all observation sequences we obtain new parameter estimates for the transition probabilities:
Calculations for the reestimation of θ_{2} may involve negative values and cannot be carried out in logspace.
To obtain the required parameter estimates we first define \mathrm{ln}[{\tau}_{ki}^{d}]={\gamma}_{ki}^{d} and then compute
There is no closed form estimate for ν_{ i }. To obtain {\stackrel{\u02c6}{\nu}}_{i} one has to find a solution to the equation
where ψ is the digamma function. Standard rootfinding techniques are employed to find a solution to (20).
4.2 Viterbi Training
Viterbi training provides a faster alternative to the BaumWelch algorithm. See Section 2.2.3 for a high level description of the algorithm. Details of the parameter estimation procedure are given below. Instead of calculating the conditional expectation of the complete data log likelihood, this algorithm first computes the most likely state sequence Q given the observation sequence Y and the current model θ. The sequence Y is partitioned according to Q, assigning each observation to the state that it most likely originated from. New estimates for θ_{1} are then obtained by calculating
Updates for μ and σ are obtained as in Section 2.2.1. The degrees of freedom ν can be either fixed in advance or estimated from the data using Equation (20) by setting {\tau}_{ki}^{d}=1 if ({q}_{ki}^{d},{q}_{k+1}^{d})=({S}_{i},{S}_{j}) and {\tau}_{ki}^{d}=0 otherwise.
4.3 Simulated Data
In a first step following the original analysis by [3], TileMap [7] is used with the HMM option to define enriched and nonenriched probes. Note that, although this classification of probes is not perfect, it can be assumed that most probes are assigned to the correct group. The length distribution of enriched and nonenriched regions detected by TileMap is used to determine the length distributions for the simulated data after removing all regions that contain less than 10 probes (Figure 10). Data are generated by first determining the length of enriched and nonenriched regions from the empirical length distributions and then sampling data points from the respective TileMap generated clusters. Following this procedure, 600 sequences with one to ten enriched regions in each sequence are generated. A second dataset is generated by applying the model described in Section 2. Note that, although this procedure relies on the classifications produced by the respective models, the resampling procedure will place individual probe values in a new context of surrounding probes, which may lead to different probe calls in the analysis of the simulated data. Prior to analysis all data are quantile normalised.
5 Availability
The parameter estimation methods used in this article are available as part of the R package tileHMM from the authors' webpage http://www.bioinformatics.csiro.au/TileHMM/ and from CRAN. The simulated data used in this study is available from the authors' web page.
References
Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ, Wheeler R, Wong B, Drenkow J, Yamanaka M, Patel S, Brubaker S, Tammana H, Helt G, Struhl K, Gingeras TR: Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs. Cell 2004, 116: 499–509. 10.1016/S00928674(04)001278
Bernstein BE, Kamal M, LindbladToh K, Bekiranov S, Bailey DK, Huebert DJ, McMahon S, Karlsson EK, III EJK, Gingeras TR, Schreiber SL, Lander ES: Genomic Maps and Comparative Analysis of Histone Modifications in Human and Mouse. Cell 2005, 120: 169–181. 10.1016/j.cell.2005.01.001
Zhang X, Clarenz O, Cokus S, Bernatavichute YV, Goodrich J, Jacobsen SE: WholeGenome Analysis of Histone H3 Lysine 27 Trimethylation in Arabidopsis. PLoS Biol 2007, 5(5):e129. 10.1371/journal.pbio.0050129
Zhang X, Yazaki J, Sundaresan A, Cokus S, Chan SWL, Chen H, Henderson IR, Shinn P, Pellegrini M, Jacobsen SE, Ecker JR: Genomewide HighResolution Mapping and Functional Analysis of DNA Methylation in Arabidopsis. Cell 2006, 126: 1189–1201. 10.1016/j.cell.2006.08.003
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M: Global Identification of Human Transcribed Sequences with Genome Tiling Arrays. Science 2004, 306(5705):2242–2246. 10.1126/science.1103388
Li W, Meyer CA, Liu XS: A hidden Markov model for analyzing ChIPchip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 2005, 21(Suppl 1):i274i282. 10.1093/bioinformatics/bti1046
Ji H, Wong WH: TileMap: create chromosomal map of tiling array hybridisations. Bioinformatics 2005, 21(18):3629–3636. 10.1093/bioinformatics/bti593
Munch K, Gardner PP, Arctander P, Krogh A: A hidden Markov model approach for determining expression from genomic tiling micro arrays. BMC Bioinformatics 2006, 7: 239. 10.1186/147121057239
Huber W, Toedling J, Steinmetz LM: Transcript mapping with highdensity oligonucleotide tiling arrays. Bioinformatics 2006, 22(16):1963–1970. 10.1093/bioinformatics/btl289
Reiss DJ, Facciotti MT, Baliga NS: Modelbased deconvolution of genomewide DNA binding. Bioinformatics 2008, 24(3):396–403. 10.1093/bioinformatics/btm592
Toyoda T, Shinozaki K: Tiling arraydriven elucidation of transcriptional structures based on maximumlikelihood and Markov models. The Plant Journal 2005, 43: 611–621. 10.1111/j.1365313X.2005.02470.x
Du J, Rozowsky J, Korbel JO, Zhang ZD, Royce TE, Schultz MH, Snyder M, Gerstein M: A supervised hidden Markov model framework for efficiently segmenting tiling array data in transcriptional an ChIPchip experiments: systematically incorporating validated biological knowledge. Bioinformatics 2006, 22(24):3016–3024. 10.1093/bioinformatics/btl515
Keleş S: Mixture Modeling for GenomeWide Localization of Transcription Factors. Biometrics 2007, 63: 10–21. 10.1111/j.15410420.2005.00659.x
Ji H, Vokes SA, Wong WH: A comparative analysis of genomewide chromatin immunoprecipitation data for mammalian transcription factors. Nucl Acids Res 2006, 34(21):e146. 10.1093/nar/gkl803
Sandmann T, Girardot C, Brehme M, Tongprasit W, Stolc V, Furlong EEM: A core transcriptional network for early mesoderm development in Drosophila melanogaster. Genes & Development 2007, 21(4):436–449. 10.1101/gad.1509007
Baum LE, Petrie T, Soules G, Weiss N: A Maximization Technique Occuring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics 1970, 41: 164–171. 10.1214/aoms/1177697196
Juang BH, Rabiner LR: A segmental kmeans algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1990, 38(9):1639–1641. 10.1109/29.60082
OpgenRhein R, Strimmer K: Accurate Ranking of differentially expressed genes by a distributionfree shrinkage approach. Statistical applications in Genetics and Molecular Biology 2007, 6: Article 9. 10.2202/15446115.1252
Smyth GK: Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology 2004, 3: Article 3. 10.2202/15446115.1027
Lange KL, Little RJA, Taylor JMG: Robust Statistical Modeling Using the t Distribution. Journal of the American Statistical Association 1989, 84(409):881–896. 10.2307/2290063
Rabiner LR: A tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 1989, 77(2):257–286. 10.1109/5.18626
Viterbi AJ: Error bounds for convolutional codes and an assymptotically optimal decoding algorithm. IEEE Transactions on Information Theory 1967, 13: 260–269. 10.1109/TIT.1967.1054010
Hartigan JA, Wong MA: A Kmeans clustering algorithm. Applied Statistics 1979, 28: 100–108. 10.2307/2346830
Dempster AP, Laird NM, Rubin DB: Maximum Likelihood for Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 1977., 39:
Liu C, Rubin DB: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica 1995, 5: 19–39.
Peel D, McLachlan GJ: Robust mixture modelling using the t distribution. Statistics and Computing 2000, 10: 339–348. 10.1023/A:1008981510081
Acknowledgements
PH is supported by an MQRES scholarship from Macquarie University and a topup scholarship from CSIRO. The authors would like to thank Michael Buckley for his helpful suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
6 Authors' contributions
PH conducted the research and wrote the manuscript. DB critically revised the manuscript. GS conceived the project. DB and GS provided supervision to PH. All authors have read and approved the final manuscript.
Electronic supplementary material
12859_2008_2328_MOESM1_ESM.png
Additional file 1: False negative probe calls resulting from different models. For any given cutoff TileMap produces more false negatives than the BaumWelch and Viterbi trained models. (PNG 76 KB)
12859_2008_2328_MOESM2_ESM.png
Additional file 2: False positive probe calls resulting from different models. For any given cutoff TileMap produces fewer false positives than the BaumWelch and Viterbi trained models. (PNG 444 KB)
12859_2008_2328_MOESM3_ESM.png
Additional file 3: Origin of isolated enriched probes in dataset I. The isolated enriched probes identified in dataset I by the BaumWelch model originate from enriched regions identified by the BaumWelch model in the real data. Two out of three probes are located close to enriched regions identified by TileMap. (PNG 3 MB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Humburg, P., Bulger, D. & Stone, G. Parameter estimation for robust HMM analysis of ChIPchip data. BMC Bioinformatics 9, 343 (2008). https://doi.org/10.1186/147121059343
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059343
Keywords
 Hide Markov Model
 Length Distribution
 Tiling Array
 Observation Sequence
 Parameter Estimation Method