Predicting nucleosome positioning using a duration Hidden Markov Model
© Xi et al; licensee BioMed Central Ltd. 2010
Received: 12 April 2010
Accepted: 24 June 2010
Published: 24 June 2010
The nucleosome is the fundamental packing unit of DNAs in eukaryotic cells. Its detailed positioning on the genome is closely related to chromosome functions. Increasing evidence has shown that genomic DNA sequence itself is highly predictive of nucleosome positioning genome-wide. Therefore a fast software tool for predicting nucleosome positioning can help understanding how a genome's nucleosome organization may facilitate genome function.
We present a duration Hidden Markov model for nucleosome positioning prediction by explicitly modeling the linker DNA length. The nucleosome and linker models trained from yeast data are re-scaled when making predictions for other species to adjust for differences in base composition. A software tool named NuPoP is developed in three formats for free download.
Simulation studies show that modeling the linker length distribution and utilizing a base composition re-scaling method both improve the prediction of nucleosome positioning regarding sensitivity and false discovery rate. NuPoP provides a user-friendly software tool for predicting the nucleosome occupancy and the most probable nucleosome positioning map for genomic sequences of any size. When compared with two existing methods, NuPoP shows improved performance in sensitivity.
Most eukaryotic genomic DNA is wrapped in nucleosomes, which occlude and strongly distort the wrapped DNA. Accumulating evidence shows that the DNA sequence itself is highly predictive of nucleosome positioning in vivo [1–7], and that nucleosome positioning is closely related to chromosome functions [1, 8–11]. A fast software tool for predicting nucleosome positioning is highly desirable.
Several statistical methods for nucleosome positioning prediction have been proposed in the literature. In  a method was proposed based on cross-correlation with a nucleosome DNA sequence signature. In  a Markov model was used together with consideration of steric exclusion and thermodynamic equilibrium. In , a support vector machine (SVM) was trained based on the differential k-mer usage between nucleosome and linker DNAs. In , the authors proposed an N-score model to discriminate nucleosome and linker DNAs using wavelet energies as covariates in a logistic regression model. In , a web-interface called "nuScore" for estimation of the affinity of histone core to DNA and prediction of nucleosome positioning was developed based on the DNA deformation energy score. In [6, 7], the model from  was improved by incorporation of differential k-mer usage (most notably, poly(dA:dT) tracts, which are strongly disfavored by nucleosomes). This model can be further improved by accounting for nucleosome-nucleosome interaction .
While the nucleosomal features are universal, eukaryotic genomes vary in nucleosomal repeat length  and base composition. The nucleosomal repeat length is dictated by the length distribution of linker DNAs that separate neighboring nucleosomes, and it determines the overall nucleosome density in the chromatin fiber. The contribution of this paper is a duration Hidden Markov Model and a software tool called NuPoP for genome-wide nucleosome positioning prediction. We show that incorporation of linker length information can achieve better sensitivity in prediction. In addition, we propose a re-scaling method to adjust for base composition variation when using yeast models to make predictions for other species. A relatively superior performance of this approach is established by comparing it with other existing tools.
The hidden Markov model (HMM) has been known for decades. An excellent and famous tutorial is Rabiner's 1989 paper , in which the model, algorithms, and applications were carefully and thoroughly reviewed. A conventional HMM implicitly assumes a geometric duration distribution for each state, which can be wrong in real applications. Modeling the explicit duration of each state can improve the prediction of HMMs (e.g., [14–17]). We model each chromosomal DNA sequence with a duration hidden Markov model (dHMM) of two oscillating states: nucleosome (N) and linker (L), where the nucleosome state has a fixed length of 147 bp, and the linker state has a variable length. We assume that at the end of each state, the chain must transit to the other state; additionally, a complete chromatin sequence must start with and end in a linker state. We trained a 4th order time-dependent Markov chain for the the N state, and a homogeneous 4th order Markov chain for the L state to distinguish the k-mer usage preferences for k up to 5 between the nucleosome and linker states as shown in other methods, e.g., [3, 6] (see below for details).
Given the models P N , G L and F L , the optimal path z can be found by the standard Viterbi algorithm, and the nucleosome occupancy score can be estimated using forward and backward algorithms.
Data and model training
We utilized the 503,264 yeast nucleosome DNA reads from 454 pyrosequencing published in  for model training and assessment. Among 371,914 reads that each were mapped to a unique region of the yeast genome, we first selected reads of length between 146 and 149 bp. If multiple such reads existed for the same nucleosome, we selected the one with the highest BLAST score. The resulting non-redundant set of 18,547 nucleosome sequences were center aligned to train the nucleosome model P N . The 4th order time dependent Markov chain can be defined by the base composition at the first position qN(x1), and the transitional probabilities qN(x2|x1), qN(x3|x1, x2), qN(x4|x1, x2, x3), qN(x k |xk-4, x k -3, xk-2, xk-1), for k = 5, ..., 147, x i = A/C/G/T, i = 1, ..., 147, where the subscript k, i index the positions within a nucleosome. These probabilities are trained using the corresponding observed fractions or conditional fractions based on the center alignment, with a three bp moving average (as explained in [1, 18]). We further identified 8,090 reads-free regions of length 7-500 bp to train the linker state model G L . The 4th order homogeneous Markov model for the linker DNAs can be completely defined by the stationary base composition qL(xi), and the transition probabilities qL(x i |xi-1), qL(x i |xi-1, xi-2), qL(x i |xi-1, xi-2, xi-3), qL(x i |xi-1, xi-2, xi-3, xi-4). By "homogeneous", we mean that these probabilities are all constants as functions of i. These probabilities were trained using their observed values as in the nucleosome model. For example, qL(x i |xi-1, xi-2, xi-3, xi-4) was trained by calculating the fraction of occurrences of transitions from any four letters to the fifth letter in the selected putative linker DNA sequences.
Our initial nucleosome/linker model was trained using the yeast data. A complication arises when predicting nucleosomes for other species because organisms may differ significantly in their DNA base composition. We propose to scale up or down the probabilities in the Markov models by a factor determined by the difference of the base composition between the current species and yeast. For example, in C. elegans, the fraction of A plus T bases is 0.645 compared to 0.617 in yeast. For a specific transition probability qN(A|....) at any specific nucleosomal position defined for yeast, we scaled it up as qN(A|....) × 0.645/0.617 for the corresponding transition probability at that given position for C. elegans. Likewise the transition probabilities for G/C will be scaled down by a factor of 0.355/0.383.
All the re-scaled probabilities are then normalized. The same re-scaling applies to the linker model. We shall use simulations below to show that re-scaling improves prediction regarding sensitivity and false discovery rate. Using the trained nucleosome model (P N ) and linker model (G L ), we further train the linker DNA length distribution as follows. We assume that the linker DNAs in any given species or cell type have a maximum length τ L = 500 bp.
Initialize the algorithm with a uniform distribution for F L (k) for k = 1, ... τ L where τ L is the maximum allowable linker length.
Use the forward and backward algorithm to obtain the posterior expectation of F L (k) for each k. For a sequence x = x1, ..., x n , let n k be the number of linker DNAs of length k. Then
Use the updated linker length distribution from step 3 to compute the nucleosome occupancy and optimal positioning.
Compared to Viterbi training (i.e., using linker length predicted from the Viterbi algorithm), using the posterior expectation obtained in Eq. (1) combined with the kernel method in Eq. (2) performs overwhelmingly better in minimizing the summed square errors (unpublished work ). In the developed software tool NuPoP, we have trained the linker DNA length distributions for 11 different species including human, mouse, rat, zebrafish, D. melanogaster, C. elegans, S. cerevisiae, C. albicans, S. pombe; A. thaliana and maize. The linker DNA length distribution (F L ) for each species has been trained by scanning the corresponding genome sequences based on τ L = 500. We found that the re-scaled nucleosome and linker profiles, together with the trained linker length distribution, not only roughly recover the genome-wide base compositions, but also the dinucleotide frequencies for different species. The frequency of each single or di-nucleotide in simulated genomes typically differs by ≤ 1% from that observed in the corresponding real genomes (results not shown). As different cell types from the same organism (with the same genome) can exhibit quite different linker DNA length distributions , a useful future refinement would utilize high quality nucleosome maps for the given cell type, when such data become available.
We have developed a software tool called NuPoP, implemented in three different formats: an R package tested for Windows XP, Linux and Mac OS X; a stand-alone Fortran program; and an NuPoP web server, all available from http://nucleosome.stats.northwestern.edu. The R package is built upon the Fortran program. It provides additional handy functions to visualize the resulting Viterbi (most probable nucleosome position map) and nucleosome occupancy predictions. Both the R package and Fortran program can handle a genomic sequence of any length with a RAM demand <400 M bytes. The predicted results are stored locally in the working directory. The web server provides an interface through which the user can submit their own sequence up to 500 K bp in length for fast online prediction. When making a prediction, the user is required to specify which species the genomic sequence is from. If the species is not on the list, NuPoP will calculate the base composition of the input DNA sequence and then choose the nucleosome and linker models from a species that has the most similar base composition. An alternative model with a 1st order time-dependent Markov chain for the nucleosome state and a homogeneous 1st order Markov model for the linker state, trained in the same way, is also implemented in NuPoP as an option.
Updating F L improves prediction
Updating linker length improves prediction - Normal linker length model
Updating linker length improves prediction - Gamma linker length model
We also observe that under the same setting and condition, the fourth order model performs slightly but uniformly better than the first order model in both sensitivity and FDR. This is given that the true models are known and we scan the sequence using the true models. In theory, the first order Markov chain model is nested in the fourth order model. Therefore if the true model is the first order, a well trained fourth order model will have the same prediction power as the first order model, but not vice versa. Since training a higher order Markov chain model requires more data, inadequate training can undermine the prediction power.
Re-scaling vs. not Re-scaling
Re-scaling models improves prediction - Normal linker length model
Re-scaling models improves prediction - Gamma linker length mode l
NuPoP vs. other software tools
The sensitivity results suggest that the predictions from NuPoP outperforms the other two methods in two senses. Firstly, the sensitivity from NuPoP is 4.9-8.3% higher than the other two methods at different threshold values (Figure 3a). Secondly, while controlling the total predictions to be the same, NuPoP has ~ 3.2-5.3% higher sensitivity than the MM/TE method (Figure 3b), when the precision threshold is ≤ ±35. As the precision threshold gets less stringent, the difference attenuates and eventually vanishes. The contrast between NuPoP and the N-score method is even larger as shown in Figure 3c.
As a further comparison, we computed the predictions of the dHMM method under a uniform linker length distribution defined on 1, 2,..., 500. This method predicted 48,334 nucleosomes under the 4th order models, achieving a sensitivity 3.0-6.6% higher than the MM/TE method (Figure 3a). When controlling the total predictions to be the same as MM/TE method or N-score method, the resulting sensitivity curve almost perfectly overlaps with that from NuPoP. Therefore we omitted these results from Figure 3b and 3c.
One could further attempt to evaluate the false positive rate (FPR), measuring the fraction of linker regions that were falsely classified as nucleosome regions (or similarly the false discovery rate, FDR). This task requires well-defined linker regions. A problem, however, is that the average length of linker DNAs in yeast (20 bp; ) is smaller than the dispersion in lengths of the nucleosome DNAs as isolated biochemically (which is often 30-50 bp full width at half maximum, notwithstanding that the nucleosome as defined crystallographically has precisely 147 bp of DNA). Thus existing nucleosome maps lack the precision needed to define such short linker DNAs. Moreover, various sampling biases such as the DNA sequence preferences of the micrococcal nuclease used to liberate nucleosomes biochemically (which preferentially cleaves A/T rich regions) could yield longer genomic regions that are free of recovered nucleosome DNA reads even if they are actually nucleosome occupied . Attempts to evaluate the FPR given these problems in the data could result in misleading conclusions. For these reasons, FPR evaluation is not pursued in this paper.
The duration Hidden Markov model proposed in this paper is a generic model for the oscillating structure of nucleosome and linker DNAs in chromatin fiber. The Markov models can be replaced by any other models for the nucleosome and linker states. The kernel method for linker length training is nonparametric and typically robust. We showed in the simulation that updating the linker length distribution iteratively improves sensitivity and FDR in prediction if appropriate nucleosome and linker models are used. In particular, the first iteration often achieves the most pronounced improvement. In contrast inappropriate nucleosome and linker models could lead to the opposite outcome, as shown in the simulation studies (Table 3 and Table 4). In reality, the genomic DNAs are complicated by their biological functions. The models trained based on typical nucleosomes or linkers may not well fit some special genomic regions like repeated elements. To avoid possible risks due to such complications, we trained the linker length distribution less greedily by using only one iteration in NuPoP.
Limitations may still exist in the model training and assessment used in this study. The MNase is known to have strong preference to cleave dinucleotides containing only A/T . Consequently the MNase-mapped nucleosome sequences can be systematically biased in some regions. This bias could undermine the prediction power because of the dampened signal in the trained nucleosome model. The systematic bias may also exist in the well-defined nucleosomes, causing inaccuracy in sensitivity estimation. A better map of nucleosomes is highly desirable for both purposes. For species other than yeast, we currently lack high-quality genome-wide nucleosome sequence data (e.g., like the 454 reads) for model training and model validation. The advantages of the re-scaling method shown using simulation in this paper need to be further assessed once such high-quality data becomes available. Moreover, the results from different methods in this paper were all based on the default settings. The N-score method was originally trained based on a much smaller set of nucleosome and linker sequences. A better training using a larger set could improve this method's predictions. In addition, different settings in the N-score or MM/TE methods can lead to different predictions, which we did not further investigate here. Finally, the software for MM/TE method only provides the occupancy score. Different ways to call a predicted nucleosome based on the occupancy score might lead to different conclusions.
Finally, we address the question of which subset of the available 454 reads data might best be used for training the nucleosome model. In NuPoP, we trained the nucleosome model using selected non-redundant nucleosome reads of length within a short range (146-149 bp), to retain strong high resolution nucleosome sequence signatures, e.g., the _10 bp-periodic dinucleotide signals. As comparisons, we trained two additional nucleosome models: one using the selected non-redundant reads of length 122-177 bp (retaining the non-redundancy but yielding far more training data), and the other using all reads of length 122-177 bp. The resulting models both contain the k-mer usage information that distinguishes nucleosomes from linkers (e.g., [3, 4]), while the dinucleotide signals in these models are much weaker due to poor alignment of these reads. Furthermore, as the reads count at a nucleosome site is heavily biased by the G/C content due to MNase specificity and other effects in the experiment, the model trained from the redundant reads tends to be over-enriched in G/C. When combined with the linker model from NuPoP, the two alternative nucleosome models yielded comparable sensitivity as NuPoP in predicting the approximate positioning of nucleosomes, assessed based on the 20,471 well-defined nucleosomes used above. This comparison, however, is not sensitive to spatial precision of the predictions. Therefore, we asked further, given that a nucleosome is predicted within ±73 of a true nucleosome, which model predicts the location more accurately? To investigate this, we simulated genomic sequences using the nucleosome and linker models from NuPoP. We compared the prediction from the three models and found that the true model with strong signals achieves much better prediction accuracy than the two alternative models. For example, 16.1% of the predictions from the true model were prefect (with 0 bp offset), compared to 8.7% and 5.9% respectively from the other two models (results not shown).
The dHMM model proposed in this paper is effective in characterizing the oscillating structure of nucleosome and linker DNAs in chromatin fiber. Explicit modeling of linker length improves the prediction of nucleosome positioning regarding sensitivity. The developed software tool NuPoP provides a user-friendly interface for predicting nucleosome occupancy and the most probable nucleosomes positioning map genome-wide.
Availability and requirements
NuPoP software tools are freely available from http://nucleosome.stats.northwestern.edu. The R package shall be made available through bioconductor http://www.bioconductor.org upon publication. To run the NuPoP Fortran stand-alone program, a Fortran compiler is required. For the NuPoP R package, an R version later than 2.9 is required.
The research is supported by NIH grant R01GM075313 and NCI grant U54CA143869. We acknowledge with gratitude the gift of a parallel sequencing run from Roche/454 Life Sciences, and thank Ms. Jolene Osterberger (Roche/454 Life Sciences) and Dr. Nadereh Jafari (Northwestern University) for arranging this. The authors would also like to thank Drs. Eran Segal, Guocheng Yuan, Zhiping Weng and Yutao Fu for help in providing data and their codes.
- Segal E, Fondufe-Mittendorf Y, Chen L, Thåström A, Field Y, Moore I, Wang JPZ, Widom J: A genomic code for nucleosome positioning. Nature 2006, 442: 772–778. 10.1038/nature04979View ArticlePubMedPubMed CentralGoogle Scholar
- Ioshikhes I, Bolshoy A, Derenshteyn K, Borodovsky M, Trifonov EN: Nucleosome DNA Sequence Pattern Revealed by Multiple Alignment of Experimentally Mapped Sequences. J Mol Biol 1996, 262: 129–139. 10.1006/jmbi.1996.0503View ArticlePubMedGoogle Scholar
- Peckham H, Thurman R, Fu Y, Stamatoyannopoulos J, Noble W, Struhl K, Weng Z: Nucleosome positioning signals in genomic DNA. Genome Res 2007, 17(8):1170–7. 10.1101/gr.6101007View ArticlePubMedPubMed CentralGoogle Scholar
- Yuan GC, Liu JS: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Computational Biology 2008, 4: e13. 10.1371/journal.pcbi.0040013View ArticlePubMedPubMed CentralGoogle Scholar
- Tolstorukov M, Choudhary V, Olson W, Zhurkin V, Park P: nuScore: a web-interface for nucleosome positioning predictions. Bioinformatics 2008, 28: 1456–1458. 10.1093/bioinformatics/btn212View ArticleGoogle Scholar
- Field Y, Kaplan N, Fondufe-Mittendorf Y, Moore I, Sharon E, Lubling Y, Widom J, Segal E: Distinct modes of regulation by chromatin encoded through nucleosome positioning signals. PLoS Comput Biol 2008, 4(9):e1000175. 10.1371/journal.pcbi.1000175View ArticleGoogle Scholar
- Kaplan N, Moore I, Fondufe-Mittendorf Y, Gopssett A, Tillo D, Field Y, LeProust E, Hughes T, Lieb J, Widom J, Segal E: The DNA-Encoded Nucleosome Organization of a Eukaryotic Genome. Nature 2009, 458: 362–368. 10.1038/nature07667View ArticlePubMedPubMed CentralGoogle Scholar
- Schones D, Cui K, Cuddapah S, Roh T, Barski A, Wang Z, Wei G, Zhao K: Dynamic regulation of nucleosome positioning in the human genome. Cell 2008, 132(5):887–98. 10.1016/j.cell.2008.02.022View ArticlePubMedGoogle Scholar
- Lee C, Shibata Y, Rao B, Strahl B, Lieb J: Evidence for nucleosome depletion at active regulatory regions genome-wide. Nature Genetics 2004, 36: 900–905. 10.1038/ng1400View ArticlePubMedGoogle Scholar
- Lee W, Tillo D, Bray N, Morse R, Davis R, Hughes T, Nislow C: A high-resolution atlas of nucleosome occupancy in yeast. Nature Genetics 2007, 39(10):1235–44. 10.1038/ng2117View ArticlePubMedGoogle Scholar
- Whitehouse I, Tsukiyama T: Antagonistic forces that position nucleosomes in vivo. Nat Struct Mol Biol 2006, 13: 633–640. 10.1038/nsmb1111View ArticlePubMedGoogle Scholar
- Lubliner S, Segal E: Modeling interactions between adjacent nucleosomes improves genome-wide predictions of nucleosome occupancy. Bioinformatics 2009, 25(12):i348. i355 i355 10.1093/bioinformatics/btp216View ArticlePubMedPubMed CentralGoogle Scholar
- van Holde KE: Chromatin. Springer-Verlag; 1989.View ArticleGoogle Scholar
- Rabiner LR: A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 1989, 77: 257–285. 10.1109/5.18626View ArticleGoogle Scholar
- Rabiner L, Wilpon JG, Soong FK: High performance connected digit recognition using hidden Markov models. IEEE Transaction on Acoustics, Speech and Signal Processing 1990, 37(8):1214–1225. 10.1109/29.31269View ArticleGoogle Scholar
- Burshtein D: Robust parametric modeling of durations in hidden Markov models. IEEE Transactions on Speech and Audio Processing 1996, 4: 240–242. 10.1109/89.496221View ArticleGoogle Scholar
- Wang JP, Xi L: Duration estimation for Hidden Markov Models. unpublished 2010.Google Scholar
- Wang JPZ, Widom J: Improved alignment of nucleosome DNA sequences using a mixture model. Nucleic Acids Research 2005, 22(22):6743–55. 10.1093/nar/gki977View ArticleGoogle Scholar
- Wang JP: Estimating the species richness by a Poisson-compound Gamma model. To Appear in Biometrika 2010.Google Scholar
- Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek J, Costa G, McKernan K, Sidow A, Fire A, Johnson S: A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res 2008, 18(7):1051–63. 10.1101/gr.076463.108View ArticlePubMedPubMed CentralGoogle Scholar
- Wang JP, Fondufe-Mittendorf Y, Xi L, Tsai GF, Segal E, Widom J: Preferentially quantized linker DNA lengths in Saccharomyces cerevisiae . PLoS Computational Biology 2008, 4(9):e1000175. 10.1371/journal.pcbi.1000175View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.