Model
The hidden Markov model (HMM) has been known for decades. An excellent and famous tutorial is Rabiner's 1989 paper [14], in which the model, algorithms, and applications were carefully and thoroughly reviewed. A conventional HMM implicitly assumes a geometric duration distribution for each state, which can be wrong in real applications. Modeling the explicit duration of each state can improve the prediction of HMMs (e.g., [14–17]). We model each chromosomal DNA sequence with a duration hidden Markov model (dHMM) of two oscillating states: nucleosome (N) and linker (L), where the nucleosome state has a fixed length of 147 bp, and the linker state has a variable length. We assume that at the end of each state, the chain must transit to the other state; additionally, a complete chromatin sequence must start with and end in a linker state. We trained a 4th order time-dependent Markov chain for the the N state, and a homogeneous 4th order Markov chain for the L state to distinguish the k-mer usage preferences for k up to 5 between the nucleosome and linker states as shown in other methods, e.g., [3, 6] (see below for details).
Let e = e1, ..., e147 be a nucleosome DNA sequence. Let P
N
be the probability of observing e as a nucleosome, computed as the product of probabilities for both Watson and Crick strands under the 4th order Markov Chain model. We assume that the linker DNA length of a given species has an unknown distribution F
L
(k) defined for k = 1, ..., τ
L
(the maximum linker length we allow). An observed linker DNA sequence e = e1, ..., e
k
carries two pieces of information, the length is k bp, and given which, the emitted letters are e1, ..., e
k
. Let G
L
(e|k) denote the homogeneous Markov chain model for the linker DNA (again including both strands). Then observing e as a linker DNA has probability
Suppose x = x1, ..., x
n
is a genomic DNA sequence of length n, where x
i
= A/C/G/T. Let z = z1, ..., z
n
be the corresponding hidden state path, where z
i
= 1 if x
i
is covered by a nucleosome state, and 0 otherwise. Suppose that the path z = z1, ..., z
n
partitions x into k consecutive nucleosome or linker state blocks, in which the nucleosome blocks have a uniform length of 147 bp, whereas the length of linker blocks may vary. We denote these blocks as y = y+, ..., y
B
, and their state identification as s = s1, ..., s
B
, where s
i
= 1 if y
i
is nucleosome state, and 0 otherwise. The probability of observing (x, z) is given by
where π0(s1) and πe(s
B
) stand for the probabilities that the chain initializes and ends with state s1 and s
k
respectively, and I is an indicator function. Since we assume that a complete chromatin sequence must start with and end in a linker state, π0(s1 = 0) = π
e
(s
B
= 0) = 1. We define the nucleosome occupancy at a specific position i, denoted o
i
, as the posterior probability that z
i
= 1, i.e.,
We also define the histone binding affinity score at position i as the log likelihood ratio for the region x
i
-73, ... x
i
, ..., x
i
+73 to be a nucleosome vs. a linker, i.e.,
Given the models P
N
, G
L
and F
L
, the optimal path z can be found by the standard Viterbi algorithm, and the nucleosome occupancy score can be estimated using forward and backward algorithms.
Data and model training
We utilized the 503,264 yeast nucleosome DNA reads from 454 pyrosequencing published in [6] for model training and assessment. Among 371,914 reads that each were mapped to a unique region of the yeast genome, we first selected reads of length between 146 and 149 bp. If multiple such reads existed for the same nucleosome, we selected the one with the highest BLAST score. The resulting non-redundant set of 18,547 nucleosome sequences were center aligned to train the nucleosome model P
N
. The 4th order time dependent Markov chain can be defined by the base composition at the first position qN(x1), and the transitional probabilities qN(x2|x1), qN(x3|x1, x2), qN(x4|x1, x2, x3), qN(x
k
|xk-4, x
k
-3, xk-2, xk-1), for k = 5, ..., 147, x
i
= A/C/G/T, i = 1, ..., 147, where the subscript k, i index the positions within a nucleosome. These probabilities are trained using the corresponding observed fractions or conditional fractions based on the center alignment, with a three bp moving average (as explained in [1, 18]). We further identified 8,090 reads-free regions of length 7-500 bp to train the linker state model G
L
. The 4th order homogeneous Markov model for the linker DNAs can be completely defined by the stationary base composition qL(xi), and the transition probabilities qL(x
i
|xi-1), qL(x
i
|xi-1, xi-2), qL(x
i
|xi-1, xi-2, xi-3), qL(x
i
|xi-1, xi-2, xi-3, xi-4). By "homogeneous", we mean that these probabilities are all constants as functions of i. These probabilities were trained using their observed values as in the nucleosome model. For example, qL(x
i
|xi-1, xi-2, xi-3, xi-4) was trained by calculating the fraction of occurrences of transitions from any four letters to the fifth letter in the selected putative linker DNA sequences.
Our initial nucleosome/linker model was trained using the yeast data. A complication arises when predicting nucleosomes for other species because organisms may differ significantly in their DNA base composition. We propose to scale up or down the probabilities in the Markov models by a factor determined by the difference of the base composition between the current species and yeast. For example, in C. elegans, the fraction of A plus T bases is 0.645 compared to 0.617 in yeast. For a specific transition probability qN(A|....) at any specific nucleosomal position defined for yeast, we scaled it up as qN(A|....) × 0.645/0.617 for the corresponding transition probability at that given position for C. elegans. Likewise the transition probabilities for G/C will be scaled down by a factor of 0.355/0.383.
All the re-scaled probabilities are then normalized. The same re-scaling applies to the linker model. We shall use simulations below to show that re-scaling improves prediction regarding sensitivity and false discovery rate. Using the trained nucleosome model (P
N
) and linker model (G
L
), we further train the linker DNA length distribution as follows. We assume that the linker DNAs in any given species or cell type have a maximum length τ
L
= 500 bp.
This algorithm contains the following steps:
-
1.
Initialize the algorithm with a uniform distribution for F
L
(k) for k = 1, ... τ
L
where τ
L
is the maximum allowable linker length.
-
2.
Use the forward and backward algorithm to obtain the posterior expectation of F
L
(k) for each k. For a sequence x = x1, ..., x
n
, let n
k
be the number of linker DNAs of length k. Then
for k = 1, ..., τ
L
.
for k = 1, ..., τ
L
.
-
3.
Update the empirical linker length distribution from step 2 using a kernel smoothing method as follows:
where K is the standard Gaussian kernel and h is the bandwidth parameter optimally chosen by the leave-one cross-validation method as in [19].
-
4.
Use the updated linker length distribution from step 3 to compute the nucleosome occupancy and optimal positioning.
Compared to Viterbi training (i.e., using linker length predicted from the Viterbi algorithm), using the posterior expectation obtained in Eq. (1) combined with the kernel method in Eq. (2) performs overwhelmingly better in minimizing the summed square errors
(unpublished work [17]). In the developed software tool NuPoP, we have trained the linker DNA length distributions for 11 different species including human, mouse, rat, zebrafish, D. melanogaster, C. elegans, S. cerevisiae, C. albicans, S. pombe; A. thaliana and maize. The linker DNA length distribution (F
L
) for each species has been trained by scanning the corresponding genome sequences based on τ
L
= 500. We found that the re-scaled nucleosome and linker profiles, together with the trained linker length distribution, not only roughly recover the genome-wide base compositions, but also the dinucleotide frequencies for different species. The frequency of each single or di-nucleotide in simulated genomes typically differs by ≤ 1% from that observed in the corresponding real genomes (results not shown). As different cell types from the same organism (with the same genome) can exhibit quite different linker DNA length distributions [13], a useful future refinement would utilize high quality nucleosome maps for the given cell type, when such data become available.
Software tools
We have developed a software tool called NuPoP, implemented in three different formats: an R package tested for Windows XP, Linux and Mac OS X; a stand-alone Fortran program; and an NuPoP web server, all available from http://nucleosome.stats.northwestern.edu. The R package is built upon the Fortran program. It provides additional handy functions to visualize the resulting Viterbi (most probable nucleosome position map) and nucleosome occupancy predictions. Both the R package and Fortran program can handle a genomic sequence of any length with a RAM demand <400 M bytes. The predicted results are stored locally in the working directory. The web server provides an interface through which the user can submit their own sequence up to 500 K bp in length for fast online prediction. When making a prediction, the user is required to specify which species the genomic sequence is from. If the species is not on the list, NuPoP will calculate the base composition of the input DNA sequence and then choose the nucleosome and linker models from a species that has the most similar base composition. An alternative model with a 1st order time-dependent Markov chain for the nucleosome state and a homogeneous 1st order Markov model for the linker state, trained in the same way, is also implemented in NuPoP as an option.