Model
Since a nucleosome wraps 147 base pairs worth of DNA, the space of possible sequences contains 4147 or about 1088 possibilities. It is impossible to enumerate all of these, so a simple function is needed for the probability distribution.
Segal et al. do this by treating a DNA sequence as a Markov chain of order 1, where the probability of a nucleotide at a certain position depends only upon the preceding nucleotide. The probability of the sequence as a whole is the product of the probabilities of all the nucleotides it is composed of. More precisely, defining S as a sequence of length 147, consisting of nucleotides S
i
with i from 1 to 147,
$$\begin{array}{*{20}l} P(S) &= P\left(\bigcap_{i=1}^{147} S_{i}\right) = P\left(S_{147}|\bigcap_{i=1}^{146} S_{i}\right) P\left(\bigcap_{i=1}^{146} S_{i}\right) \end{array} $$
(1)
$$\begin{array}{*{20}l} &= \prod_{n=1}^{147} P\left(S_{n}|\bigcap_{i=1}^{n-1} S_{i}\right), \end{array} $$
(2)
where we have applied the chain rule of probabilities. If we now introduce the assumption we mentioned earlier, that the probability of a nucleotide depends only on the preceding nucleotide, we find the expression given by Segal et al., i.e.
$$\begin{array}{*{20}l} P(S) &= P(S_{1}) \prod_{n=2}^{147} P(S_{n}|S_{n-1}). \end{array} $$
(3)
We should stress that the value of quantities like P(S
n
) depends not just on the value of S
n
(i.e. which nucleotide is represented) but also on the position along the nucleosome, n. These probability distributions for, in the case of Segal et al., dinucleotides, can be obtained by analyzing a suitable ensemble of sequences that have high affinities for the nucleosome. Segal et al. generate such an ensemble from the genome they are interested in making predictions for, by mapping actual (in vitro) nucleosome positions along the DNA. Although the original model did not perform very well [12], this model has been applied with success – after a refinement of the model and employing a better training data set – to predicting nucleosome positions, by Field et al. [13] and Kaplan et al. [14].
These experimental probability distributions do not capture only the intrinsic mechanical preferences of the DNA. They also capture inherent biases in the sample (a genomic sequence necessarily contains only a small subset of all 1088 possible sequences of length 147) and biases of the experimental method. This makes it difficult to evaluate the accuracy of the model, since both the training of the model and its testing generally rely on the same experimental methods, and there is the risk that agreement between the model and reality is overestimated because the model correctly fits experimental artifacts. Therefore it becomes of interest to study the model in a theoretical framework, where we can isolate the purely mechanical effects.
Ensembles to inform this type of bioinformatics model can also be generated from a theoretical nucleosome model using the Mutation Monte Carlo (MMC) method [5]. This method adds mutation moves to a standard Monte Carlo simulation of a nucleosome, thereby sampling the Boltzmann probability distribution of pairs of sequences and spatial configurations (S,θ),
$$ P(S,\theta) = e^{-\beta E(S,\theta)}. $$
(4)
By sampling the sequences during the MMC simulation, the spatial degrees of freedom of the nucleosome model are marginalized and one obtains the probability distribution of the sequences
$$ P(S) = \int d\theta e^{-\beta E(S,\theta)} $$
(5)
and their free energy
$$ F(S) = -kT \log(P(S)). $$
(6)
Note that in Eqs. 4–6 we have neglected the overall normalization of the probability distributions by the partition function Z, and hence a constant offset −k
T log(Z) to the free energy. Because the probabilities we derive are simply relative frequencies with respect to our sequence ensemble, they are inherently normalized (i.e. summing them over all possible sequences gives unity) and we have no information on the partition function. This is not usually an impediment as we are mostly interested in relative energy differences.
Sampling the entire sequence space is not feasible, but making the same assumption about long-range correlations in the sequence preferences as Segal et al., we can assume that we may write our P(S) as in Eq. 3. It turns out it is feasible to produce a sequence ensemble large enough that the distributions P(S
i
|S
i−1) may be determined.
Generalization of the Dinucleotide Model
We used an MMC simulation of the model put forward by Eslami-Mossallam et al. at 1/6 of room temperature to generate an ensemble of 107 sequences, from which the oligonucleotide distributions were derived (see Additional files 1, 2, and 3). At each position, we counted the number of instances of every mono-, di- and tri-nucleotide and divided these by the total number of sequences in order to obtain probability distributions.
This gives us the joint probability distribution P(S
n
∩S
n−1) and not the conditional probability P(S
n
|S
n−1) that we need for Eq. 3. This is easily remedied. We can rewrite Eq. 3 as
$$ P(S) = P(S_{1}) \prod_{n=2}^{147} \frac{P(S_{n} \cap S_{n-1})}{P(S_{n-1})} = \frac{\prod_{n=2}^{147} P(S_{n} \cap S_{n-1})}{\prod_{n=2}^{146}P(S_{n})}. $$
(7)
We see that we can write this equation in terms of the probability distributions of mono- and dinucleotides that we can find from a sequence ensemble. Analogously, if we want to expand the model to trinucleotides, we insert the assumption that the probability of a nucleotide depends only on the previous two (creating a Markov chain of order two) and we find
$$ P(S) = \frac{\prod_{n=3}^{147} P(S_{n} \cap S_{n-1} \cap S_{n-2})}{\prod_{n=3}^{146} P(S_{n} \cap S_{n-1})}. $$
(8)
This model can thus be applied using probability distributions for di- and trinucleotides, both to be obtained from a suitable sequence ensemble. The result easily generalizes to tetranucleotides and beyond. For mononucleotides, the model simplifies to
$$ P(S) = \prod_{i=1}^{147} P(S_{i}). $$
(9)
Analysis
Segal et al. test their model by predicting nucleosome positions along the genome they are studying and comparing with reality and they find that their model has some predictive power, even on genomes on which the method was not trained. However, their study is inevitably hampered by small statistics and their use of natural materials. The latter makes it difficult to judge the quality of their model.
The in silico methods allow us to test the model, as an approximation to the full underlying model, much more rigorously. Because we can explicitly calculate the energy of a given sequence, we can directly measure the correlation between the energy given by the theoretical nucleosome model and the probability calculated by the bioinformatics model. Using a standard Monte Carlo simulation of the nucleosome with a given sequence, we can measure the average energy
$$ \left<E\right>_{S} = \int d\theta E(S,\theta) e^{-\beta E(S,\theta)} $$
(10)
of the sequence. Unfortunately, calculating the free energy using the Eslami-Mossallam nucleosome model is not straightforward, and we will be comparing <E>
S
as predicted by the biophysical model with F(S) as predicted by the approximative model. At finite temperature, these quantities are not the same, differing by an entropic contribution. However, at low enough temperatures they converge. We will compare the predictions at 1/6th of room temperature, as some finite temperature is needed for the statistical simulations to function. In performing this comparison, we thus provide an upper limit for the discrepancy between the approximation and the real <E>
S
.
In order to generate an energy landscape with which to compare the results of the probability-based models, we take the first chromosome of S. cerevisiae (∼2×105 base pairs) and perform a Monte Carlo simulation of the nucleosome wrapped with each 147-base-pair subsequence of the chromosome, using the nucleosome model put forth by Eslami-Mossallam et al. After letting the simulation equilibrate, we sample the energy of the system and take the average. In order to be able to compare this energy landscape with a probability landscape, we calculate the (Boltzmann) probability distribution and normalize this over the set of sequences for which we calculated the energy, and then take the logarithm to regain our (shifted) energy landscape.
Analogously, we use the probability-based model to generate a probability landscape of the same sequence. This we normalize over the set of sequences analyzed and convert to an energy using Eq. 6. We find that this procedure is about five orders of magnitude faster than using the full biophysical model.
We only know the free energy up to some constant offset, but by making sure both the real energy landscape given by the energetic model and the approximate energy landscape provided by our probability-based model have the same normalization, we can readily compare the two.
In doing so, we may draw some conclusions about this kind of Markov-chain model not only as it relates to the nucleosome model we consider here, but about the assumptions that go into it in general, i.e. the explicit assumption of short-range correlations and the implicit assumption that the sequence ensemble on which the model is being trained is large enough. To test the first assumption, we extend the dinucleotide model used by Segal et al. to mononucleotides (which assumes no correlations at all) and trinucleotides (which relaxes the assumption of short-range correlations) and compare their accuracy. For the second, we examine the accuracy of these three models as a function of the ensemble size on which they are trained.