### A model of pulldown data from SssI-treated DNA

An MBD enrichment-capture experiment produces a pool of DNA fragments enriched for DNA methylation. Those fragments are sequenced and the reads are aligned to a reference genome to form the “pulldown” data set. In order to generate a control sample that approximates the pulldown of a genomic window at full methylation, the experiment can be done on DNA treated with M.SssI to methylate all cytosines in the CpG dinucleotide context. In this study, we formulate a model of the expected pulldown data from such an SssI Control sample to use in place of an experimental SssI Control pulldown data set. Let *Λ*_{x} represent the average number of pulldown reads that align to the genome starting at location *x*. For our purposes, we assume that a fragment that aligns to location *x*, on the forward or reverse strand, with a length *ℓ* possesses the corresponding sequence that is encoded in the genome. From our previous results [13], the parameters that most determine the probability that such a fragment starting at *x* would be pulled down are the number and spacing of the CpGs on the DNA fragment. We summarize the number and spacing of the CpGs by our term “accessible CpGs”. This refers to the number of CpGs on the fragment that can be simultaneously bound by MDB2 protein domains after taking into account that MBD2 domains are sterically excluded from binding if the CpGs are too close to each other along the DNA molecule. Thus, for *Λ*_{x}, we consider every fragment that could align starting at *x* and sum over the expected amount of pulldown of that fragment, weighted by the probability that a fragment of that length would appear in the sample to begin with:

$$ \Lambda_{x} \propto \sum\limits_{\ell = \ell_{\min}}^{\ell_{\max}} P(\ell) [C_{n(x \to x+\ell-1)} + C_{n(x-\ell+1 \to x)}], $$

(1)

where the sum is over the range of possible pulldown fragment lengths *ℓ*. In Eq. (1), *P*(*ℓ*) represents the probability that a fragment sequenced from the pulldown data set is of length *ℓ*,*n*(*a*→*b*) the number of accessible CpGs that would be on a fragment that starts at genomic location *a* and ends at *b* (inclusive), and *C*_{n} the scale factor for the representation of sequenced fragments that have *n* accessible mCpGs. The two terms correspond to alignments on the forward strand and reverse strand that could both align to location *x*. These *C*_{n} factors scale the probability of observing a fragment with *n* accessible mCpGs to that of observing a fragment with 0 mCpGs. The other normalization to consider is that which scales *Λ*_{x} to the sequencing depth of the modeled experiment. We choose a pre-factor of 1 and later show that the overall results are not affected by this choice over a large range of values.

To evaluate Eq. (1) for each site in a reference genome, parameters *n*(*a*→*b*),*C*_{n}, and *P*(*ℓ*) were derived as detailed next from the experiments performed in [13]. These experiments resulted in two data sets. For both data sets the DNA was first treated with SssI and then fragmented. For the Input Control (*I*) data set, the DNA fragments were simply sequenced. The SssI Control (*C*) data set, on the other hand, represents the library of DNA fragments that were submitted to the enrichment-capture experiment, pulled down, and then sequenced. Comparing SssI Control and Input Control data sets yields the parameters of our model as follows:

#### Number of accessible CpGs

In [13], we deduced that the physical size of the binding protein used for the pulldown experiment (in this case MBD2) can limit the accessibility of an mCpG to a binding protein if another nearby mCpG has already bound a protein. Specifically, we found that pulldown efficiency was suppressed for fragments with two mCpGs separated by 2 bp or less, relative to those with separations ≥3 bp. We want the number of “accessible CpGs” to refer to the largest number of CpGs that can be simultaneously bound by protein. To approximate this, we calculate *n*(*x*→*x*+*ℓ*−1) for the DNA sequence represented by [*x,x*+*ℓ*−1], by finding the largest subset of CpGs on the sequence such that each pair of CpGs in the subset are separated by at least 3 bp. This is equivalent to allowing an MBD protein to be bound at the first CpG, where the location of the CpG is identified with the position of the Cytosine and labeled *c*_{1}. Then, going through the rest of the downstream CpGs, the number of bound CpGs is only increased, and the location of the last bound CpG is updated from *c*_{i} to *c*_{i+j}, if (*c*_{i}+1)+3<*c*_{i+j}.

#### Coefficients of relative enrichment

In [13] we also, as a corollary, introduced the pulldown efficiency, *E*(*n* CpGs), for DNA fragments with *n* accessible mCpGs as the ratio between the fraction of the SssI Control data with *n* CpGs and the fraction of the Input Control data with *n* CpGs. Then *C*_{n}=*E*(*n*)/*E*(0) represents how much more likely a fragment with *n* mCpGs will be sequenced in the SssI Control data set than a fragment with 0 mCpGs. Hence, as we sum over the fragments that could align to location *x*, this factor accounts for relatively how often we should expect to see a fragment with that many mCpGs. See “Modeled pulldown from SssI-treated DNA” subsection of the Methods for the specific values calculated.

#### Length distribution of DNA fragments

The length distribution of DNA fragments that are aligned to the reference genome can be approximated by Bioanalyzer analysis on the pulldown library after fragmentation. To get a more precise description of the fragment length distribution, we compared the distribution of alignments from the SssI Control to the distribution in the Input Control. Let the fragment length probability distribution be approximated by a Gaussian, *P*(*ℓ*)∼*N*(*L,S*), where *L* is the average fragment length and *S* is the standard deviation. To determine its parameters *L* and *S*, we take all reads that have been aligned to the reference genome and then extend them to a segment of length *ℓ*_{max}=250 (larger than the expected actual fragment length) and then consider only those segments where the genomic sequence contains only a single CpG. To avoid edge effects we in addition require that the C of this CpG must be located at, or downstream of, the 11th nucleotide from the 5’ end of the fragment. Let then *p*(1 CpG,*t* nt) be the fraction of segments observed in the SssI Control data set with one CpG, which starts at the *t*th nucleotide (with respect to the 5’-end). Similarly, we define *q*(1 CpG,*t* nt) for the Input Control data set. Then the predicted pulldown efficiency for reads that are sequenced and align to a position with one CpG located at the *t*th nucleotide downstream with respect to that position is:

$$\begin{array}{@{}rcl@{}} \frac{p(1 \text{ CpG}, t\ \text{nt})}{q(1\ \text{CpG}, t\ \text{nt})} &=& \sum\limits_{\ell=\ell_{\min}}^{t} R_{0} P(\ell) + \sum\limits_{\ell=t+1}^{\ell_{\max}} R_{1} P(\ell) \\ &=& R_{0} + (R_{1} - R_{0}) P(\ell \ge t+1), \end{array} $$

(2)

where *R*_{0} and *R*_{1} represent the pulldown efficiency for fragments with 0 and 1 CpGs, as a ratio of fractions of the pool of reads with only 1 CpG between 11 bp and 250 bp from the 5^{′}-end. This expression captures how, as we look at fragments with a single CpG further and further downstream of the 5^{′}-end, we will find the length past which the CpG is not likely to actually be on the fragment and not contribute to that fragment’s pulldown probability, and therefore it marks the typical length of the sequenced fragments. For a Gaussian *P*(*ℓ*), we can approximate \(P(\ell \ge t) \approx \text { Erf}\left (\frac {t - L}{S \sqrt {2}}\right) - \text { Erf}\left (\frac {1 - L}{S \sqrt {2}}\right)\). We fit the experimental data to Eq. (2) using the Python SciPy function curve_fit to perform least-squares optimization, obtaining parameters (*R*_{0}=0.889,*R*_{1}=1.187,*L*=100.7,*S*=12.98) from an initial guess of (1.0,1.0,200,50), (Fig. 2). This completely characterizes the length distribution *P*(*ℓ*) and from this we set *ℓ*_{min}=3 and *ℓ*_{max}=200.

When comparing the fit to the data, we notice an increase over the expected pulldown efficiency at *ℓ*=10∼30. It is not clear what the source of this trend is, though there is another similar increase for *ℓ*>200. We found the latter to be an artifact of our maximum *ℓ* cutoff; when we shifted *ℓ*_{max} from 250 to 300, the increase at the largest *ℓ* shifted with it and the parameter fits for *L* and *S* were not significantly changed. We also note that the values for *R*_{0} and *R*_{1} do not match those for *E*(0 CpGs) and *E*(1 CpG) because the latter are normalized to the larger pool of all fragments with no CpGs contained in the first 10 bases versus the subset with just one CpG within 250 bp downstream of the alignment start.

### Fragment length and mCpG number are sufficient to model pulldown alignment to a genomic window

To begin thinking about using our model of *Λ*_{x} as a substitute for experimentally observed MBD pulldown alignments, we wish to see how well SssI Control window coverage correlates with modeled window coverage. We calculated expected SssI Control alignments, *Λ*_{x}, for every site in chromosome 7 and summed it over 100 bp non-overlapping windows to generate modeled SssI Control window coverage, \(y_{i\Lambda }= \left \lfloor \sum \nolimits _{x \in i} \Lambda _{x} \right \rfloor \). Using the same mappability cutoff as in [14], we compare this quantity to the observed SssI Control window coverage, *y*_{iC}, at every genomic window *i* that has at least 75% mappable bases. In Fig. 3a, there is a general increase in the SssI Control coverage as the modeled SssI Control coverage increases (Pearson correlation between *y*_{iΛ} and *y*_{iC} is 0.78). Wondering at the reason for the slight turnover in SssI Control coverage for modeled coverage values log10(*y*_{iΛ})>4.5, we suspect these regions are more likely to have high GC content and were therefore less likely to be sequenced in these experiments [24]. We can control for which windows are likely to be sequenced by dividing the SssI Control coverage by the Input Control Coverage, *y*_{iI}, to essentially obtain an unnormalized pulldown efficiency of reads that overlap the window. Comparing that to the modeled SssI Control coverage, we see a positive correlation between a window’s SssI Control coverage and pulldown efficiency at these larger values of *y*_{iΛ} (Fig. 3b).

### Model of pulldown alignment improves estimates of methylation

#### Methylation prediction

To predict methylation level from MBD pulldown and assess our model of predicted SssI Control coverage, we use the previously published program BayMeth [14]. BayMeth uses a Bayesian framework to quantify methylation fraction on genomic windows from data from pulldown experiments – done on both a Sample of interest (*S*) that has undergone MBD pulldown but not SssI treatment and an SssI Control (*C*) version of that sample. For each genomic window *i*, the probability of the observed read counts from the Sample of Interest (*y*_{iS}) and the SssI Control sample (*y*_{iC}) update the prior distribution for the methylation fraction *μ*. The read counts are assumed to be Poisson-distributed with average read density scaled by parameter *λ*_{i}, which represents the expected read density for a window of the same CpG density at full methylation. There are two main modes of BayMeth that we compare and modify in this study. The first uses pulldown read coverage from both the Sample of interest and the SssI Control sample (what we call BayMeth-SssI); this is the mode recommended by the authors of BayMeth. The second only uses pulldown data from the Sample of interest (BayMeth-noSssI), which is of use if a matching SssI Control sample is not available. The new implementation that we test here is to run BayMeth with expected read coverage, \(y_{i\Lambda } = \left \lfloor \sum \nolimits _{x \in i} \Lambda _{x} \right \rfloor \), calculated by our model developed above, as a proxy for the SssI-treated Control read coverage *y*_{iC}; we call this mode BayMeth-calcSssI, which can be used even in the absence of a real SssI-treated Control sample. While this formulation of *y*_{iΛ} only explicitly models reads that start or end in window *i* – in contrast to the window coverage described by *y*_{iC} – we will find that including modeled reads spanning, but not starting in, a window does not meaningfully improve predictions by BayMeth-calcSssI, even when the window width is much smaller than the average fragment length.

#### Methylation predictions on 100 bp windows

The methylation predictions generated by BayMeth-SssI, BayMeth-noSssI, and BayMeth-calcSssI are each assessed against the methylation predictions measured by RRBS, \(\mu _{i}^{\text {RRBS}}\), for a Sample of interest from [22]. We set a window’s methylation state to be methylated if its RRBS methylation is >0.50 and unmethylated if it is ≤0.50. Then ROC curves are generated and we ultimately evaluate each method’s performance in separating methylated from unmethylated windows through its corresponding AUC.

We first compare the performance of BayMeth-calcSssI on all 100 bp windows on hg18 that pass the minimum RRBS coverage and mappable base percentage (212,252 windows out of 14,506,245 total windows with at least one annotated CpG). In Fig. 4, the ROC curves for BayMeth-SssI, BayMeth-noSssI, and BayMeth-calcSssI are plotted, showing that BayMeth-calcSssI has the largest AUC (0.948) followed by BayMeth-SssI (0.936) and BayMeth-noSssI (0.925).

For each method’s most efficient ordering (the cutoff corresponding to the point on the ROC curve furthest from the *y*=*x* line), 94.5*%* of methylated windows are correctly categorized by BayMeth-calcSssI, a ∼ 5*%* improvement over BayMeth-noSssI, and ∼ 1*%* improvement over BayMeth-SssI. On the reverse, 14.3*%* of unmethylated windows are incorrectly categorized by BayMeth-calcSssI, a ∼ 5*%* improvement over BayMeth-noSssI, and ∼ 3*%* improvement over BayMeth-SssI.

To get a sense of what methylation states each method is better at predicting, smoothed density plots in Fig. 5 compare predicted methylation levels to their RRBS methylation. First, methylation level among 100 bp windows with RRBS coverage is highly bimodal, similar to the mean methylation levels observed at individual CpGs. There are 55% more unmethylated windows than methylated windows, but as both methylated and unmethylated states are well-represented in this RRBS sample, an indicator of methylation state has to achieve high accuracy in both regimes. About 81% of plotted windows have an RRBS methylation level that is ≤0.10 or ≥0.90. From Fig. 5a-c, we see that BayMeth-SssI and BayMeth-calcSssI give less precise predictions to windows with medium levels of methylation than those given by BayMeth-noSssI. For windows with RRBS methylation level ≤0.10 (Fig. 5d), more windows are predicted by BayMeth-noSssI to still have a methylation level ≤0.10 than by BayMeth-SssI, and BayMeth-noSssI also miscategorizes fewer windows overall (4.80*%* versus 5.50*%*, of windows with \(\mu _{i}^{\text {RRBS}} \leq 0.10\)). Among windows with an RRBS methylation level of at least 0.90 (Fig. 5e), more windows are predicted by BayMeth-noSssI than by BayMeth-SssI to have a methylation level ≥0.90. However, overall BayMeth-noSssI ends up miscategorizing more of these high-methylation windows than BayMeth-SssI because of how many more windows it predicts with a methylation level ≤0.50 (23.3*%* versus 15.9*%*, of windows with \(\mu _{i}^{\text {RRBS}} \geq 0.90\)). Interestingly, the methylation prediction profile of BayMeth-calcSssI is most similar to BayMeth-SssI in the high-methylation regime (and miscategorizes the lowest percentage of these windows at 15.5*%*) and most similar to BayMeth-noSssI in the low-methylation regime (and again miscategorizes the lowest percentage at 4.07*%*). Thus, BayMeth-calcSssI appears to capture the strengths of the other two BayMeth configurations and, on this scale, perform better than both of them.

#### Methylation predictions on 10 bp windows

Since one of the advantages of RRBS is single CpG resolution in methylation predictions, we explore the accuracy of methylation predictions informed by MBD pulldown on 10 bp windows. Of the 25,858,448 windows of width 10 bp on hg18 with at least one annotated CpG, 457,335 pass the minimum mappable base percentage (75%) and RRBS coverage (10). Of these windows, 63% contain only one CpG, 29% contain two CpGs, and the remaining 8% have three or more. Evaluating the three methylation prediction methods on these windows, the ROC curves for BayMeth-SssI, BayMeth-noSssI, and BayMeth-calcSssI are plotted in Fig. 6. Again, the AUC for BayMeth-calcSssI (0.955) is highest followed by BayMeth-SssI (0.936) and BayMeth-noSssI (0.930). These AUCs are even higher than they were for the 100 bp windows in the case of BayMeth-calcSssI and BayMeth-noSssI.

Smoothed density plots of the predicted methylations versus the RRBS methylation are shown in Fig. 7a-c. The percentage of windows with extremal RRBS methylation levels (≤0.10 or ≥0.90) increases to 88%. As with the methylation predictions on 100 bp windows, BayMeth-calcSssI and BayMeth-noSssI provide similar distributions of methylation predictions for windows in the low methylation regime, while BayMeth-calcSssI and BayMeth-SssI provide similar distributions of predictions for windows in the high methylation regime (Fig. 7d-e). Though, on this window size, BayMeth-calcSssI miscategorizes the most windows in the high methylation regime (20.6*%*), followed by BayMeth-SssI (16.1*%*), and BayMeth-noSssI (15.9*%*). However, in the low methylation regime, BayMeth-calcSssI performs best, categorizing 96.9*%* of windows correctly, followed by BayMeth-SssI (95.1*%*), and then BayMeth-noSssI (94.3*%*).

### Testing parameter robustness on chromosome 7

With these results, we must also make sure that the quality of BayMeth-calcSssI predictions is robust to variation of the input parameters involved in calculating the modeled SssI Control coverage, *y*_{iΛ}. We use chromosome 7 to test the sensitivity of the AUC measure since it was the chromosome reported on in [14]. On chromosome 7, 11,158 windows of width 100 bp meet the minimum mappable base percentage and RRBS coverage (which represents 1.4*%* of windows on chromosome 7 with at least one CpG). Running our three configurations of BayMeth with the previously stated parameters, we find that methylation predictions from the BayMeth-calcSssI method produced the largest AUC (0.946), followed by BayMeth-SssI (0.938) and BayMeth-noSssI (0.925). If we, instead, reduce the minimum separation between consecutive CpGs to 2 bp (from 3 bp) to potentially count additional CpGs as “accessible” (0.946); scale the overall depth normalization, i.e. the prefactor in Eq. (1), by 10 (0.946), or by 0.10 (0.946); or collapse the fragment length distribution to the determined average fragment length (0.945), i.e. *P*(*ℓ*)∼*δ*(*ℓ*−*ℓ*_{avg}); the AUCs calculated after each of these individual modifications all remain within 0.001 of the initial model. Additionally, we considered whether information is effectively lost by only summing the *Λ*_{x} terms for sites, *x*, in window *i*, and thus only modeling contributions from reads that start in window *i* (on either the forward or reverse strand). To instead model a sum over all reads that likely overlap window *i*, we tested using a fixed fragment length (*P*(*ℓ*)∼*δ*(*ℓ*−*ℓ*_{avg})) and, for the proxy SssI Control measure, summing over all sites and strands where a fragment of length *ℓ*_{avg} would overlap window *i*:

$$ {y_{i\Lambda} = \left \lfloor \left(\sum\limits_{\underset{\text{overlapping window}\ i}{[x, x + (\ell_{\text{avg}} - 1)]}} \Lambda_{x,+}\right) + \left(\sum\limits_{\underset{\text{overlapping window}\ i}{[x-(\ell_{\text{avg}} - 1), x]}} \Lambda_{x,-} \right)\right \rfloor,} $$

(3)

where we use *Λ*_{x,+} to refer to the expected pulldown to position *x* from the forward strand, and similarly for *Λ*_{x,−} for the reverse strand. This formulation also produces an AUC of 0.945. Additionally, this formulation and each of the other modifications to the BayMeth-calcSssI method applied to analysis of chromosome 7 on 10 bp windows also produced methylation prediction profiles that all had an AUC within 0.001 of the AUC produced by the original method.

### Model of pulldown alignment can be applied to methylation predictions on external experiments done with the same MBD protein

For our model of expected pulldown alignments, *Λ*_{x}, to be of general use in predicting methylation level from MBD pulldown experiments, we must show that it may be applied to pulldown experiments that it has not been trained on. To that end, we adapted our model to be used in place of the SssI Control data from the human fibroblast (IMR-90) sample analyzed in Riebler et al. [14]. We used the average fragment length of 300 bp given in that study to set the fragment length probability distribution to *P*(*ℓ*)=*δ*(*ℓ*−300 bp), since the width of that distribution was not described, and kept all other parameters the same. On chromosome 7, 426,366 windows of width 100 bp pass both the 75% minimum mappable base percentage and the minimum WGBS coverage, set in that study to 33. Applying the three configurations of methylation prediction on these windows, the ROC curves for BayMeth-SssI, BayMeth-calcSssI, and BayMeth-noSssI are plotted in the top leftmost panel of Fig. 8 (“All” windows). The BayMeth-SssI mode performs best with an AUC of 0.763, followed by BayMeth-calcSssI (0.715) and BayMeth-noSssI (0.681). Since the AUC produced by BayMeth-calcSssI on this set of windows is not as high as we have seen in the previous sections, it would be convenient to have a quantity that indicates for each window whether BayMeth-calcSssI is likely to provide a good methylation prediction. A suitable choice is the calculated SssI Control window coverage, *y*_{iΛ}, itself. We had developed our model for SssI Control pulldown alignments by considering what DNA fragment features significantly change the pulldown efficiency, and the calculated SssI Control coverage approximates how efficiently, relatively, an MBD pulldown experiment should be expected to sample reads from that window. In the Bayesian framework used in BayMeth, the windows that it samples the most information from should be predicted on more accurately. To test the usefulness of calculated SssI Control coverage as an indicator of BayMeth-calcSssI performance, we range through different minimum cutoffs on *y*_{iΛ} (in steps of 5 percentile) and calculate ROC curves on all windows above that threshold. We see in the main body of Fig. 8 that the AUCs for the three methods increase monotonically with the minimum window *y*_{iΛ}. For cutoffs above the 85th percentile in SssI Control coverage, the AUCs from all three methods are above 0.90.

Considering this framework for assessment, then, we see that the quality of methylation predictions produced by BayMeth-SssI and BayMeth-calcSssI follow each other closely and their AUC values stay within 0.05 of each other at all cutoffs. The AUCs for BayMeth-noSssI get within 0.05 only for a minimum modeled SssI Control coverage ≥75^{th} percentile, and at all cutoffs, BayMeth-calcSssI performs better than BayMeth-noSssI. Thus, in the absence of SssI Control data, our model improves methylation predictions. While additional comparisons to external data sets would be beneficial in showing our model’s broad applicability, MBD-seq data sets are not often generated with an SssI Control, which happens to be something our method attempts to rectify. More pressingly, genome-wide bisulfite sequencing data (either RRBS or WGBS) that can act as a truth data set against an MBD-seq data set is not usually available simultaneously for the same samples. Thus, to our knowledge there are no other MBD-seq/BS samples that use the same protein domain for enrichment, are of sufficient sequencing quality, and whose results would not be potentially confounded by copy number variation.