An information-theoretic approach to the modeling and analysis of whole-genome bisulfite sequencing data

Background DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads. Results We present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data. Conclusions This contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods. Electronic supplementary material The online version of this article (10.1186/s12859-018-2086-5) contains supplementary material, which is available to authorized users.


PMF of methylation within a genomic region
We can approximately express the PMF of the methylation state within a genomic region R k in terms of a 1D Ising model. To see why this is true, let us assume that R k comprises R CpG sites 1, 2, . . . , R, which are labeled according to their order of appearance along the region. Note that the PMF of methylation within R k satisfies P X (x 1 , x 2 , . . . , x R ) = P X (x 2 , x 3 , . . . , where we assume that X 1 and X R are approximately statistically independent. This is a reasonable assumption considering the fact that, in most cases, the two CpG sites near the boundary of R k are separated by many intermediate CpG sites. We can now show from Eqs. (1)-(4) of the Main Paper that Therefore, if we set α k = 1 2 ln Pr[X 1 = 1] 1 − Pr[X 1 = 1] and α k = 1 2 ln and use (S1) and (S2), we approximately obtain where Z = u u u exp α k (2u 1 − 1) + R−1 r=2 (α k + β k ρ r )(2u r − 1) + α k (2u R − 1) Notably, this is an Ising model, albeit with a smaller number of parameters when R ≥ 4 (5 vs. 2R − 1) than the one without using Eqs. (3) and (4) of the Main Paper. Note also that parameters α k and α k account for boundary effects that occur when restricting the Ising model associated with the entire genome to the Ising model within R k .

Computing marginal PMFs
Fitting the Ising methylation model to available WGBS data requires evaluation of marginal PMFs within a genomic region R k of the form P X (x q:q+s ) = P X (x q , x q+1 , . . . , x q+s ), where q and s are such that 1 ≤ q ≤ q + s ≤ R. From (S6) and (S9), we can show that where and This result provides an alternative representation of the 1D Ising model in terms of an inhomogeneous Markov chain with initial probability z 1 (x 1 ) and transition probabilities z r+1 (x r+1 | x r ). From (S10), we have that which implies where Moreover, from (S11), (S12), and (S15), note that This implies that where Z(x q ) is computed by for x = 0, 1, r = 2, 3, . . . , q.
Finally, by combining (S11), (S12), (S14), and (S15), we obtain which provides a formula for calculating the marginal PMF P X (x q:q+s ) using the iterations in (S9) and (S18). Note that computation of the marginal PMFs P X (x q:q+s ) requires the iterations (S9) and (S18) to be performed only once as long as the intermediate values Z r (x), Z r (x), x = 0, 1, r = 1, . . . , R, are stored.

Identifiability
It is worth pointing out here that the PMF P X (x x x | θ θ θ k ) within a genomic region R k forms a five-parameter exponential family of distributions with natural sufficient statistic given by [see (S4) and (S5)] Since the CpG density ρ r depends in general on r (it changes from a CpG site to the next), {S 1 (X X X), S 2 (X X X), S 3 (X X X), S 4 (X X X), S 5 (X X X), 1} are linearly independent with positive probability, due to the fact that the Ising model assigns positive probabilities over the methylation state-space. In this case, the exponential family has rank 5 and the parameter vector θ θ θ k is identifiable according to Theorem 1.6.4 in [1].

Computing the PMF of methylation level
For GUs containing at most 18 CpG sites, we calculate the PMF P L ( ) of the methylation level L using the exact summation formula in Eq. (10) of the Main Paper. For GUs with more than 18 CpG sites, we estimate P L ( ) by combining Monte Carlo sampling with the maximum entropy principle [5,6]. To do so, we first draw M samples x x x (1) , x x x (2) , . . . , x x x (M ) from the Ising distribution P X (x x x) associated with the GU, which we then use to produce M samples 1 , 2 , . . . , M of the methylation level L by We then estimate the first Q non-central moments E[L q ], q = 1, 2, . . . , Q, of L by means of Finally, we approximate P L ( ) by a probability distribution over the values of L that maximizes the Shannon entropy − P L ( ) log 2 P L ( ), subject to the moment constraints q P L ( ) = µ q , for q = 1, 2, . . . , Q.
Note that we can rapidly draw exact samples from the Ising probability distribution P X (x x x) within a GU with K CpG sites by noting that P X (x x x) is given by Eqs. (S10)-(S12), with R = K. Equation (S10) implies that the methylation sequence {X 1 , X 2 , . . . , X K } is a first-order nonhomogeneous Markov chain with initial and transition probabilities given by (S11) and (S12), respectively. As a consequence, we can obtain an exact realization x x x by iteratively drawing samples x 1 , x 2 , . . . , x K from these conditional probability distributions.
In (S22), we use M = 2 17 = 131,072 Monte Carlo samples and set Q = 4 (i.e., we use Monte Carlo sampling to estimate the first four non-central moments of L). We investigated a number of Q values (Q = 3, 4, 5, 6) and determined that Q = 4 leads to robust Monte Carlo estimation of the probability distribution P L ( ) within GUs with at most 18 CpG sites for which we can exactly compute P L ( ) using the summation in Eq. (10) of the Main Paper. We also determined the value of M by noting that exact summation requires evaluation of at most 2 18 = 262,144 probabilities P X (x x x) when K ≤ 18, a computation that we found to be feasible on our computer system. However, when K > 18, this calculation becomes increasingly less efficient, and that is why we switch from exact summation to Monte Carlo estimation. On our computer system, each probability evaluation is done in half the time required for drawing a sample from P X (x x x). As a consequence, when K ≤ 18, exact summation is at least as computationally efficient as drawing 2 17 = 131,072 Monte Carlo samples, and this observation guides us to set M = 2 17 . Finally, note that there are 11,273,889 GUs with at least 1 CpG site and at most 18 CpG sites in the human genome, but only 30,024 GUs with the number of CpG sites ranging from 19 to 44 (Additional file 2: Table S1). As a consequence, we compute the PMF of the methylation level L by exact summation for the vast majority (99.73%) of the GUs with at least one CpG site. and employ these probabilities to classify the GU using the scheme depicted in Fig. S1. This scheme classifies a GU as being unmethylated if more than 60% of its copies in a cell population have methylation level below 50% (i.e., if p 1 + p 2 > 0.6). In particular, the GU is classified as being partially unmethylated if no more than 60% of these copies have methylation level below 25% (i.e., if p 1 ≤ 0.6); otherwise, it is classified as being highly unmethylated. Similarly, the GU is classified as being methylated if more than 60% of its copies in a cell population have methylation level above 50% (i.e., if p 1 + p 2 < 0.4, which implies that p 3 + p 4 > 0.6). In particular, the GU is classified as being partially methylated if no more than 60% of these copies have methylation level above 75% (i.e., if p 4 ≤ 0.6); otherwise, it is classified as being highly methylated.
If a GU is neither unmethylated or methylated (i.e., if 0.4 ≤ p 1 + p 2 ≤ 0.6), then it is classified as being mixed if no more than 40% of its copies with methylation level below 50% have methylation level of at most 25% [i.e., if p 1 /(p 1 + p 2 ) ≤ 0.4] and no more than 40% of its copies with methylation level above 50% have methylation level of at least 75% [i.e., if On the other hand, the GU is classified as being highly mixed if more than 40% but no more than 60% of its copies with methylation level below 50% have methylation level of at most 25% [i.e., if 0.4 < p 1 /(p 1 + p 2 ) < 0.6] and more than 40% but no more than 60% of its copies with methylation level above 50% have methylation level of at least 75% [i.e., if 0.4 < p 4 /(p 3 + p 4 ) ≤ 0.6]. Finally, the GU is classified as being bistable if at least 60% of its copies with methylation level below 50% have methylation level of at most 25% [i.e., if p 1 /(p 1 + p 2 ) ≥ 0.6] and at least 60% of its copies with methylation level above 50% have methylation level above 75% [i.e., if p 4 /(p 3 + p 4 ) ≥ 0.6].
As we discuss in the Main Paper, the mixed and highly mixed GUs are characterized by appreciable methylation variability, whereas, bistable GUs are characterized by the highest possible variance in methylation level. For this reason, we refer to a GU that is mixed, highly mixed, or bistable as being variably methylated.
To see why a bistable GU is characterized by the highest possible variance, let us consider for example a highly mixed GU with N CpG sites in which the distribution of methylation level is uniform, and a bistable GU with probability distribution whose mass is equally distributed at methylation levels 0 and 1/N . In both cases, the mean is given by 1/2. However, the highly mixed GU is characterized by a variance of (N + 2)/12N , whereas the bistable GU is characterized by a variance of 1/4 ≥ (N + 2)/12N .

Entropy-based classification
To effectively summarize the status of the NME within a GU, we employ the following classification scheme: We determined the previous thresholds by studying the NME within a region containing one CpG site. In this case, the methylation level L assumes two possible values, 1 or 0, with probabilities p and 1 − p, respectively. By focusing on the odds ratio r = p/(1 − p), we consider the random variable L to be "moderately ordered" if r ≥ 10 or r ≤ 1/10, whereas we consider it to be "highly ordered" if r ≥ 20 or r ≤ 1/20. In the first case, p ≥ 0.9091 or r ≤ 0.0909, which corresponds to a maximum NME value of 0.44, whereas, in the second case, p ≥ 0.9524 or p ≤ 0.0476, which corresponds to a maximum NME value of 0.28. Furthermore, we consider L to be "moderately disordered" if 1/2 ≤ r ≤ 2, whereas we consider it to be "highly disordered" if 1/1.2 ≤ r ≤ 1.2. In the first case, 0.3333 ≤ p ≤ 0.6667, which corresponds to a minimum NME value of 0.92, whereas, in the second case, 0.4545 ≤ p ≤ 0.5455, which corresponds to a minimum NME value of 0.99.
7 Differential classification of GUs 7.1 Methylation-based differential classification Using the PMF of the difference D L = L t − L r in methylation level within a GU between a test and a reference sample, we calculate the following probabilities: and use them to classify the GU by employing the scheme depicted in Fig. S2. It turns out that a GU is classified as being isomethylated if the absolute difference in methylation levels between two copies of the GU, one randomly chosen from the test sample and the other from the reference sample, is less than 0.1 more than 55% of the time (i.e., if q 3 > 0.55). When the GU is not isomethylated, it is classified as being hypomethylated if the difference in methylation levels between its two randomly chosen copies is no more than −0.1 more than 55% of the time (i.e., if Pr[D L ≤ −0.1 | |D L | ≥ 0.1] > 0.55, which implies that (q 1 + q 2 )/(1 − q 3 ) > 0.55). In particular, the GU is classified as being strongly hypomethylated if the difference in methylation levels is less than −0.55 more than 55% of the time (i.e., if q 1 > 0.55). On the other hand, it is classified as being moderately hypomethylated if the difference in methylation levels is between −1 and −0.1 more than 55% of the time with the difference being smaller than −0.55 no more than 55% of the time (i.e., if q 1 ≤ 0.55 but q 1 + q 2 > 0.55). Finally, the GU is classified as being weakly hypomethylated if the difference in methylation levels is between −1 and −0.1 no more than (1 − q 3 ) × 55% but no more than 55% of the time (i.e., if 0.55 × (1 − q 3 ) < q 1 + q 2 ≤ 0.55). Similar remarks apply for classifying the GU as being hypermethylated.

Entropy-based differential classification
By computing the difference D h = h t −h r in NMEs between a test and a reference sample within a GU, we classify the GU into one of seven classes using the following simple thresholding scheme: We choose the previous thresholds to approximately be in agreement with the thresholds 0.28 and 0.44 used for entropy-based classification of GUs in single samples; see (S24 Metiline, a recently proposed method for differential methylation analysis [4], performs hypothesis testing using the Mann-Whitney U (MWU) test (also known as the Wilcoxon ranksum test) or a two-dimensional extension of the Kolmogorov-Smirnov (KS) test [3]. Similarly to most methods of methylation analysis published in the literature, such as DSS, metilene fails to consider the joint statistical properties of DNA methylation and does not test against a null hypothesis that accurately reflects the goal of differential analysis. Moreover, and inconsistently with other methods (such as DSS), metilene does not account for the data generation process and variations in depth of coverage that are always present in sequencing data. Most "marginal" methods of methylation analysis of sequencing data require two random variables to be observed at each CpG site n of the genome: the number M n of methylated observations as well as the total number of observations (coverage) C n . In this case, the data generation process at each CpG site can be modeled as a binomial distribution and the marginal probability of the CpG site to be methylated can be estimated by using the maximum likelihood estimator p n = M n /C n . Clearly, the quality of the resulting estimate increases as the coverage C n increases (i.e., the width of the confidence interval for this estimate shrinks as the number of observations increases). For small C n , the maximum likelihood estimator p n is highly unreliable. However, it asymptotically converges to the true marginal probability of methylation as the coverage increases. It is therefore critical to take the coverage C n into account when formulating a statistical procedure to detect methylation differences. Surprisingly, metilene uses only the empirical estimator p n and thus it cannot account for the data generation process and the resulting uncertainty in the estimate. For example, a value p n = 1 that comes from a single observation and results in a 95% confidence interval of [0.025, 1] (using the binomial distribution) would be treated by metilene exactly the same as a value p n = 1 that comes from 50 observations that results in a 95% confidence interval of [0.93, 1]. As a result, the statistical procedures employed by metilene are highly questionable regardless of its mode of operation or choice of the statistical hypothesis test used. We discuss this important issue in the following.

Differential analysis using the MWU test
When examining one CpG site at a time, say the n-th CpG site, the input to metilene consists of observations that are split into a test and a reference group. Specifically, the method uses the maximum likelihood estimators p (t) n (k) and p (r) n (l) to estimate the marginal probabilities of the n-th CpG site to be methylated in the k-th test and l-th reference samples, respectively. It then attempts to detect differential methylation at this CpG site by comparing the probability distributions p (t) n and p (r) n using a two-sample MWU test on the observed values. In the case of a predefined region of interest or a potential DMR, metilene performs MWU analysis by pooling the p values associated with all observed CpG sites within the region of interest or DMR in each group (test or reference).
The appropriateness of using a MWU test in the context of differential methylation analysis is questionable, since the null hypothesis is difficult to precisely articulate except for the case of distributions that differ only by a shift in location [2]. Since the support of the probability distributions of the two estimators p (t) n and p (r) n is the unit interval, it is not always the case that these distributions differ only by a shift. This means that the MWU test will reject the underlying null hypothesis for differences other than location shifts. In this case, "interpretation of a small p-value is not always straightforward" [2]. This issue is further exacerbated by the failure of metilene to account for the data generation process that produces the p values.
A simulation example demonstrates that the previous statistical framework is problematic. Let us consider the simplest possible context for methylation analysis: a CpG island (CGI) with 100 CpG sites that are methylated in an independent and identically distributed fashion with probability 0.05 in both the test and the reference samples. In this case, there is no differential methylation present within the CGI, since both the test and the reference samples are characterized by the same distributions for DNA methylation. However, let us assume that there is a difference in sequencing coverage between the test and reference samples, a situation that is common in real data. Recall that any valid statistical hypothesis testing procedure must control the Type I error rate (i.e., the percentage of false positives when the null hypothesis is true) at a level α when the null hypothesis is rejected for p-values less than α. Otherwise, the method cannot be trusted when it rejects the null hypothesis, since this could be a false-positive with high probability.
Using Monte Carlo, we simulated the previous example with 5 reference and 5 test samples subject to a small difference in coverage, which is quite common in WGBS sequencing data: 15× in the test data and 10× in the reference data. For a p-value threshold of 0.05, the Type I error rate of metilene was 16% (estimated from 10,000 Monte Carlo trials), more than three times over the maximum allowed in a correct statistical test. Specifically, if the statistical procedure were valid, the Type I error rate should have been no more than 5%. A higher difference in coverage worsens the problem. For instance, for a 30× versus 10× coverage, which is not outside the norm for WGBS data, metilene falsely calls this region as differentially methylated 98% of the time. Note that our simple example does not even consider additional complications of real-world data that would cause further problems for metilene: coverages that vary from CpG to CpG site within a given sample, missing data at certain CpG sites (although an attempt was made in [4] to fill-in missing data), correlation in methylation between CpG sites, etc. Notably, the issue of correlation results in violation of a critical assumption (independence) underlying the MWU test.

Differential analysis using the KS test
In addition to the MWU test, metilene employs another way to assess statistical significance using a two-dimensional version of the KS test. This test plays a crucial role in the circular binary segmentation algorithm used in the first mode of metilene. In addition, the KS test is used in the second mode to test for significant differences in each region of interest. However, no documentation has been provided on why a second statistical test is necessary, or why the KS test should be more preferable than the MWU test. The two tests are based on different formulations for the null and alternative hypotheses, and the lack of formal justification for either hypothesis testing procedure further contributes to the confusion. Regardless, the KS test cannot overcome the fundamental issue of ignoring the data generation process and the associated coverage information necessary to conduct valid statistical analysis. We demonstrate this fact next.
Metiline employs the KS test only in regions with multiple CpG sites. It is based on randomly selecting a CpG site within the region and calculating the associated maximum likelihood estimates p (t) and p (r) of the methylation probabilities in the test and reference samples, respectively. This process generates data points randomly distributed within a two-dimensional space determined by the two-dimensional random variable (X (t) , p (t) ), where X (t) is a random variable indicating the location of the chosen CpG site. Likewise, the reference samples contribute data points randomly distributed within a two-dimensional space determined by the two-dimensional random variable (X (r) , p (r) ). The two groups of points are then passed to a two-dimensional KS test to determine whether there is a statistically significant difference between the joint probability distributions of (X (t) , p (t) ) and (X (r) , p (r) ).
The issue of coverage variability seriously affects the KS test as well. Consider the simple case of 15× test versus 10× reference coverage. The maximum likelihood estimator p (t) n at the n-th CpG site of the probability of methylation in the test sample distributes its probability mass over the set {0, 1/10, 2/10, . . . , 1}, whereas the p (r) n distributes its probability mass over a different set {0, 1/15, 2/15, . . . , 1}. Therefore, even if the marginal probabilities of methylation were identical in the test and reference samples, one would expect the null hypothesis to be rejected when there are different coverages between the samples. This is because the KS test is based on the null hypothesis that the cumulative probability distributions of p (t) n and p (r) n are identical. Indeed, by using the Monte Carlo simulation discussed in the previous subsection, we calculated a 100% false-positive rate for the case of 15× versus 10× coverage using 10,000 Monte Carlo samples, but worse-yet, even with a coverage of 11× versus 10×, the KS test still produces a 100% false positive rate.
There are additional problems associated with metilene's KS formulation. For example, by considering the random variable X, the null hypothesis of the KS test is false whenever the two samples have uneven missing data patterns within a region. To see why this is true, consider a case in which there are no missing data at the CpG sites in the test samples, while there are missing data in the reference samples. In this case, the probability distribution of X (t) in the test samples will follow a uniform distribution over the CpG locations within the region, while the probability distribution of X (r) in the reference samples will not be uniform. Therefore, the joint distribution of (X (t) , p (t) ) will differ from the joint distribution of (X (r) , p (r) ) regardless of the distributions of the methylation probabilities p (t) and p (r) in the region. Although metilene seems to handle missing data in a per CpG site basis, it can be the case that the set of CpG sites from each group included in the analysis is different for a given region of interest. This will happen when the total number of CpG sites in the region is above a certain threshold and when there is no data for a given CpG site in either group.
Clearly using the KS tests for determining regions of significant differential methylation between test and reference samples is highly problematic. As a result of our previous discussion and simple analysis, we conclude that metilene should not be used in practice. Only methods, such as DSS and informME, that properly account for the process that generates the sequencing data should be considered to be statistically valid procedures for differential methylation analysis.

bedGraph files generated by informME
Below we provide a list of the browser extensible data graph (bedGraph) tracks and excel spreadsheets generated by informME. The bedGraph files can be directly displayed in the UCSC genome browser (https://genome.ucsc.edu). "PhenoName" is the name for the phenotype used in inter-sample analysis (e.g., lungnormal-1). "tPhenoName" and "rPhenoName" are the names of the test and reference phenotypes used in differential analysis (e.g., lungcancer-1 and lungnormal-1, respectively). Finally, "tName" and "rName" are names for the test and reference phenotypes used in differential analysis (e.g., lungcancer and lungnormal, respectively).

Gene Ranking
• gRank-JSD-tName-VS-rName.xlsx: Excel spreadsheet containing a list of ranked genes using the JSD when no replicate reference data is available.
• gRankRDD-JSD-rName-VS-rName.xlsx: Excel spreadsheet containing a list of ranked genes using the JSD when replicate reference data is available.  Figure S1. Methylation-based classification scheme for GUs that leads to two genome-wide bedGraph tracks: METH and VAR (see Section 9 for the associated codes). The probabilities p 1 , p 2 , p 3 , and p 4 are given by (S23). Note that a (small) number of GUs are not classified by this scheme.  Figure S2. Methylation-based differential classification scheme for GUs that leads to the genome-wide DMU bedGraph track (see Section 9 for the associated codes). The probabilities q 1 , q 2 , q 3 , q 4 , and q 5 are given by (S25). Note that a (small) number of GUs are not classified by this scheme. . Distribution of p-values obtained genomewide using the (lungnormal-1, lungnormal-2) pair of our lung normal data by informME, DSS-single, metilene in the "DMR de-novo annotation" mode 1 based on the KS test statistic, metilene in the "DMR de-novo annotation" mode 1 based on the MWU test statistic, metilene in "DMR annotation in known features" mode 2 based on the KS test statistic, and metilene in "DMR annotation in known features" mode 2 based on the MWU test statistic.  Figure S4. Distribution of p-values obtained genomewide using the (lungnormal-1, lungnormal-3) pair of our lung normal data by informME, DSS-single, metilene in the "DMR de-novo annotation" mode 1 based on the KS test statistic, metilene in the "DMR de-novo annotation" mode 1 based on the MWU test statistic, metilene in "DMR annotation in known features" mode 2 based on the KS test statistic, and metilene in "DMR annotation in known features" mode 2 based on the MWU test statistic. . Distribution of p-values obtained genomewide using the (lungnormal-2, lungnormal-3) pair of our lung normal data by informME, DSS-single, metilene in the "DMR de-novo annotation" mode 1 based on the KS test statistic, metilene in the "DMR de-novo annotation" mode 1 based on the MWU test statistic, metilene in "DMR annotation in known features" mode 2 based on the KS test statistic, and metilene in "DMR annotation in known features" mode 2 based on the MWU test statistic.  Figure S6. Distributions of aggregate methylation-based GU classifications (METH and VAR tracks -see Section 9) within the entire genome, enhancers, promoters, gene bodies, CGIs, and CGI shores.  Figure S7. Distributions of aggregate entropy-based GU classifications (ENTR track -see Section 9) within the entire genome, enhancers, promoters, gene bodies, CGIs, and CGI shores.  Figure S8. Distributions of aggregate methylation-based differential GU classifications (DMU track -see Section 9) within the entire genome, enhancers, promoters, gene bodies, CGIs, and CGI shores.  Figure S9. Distributions of aggregate entropy-based differential GU classifications (DEU track -see Section 9) within the entire genome, enhancers, promoters, gene bodies, CGIs, and CGI shores.