Estimation of the methylation pattern distribution from deep sequencing data

Background Bisulphite sequencing enables the detection of cytosine methylation. The sequence of the methylation states of cytosines on any given read forms a methylation pattern that carries substantially more information than merely studying the average methylation level at individual positions. In order to understand better the complexity of DNA methylation landscapes in biological samples, it is important to study the diversity of these methylation patterns. However, the accurate quantification of methylation patterns is subject to sequencing errors and spurious signals due to incomplete bisulphite conversion of cytosines. Results A statistical model is developed which accounts for the distribution of DNA methylation patterns at any given locus. The model incorporates the effects of sequencing errors and spurious reads, and enables estimation of the true underlying distribution of methylation patterns. Conclusions Calculation of the estimated distribution over methylation patterns is implemented in the R Bioconductor package MPFE. Source code and documentation of the package are also available for download at http://bioconductor.org/packages/3.0/bioc/html/MPFE.html. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0600-6) contains supplementary material, which is available to authorized users.

Akman et al. [1] have developed the Bioconductor package BEAT (BS-Seq Epimutation Analysis Toolkit) for quantitative analysis of DNA methylation patterns from bisulphite sequencing data. The basis of their software is a statistical model for the distribution of the number of reads which register as being methylated from a pooled set of bisulphite-sequencing reads from CpG sites in a given region of a genome. The distribution is derived in terms of the total number of pooled reads from the region, a false positive rate and a false negative rate representing incorrect conversions or non-conversions, and an underlying methylation rate which is assumed to be common to all CpG sites within the region. Their purpose is to estimate the common underlying methylation rate for the region. Their model does not deal with the distribution over methylation patterns.
We argue that because of the assumption that all CpG sites within the pooled region share common values for the distributional parameters, their model is mathematically equivalent to the n = 1 CpG site version of the model set out in Section 3.1 of the Methods section of the current paper. More importantly, the binomial mixture model arrived at by Akman et al. by repeated application of the law of total probability is nothing more than a single binomial distribution. Once this is recognised, it is a trivial matter to estimate the underlying methylation rate from maximum likelihood without the need for the Bayesian analysis employed by Akman et al. Table 1 lists the correspondence between the notation of Akman et al. and that used the current paper. If we represent the number of of reads registering as being methylated by the random variable K, then Eq.
(2) because M has the properties of a Markov transition matrix, and Note that M −1 exists provided = 1 and η s = 1 2 (see Eq. (4) of the main paper), which will hold in any realistic situation. An illustrated example of the constraint onφ for the case n = 2 is shown in Figure 8 of the main paper.
The computational efficiency arises because the number of constraints reduces considerably for large n and physically realistic values of l.
It is easy to check that ∂ 2L (φ)/∂φ j ∂φ k is a diagonal matrix with strictly negative diagonal elements, implying thatL(φ) is a concave function, and that the constraints Eq. (16) and (17) define a convex set. Thus the fast algorithm is also guaranteed to give a unique result.

Reduction in the number of patterns present
One may be tempted to ask whether the maximum likelihood estimate described above is guaranteed to lead to a reduction in the number of methylation patterns reported relative to the number of patterns naïvely observed. We argue here that if read errors are ignored (i.e. we set η = 0), the estimated distribution over patternsθ i is shifted, relative to the naïve distribution y i /n read , towards the totally unmethylated pattern (0, . . . , 0). As a consequence the set of patterns reported to be present is guaranteed to be a subset of or equal to the set of observed patterns.
If the maximum of the function in chain brackets in Eq (19), namely y (0) /n read , lies in the interval [0, 1− ], the transformation M −1 back to θ-space clearly skews the estimated abundances towards the unmethylated state. If the unconstrained maximum lies in the interval [ , 1], the constrained maximum returns the valuê φ (0) = 1 − , which transforms to the estimateθ (0) = 1, and any methylated reads are interpreted as spurious. More specifically, the reported estimate is To see how the result generalises to higher numbers of CpG sites, consider the n = 2 case illustrated in Figure 1B. The fully methylated pattern (1, 1) is a fixed point of the transformation M . Points in each triangular face of the black wire tetrahedron correspond to reported distributions in θ-space for which the pattern i indicated at the opposite corner occurs with probabilitŷ θ i = 0. Thus the transformation M −1 of the red tetrahedron to the black wire frame skews the distribution towards less-methylated patterns. Furthermore, whenever the maximum y i /n read of the likelihood function Eq (4) lies outside the red tetrahedron but inside the wire frame, the constrained maximum lies on the surface of the red tetrahedron, which maps to a point in one of the triangular faces of the wire tetrahedron, and possibly to an edge or corner. In this case at least one pattern is reported to be spurious 1 .
For an arbitrary number n of CpG sites, the only fixed point of the 2 n × 2 n matrix M is vector (1, . . . , 1) corresponding to the fully methylated pattern. Again the transformation M −1 skews the distribution towards less-methylated patterns, and any observed pattern with a small number of observed counts can potentially be reported as spurious. Figure 1: (A) Schematic of the region of φ-space constrained by Eq. (5) and Eq. (6) shown in red and its mapping back to θ-space for the case of an amplicon with n = 1 CpG site and read-error rate η = 0; (B) the constrained region of φ-space (red solid tetrahedron) and its mapping back to θ-space (black wire tetrahedron) for the case of an amplicon with n = 2 CpG site and read-error rate η = 0; and (C) the same as (A) with η > 0.
The situation when η > 0 is more complicated (see Figure 1C for the case of n = 1 CpG site). In principle, an observed distribution y i /n read can be skewed either towards the fully unmethylated pattern (0, . . . , 0) or the fully methylated pattern (1, . . . , 1) by the transformation M −1 . In practice, however, and particularly in the case of non-vertebrates, observed distributions are typically weighted in favour of the fully unmethylated pattern and patterns within a small Hamming distance of the fully unmethylated pattern. Such distributions will be skewed towards the fully unmethylated pattern and the number of reported patterns will decrease, as observed in Figure 3 of the main paper.