BIMMER: a novel algorithm for detecting differential DNA methylation regions from MBDCap-seq data

Mao, Zijing; Ma, Chifeng; Huang, Tim H-M; Chen, Yidong; Huang, Yufei

doi:10.1186/1471-2105-15-S12-S6

Volume 15 Supplement 12

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Bioinformatics

Research
Open access
Published: 06 November 2014

BIMMER: a novel algorithm for detecting differential DNA methylation regions from MBDCap-seq data

Zijing Mao¹,
Chifeng Ma¹,
Tim H-M Huang^2,5,
Yidong Chen^3,4,5 &
…
Yufei Huang^1,3

BMC Bioinformatics volume 15, Article number: S6 (2014) Cite this article

1687 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

DNA methylation is a common epigenetic marker that regulates gene expression. A robust and cost-effective way for measuring whole genome methylation is Methyl-CpG binding domain-based capture followed by sequencing (MBDCap-seq). In this study, we proposed BIMMER, a Hidden Markov Model (HMM) for differential Methylation Regions (DMRs) identification, where HMMs were proposed to model the methylation status in normal and cancer samples in the first layer and another HMM was introduced to model the relationship between differential methylation and methylation statuses in normal and cancer samples. To carry out the prediction for BIMMER, an Expectation-Maximization algorithm was derived. BIMMER was validated on the simulated data and applied to real MBDCap-seq data of normal and cancer samples. BIMMER revealed that 8.83% of the breast cancer genome are differentially methylated and the majority are hypo-methylated in breast cancer.

Introduction

DNA methylation refers to the chemical modification of DNA nucleotides. One of the most common DNA methylation is the modification of cytosine, which typically occurs in CpG sites. When CpG sites in the promoter region that transcription factors bind are methylated, permanent silencing of gene expression is observed in the cell. DNA methylation is highly prevalent in cancer, involved in almost all types of cancer development by altering the normal regulation of gene expression and silencing the tumor suppressor genes [1]. There are three sequencing-based technologies for whole-genome DNA methylation profiling: bisulfite treatment[2] based or bisulfite sequencing, methylated DNA immunoprecipitation followed by sequencing (MeDIP-seq)[3], and Methyl-CpG binding domain-base capture followed by sequencing (MBDCap-seq)[4]. Among the three technologies, MBDCap-seq has higher dynamic range and better sensitivities and it detects more enrichment in CpG-dense methylated DNA regions [5, 6]. We choose to focus on MDBCap-seq data analysis in this study.

Two computational problems concern these genome-wide methylation data including methylation site detection and differential methylation region (DMR) detection. The problem of methylation site detection is similar to the peak detection for ChIP-seq. However, since methylation signals give rise to wider sequence read distribution than that from ChIP-seq peak identification algorithms such as SPP[7] and MACS[8] that are designed primarily for ChIP-seq data analysis would produce poor identification of methylation sites. Specific changes and new algorithms have been proposed to account for the nature of wider read distribution in methylation sequencing data. For instance, Hidden Markov Model [9, 10] have been proposed to model the correlation between adjacent bins of a methylation site. The main aim of DMR detection is to identify aberrant DNA methylation regions that are specifically associated with disease phenotype. It is also fundamental to understanding the cause of altered gene expression in cancer. Most of the popular DMR detection pipelines includes two parts: the first part concerns detection of methylation sites in normal and disease samples individually and the second part includes identification of differential methylated regions in disease sample versus normal samples [11]. Many algorithms for differential methylation detection have been proposed including for example ChIPnorm [11] and ChIPDiff [12]. ChIPnorm performs a quantile normalization on normal and disease samples and applies differential analysis to detect DMRs, whereas ChIPDiff detects enriched methylation regions in normal and disease samples with a Binomial model, and then performs differential analysis based on a HMM. These existing algorithms are very powerful tools in differential methylation analysis but they have clear disadvantages especially when applied to MBDCap-seq. First, the targeted resolutions of the data are relatively low, for instance, the bin size in ChIPDiff is 1000 base pairs (bps). However for a typical MBDCap-seq data, the resolution is normally 100bp bins. Second, these existing pipelines mentioned above are two-step procedures, which are prone to error propagation. If there is an error in methylation site detection, this error will be passed on to the following DMR detection step and impact negatively the performance of differential methylation. Last but not the least, with an exception in [13], most existing algorithms were developed to handle single sample. When replicates are available, they perform prediction on individual samples separately and then fuse the detection results from individual together. Such fusion based algorithms is easily influenced by the erroneous predictions made at individual level. The algorithm in [13] applies LOESS to normalize the difference between replicate samples. However, it assumes that only a small portions of methylation regions are DMRs [11], which might not be applicable to all cases.

In this paper, we proposed a novel algorithm for differential methylation regions (DMRs) detection based on Hidden Markov Model (HMM) and we call the algorithm BIMMER. BIMMER models the methylation status and detect DMRs in normal and cancer samples simultaneously. By doing this, BIMMER avoids error propagation in the existing two-step pipelines and therefore can improve the performance of DMRs detection. BIMMER was tested first on a simulated datasets and applied to a real breast cancer MBDCAP-seq data. The results from breast cancer data revealed that there are 8.83% of 30,804,183 bines detected with differentially methylated status, most of which are hypo-methylated in breast cancer samples.

Methods

Notation

Each MBDCap-seq data sample is pre-processed to be in a BED file, which records the sequence reads counts in consecutive 100 base pair (bp) bins over the entire genome. Let's denote the sample size of the normal sample MBDCap-Seq datasets as $N_{1}$ , that of the cancer dataset as $N_{2}$ , and the total number of the bins is denoted as $M$ . We further denote the reads count of the $i_{t h}$ bin in $N_{1}$ normal samples by a vector $X_{n_{i}} = {[X_{n_{i} . 1}, X_{n_{i} . 2}, \dots, X_{n_{i} . N_{1}}]}^{⊤}$ , where $X_{n_{i}, j}$ is the reads count of the $j_{t h}$ sample, and similarly the reads count of the $i_{t h}$ bin in $N_{2}$ cancer samples by $X_{c_{i}} = {[X_{c_{i} . 1}, X_{c_{i} . 2}, \dots, X_{c_{i} . N_{2}}]}^{⊤}$ , where $X_{c_i, k}$ represents the reads count in the $k_{t h}$ sample. The aim of this work is to predict the differential methylation status of the cancer samples over the normal samples for every bin in the genome.

Two layer HMM model for differential methylation

A bin is considered differential methylated if its methylation status in the cancer sample is different from that in the normal samples. Therefore, the methylation models for the $N_{1}$ normal samples and the $N_{2}$ cancer samples needs to be defined before proceeding to model the differential methylation. To model the methylation status, let $m_{n_{i}} = [0, 1]$ denote the methylation status of the $i_{t h}$ bin for the normal sample, where $m_{n_{i}} = 1$ when the $i_{t h}$ bin is methylated and $m_{n_{i}} = 0$ otherwise. Because the methylation statuses in the adjacent binds are highly correlated, a first order Markov chain is introduced (Figure 1), where the transition probability is defined as $A_{n} (i) = p (m_{n_{i}} | m_{n_{i - 1}})$ and the initial probability for bin 1 is defined as $τ_{n} (i) = P (m_{n_{1}})$ . Then for each bin, the read counts would depend on the methylation status, which is modeled as an i.i.d discrete distribution as $B_{n} (r_{j}, d) = \prod_{j = 1}^{N_{1}} P (x_{n_{j}} = r_{j} | m_{n_{j}} = d) .$ Taken together, the methylation in the normal samples is modeled by an HMM. Similarly, the methylation status for the cancer samples can be also modeled by an HMM. Specifically, if let $m_{c_{i}} = [0, 1]$ denote the methylation status of the $j_{t h}$ bin of the cancer samples, the transition probability and the initial state probability are modeled as $A_{c} (i) = p (m_{c_{i}} | m_{c_{i} - 1})$ and $τ_{c} (i) = P (m_{c_{1}}),$ respectfully and the emission probability is represented as $B_{c} (r_{k}, d) = \prod_{k = 1}^{N_{2}} P (x_{c_{k}} = r_{k} | m_{c_{k}} = d) .$ Next, let differential status at the $i_{t h}$ bin denoted by $d m_{i} = [0, 1]$ , where $d m_{i} = \{\begin{matrix} 1, w h e n m_{n_{i}} \neq m_{c_{i}} \\ 0, o t h e r w i s e \end{matrix}$ . Because the differential methylation statuses for the adjacent bins are also correlated, $d m_{i}$ is further assumed to follow another first order Markov chain (Figure 1), whose transition probability and initial state probability are defined as $A_{d m} (i) = p (d m_{i} | d m_{i - 1})$ and $τ_{d m} (i) = P (d m_{i})$ . Finally, we need to model the relationship between the differential methylation status $d m_{i}$ and the methylation statuses $m_{n_{i}}$ and $m_{c_{i}}$ . In this work, we propose to model it as depicted in Figure 1 by the emission probability $P (m_{n_{i}} | m_{n_{i - 1}}, m_{c_{i}}, d m_{i})$ , i.e., the normal sample methylation status depends directly on the cancer sample methylation status and the differential methylation status, in addition to its own correlations between adjacent bins. It is easy to see from Figure 1 that there are two sets of relatively well defined relationships involved in this emission probability: $p (m_{n_{i}} | m_{n_{i - 1}})$ and $P (m_{n_{i}} | m_{c_{i}}, d m_{i})$ . The first one $p (m_{n_{i}} | m_{n_{i - 1}})$ is the transition probability for the methylation status in normal sample and the second one $P (m_{n_{i}} | m_{c_{i}}, d m_{i})$ models the dependence of $m_{n_{i}}$ on $m_{c_{i}}$ and $d m_{i}$ , which can be intuitively defined as $P (m_{n_{i}} | m_{c_{i}}, d m_{i}) = 0, i f m_{n_{i}} = m_{c_{i}} b u t d m_{i} = 1, o r m_{n_{i}} \neq m_{c_{i}} b u t d m_{i} = 0$ and $P (m_{n_{i}} | m_{c_{i}}, d m_{i}) = 1, o t h e r w i s e$ , i.e., $m_{n_{i}}$ and $m_{c_{i}}$ have to be different if $d m_{i} = 1$ and otherwise they must be the same. Now, the question is how to integrate $p (m_{n_{i}} | m_{n_{i - 1}})$ and $P (m_{n_{i}} | m_{c_{i}}, d m_{i})$ to model the emission probability $P (m_{n_{i}} | m_{n_{i - 1}}, m_{c_{i}}, d m_{i})$ . To this end, a popular approach in data fusion is adopted, which combines them through a weighted sum: $P (m_{n_{i}} | m_{n_{i - 1}}, m_{c_{i}}, d m_{i}) = α p (m_{n_{i}} | m_{n_{i - 1}}) + (1 - α) P (m_{n_{i}} | m_{c_{i}}, d m_{i})$ where $α$ is the weighting factor to be determined from data. Taken together, we propose a two-layer HMM model as depicted in Figure 1 for differential methylation and we refer this model as BIMMER. With BIMMER, the differential methylation status is predicted according to the posterior distribution $P (d m_{i} | X_{n_{i}}, X_{c_{i}} \forall i)$ . This posterior distribution does not depend on the hard decisions on the methylation states for the normal and cancer samples and therefore overcomes the aforementioned problems of error propagation. However, one difficulty is that in the calculation of this posterior distribution, we need to calculate the integration of all 9 model parameters: $τ_{n}$ , $τ_{c}$ , $B_{n}$ , $B_{c}$ , $A_{n}$ , $A_{c}$ , $τ_{d m}$ , $A_{d m} a n d α$ , which is analytically intractable. To solve this problem, we propose the next Expectation and Maximization (EM) solution.

The EM solution

Let $X_{n}$ denote the collection of the reads counts in all M bins for all $N_{1}$ normal samples and $X_{c}$ the collection of reads counts in all M bins for all $N_{2}$ cancer samples. Also, let $m_{n} = {[m_{n 1}, m_{n 2}, \dots m_{n i}]}^{T}$ , $m_{c} = {[m_{c 1}, m_{c 2}, \dots m_{c i}]}^{T}$ and $d m = {[d m_{1}, d m_{2}, \dots d m_{i}]}^{T}$ . In order to obtain the EM solution, $X_{n}$ and $X_{c}$ are treated as the observed data but $m_{n}$ and $m_{c}$ are considered as the unobserved data for the first layer HMM while $d m$ is the unobserved data for the second layer HMM. Here, $Ψ$ is used to denote the model parameter set. For the simplicity of the computation, the first layer HMM parameters $τ_{n}$ , $τ_{c}$ , $B_{n}$ , $B_{c}$ , $A_{n}$ , $A_{c}$ are learned directly from $X_{n}$ and $X_{c}$ with Baum-Welch algorithm and excluded from the EM process. Therefore, the parameter set $Ψ$ for BIMMER includes 3 parameter: $Ψ = {τ_{d m}, A_{d m}, α}$ . Given a set of initial or estimated parameters, the complete data likelihood function is

\begin{gathered} L (Ψ) = P (X_{n}, X_{c}, m_{c}, m_{n}, d m | Ψ) \\ = τ_{d m} \times τ_{c} \times (α \times τ_{n} + (1 - α) \times P (m_{n 1} | m_{c 1}, d m_{1}, Ψ)) \times P (x_{n 1} | m_{n 1}, Ψ) \times P (x_{c 1} | m_{c 1}, Ψ) \\ \times \prod_{i = 2}^{M} [P (x_{n i} | m_{n i}, Ψ) \times P (x_{c i} | m_{c i}, Ψ) \times P (m_{c i} | m_{c (i - 1)}, Ψ) \times P (m_{n i} | m_{n (i - 1)}, m_{c i}, d m_{i}, Ψ) \\ \times P (d m_{i} | d m_{i - 1}, Ψ)] \end{gathered}

Then the log-likelihood function can be expressed as

\begin{gathered} log (l (Ψ)) = log (τ_{d m}) \\ + log (τ_{c}) + log (α \times τ_{n} + (1 - α) \times P (m_{n 1} | m_{c 1}, d m_{1}, Ψ)) + log (P (x_{n 1} | m_{n 1}, Ψ)) + log (P (x_{c 1} | m_{c 1}, Ψ)) \\ + \sum_{i = 2}^{M} [log (P (x_{n i} | m_{n i}, Ψ)) + log ((x_{c i} | m_{c i}, Ψ)) + log (P (m_{c i} | m_{c (i - 1)}, Ψ)) \\ + log (P (m_{n i} | m_{n (i - 1)}, m_{c i}, d m_{i}, Ψ)) + log (P (d m_{i} | d m_{i - 1}, Ψ))] \end{gathered}

At the $k_{t h}$ iteration, suppose that the estimated parameter set at the previous iteration is $Ψ^{k - 1}$ . Then, at E-step, the conditional expectation of this log complete data likelihood is calculated

Q (Ψ; Ψ^{k - 1}) = E_{Ψ^{k - 1}} [l o g L (Ψ) | X_{n}, X_{c}] = \sum_{m_{c}} \sum_{m_{n}} \sum_{d m} \sum_{i = 1}^{M} l o g P (X_{n}, X_{c}, m_{c i}, m_{n i}, d m_{i} | Ψ) \prod_{i = 1}^{M} P (m_{c i}, m_{n i}, d m_{i} | X_{n}, X_{c}, Ψ^{k - 1})

In order to obtain $P (m_{c i}, m_{n i}, d m_{i} | X_{n}, X_{c}, Ψ^{k - 1})$ , the forward-backward algorithm is used, where

\begin{gathered} P (m_{c i}, m_{n i}, d m_{i} | X_{n}, X_{c}, Ψ^{k - 1}) \propto P (d m_{i}, m_{c i}, m_{n i}, X_{n}, X_{c} | Ψ^{k - 1}) \\ = P (d m_{i}, m_{c i}, m_{n i}, X_{n, 1 : i}, X_{c, 1 : i} | Ψ^{k - 1}) \times P (X_{n, i + 1 : M}, X_{c, i + 1 : m} | d m_{i}, m_{c i}, m_{n i}, X_{n,}, X_{c, 1 : M}, Ψ^{k - 1}) \\ = P (d m_{i}, m_{c i}, m_{n i}, X_{n, 1 : i}, X_{c, 1 : i} | Ψ^{k - 1}) \times P (X_{n, i + 1 : M}, X_{c, i + 1 : m} | d m_{i}, m_{c i}, m_{n i}, Ψ^{k - 1}) \end{gathered}

P (d m_{i}, m_{c i}, m_{n i}, X_{n, 1 : i}, X_{c, 1 : i} | Ψ^{k - 1}) = \sum_{d m_{i - 1}}^{} \sum_{m_{c (i - 1)}}^{} \sum_{m_{n (i - 1)}}^{} P (d m_{i}, d m_{i - 1}, m_{c i}, m_{c (i - 1)}, m_{n i}, m_{n (i - 1)}, X_{n, 1 : i - 1}, X_{c, 1 : i - 1}, | Ψ^{k - 1})

and

$\begin{gathered} P (X_{n, i + 1 : M}, X_{c, i + 1 : m} | d m_{i}, m_{c i}, m_{n i}, Ψ^{k - 1}) = \\ \sum_{d m_{i - 1}}^{} \sum_{m_{c (i - 1)}}^{} \sum_{m_{n (i - 1)}}^{} P (d m_{i + 1}, m_{n (i + 1)}, m_{c (i + 1)}, X_{n, i + 1 : M}, X_{c, i + 1 : M} | d m_{i}, m_{c i}, m_{n i}, Ψ^{k - 1}) \end{gathered}$ .

where $X_{n, p : q}$ and $X_{c, p : q}$ denote the collection of the reads counts from bin p to bin q from the N₁ normal samples and the N₂ cancer samples, respectively. In the forward step, we calculate

\begin{gathered} P (d m_{i}, d m_{i - 1}, m_{c i}, m_{c (i - 1)}, m_{n i}, m_{n (i - 1)}, X_{n, 1 : i - 1}, X_{c, 1 : i - 1} | Ψ^{k - 1}) \\ = P (d m_{i}, d m_{i - 1}, m_{c i}, m_{c (i - 1)}, m_{n i}, m_{n (i - 1)}, X_{n, 1 : i - 1}, X_{c, 1 : i - 1} | Ψ^{k - 1}) \\ \times P (m_{c i} | d m_{i}, d m_{i - 1}, m_{n (i - 1)}, m_{c (i - 1)}, X_{n, 1 : i - 1}, X_{c, 1 : i - 1}, Ψ^{k - 1}) \\ \times P (d m_{i} | d m_{i - 1}, m_{n (i - 1)}, m_{c (i - 1)}, X_{n, 1 : i - 1}, X_{c, 1 : i - 1}, Ψ^{k - 1}) \\ \times P (d m_{i - 1}, m_{n (i - 1)}, m_{c (i - 1)}, X_{n, 1 : i - 1}, X_{c, 1 : i - 1} | Ψ^{k - 1}) \end{gathered}

$= P (X_{n i} | m_{n i}, Ψ^{k - 1}) \times P (X_{n, i + 2 : M}, X_{c, i + 2 : m} | d m_{i + 1}, m_{c (i + 1)}, m_{n (i + 1)}, Ψ^{k - 1})$ .

In the backward step, we have

\begin{gathered} P (d m_{i + 1}, m_{c (i + 1)}, m_{n (i + 1)}, X_{n, i + 1 : M}, X_{c, i + 1 : M} | d m_{i}, m_{c i}, m_{n i}, Ψ^{k - 1}) \\ = P (X_{n (i + 1)} | d m_{i}, d m_{i + 1}, m_{c i}, m_{c (i + 1)}, m_{n i}, m_{n (i + 1)}, X_{n, i + 2 : M}, X_{c, i + 2 : M}, Ψ^{k - 1}) \\ \times P (X_{c (i + 1)} | d m_{i}, d m_{i + 1}, m_{c i}, m_{c (i + 1)}, m_{n i}, m_{n (i + 1)}, X_{n, i + 2 : M}, X_{c, i + 2 : M}, Ψ^{k - 1}) \\ \times P (m_{n (i + 1)} | d m_{i}, d m_{i + 1}, m_{c i}, m_{c (i + 1)}, m_{n i}, X_{n, i + 2 : M}, X_{c, i + 2 : M}, Ψ^{k - 1}) \\ \times P (m_{c (i + 1)} | d m_{i}, d m_{i + 1}, m_{c i}, m_{n i}, X_{n, i + 2 : M}, X_{c, i + 2 : M}, Ψ^{k - 1}) \times P (d m_{i + 1} | d m_{i + 2}, Ψ^{k - 1}) \\ \times P (X_{n, i + 2 : M}, X_{c, i + 2 : M} | d m_{i + 1}, m_{c (i + 1)}, m_{n (i + 1)}, Ψ^{k - 1}) \end{gathered}

Then, at M-step, the parameter set $Ψ^{k}$ is updated from $Ψ^{k - 1}$ by maximizing the likelihood expectation with respect to $Ψ^{k - 1}$ . This process is equivalent to maximizing he $Q$ function with respective to the parameters $Ψ$

Ψ^{k} = a r g m a x_{Ψ} Q (Ψ; Ψ^{k - 1})

The maximization yields

$τ_{d m} (z) = \sum_{m_{c 1}}^{} \sum_{m_{n 1}}^{} P (d m_{1} = z, m_{c 1}, m_{n 1}, X_{n i}, X_{c i} | Ψ^{k - 1})$ ,

A_{d m} (i) = P (d m_{i} | d m_{i - 1}, Ψ^{k - 1}) = \frac{P (d m_{i}, d m_{i - 1,} X_{n i}, X_{c i} | Ψ^{k - 1})}{P (d m_{i - 1,} X_{n i}, X_{c i} | Ψ^{k - 1})}

where

\begin{gathered} P (d m_{i}, d m_{i - 1,} X_{n i}, X_{c i} | Ψ^{k - 1}) \\ = \sum_{m_{c i}} \sum_{m_{n i}} \sum_{m_{n (i - 1)}} \sum_{m_{c (i - 1)}} P (X_{n i} | m_{n i}, Ψ^{k - 1}) P (X_{c i} | m_{c i}, Ψ^{k - 1}) P (X_{n, i + 1 : M}, X_{c, i + 1 : M} | d m_{i}, m_{c i}, m_{n i}, Ψ^{k - 1}) \\ \times P (m_{n i} | d m_{i}, m_{c i}, m_{n i}, m_{n (i - 1)}, Ψ^{k - 1}) P (m_{c i} | m_{c (i - 1)}, Ψ^{k - 1}) P (d m_{i} | d m_{i - 1}, Ψ^{k - 1}) \\ \times P (d m_{i - 1}, m_{c i - 1}, m_{n i - 1}, X_{n 1 : i - 1}, X_{c 1 : i + 1} | Ψ^{k - 1}) \end{gathered}

and

\begin{gathered} P (d m_{i - 1,} X_{n i}, X_{c i} | Ψ^{k - 1}) \\ = \sum_{m_{n i - 1}}^{} \sum_{m_{c i - 1}}^{} P (X_{n, i - 1 : M}, X_{c, i - 1 : M} | d m_{i - 1}, m_{c (i - 1)}, m_{n (i - 1)}, Ψ^{k - 1}) P (d m_{i - 1}, m_{c (i - 1)}, m_{n (i - 1)}, X_{n, 1 : i - 1}, X_{c, 1 : i + 1} | Ψ^{k - 1}) \end{gathered}

and

\begin{gathered} \frac{\partial \sum_{m_{c}, m_{n}, d m} l o g (P (X_{c}, X_{n}, m_{c}, m_{n}, d m | Ψ) P (X_{c}, X_{n}, m_{c}, m_{n}, d m | Ψ^{k - 1}))}{\partial α} \\ = \sum_{i = 2}^{M} \sum_{m_{c i}} \sum_{m_{n i}} \sum_{d m_{i}} \sum_{m_{n i - 1}} \frac{P (m_{c i}, m_{n i}, d m_{i}, m_{n i - 1}, X_{c}, X_{n}, | Ψ^{k - 1})}{α + \frac{P (m_{n i} | m_{c i}, d m_{i}, Ψ^{k - 1})}{P (m_{n i} | m_{n (i - 1)}, Ψ^{k - 1}) - P (m_{n i} | m_{c i}, d m_{i}, Ψ^{k - 1})}} \\ + \sum_{m_{c 1}} \sum_{m_{n 1}} \sum_{d m_{1}} \frac{P (m_{c 1}, m_{n 1}, d m_{1}, X_{c}, X_{n}, | Ψ^{k - 1})}{α + \frac{P (m_{1} | m_{1}, d m_{1}, Ψ^{k - 1})}{τ_{c} - P (m_{n 1} | m_{c 1}, d m_{1}, Ψ^{k - 1})}} \end{gathered}

where the last equation is calculated by the Newton-Raphson algorithm. Maximizing this $Q$ function guarantees that the likelihood $L (Ψ^{k})$ is always greater than $L (Ψ^{k - 1})$ , hence ensures global convergence of the solution.

Model initialization and prediction of DMRs

To implement the EM algorithm, the initial parameter set Φ⁽⁰⁾ and the parameters for the first layer needs to be carefully defined because specific choice of these initial parameter values could lead to difference local optimal solutions and affect the prediction performance. After the convergence of the EM solution, the differential methylation statuses, dm, are predicted using the Viterbi algorithm[14] as the chain of the states with the largest probability given the estimated parameter set. Additionally, the methylation statuses $m_{c} and m_{n}$ can also be predicted using the Viterbi algorithm provided the parameters of the first layer HMM are set to the estimated ones.

Results

BIMMER was validated on both simulated data and applied to a real breast cancer dataset. It was first tested on the simulated systems, where the data models were assumed known. Then, BIMMER was applied to a real breast cancer dataset to explore the state of differential methylation.

Test on simulated data

A test dataset was simulated based on the graphical model in Figure 1 to evaluate the performance of BIMMER. A chain of dm was first generated based on given $τ_{d m}$ and $A_{d m}$ . The methylation status in normal and cancer sample $m_{c}$ and $m_{n}$ were then generated based on a set of $τ_{c}$ , $τ_{n}, A_{c}$ , $A_{n}$ and weight parameter $α$ . The read counts in each bin of the normal and cancer sample $X_{c}$ and $X_{n}$ were generated according to the emission probabilities $B_{n}$ amd $B_{c}$ . In addition, a Poisson noise was also added to the reads. Multiple sets of 10 samples with 200,000 bins were generated with different transition probabilities and weight factors (See Table 1 for detailed parameter settings). For comparison, two commonly used differential analysis algorithms: two sample t-test and Wilcoxon Test [15, 16] were also applied to the simulation data. To test the performance of BIMMER under different conditions, two scenarios were considered, where in the first all the parameters except the transition probabilities were fixed, whereas in the other situation only the weight factors were allowed to change. The prediction results were evaluated by the precision and recall (PR) curve and receiver operating characteristic (ROC) curve while the area under the curve (AUC) were calculated for each algorithm (Figure 2). For all the simulation tests, BIMMER outperformed both the two sample t-test and Wilcoxon test. In each simulated scenario, the performance of Wilcoxon test and the two sample t-test were very similar. The Wilcoxon slightly outperformed the t-test because the Wilcoxon test is more robust. For the scenario, the transition probabilities of the simulation were sets from 0.9 to 0.7 and the performance of BIMMER did not change much in terms of its PR and ROC curve. For the second scenario, as the weight factors increasing, the performance of BIMMER slightly decreases. This makes sense because the weight factors actually models the contribution factors of two probabilities $p (m_{n_{i}} | m_{n_{i - 1}}) a n d P (m_{n_{i}} | m_{c_{i}}, d m_{i})$ . The larger this weight factors is, the more uncertainty exist. As the result, the prediction performance decrease. We further tested the influence of different initial values of the weight on the final prediction of differential methylation in the EM solution. This requires to be tested because $α$ is unique in our model. Different initial weights (0.01 and 0.3) were tested used in three simulations and the prediction performance of BIMMER (Figure 3) showed little difference, indicating that the initial ω has little influence on BIMMER's prediction results. (The simulation transition probability and the training result are provided in Table 2) In conclusion, BIMMER can produce satisfactory prediction results on the simulation data; it is also robust against changes in the transition probabilities and the weights.

Table 1 Parameter set used for simulation.

Full size table

Table 2 Estimated parameters after training using different initial weight: 0.01 and 0.3

Full size table

Test on real data

To demonstrate the utility and further validate the performance of BIMMER, we applied BIMMER to a real dataset published in [4], which includes MBDCap-seq reads of whole genome methylation profiles from 10 normal and 75 breast cancer tissues from the 1000 methylome project (http://cbbiweb.uthscsa.edu/KMethylomes). The raw reads (FASTQ) file of MBDCap-seq data was first aligned to UCSC hg18 genome by BWA aligner [17]. The aligned SAM file was then converted to BED format later for further analysis.

The initial model parameters of the EM algorithm are defined in Table 3. Table 4 shows the estimated parameter set of the second hidden layer. The weight $α$ was predicted to be 0.3519, which means the transition probability A_n possesses about 35.2% of influence while the conditional probability $P (n_{t} | d_{t}, c_{t})$ has about a weight of 64.8% on the state of n_t.

Table 3 Initial values and the prior probabilities of BIMMER.

Full size table

Table 4 The estimated parameters of the second hidden layer

Full size table

Among the entire genome, about 8.83% of the bins were detected with differential methylation. Among these differential methylated bins, 95.6% of them are hypo-methylation (less degree of methylation in cancer), while only a minority of bins (4.4%) presented hyper-methylation (more degree of methylation in the cancer samples). Genome-wide differential rates on 4 regions (promoter region (±2kbp of transcription start position), enhancer region (100kbp after transcription end position), exons region and gene body) are plotted in Figure 4, where the detailed differential rates of the 4 regions in the 24 chromosomes are shown in Table 5. As expected, the promoter region and the exon possess higher differential methylation rate than the enhancer regions and gene body. Interestingly, chromosomes 1-2 have a significantly higher differential methylation rates in 4 genomic regions than those regions in other chromosomes.

Table 5 Differential rate of 4 regions on 24 chromosomes

Full size table

Next the genome-wide methylation information was mapped to individual genes to determine whether a gene is differential methylated in the cancer samples vs. the normal samples. For this mapping, 17814 gene symbols were selected (TCGA-BRCA entry). The location information of these gene symbols were downloaded and mapped to the bin location. To avoid possible false positive, a permutation test was conducted on the predicted methylation result to obtain the prediction p-value and a 0.05 significant level was applied. Among this 17814 genes, 293 genes (additional file 1) were detected with significant differential methylation. The methylation status was very similar to that of the genome-wide result, where among these 293, only 4 genes were hyper-methylated and the rest are all hypo-methylated. Table 6 listed the top 20 methylated genes according to their differential methylation rates. The first ranked gene is CDC5L, which encodes the cell division cycle 5-like protein. [18, 19] Research showed that this gene is highly involved in the RNA-splicing and could be a target for cancers [20, 21]. The next 16 genes that shared the same differential methylation rate include BCL3, which is highly involved in breast cancer metastasis and tumor progression [22–24], c6orf123 and c6orf124, two RNA genes which have been showed to be associated with ovarian cancer, CRYAB, a tumor suppressor gene [25], CRIP2, a gene that encodes the cysteine rich intestinal protein 2 and has been implicated to have effect on suppressing tumorigenesis [26], HSD17B1 gene that produces an enzyme that catalyzes the conversion of esrone to estradiol, and is hypothesized to influence endometrial and breast cancer risk[27, 28], and PTPN12, which has been shown to be involved in the ovarian cancer and breast cancer and is also survival related [29–31]. Over all, a lot of top ranked differential methylated genes show associations with breast cancer or other cancers. In addition, the differential methylation status of three sets of breast cancer genes including six survival related genes, 8 tumor size related genes and eight ER+ related genes (Table 7) were examined. For the six survival related genes, 3 out of five were detected with significant differential methylation. In contrast, both tumor size related and ER+ gene sets have about 25% differential methylation rate. The differential methylation density maps of a subset of these genes were also shown in Figure 5. The density clearly confirms that BIMMER has correctly identified the differential methylation regions and the advantages of the HMM model for differential methylation analysis is clearly shown. When a bin having similar reads counts in cancer and normal sample sits in the middle of a stretch differential methylated region (Figure 5.A. second DMR region; Figure 5.C last DMR region), it will be predicted differential methylation by BIMMER because BIMMER considers correlation between adjacent bins. This gives BIMMER the ability to avoid possible false negative predictions.

Table 6 Top 20 differential methylated gene

Full size table

Table 7 Differential rate of normal and patient samples for 22 breast cancer related genes

Full size table

Discussion and future work

In this work, BIMMER, an HMM based algorithm for DMRs detection for MBDCap-seq data is proposed. BIMMER models the methylation status and differential methylation status simultaneously, which does not suffer from the error propagation of existing two-step DMRs detection algorithms. In addition, BIMMER can handle replicate samples at the same time, producing more coherent detections. BIMMER relies on an EM algorithm to estimate the model parameters jointly. BIMMER was validated using simulated data and applied to real breast cancer datasets.

In the future work, four possible aspects could contribute to the performance improvement of BIMMER. First, adding more states of differential methylation into the HMM model and including hyper- and hypo- methylation type status will clearly provide better interpretation of the result. Second, more accurate models can be developed to model the differential methylation status and methylations in different phenotype of samples. Third, more accurate solution could be introduced to replace the weighted average approach. For example, product of experts (PoE) [32] has been shown to be a power tool in recent studies. Finally, more epigenetic information such as CpG island or histone modification can be included into BIMMER to produce biologically more relevant results.

References

Kulis M, Esteller M: DNA methylation and cancer. Adv Genet. 2010, 70: 27-56.
Article PubMed Google Scholar
Yang AS: A simple method for estimating global DNA methylation using bisulfite PCR of repetitive DNA elements. Nucleic Acids Research. 2004, 32 (3):
Hon GC: Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer. Genome Res. 2012, 22 (2): 246-58. 10.1101/gr.125872.111.
Article PubMed Central CAS PubMed Google Scholar
Gu F: CMS: a web-based system for visualization and analysis of genome-wide methylation data of human cancers. PLoS One. 2013, 8 (4): e60980-10.1371/journal.pone.0060980.
Article PubMed Central CAS PubMed Google Scholar
Robinson MD: Evaluation of affinity-based genome-wide DNA methylation data: effects of CpG density, amplification bias, and copy number variation. Genome Res. 2010, 20 (12): 1719-29. 10.1101/gr.110601.110.
Article PubMed Central CAS PubMed Google Scholar
Bock C: Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat Biotechnol. 2010, 28 (10): 1106-14. 10.1038/nbt.1681.
Article PubMed Central CAS PubMed Google Scholar
Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26 (12): 1351-9. 10.1038/nbt.1508.
Article PubMed Central CAS PubMed Google Scholar
Zhang Y: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.
Article PubMed Central PubMed Google Scholar
Qin ZS: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinformatics. 2010, 11: 369-10.1186/1471-2105-11-369.
Article PubMed Central PubMed Google Scholar
Seifert M: MeDIP-HMM: genome-wide identification of distinct DNA methylation states from high-density tiling arrays. Bioinformatics. 2012, 28 (22): 2930-9. 10.1093/bioinformatics/bts562.
Article CAS PubMed Google Scholar
Nair NU: ChIPnorm: a statistical method for normalizing and identifying differential regions in histone modification ChIP-seq libraries. PLoS One. 2012, 7 (8): e39573-10.1371/journal.pone.0039573.
Article PubMed Central CAS PubMed Google Scholar
Xu H: An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics. 2008, 24 (20): 2344-9. 10.1093/bioinformatics/btn402.
Article CAS PubMed Google Scholar
Taslim C: Comparative study on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics. 2009, 25 (18): 2334-40. 10.1093/bioinformatics/btp384.
Article PubMed Central CAS PubMed Google Scholar
Yudkin HL: Channel state testing in information decoding. 1965, 126-leaves
Google Scholar
Malentacchi F: Quantitative evaluation of DNA methylation by optimization of a differential-high resolution melt analysis protocol. Nucleic Acids Res. 2009, 37 (12): e86-10.1093/nar/gkp383.
Article PubMed Central PubMed Google Scholar
Lewin J: Comparative DNA methylation analysis in normal and tumour tissues and in cancer cell lines using differential methylation hybridisation. Int J Biochem Cell Biol. 2007, 39 (7-8): 1539-50. 10.1016/j.biocel.2007.03.006.
Article CAS PubMed Google Scholar
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-60. 10.1093/bioinformatics/btp324.
Article PubMed Central CAS PubMed Google Scholar
Ajuh P: Functional analysis of the human CDC5L complex and identification of its components by mass spectrometry. EMBO J. 2000, 19 (23): 6569-81. 10.1093/emboj/19.23.6569.
Article PubMed Central CAS PubMed Google Scholar
Groenen PM: Rearrangement of the human CDC5L gene by a t(6;19)(p21;q13.1) in a patient with multicystic renal dysplasia. Genomics. 1998, 49 (2): 218-29. 10.1006/geno.1998.5254.
Article CAS PubMed Google Scholar
Ganesh K: CTNNBL1 is a novel nuclear localization sequence-binding protein that recognizes RNA-splicing factors CDC5L and Prp31. Journal of Biological Chemistry. 2011, 286 (19): 17091-102. 10.1074/jbc.M110.208769.
Article PubMed Central CAS PubMed Google Scholar
Lu XY: Cell cycle regulator gene CDC5L, a potential target for 6p12-p21 amplicon in osteosarcoma. Mol Cancer Res. 2008, 6 (6): 937-46. 10.1158/1541-7786.MCR-07-2115.
Article PubMed Central CAS PubMed Google Scholar
Wakefield A: Bcl3 selectively promotes metastasis of ERBB2-driven mammary tumors. Cancer Res. 2013, 73 (2): 745-55. 10.1158/0008-5472.CAN-12-1321.
Article CAS PubMed Google Scholar
Mitsui T: Inhibition of Bcl3 gene expression mediates the anti-proliferative action of estrogen in pituitary lactotrophs in primary culture. Mol Cell Endocrinol. 2011, 345 (1-2): 68-78. 10.1016/j.mce.2011.07.021.
Article CAS PubMed Google Scholar
Choi HJ: Bcl3-dependent stabilization of CtBP1 is crucial for the inhibition of apoptosis and tumor progression in breast cancer. Biochem Biophys Res Commun. 2010, 400 (3): 396-402. 10.1016/j.bbrc.2010.08.084.
Article CAS PubMed Google Scholar
Huang Z: Tumor suppressor Alpha B-crystallin (CRYAB) associates with the cadherin/catenin adherens junction and impairs NPC progression-associated properties. Oncogene. 2012, 31 (32): 3709-20. 10.1038/onc.2011.529.
Article CAS PubMed Google Scholar
Cheung AK: Cysteine-rich intestinal protein 2 (CRIP2) acts as a repressor of NF-kappaB-mediated proangiogenic cytokine transcription to suppress tumorigenesis and angiogenesis. Proc Natl Acad Sci USA. 2011, 108 (20): 8390-5. 10.1073/pnas.1101747108.
Article PubMed Central CAS PubMed Google Scholar
Zhang LS: Association of Genetic Polymorphisms in HSD17B1, HSD17B2 and SHBG Genes with Hepatocellular Carcinoma Risk. Pathol Oncol Res. 2014
Google Scholar
Obazee O: Confirmation of the reduction of hormone replacement therapy-related breast cancer risk for carriers of the HSD17B1_937_G variant. Breast Cancer Res Treat. 2013, 138 (2): 543-8. 10.1007/s10549-013-2448-7.
Article CAS PubMed Google Scholar
Luo RZ: Decreased Expression of PTPN12 Correlates with Tumor Recurrence and Poor Survival of Patients with Hepatocellular Carcinoma. PLoS One. 2014, 9 (1): e85592-10.1371/journal.pone.0085592.
Article PubMed Central PubMed Google Scholar
Villa-Moruzzi E: PTPN12 controls PTEN and the AKT signalling to FAK and HER2 in migrating ovarian cancer cells. Mol Cell Biochem. 2013, 375 (1-2): 151-7.
CAS PubMed Google Scholar
Sun T: Activation of multiple proto-oncogenic tyrosine kinases in breast cancer via loss of the PTPN12 phosphatase. Cell. 2011, 144 (5): 703-18. 10.1016/j.cell.2011.02.003.
Article CAS PubMed Google Scholar
Hinton GE, Osindero S, Teh YW: A fast learning algorithm for deep belief nets. Neural Computation. 2006, 18 (7): 1527-1554. 10.1162/neco.2006.18.7.1527.
Article PubMed Google Scholar

Download references

Declaration

Publication of this article has been funded by NSF Grant (CCF-1246073) to YH and a Qatar National Research Fund(09-897-3-235)to YH and YC to support this project. This work also has been supported by the help of UTSA Computational System Biology Core,funded by the National Institute on Minority Health and Health Disparities(G12MD007591) from the National Institutes of Health.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 12, 2014: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S12.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, TX, 78249-0669, USA
Zijing Mao, Chifeng Ma & Yufei Huang
Department of Molecular Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
Tim H-M Huang
Department of Epidemiology and Biostatistics, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
Yidong Chen & Yufei Huang
Greehey Children's Cancer Research Institute, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
Yidong Chen
Cancer Therapy and Research Center, University of Texas Health Science Center at San Antonio, San Antonio, TX, 78229, USA
Tim H-M Huang & Yidong Chen

Authors

Zijing Mao
View author publications
You can also search for this author in PubMed Google Scholar
Chifeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Tim H-M Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yidong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yidong Chen or Yufei Huang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ZM and CM designed the method and drafted the manuscript. TH provided the data. YC helped with preprocessing the data. YC and YH supervised the work, made critical revisions of the paper and approved the submission of the manuscript.

Zijing Mao, Chifeng Ma contributed equally to this work.

Electronic supplementary material

12859_2014_6677_MOESM1_ESM.xlsx

Additional File 1: List of significantly differentially methylated genes reported by BIMMER. Differential methylation rates and methylation status are also provided. (XLSX 15 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Mao, Z., Ma, C., Huang, T.HM. et al. BIMMER: a novel algorithm for detecting differential DNA methylation regions from MBDCap-seq data. BMC Bioinformatics 15 (Suppl 12), S6 (2014). https://doi.org/10.1186/1471-2105-15-S12-S6

Download citation

Published: 06 November 2014
DOI: https://doi.org/10.1186/1471-2105-15-S12-S6

Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Bioinformatics

BIMMER: a novel algorithm for detecting differential DNA methylation regions from MBDCap-seq data

Abstract

Introduction