BIMMER: a novel algorithm for detecting differential DNA methylation regions from MBDCap-seq data

DNA methylation is a common epigenetic marker that regulates gene expression. A robust and cost-effective way for measuring whole genome methylation is Methyl-CpG binding domain-based capture followed by sequencing (MBDCap-seq). In this study, we proposed BIMMER, a Hidden Markov Model (HMM) for differential Methylation Regions (DMRs) identification, where HMMs were proposed to model the methylation status in normal and cancer samples in the first layer and another HMM was introduced to model the relationship between differential methylation and methylation statuses in normal and cancer samples. To carry out the prediction for BIMMER, an Expectation-Maximization algorithm was derived. BIMMER was validated on the simulated data and applied to real MBDCap-seq data of normal and cancer samples. BIMMER revealed that 8.83% of the breast cancer genome are differentially methylated and the majority are hypo-methylated in breast cancer.


Introduction
DNA methylation refers to the chemical modification of DNA nucleotides. One of the most common DNA methylation is the modification of cytosine, which typically occurs in CpG sites. When CpG sites in the promoter region that transcription factors bind are methylated, permanent silencing of gene expression is observed in the cell. DNA methylation is highly prevalent in cancer, involved in almost all types of cancer development by altering the normal regulation of gene expression and silencing the tumor suppressor genes [1]. There are three sequencing-based technologies for whole-genome DNA methylation profiling: bisulfite treatment [2] based or bisulfite sequencing, methylated DNA immunoprecipitation followed by sequencing (MeDIP-seq) [3], and Methyl-CpG binding domain-base capture followed by sequencing (MBDCapseq) [4]. Among the three technologies, MBDCap-seq has higher dynamic range and better sensitivities and it detects more enrichment in CpG-dense methylated DNA regions [5,6]. We choose to focus on MDBCap-seq data analysis in this study.
Two computational problems concern these genomewide methylation data including methylation site detection and differential methylation region (DMR) detection. The problem of methylation site detection is similar to the peak detection for ChIP-seq. However, since methylation signals give rise to wider sequence read distribution than that from ChIP-seq peak identification algorithms such as SPP [7] and MACS [8] that are designed primarily for ChIP-seq data analysis would produce poor identification of methylation sites. Specific changes and new algorithms have been proposed to account for the nature of wider read distribution in methylation sequencing data. For instance, Hidden Markov Model [9,10] have been proposed to model the correlation between adjacent bins of a methylation site. The main aim of DMR detection is to identify aberrant DNA methylation regions that are specifically associated with disease phenotype. It is also fundamental to understanding the cause of altered gene expression in cancer. Most of the popular DMR detection pipelines includes two parts: the first part concerns detection of methylation sites in normal and disease samples individually and the second part includes identification of differential methylated regions in disease sample versus normal samples [11]. Many algorithms for differential methylation detection have been proposed including for example ChIPnorm [11] and ChIPDiff [12]. ChIPnorm performs a quantile normalization on normal and disease samples and applies differential analysis to detect DMRs, whereas ChIPDiff detects enriched methylation regions in normal and disease samples with a Binomial model, and then performs differential analysis based on a HMM. These existing algorithms are very powerful tools in differential methylation analysis but they have clear disadvantages especially when applied to MBDCapseq. First, the targeted resolutions of the data are relatively low, for instance, the bin size in ChIPDiff is 1000 base pairs (bps). However for a typical MBDCap-seq data, the resolution is normally 100bp bins. Second, these existing pipelines mentioned above are two-step procedures, which are prone to error propagation. If there is an error in methylation site detection, this error will be passed on to the following DMR detection step and impact negatively the performance of differential methylation. Last but not the least, with an exception in [13], most existing algorithms were developed to handle single sample. When replicates are available, they perform prediction on individual samples separately and then fuse the detection results from individual together. Such fusion based algorithms is easily influenced by the erroneous predictions made at individual level. The algorithm in [13] applies LOESS to normalize the difference between replicate samples. However, it assumes that only a small portions of methylation regions are DMRs [11], which might not be applicable to all cases.
In this paper, we proposed a novel algorithm for differential methylation regions (DMRs) detection based on Hidden Markov Model (HMM) and we call the algorithm BIMMER. BIMMER models the methylation status and detect DMRs in normal and cancer samples simultaneously. By doing this, BIMMER avoids error propagation in the existing two-step pipelines and therefore can improve the performance of DMRs detection. BIMMER was tested first on a simulated datasets and applied to a real breast cancer MBDCAP-seq data. The results from breast cancer data revealed that there are 8.83% of 30,804,183 bines detected with differentially methylated status, most of which are hypo-methylated in breast cancer samples.

Notation
Each MBDCap-seq data sample is pre-processed to be in a BED file, which records the sequence reads counts in consecutive 100 base pair (bp) bins over the entire genome. Let's denote the sample size of the normal sample MBDCap-Seq datasets as N 1 , that of the cancer dataset as N 2 , and the total number of the bins is denoted as M. We further denote the reads count of the i th bin in N 1 normal samples by a vector X n i = X n i .1 , X n i .2 , . . . , X n i .N 1 , where X n i ,j is the reads count of the j th sample, and similarly the reads count of the i th bin in N 2 cancer samples by where X c i,k represents the reads count in the k th sample. The aim of this work is to predict the differential methylation status of the cancer samples over the normal samples for every bin in the genome.

Two layer HMM model for differential methylation
A bin is considered differential methylated if its methylation status in the cancer sample is different from that in the normal samples. Therefore, the methylation models for the N 1 normal samples and the N 2 cancer samples needs to be defined before proceeding to model the differential methylation. To model the methylation status, let m n i = [0, 1] denote the methylation status of the i th bin for the normal sample, where m n i = 1 when the i th bin is methylated and m n i = 0 otherwise. Because the methylation statuses in the adjacent binds are highly correlated, a first order Markov chain is introduced (Figure 1), where the transition probability is defined as A n (i) = p m n i |m n i−1 and the initial probability for bin 1 is defined as τ n (i) = P m n 1 . Then for each bin, the read counts would depend on the methylation status, which is modeled as an i.i.d discrete distribution as B n r j , d = N 1 j=1 P x n j = r j |m n j = d . Taken together, the methylation in the normal samples is modeled by an HMM. Similarly, the methylation status for the cancer samples can be also modeled by an HMM. Specifically, if let m c i = [0, 1] denote the methylation status of the j th bin of the cancer samples, the transition probability and the initial state probability are modeled as A c (i) = p m c i |m c i −1 and τ c (i) = P m c 1 , respectfully and the emission probability is represented as k=1 P x c k = r k |m c k = d . Next, let differential status at the i th bin denoted by dm i = [0, 1], where dm i = 1, when m n i = m c i 0, otherwise . Because the differential methylation statuses for the adjacent bins are also correlated, dm i is further assumed to follow another first order Markov chain (Figure 1), whose transition probability and initial state probability are defined as A dm (i) = p (dm i |dm i−1 ) and τ dm (i) = P (dm i ). Finally, we need to model the relationship between the differential methylation status dm i and the methylation statuses m n i and m c i . In this work, we propose to model it as depicted in Figure 1 by the emission probability P(m n i |m n i−1 , m c i , dm i ), i.e., the normal sample methylation status depends directly on the cancer sample methylation status and the differential methylation status, in addition to its own correlations between adjacent bins. It is easy to see from Figure 1 that there are two sets of relatively well defined relationships involved in this emission probability: p m n i |m n i−1 and P m n i |m c i , dm i . The first one p m n i |m n i−1 is the transition probability for the methylation status in normal sample and the second oneP m n i |m c i , dm i models the dependence of m n i on dm i anddm i , which can be intuitively defined as P m ni |m ci , dm i = 0, if m ni = m ci but dm i = 1, or m ni = m ci but dm i = 0 and P m n i |m c i , dm i = 1, otherwise, i.e., m n i and m c i have to be different if dm i = 1and otherwise they must be the same. Now, the question is how to integrate p m n i |m n i−1 and P m n i |m c i , dm i to model the emission probabilityP m n i |m n i−1 , m c i , dm i . To this end, a popular approach in data fusion is adopted, which combines them through a weighted sum: whereα is the weighting factor to be determined from data. Taken together, we propose a two-layer HMM model as depicted in Figure 1 for differential methylation and we refer this model as BIMMER. With BIMMER, the differential methylation status is predicted according to the posterior distribution P(dm i |X n i , X c i ∀i). This posterior distribution does not depend on the hard decisions on the methylation states for the normal and cancer samples and therefore overcomes the aforementioned problems of error propagation. However, one difficulty is that in the calculation of this posterior distribution, we need to calculate the integration of all 9 model parameters: τ n, τ c, B n , B c , A dm and α, A c , τ dm, A dm and α, which is analytically intractable. To solve this problem, we propose the next Expectation and Maximization (EM) solution.

The EM solution
Let X n denote the collection of the reads counts in all M bins for all N 1 normal samples and X c the collection of reads counts in all M bins for all N 2 cancer samples. Also, In order to obtain the EM solution, X n and X c are treated as the observed data but m n and m c are considered as the unobserved data for the first layer HMM while dm is the unobserved data for the second layer HMM. Here, is used to denote the model parameter set. For the simplicity of the computation, the first layer HMM parametersτn, τ c, B n , B c , A n , A c are learned directly from X n and X c with Baum-Welch algorithm and excluded from the EM process. Therefore, the parameter set for BIM-MER includes 3 parameter: = {τ dm , A dm , α}. Given a set of initial or estimated parameters, the complete data likelihood function is Then the log-likelihood function can be expressed as log (l ( )) = log (τdm) [log (P (xni|mni, )) + log ((xci|mci, )) + log P mci|mc(i−1), At the k th iteration, suppose that the estimated parameter set at the previous iteration is k−1 . Then, at Estep, the conditional expectation of this log complete data likelihood is calculated and where X n,p:q and X c,p:q denote the collection of the reads counts from bin p to bin q from the N 1 normal Then, at M-step, the parameter set k is updated from k−1 by maximizing the likelihood expectation with respect to k−1 . This process is equivalent to maximizing he Q function with respective to the parameters where the last equation is calculated by the Newton-Raphson algorithm. Maximizing this Q function guarantees that the likelihood L k is always greater than L k−1 , hence ensures global convergence of the solution.

Model initialization and prediction of DMRs
To implement the EM algorithm, the initial parameter set Φ (0) and the parameters for the first layer needs to be carefully defined because specific choice of these initial parameter values could lead to difference local optimal solutions and affect the prediction performance. After the convergence of the EM solution, the differential methylation statuses, dm, are predicted using the Viterbi algorithm [14] as the chain of the states with the largest probability given the estimated parameter set. Additionally, the methylation statuses m c and m n can also be predicted using the Viterbi algorithm provided the parameters of the first layer HMM are set to the estimated ones.

Results
BIMMER was validated on both simulated data and applied to a real breast cancer dataset. It was first tested on the simulated systems, where the data models were assumed known. Then, BIMMER was applied to a real breast cancer dataset to explore the state of differential methylation.

Test on simulated data
A test dataset was simulated based on the graphical model in Figure 1 to evaluate the performance of BIMMER. A chain of dm was first generated based on given τ dm and A dm . The methylation status in normal and cancer sample m c and m n were then generated based on a set of τ c, τ n , A c , A n and weight parameter α. The read counts in each bin of the normal and cancer sample X c and X n were generated according to the emission probabilities B n amd B c . In addition, a Poisson noise was also added to the reads. Multiple sets of 10 samples with 200,000 bins were generated with different transition probabilities and weight factors (See Table 1 for detailed parameter settings). For comparison, two commonly used differential analysis algorithms: two sample t-test and Wilcoxon Test [15,16] were also applied to the simulation data. To test the performance of BIM-MER under different conditions, two scenarios were considered, where in the first all the parameters except the transition probabilities were fixed, whereas in the other situation only the weight factors were allowed to change. The prediction results were evaluated by the precision and recall (PR) curve and receiver operating characteristic (ROC) curve while the area under the curve (AUC) were calculated for each algorithm (Figure 2). For all the simulation tests, BIMMER outperformed both the two sample t-test and Wilcoxon test. In each simulated scenario, the performance of Wilcoxon test and the two sample t-test were very similar. The Wilcoxon slightly outperformed the t-test because the Wilcoxon test is more robust. For the Table 1 Parameter set used for simulation  scenario, the transition probabilities of the simulation were sets from 0.9 to 0.7 and the performance of BIM-MER did not change much in terms of its PR and ROC curve. For the second scenario, as the weight factors increasing, the performance of BIMMER slightly decreases. This makes sense because the weight factors actually models the contribution factors of two probabilities p m n i |m n i−1 and P m n i |m c i , dm i . The larger this weight factors is, the more uncertainty exist. As the result, the prediction performance decrease. We further tested the influence of different initial values of the weight on the final prediction of differential methylation in the EM solution. This requires to be tested because α is unique in our model. Different initial weights (0.01 and 0.3) were tested used in three simulations and the prediction performance of BIMMER (Figure 3) showed little difference, indicating that the initial ω has little influence on BIMMER's prediction results. (The simulation transition probability and the training result are provided in Table 2) In conclusion, BIMMER can produce satisfactory prediction results on the simulation data; it is also robust against changes in the transition probabilities and the weights.

Test on real data
To demonstrate the utility and further validate the performance of BIMMER, we applied BIMMER to a real dataset published in [4], which includes MBDCap-seq reads of whole genome methylation profiles from 10 normal and 75 breast cancer tissues from the 1000 methylome project (http://cbbiweb.uthscsa.edu/KMethylomes). The raw reads (FASTQ) file of MBDCap-seq data was first aligned to UCSC hg18 genome by BWA aligner [17]. The aligned SAM file was then converted to BED format later for further analysis.
The initial model parameters of the EM algorithm are defined in Table 3. Table 4 shows the estimated parameter set of the second hidden layer. The weight α was predicted to be 0.3519, which means the transition probability A n possesses about 35.2% of influence while the conditional probability P (n t |d t , c t ) has about a weight of 64.8% on the state of n t .
Among the entire genome, about 8.83% of the bins were detected with differential methylation. Among these differential methylated bins, 95.6% of them are hypo-methylation (less degree of methylation in cancer), while only a minority of bins (4.4%) presented hyper-methylation (more degree of methylation in the cancer samples). Genome-wide differential rates on 4 regions (promoter region (±2kbp of transcription start position), enhancer region (100kbp after transcription end position), exons region and gene body) are plotted in Figure 4, where the detailed differential rates of the 4 regions in the 24 chromosomes are shown in Table 5. As expected, the promoter region and the exon possess higher differential methylation rate than the enhancer regions and gene body. Interestingly, chromosomes 1-2 have a significantly higher differential methylation rates in 4 genomic regions than those regions in other chromosomes.
Next the genome-wide methylation information was mapped to individual genes to determine whether a gene is differential methylated in the cancer samples vs. the normal samples. For this mapping, 17814 gene symbols were selected (TCGA-BRCA entry). The location information of these gene symbols were downloaded and mapped to the bin location. To avoid possible false    Table. 3-1 Table. 3-2 Table. 3-3  Table. 3-4 Table. 3-1 to Table. 3-3 are initial transition probabilities for m n , m c and dm; Table. 3-4 enlists the initial probabilities of m n , m c and dm.   positive, a permutation test was conducted on the predicted methylation result to obtain the prediction p-value and a 0.05 significant level was applied. Among this 17814 genes, 293 genes (additional file 1) were detected with significant differential methylation. The methylation status was very similar to that of the genome-wide result, where among these 293, only 4 genes were hyper-methylated and the rest are all hypo-methylated. Table 6 listed the top 20 methylated genes according to their differential methylation rates. The first ranked gene is CDC5L, which encodes the cell division cycle 5-like protein. [18,19] Research showed that this gene is highly involved in the RNA-splicing and could be a target for cancers [20,21]. The next 16 genes that shared the same differential methylation rate include BCL3, which is highly involved in breast cancer metastasis and tumor progression [22][23][24], c6orf123 and c6orf124, two RNA genes which have been showed to be associated with ovarian cancer, CRYAB, a tumor suppressor gene [25], CRIP2, a gene that encodes the cysteine rich intestinal protein 2 and has been implicated to have effect on suppressing tumorigenesis [26], HSD17B1 gene that produces an enzyme that catalyzes the conversion of esrone to estradiol, and is hypothesized to influence endometrial and breast cancer risk [27,28], and PTPN12, which has been shown to be involved in the ovarian cancer and breast cancer and is also survival related [29][30][31]. Over all, a lot of top ranked differential methylated genes show associations with breast cancer or other cancers. In addition, the differential methylation status of three sets of breast cancer genes including six survival related genes, 8 tumor size related genes and eight ER+ related genes (Table 7) were examined. For the six survival related genes, 3 out of five were detected with significant differential methylation. In contrast, both tumor size related and ER+ gene sets have about 25% differential methylation rate. The differential methylation density maps of a subset of these genes were also shown in Figure 5. The density clearly confirms that BIMMER has correctly identified the differential methylation regions and the advantages of the HMM model for differential methylation analysis is clearly shown. When a bin having similar reads counts in cancer and normal sample sits in the middle of a stretch differential methylated region ( Figure 5.A. second DMR region; Figure 5.C last DMR region), it will be predicted differential methylation by BIMMER because BIMMER considers correlation between adjacent bins. This gives BIMMER the ability to avoid possible false negative predictions.

Discussion and future work
In this work, BIMMER, an HMM based algorithm for DMRs detection for MBDCap-seq data is proposed. BIM-MER models the methylation status and differential methylation status simultaneously, which does not suffer from the error propagation of existing two-step DMRs detection algorithms. In addition, BIMMER can handle replicate samples at the same time, producing more coherent detections. BIMMER relies on an EM algorithm to estimate the model parameters jointly. BIMMER was validated using simulated data and applied to real breast cancer datasets.
In the future work, four possible aspects could contribute to the performance improvement of BIMMER. First, adding more states of differential methylation into the HMM model and including hyper-and hypo-methylation type status will clearly provide better interpretation of the result. Second, more accurate models can be developed to model the differential methylation status and methylations in different phenotype of samples. Third, more accurate solution could be introduced to replace the weighted average approach. For example, product of experts (PoE) [32] has been shown to be a power tool in recent studies. Finally, more epigenetic information such as CpG island or histone modification can be included into BIMMER to produce biologically more relevant results.