BIMMER: a novel algorithm for detecting differential DNA methylation regions from MBDCap-seq data
BMC Bioinformatics volume 15, Article number: S6 (2014)
DNA methylation is a common epigenetic marker that regulates gene expression. A robust and cost-effective way for measuring whole genome methylation is Methyl-CpG binding domain-based capture followed by sequencing (MBDCap-seq). In this study, we proposed BIMMER, a Hidden Markov Model (HMM) for differential Methylation Regions (DMRs) identification, where HMMs were proposed to model the methylation status in normal and cancer samples in the first layer and another HMM was introduced to model the relationship between differential methylation and methylation statuses in normal and cancer samples. To carry out the prediction for BIMMER, an Expectation-Maximization algorithm was derived. BIMMER was validated on the simulated data and applied to real MBDCap-seq data of normal and cancer samples. BIMMER revealed that 8.83% of the breast cancer genome are differentially methylated and the majority are hypo-methylated in breast cancer.
DNA methylation refers to the chemical modification of DNA nucleotides. One of the most common DNA methylation is the modification of cytosine, which typically occurs in CpG sites. When CpG sites in the promoter region that transcription factors bind are methylated, permanent silencing of gene expression is observed in the cell. DNA methylation is highly prevalent in cancer, involved in almost all types of cancer development by altering the normal regulation of gene expression and silencing the tumor suppressor genes . There are three sequencing-based technologies for whole-genome DNA methylation profiling: bisulfite treatment based or bisulfite sequencing, methylated DNA immunoprecipitation followed by sequencing (MeDIP-seq), and Methyl-CpG binding domain-base capture followed by sequencing (MBDCap-seq). Among the three technologies, MBDCap-seq has higher dynamic range and better sensitivities and it detects more enrichment in CpG-dense methylated DNA regions [5, 6]. We choose to focus on MDBCap-seq data analysis in this study.
Two computational problems concern these genome-wide methylation data including methylation site detection and differential methylation region (DMR) detection. The problem of methylation site detection is similar to the peak detection for ChIP-seq. However, since methylation signals give rise to wider sequence read distribution than that from ChIP-seq peak identification algorithms such as SPP and MACS that are designed primarily for ChIP-seq data analysis would produce poor identification of methylation sites. Specific changes and new algorithms have been proposed to account for the nature of wider read distribution in methylation sequencing data. For instance, Hidden Markov Model [9, 10] have been proposed to model the correlation between adjacent bins of a methylation site. The main aim of DMR detection is to identify aberrant DNA methylation regions that are specifically associated with disease phenotype. It is also fundamental to understanding the cause of altered gene expression in cancer. Most of the popular DMR detection pipelines includes two parts: the first part concerns detection of methylation sites in normal and disease samples individually and the second part includes identification of differential methylated regions in disease sample versus normal samples . Many algorithms for differential methylation detection have been proposed including for example ChIPnorm  and ChIPDiff . ChIPnorm performs a quantile normalization on normal and disease samples and applies differential analysis to detect DMRs, whereas ChIPDiff detects enriched methylation regions in normal and disease samples with a Binomial model, and then performs differential analysis based on a HMM. These existing algorithms are very powerful tools in differential methylation analysis but they have clear disadvantages especially when applied to MBDCap-seq. First, the targeted resolutions of the data are relatively low, for instance, the bin size in ChIPDiff is 1000 base pairs (bps). However for a typical MBDCap-seq data, the resolution is normally 100bp bins. Second, these existing pipelines mentioned above are two-step procedures, which are prone to error propagation. If there is an error in methylation site detection, this error will be passed on to the following DMR detection step and impact negatively the performance of differential methylation. Last but not the least, with an exception in , most existing algorithms were developed to handle single sample. When replicates are available, they perform prediction on individual samples separately and then fuse the detection results from individual together. Such fusion based algorithms is easily influenced by the erroneous predictions made at individual level. The algorithm in  applies LOESS to normalize the difference between replicate samples. However, it assumes that only a small portions of methylation regions are DMRs , which might not be applicable to all cases.
In this paper, we proposed a novel algorithm for differential methylation regions (DMRs) detection based on Hidden Markov Model (HMM) and we call the algorithm BIMMER. BIMMER models the methylation status and detect DMRs in normal and cancer samples simultaneously. By doing this, BIMMER avoids error propagation in the existing two-step pipelines and therefore can improve the performance of DMRs detection. BIMMER was tested first on a simulated datasets and applied to a real breast cancer MBDCAP-seq data. The results from breast cancer data revealed that there are 8.83% of 30,804,183 bines detected with differentially methylated status, most of which are hypo-methylated in breast cancer samples.
Each MBDCap-seq data sample is pre-processed to be in a BED file, which records the sequence reads counts in consecutive 100 base pair (bp) bins over the entire genome. Let's denote the sample size of the normal sample MBDCap-Seq datasets as , that of the cancer dataset as , and the total number of the bins is denoted as . We further denote the reads count of the bin in normal samples by a vector, where is the reads count of the sample, and similarly the reads count of the bin in cancer samples by , where represents the reads count in the sample. The aim of this work is to predict the differential methylation status of the cancer samples over the normal samples for every bin in the genome.
Two layer HMM model for differential methylation
A bin is considered differential methylated if its methylation status in the cancer sample is different from that in the normal samples. Therefore, the methylation models for the normal samples and the cancer samples needs to be defined before proceeding to model the differential methylation. To model the methylation status, let denote the methylation status of the bin for the normal sample, where when the bin is methylated and otherwise. Because the methylation statuses in the adjacent binds are highly correlated, a first order Markov chain is introduced (Figure 1), where the transition probability is defined as and the initial probability for bin 1 is defined as . Then for each bin, the read counts would depend on the methylation status, which is modeled as an i.i.d discrete distribution as Taken together, the methylation in the normal samples is modeled by an HMM. Similarly, the methylation status for the cancer samples can be also modeled by an HMM. Specifically, if let denote the methylation status of the bin of the cancer samples, the transition probability and the initial state probability are modeled as and respectfully and the emission probability is represented as Next, let differential status at the bin denoted by , where . Because the differential methylation statuses for the adjacent bins are also correlated, is further assumed to follow another first order Markov chain (Figure 1), whose transition probability and initial state probability are defined as and . Finally, we need to model the relationship between the differential methylation status and the methylation statuses and . In this work, we propose to model it as depicted in Figure 1 by the emission probability , i.e., the normal sample methylation status depends directly on the cancer sample methylation status and the differential methylation status, in addition to its own correlations between adjacent bins. It is easy to see from Figure 1 that there are two sets of relatively well defined relationships involved in this emission probability: and . The first one is the transition probability for the methylation status in normal sample and the second one models the dependence of on and, which can be intuitively defined as and , i.e., and have to be different if and otherwise they must be the same. Now, the question is how to integrate and to model the emission probability. To this end, a popular approach in data fusion is adopted, which combines them through a weighted sum: where is the weighting factor to be determined from data. Taken together, we propose a two-layer HMM model as depicted in Figure 1 for differential methylation and we refer this model as BIMMER. With BIMMER, the differential methylation status is predicted according to the posterior distribution. This posterior distribution does not depend on the hard decisions on the methylation states for the normal and cancer samples and therefore overcomes the aforementioned problems of error propagation. However, one difficulty is that in the calculation of this posterior distribution, we need to calculate the integration of all 9 model parameters: , , ,, ,,, , which is analytically intractable. To solve this problem, we propose the next Expectation and Maximization (EM) solution.
The EM solution
Let denote the collection of the reads counts in all M bins for all normal samples and the collection of reads counts in all M bins for all cancer samples. Also, let , and . In order to obtain the EM solution, and are treated as the observed data but and are considered as the unobserved data for the first layer HMM while is the unobserved data for the second layer HMM. Here, is used to denote the model parameter set. For the simplicity of the computation, the first layer HMM parameters,, ,, , are learned directly from and with Baum-Welch algorithm and excluded from the EM process. Therefore, the parameter set for BIMMER includes 3 parameter: . Given a set of initial or estimated parameters, the complete data likelihood function is
Then the log-likelihood function can be expressed as
At the iteration, suppose that the estimated parameter set at the previous iteration is . Then, at E-step, the conditional expectation of this log complete data likelihood is calculated
In order to obtain, the forward-backward algorithm is used, where
where and denote the collection of the reads counts from bin p to bin q from the N1 normal samples and the N2 cancer samples, respectively. In the forward step, we calculate
In the backward step, we have
Then, at M-step, the parameter set is updated from by maximizing the likelihood expectation with respect to . This process is equivalent to maximizing he function with respective to the parameters
The maximization yields
where the last equation is calculated by the Newton-Raphson algorithm. Maximizing this function guarantees that the likelihood is always greater than , hence ensures global convergence of the solution.
Model initialization and prediction of DMRs
To implement the EM algorithm, the initial parameter set Φ(0) and the parameters for the first layer needs to be carefully defined because specific choice of these initial parameter values could lead to difference local optimal solutions and affect the prediction performance. After the convergence of the EM solution, the differential methylation statuses, dm, are predicted using the Viterbi algorithm as the chain of the states with the largest probability given the estimated parameter set. Additionally, the methylation statuses can also be predicted using the Viterbi algorithm provided the parameters of the first layer HMM are set to the estimated ones.
BIMMER was validated on both simulated data and applied to a real breast cancer dataset. It was first tested on the simulated systems, where the data models were assumed known. Then, BIMMER was applied to a real breast cancer dataset to explore the state of differential methylation.
Test on simulated data
A test dataset was simulated based on the graphical model in Figure 1 to evaluate the performance of BIMMER. A chain of dm was first generated based on given and . The methylation status in normal and cancer sample and were then generated based on a set of , , and weight parameter . The read counts in each bin of the normal and cancer sample and were generated according to the emission probabilities amd . In addition, a Poisson noise was also added to the reads. Multiple sets of 10 samples with 200,000 bins were generated with different transition probabilities and weight factors (See Table 1 for detailed parameter settings). For comparison, two commonly used differential analysis algorithms: two sample t-test and Wilcoxon Test [15, 16] were also applied to the simulation data. To test the performance of BIMMER under different conditions, two scenarios were considered, where in the first all the parameters except the transition probabilities were fixed, whereas in the other situation only the weight factors were allowed to change. The prediction results were evaluated by the precision and recall (PR) curve and receiver operating characteristic (ROC) curve while the area under the curve (AUC) were calculated for each algorithm (Figure 2). For all the simulation tests, BIMMER outperformed both the two sample t-test and Wilcoxon test. In each simulated scenario, the performance of Wilcoxon test and the two sample t-test were very similar. The Wilcoxon slightly outperformed the t-test because the Wilcoxon test is more robust. For the scenario, the transition probabilities of the simulation were sets from 0.9 to 0.7 and the performance of BIMMER did not change much in terms of its PR and ROC curve. For the second scenario, as the weight factors increasing, the performance of BIMMER slightly decreases. This makes sense because the weight factors actually models the contribution factors of two probabilities . The larger this weight factors is, the more uncertainty exist. As the result, the prediction performance decrease. We further tested the influence of different initial values of the weight on the final prediction of differential methylation in the EM solution. This requires to be tested because is unique in our model. Different initial weights (0.01 and 0.3) were tested used in three simulations and the prediction performance of BIMMER (Figure 3) showed little difference, indicating that the initial ω has little influence on BIMMER's prediction results. (The simulation transition probability and the training result are provided in Table 2) In conclusion, BIMMER can produce satisfactory prediction results on the simulation data; it is also robust against changes in the transition probabilities and the weights.
Test on real data
To demonstrate the utility and further validate the performance of BIMMER, we applied BIMMER to a real dataset published in , which includes MBDCap-seq reads of whole genome methylation profiles from 10 normal and 75 breast cancer tissues from the 1000 methylome project (http://cbbiweb.uthscsa.edu/KMethylomes). The raw reads (FASTQ) file of MBDCap-seq data was first aligned to UCSC hg18 genome by BWA aligner . The aligned SAM file was then converted to BED format later for further analysis.
The initial model parameters of the EM algorithm are defined in Table 3. Table 4 shows the estimated parameter set of the second hidden layer. The weight was predicted to be 0.3519, which means the transition probability A n possesses about 35.2% of influence while the conditional probability has about a weight of 64.8% on the state of n t .
Among the entire genome, about 8.83% of the bins were detected with differential methylation. Among these differential methylated bins, 95.6% of them are hypo-methylation (less degree of methylation in cancer), while only a minority of bins (4.4%) presented hyper-methylation (more degree of methylation in the cancer samples). Genome-wide differential rates on 4 regions (promoter region (±2kbp of transcription start position), enhancer region (100kbp after transcription end position), exons region and gene body) are plotted in Figure 4, where the detailed differential rates of the 4 regions in the 24 chromosomes are shown in Table 5. As expected, the promoter region and the exon possess higher differential methylation rate than the enhancer regions and gene body. Interestingly, chromosomes 1-2 have a significantly higher differential methylation rates in 4 genomic regions than those regions in other chromosomes.
Next the genome-wide methylation information was mapped to individual genes to determine whether a gene is differential methylated in the cancer samples vs. the normal samples. For this mapping, 17814 gene symbols were selected (TCGA-BRCA entry). The location information of these gene symbols were downloaded and mapped to the bin location. To avoid possible false positive, a permutation test was conducted on the predicted methylation result to obtain the prediction p-value and a 0.05 significant level was applied. Among this 17814 genes, 293 genes (additional file 1) were detected with significant differential methylation. The methylation status was very similar to that of the genome-wide result, where among these 293, only 4 genes were hyper-methylated and the rest are all hypo-methylated. Table 6 listed the top 20 methylated genes according to their differential methylation rates. The first ranked gene is CDC5L, which encodes the cell division cycle 5-like protein. [18, 19] Research showed that this gene is highly involved in the RNA-splicing and could be a target for cancers [20, 21]. The next 16 genes that shared the same differential methylation rate include BCL3, which is highly involved in breast cancer metastasis and tumor progression [22–24], c6orf123 and c6orf124, two RNA genes which have been showed to be associated with ovarian cancer, CRYAB, a tumor suppressor gene , CRIP2, a gene that encodes the cysteine rich intestinal protein 2 and has been implicated to have effect on suppressing tumorigenesis , HSD17B1 gene that produces an enzyme that catalyzes the conversion of esrone to estradiol, and is hypothesized to influence endometrial and breast cancer risk[27, 28], and PTPN12, which has been shown to be involved in the ovarian cancer and breast cancer and is also survival related [29–31]. Over all, a lot of top ranked differential methylated genes show associations with breast cancer or other cancers. In addition, the differential methylation status of three sets of breast cancer genes including six survival related genes, 8 tumor size related genes and eight ER+ related genes (Table 7) were examined. For the six survival related genes, 3 out of five were detected with significant differential methylation. In contrast, both tumor size related and ER+ gene sets have about 25% differential methylation rate. The differential methylation density maps of a subset of these genes were also shown in Figure 5. The density clearly confirms that BIMMER has correctly identified the differential methylation regions and the advantages of the HMM model for differential methylation analysis is clearly shown. When a bin having similar reads counts in cancer and normal sample sits in the middle of a stretch differential methylated region (Figure 5.A. second DMR region; Figure 5.C last DMR region), it will be predicted differential methylation by BIMMER because BIMMER considers correlation between adjacent bins. This gives BIMMER the ability to avoid possible false negative predictions.
Discussion and future work
In this work, BIMMER, an HMM based algorithm for DMRs detection for MBDCap-seq data is proposed. BIMMER models the methylation status and differential methylation status simultaneously, which does not suffer from the error propagation of existing two-step DMRs detection algorithms. In addition, BIMMER can handle replicate samples at the same time, producing more coherent detections. BIMMER relies on an EM algorithm to estimate the model parameters jointly. BIMMER was validated using simulated data and applied to real breast cancer datasets.
In the future work, four possible aspects could contribute to the performance improvement of BIMMER. First, adding more states of differential methylation into the HMM model and including hyper- and hypo- methylation type status will clearly provide better interpretation of the result. Second, more accurate models can be developed to model the differential methylation status and methylations in different phenotype of samples. Third, more accurate solution could be introduced to replace the weighted average approach. For example, product of experts (PoE)  has been shown to be a power tool in recent studies. Finally, more epigenetic information such as CpG island or histone modification can be included into BIMMER to produce biologically more relevant results.
Kulis M, Esteller M: DNA methylation and cancer. Adv Genet. 2010, 70: 27-56.
Yang AS: A simple method for estimating global DNA methylation using bisulfite PCR of repetitive DNA elements. Nucleic Acids Research. 2004, 32 (3):
Hon GC: Global DNA hypomethylation coupled to repressive chromatin domain formation and gene silencing in breast cancer. Genome Res. 2012, 22 (2): 246-58. 10.1101/gr.125872.111.
Gu F: CMS: a web-based system for visualization and analysis of genome-wide methylation data of human cancers. PLoS One. 2013, 8 (4): e60980-10.1371/journal.pone.0060980.
Robinson MD: Evaluation of affinity-based genome-wide DNA methylation data: effects of CpG density, amplification bias, and copy number variation. Genome Res. 2010, 20 (12): 1719-29. 10.1101/gr.110601.110.
Bock C: Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat Biotechnol. 2010, 28 (10): 1106-14. 10.1038/nbt.1681.
Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26 (12): 1351-9. 10.1038/nbt.1508.
Zhang Y: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.
Qin ZS: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinformatics. 2010, 11: 369-10.1186/1471-2105-11-369.
Seifert M: MeDIP-HMM: genome-wide identification of distinct DNA methylation states from high-density tiling arrays. Bioinformatics. 2012, 28 (22): 2930-9. 10.1093/bioinformatics/bts562.
Nair NU: ChIPnorm: a statistical method for normalizing and identifying differential regions in histone modification ChIP-seq libraries. PLoS One. 2012, 7 (8): e39573-10.1371/journal.pone.0039573.
Xu H: An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics. 2008, 24 (20): 2344-9. 10.1093/bioinformatics/btn402.
Taslim C: Comparative study on ChIP-seq data: normalization and binding pattern characterization. Bioinformatics. 2009, 25 (18): 2334-40. 10.1093/bioinformatics/btp384.
Yudkin HL: Channel state testing in information decoding. 1965, 126-leaves
Malentacchi F: Quantitative evaluation of DNA methylation by optimization of a differential-high resolution melt analysis protocol. Nucleic Acids Res. 2009, 37 (12): e86-10.1093/nar/gkp383.
Lewin J: Comparative DNA methylation analysis in normal and tumour tissues and in cancer cell lines using differential methylation hybridisation. Int J Biochem Cell Biol. 2007, 39 (7-8): 1539-50. 10.1016/j.biocel.2007.03.006.
Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-60. 10.1093/bioinformatics/btp324.
Ajuh P: Functional analysis of the human CDC5L complex and identification of its components by mass spectrometry. EMBO J. 2000, 19 (23): 6569-81. 10.1093/emboj/19.23.6569.
Groenen PM: Rearrangement of the human CDC5L gene by a t(6;19)(p21;q13.1) in a patient with multicystic renal dysplasia. Genomics. 1998, 49 (2): 218-29. 10.1006/geno.1998.5254.
Ganesh K: CTNNBL1 is a novel nuclear localization sequence-binding protein that recognizes RNA-splicing factors CDC5L and Prp31. Journal of Biological Chemistry. 2011, 286 (19): 17091-102. 10.1074/jbc.M110.208769.
Lu XY: Cell cycle regulator gene CDC5L, a potential target for 6p12-p21 amplicon in osteosarcoma. Mol Cancer Res. 2008, 6 (6): 937-46. 10.1158/1541-7786.MCR-07-2115.
Wakefield A: Bcl3 selectively promotes metastasis of ERBB2-driven mammary tumors. Cancer Res. 2013, 73 (2): 745-55. 10.1158/0008-5472.CAN-12-1321.
Mitsui T: Inhibition of Bcl3 gene expression mediates the anti-proliferative action of estrogen in pituitary lactotrophs in primary culture. Mol Cell Endocrinol. 2011, 345 (1-2): 68-78. 10.1016/j.mce.2011.07.021.
Choi HJ: Bcl3-dependent stabilization of CtBP1 is crucial for the inhibition of apoptosis and tumor progression in breast cancer. Biochem Biophys Res Commun. 2010, 400 (3): 396-402. 10.1016/j.bbrc.2010.08.084.
Huang Z: Tumor suppressor Alpha B-crystallin (CRYAB) associates with the cadherin/catenin adherens junction and impairs NPC progression-associated properties. Oncogene. 2012, 31 (32): 3709-20. 10.1038/onc.2011.529.
Cheung AK: Cysteine-rich intestinal protein 2 (CRIP2) acts as a repressor of NF-kappaB-mediated proangiogenic cytokine transcription to suppress tumorigenesis and angiogenesis. Proc Natl Acad Sci USA. 2011, 108 (20): 8390-5. 10.1073/pnas.1101747108.
Zhang LS: Association of Genetic Polymorphisms in HSD17B1, HSD17B2 and SHBG Genes with Hepatocellular Carcinoma Risk. Pathol Oncol Res. 2014
Obazee O: Confirmation of the reduction of hormone replacement therapy-related breast cancer risk for carriers of the HSD17B1_937_G variant. Breast Cancer Res Treat. 2013, 138 (2): 543-8. 10.1007/s10549-013-2448-7.
Luo RZ: Decreased Expression of PTPN12 Correlates with Tumor Recurrence and Poor Survival of Patients with Hepatocellular Carcinoma. PLoS One. 2014, 9 (1): e85592-10.1371/journal.pone.0085592.
Villa-Moruzzi E: PTPN12 controls PTEN and the AKT signalling to FAK and HER2 in migrating ovarian cancer cells. Mol Cell Biochem. 2013, 375 (1-2): 151-7.
Sun T: Activation of multiple proto-oncogenic tyrosine kinases in breast cancer via loss of the PTPN12 phosphatase. Cell. 2011, 144 (5): 703-18. 10.1016/j.cell.2011.02.003.
Hinton GE, Osindero S, Teh YW: A fast learning algorithm for deep belief nets. Neural Computation. 2006, 18 (7): 1527-1554. 10.1162/neco.2006.18.7.1527.
Publication of this article has been funded by NSF Grant (CCF-1246073) to YH and a Qatar National Research Fund(09-897-3-235)to YH and YC to support this project. This work also has been supported by the help of UTSA Computational System Biology Core,funded by the National Institute on Minority Health and Health Disparities(G12MD007591) from the National Institutes of Health.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 12, 2014: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S12.
The authors declare that they have no competing interests.
ZM and CM designed the method and drafted the manuscript. TH provided the data. YC helped with preprocessing the data. YC and YH supervised the work, made critical revisions of the paper and approved the submission of the manuscript.
Zijing Mao, Chifeng Ma contributed equally to this work.
Electronic supplementary material
Additional File 1: List of significantly differentially methylated genes reported by BIMMER. Differential methylation rates and methylation status are also provided. (XLSX 15 KB)
About this article
Cite this article
Mao, Z., Ma, C., Huang, T.HM. et al. BIMMER: a novel algorithm for detecting differential DNA methylation regions from MBDCap-seq data. BMC Bioinformatics 15 (Suppl 12), S6 (2014). https://doi.org/10.1186/1471-2105-15-S12-S6
- DNA methylation
- differential methylation
- Hidden Markov Model (HMM)