HOME: a histogram based machine learning approach for effective identification of differentially methylated regions

Background The development of whole genome bisulfite sequencing has made it possible to identify methylation differences at single base resolution throughout an entire genome. However, a persistent challenge in DNA methylome analysis is the accurate identification of differentially methylated regions (DMRs) between samples. Sensitive and specific identification of DMRs among different conditions requires accurate and efficient algorithms, and while various tools have been developed to tackle this problem, they frequently suffer from inaccurate DMR boundary identification and high false positive rate. Results We present a novel Histogram Of MEthylation (HOME) based method that takes into account the inherent difference in the distribution of methylation levels between DMRs and non-DMRs to discriminate between the two using a Support Vector Machine. We show that generated features used by HOME are dataset-independent such that a classifier trained on, for example, a mouse methylome training set of regions of differentially accessible chromatin, can be applied to any other organism’s dataset and identify accurate DMRs. We demonstrate that DMRs identified by HOME exhibit higher association with biologically relevant genes, processes, and regulatory events compared to the existing methods. Moreover, HOME provides additional functionalities lacking in most of the current DMR finders such as DMR identification in non-CG context and time series analysis. HOME is freely available at https://github.com/ListerLab/HOME. Conclusion HOME produces more accurate DMRs than the current state-of-the-art methods on both simulated and biological datasets. The broad applicability of HOME to identify accurate DMRs in genomic data from any organism will have a significant impact upon expanding our knowledge of how DNA methylation dynamics affect cell development and differentiation. Electronic supplementary material The online version of this article (10.1186/s12859-019-2845-y) contains supplementary material, which is available to authorized users.


Supplementary information 1.Differential ATAC-seq peaks
Fastq files were adapter and quality trimmed and aligned to the mm10 reference genome with bowtie2 (Langmead and Salzberg, 2012). As suggested by Buenrostro et al., fragment ends were offset by +4 bp on the positive strand and -5 bp on the negative strand (Buenrostro, et al., 2013). Duplicate reads were removed with Picard and peaks were identified with MACS2 (Zhang, et al., 2008). Differential ATAC-Seq peaks between EX and PV samples were detected with MACS2 bdgdiff.

Tool parameters for simulated data
1.2.1 HOME(1.0.0)

Class 1
Parameters tested: • All parameters kept as default.
• Prediction score cutoff set to 0.3 and pruncutoff set to 0.2 and all other parameters kept as default.
• Prediction score cutoff set to 0.6 and pruncutoff set to 0.4 and all other parameters kept as default.
Best results were obtained for all parameters kept as default.

Class 2
Parameters tested: • All parameters kept as default.
• Prediction score cutoff set to 0.3 and pruncutoff set to 0.2 and all other parameters kept as default.
• Prediction score cutoff set to 0.6 and pruncutoff set to 0.4 and all other parameters kept as default.
Best results were obtained score cutoff set to 0.6 and pruncutoff set to 0.4 and all other parameters kept as default. Best results were obtained for pval 0.01, delta 0.2 and all other parameters kept as default.

Class 1
Parameters tested: • All parameters kept as default • delta set to 0.2 and all other parameters kept as default • delta set to 0.3 and all other parameters kept as default Best results were obtained for delta 0.3 and all other parameters kept as default.

Class 2
Parameters tested: • All parameters kept as default • delta set to 0.2 and all other parameters kept as default • delta set to 0.3 and all other parameters kept as default Best results were obtained for delta 0.3 and all other parameters kept as default.

Boundary accuracy measurement
To measure the boundary accuracy, the fraction of overlap in bp between simulated and predicted DMRs was calculated using BEDtools intersect (Quinlan and Hall, 2010) with fraction of overlap (-f) and reciprocal overlap (-r) features. As defined by BEDtools, -r feature require that the fraction of overlap be reciprocal for A and B. That is, if -f is 0.90 and -r is used, this requires that B overlap at least 90% of A and that A also overlaps at least 90% of B. In our case A was simulated DMR and B was predicted DMR.

HOME(1.0.0)
For CG context, all parameters were kept as default. For CH in mammalian data, prediction score cutoff set to 0.3 and for CHH and CHG context in plant data, prediction score cutoff set to 0.5. The merge distance parameter kept as default (500bp) for all contexts.

DSS (2.10.0)
p.threshold set to 0.01 and dis.merge set to 500 (all other parameters set to default).