Identification and analysis of methylation call differences between bisulfite microarray and bisulfite sequencing data with statistical learning techniques
© Döring et al.; licensee BioMed Central Ltd. 2015
Published: 13 February 2015
DNA methylation is an epigenetic modification known to play a prime role in gene silencing and is an important topic in epigenetic research. However, due to technology-dependent errors there are inconsistencies between methylation measurements from different methods . Incorrect methylation calls could result in the discovery of spurious associations between methylation patterns and specific phenotypes in epigenome-wide association studies (EWAS). We worked towards assigning a measure of confidence to individual CpGs to down-weigh or exclude positions with inconsistent measurements in such studies. We used methylation measurements from the Infinium HumanMethylation450 microarray (β450K) and whole genome bisulfite sequencing (βWGBS) to evaluate whether locus-specific measurement differences, Δβ = β450K − βWGBS, are predictable using statistical learning techniques.
To predict Δβ, we built support vector regression models based on Illumina WGBS data. Read similarity was measured with numerical, string [5–7], and set kernels . We introduced the notion of hybrid string kernels to afford a similarity measure for both numeric and string input simultaneously. These kernels are based on scaling the motif similarity scores of two sequences according to the similarity of their base frequency profiles.
For a read-based set kernel utilizing the weighted degree kernel with shifts , we found that the predicted values of Δβ correlated significantly with the observed outcomes (r = 0.37, p-value < 2.2 · 10−16). Furthermore, the hybrid weighted degree kernel (r = 0.234) outperformed the weighted degree kernel with shifts (r = 0.22) by also considering the frequencies of individual bases in addition to the consensus sequences. Non-sequence features were less predictive of the outcome than the sequence, e.g., RBF kernels on base quality and depth of coverage attained only correlations of r = 0.057 and r = 0.003 with the outcome, respectively.
To our knowledge, this is the first approach indicating that differences between methylation measurements from bisulfite sequencing and the Infinium HumanMethylation450 microarray are predictable from the reads. The results suggest that features beside the sequence play only a minuscule role in the emergence of inconsistent methylation measurements. We were able to show that, in this scenario, set kernels and hybrid string kernels provide well-suited similarity measures. Further work is necessary to validate the model's generalizability for data from other cell lines and to evaluate its practical merit.
Gilles Gasparoni and Karl Nordström were funded by the BMBF project 01KU1216F (DEEP). Pavlo Lutsik was funded by the European Union's Seventh Framework Programme (FP7/2007-2013) grant agreement No. 267038 (NOTOX).
- Dedeurwaerder S, Defrance M, Calonne C, Denis H, Sotiriou C, Fuks F: Evaluation of the Infinium Methylation 450K technology. Epigenomics. 2011, 3 (6): 771-784. 10.2217/epi.11.105.View ArticlePubMedGoogle Scholar
- Liu Y, Siegmund KD, Laird PW, Berman BP, et al: Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biol. 2012, 13 (7): R61-10.1186/gb-2012-13-7-r61.PubMed CentralView ArticlePubMedGoogle Scholar
- Assenov Y, Müller F, Lutsik P, Walter J, Lengauer T, Bock C: Comprehensive Analysis of DNA Methylation Data with RnBeads. Nat Methods.Google Scholar
- Teschendorff AE, et al: A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450K DNA methylation data. Bioinformatics. 2013, 29 (2): 189-196. 10.1093/bioinformatics/bts680.PubMed CentralView ArticlePubMedGoogle Scholar
- Sonnenburg S, Rätsch G, Schäfer G: Learning interpretable SVMs for biological sequence classification. Research in Computational Molecular Biology. 2005, Springer, 389-407.View ArticleGoogle Scholar
- Rätsch G, Sonnenburg S, Schölkopf B: RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics. 2005, 21 (suppl 1): i369-i377. 10.1093/bioinformatics/bti1053.View ArticlePubMedGoogle Scholar
- Meinicke P, Tech M, Morgenstern B, Merkl R: Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinformatics. 2004, 5 (1): 169-10.1186/1471-2105-5-169.PubMed CentralView ArticlePubMedGoogle Scholar
- Gärtner T, Flach PA, Kowalczyk A, Smola AJ: Multi-Instance Kernels. Proceedings of 19th International Conference on Machine Learning. 2002, San Mateo, CA: Morgan Kaufman, 179-186. Edited by Sammut C, Hoffmann AGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.