A framework for analyzing DNA methylation data from Illumina Infinium HumanMethylation450 BeadChip
© The Author(s) 2018
Published: 11 April 2018
DNA methylation has been identified to be widely associated to complex diseases. Among biological platforms to profile DNA methylation in human, the Illumina Infinium HumanMethylation450 BeadChip (450K) has been accepted as one of the most efficient technologies. However, challenges exist in analysis of DNA methylation data generated by this technology due to widespread biases.
Here we proposed a generalized framework for evaluating data analysis methods for Illumina 450K array. This framework considers the following steps towards a successful analysis: importing data, quality control, within-array normalization, correcting type bias, detecting differentially methylated probes or regions and biological interpretation.
We evaluated five methods using three real datasets, and proposed outperform methods for the Illumina 450K array data analysis. Minfi and methylumi are optimal choice when analyzing small dataset. BMIQ and RCP are proper to correcting type bias and the normalized result of them can be used to discover DMPs. R package missMethyl is suitable for GO term enrichment analysis and biological interpretation.
DNA methylation is an important epigenetic modification which has shown numerous associations with biological processes and complex diseases such as diabetes, schizophrenia and cancer [1–4]. However, the methylomic landscape in disease pathogenesis has not yet been well characterized, especially in cancer where DNA methylation can be altered dramatically. Interests of exploring the associations between DNA methylation and complex diseases increase in disease studies.
Illumina Infinium HumanMethylation450 (450K) BeadChip array, which covers over 480K CpG sites and targets 96% of CpG islands in human genome , has been widely utilized in many large studies, such as The Cancer Genome Atlas (TCGA) and The International Cancer Genome Consortium (ICGC) Project . With the availability of public data resources, a number of methods for analyzing the Illumina 450K array data became rapidly available in the past few years.
Unlike the previous platform Illumina Infinium HumanMethylation27 (27K) BeadChip, in which only one probe type is utilized, the Illumina 450K BeadChip includes two distinct probe types, Infinium I (n=135501) and Infinium II (n=350076) . Each CpG site of Infinium I is targeted by two 50bp probes: one for detecting “methylated” (M) intensity and one for detecting “unmethylated” (U) intensity, whereas each CpG site of Infinium II uses just one probe to distinguish “M” and “U” intensity through different dye colors (green and red), then the β-value, indicating the methylation level of one CpG site, can be computed as β=M/(M+U+α) where α is 100 generally. M-value, M=log2(β/(1−β)), the logit-transformed β-value, is another quantity used in following up analysis.
There are mainly two forms of the Illumina 450K array data: i) raw (*.idat) data which is the direct output from Illumina iScan system and stores intensities for each probe, ii) *.txt data, which is usually got after simple preprocessing and is easier to access. Both file formats can be handled by R packages: *.idat files can be read by illuminaio package  and *.txt files can be dealt with minfi , wateRmelon  et al. We use three datasets for expounding our framework.
Dataset 1: Illumina 450K dataset from Dedeurwaerder et al. (GEO accession number: GSE29290) . Three samples on HCT116WT cell-lines are considered, to evaluate the capability of methods to reduce the replicate variation.
Dataset 2: Illumina 450K dataset of fresh frozen head and neck cancer (HNC) samples form GSE38268 , where three of them are HPV+ and other three are HPV- samples.
Dataset 3: Illumina 450K dataset of level 1 methylation data of TCGA KIRC samples, which are *.idat files containing 160 normal samples and 325 tumor samples, to evaluate the efficiency of different modules.
After the data imported into R, we would evaluate the quality of data. First, probes displaying a high detected p-value should be filtered out (e.g. > 0.01), while such probes have a β-value of “NA” in *.txt files. It is worth mentioned that different strategies of assigning missing β-value are applied in different modules. In minfi, the β-values are assigned with “NAs” when both M and U intensities are zero, but an additional criteria is that “NAs” will be assigned if either M or U intensities don’t fluoresce above background. Second, there are 850 inbuilt control probes on the 450K array, such as bisulfite conversion I, bisulfite conversion II, extension, hybridization and negative (n=613), which can be used to evaluate the other probes’ intensities. Samples that can’t pass this quality control are excluded in further analysis. Third, probes on Chromosome X or Y should be filtered out to eliminate the impact of sex on differential methylated analysis in many studies. Fourth, Price ME et al.  reported that there were 4.3% of Illumina 450K probes containing a known SNP at the targeted site. Such probes should cause problems in inter-sample analysis. Last but not the least, cross-reactive probes on the Illumina 450K array are identified depending on , which is particularly problematic because the β-value of such probes is more likely to represent a combination of multiple sites and not the level of initially targeted CpG sites.
This step includes background correction and color bias adjustment. Background correction methods have been developed by Triche et al.  such as Noob and Normexp, which are based on convolution models and use out-of-band (OOB) probes intensities to measure the background. The lumi R package provides two different methods for eliminating the background. The first one is based on the negative-control probes inbuilt the BeadChip and the second one estimates the background from the density modes of probes intensities. The second one would show bad performance when there are more than two density modes for some sample. The popular methylumi R package  proposed a color bias adjustment based on smooth quantile or shift-and-scaling normalization. Globally, these methods seem to improve the data quality in some cases .
Correction of the type bias
The type bias is the one that is most crucial to correct as it is the main source decreasing the data quality. There have been several efforts to develop methodologies to correct the probe type bias because of the differences between Infinium I and Infinium II. Because Infinium I probes are more stable and reproducible across different samples, most methods reduce the bias of Infinium II rather than Infinium I probes.
The first method is called peak-based correction (PBC) , which rescale the methylation values of Infinium II to the same modes for distribution of methylation values of Infinium I. But this method is sensitive to the shape of β-value density curves and is therefore less robust when the methylation density distribution does not exhibit well-defined peaks.
Touleimat and Tost  developed a method called Subset Quantile Normalization (SQN) based on an assumption that the β-values of CpGs form the same biological category should have the same density distribution. They found that the normalization result of using the “relation to CpG” annotation perfectly corrected the bias.
Subset-quantile Within Array Normalization (SWAN)  was developed based on the assumption that the β-values distribution should be the same when the probes have the same number of CpGs. But SWAN also alters Infinium I probe data, which increases Infinium I technical variation, and does not seem to improve the data quality when applied to some datasets .
Beta Mixture Quantile normalization (BMIQ)  method decomposes the density profile of Infinium I and Infinium II probes by fitting a beta-mixture model of three states: unmethylated (default β-value <0.25), hemimethylated (default 0.25 ≤β-value <0.75) and fully methylated (default β-value ≥0.75). Then it uses a quantile normalization to fit β-values distribution of Infinium II to the corresponding β-values distribution of Infinium I. This method does not depend on unceremonious choices of biological characteristics to be used to normalize data. Thus it seems more suitable than other methods. However, some points appear worse after BMIQ correction.
Another method, called Regression on Correlated Probes (RCP) , uses a quantile linear regression model of correlation between pairs of nearby Infinium I and II probes that share the same genomic context to adjust the methylation levels of Infinium II probes. The weakness of RCP is that it may not fit some experimental data leading to a result worse than raw data.
While background is important for measuring absolute methylation levels for single sample/condition experiments, we ignore the background here in this analysis since it can be cancelled out when comparing two conditions. As shown by other studies , widely used normalization process, which is based on the assumption that the majority of signals should not change across compared conditions, usually makes mistakes when it was applied to experiments where large portions of signals are differentially expressed. Thus, with a good quality control on the analyzed datasets, we didn’t choose to apply common normalization strategies.
It is also important to remove non-biological variation called batch effects existing between batches and samples. Such batch effects can influence on measurement of global level that could be partially removed through between-sample normalization using principal component analysis.
Identification of DMPs/DMRs
As we mentioned above, the main focus of many methylation studies has been on detecting differentially methylated probes (DMPs) or regions (DMRs) associated with a phenotype. The β-value is the default value for methylation measurement, allowing easy biological interpretation. Another type of value, M-value, is used to express the degree of methylation obtained with Infinium. Due to the heteroscedasticity of β-value, the variance of M-value across the methylation range is approximately constant, so the M-value has better statistical properties. The two types of value are used in different methods.
Minfi offers a comprehensive package to analyze Illumina 450K array data, where candidate regions are determined for DMR analysis and locally weighted scatterplot smoothing (LOESS) is adopted to smooth the methylation differences between groups within each determined region. It also can find long-range alterations such as identified hypomethylated blocks  based on “open sea” probes. A empirical Bayes moderated t-test is used in limma  when sample sizes are less than 10, in which case M-values should be used as they rely much more on Gaussianity assumption .
There is a long list of significant CpGs to be interpreted after differential methylation analysis. Using the Infinium annotation file, Illumina 450K probes are classified according to their relations to CGIs and to the closest annotated gene. Regarding their relation to CGIs, the probes are classified into four categories: sites located inside a CGI, sites located in the CGI shores (0-2k bp), sites located in the CGI shelves (2-4k bp) and sites located in the “open sea”. As regards their relation to annotated genes, the sites are categorized as inside the promoter, inside the 5’-UTR region, inside the gene body and inside the 3’-UTR region. Then the significant DMPs can be marked with their related genes.
Performing gene set analysis is a popular way to understand the affected potential gene pathways. Although gene set analysis is well established in gene expression experimental, the research in methylation data is ongoing in different groups. In Illumina 450K array, the numbers of CpGs associated with each gene ranges largely from 1 to 1299 . Genes with larger numbers of probes are more likely to have significant differentially methylated CpGs . With the ontology and knowledgebase developing [34–40], researchers can easily annotate the genes containing DMPs or DMRs to ontology entries, which brings convenience for understanding the function of genes in the pathogenesis of diseases. Obviously, a phenotype is associated with several -omics data, such as mRNA expression and protein expression, which suggests researchers should utilize integrated analysis with multi-dimension data like TCGA project does [41, 42].
Reduce the technical variation and type bias
Identify DMPs/DMRs by preprocessed results of different methods
The number of obtained DMPs and GO terms by five methods
Efficiency of large dataset analysis
We use Dataset3 to evaluate the performance of minfi and meffil  in term of importing large data. It took 48 min and less than 3 Gb memory to import data of 485 samples (970 *.idat files) using meffil package on a computer with 8 Gb of RAM and 4 processors while minfi could not run on the same computer. Minfi took two hours and ∼ 23 Gb memory on a server to import the same dataset.
Summary of evaluation of five methods
The SQN, Dasen and RCP methods could significantly improve the replicated data quality, while SWAN and BMIQ didn’t show improvement of the replicates. SWAN and Dasen didn’t remove type bias as other methods, which might be due to that models they applying cannot fit the distribution of Infinium I well as other methods.
When evaluation focused on detecting DMPs, BMIQ and RCP got more overlapped DMPs and credible GO terms than other methods. It is should pointed out that result may vary largely when using different datasets, which will be validated in further work.
Illumina MethylationEPIC BeadChip  microarrays have been used in some project, which contain more probes on a single array. More efficient tools are in urgent need of merging 450K and EPIC array data and the efficiency of analysis should be considered. Minfi has been utilized widely but it cannot handle large dataset on personal computer in our view while the newly package meffil displayed surprising performance.
During the evaluating processing, there were conflicts between R packages, for example, the MethylSet object in wateRmelon and minfi are different because the one in minfi has been updated in the newest version while the wateRmelon still use the previous object constructor. It is should be noticed in case of using different version of them.
It is suggested that the Illumina 450K users should choose proper strategy about importing data, background eliminating, correcting dye bias, correcting the type bias and detecting DMPs or DMRs. When analyzing small dataset, minfi and methylumi are optimal choice to import data and SQN, BMIQ and RCP may be proper to correcting the Infinium I/II bias. R package missMethyl is suitable for GO term enrichment analysis and biological interpretation. In our view, minfi is a proper R package to import data, eliminate background and ENmix package can be used to correct the type bias, then the normalized data should be used in the remaining steps of the framework.
Publication costs were funded by the Major State Research Development Program of China [No:2016YFC1202302], the fundamental research funds for the central universities (grant no. HIT.NSRIF.201652), the National Natural Science Foundation of China (No:61571152) and the National High-tech R&D Program of China (863 Program) [Nos:2015AA020101, 2015AA020108].
Availability of data and materials
Dataset 1 and Dataset 2 analyzed during the current study are available in the GEO repository, GSE29290, GSE38268. Dataset 3 is available in the GDC repository TCGA-KIRC
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 5, 2018: Selected articles from the Biological Ontologies and Knowledge bases workshop 2017. The full contents of the supplement are available online at https://doi.org/https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-5.
ZXW performed analyses and wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Koga Y, Pelizzola M, Cheng E, Krauthammer M, Sznol M, Ariyan S, Narayan D, Molinaro AM, Halaban R, Weissman SM. Genome-wide screen of promoter methylation identifies novel markers in melanoma. Genome Res. 2009; 19(8):1462–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Teng M, Balch C, Liu Y, Li M, Huang TH, Wang Y, Nephew KP, Li L. The influence of cis-regulatory elements on dna methylation fidelity. PloS ONE. 2012; 7(3):32928.View ArticleGoogle Scholar
- Esteller M. Cancer epigenomics: Dna methylomes and histone-modification maps. Nat Rev Genet. 2007; 8(4):286.View ArticlePubMedGoogle Scholar
- Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, et al. The human colon cancer methylome shows similar hypo-and hypermethylation at conserved tissue-specific cpg island shores. Nat Genet. 2009; 41(2):178–86.View ArticlePubMedPubMed CentralGoogle Scholar
- Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, et al. High density dna methylation array with single cpg site resolution. Genomics. 2011; 98(4):288–95.View ArticlePubMedGoogle Scholar
- Consortium ICG, et al. International network of cancer genome projects. Nature. 2010; 464(7291):993.View ArticleGoogle Scholar
- Dedeurwaerder S, Defrance M, Calonne E, Denis H, Sotiriou C, Fuks F. Evaluation of the infinium methylation 450k technology. Epigenomics. 2011; 3(6):771–84.View ArticlePubMedGoogle Scholar
- Touleimat N, Tost J. Complete pipeline for infinium®; human methylation 450k beadchip data processing using subset quantile normalization for accurate dna methylation estimation. Epigenomics. 2012; 4(3):325–41.View ArticlePubMedGoogle Scholar
- Teschendorff AE, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S. A beta-mixture quantile normalization method for correcting probe design bias in illumina infinium 450 k dna methylation data. Bioinformatics. 2012; 29(2):189–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Maksimovic J, Gordon L, Oshlack A. Swan: Subset-quantile within array normalization for illumina infinium humanmethylation450 beadchips. Genome Biol. 2012; 13(6):44.View ArticleGoogle Scholar
- Pidsley R, Wong CC, Volta M, Lunnon K, Mill J, Schalkwyk LC. A data-driven approach to preprocessing illumina 450k methylation array data. BMC Genomics. 2013; 14(1):293.View ArticlePubMedPubMed CentralGoogle Scholar
- Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA. Minfi: a flexible and comprehensive bioconductor package for the analysis of infinium dna methylation microarrays. Bioinformatics. 2014; 30(10):1363–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu Z, Niu L, Li L, Taylor JA. Enmix: a novel background correction method for illumina humanmethylation450 beadchip. Nucleic Acids Res. 2015; 44(3):20.View ArticleGoogle Scholar
- Niu L, Xu Z, Taylor JA. Rcp: a novel probe design bias correction method for illumina methylation beadchip. Bioinformatics. 2016; 32(17):2659–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Smith ML, Baggerly KA, Bengtsson H, Ritchie ME, Hansen KD. illuminaio: An open source idat parsing tool for illumina microarrays. F1000Research. 2013; 2:264.PubMedPubMed CentralGoogle Scholar
- Lechner M, Fenton T, West J, Wilson G, Feber A, Henderson S, Thirlwell C, Dibra HK, Jay A, Butcher L, et al. Identification and functional validation of hpv-mediated hypermethylation in head and neck squamous cell carcinoma. Genome Med. 2013; 5(2):15.View ArticlePubMedPubMed CentralGoogle Scholar
- Price EM, Cotton AM, Lam LL, Farré P., Emberly E, Brown CJ, Robinson WP, Kobor MS. Additional annotation enhances potential for biologically-relevant analysis of the illumina infinium humanmethylation450 beadchip array. Epigenetics Chromatin. 2013; 6(1):4.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen Y-a, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, Gallinger S, Hudson TJ, Weksberg R. Discovery of cross-reactive probes and polymorphic cpgs in the illumina infinium humanmethylation450 microarray. Epigenetics. 2013; 8(2):203–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Triche Jr TJ, Weisenberger DJ, Van Den Berg D, Laird PW, Siegmund KD. Low-level processing of illumina infinium dna methylation beadarrays. Nucleic Acids Res. 2013; 41(7):90.View ArticleGoogle Scholar
- Davis S, Du P, Bilke S, Triche T, Bootwalla M. methylumi: Handle illumina methylation data. R Package version 2.0. 2014.Google Scholar
- Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A comprehensive overview of infinium humanmethylation450 data processing. Brief Bioinforma. 2013; 15(6):929–41.View ArticleGoogle Scholar
- Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biol. 2010; 11(3):25.View ArticleGoogle Scholar
- Berman BP, Weisenberger DJ, Aman JF, Hinoue T, Ramjan Z, Liu Y, Noushmehr H, Lange CP, van Dijk CM, Tollenaar RA, et al. Regions of focal dna hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina-associated domains. Nat Genet. 2012; 44(1):40–6.View ArticleGoogle Scholar
- Smyth GK. Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. New York: Springer: 2005. p. 397–420.Google Scholar
- Zhuang J, Widschwendter M, Teschendorff AE. A comparison of feature selection and classification methods in dna methylation studies using the illumina infinium platform. BMC Bioinformatics. 2012; 13(1):59.View ArticlePubMedPubMed CentralGoogle Scholar
- Stockwell PA, Chatterjee A, Rodger EJ, Morison IM. Dmap: differential methylation analysis package for rrbs and wgbs data. Bioinformatics. 2014; 30(13):1814–22.View ArticlePubMedGoogle Scholar
- Warden CD, Lee H, Tompkins JD, Li X, Wang C, Riggs AD, Yu H, Jove R, Yuan Y-C. Cohcap: an integrative genomic pipeline for single-nucleotide resolution dna methylation analysis. Nucleic Acids Res. 2013; 41(11):117.View ArticleGoogle Scholar
- Wang D, Yan L, Hu Q, Sucheston LE, Higgins MJ, Ambrosone CB, Johnson CS, Smiraglia DJ, Liu S. Ima: an r package for high-throughput analysis of illumina’s 450k infinium methylation data. Bioinformatics. 2012; 28(5):729–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Hansen KD, Langmead B, Irizarry RA. Bsmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biol. 2012; 13(10):83.View ArticleGoogle Scholar
- Park Y, Figueroa ME, Rozek LS, Sartor MA. Methylsig: a whole genome dna methylation analysis pipeline. Bioinformatics. 2014; 30(17):2414–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Hebestreit K, Dugas M, Klein H-U. Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics. 2013; 29(13):1647–53.View ArticlePubMedGoogle Scholar
- Phipson B, Maksimovic J, Oshlack A. missmethyl: an r package for analyzing data from illumina’s humanmethylation450 platform. Bioinformatics. 2015; 32(2):286–8.PubMedGoogle Scholar
- Geeleher P, Hartnett L, Egan LJ, Golden A, Raja Ali RA, Seoighe C. Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics. 2013; 29(15):1851–7.View ArticlePubMedGoogle Scholar
- Cheng L, Jiang Y, Wang Z, Shi H, Sun J, Yang H, Zhang S, Hu Y, Zhou M. Dissim: an online system for exploring significant similar diseases and exhibiting potential therapeutic drugs. Sci Rep. 2016; 6:30024.View ArticlePubMedPubMed CentralGoogle Scholar
- Cheng L, Sun J, Xu W, Dong L, Hu Y, Zhou M. Oahg: an integrated resource for annotating human genes with multi-level ontologies. Sci Rep. 2016; 6:34820.View ArticlePubMedPubMed CentralGoogle Scholar
- Peng J, Lu J, Shang X, Chen J. Identifying consistent disease subnetworks using dnet. Methods. 2017; 131:104–10.View ArticlePubMedGoogle Scholar
- Cheng L, Yang H, Zhao H, Pei X, Shi H, Sun J, Zhang Y, Wang Z, Zhou M. Metsigdis: a manually curated resource for the metabolic signatures of diseases. Brief Bioinforma. 2017;:bbx103.Google Scholar
- Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinforma. 2017; 18(16):573.View ArticleGoogle Scholar
- Peng J, Xue H, Shao Y, Shang X, Wang Y, Chen J. A novel method to measure the semantic similarity of hpo terms. Int J Data Min Bioinforma. 2017; 17(2):173–88.View ArticleGoogle Scholar
- Peng J, Zhang X, Hui W, Lu J, Li Q, Shang X. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. BMC Syst Biol. 2018;:12(Suppl2). In press.Google Scholar
- Network CGAR, et al. Integrated genomic analyses of ovarian carcinoma. Nature. 2011; 474(7353):609.View ArticleGoogle Scholar
- Network CGAR, et al. Integrated genomic characterization of endometrial carcinoma. Nature. 2013; 497(7447):67.View ArticleGoogle Scholar
- Min J, Hemani G, Smith GD, Relton CL, Suderman M. Meffil: efficient normalisation and analysis of very large dna methylation samples. bioRxiv. 2017:125963.Google Scholar
- Pidsley R, Zotenko E, Peters TJ, Lawrence MG, Risbridger GP, Molloy P, Van Djik S, Muhlhausler B, Stirzaker C, Clark SJ. Critical evaluation of the illumina methylationepic beadchip microarray for whole-genome dna methylation profiling. Genome Biol. 2016; 17(1):208.View ArticlePubMedPubMed CentralGoogle Scholar