Figure 2 shows the diagram of the framework and the detail of each step will be introduced below.
Importing data
There are mainly two forms of the Illumina 450K array data: i) raw (*.idat) data which is the direct output from Illumina iScan system and stores intensities for each probe, ii) *.txt data, which is usually got after simple preprocessing and is easier to access. Both file formats can be handled by R packages: *.idat files can be read by illuminaio package [15] and *.txt files can be dealt with minfi [12], wateRmelon [11] et al. We use three datasets for expounding our framework.
Dataset 1: Illumina 450K dataset from Dedeurwaerder et al. (GEO accession number: GSE29290) [7]. Three samples on HCT116WT cell-lines are considered, to evaluate the capability of methods to reduce the replicate variation.
Dataset 2: Illumina 450K dataset of fresh frozen head and neck cancer (HNC) samples form GSE38268 [16], where three of them are HPV+ and other three are HPV- samples.
Dataset 3: Illumina 450K dataset of level 1 methylation data of TCGA KIRC samples, which are *.idat files containing 160 normal samples and 325 tumor samples, to evaluate the efficiency of different modules.
Quality control
After the data imported into R, we would evaluate the quality of data. First, probes displaying a high detected p-value should be filtered out (e.g. > 0.01), while such probes have a β-value of “NA” in *.txt files. It is worth mentioned that different strategies of assigning missing β-value are applied in different modules. In minfi, the β-values are assigned with “NAs” when both M and U intensities are zero, but an additional criteria is that “NAs” will be assigned if either M or U intensities don’t fluoresce above background. Second, there are 850 inbuilt control probes on the 450K array, such as bisulfite conversion I, bisulfite conversion II, extension, hybridization and negative (n=613), which can be used to evaluate the other probes’ intensities. Samples that can’t pass this quality control are excluded in further analysis. Third, probes on Chromosome X or Y should be filtered out to eliminate the impact of sex on differential methylated analysis in many studies. Fourth, Price ME et al. [17] reported that there were 4.3% of Illumina 450K probes containing a known SNP at the targeted site. Such probes should cause problems in inter-sample analysis. Last but not the least, cross-reactive probes on the Illumina 450K array are identified depending on [18], which is particularly problematic because the β-value of such probes is more likely to represent a combination of multiple sites and not the level of initially targeted CpG sites.
Within-array normalization
This step includes background correction and color bias adjustment. Background correction methods have been developed by Triche et al. [19] such as Noob and Normexp, which are based on convolution models and use out-of-band (OOB) probes intensities to measure the background. The lumi R package provides two different methods for eliminating the background. The first one is based on the negative-control probes inbuilt the BeadChip and the second one estimates the background from the density modes of probes intensities. The second one would show bad performance when there are more than two density modes for some sample. The popular methylumi R package [20] proposed a color bias adjustment based on smooth quantile or shift-and-scaling normalization. Globally, these methods seem to improve the data quality in some cases [7].
Correction of the type bias
The type bias is the one that is most crucial to correct as it is the main source decreasing the data quality. There have been several efforts to develop methodologies to correct the probe type bias because of the differences between Infinium I and Infinium II. Because Infinium I probes are more stable and reproducible across different samples, most methods reduce the bias of Infinium II rather than Infinium I probes.
The first method is called peak-based correction (PBC) [7], which rescale the methylation values of Infinium II to the same modes for distribution of methylation values of Infinium I. But this method is sensitive to the shape of β-value density curves and is therefore less robust when the methylation density distribution does not exhibit well-defined peaks.
Touleimat and Tost [8] developed a method called Subset Quantile Normalization (SQN) based on an assumption that the β-values of CpGs form the same biological category should have the same density distribution. They found that the normalization result of using the “relation to CpG” annotation perfectly corrected the bias.
Subset-quantile Within Array Normalization (SWAN) [10] was developed based on the assumption that the β-values distribution should be the same when the probes have the same number of CpGs. But SWAN also alters Infinium I probe data, which increases Infinium I technical variation, and does not seem to improve the data quality when applied to some datasets [21].
Beta Mixture Quantile normalization (BMIQ) [9] method decomposes the density profile of Infinium I and Infinium II probes by fitting a beta-mixture model of three states: unmethylated (default β-value <0.25), hemimethylated (default 0.25 ≤β-value <0.75) and fully methylated (default β-value ≥0.75). Then it uses a quantile normalization to fit β-values distribution of Infinium II to the corresponding β-values distribution of Infinium I. This method does not depend on unceremonious choices of biological characteristics to be used to normalize data. Thus it seems more suitable than other methods. However, some points appear worse after BMIQ correction.
Another method, called Regression on Correlated Probes (RCP) [14], uses a quantile linear regression model of correlation between pairs of nearby Infinium I and II probes that share the same genomic context to adjust the methylation levels of Infinium II probes. The weakness of RCP is that it may not fit some experimental data leading to a result worse than raw data.
While background is important for measuring absolute methylation levels for single sample/condition experiments, we ignore the background here in this analysis since it can be cancelled out when comparing two conditions. As shown by other studies [22], widely used normalization process, which is based on the assumption that the majority of signals should not change across compared conditions, usually makes mistakes when it was applied to experiments where large portions of signals are differentially expressed. Thus, with a good quality control on the analyzed datasets, we didn’t choose to apply common normalization strategies.
It is also important to remove non-biological variation called batch effects existing between batches and samples. Such batch effects can influence on measurement of global level that could be partially removed through between-sample normalization using principal component analysis.
Identification of DMPs/DMRs
As we mentioned above, the main focus of many methylation studies has been on detecting differentially methylated probes (DMPs) or regions (DMRs) associated with a phenotype. The β-value is the default value for methylation measurement, allowing easy biological interpretation. Another type of value, M-value, is used to express the degree of methylation obtained with Infinium. Due to the heteroscedasticity of β-value, the variance of M-value across the methylation range is approximately constant, so the M-value has better statistical properties. The two types of value are used in different methods.
SQN simply considers a probe as DMP if the absolute value of the difference between β-value medians of paired samples is higher than 0.2:
$$\left|median\left({\beta^{N}_{1},\ldots,\beta^{N}_{n}}\right)-median\left({\beta^{T}_{1},\ldots,\beta^{T}_{n}}\right)\right|\ge0.2 $$
where \(\beta ^{N}_{i}\) and \(\beta ^{T}_{i}\) corresponding β-values of paired normal and tumor samples. The 0.2 threshold represents approximately a difference in methylation level of 20% which can be detected by the Infinium technology with 99% confidence [5]. Then the corresponding differentially methylated gene identities can be obtained from the list of DMPs.
Minfi offers a comprehensive package to analyze Illumina 450K array data, where candidate regions are determined for DMR analysis and locally weighted scatterplot smoothing (LOESS) is adopted to smooth the methylation differences between groups within each determined region. It also can find long-range alterations such as identified hypomethylated blocks [23] based on “open sea” probes. A empirical Bayes moderated t-test is used in limma [24] when sample sizes are less than 10, in which case M-values should be used as they rely much more on Gaussianity assumption [25].
Generally, DMRs are detected by applying various statistical techniques such as Fisher’s exact test [26, 27], t-test [27], Wilcoxon rank sum test [28] or different regression models [29–31].
Biological interpretation
There is a long list of significant CpGs to be interpreted after differential methylation analysis. Using the Infinium annotation file, Illumina 450K probes are classified according to their relations to CGIs and to the closest annotated gene. Regarding their relation to CGIs, the probes are classified into four categories: sites located inside a CGI, sites located in the CGI shores (0-2k bp), sites located in the CGI shelves (2-4k bp) and sites located in the “open sea”. As regards their relation to annotated genes, the sites are categorized as inside the promoter, inside the 5’-UTR region, inside the gene body and inside the 3’-UTR region. Then the significant DMPs can be marked with their related genes.
Performing gene set analysis is a popular way to understand the affected potential gene pathways. Although gene set analysis is well established in gene expression experimental, the research in methylation data is ongoing in different groups. In Illumina 450K array, the numbers of CpGs associated with each gene ranges largely from 1 to 1299 [32]. Genes with larger numbers of probes are more likely to have significant differentially methylated CpGs [33]. With the ontology and knowledgebase developing [34–40], researchers can easily annotate the genes containing DMPs or DMRs to ontology entries, which brings convenience for understanding the function of genes in the pathogenesis of diseases. Obviously, a phenotype is associated with several -omics data, such as mRNA expression and protein expression, which suggests researchers should utilize integrated analysis with multi-dimension data like TCGA project does [41, 42].