 Research Article
 Open Access
 Published:
An evaluation of statistical methods for DNA methylation microarray data analysis
BMC Bioinformatics volume 16, Article number: 217 (2015)
Abstract
Background
DNA methylation offers an excellent example for elucidating how epigenetic information affects gene expression. β values and M values are commonly used to quantify DNA methylation. Statistical methods applicable to DNA methylation data analysis span a number of approaches such as Wilcoxon rank sum test, ttest, Kolmogorov–Smirnov test, permutation test, empirical Bayes method, and bump hunting method. Nonetheless, selection of an optimal statistical method can be challenging when different methods generate inconsistent results from the same data set.
Results
We compared six statistical approaches relevant to DNA methylation microarray analysis in terms of false discovery rate control, statistical power, and stability through simulation studies and real data examples. Observable differences were noticed between β values and M values only when methylation levels were correlated across CpG loci. For small sample size (n=3 or 6 in each group), both the empirical Bayes and bump hunting methods showed appropriate FDR control and the highest power when methylation levels across CpG loci were independent. Only the bump hunting method showed appropriate FDR control and the highest power when methylation levels across CpG sites were correlated. For medium (n=12 in each group) and large sample sizes (n=24 in each group), all methods compared had similar power, except for the permutation test whenever the proportion of differentially methylated loci was low. For all sample sizes, the bump hunting method had the lowest stability in terms of standard deviation of total discoveries whenever the proportion of differentially methylated loci was large. The apparent test power comparisons based on raw pvalues from DNA methylation studies on ovarian cancer and rheumatoid arthritis provided results as consistent as those obtained in the simulation studies. Overall, these results provide guidance for optimal statistical methods selection under different scenarios.
Conclusions
For DNA methylation studies with small sample size, the bump hunting method and the empirical Bayes method are recommended when DNA methylation levels across CpG loci are independent, while only the bump hunting method is recommended when DNA methylation levels are correlated across CpG loci. All methods are acceptable for medium or large sample sizes.
Background
DNA methylation is a biochemical process of adding a methyl group at the \(\phantom {\dot {i}\!}5^{'}\) carbon of the cytosine ring to form 5methylcytosine (found at cytosineguanosine dinucleotides (CpGs)) and plays a significant role in the development and progression of human disease [1]. More than 50 % of human gene transcription initiations are from genome regions with elevated CpG contents, known as “CpG islands”. CpG loci within promoter CpG islands are normally free from DNA methylation to allow the initiation of gene expression [2]. Studies have documented associations between DNA methylation and cancer [1, 3]. Promoter hypermethylation impacts development of cancer through transcriptional silencing of crucial growth regulators. Two United States Food and Drug Administration (FDA) approved epigenetic drugs, azacitidine and decitabine, reactivate tumor suppressor genes through removing DNA methylation marks, which highlights the importance of understanding DNA methylation in disease etiology and treatment [4, 5].
Largescale examination of DNA methylation through microarray or sequencing technologies makes epigenomewide association studies (EWAS) feasible to explore associations between DNA methylation and cancers in the sustained effort to develop novel anticancer drugs, and to identify DNA methylation markers associated with certain cancers for prognosis and diagnosis purpose [6]. The Illumina HumanMethylation BeadChip technology is a popular platform for conducting epigenomewide association studies. Three platforms have been developed by Illumina for DNA methylation assay: GoldenGate, Infinium Human Methylation27 and Infinium HumanMethylation450 BeadChip. All platforms use two fluorescent dye colors to recognize the bisulphiteconverted sequence. The standard output from the BeadChip assay for quantifying methylation is the β value, which is calculated from the intensity of methylated allele (M a x(M,0)) and the intensity of unmethylated allele (M a x(U,0)) according to the following formula [7].
The β values are usually preprocessed for the downstream statistical analysis. The summary on preprocessing the β values including quality control, background correction, and normalization could be found somewhere else [8]. For differential DNA methylation analysis, the average β value denotes the methylation level, or the percentage for an interrogated locus. The average β values vary between 0 and 1. In an ideal situation, "zero" indicates that no copy of the CpG site in the sample is methylated, and "one" indicates that every copy of the site is methylated. The average β value approximates the methylation percent for the population of a sampled CpG site. Alternatively, some investigators use the Mvalue, considering the Mvalue alternative statistically more valid [9]. The Mvalue is defined as:
The range of Mvalues could be from −i n f to +i n f, consistent with the data range for a normal distribution. However, interpretations of Mvalues are not as intuitive as for βvalues. A properly normalized Mvalue approaching zero indicates that a specific CpG site is halfmethylated. Positive Mvalues suggest a methylation rate greater than 50 %, while negative Mvalues indicate a methylation inferior to 50 %. The βvalues and Mvalues are related through a log2 ratio transformation such as:
It has been shown that there is an approximately linear relationship between βvalues and Mvalues in the middle range of the methylation data ([0.2, 0.8] for β values, and [2, 2] for Mvalues) [9].
We used both βvalues and Mvalues in our simulation studies and real data examples, which should provide guidance to investigators in selecting βvalues or Mvalues for their differential DNA methylation analysis with regard to FDR control, power, and stability.
Currently available methylation differential analysis methods implemented in Bioconductor/R include several approaches such as Wilcoxon rank sum test (used in methyAnalysis package), ttest (used in methyAnalysis, CpGAssoc, RnBeads, and IMA package), KolmogorovSmirnov Tests (although not implemented in packages, but used by some investigators [10]), permutation test (used in CpGAssoc package), empirical Bayes method (used in RnBeads, IMA and minfi package), and bump hunting method (used in bumphunter and minfi package). However, with so many options available to investigators, selection of an optimal statistical method, can be challengingespecially when different methods applied to the same data set generate inconsistent results. As such, we systematically investigated these commonly used DNA methylation differential analysis methods in terms of their FDR control, power, and stability, through simulation studies. We illustrated the respective advantages and disadvantages of these methods with real methylation data sets, in order to provide empirical evidence and advice to investigators in selecting the most appropriate DNA methylation analysis methods for their studies.
Methods
Hypothesis testing for each DNA methylation locus was done using either the average βvalues or the transformed Mvalues of the different groups. Assume there are m methylation loci from the DNA methylation array assay. Among the m methylation loci, m _{0} loci are not differentially methylated. Suppose R methylation loci are rejected of m total loci, then V indicates the number of falsely rejected methylation loci (or “false discoveries”) from R rejections, and S denotes the true number of differentially methylated loci between groups in R rejections (R=V+S). The possible outcomes of testing m DNA methylation loci simultaneously are shown in Table 1. When testing m DNA methylation loci simultaneously, we need to control multiple testing error rate as opposed from testing a single DNA methylation locus.
The most commonly used multiple testing error rate for discovery purposes is the false discovery rate proposed by Benjamini and Hochberg [11], defined as:
Another definition of false discovery rate proposes to control the expected proportion of false discoveries \(E(\frac {V}{R}R > 0)\) when R>0 [12] such as:
p F D R=1 when m _{0}=m. FDR and pFDR set to 0 when R=0. FDR and pFDR are similar when the phenotype is not associated with DNA methylation for most of the CpGs.
In our simulations, two different multiple testing procedures were used to control for FDR/pFDR. Through a stepup procedure, the BenjaminiHochberg procedure [11] provides control of FDR at α level. The BenjaminiHochberg procedure compares ordered P _{(i)} with \(\frac {i}{m}\alpha \) from the largest p, rejecting all H _{(i)} i=1,2,…,k with \({P_{(i)} \le \frac {i}{m}\alpha }\). The BenjaminiHochberg procedure provides strong control for FDR at level α (for independent and positively correlated test statistics). Similarly to the BenjaminiHochberg procedure, the Storey’s qvalue procedure [12] uses conservative point estimators of m _{0}, \(\hat {m}_{0}(\lambda)\) (λ is a tuning parameter). With larger cutoffs, the Storey’s qvalue leads to higher power than the BenjaminiHochberg procedure as \({\hat {m}_{0}(\lambda) \leq m}\). The Storey’s qvalue controls pFDR at α, with test statistics correlated weakly or independently.
Besides controlling FDR at a desired α level in the multiple testing process, we would also desire that the DNA methylation analysis method possess enough power to detect true differential DNA methylation loci and be consistent from experiment to experiment. Power is defined as the expected proportion of true differentially methylated loci detected among the total number of true differentially methylated loci [13]. Stability is measured as the standard deviation (SD) of the count of the differentially methylated loci detected. Power and stability of a differential DNA methylation analysis method could be expressed using the following formulas:
Power is defined as 0 and Stability becomes a measure of standard deviation of false detections when m=m _{0}.
Wilcoxon rank sum test (rank test)
Wilcoxon rank sum test (i.e., Mann–Whitney U test) is a rankbased nonparametric test and used in the methyAnalysis package as a differential methylation analysis method [14]. It is usually used as an alternative to the twoindependent sample ttest when the assumption of normal data distribution is violated for the ttest.
Assume the methylation level denoted either by βvalues or Mvalues for ith locus, jth group, and kth subject is y _{ ijk }. Suppose j=1 denotes the normal group and j=2 denotes the cancer group. For each DNA methylation locus, the null hypothesis of the Wilcoxon rank sum test is that the distribution of y _{ i1k } equals the distribution of y _{ i2k } for i=1,2,…,m. The twoside alternative hypothesis is a location shift of the distribution of y _{ i2k } from y _{ i1k } in either direction. The raw pvalues from the Wilcoxon rank sum test are then adjusted using Benjamini and Hochberg procedure to control for FDR at level α [11] through the p.adjust function in R.
ttest
Implemented in methyAnalysis, CpGAssoc, RnBeads, and IMA packages, the ttest is a commonly used hypothesis testing method in genomic data analysis for testing equivalence of means between two groups [15]. For two independent samples ttest, there are two ttest procedures depending on whether the variances from those two groups are equal or not. The unequal variance ttest procedure (i.e., Welch’s ttest) is usually the default one used in most packages, and does not assume equal variance between groups. The raw pvalues from the ttests are computed based on the t distribution, while adjusted pvalues are obtained using the Benjamini and Hochberg procedure through the same p.adjust function in R.
KolmogorovSmirnov test (KS test)
The KolmogorovSmirnov test (KS test) is a nonparametric test in statistics for testing the equality of two continuous probability distributions [16, 17]. In DNA methylation studies, the null hypothesis is that the distribution of y _{ i1k } equals the distribution of y _{ i2k } as that in Wilcoxon rank sum test for each locus of i=1,2,…,m. Sensitive to difference in shape and location of the distribution functions of two groups, the KS test differs from the Wilcoxon rank sum test (sensitive to differences in location). The raw pvalue from the KS test are adjusted using the Benjamini and Hochberg procedure to control FDR at level α.
Permutation test
Permutation test is a resamplingbased nonparametric test, which permutes data falling under the null hypothesis of equal data distributions between groups [18]. The distributions of test statistics (usually ttest statistics) are estimated from permuted test statistics. In the CpGAssoc package, the raw pvalues from the permutation test for DNA methylation data are adjusted using the p.adjust function in R to control FDR at level α.
Empirical Bayes method
Used in RnBeads, IMA and minfi packages, the empirical Bayes method is a popular hypothesis testing method applied through the lmFit and eBayes functions [19]. First, we can fit a linear model to estimate \(\beta ^{*}_{i}\), the mean differences between two groups for ith locus. In DNA methylation studies, let \({y_{i}^{T}}=(y_{i_{1}}, \dots, y_{i_{n}})\) denote the DNA methylation level for both groups with n=n _{1}+n _{2} for ith locus. Then, we can fit a linear model for each locus using the formula:
where X is a design matrix of full column rank, and \(\beta ^{*}_{i}\) is a coefficient vector. The \(\beta ^{*}_{i}\) coefficient vector includes \((\beta ^{*}_{i0}, \beta ^{*}_{i1})\) with \(\beta ^{*}_{i0}\) denoting the mean DNA methylation level for normal group, while \(\beta ^{*}_{i1}\) denotes the mean methylation level difference between the cancer group and the normal group. Thus, the null hypothesis for testing the mean methylation level difference between the normal group and the cancer group is \(H_{0}: \beta ^{*}_{i1} = 0\) for locus i=1,…,m. The test statistic for testing H _{0} is the moderated tstatistic, based on a hybrid classical/Bayes approach, defined by:
The pvalue for testing \(H_{0}:\beta ^{*}_{\textit {ij}}=0\) (\(H_{0}:\beta ^{*}_{i0}=0\) and \(H_{0}:\beta ^{*}_{i1}=0\)) based on the moderated tstatistic is calculated from the t distribution with d _{ i }+d _{0} degrees of freedom. More information on \(\tilde {s}_{i}\), v _{ ij }, d _{ i }, and d _{0} could be found elsewhere [19].
The pvalue for testing \(H_{0}:\beta ^{*}_{i1}=0\) can be further adjusted using the p.adjust function to control for FDR at level α.
Bump hunting method
The bump hunting method used in bumphunter and minfi packages was developed to take into account the correlations of methylation levels between nearby CpG locus [20]. The bump hunting method was carried out by first fitting a linear regression model for each locus before smoothing the coefficient within clusters along the genome to identify bumps [21]. More specifically, for each locus, a linear model will be used to estimate the coefficient of difference in methylation levels between the cancer group and the normal groups. Let Y _{ ijk } denote the measured methylation level for ith locus, jth group, and kth subject. X _{ ij } is an indicator variable with X _{ i1}=0 for the normal group and X _{ i1}=1 for the cancer group at all locus i. \(\beta ^{*}_{i}\) is the estimated coefficient for X _{ ij }, and also stands for the estimated difference in DNA methylation levels between the cancer and the normal groups. We have then the following linear regression model.
where ε _{ ijk } is the error term in the model, which follows a normal distribution with mean =0 and variance \(={\sigma ^{2}_{i}}\).
After fitting the linear regression model, the bump hunting method will be implemented according to the following steps:

Estimate \(\beta ^{*}_{i}\) for each locus i.

Estimate a smooth function β ^{∗}(t) using these estimates.

Use this smooth function β ^{∗}(t) to estimate the regions R _{ n }, n=1,…,N for which β ^{∗}(t)≠0 for all t∈R _{ n }.

Assign statistical uncertainty to each estimated region using permutation tests.
We examined two pvalues generated from the bump hunting method in the minfi package. One pvalue is the raw pvalue from the bump hunting method, adjusted through the Benjamini and Hochberg procedure using the p.adjust function in R (Bump hunting BH), and the other pvalue is the qvalue  an adjusted pvalue generated from the minfi package using Storey’s procedure (Bump Hunting qvalue) [22].
Data extraction
We downloaded the ovarian cancer data set [23] and the rheumatoid arthritis data set [24] from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) public functional genomics data repository. The ovarian cancer data set on 540 whole blood samples has GEO accession number GSE19711, generated from the Illumina Infinium 27k Human DNA methylation Beadchip v1.2. The rheumatoid arthritis data set on 691 subjects has GEO accession number GSE42861 and was generated using Illumina HumanMethylation450 BeadChip array. We randomly selected 3, 6, or 12 samples from either the case or the control groups to illustrate the apparent test power comparisons.
Results
Simulation study
We conducted simulation studies to compare the power and stability of six DNA methylation differential analysis methods for independent and correlated DNA methylation levels across CpG loci. Each simulation study included 1,000 independently generated two group samples with sample size (n) of 3, 6, 12, or 24 in both the cancer and normal groups. For all simulations, we set the total number of DNA methylation loci (m) as 1000. The fractions of truly differentially methylated loci \(\left (\pi _{1} = \frac {mm_{0}}{m}\right)\) were set at 1 %, 5 %, 10 %, 25 %, 50 %, 75 %, or 90 % to cover different scenarios. To mimic the data distribution of a real DNA methylation array experiment, the β values from the DNA methylation array studies for both cancer and normal groups were generated from a mixed beta distribution (0.1B e t a(0.5,5)+0.9B e t a(5,0.5)), for independent DNA methylation levels across CpG loci. For correlated DNA methylation levels across CpG loci, the β values y _{ ij } for ith locus and jth subject were generated from \(y_{\textit {ij}} = \frac {2^{log2(b_{\textit {ij}})  log2(1b_{\textit {ij}}) + e_{\textit {ij}}}}{1 + 2^{log2(b_{\textit {ij}})  log2(1b_{\textit {ij}}) + e_{\textit {ij}}}}\) where b _{ ij } and e _{ ij } were the (i,j)th elements of m×2n matrixes of B and E respectively. B was simulated from a B e t a(0.1,0.1) distribution and E was an matrix of error following a factoranalytic structure E=L U ^{T}+Φ [25]. L=Z×Δ in which Z (a m×4 factor loadings matrix) denoted methylation profiles for constituent cell types and Δ=d i a g(0.55,0.35,0.07,0.03)^{T} (a 4×4 factor scores diagonal matrix) denoted cell proportions through its diagonal elements. U was a 2n×4 matrix of latent effects, and Φ was a m×2n random error matrix. Both Z and U were simulated from N(0,1) distribution. Φ were simulated from N(0,σ _{ i }) with \({\sigma ^{2}_{i}} = 0.5  \sum (diag(\Delta ^{2}))\) (i.e. the standard deviation of each value was \(\sqrt {0.5}\), but the errors were correlated across CpG loci). To set up the mean β value differences between groups, all β values of 1 %, 5 %, 10 %, 25 %, 50 %, 75 %, or 90 % of 1000 CpG loci in normal group subtracted a sequential vector from 0.1 to 0.4 with a length of 10, 50, 100, 250, 500, 750, and 900. For instance, with 1 % true difference between groups, the first 10 rows of β values from the normal group will equal the original β values in the normal group, generated either from the mixed beta distribution 0.1B e t a(0.5,5)+0.9B e t a(5,0.5) or from \(y_{\textit {ij}} = \frac {2^{log2(b_{\textit {ij}})  log2(1b_{\textit {ij}}) + e_{\textit {ij}}}}{1 + 2^{log2(b_{\textit {ij}})  log2(1b_{\textit {ij}}) + e_{\textit {ij}}}}\), subtracted the sequential vector with a length of 10 from 0.1 to 0.4, i.e. (0.10,0.13,0.17,0.20,0.23,0.27,0.30,0.33,0.37,0.40). The Mvalues were generated using the l o g i t2 transformation of the βvalues \(\left (M = log2\left (\frac {\beta }{1  \beta }\right)\right)\) and the FDR level was set at 5 %.
Simulation results
Independent cases
For simulated DNA methylation data with sample sizes as small as 3 in each group, all methods could control FDR at a desired level of 5 % (Fig. 1 a and Fig. 2 a). In terms of power, the empirical Bayes method was the most powerful, followed by the bump hunting method and the ttest when the proportion of differentially methylated loci was below 50 % (Table 2 and Table 3). The bump hunting method was the most powerful method, followed by the empirical Bayes method and the ttest when the proportion of differentially methylated loci was above 50 %. Within the bump hunting method, power is higher with Storey’s qvalue procedure than with Benjamini and Hochberg’s procedure. Neither the Wilcoxon rank sum test, the KolmogorovSmirnov test, nor the permutation test had power to identify any truly differentially methylated locus across all proportions of differentially methylated loci tested. For stability, the empirical Bayes method was much better than either the ttest or the bump hunting method (Table 4 and Table 5). The bump hunting method had the largest standard deviation of total discoveries once pvalues were adjusted using Storey’s qvalue procedure. The standard deviation of total discoveries from the bump hunting method increased exponentially as the proportions of differentially methylated loci increased. In the simulation studies, no significant differences were observed between β values and M values in terms of FDR control, power, mean number of total discoveries, or standard deviation of total discoveries.
Increasing sample size to 6 in each group, the Wilcoxon rank sum test, the KolmogorovSmirnov test, and the permutation test all showed greater than zero power (Fig. 1 b and Fig. 2 b). While all methods could control FDR at 5 %, the empirical Bayes method remained the most powerful among all methods, followed by the bump hunting method and the ttest, when the proportion of differentially methylated loci was below 25 % (Table 2 and Table 3). The bump hunting method was the most powerful method, followed by the empirical Bayes method and the ttest/permutation test, whenever the proportion of differentially methylated loci was above 25 %. The power of the Wilcoxon rank sum test, the KolmogorovSmirnov test, and the permutation test was lower than the ttest whenever the proportion of differentially methylated loci was below 25 %; however, the power of the permutation test was similar to the ttest, whenever the proportion of differentially methylated loci was above 25 %. The Wilcoxon rank sum test and the KolmogorovSmirnov test had relatively lower power compared to the other methods, even after the proportion of differentially methylated loci increased to 25 % or higher. In terms of stability, the bump hunting method had the largest standard deviation of total discoveries and an exponentially increasing trend, especially when the proportion of differentially methylated loci was larger than 50 % (Table 4 and Table 5). All other methods showed similar stability, while the empirical Bayes method had relatively the smallest standard deviation of total discoveries across all proportions of differentially methylated loci. Significant differences were not observed between β values and Mvalues in terms of power, mean number of total discoveries, and standard deviation of total discoveries, except that the FDR was controlled at a lower level whenever the β values were used for the empirical Bayes method.
For a moderate sample size of 12 in each group, power was not significantly different across methods whenever the proportion of differentially methylated loci was greater than 1 % (Fig. 1 c and Fig. 2 c). The mean number of total discoveries was also similar. standard deviation of total discoveries was maintained at a relatively low level for all methods across all proportions of differentially methylated loci, except for the bump hunting method, which showed a relatively large standard deviation of total discoveries and an exponentially increasing trend, whenever the proportion of differentially methylated loci was above 25 % (Table 4 and Table 5). All methods still controlled FDR within a 5 % level and had a more conservative control of FDR as the proportion of differentially methylated loci increased, with the exception of the bump hunting method which was less conservative in FDR control as the proportion of differentially methylated loci increased. Aside from the fact that the FDR was controlled at a lower level whenever the β values were used for the empirical Bayes method, no significant differences were observed between β values and Mvalues in terms of power, mean number of total discoveries, and standard deviation of total discoveries.
Similar simulation results were observed when sample size was increased to 24 in each group (Fig. 1 d and Fig. 2 d). The power of all methods became almost identical, and the large standard deviation of the bump hunting method became more obvious. Whenever β values were used for analysis using the empirical Bayes method, the FDR was controlled at a relatively lower level as compared to using M values. No significant power or stability differences were observed between the β values and M values.
Overall, the power and stability of all methods increased as sample size increased for both β values and M values (Table 2, 3, 4 and 5). It was observed that the permutation method retained lower power whenever the proportion of differentially methylated loci was as low as 1 %, regardless of sample size. The Wilcoxon rank sum test and the KolmogorovSmirnov test had exactly the same power and stability whenever either β values or M values were used.
Correlated cases
When methylation levels were correlated across CpG loci and sample size was as small as 3 in each group, the FDR and power estimates were different from independent cases. The ttest and empirical Bayes method had very large FDR estimates with both β values and M values. The bump hunting method had estimated FDR exceeding 0.05 when β values were used, but the FDR was well controlled at 5 % when M values were used (Fig. 3 a and Fig. 4 a). Interestingly, the bump hunting method had much higher power than all other methods especially when the proportion of differentially methylated loci was lower than 25 %. We also noticed that the power of the bump hunting method was higher when using β values than when using M values (Table 6 and Table 7). The bump hunting method also had a larger mean of total discoveries than all other methods, and identified more loci when β values were used. The stability trend remained the same as in the independent case. The bump hunting method still had the lowest stability among all methods compared (Table 8 and Table 9).
When increasing sample size from 3 to 6 in each group, the FDR and power estimates also showed different characteristics from independent cases. The ttest still had very large FDR estimates when either β values or M values were used. The empirical Bayes method had decent control of FDR when β values were used; however, it lost control of FDR when M values were used (Fig. 3 b and Fig. 4 b). When M values were used, the bump hunting method had the highest power among all methods compared. When β values were used, the bump hunting method was the most powerful method, followed by the empirical Bayes method and the ttest whenever the proportion of differentially methylated loci was lower than 10 % or higher than 75 %, while the empirical Bayes method had the highest power whenever the proportion of differentially methylated loci was between 10 % and 75 % (Table 6 and Table 7). For stability, the bump hunting method had still the largest standard deviation of total discoveries whenever the proportion of differentially methylated loci was high, either with β values or M values (Table 8 and Table 9).
For a sample size of 12 in each group, the power and stability of all methods started to converge. The Wilcoxon rank sum test, the ttest, and the KolmogorovSmirnov test had estimated FDR values larger than 0.05 whenever β values were used and the proportion of differentially methylated loci was smaller than 10 %. When M values were used, only the bump hunting method and the permutation test had FDR controlled at 5 % whenever the proportion of differentially methylated loci was smaller than 10 % (Fig. 3 c and Fig. 4 c). With β values, the empirical Bayes method had slightly higher power than all other methods whenever the proportion of differentially methylated loci was smaller than 25 %. The bump hunting method had the highest power whenever the proportion of differentially methylated loci was greater than 25 %. Using M values, the permutation test had the highest power whenever the proportion of differentially methylated loci was smaller than 50 %, and the bump hunting method had the highest power whenever the proportion of differentially methylated loci was greater than 50 % (Table 6 and Table 7). The stability of all methods began to converge, but the bump hunting method still showed slightly larger standard deviation than all other methods compared (Table 8 and Table 9).
With a sample size of 24 in each group, the power of all methods was similar (Table 6 and Table 7). The ttest had estimated FDR over 5 % whenever β values were used and the proportion of differentially methylated loci was smaller than 10 %. All methods had estimated FDR within 5 % when M values were used (Fig. 3 d and Fig. 4 d). The bump hunting method with Storey’s qvalue adjustment still showed low stability whenever the proportion of differentially methylated loci was large (Table 8 and Table 9).
In summary, the power and stability of all methods showed differences when using β values versus M values in all correlated cases (Table 6, 7, 8 and 9). Whenever sample sizes were 3, 6, or 12 in each group, the ttest, the empirical Bayes method, and the bump hunting method had larger power using β values than M values. The same observation was made whenever sample size was increased to 24 in each group with the proportion of differentially methylated loci smaller than 25 %. The Wilcoxon rank sum test and the KolmogorovSmirnov test had similar power using β values or M values, and the permutation test had higher power using M values than β values. The permutation method still retained low power whenever the proportion of differentially methylated loci was as low as 1 %, regardless of sample size. All methods were observed to produce slightly larger standard deviations whenever using β values rather than M values, except for the bump hunting method and the empirical Bayes method whenever the proportion of differentially methylated loci was larger than 50 % for sample size of 12 in each group.
Real data examples
Ovarian cancer
Ovarian cancer ranks fifth in cancer death among women in the United States [26]. Aberrant DNA methylation was found to be associated with ovarian cancer. A genome wide DNA methylation profiling of United Kingdom Ovarian Cancer Population Study (UKOPS) was conducted to identify methylation signatures associated with carcinogenesis [23]. The data is available publicly, downloaded from the NCBI GEO website with GEO number GSE19711. The data originated from the Illumina Infinium 27k Human DNA methylation Beadchip v1.2 with 27578 CpGs from 540 whole blood samples, and 266 samples were taken from postmenopausal ovarian cancer patients, and 274 from normal controls (agematched). To illustrate the differences in apparent test power (total number of discoveries) across the six methods at different FDR levels, we randomly selected either 3, 6, or 12 samples from both the cancer pretreatment group and control group. The FDR levels ranged from 0.01 to 0.10, with a step of 0.01. Due to lack of significants using adjusted pvalues for all methods, the raw pvalues were used for comparisons. Thus, the Storey’s qvalue procedure and the BenjaminiHochberg procedure from the bump hunting method had the same raw pvalues.
When we randomly took 3 samples from both the cancer and control groups, both the empirical Bayes method and the bump hunting method showed higher apparent test power than the four other methods (Fig. 5). No discoveries were made either with the Wilcoxon rank sum test, the KolmogorovSmirnov test, or the permutation test, below a FDR level of 0.08. The ttest had lower apparent test power than the empirical Bayes method and the bump hunting method. However, differences between the empirical Bayes method and the bump hunting method were not significant. When we randomly selected 6 samples from both groups, total discoveries were similar for all methods, except for the KolmogorovSmirnov test which had still a relatively lower apparent test power than all other methods. No significant differences were observed between results using either βvalues or Mvalues. When increasing sample size further to 12 in each group, we observed that all methods had more convergent apparent test power than when using a sample size of 6 in each group. Similarly, the KolmogorovSmirnov test showed increased power as sample size increased to 12 in each group.
Rheumatoid arthritis
Rheumatoid arthritis is a complex disease whose etiology involves the interaction of genetic, environmental, and lifestyle factors [27]. Epigenomewide associations study have implicated DNA methylation as an intermediary of genetic risk in rheumatoid arthritis using Illumina HumanMethylation450 arrays on 354 rheumatoid arthritis cases and 337 controls [24]. The Methylation data was downloaded from the NCBI GEO website with accession number GSE42861. To demonstrate further the differences in apparent test power across the six methods at different FDR levels for popular HumanMethylation450 arrays, we randomly selected samples of size 3, 6, or 12 from both the rheumatoid arthritis case group and control group with the same FDR level set in the Ovarian Cancer example. The Storey’s qvalue procedure and the BenjaminiHochberg procedure from the bump hunting method had the same raw pvalues.
The apparent test power showed similar results to those observed in the ovarian cancer example (Fig. 6). When sample size was 3 in each group, the empirical Bayes method and the bump hunting method had higher apparent test power than all other methods compared. The empirical Bayes method had a slightly higher apparent test power than the bump hunting method when β values were used, while the bump hunting method had a slighter higher apparent test power than the empirical Bayes method when Mvalues were used. All other methods showed similar results to those observed in the ovarian cancer example. When sample size was further increased to 6 or 12 in each group, the apparent test power of all methods compared were similar.
Overall, the results of the apparent test power comparisons of the six DNA methylation differential analysis methods using real data were consistent with our simulation results.
Discussion and conclusions
In simulation studies, we compared six DNA methylation data analysis methods in terms of FDR control, power, and stability in both independent and correlated cases. These methods’ apparent test power based on raw pvalues were also compared using two real data examples. For independent cases, no significant differences were detected between β values and M values in terms of FDR control, power, and stability, except that FDR was controlled at a lower level for the empirical Bayes method when β values were used for analysis. The similarity of the simulation results using either the β values or M values was probably due to the linear relationship between β and M values when β values were in the [0.2, 0.8] range and M values were in the [2, 2] range as pointed out by Du [9]. The differences between β and M values in the empirical Bayes approach are likely a result of model misspecification in the case of β values, potentially leading to an overestimation of standard deviations and thus deflation of significance. For correlated cases, the FDR control, power, and stability of the methods compared showed differences when using β values versus M values, which might have resulted from the correlations across CpG loci. The higher power and slightly lower stability observed in the ttest, the empirical Bayes method, and the bump hunting method when using β values rather than M values for small sample size data deserves further exploration.
In highthroughput data analysis, small or medium sample sizes are very common due to scant resources and funding constraints. Low statistical power challenges the reliability of studies, especially in small biomedical studies with low sample size [28]. Choosing appropriate approaches for DNA methylation data analysis could help investigators maximize the likelihood of true discoveries from small sample size studies with limited resources and funding. For small sample size data, both the empirical Bayes method and the bump hunting method showed good FDR control and much higher power than all other methods in independent cases. The empirical Bayes approach shrinks the estimated sample variance of the ordinary tstatistic towards a pooled estimate, resulting in higher power and more stable inference in small sample size studies [19]. The bump hunting method borrowed the strength of neighbor CpG loci, which improved the power of the DNA methylation analysis for small sample size [20]. When the methylation levels were correlated across CpG loci, only the bump hunting method showed decent control of FDR and much higher power than all other methods compared. The inflated FDR from the ttest and the empirical Bayes method was likely caused by the violation of the t distribution assumption when sample size is small, and by the violation of the independence assumption of methylation levels across CpG loci. The wellcontrolled FDR and high power from the bump hunting method might be due to its strength in taking the probe location information into account to model the correlation structure of error variances [20]. When sample size is very small (n=3), the zero power of the permutation test is due to the limited number of the possible combinations of permutation [29]. For the Wilcoxon rank sum test and the KolmogorovSmirnov test, the zero power when sample size is 3 may also be due to the discrete distribution of their test statistics estimated under the null hypothesis. For medium or large sample size, all methods had almost equivalent power, except for the permutation test with a very low proportion of differentially methylated loci, which deserves further exploration to elucidate causality. It is expected to see that the Storey’s qvalue adjustment from the bump hunting method has higher power than the BenjaminiHochberg’s pvalue adjustment from the bump hunting method, as the Storey’s qvalue adjustment has a larger cutoff value than the BenjaminiHochberg pvalue adjustment [22]. Meanwhile, the larger cutoff value from the Storey’s qvalue adjustment resulting from the inclusion of estimated π _{0} also brought more variability in the analysis results, as indicated by the greater standard deviation of total discoveries.
In highthroughput data analysis, it is important to examine the power and stability of multiple testing procedures to learn the likelihood of true discoveries from empirical studies [30, 31]. In general, the use of either β values or M values is appropriate; however, it is advisable to take into account the differences observed between the β values and M values whenever applying the methods to DNA methylation differential analysis. When DNA methylation levels are independent across CpG loci, we recommend the bump hunting method and the empirical Bayes method in studies constrained by small sample sizes. When DNA methylation levels are correlated across CpG loci, we do recommend the bump hunting method in studies constrained by small sample sizes. In studies with medium to large sample size, all methods are suitable. With DNA differential methylation data analysis, researchers need to exercise caution with regard to the low stability of the bump hunting method whenever the proportions of differentially methylated loci are large, and with regard to the inflated FDR of the empirical Bayes method whenever DNA methylation levels are correlated across CpG loci in studies constrained by small sample sizes.
References
 1
Baylin SB. DNA methylation and gene silencing in cancer. Nat Clin Prac Oncol. 2005; Suppl 2:4–11.
 2
Vavouri T, Lehner B. Human genes with CpG island promoters have a distinct transcriptionassociated chromatin organization. Genome Biol. 2012; 13(11):110.
 3
Das PM, Singal R. DNA methylation and cancer. J Clin Oncol. 2004; 22(22):4632–42.
 4
Kaminskas E, Farrell A, Abraham S, et al. Approval summary: azacitidine for treatment of myelodysplastic dyndrome subtypes. Clinical Cancer Res. 2005; 11:3604–8.
 5
Sharma S, Kelly T, Jones P. Epigenetics in cancer. Carcinogenesis. 2010; 31:27–36.
 6
Moshe S. DNA methylation signatures for breast cancer classification and prognosis. Genome Med. 2012; 4(3):26. doi:10.1186/gm325. http://www.genomemedicine.com/content/4/3/26.
 7
Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan J, Shen R. High density DNA methylation array with single CpG site resolution. Genomics. 2011; 98:288–95.
 8
WilhelmBenartzi CS, Koestler DC, Karagas MR, Flanagan JM, Christensen BC, Kelsey KT, Marsit CJ, Houseman EA, Brown R. Review of processing and analysis methods for DNA methylation array data. Br J Cancer. 2013; 109:1394–402.
 9
Du P, Zhang X, Huang C, Jafari N, Kibbe WA, Hou L, Lin SM. Comparison of Betavalue and Mvalue methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010; 11:587.
 10
Price EM, Cotton AM, Lam LL, Farré P, Emberly E, Brown CJ, Robinson WP, Kobor MS. Additional annotation enhances potential for biologicallyrelevant analysis of the illumina infinium humanmethylation450 beadchip array. Epigenetics & Chromatin. 2013; 6(1):4. doi:10.1186/1756893564. http://www.epigeneticsandchromatin.com/content/6/1/4.
 11
Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B (Methodological). 1995; 57(1):289–300.
 12
Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B (Methodological). 2002; 64(3):479–98.
 13
Rubin D, Dudoit S, van der Laan MJ. A method to increase the power of multiple testing procedures through sample splitting. Stat Appl Genet Mol Biol. 2006; 5(1):1544–6115. http://www.degruyter.com/view/j/sagmb.2006.5.1/sagmb.2006.5.1.1148/sagmb.2006.5.1.1148.xml;jsessionid=2327FB240E9E1522A22AABE4359FABC9.
 14
Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945; 1(6):80–3.
 15
Rice JA. Mathematical Statistics and Data Analysis, Third Edition. Belmont, CA: Duxbury Advanced Series; 2006.
 16
Kolmogorov A. Sulla determinazione empirica di una legge di distribuzione. G Ist Ital Attuari. 1933; 4:83–91.
 17
Smirnov N. Table for estimating the goodness of fit of empirical distributions. Ann Math Stat. 1948; 19:279–81.
 18
Westfall PH, Stanley Young S. ResamplingBased Multiple Testing: Examples and Methods for pValue Adjustment. New York, NY: WileyInterscience; 1993.
 19
Smyth GK. Linear models and empirical bayes for asessingdifferential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3(1):3.
 20
Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012; 41(1):200–9.
 21
Aryee MJ, Jaffe AE, CorradaBravo H, LaddAcosta C, Feinberg AP, Hansen KD, Irizarry RA. Minfi: a flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics. 2014; 30(10):1363–9.
 22
Storey JD. The positive false discovery rate: a Bayesian interpretation and the qvalue. Ann Stat. 2003; 31(6):2013–35.
 23
Teschendorff AE, Menon U, GentryMaharaj A, Ramus S, et al. Agedependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 2010; 20(4):440–6.
 24
Liu Y, Aryee M, Padyukov L, Fallin M, et al. Epigenomewide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 2013; 31(2):142–7.
 25
Houseman EA, Molitor J, Marsit CJ. Referencefree cell mixture adjustments in analysis of DNA methylation data. Bioinformatics. 2014; 30(10):1431–9. doi:10.1093/bioinformatics/btu029.
 26
Jemal A, Siegel R, Xu J, Ward E. Cancer statistics. CA Cancer J Clin. 2010; 56:106–30.
 27
Klareskog L, Catrina A, Paget S. Rheumatoid arthritis. Lancet. 2009; 373:659–72.
 28
Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013; 14:365–76.
 29
Li D, Dye TD. Power and stability properties of resamplingbased multiple testing procedures with applications to gene oncology studies. Comput Math Methods Med. 2013; 2013:11.
 30
Qiu X, Xiao Y, Gordon A, Yakovlev A. Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics. 2006; 7:50.
 31
Gordon A, Glazko G, Qiu X, Yakovlev A. Control of the mean number of false discoveries, bonferroni and stability of multiple testing. Ann Appl Stat. 2007; 1(1):179–90.
Acknowledgements
Dr. Li’s and Dr. Dye’s time is supported by the University of Rochester’s Clinical and Translational Science Award (CTSA) number UL1 TR000042 from the National Center for Advancing Translational Sciences of the National Institutes of Health. Dr. Le Pape is supported by a Center for Biomedical Research Excellence (COBRE) P20GM103516.
We would like to thank the Center for Integrated Research Computing at the University of Rochester for providing high performance computing resources. We would also like to thank Kathleen Holt for her valuable comments and the reviewers’ insightful comments and suggestions which improved the quality of our manuscript.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
DL, ZX designed the simulation studies. DL performed the simulation studies and real data analysis. DL, ZX, MLP, and TD wrote the manuscript. All authors edited and approved the final manuscript.
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, D., Xie, Z., Le Pape, M. et al. An evaluation of statistical methods for DNA methylation microarray data analysis. BMC Bioinformatics 16, 217 (2015). https://doi.org/10.1186/s128590150641x
Received:
Accepted:
Published:
Keywords
 DNA methylation
 Power
 Stability