An evaluation of statistical methods for DNA methylation microarray data analysis

Background DNA methylation offers an excellent example for elucidating how epigenetic information affects gene expression. β values and M values are commonly used to quantify DNA methylation. Statistical methods applicable to DNA methylation data analysis span a number of approaches such as Wilcoxon rank sum test, t-test, Kolmogorov–Smirnov test, permutation test, empirical Bayes method, and bump hunting method. Nonetheless, selection of an optimal statistical method can be challenging when different methods generate inconsistent results from the same data set. Results We compared six statistical approaches relevant to DNA methylation microarray analysis in terms of false discovery rate control, statistical power, and stability through simulation studies and real data examples. Observable differences were noticed between β values and M values only when methylation levels were correlated across CpG loci. For small sample size (n=3 or 6 in each group), both the empirical Bayes and bump hunting methods showed appropriate FDR control and the highest power when methylation levels across CpG loci were independent. Only the bump hunting method showed appropriate FDR control and the highest power when methylation levels across CpG sites were correlated. For medium (n=12 in each group) and large sample sizes (n=24 in each group), all methods compared had similar power, except for the permutation test whenever the proportion of differentially methylated loci was low. For all sample sizes, the bump hunting method had the lowest stability in terms of standard deviation of total discoveries whenever the proportion of differentially methylated loci was large. The apparent test power comparisons based on raw p-values from DNA methylation studies on ovarian cancer and rheumatoid arthritis provided results as consistent as those obtained in the simulation studies. Overall, these results provide guidance for optimal statistical methods selection under different scenarios. Conclusions For DNA methylation studies with small sample size, the bump hunting method and the empirical Bayes method are recommended when DNA methylation levels across CpG loci are independent, while only the bump hunting method is recommended when DNA methylation levels are correlated across CpG loci. All methods are acceptable for medium or large sample sizes.

removing DNA methylation marks, which highlights the importance of understanding DNA methylation in disease etiology and treatment [4,5].
Large-scale examination of DNA methylation through microarray or sequencing technologies makes epigenomewide association studies (EWAS) feasible to explore associations between DNA methylation and cancers in the sustained effort to develop novel anti-cancer drugs, and to identify DNA methylation markers associated with certain cancers for prognosis and diagnosis purpose [6]. The Illumina HumanMethylation BeadChip technology is a popular platform for conducting epigenome-wide association studies. Three platforms have been developed by Illumina for DNA methylation assay: GoldenGate, Infinium Human Methylation27 and Infinium Human-Methylation450 BeadChip. All platforms use two fluorescent dye colors to recognize the bisulphite-converted sequence. The standard output from the BeadChip assay for quantifying methylation is the β value, which is calculated from the intensity of methylated allele (Max(M, 0)) and the intensity of unmethylated allele (Max(U, 0)) according to the following formula [7].
The β values are usually preprocessed for the downstream statistical analysis. The summary on preprocessing the β values including quality control, background correction, and normalization could be found somewhere else [8]. For differential DNA methylation analysis, the average β value denotes the methylation level, or the percentage for an interrogated locus. The average β values vary between 0 and 1. In an ideal situation, "zero" indicates that no copy of the CpG site in the sample is methylated, and "one" indicates that every copy of the site is methylated. The average β value approximates the methylation percent for the population of a sampled CpG site. Alternatively, some investigators use the M-value, considering the M-value alternative statistically more valid [9]. The M-value is defined as: The range of M-values could be from −inf to +inf , consistent with the data range for a normal distribution. However, interpretations of M-values are not as intuitive as for β-values. A properly normalized M-value approaching zero indicates that a specific CpG site is half-methylated. Positive M-values suggest a methylation rate greater than 50 %, while negative M-values indicate a methylation inferior to 50 %. The β-values and M-values are related through a log2 ratio transformation such as: It has been shown that there is an approximately linear relationship between β-values and M-values in the middle range of the methylation data ([0. 2, 0.8] for β values, and [-2, 2] for M-values) [9].
We used both β-values and M-values in our simulation studies and real data examples, which should provide guidance to investigators in selecting β-values or Mvalues for their differential DNA methylation analysis with regard to FDR control, power, and stability.
Currently available methylation differential analysis methods implemented in Bioconductor/R include several approaches such as Wilcoxon rank sum test (used in methyAnalysis package), t-test (used in methyAnalysis, CpGAssoc, RnBeads, and IMA package), Kolmogorov-Smirnov Tests (although not implemented in packages, but used by some investigators [10]), permutation test (used in CpGAssoc package), empirical Bayes method (used in RnBeads, IMA and minfi package), and bump hunting method (used in bumphunter and minfi package). However, with so many options available to investigators, selection of an optimal statistical method, can be challenging-especially when different methods applied to the same data set generate inconsistent results. As such, we systematically investigated these commonly used DNA methylation differential analysis methods in terms of their FDR control, power, and stability, through simulation studies. We illustrated the respective advantages and disadvantages of these methods with real methylation data sets, in order to provide empirical evidence and advice to investigators in selecting the most appropriate DNA methylation analysis methods for their studies.

Methods
Hypothesis testing for each DNA methylation locus was done using either the average β-values or the transformed M-values of the different groups. Assume there are m methylation loci from the DNA methylation array assay. Among the m methylation loci, m 0 loci are not differentially methylated. Suppose R methylation loci are rejected of m total loci, then V indicates the number of falsely rejected methylation loci (or "false discoveries") from R rejections, and S denotes the true number of differentially methylated loci between groups in R rejections (R = V + S). The possible outcomes of testing m DNA methylation loci simultaneously are shown in Table 1. When testing m DNA methylation loci simultaneously, we need to control multiple testing error rate as opposed from testing a single DNA methylation locus.
The most commonly used multiple testing error rate for discovery purposes is the false discovery rate proposed by Benjamini and Hochberg [11], defined as: Another definition of false discovery rate proposes to control the expected proportion of false discoveries E( V R |R > 0) when R > 0 [12] such as: pFDR = 1 when m 0 = m. FDR and pFDR set to 0 when R = 0. FDR and pFDR are similar when the phenotype is not associated with DNA methylation for most of the CpGs. In our simulations, two different multiple testing procedures were used to control for FDR/pFDR . Through a step-up procedure, the Benjamini-Hochberg procedure [11] provides control of FDR at α level. The Benjamini-Hochberg procedure compares ordered P (i) with i m α from the largest p, rejecting all H (i) i = 1, 2, . . . , k with P (i) ≤ i m α. The Benjamini-Hochberg procedure provides strong control for FDR at level α (for independent and positively correlated test statistics). Similarly to the Benjamini-Hochberg procedure, the Storey's q-value procedure [12] uses conservative point estimators of m 0 , m 0 (λ) (λ is a tuning parameter). With larger cutoffs, the Storey's q-value leads to higher power than the Benjamini-Hochberg procedure asm 0 (λ) ≤ m. The Storey's q-value controls pFDR at α, with test statistics correlated weakly or independently.
Besides controlling FDR at a desired α level in the multiple testing process, we would also desire that the DNA methylation analysis method possess enough power to detect true differential DNA methylation loci and be consistent from experiment to experiment. Power is defined as the expected proportion of true differentially methylated loci detected among the total number of true differentially methylated loci [13]. Stability is measured as the standard deviation (SD) of the count of the differentially methylated loci detected. Power and stability of a differen-tial DNA methylation analysis method could be expressed using the following formulas: Power is defined as 0 and Stability becomes a measure of standard deviation of false detections when m = m 0 .

Wilcoxon rank sum test (rank test)
Wilcoxon rank sum test (i.e., Mann-Whitney U test) is a rank-based non-parametric test and used in the methy-Analysis package as a differential methylation analysis method [14]. It is usually used as an alternative to the two-independent sample t-test when the assumption of normal data distribution is violated for the t-test.
Assume the methylation level denoted either by βvalues or M-values for ith locus, jth group, and kth subject is y ijk . Suppose j = 1 denotes the normal group and j = 2 denotes the cancer group. For each DNA methylation locus, the null hypothesis of the Wilcoxon rank sum test is that the distribution of y i1k equals the distribution of y i2k for i = 1, 2, . . . , m. The two-side alternative hypothesis is a location shift of the distribution of y i2k from y i1k in either direction. The raw p-values from the Wilcoxon rank sum test are then adjusted using Benjamini and Hochberg procedure to control for FDR at level α [11] through the p.adjust function in R.

t-test
Implemented in methyAnalysis, CpGAssoc, RnBeads, and IMA packages, the t-test is a commonly used hypothesis testing method in genomic data analysis for testing equivalence of means between two groups [15]. For two independent samples t-test, there are two t-test procedures depending on whether the variances from those two groups are equal or not. The unequal variance t-test procedure (i.e., Welch's t-test) is usually the default one used in most packages, and does not assume equal variance between groups. The raw p-values from the t-tests are computed based on the t distribution, while adjusted pvalues are obtained using the Benjamini and Hochberg procedure through the same p.adjust function in R.

Kolmogorov-Smirnov test (KS test)
The Kolmogorov-Smirnov test (KS test) is a nonparametric test in statistics for testing the equality of two continuous probability distributions [16,17]. In DNA methylation studies, the null hypothesis is that the distribution of y i1k equals the distribution of y i2k as that in Wilcoxon rank sum test for each locus of i = 1, 2, . . . , m. Sensitive to difference in shape and location of the distribution functions of two groups, the KS test differs from the Wilcoxon rank sum test (sensitive to differences in location). The raw pvalue from the KS test are adjusted using the Benjamini and Hochberg procedure to control FDR at level α.

Permutation test
Permutation test is a resampling-based nonparametric test, which permutes data falling under the null hypothesis of equal data distributions between groups [18]. The distributions of test statistics (usually t-test statistics) are estimated from permuted test statistics. In the CpGAssoc package, the raw p-values from the permutation test for DNA methylation data are adjusted using the p.adjust function in R to control FDR at level α.

Empirical Bayes method
Used in RnBeads, IMA and minfi packages, the empirical Bayes method is a popular hypothesis testing method applied through the lmFit and eBayes functions [19]. First, we can fit a linear model to estimate β * i , the mean differences between two groups for ith locus. In DNA methylation studies, let y T i = (y i 1 , . . . , y i n ) denote the DNA methylation level for both groups with n = n 1 + n 2 for ith locus. Then, we can fit a linear model for each locus using the formula: where X is a design matrix of full column rank, and β * i is a coefficient vector. The β * i coefficient vector includes (β * i0 , β * i1 ) with β * i0 denoting the mean DNA methylation level for normal group, while β * i1 denotes the mean methylation level difference between the cancer group and the normal group. Thus, the null hypothesis for testing the mean methylation level difference between the normal group and the cancer group is H 0 : β * i1 = 0 for locus i = 1, . . . , m. The test statistic for testing H 0 is the moderated t-statistic, based on a hybrid classical/Bayes approach, defined by: The p-value for testing H 0 : β * ij = 0 (H 0 : β * i0 = 0 and H 0 : β * i1 = 0) based on the moderated t-statistic is calculated from the t distribution with d i + d 0 degrees of freedom. More information ons i , v ij , d i , and d 0 could be found elsewhere [19].
The p-value for testing H 0 : β * i1 = 0 can be further adjusted using the p.adjust function to control for FDR at level α.

Bump hunting method
The bump hunting method used in bumphunter and minfi packages was developed to take into account the correlations of methylation levels between nearby CpG locus [20]. The bump hunting method was carried out by first fitting a linear regression model for each locus before smoothing the coefficient within clusters along the genome to identify bumps [21]. More specifically, for each locus, a linear model will be used to estimate the coefficient of difference in methylation levels between the cancer group and the normal groups. Let Y ijk denote the measured methylation level for ith locus, jth group, and kth subject. X ij is an indicator variable with X i1 = 0 for the normal group and X i1 = 1 for the cancer group at all locus i. β * i is the estimated coefficient for X ij , and also stands for the estimated difference in DNA methylation levels between the cancer and the normal groups. We have then the following linear regression model.
where ijk is the error term in the model, which follows a normal distribution with mean = 0 and variance = σ 2 i . After fitting the linear regression model, the bump hunting method will be implemented according to the following steps: (1) Estimate β * i for each locus i. (2) Estimate a smooth function β * (t) using these estimates. (3) Use this smooth function β * (t) to estimate the regions R n , n = 1, . . . , N for which β * (t) = 0 for all t ∈ R n . (4) Assign statistical uncertainty to each estimated region using permutation tests.
We examined two p-values generated from the bump hunting method in the minfi package. One p-value is the raw p-value from the bump hunting method, adjusted through the Benjamini and Hochberg procedure using the p.adjust function in R (Bump hunting BH), and the other p-value is the q-value -an adjusted p-value generated from the minfi package using Storey's procedure (Bump Hunting q-value) [22].

Data extraction
We downloaded the ovarian cancer data set [23] and the rheumatoid arthritis data set [24] from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) public functional genomics data repository. The ovarian cancer data set on 540 whole blood samples has GEO accession number GSE19711, generated from the Illumina Infinium 27k Human DNA methylation Beadchip v1.2. The rheumatoid arthritis data set on 691 subjects has GEO accession number GSE42861 and was generated using Illumina HumanMethylation450 BeadChip array. We randomly selected 3, 6, or 12 samples from either the case or the control groups to illustrate the apparent test power comparisons.

Simulation study
We conducted simulation studies to compare the power and stability of six DNA methylation differential analysis methods for independent and correlated DNA methylation levels across CpG loci. Each simulation study included 1,000 independently generated two group samples with sample size (n) of 3, 6, 12, or 24 in both the cancer and normal groups. For all simulations, we set the total number of DNA methylation loci (m) as 1000. The fractions of truly differentially methylated loci π 1 = m−m 0 m were set at 1 %, 5 %, 10 %, 25 %, 50 %, 75 %, or 90 % to cover different scenarios. To mimic the data distribution of a real DNA methylation array experiment, the β values a) b) c) d) Fig. 1 Estimated FDRs, powers, means of total discoveries, and SD of total discoveries of six DNA methylation differential analysis methods using β values (Independent case). Blue solid: rank test; Red dashed: t-test; Green dotted: KS test; Black dotdash: permutation test; Orange twodash: Empirical Bayes; Yellow twodash: Bump Hunting BH adjustment (BH-BH); Purple twodash: Bump Hunting q-value adjustment (BH-q) from the DNA methylation array studies for both cancer and normal groups were generated from a mixed beta distribution (0.1Beta(0.5, 5) + 0.9Beta(5, 0.5)), for independent DNA methylation levels across CpG loci. For correlated DNA methylation levels across CpG loci, the β values y ij for ith locus and jth subject were generated from and e ij were the (i, j)th elements of m × 2n matrixes of B and E respectively. B was simulated from a Beta(0.1, 0.1) distribution and E was an matrix of error following a factor-analytic structure E = LU T + [25]. L = Z × in which Z (a m × 4 factor loadings matrix) denoted methylation profiles for constituent cell types and = diag(0.55, 0.35, 0.07, 0.03) T (a 4 × 4 factor scores diagonal matrix) denoted cell proportions through its diagonal elements. U was a 2n × 4 matrix of latent effects, and was a m × 2n random error matrix. Both Z and U were simulated from value was √ 0.5, but the errors were correlated across CpG loci). To set up the mean β value differences between groups, all β values of 1 %, 5 %, 10 %, 25 %, 50 %, 75 %, or 90 % of 1000 CpG loci in normal group subtracted a sequential vector from 0.1 to 0.4 with a length of 10, 50, 100, 250, 500, 750, and 900. For instance, with 1 % true difference between groups, the first 10 rows of β values from the normal group will equal the original β values in the normal group, generated either from the mixed beta distribution 0.1Beta(0.5, 5)+0.9Beta(5, 0.5) or

Independent cases
For simulated DNA methylation data with sample sizes as small as 3 in each group, all methods could control FDR at a desired level of 5 % (Fig. 1a and Fig. 2a). In terms of power, the empirical Bayes method was the most powerful, followed by the bump hunting method and the t-test when the proportion of differentially methylated loci was below 50 % ( Table 2 and Table 3). The bump hunting method was the most powerful method, followed by the empirical Bayes method and the ttest when the proportion of differentially methylated loci was above 50 %. Within the bump hunting method, power is higher with Storey's q-value procedure than with Benjamini and Hochberg's procedure. Neither the Wilcoxon rank sum test, the Kolmogorov-Smirnov test, nor the permutation test had power to identify any truly  differentially methylated locus across all proportions of differentially methylated loci tested. For stability, the empirical Bayes method was much better than either the t-test or the bump hunting method (Table 4 and Table 5). The bump hunting method had the largest standard deviation of total discoveries once p-values were adjusted using Storey's q-value procedure. The standard deviation of total discoveries from the bump hunting method increased exponentially as the proportions of differentially methylated loci increased. In the simulation studies, no significant differences were observed between β values and M values in terms of FDR control, power, mean number of total discoveries, or standard deviation of total discoveries. Increasing sample size to 6 in each group, the Wilcoxon rank sum test, the Kolmogorov-Smirnov test, and the permutation test all showed greater than zero power ( Fig. 1b and Fig. 2b). While all methods could control FDR at 5 %, the empirical Bayes method remained the most powerful among all methods, followed by the bump hunting method and the t-test, when the proportion of differentially methylated loci was below 25 % (Table 2 and  Table 3). The bump hunting method was the most powerful method, followed by the empirical Bayes method and the t-test/permutation test, whenever the proportion of differentially methylated loci was above 25 %. The power of the Wilcoxon rank sum test, the Kolmogorov-Smirnov test, and the permutation test was lower than the t-test whenever the proportion of differentially methylated loci was below 25 %; however, the power of the permutation test was similar to the t-test, whenever the proportion of differentially methylated loci was above 25 %. The Wilcoxon rank sum test and the Kolmogorov-Smirnov test had relatively lower power compared to the other methods, even after the proportion of differentially methylated loci increased to 25 % or higher. In terms of stability, the bump hunting method had the largest standard deviation of total discoveries and an exponentially increasing trend, especially when the proportion of differentially methylated loci was larger than 50 % (Table 4 and Table 5). All other methods showed similar stability, while the empirical Bayes method had relatively the smallest standard deviation of total discoveries across all proportions of differentially methylated loci. Significant differences were not observed between β values and M-values in terms of power, mean number of total discoveries, and standard deviation of total discoveries, except that the FDR was controlled at a lower level whenever the β values were used for the empirical Bayes method. For a moderate sample size of 12 in each group, power was not significantly different across methods whenever the proportion of differentially methylated loci was greater than 1 % (Fig. 1c and Fig. 2c). The mean number of total discoveries was also similar. standard deviation of total discoveries was maintained at a relatively low level for all methods across all proportions of differentially methylated loci, except for the bump hunting method, which showed a relatively large standard deviation of total discoveries and an exponentially increasing trend, whenever the proportion of differentially methylated loci was above 25 % (Table 4 and Table 5). All methods still controlled FDR within a 5 % level and had a more conservative control of FDR as the proportion of differentially methylated loci increased, with the exception of the bump hunting method which was less conservative in FDR control as the proportion of differentially methylated loci increased. Aside from the fact that the FDR was controlled at a lower level whenever the β values were used for the empirical Bayes method, no significant differences were observed between β values and M-values in terms of power, mean number of total discoveries, and standard deviation of total discoveries. Similar simulation results were observed when sample size was increased to 24 in each group (Fig. 1d and  Fig. 2d). The power of all methods became almost identical, and the large standard deviation of the bump hunting method became more obvious. Whenever β values were used for analysis using the empirical Bayes method, the FDR was controlled at a relatively lower level as compared to using M values. No significant power or stability differences were observed between the β values and M values.
Overall, the power and stability of all methods increased as sample size increased for both β values and M values ( Table 2, 3, 4 and 5). It was observed that the permutation method retained lower power whenever the proportion of differentially methylated loci was as low as 1 %, regardless of sample size. The Wilcoxon rank sum test and the Kolmogorov-Smirnov test had exactly the same power and stability whenever either β values or M values were used.

Correlated cases
When methylation levels were correlated across CpG loci and sample size was as small as 3 in each group, the FDR and power estimates were different from independent cases. The t-test and empirical Bayes method had very large FDR estimates with both β values and M values. The bump hunting method had estimated FDR exceeding 0.05 when β values were used, but the FDR was well controlled at 5 % when M values were used ( Fig. 3a and Fig. 4a). Interestingly, the bump hunting method had much higher power than all other methods especially when the proportion of differentially methylated loci was lower than 25 %. We also noticed that the power of the bump hunting method was higher when using β values than when using M values (Table 6 and Table 7). The bump hunting method also had a larger mean of total discoveries than all other methods, and identified more loci when β values were used. The stability trend remained the same as in the independent case. The bump hunting method still had the a) b) c) d) Fig. 3 Estimated FDRs, powers, means of total discoveries, and SD of total discoveries of six DNA methylation differential analysis methods using β values (Correlated case). lowest stability among all methods compared (Table 8 and  Table 9).
When increasing sample size from 3 to 6 in each group, the FDR and power estimates also showed different characteristics from independent cases. The t-test still had very large FDR estimates when either β values or M values were used. The empirical Bayes method had decent control of FDR when β values were used; however, it lost control of FDR when M values were used (Fig. 3b and Fig. 4b).
When M values were used, the bump hunting method had the highest power among all methods compared. When β values were used, the bump hunting method was the most powerful method, followed by the empirical Bayes method and the t-test whenever the proportion of differentially methylated loci was lower than 10 % or higher than 75 %, while the empirical Bayes method had the highest power whenever the proportion of differentially methylated loci was between 10 % and 75 % ( Table 6  and Table 7). For stability, the bump hunting method had still the largest standard deviation of total discoveries  (Table 8 and  Table 9). For a sample size of 12 in each group, the power and stability of all methods started to converge. The Wilcoxon rank sum test, the t-test, and the Kolmogorov-Smirnov test had estimated FDR values larger than 0.05 whenever β values were used and the proportion of differentially methylated loci was smaller than 10 %. When M values were used, only the bump hunting method and the permutation test had FDR controlled at 5 % whenever the proportion of differentially methylated loci was smaller than 10 % (Fig. 3c and Fig. 4c). With β values, the empirical Bayes method had slightly higher power than all other methods whenever the proportion of differentially methylated loci was smaller than 25 %. The bump hunting method had the highest power whenever the proportion of differentially methylated loci was greater than 25 %. Using M values, the permutation test had the highest power whenever the proportion of differentially methylated loci was smaller than 50 %, and the bump hunting method had the highest power whenever the proportion of differentially methylated loci was greater than 50 % ( Table 6 and Table 7). The stability of all methods began to converge, but the bump hunting method still showed slightly larger standard deviation than all other methods compared (Table 8 and Table 9).
With a sample size of 24 in each group, the power of all methods was similar ( Table 6 and Table 7). The ttest had estimated FDR over 5 % whenever β values were used and the proportion of differentially methylated loci was smaller than 10 %. All methods had estimated FDR within 5 % when M values were used ( Fig. 3d and Fig. 4d).
The bump hunting method with Storey's q-value adjustment still showed low stability whenever the proportion of differentially methylated loci was large (Table 8 and  Table 9). In summary, the power and stability of all methods showed differences when using β values versus M values in all correlated cases (Table 6, 7, 8 and 9). Whenever sample sizes were 3, 6, or 12 in each group, the t-test, the empirical Bayes method, and the bump hunting method had larger power using β values than M values. The same observation was made whenever sample size was increased to 24 in each group with the proportion of differentially methylated loci smaller than 25 %. The Wilcoxon rank sum test and the Kolmogorov-Smirnov test had similar power using β values or M values, and the permutation test had higher power using M values than β values. The permutation method still retained low power whenever the proportion of differentially methylated loci was as low as 1 %, regardless of sample size. All methods were observed to produce slightly larger standard deviations whenever using β values rather than M values, except for the bump hunting method and the empirical Bayes method whenever the proportion of differentially methylated loci was larger than 50 % for sample size of 12 in each group.

Real data examples Ovarian cancer
Ovarian cancer ranks fifth in cancer death among women in the United States [26]. Aberrant DNA methylation was found to be associated with ovarian cancer. A genome To illustrate the differences in apparent test power (total number of discoveries) across the six methods at different FDR levels, we randomly selected either 3, 6, or 12 samples from both the cancer pre-treatment group and control group. The FDR levels ranged from 0.01 to 0.10, with a step of 0.01. Due to lack of significants using adjusted p-values for all methods, the raw p-values were used for comparisons. Thus, the Storey's q-value procedure and the Benjamini-Hochberg procedure from the bump hunting method had the same raw p-values. When we randomly took 3 samples from both the cancer and control groups, both the empirical Bayes method and the bump hunting method showed higher apparent test power than the four other methods (Fig. 5). No discoveries were made either with the Wilcoxon rank sum test, the Kolmogorov-Smirnov test, or the permutation test, below a FDR level of 0.08. The t-test had lower apparent test power than the empirical Bayes method and the bump hunting method. However, differences between the empirical Bayes method and the bump hunting method were not significant. When we randomly selected 6 samples from both groups, total discoveries were similar for all methods, except for the Kolmogorov-Smirnov test which had still a relatively lower apparent test power than all other methods. No significant differences were observed between results using either β-values or M-values. When increasing sample size further to 12 in each group, we observed that all methods had more convergent apparent test power than when using a sample size of 6 in each group. Similarly, the Kolmogorov-Smirnov test showed increased power as sample size increased to 12 in each group.

Rheumatoid arthritis
Rheumatoid arthritis is a complex disease whose etiology involves the interaction of genetic, environmental, and life-style factors [27]. Epigenome-wide associations study have implicated DNA methylation as an intermediary of genetic risk in rheumatoid arthritis using Illumina HumanMethylation450 arrays on 354 rheumatoid arthritis cases and 337 controls [24]. The Methylation data was downloaded from the NCBI GEO website with accession number GSE42861. To demonstrate further the differences in apparent test power across the six methods at different FDR levels for popular Human-Methylation450 arrays, we randomly selected samples of size 3, 6, or 12 from both the rheumatoid arthritis case group and control group with the same FDR level set in the Ovarian Cancer example. The Storey's q-value procedure and the Benjamini-Hochberg procedure from the bump hunting method had the same raw p-values. The apparent test power showed similar results to those observed in the ovarian cancer example (Fig. 6). When sample size was 3 in each group, the empirical Bayes method and the bump hunting method had higher apparent test power than all other methods compared. The empirical Bayes method had a slightly higher apparent test power than the bump hunting method when β values were used, while the bump hunting method had a slighter higher apparent test power than the empirical Bayes method when M-values were used. All other methods showed similar results to those observed in the ovarian cancer example. When sample size was further increased to 6 or 12 in each group, the apparent test power of all methods compared were similar.
Overall, the results of the apparent test power comparisons of the six DNA methylation differential analysis methods using real data were consistent with our simulation results.

Discussion and conclusions
In simulation studies, we compared six DNA methylation data analysis methods in terms of FDR control, power, and stability in both independent and correlated cases. These methods' apparent test power based on raw p-values were also compared using two real data examples. For independent cases, no significant differences were detected between β values and M values in terms of FDR control, power, and stability, except that FDR was controlled at a lower level for the empirical Bayes method when β values were used for analysis. The similarity of the simulation results using either the β values or M values was probably due to the linear relationship between β and M values when β values were in the [0.2, 0.8] range and M values were in the [-2, 2] range as pointed out by Du [9]. The differences between β and M values in the empirical Bayes approach are likely a result of model mis-specification in the case of β values, potentially leading to an overestimation of standard deviations and thus deflation of significance. For correlated cases, the FDR control, power, and stability of the methods compared showed differences when using β values versus M values, which might have resulted from the correlations across CpG loci. The higher power and slightly lower stability observed in the t-test, the empirical Bayes method, and the bump hunting method when using β values rather than M values for small sample size data deserves further exploration.
In high-throughput data analysis, small or medium sample sizes are very common due to scant resources and funding constraints. Low statistical power challenges the reliability of studies, especially in small biomedical studies with low sample size [28]. Choosing appropriate approaches for DNA methylation data analysis could help investigators maximize the likelihood of true discoveries from small sample size studies with limited resources and funding. For small sample size data, both the empirical Bayes method and the bump hunting method showed good FDR control and much higher power than all other methods in independent cases. The empirical Bayes approach shrinks the estimated sample variance of the ordinary t-statistic towards a pooled estimate, resulting in higher power and more stable inference in small sample size studies [19]. The bump hunting method borrowed the strength of neighbor CpG loci, which improved the power of the DNA methylation analysis for small sample size [20]. When the methylation levels were correlated across CpG loci, only the bump hunting method showed decent control of FDR and much higher power than all other methods compared. The inflated FDR from the ttest and the empirical Bayes method was likely caused by the violation of the t distribution assumption when sample size is small, and by the violation of the independence assumption of methylation levels across CpG loci. The well-controlled FDR and high power from the bump hunting method might be due to its strength in taking the probe location information into account to model the correlation structure of error variances [20]. When sample size is very small (n = 3), the zero power of the permutation test is due to the limited number of the possible combinations of permutation [29]. For the Wilcoxon rank sum test and the Kolmogorov-Smirnov test, the zero power when sample size is 3 may also be due to the discrete distribution of their test statistics estimated under the null hypothesis. For medium or large sample size, all methods had almost equivalent power, except for the permutation test with a very low proportion of differentially methylated loci, which deserves further exploration to elucidate causality. It is expected to see that the Storey's q-value adjustment from the bump hunting method has higher power than the Benjamini-Hochberg's p-value adjustment from the bump hunting method, as the Storey's q-value adjustment has a larger cutoff value than the Benjamini-Hochberg p-value adjustment [22]. Meanwhile, the larger cutoff value from the Storey's qvalue adjustment resulting from the inclusion of estimated π 0 also brought more variability in the analysis results, as indicated by the greater standard deviation of total discoveries.
In high-throughput data analysis, it is important to examine the power and stability of multiple testing procedures to learn the likelihood of true discoveries from empirical studies [30,31]. In general, the use of either β values or M values is appropriate; however, it is advisable to take into account the differences observed between the β values and M values whenever applying the methods to DNA methylation differential analysis. When DNA methylation levels are independent across CpG loci, we recommend the bump hunting method and the empirical Bayes method in studies constrained by small sample sizes. When DNA methylation levels are correlated across CpG loci, we do recommend the bump hunting method in studies constrained by small sample sizes. In studies with medium to large sample size, all methods are suitable. With DNA differential methylation data analysis, researchers need to exercise caution with regard to the low stability of the bump hunting method whenever the proportions of differentially methylated loci are large, and with regard to the inflated FDR of the empirical Bayes method whenever DNA methylation levels are correlated across CpG loci in studies constrained by small sample sizes.