Detecting differentially methylated regions using a fast wavelet-based approach to functional association analysis

Background We present here a computational shortcut to improve a powerful wavelet-based method by Shim and Stephens (Ann Appl Stat 9(2):665–686, 2015. 10.1214/14-AOAS776) called WaveQTL that was originally designed to identify DNase I hypersensitivity quantitative trait loci (dsQTL). Results WaveQTL relies on permutations to evaluate the significance of an association. We applied a recent method by Zhou and Guan (J Am Stat Assoc 113(523):1362–1371, 2017. 10.1080/01621459.2017.1328361) to boost computational speed, which involves calculating the distribution of Bayes factors and estimating the significance of an association by simulations rather than permutations. We called this simulation-based approach “fast functional wavelet” (FFW), and tested it on a publicly available DNA methylation (DNAm) dataset on colorectal cancer. The simulations confirmed a substantial gain in computational speed compared to the permutation-based approach in WaveQTL. Furthermore, we show that FFW controls the type I error satisfactorily and has good power for detecting differentially methylated regions. Conclusions Our approach has broad utility and can be applied to detect associations between different types of functions and phenotypes. As more and more DNAm datasets are being made available through public repositories, an attractive application of FFW would be to re-analyze these data and identify associations that might have been missed by previous efforts. The full R package for FFW is freely available at GitHub https://github.com/william-denault/ffw.

Bump-hunting methods [4,6] can, to some extent, alleviate these issues by detecting a predetermined difference in the spatially ordered variables (e.g., SNPs or CpGs) that correlate with the trait or disease under scrutiny. However, as the name implies, bump-hunting methods can only analyze a single bump at a time, failing to consider the combined effect of multiple bumps in a given region. Besides the lack of power [5], bump-hunting methods also rely on a re-sampling procedure, such as bootstrapping or permutation, which makes them computationally more intensive (for details, see [4]).
To help address these methodological shortcomings, we developed a wavelet-based method to enable the detection of more complex signals than those present in a single bump. Although several methods are now available for functional association analysis based on wavelets [5,[7][8][9], they do not scale well in terms of computational time when the sample size (n) or the number of regions (R) becomes exceedingly large (e.g., when n ≈ 1000 or R > 1000 ). As wavelet coefficients are not independent, Lee and Morris [5] and Ma and Soriano [9] proposed searching for associations between a trait and a function by using wavelet regression that takes into account the dependencies between the wavelet coefficients. However, this requires the use of a re-sampling procedure such as Markov Chain Monte Carlo (MCMC) (exemplified in [5,7]) or the computation of complex analytical posterior distributions as in Ma and Soriano [9]. To address these issues, Shim and Stephens [8] proposed simplifying the modeling by omitting the dependencies and using a likelihood ratio test to search for associations between a trait and a function. This simplification is, however, still limited because of the need for permutations to evaluate the significance of the likelihood ratio.
In our current approach, which we call "Fast Functional Wavelet" (FFW), we combine the framework of Shim and Stephens [8] with recent results on the theoretical null distribution of Bayes factors by Zhou and Guan [10]. Our approach allows a fast emulation of the test statistic in Shim and Stephens [8] and reduces the computational time considerably. The main difference between FFW and WaveQTL is that FFW requires regressing the trait of interest on the wavelet coefficients, regardless of the application. Hence, the design matrix for association is kept constant across all the screened regions. This simple modification enables the null distribution of the test statistic to be simulated using only a χ 2 1 distribution, thereby circumventing the need for extensive permutations to assess the significance of a region. By keeping the design matrix for association constant across the screened regions, and using the results of Zhou and Guan [10], we can show that the same distribution can be used to emulate the null distribution of each regional test. This null distribution depends on a single parameter that is easily computed. Keeping the design matrix constant across the screened regions can lead to a reverse regression, resulting in shrunken estimates and a potential loss of power (see Fuller, Chapter 1 [11]). Besides making the null distribution easier to emulate, reverse regression also allows the analysis of a wider variety of traits (continuous, binary, or count).
Although the focus of the current paper is on DNAm data, we describe FFW more broadly to highlight its utility for other types of high-dimensional data, such as those from a GWAS [3], dsQTL analysis [2], and functional data from biomechanical research [1].

Description of our wavelet-based approach
There are different types of wavelets, and readers interested in a more comprehensive introduction to wavelets are referred to Nason [12]. In our application here (FFW), we only consider the simplest type of wavelet-the Haar wavelet.

Processing
For simplicity, we consider genetic regions composed of evenly-spaced variables of length 2 J . As the assumption of evenly-spaced variables rarely holds in practice, we use the approach of Kovac and Silverman [13] to handle this limitation in our software implementation. Their approach mainly consists of using an interpolation on the observed data to obtain evenly-spaced variables of length 2 J .
A given region is defined as a compound of X 1 , . . . X 2 J , with 2 J spatially-ordered variables. Suppose we observe these 2 J variables for n individuals; we then denote X i,k as the kth observation of the ith variable. The wavelet coefficients for the individual k are defined as follows. For wavelet coefficients corresponding to the highest scale or resolution (i.e., J), and for i ∈ 1, 2 J −1 These wavelet coefficients correspond to local differences between adjacent variables. For a lower scale (i.e., j < J ), the corresponding wavelet coefficients are computed as follows: where X j,2i,k , is defined as: X j,2i,k correspond to the scaled average of the adjacent variables for individual k (for further details, see Nason [14]). Finally, the wavelet coefficients for the lowest scale (i.e., 0) are computed as follows: The procedure described above is known as Mallat's pyramid algorithm for signal processing [15]. To ease comprehension, we denote X jl as the random variable representing the wavelet coefficient at the scale j, with 1 ≤ j ≤ J , and at the location l, with 1 ≤ l ≤ 2 j−1 .

Modeling
We first summarize the work of Shim and Stephens [8] before presenting our main methodological contributions in the "Significance of a region" section further below.
To identify associations between a region and a phenotype (denoted as hereafter), we assess whether specific scales are associated with at different locations. Let π be a vector of length J, where ∀j ∈ [0 : J ], π j ∈ [0, 1] and π j represents the proportion of wavelet coefficients at scale j associated with . To assess the significance of a given genetic region, we test the following hypothesis: In the next sections, we describe the test statistic (likelihood ratio), how its different components are computed, and how its significance is tested.

Bayes factors
To test for associations between and the wavelet coefficient X jl for a given region, we use the Normal-Inverse-Gamma (NIG) prior to perform a regression between each wavelet coefficient and . Note that our framework easily allows confounders to be incorporated into the regression models. We quantile-transform each wavelet coefficient across the individuals to reduce the proportion of spurious associations due to distribution-related issues.
The association models for each scale and location are defined as follows: where C is a matrix of dimension c × n , β jl,C is a matrix of dimension 1 × c and ǫ ∼ N (0, σ 2 ) , where σ 2 is unknown. Next, we compute the Bayes factors of the wavelet regression jl using the closed form provided by Zhou and Guan [10] for the NIG prior.

Ratio statistic
Our goal is to assess the significance of the vector π = (π 0 , . . . , π j , . . . , π J ) , where π j represents the proportion of wavelet coefficients at scale j associated with , and X is the wavelet representation of the individual functions. To test the significance of π , we construct a test statistic by computing the following likelihood ratio: Following the approach of Shim and Stephens [8], we denote γ jl as the random variable with support {0, 1} . Thus, γ jl = 1 if the wavelet coefficient X jl is associated with , and 0 if not. We consider π as a hyperparameter of γ jl ; i.e., Shim and Stephens [8] assume independence between the wavelet coefficients. However, this may not to hold in practice [16]. Under the assumption of independence of the wavelet coefficients, we can rewrite 4 as: (2) H 0 : π = (0, . . . , 0)vsH 1 : ∃j ∈ [0 : J ], π j � = 0 p(x jl |γ jl =0,�) as the Bayes factor of the association between the wavelet coefficient at scale s and location l. Using this notation, we can rewrite 7 as: We then compute the likelihood ratio statistic by maximizing the lambda statistics over π and estimating π using the EM algorithm.

Significance of a region
As the distribution of is unknown, we simulate under H 0 by simulating BF jl under H 0 . Recently, Zhou and Guan [10] showed that, under H 0 and a wide spectrum of priors, the Bayes factors (including the NIG prior) follow a specific distribution for a Gaussian model. More precisely, where Q 1 is a non-central chi-squared random variable with df = 1, and ǫ = O(1) and its non-centrality parameter has a closed-form. The parameter 1 is the largest eigenvalue of is the prior effect size of the NIG prior for the intercept and the covariates). By keeping the design matrix constant across the regions, 1 stays the same for all the regions and only needs to be computed once. The non-centrality parameter is region-dependent in general, but it is exactly zero when the null hypothesis of the Bayes factor is β jl,1 = 0 . Zhou and Guan [10] showed that, for df = 1, Q 1 is asymptotically equal to the likelihood ratio test statistic for Gaussian linear models. In other words, Q 1 is equal to a simple chi-squared statistic with one degree of freedom. Therefore, we use the approximation in Eq. (11) for the distribution of the Bayes factors. Note also that Zhou and Guan [10] showed that this approximation is exact when using a Normal prior.
By using this approximation, it is only necessary to compute a single parameter for all the regions. We can then perform M independent simulations of the vector of Bayes factors under H 0 . This corresponds to M vectors of length 2 J , corresponding to the number of wavelet coefficients. Next, for each simulated vector of Bayes factors BF m , we compute the simulated likelihood ratio � m = max π ∈[0,1] J �(π , BF m ) using the procedure described above. Monte Carlo methods for p value estimation can then be applied to the set of observed statistics.

Simulations and application
We performed a set of simulations to evaluate the gain in computational time with FFW and to assess the significance of the test statistic. We also evaluated the statistical power of FFW using a realistically simulated dataset. Lastly, we ran a sensitivity analysis for the priors to assess the sensitivity of FFW according to the choice of prior. All the simulations were performed on an ordinary laptop, equipped with an Intel(R) i7-700HQ 2.80 GHz processor and 8 GB of RAM.

Gain in computational time
We performed separate EWASes using FFW and WaveQTL, and report the run time for different sample sizes. Figure 1 illustrates the substantial gain in computational time with FFW. The green and orange curves represent the total time it took to perform an EWAS based on DNAm data generated on the Illumina 450K platform (using the same pre-processing steps as in the "Power and prior sensitivity analysis" section).
The data used to estimate the computational time were generated as follows. First, we simulate each individual's DNAm profile using independent and identically distributed (iid) uniform random variables on [0, 1]. More precisely, we simulate each individual's DNAm as being generated by the Illumina 450K array, by simulating the value of each probe on the array using a random variable on [0, 1]. Next, we apply the same pre-processing steps as in the "Power and prior sensitivity analysis" section, resulting in 4731 regions to test for association.
Finally, in the "Computational cost" section of the Appendix, we derive the theoretical computational cost for WaveQTL and FFW.

Type I error
We estimated the type I error for four distinct scenarios and performed simulations under the assumption of no association, using the test functions block, bump, heaviSine, and doppler as previously described in Donoho and Johnstone [17]. The different functions are illustrated in Fig. 2 (adapted from Donoho and Johnstone [17]). For each simulation s, we propose the following model of association. For each test function T, we perform the following simulation, and using T, we generate a population of observed functions as follows: where a k is the individual amplitude of the test function, with a ∼ N (0, 1) and ǫ ∼ N (0, 1) . We denote F as the set of observed functions. We then generate Y k ∼ N (0, 1) , which represents a continuous trait not associated with the considered test function. Next, we wavelet-transform each individual function f k and quantile-transform each wavelet coefficient in the population. We then compute the likelihood ratio � (F , Y ) s , as well as 1 s , which is the largest eigenvalue of X X t X + is the prior effect size of the NIG prior for the intercept and the covariate). We simulate � (F , Y ) s M = 10 6 times using These simulations are denoted as � (F , Y ) m s . We compute the Monte Carlo p value as This procedure is repeated 75,000 times for each sample size and type of test function (block, bump, heaviSine, and doppler). Table 1 summarizes the estimated type I errors for different sample sizes and test functions. These results show that the type I errors are handled satisfactorily for all the function types and sample sizes.

Power and prior sensitivity analysis
To evaluate the power and performance of FFW on DNAm data, we used the same dataset as in Lee and Morris [5]. This dataset is a combination of 26 methylation profiles on chromosome 3, containing a total of 75,069 probes. Every patient's methylation profile was measured twice, once in cancer cells and once in control cells. The phenotype (Y) is thus a binary indicator corresponding to a cancer (Y = 1) or control (Y = 0) cell.
The simulations are designed as follows. The true mean methylation level is kept identical for all the probes except the 1901 loci that were previously found to be differentially methylated in Irizarry et al. [18]. For these 1901 probes, the mean methylation levels were made to be different between cases and controls according to the difference reported by Irizarry and colleagues [18].
The above simulations are designed to ensure that the two groups have the same DNAm profile for all CpGs except the 1901 loci reported to be differentially methylated in Irizarry et al. [18]. For further information regarding the simulated data, readers are referred to the Supporting Information section in Lee and Morris [5].
Pre-processing As CpG sites are not evenly spaced in the genome, the wavelet transform is well-suited for modeling such sites as a function, provided there is a sufficiently large number of measurements. We pre-processed the DNAm data by dividing the genome into smaller regions containing at least 10 CpGs, with any two adjacent CpGs separated by a maximum distance of 500 base pairs. This criterion is similar to the one used by Jenkinson et al. [19], where the authors studied regions of 3 Kb containing at least 10 CpGs.
The above pre-processing step resulted in a total of 1213 regions, containing 1875 of the 1901 CpGs in Irizarry et al. [18] that were scattered across 89 of the 1213 defined regions. For each region, we investigated whether the CpG patterns varied between case (n = 13) and control (n = 13) cells. As each region contains at least 10 CpGs, we used a depth of analysis of 3. We then ran FFW and WaveQTL for different values of the standard deviation of the prior on the previously defined regions. Finally, for each method and standard deviation of the prior, we computed the p value for each region and the corresponding false discovery rate (FDR) using the Benjamini-Hochberg procedure [20]. Figure 3 shows the consistency of FFW according to different standard deviations of the prior. This figure also shows that FFW and WaveQTL have similar power.
To evaluate the type I error of FFW for various standard deviations of the prior, we used the above dataset of cancer and control cells to generate the test statistics under the null. We performed 50 screenings of the dataset by permuting the phenotype in each screening. This corresponds to 56,500 observations of the test statistics under the null for different standard deviations of the prior. Table 2 summarizes the estimated type I errors for different standard deviations of the prior. The calibration is slightly worse than    Table 1, which might be because the approximation used here is only valid asymptotically (see Eq. 11). Moreover, only 26 individuals were available for analysis in the case-control dataset, and we only considered independent samples in our modeling. Even though the dataset contained repeated measurements, the estimations were similar across the different priors. As was the case with statistical power, Fig. 3 also shows that FFW and WaveQTL have a similar performance.
We assessed the power of FFW according to the number of differentially methylated CpGs per region (Table 3). We computed the average proportion of truly associated regions across different standard deviations of the prior for each FDR level and the number of differentially methylated CpGs per region. FFW had higher power for regions containing a large number of differentially methylated CpGs (Table 3). We also computed the power according to the number of differentially methylated CpGs per region using WaveQTL. We found the same power estimates as those shown in Table 3. This is unexpected, considering the relatively small number of truly associated regions (n = 89). Figure 3 also shows slight differences between FFW and WaveQTL in relation to FDR. Still, these discrepancies are negligible and indicate that the two approaches have similar power (Additional file 1). Figure 4 shows matching ROC curves for FFW and WaveQTL, which indicate that the two methods have the same power. We suspect that, since WaveQTL estimates the p value using an early-stopping Monte Carlo approximation based on 10,000 permutations, the estimated FDR might be slightly more conservative than the one obtained based on simulations. However, when we rank the regions by FDR, we obtained the same ranking for WaveQTL as for FFW. As ROC curves are invariant if the ranking of the regions does not change, we obtain the same ROC curves as a result.
To compare FFW with another wavelet-based method, we repeated the same analyses using the "Wavelet-based Functional Mixed Models" (WFMM) method by Morris and Carrol [7] on the same dataset as above. WFMM can be used to detect DMRs [5], and, more generally, to detect signals via wavelet regression. The authors used an empirical Bayes approach to perform a regularization of the estimated effects. Their model can thus take into account a larger range of correlations between the observed DNAm profiles in each individual. WFMM is thus able to handle repeated measures of DNAm. WFMM processed all the 75,069 CpG sites in one go and computed the posterior probability for each of the CpGs being above a set threshold (here 0.05) for being associated with cancer. We ran WFMM by specifying the correlation structure between the observations. Following the approach of Lee and Morris [5], we transformed the posterior probabilities of the CpGs into Bayesian FDR [5]. To compare the performance of WFMM against that of FFW, we first need to provide a regional significance criterion for WFMM. We used the minimum Bayesian FDR value for all the CpGs within a region of interest to assign a regional significance criterion. After running WFMM on the entire dataset, we used the minimum Bayesian FDR value for each of these regions to assess significance.
The results showed that WFMM had higher power than FFW, detecting all the 89 regions with an FDR below 0.01. This difference in power might be due to the refined modeling proposed by Morris and Carrol [7], which takes advantage of the correlations between DNAm profiles. Notably, Lee and Morris [5] showed that taking these correlations into account resulted in a systematic gain in power. In terms of computational time, however, WFMM took more than 6 h to complete the screening of 1213 regions, whereas FFW took a minute.

Discussion
This paper reports on a computational shortcut for improving the wavelet-based approach proposed by Shim and Stephens [8]. We drew inspiration from the work of both Shim and Stephens [8] and Zhou and Guan [10] to develop a faster functional modeling that is applicable to a wider variety of functions and phenotypes. The approach of Shim and Stephens [8] was designed to identify dsQTL. Here, we show that waveletbased approaches can also be used to detect differentially methylated regions (DMRs). Both WaveQTL by Shim and Stephens [8] and FFW offer a more flexible approach to modeling functions than conventional single-point testing. By keeping the design matrix constant across the screened regions and using simulations instead of permutations, we show that FFW is faster than WaveQTL. In addition, FFW controls the type I error satisfactorily for large sample sizes.
Reverse regression is a very useful tool for reducing the overall computational burden. However, the downside of reverse regression is that the coefficients from the analysis may become less interpretable. If the objective of a study is to estimate the effect of a particular wavelet in a given DNAm region, then one needs to rerun the procedure using individual wavelet coefficients as exposures. Therefore, FFW might function better as an initial screening tool to gain important biological insights from the DNAm data. For other types of data, such as those from biomechanical research, the wavelet coefficients are more directly interpretable. For example, if a researcher is interested in studying the effect of a particular treatment on motor function, e.g., leg function in the strength-dexterity test [21], FFW would lend itself easily to such an analysis.
An additional methodological constraint is the need to assign a given value for J in the applications of FFW. We chose a cutoff of J = 3 because of the requirement for the screened regions to contain least 10 CpGs, with any two adjacent CpGs separated by a maximum distance of 500 base pairs. In general, we advise choosing as large of a J as possible, while restricting 2 J < κ , where κ is an integer larger than 4. This can be written as J = max j {2 j < κ} when analyzing regions containing at least κ CpGs. In our current application, this corresponded to J = 3. Nevertheless, it is possible to choose a different J for each region. Given the test statistic depends on J, choosing a different J for each genetic region would require the test statistic to be simulated for different values of J. As shown in Fig. 1, one can quickly simulate the test statistic. Therefore, simulating different values of J is likely to have a negligible impact on the overall run time.
FFW is well-powered to detect DMRs containing more than 10 CpGs, even when the CpGs only have small effects. However, FFW has lesser power for detecting DMRs containing only a few CpGs ( ≤ 10 ). Although it is less powerful than WFMM [5], FFW has the advantage of being significantly faster in terms of computational time. WFMM took more than 6 h to process one chromosome for 26 individuals, whereas FFW took a minute. We thus expect WFMM to become exceedingly slow if there is a need to scale up the analysis to include hundreds of individuals and data from denser DNAm platforms, such as the Illumina 850K, or those from whole-genome bisulfite sequencing [22]. Therefore, FFW is a useful complementary tool for the rapid scanning of EWAS datasets to detect DMRs that can subsequently be used in downstream fine-mapping efforts. An attractive application of FFW would be to re-analyze DNAm data from previously published EWASes that are publicly available through, e.g., the Gene Expression Omnibus (GEO) database [23].
In future developments, we plan to extend FFW to also include phenotypes on non-ordered scales, e.g., blood types and psychiatric phenotypes. Such phenotypes are routinely treated in a case-control fashion and analyzed using multinomial regression owing to the prohibitively large computational burden. However, by exploiting reverse regression, as we do here, the phenotypes can be re-coded and readily included in the predictor matrix. Reverse regression also enables FFW to easily adapt to the setting of a phenome-wide association study (PheWAS), in which multiple phenotypes are interrogated simultaneously ( [24][25][26]). As highlighted by our analyses, this development is further simplified by the results of Zhou and Guan [10], showing that the parameter of the Bayes factors law depends primarily on the singular values of the regression matrix and the number of parameters tested. As the regression matrix remains constant across all loci, locations, and scales, these parameters only need to be computed once, thus enabling a fast computation of p values. This makes FFW a highly versatile method for analyzing phenotypes that do not lend themselves easily to either single-point or bump-hunting methods.
FFW is distributed as an R package. The package contains the analysis code and a data visualization tool to enable a more detailed inspection of the detected regions. The full R package is freely available on GitHub (https ://githu b.com/willi am-denau lt/ffw), and a comprehensive example run of the package is provided in the help function ffw.
Additional file 1: Figure S4. ROC curves at given standard deviations of the prior. The thin solid curves are the output of FFW; the thick dashed curves are the output of WaveQTL. The ROC curves match for standard deviations of the prior of 0.5 and 1.
performed, including the ones for the permutation as well as for the cost of the EM algorithm. The second term is the overall cost of the wavelet transform.
Early-stopping methods, such as the methods of Gandy [28] or Besag and Clifford [29], are algorithms that estimate p values in a sequential fashion. As in most applications, the interesting p values are the small p values. Early-stopping methods aim at avoiding using a lot of computational power on large p values. Therefore, when it is likely that the test p value is large, such algorithms stop and move on to the next p value to be estimated. Early-stopping methods thus aim at focusing only on the interesting tests and allocating most of the computational power to those. The simctest R package contains most of recent early-stopping algorithms, and include the method of Gandy [28].
Using FFW, and neglecting the cost of simulations, we obtain a computational cost of: where the first term is the computational cost of all the regressions performed, the second term is the overall cost of the wavelet transform, and the third term is the number of iterations performed by the EM algorithm. The R term is for running the EM algorithm for each region, and the R 2 term is for running the EM algorithm for each simulation used to assess the significance of each region. Using early-stopping rules for Monte Carlo p values, this can be further reduced to: where S ′ is the number of iterations needed to assess the significance of the smallest p value using the early-stopping rules for Monte Carlo p values ( S ′ ≤ R 2 ). In essence, this procedure reduces the degree of the polynomial cost of the region by one. In addition, the term np 2 + p 3 , which carries part of the computational burden, is now only associated with a first-order polynomial in R and no longer associated with a third and seconddegree polynomial. In a scenario in which all the regions have 2 J variables, the first term would thus represent the overall cost corresponding to the single-point testing strategy.