Wavelet Screaming: a novel approach to analyzing GWAS data

We present an alternative method for genome-wide association studies (GWAS) that is more powerful than the regular GWAS method for locus detection. The regular GWAS method suffers from a substantial multiple-testing burden because of the millions of single nucleotide polymorphisms (SNPs) being tested simultaneously. Furthermore, it does not consider the functional genetic effect on the response variable; i.e., it ignores more complex joint effects of nearby SNPs within a region. Our proposed method screens the entire genome for associations using a sequential sliding-window approach based on wavelets. A sequence of SNPs represents a genetic signal, and for every screened region, we transform the genetic signal into the wavelet space. We then estimate the proportion of wavelet coefficients associated with the phenotype at different scales. The significance of a region is assessed via simulations, taking advantage of a recent result on Bayes factor distributions. Our new approach reduces the number of independent tests to be performed. Moreover, we show via simulations that the Wavelet Screaming method provides a substantial gain in power compared to the classic GWAS modeling when faced with more complex signals than just single-SNP associations. To demonstrate feasibility, we re-analyze data from the large Norwegian HARVEST cohort. Keywords: Bayes factors, GWAS, SNP, Multiple testing, Polygenic


Introduction
The objective of a genetic association study is to identify the location of genetic regions (loci) that are associated with a phenotype of interest. Although the human genome is very similar across individuals, it is interspersed by single base-pair differences (SNPs) that contribute to the observed differences across individuals. One of the most common approaches to uncovering genetic associations for a given trait or disorder is to conduct a genome-wide screening for associations (GWAS) where the significance of the effect of each SNP on a phenotype of interest is assessed in a sequential fashion. Despite its many successes, this approach is limited by two important issues: (i) it incurs a substantial multiple-testing burden, and (ii) it ignores the functional nature of the genetic effect by failing to exploit the dense genotyping of the genome. Because the genome is a code for the observable phenotype and only considers the change of one incremental unit per se (i.e., one additional variant), it is unlikely that a SNP alone would be able to efficiently model how a change in the genome might mirror a change in the phenotype.
To address these issues, we developed the Wavelet Screaming approach by harnessing insights from functional mixed modeling (Morris and Carroll, 2006). More specifically, we adopt the approach described by Shim and Stephens (2015) where the authors first tested for association with a functional phenotype (the response signal) by transforming the signal using fast discrete wavelet transform (Mallat, 2008) and then performing single-SNP association tests with the spectrum. In essence, our main idea is to reverse this approach. The use of reverse regression to identify genetic loci is now more widespread in the genetic literature (Aschard et al., 2017). Our approach entails treating sizable chunks of the genome (≈ 1 million base pairs) as the functional phenotype. We then essentially perform a dimensional reduction using wavelet transform and subsequently test for associations between the wavelet coefficients and the true phenotype (considered here as an endogenous variable).
The issue of multiple testing can be resolved using other regularization methods, such as the Fused Lasso (Robert Tibshirani and Knight, 2005). The principle of Fused Lasso is to perform a penalized regression that takes into account how variables (i.e., SNPs) that are physically close to each other might have similar effects. Fused Lasso can then define a region of association between the SNPs and the phenotype. However, Fused Lasso performs local testing, whereas it is preferable to conduct a broader test by estimating the fraction of wavelets (blocks) associated at each scale. Such broader testing combined with multiple levels of information may provide additional insights into the mechanism underlying the genetic association.
Despite an increased interest for penalized regressions in the statistical community, they seldom appear in the top-tiered genetics publications. Although penalized regression has recently been added to one of the leading software for GWAS -PLINK (Purcell et al., 2007), the lack of a comparable software for meta-analysis is a major limitation of this approach. This is because a comprehensive genome-wide association meta-analysis (GWAMA) typically relies on summary statistics from multiple cohorts. Even if meta-analyses are now doable in the Lasso regression setting (Lockhart et al., 2014), they are not currently available for variants of Lasso regression or for other regularization penalties. By contrast, our method is easily amenable to meta-analysis through the use of the classic Fisher (R, 1958) method to combine p-values for loci that show associations in different cohorts.
One of the main drawbacks of a straightforward wavelet regression is that the considered sets of SNPs are not of length 2 J and the physical locations of the SNPs are not evenly spaced. To handle this, we suggest using the robust wavelet regression developed by Kovac and Silverman (Kovac and Silverman, 2000) . By reversing the regression and targeting a given region for association, we can tackle these two main issues by: (a) Using regional association instead of SNP-specific association to reduce the number of tests to be performed from 8 million (for common SNPs) to approximately 6000 by using overlapping loci of 1 Mbp in length.
(b) Using a robust wavelet regression for non-equally spaced data that takes into account the functional nature of the genetic effect, i.e., more complex joint effects of multiple nearby SNPs within a region.
The remainder of our paper is structured as follows. We first describe the statistical setting and the wavelet methodology used to generate the wavelet coefficients. Next, we describe how to compute the likelihood of the association between the wavelet spectrum and the phenotype Φ. The phenotype Φ in this paper is a univariate vector of a continuous or a binary trait. After a comprehensive evaluation of the method using different simulations, we apply it to the Norwegian HARVEST dataset, which is a subproject of the Norwegian Mother and Child Cohort Study (MoBa), Magnus et al. (2016)) with a special focus on children's gestational age at birth.

Wavelet representation of the genome
We first process the multi-SNP data using a wavelet transform. At first sight it might seem odd to transform the raw data into a different format. However, this type of approaches is widely used in the "Gene-or Region-Based Aggregation Tests of Multiple Variants" (Seunggeung Lee and Lin (2014)). In these methods, like in the Burden test, the effects of the genetic variants in a given region are summed up to construct a genetic score that can then be used in the regression.
In the following section, we assume some familiarity with wavelet transform. A comprehensive introduction to wavelets is available in Nason (2008). In the rest of this article, wavelet '(transform)' specifically refers to the 'Haar wavelet (transform)'. We code a SNP 0 if an individual is homozygous for the reference allele, 1 if heterozygous, and 2 if homozygous for the alternate allele -consistent with an additive genetic model (Purcell et al., 2007). Although this coding is arbitrary, it is the standard way of coding (Purcell et al., 2007). Our procedure is resilient to this coding (see Section 3).
For the scale 0, the wavelet coefficient d and c can be interpreted the same way: • The coefficient at scale 0 for an individual summarizes the amount of discrepancy between the individual's genotypes and the reference genotypes coded as 0...0 (which is essentially what is tested in gene/regional tests).
The d wavelet coefficients at scale s > 0 can be interpreted as: • The wavelet d coefficient at scale s and location l for an individual represents the difference in the number of minor variants between the left part of the region (defined by s, l) and the right part.
The c wavelet coefficients at scale s > 0 can be interpreted as: • The wavelet c coefficient at scale s and location l for an individual represent the amount of discrepancy between the individual's genotypes and the reference genotypes coded as 0...0 for the region defined by s, l.
The main rationale behind this modeling is that, if there is a causal effect of an allele on the phenotype, the association is likely to be spread across genomic regions of a given size (scale) at different positions (locations). By using the wavelet transform to perform a position/size (time/frequency) decomposition and then regressing on the wavelet coefficients, we are able to visualize where (location) and how (scale) the genetic signal influences the phenotype.
In the rest of this article, we use 'wavelet coefficients' to specifically refer to any one of two coefficients, c or d, but never to both at the same time. In Section 5 we discuss the use of d or c coefficient. In the rest of this section, using c or d coefficients does not change the general framework. Let us define a genetic region of a chromosome c of individual j as the set of SNPs G. G has physical positions (base pair, bp) between a lower bound lb and an upper bound ub. It follows that a given genetic region for individual j can be written as: We consider GR c,lb,ub,j as a function/signal and observe this signal for every individual at pre-determined increasing positions bp 1 , ..., bp n , with some error due to the genomewide imputation process (Li et al., 2009). Thus, where G bpi,j is the measured genetic signal (SNP) at position bp for the individual j.
The variance σ 2 bp can be interpreted as a function of the imputation quality IQ which has a value in [0, 1]. 1 represents a perfectly imputed SNP or genotyped SNP; thus, quality control, only SNPs with an IQ ∈ [0.7, 1] are retained in further analysis. We assume that the imputation qualities are independent and heteroscedastic. As the value of a SNP is in 0, 2 and then in [0,2] in the dosage convention after imputation (Purcell et al., 2007), the hypothesis of error normality is arguable. In future developments, we might consider a more suitable noise law.

Non-decimated wavelet transform
We use the method of Kovac and Silverman (2000) for non-decimated and unevenly spaced data. This method takes an irregular grid of data (e.g., representing the sampling of the different genetic regions) and interpolates the missing data into a pre-specified regular grid. More precisely, for a genetic region GR lb,ub measured at positions bp 1 ...bp n , a new grid of points is defined as t 0 , ..., t N −1 , where N = 2 J , J ∈ N, t k = (k + 1 2 )2 −J and J = min{j ∈ Z, 2 j ≥ n}. We interpolate the signal on this grid and run the classic wavelet transform to obtain the wavelet coefficients. In practice (see Section 4), we recommend selecting genetic regions that have a relatively high density of imputed SNPs.

Coefficient-dependent thresholding and quantile transform
For each individual wavelet decomposition, we use the VisuShrink approach (Kovac and Silverman, 2000) to shrink the interpolated wavelet coefficients and reduce the dependence between the wavelet coefficients within scales. This method estimates the variance of each wavelet coefficient before determining a specific threshold for each wavelet coefficient. By determining coefficient-dependent thresholds using the wavelet coefficient variance, we can account for the individual heteroscedasticity of the noise. We then quantile-transform each wavelet coefficient distribution within the population. This ensures that the endogenous variables used in the regressions are normally distributed.
Particularly in cases where there were no associations, the residuals were normally distributed.

Modeling
To gauge the effect of a genetic region on the phenotype, we first need to assess whether certain scales are associated with the phenotype at different locations. Let π be a vector of length J where ∀j ∈ [0 : J], π j ∈ [0, 1], where π j represents the proportion of wavelet coefficients at scale j associated with the phenotype. To assess the significance of a genetic region, we want to test the following hypothesis: Below, we describe our test statistic (likelihood ratio) and how we compute its different components and test its significance.

Bayes factors
To test for association between the phenotype and the wavelet coefficientG sl for a given genetic region, we perform a regression between the wavelet coefficient and the phenotype Φ using Normal-Inverse-Gamma (NIG) prior. It is important to correct for confounding factors C in a GWAS. Through our framework, we can readily incorporate those confounding factors into the regression models.
The association models for each scale and location are defined as follows: where C is a matrix of dimension c × 1 and β sl,C is a matrix of dimension 1 × c. We compute the Bayes factors of the wavelet regression sl using the closed form provided by Servin and Stephens (2007) for NIG prior with σ = 0.2.

Ratio statistic
Our goal is to assess the significance of the vector π = (π 0 , ..., π s ) where π s represents the proportion of wavelet coefficients at scale s that is associated with the phenotype Φ, andG is the wavelet representation of the genotype. To test the significance of π, we construct a test statistic by computing the following likelihood ratio: Following the method of Shim and Stephens (2015), we denote γ sl as the random variable with support {0, 1}. γ sl = 1 when the wavelet coefficientG sl is associated with the variable Φ; 0 when there is no association. We consider π as a hyperparameter of Shim and Stephens (2015) assume independence between the wavelet coefficients.
However, this is unlikely to hold in practice because of the correlation structure of the genetic regions. Under the assumption of independence of the wavelet coefficients, we can rewrite 5 as follows: as the Bayes factor of the association between the wavelet coefficient at scale s and location l. Computation of the Bayes factor will be discussed below. Using this notation, we can rewrite 8 as: We can then compute the likelihood ratio statistic by maximizing the lambda statistics over π and determining π via the EM algorithm.

Significance of a genetic region
As the distribution of Λ is unknown, we simulate Λ under H 0 by simulating BF sl under the hypothesis of no association. Recently, Zhou and Guan (2017) showed that under H 0 and a wide spectrum of priors, the Bayes factors (including the NIG prior) for a Gaussian model follow a specific law. More precisely, where Q 1 is a non-central chi-squared random variable with df = 1, and λ 1 and its non-centrality parameter have a closed form. Using reverse regression, λ 1 can be computed directly from the design matrix for all the loci at once. Even though the non-centrality parameter is dependent on the wavelet coefficients, it vanishes to zero asymptotically with increasing sample size.
Indeed, Zhou and Guan (2017) showed that for df = 1 Bayes factors, Q 1 is asymptotically equal to the likelihood ratio test statistic for Gaussian linear models. In other words, it is equal to a simple chi-squared statistic with 1 degree of freedom. For large sample sizes (e.g., those typical in a GWAS setting), the non-centrality parameter can be safely set to zero. We can then perform M independent simulations of the vector of Bayes Factor under H 0 . Then, for each simulation m, we computeΛ m = max π∈[0,1] s Λ(π, BF m ) using the procedure described above.

P-value as a tail estimation problem
One may suggest computing the p-value via the classic method from the permutation procedure. However, users of this procedure often fail to check the number of permutation needed to obtain reliable p-values, especially at the low end of the scale. By using the normal approximation of the estimation (Markus Ojala, 2010), the number of permutations k required to obtain a reliable p-value has to be more than 1 4P 2 , where P is the desired level of significance. However, in our case, we need reliable p-value at the level of significance 0.05 6000 ≈ 8 × 10 −6 for a full genome screen, which would imply ≈ 4 × 10 10 simulations. To avoid this computational burden, we suggest using a smaller number of simulations (10 7 ) to estimate the tail distribution ofΛ by fitting a Generalized Pareto distribution. Knijnenburg et al. (2009) provided an extensive review on the use of the tail approximation by Generalized Pareto distribution to assess the significance of tests. We assume that the tail distributionΛ can be modeled by extreme-value distributions (i.e., we assumeΛ is in the domain of attraction of one of the extreme-value distributions). Estimation of the high quantile by Generalized Pareto distribution fitted by maximum likelihood leads to correct inference (Hosking and Wallis (1987)), especially when the estimated quantile is within the sample. We perform the fitting using a threshold determined by the Van Kerms rule of thumb Kerm (2007). Next, we use this fitted distribution to estimate the associated quantiles of the observation, and thus the p-values.

Complex genetic signals
We performed simulations for a complex genetic signal by combining real genetic data with a simulated phenotype. We used the locus displayed in Figure 1, computed the linkage disequilibrium (LD) structure, and selected 28 SNPs located at the center of each LD block. We performed two sets of simulation: • Mono-directional : for each iteration, we randomly selected 1 to 28 SNPs. For each individual, we summed their SNP dosages within the selected set of SNPs to construct a score. On top of the individual score, we added normally distributed noise, scaled so that the genetic score explains 0.5% of the total phenotypic variance.
• Random direction: the same setting as above, but the sign of the effect (positive/negative) for each SNP is taken at random. Compared to the mono-directional simulation, where any additional variant raises the level of the phenotype, this is not necessarily the case for random direction. These simulations are made to showcase the sensitivity of Wavelet Screaming to the direction of the SNP coding.
These simulations are engineered to mimic a diluted effect of SNPs within different LD blocks, or "block polygenic effect" where each variant has a small additive effect in the same direction. The variance explained by a single SNP varies between 0.5%, which is typical for the top SNPs in a GWAS (Evan A. Boyle (2017)), to 0.018%, at which level variants are normally not detected by the standard GWAS framework.
We performed Wavelet screaming on these simulations using c and d coefficients.
Because the design matrices do not vary between iterations (the phenotype varies but not the genotype), we can compute λ 1 directly using its closed form. We obtained λ 1 = 0.99974. As the average of the simulated phenotype increases with decreasing λ 1 (at least near 1), to be conservative, we performed the simulations ofΛ using λ 1 = 0.9997.
We computedΛ on 1000 index permutations between the scores and the genotype to obtain the null distribution ofΛ. The simulated distribution has slightly heavier −20 0 20 −2 0 2 Score Phenotype Fig. 2. Simulated phenotype against the generated score (20 SNPs selected) tails than the permuted dataset (see Figure 3). Therefore, p-values computed from the simulated distribution for large values ofΛ will tend to be conservative. We further computed the p-values using the fitted distribution of the simulatedΛ. As the null distributions are 'spiked' around 1, we provide in the Annex a zoomed plot around 1. Table 1 provides a summary of the results of the above simulations. Compared to the standard GWAS procedure, Wavelet Screaming clearly improves discovery rate for both c and d coefficients. To investigate this further, we checked the dependency of power and number of components within the signal. These results are displayed in Table 2. Using a linear model, Wavelet Screaming and the standard GWAS have roughly the same power for a single-SNP effect. However, when there is dilution of an effect due to polygenicity, Wavelet Screaming performed substantially better for both d and c coefficients. Via logistic regression of power on the number of components, we find a significant difference in slopes (Fisher test for general linear model at p < 10 −5 ) but not in intercepts.

Wavelet screaming improves the discovery rate
It appears that for mono-directional Wavelet Screaming with c coefficients, the effect dilution has a positive impact on the discovery rate. This might be due to the simulation setup, as all the SNPs are assumed to have an effect in the same direction. As the c coefficients represent the amount of variation within a region, c coefficients mimic the local score correctly. However, even if the variant is arbitrarily coded as having a positive effect (i.e., random direction simulation), we still obtain a clear improvement in the discovery rate. This demonstrates the robustness of Wavelet Screaming to potential "misspecification" in the coding. As far as we know, no other regional-based test can handle this problem without assuming a rare variant distribution.
The results in  We conclude that Wavelet Screaming for d and c coefficients does not significantly improve the discovery rate for loci that harbor a single causal SNP, but it is superior than the standard GWAS framework in detecting regions with multiple signals. Thus, in a general setting, Wavelet Screaming for d coefficient is more powerful than standard GWAS. In addition, for SNPs that have an effect pointing in the same direction, Wavelet Screaming with c coefficient provides a substantial improvement in power.

Application
To test the utility of our new method, we performed a chromosome-wide association study of human gestational duration using Wavelet Screaming. Gestational duration is a complicated phenotype to study in a GWAS setting because of a combination of large measurement errors (≈ 7 days, Nils-Halvdan Morken (2006) ) and typically small genetic effects (≈ 1.5 days (Zhang et al., 2017)). We used GWAS data on mothers from the Norwegian HARVEST study (Magnus et al., 2016) to replicate the lead SNPs reported in the largest GWAS to date on gestational duration (Zhang et al., 2017). These SNPs are located on chromosome 5, near the gene for Early B cell factor 1, EBF1, and are likely to regulate EBF1 gene activity. By using the same methodology as in Zhou and Guan (2017), the lowest p-value obtained in our dataset was 2.8 × 10 −6 , which is not considered statistically significant in the classic GWAS setting.

Definition of the regions and choice of resolution
Although millions of SNPs can now be interrogated in a typical GWAS, several chromosomal regions are characterized by poor marker density, in particular near telomeres and centromeres, and in regions of low imputation quality. Most SNPs with low imputation quality are routinely discarded during quality control after imputation. As we preprocess our data using an interpolation, we aim to avoid analyzing purely interpolated regions. Our strategy entails including an additional criterion in the preprocessing step to exclude these types of regions. We propose studying regions of size 1M bp (mega basepairs), with a maximum distance of 10kb between any two SNPs. Further, we define overlapping regions where the signals are at the boundary of a given region. By applying these additional criteria, we excluded 18% of the SNPs and defined 253 regions on chromosome 5.
In addition to avoiding fully interpolated regions, we also need to decide how deep into the wavelet decomposition we would like to analyze. We know that the precision of the wavelet coefficient depends on the amount of non-interpolated points in a given region (Kovac and Silverman, 2000). As a rule of thumb, we propose to have at least 10 SNPs on average for each wavelet coefficient. Following this preassigned criterion, we ended up with a median spacing between SNP of 202 base pairs. This means that if we divide each locus of 1M b into 2 9 = 512 subregions, we would on average have

Model and results
We applied the Wavelet Screaming approach using the d and c coefficients to the aforementioned dataset on gestational duration. In our modeling, we included the first 6 principal components for each wavelet regression to control for residual population structure (Price et al., 2010). We computed λ 1 via the analytic formula of Zhou and Guan (2017) and obtained λ 1 = 0.9999687. We simulated 10 7 Λ values under these conditions (λ 1 = 0.9999687, resolution = 9). We then estimated the parameters of the general Pareto distribution by setting the location parameter at the minimum of the simulated Λ value ( ≈ 1). The shape parameter ξ = 0.1705 (confidence interval (0.1695 − 0.1715) and the scale parameter β = 0.0103 (0.01027 − 0.01033)).
We used this distribution to compute the p-values for each locus. We identified three loci for the d coefficient and one locus for the c coefficient with p-values below 0.05 6000 ≈ 8.3 × 10 −6 . The discovered loci are depicted in Figure 4. The first and third loci surround EBF1 ; notably, the main SNP from the published meta-analysis is located in the right part of the third plot of Figure 4. In addition, the second locus in Figure 4 is located in a regulatory region containing a promoter and multiple transcription factors.
Furthermore, this locus is located less than 1 Mb from two genes, which suggest that it might be involved in their regulation. This locus therefore warrants further investigation.
We use the classic pyramidal wavelet decomposition representation to display the Bayes Factors corresponding to each wavelet coefficient, with point size and darkness representing their values (i.e., the highest Bayes factors are marked by the darkest and largest points). Furthermore, if a Bayes factor is larger than one (i.e., a region contributing toΛ), we highlight the region corresponding to the wavelet coefficient.

Discussion
In this paper, we introduce Wavelet Screaming as a novel and more powerful alternative to the classic GWAS methodology. It offers a more flexible modeling scheme than the standard single-point testing approach and substantially improves the discovery rate.
In future development, we aim to expand this tool to include phenotypes on nonordered scales (e.g., blood types or psychiatric phenotypes), which are normally treated  In case of small sample sizes (n < 1000), additional parameter computations are needed. Nevertheless, it is still possible to run an efficient screening using a two-stage procedure: first, using the (non-conservative) asymptotic p-value approximation with the non-centrality parameter set as zero, one may select loci that pass the desired threshold.
Second, for the selected loci, one may compute the non-centrality parameter of each regression and then compute the specific distribution for these loci (correct p-value).
By contrast, when the sample sizes are large (n > 10 6 , like in Lee et al. (2018)), the user will face the well-known Bartlett's paradox Bartlett (1957), which implies that the Bayes factors would converge to 0. Despite the conservativeness that the paradox brings to discovery, it can also annihilate the discovery capacity. In this setting, the choice of σ b can be crucial. Servin and Stephens (2007) suggest to average through some different values of σ b to obtain the proper Bayes factor, which is unrealistic in our setting. In our future work, we will be investigating how to attain a more optimal value for σ b in term of the discovery capacity.
Due to the complexity of the test statistics, it is hard to infer directly how power would be influenced by the parameters. Future work should focus on exploring the power behavior under different conditions (e.g., sample size, variance explained, unequally distributed effect between SNPs, etc.).
Wavelet Screaming is similar to the"Gene-or Region-Based Aggregation Tests of Multiple Variants" described by (Seunggeung Lee and Lin, 2014) and others (often referred to as burden tests). In these methods, the genetic variants are summed up in a given region to construct a genetic score that are subsequently used in the regression. To some extent, our approach is analogous to extending/generalizing this by performing a multi-scale regional test. Indeed, by using the c coefficients of the wavelet decomposition instead of the d coefficients, we essentially perform a multi-scale burden test. As shown in the simulation, the c coefficient works well when the SNP alleles have an effect in the same direction, which is the main assumption of the burden test. However, our method is the first region-based method that can confidently be applied using d coefficients, without the need to worry much about errors in the coding itself. Indeed, all the current regional methods assume a mono-directional effect setup.
Lastly, our methodology is highly versatile in its applicability to various "omics" data.
We intend to investigate its application to, e.g., methylation data, and the feasibility of adding one more level of hierarchy to extend it to multi-omics analyses.

Software
The Wavelet Screaming method is distributed as an R package. In addition to the analysis code, the package contains a data visualization tool to help elucidate the underlying mechanisms detected by Wavelet Screaming. Our Wavelet Screaming package is available at https://github.com/william-denault/WaveletScreaming. To perform an analysis, the user only needs to specify one parameter (σ b ). We provide a detailed example of how to use our package in the help function wavelet screaming based on simulated data.
We show how to compute λ 1 from our package and how to simulateΛ under the null.
Finally, the user can visualize the output of the wavelet screaming as depicted in 4 using the plot WS function.
of Excellence funding scheme, grant 262700. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We are also grateful to Jonas Bacelis for his detailed comments on the paper, and to Gatien Ricotier for his thorough review of the package.

Citations and References
tional phenotypes arising from high-throughput sequencing assays. Ann. Appl. Stat.,