On the potential of models for location and scale for genome-wide DNA methylation data
- Simone Wahl^{1, 2, 3}Email author,
- Nora Fenske^{4},
- Sonja Zeilinger^{1, 2},
- Karsten Suhre^{5, 6},
- Christian Gieger^{7},
- Melanie Waldenberger^{1, 2},
- Harald Grallert^{1, 2, 3} and
- Matthias Schmid^{4, 8}
DOI: 10.1186/1471-2105-15-232
© Wahl et al.; licensee BioMed Central Ltd. 2014
Received: 18 March 2014
Accepted: 26 June 2014
Published: 3 July 2014
Abstract
Background
With the help of epigenome-wide association studies (EWAS), increasing knowledge on the role of epigenetic mechanisms such as DNA methylation in disease processes is obtained. In addition, EWAS aid the understanding of behavioral and environmental effects on DNA methylation. In terms of statistical analysis, specific challenges arise from the characteristics of methylation data. First, methylation β-values represent proportions with skewed and heteroscedastic distributions. Thus, traditional modeling strategies assuming a normally distributed response might not be appropriate. Second, recent evidence suggests that not only mean differences but also variability in site-specific DNA methylation associates with diseases, including cancer. The purpose of this study was to compare different modeling strategies for methylation data in terms of model performance and performance of downstream hypothesis tests. Specifically, we used the generalized additive models for location, scale and shape (GAMLSS) framework to compare beta regression with Gaussian regression on raw, binary logit and arcsine square root transformed methylation data, with and without modeling a covariate effect on the scale parameter.
Results
Using simulated and real data from a large population-based study and an independent sample of cancer patients and healthy controls, we show that beta regression does not outperform competing strategies in terms of model performance. In addition, Gaussian models for location and scale showed an improved performance as compared to models for location only. The best performance was observed for the Gaussian model on binary logit transformed β-values, referred to as M-values. Our results further suggest that models for location and scale are specifically sensitive towards violations of the distribution assumption and towards outliers in the methylation data. Therefore, a resampling procedure is proposed as a mode of inference and shown to diminish type I error rate in practically relevant settings. We apply the proposed method in an EWAS of BMI and age and reveal strong associations of age with methylation variability that are validated in an independent sample.
Conclusions
Models for location and scale are promising tools for EWAS that may help to understand the influence of environmental factors and disease-related phenotypes on methylation variability and its role during disease development.
Keywords
DNA methylation Beta regression GAMLSS Infinium HumanMethylation450k BeadChip EWAS Modeling variability Resampling Model performance Model comparison Models for location and scaleBackground
DNA methylation is an important epigenetic mechanism that is involved in the regulation of gene expression [1]. In humans, methylation occurs most frequently at cytosine (C) nucleotides preceding a guanine (G) nucleotide, referred to as CpG sites [2]. Owing to its role in numerous physiological processes and disease states including cancer, changes in DNA methylation introduced by lifestyle factors and disease-related phenotypes have received increasing attention [3–9]. High-throughput array technologies such as the Illumina Infinium HumanMethylation450k BeadChip [10] have enabled epigenome-wide association studies (EWAS) to explore the relationship between phenotypes and DNA methylation in large population-based studies [2].
The statistical analysis of genome-wide DNA methylation data entails specific challenges. Most commonly, methylation signals from the Infinium HumanMethylation450K BeadChip are summarized as β-values, the proportion of methylated in relation to the sum of methylated and unmethylated signals at a specific genomic site in a biological sample [10]. Methylation β-values are bounded in the unit interval, with values occuring between 0 (corresponding to 0% methylation at the respective site) and 1 (corresponding to 100% methylation at the respective site). As typical for proportion data, their distributions display substantial heteroscedasticity [11]. Specifically, they tend to show smaller variances when located near the boundaries 0 and 1 as compared to the center of the unit interval. In addition, their distributions are typically skewed. Therefore, when using β-values as response variable in a statistical regression model, traditional modeling strategies that assume a normally distributed response might be inappropriate [12]. To solve this problem, it has been proposed to use the less heteroscedastic M-values for Gaussian regression, which are approximately equal to binary logit transformed β-values [13]. An alternative transformation for proportion data, which is often used in ecology to achieve variance stabilization, is the arcsine square root transformation [14]. This transformation has also recently been applied to methylation data [15]. In addition, beta regression, a statistical regression technique that is tailored to bounded response variables, has been proposed as a “natural” modeling strategy for proportion data [11, 16]. Currently beta regression is beginning to find application in methylation data analysis [17–20]. Although improved performance of beta regression as compared to traditional approaches was reported [18], a thorough comparison of these competing approaches for the purpose of univariate screening for genome-wide phenotype-methylation associations is lacking.
Furthermore, there is increasing evidence that not only mean differences but also variability of methylation plays a role in disease processes, including cancer [21–25]. This enhances the need to screen for an effect of lifestyle factors and disease-related phenotypes on DNA methylation level and variability [9, 26, 27]. In these studies, variability was either modeled isolated from the mean, in a combined test together with the mean, or not explicitly modeled as a function of covariates at all.
Here we propose generalized additive models for location, scale and shape (GAMLSS) [28] as a flexible approach to model methylation data. GAMLSS can be specified such that a regression model is estimated for the mean (location parameter), and another model for the variability (scale parameter), so that covariate effects on both parameters can be quantified simultaneously. Joint modeling of location and scale is preferable to separate modeling since mean and variance of bounded β-values are not independent [11].
The purpose of this study was to compare model performance as well as the performance of downstream hypothesis tests of (1) beta regression as compared to Gaussian regression on raw, binary logit or arcsine square root transformed β-values, (2) with and without simultaneous modeling of covariate effects on the scale parameter within the GAMLSS framework. Using simulated and real data sets from the large population-based research platform Cooperative Health Research in the Region of Augsburg (KORA) (n=2299), we demonstrate that models for location and scale, specifically the Gaussian model on M-values, increase predictive performance as compared to models for location only, while being more sensitive towards violations of the distribution assumption and towards the presence of outliers in the methylation data. To address this problem, we propose and evaluate a resampling-based strategy based on parametric bootstrapping followed by rank-based reassignment of the original data. All findings concerning model performance and evaluation of the resampling-based strategy are subsequently validated in an independent data set of acute lymphoblastic leukemia patients and healthy controls. Finally, an application in an EWAS of BMI and age is presented.
Methods
Data sets
Our study is based on two large data sets of 1799 and 500 subjects from the F4 and F3 studies, respectively, of the population-based research platform KORA [29]. These data sets have been used in published EWAS before [3–5], and study design and data collection have been described in detail [29, 30]. Methylation measurements from whole blood were obtained from the Illumina Infinium HumanMethylation450K BeadChip [10], as described in [3, 4]. Raw methylation data were preprocessed and quality controlled as described in Additional file 1. Subjects with missing or outlying (value outside mean ± 5 standard deviations) covariate information were excluded from the analysis, so that the final F4 methylation data set comprised 413,641 autosomal sites, mainly CpG sites, and 1763 subjects. Similarly, data from 486 F3 subjects were available for analysis. A phenotypic description of both data sets is provided in Tables S1 and S2 in Additional file 2. For the F4 data, genome-wide single nucleotide polymorphism (SNP) data were obtained using the Affymetrix GeneChip array 6.0, and genotypes were imputed with IMPUTE v0.4.2 using the HapMap2 reference panel [31]. In the F3 population, genotyping was performed with the Illumina HumanOmniExpress BeadChip and the Illumina 2.5 BeadChip, followed by imputation with IMPUTE v2.3.0 using the 1000g phase1 reference panel (integrated haplotypes).
Modeling strategies for DNA methylation data
where the inclusion of an offset α_{ β }=100 is recommended as as a stabilization in the situation when both methylated and unmethylated signal intensities are small [13]. Since the total number of signals exceeds 1000 in the majority of CpG sites, the offset does not induce much bias. By definition, β-values are bounded to the unit interval. Ignoring the offset, they can be interpreted as proportion of methylation at a specific CpG site in a biological sample. Typically, β-values at specific CpG sites tend to have a skewed distribution, centered at either a low or a high methylation state, with variances smallest for CpG sites centered close to the bounds of the standard unit interval (Figure S1 in Additional file 3).
We consider regression models with β-values as response variable, and disease-related phenotypes as explanatory variables. When modeling proportions as a function of covariates, their boundedness as a function of covariates, their boundedness implies that the conditional mean must be a nonlinear function of the explanatory variables, since a linear combination of the explanatory variables would not be restricted to values within (0,1). In addition, it implies that the conditional variance must be a function of the conditional mean, i.e. the distribution is heteroscedastic [32]. Both conditions are not met when Gaussian regression is used, so that both the expectation and the variance structure are misspecified, potentially resulting in biased and inconsistent estimation.
where X represents the n×p matrix of covariate values, γ=(γ_{1},γ_{2},…,γ_{ p })^{ T } a vector of regression coefficients corresponding to p covariates, ε the n×1 vector of independently normally distributed error terms with variance σ^{2}, and E(·) the expectation. Note that this model formulation also comprises Gaussian regression with untransformed β-values as response variable. In that case h(.) would be the identity function. In addition, the model can be extended to facilitate explicit modeling of σ as a function of covariates (see below).
In a titration experiment by Du et al.[13], M-values outperformed β-values in terms of power and precision when identifying differentially methylated sites based on methylation differences between two groups. However, when differential methylation was defined based on test statistics, superiority of M-values was only observed in small sample sizes, when regularized t-statistics were used [33]. In multivariate feature selection, β-values tended to better preserve the correlation structure of methylation signals [33]. Generally, due to the logit transformation, the interpretation of a regression model based on M-values is less intuitive than the interpretation of a corresponding model based on untransformed β-values, since covariate effects have to be interpreted with regard to the expectation of the transformed response $\stackrel{~}{y}$ rather than y itself.
to achieve homoscedasticity [14]. The transformed values are then for instance modeled with a Gaussian regression model as described above, specifying h(·) as the arcsine square root transformation. However, the arcsine square root transformation maps (0,1) to (0,π/2), so that values are still bounded.
where y∈(0,1) represents the response vector, μ∈(0,1) a location parameter, σ∈(0,1) a scale parameter, and Γ(·) the gamma function. Typically, f(y;μ,σ) is skewed (see e.g. [11] for an illustration). Expectation and variance of a beta distributed random variable y are given by E(y)=μ, andV a r(y)=μ(1-μ)σ^{2}. Thus, beta regression incorporates boundedness and skewness of a proportion response [11]. Also, because V a r(y) explicitly depends on μ, heteroscedasticity of the response can be modeled via beta regression.
where μ represents the mean of a beta distributed response vector y, and g(·) a strictly monotonic and twice differentiable link function that maps the open unit interval into [16]. In practice, the logit function is commonly chosen for g(·). Note that in case of y∈[0,1], values can be rescaled to lie in (0,1) [11]. In our methylation data, zeros did not occur after preprocessing.
To model the effect of the explanatory variables on V a r(y) (and thereby the heteroscedasticity of the response explicitly), a “variable scale beta regression model” has been proposed as an extension. Thereby, location (μ) and scale (σ) parameter are both modeled, with a possibly distinct set of covariates [11].
Despite these appealing properties and modeling options, beta regression has been considered in few studies on DNA methylation data so far [17–20]. A thorough examination of its performance in comparison to competing modeling strategies, and specifically the relevance of modeling the scale parameter in beta and Gaussian regression models, has not been conducted.
Generalized additive models for location, scale and shape (GAMLSS)
GAMLSS provides a flexible modeling framework for responses from a large class of distributions, including Gaussian and beta distributions [28, 36]. It extends generalized linear and additive models first, by relaxing the distribution assumption, and second, by allowing for modeling not only the mean (location parameter) but also other distribution parameters (e.g. a scale parameter) as a function of explanatory variables. Consequently, beta regression is a special case of a GAMLSS. Moreover, the GAMLSS methodology allows for modeling the residual error variance σ^{2} of the Gaussian models described above. Together, this makes GAMLSS highly relevant for modeling DNA methylation data. Specifically, it provides a unique framework for the comparison of the different approaches to model methylation data considered in this study.
where X_{1} and X_{2} are matrices of covariate values, γ_{1} and γ_{2} the respective vectors of regression coefficients of lengths p_{1} and p_{2}, and μ and σ the location and scale parameters as defined above within the context of Gaussian and beta regression. In “fixed scale models”, i.e. models for location only (as opposed to “variable scale models”), X_{2} reduces to an all-ones vector, and γ_{2} to a scalar intercept. For beta regression models, the link functions g_{1}(·) and g_{2}(·) were both chosen as the natural logit. For Gaussian regression models based on raw β-values, M-values and A-values, they were chosen as identity and log links, respectively.
Analysis of model fit and predictive performance
Competing models for methylation data analysis
Model (abbreviation) | Transformation | Distribution | Submodels |
---|---|---|---|
of response y | |||
Fixed scale models | |||
raw (ra) | y | Gaussian | μ only |
logit2 (lo) | $\underset{2}{log}\left(\frac{y}{1-y}\right)$ | Gaussian | μ only |
arcsine (ar) | $\stackrel{-1}{sin}\left(\sqrt{y}\right)$ | Gaussian | μ only |
beta (be) | y | beta | μ only |
Variable scale models | |||
raw+ (ra+) | y | Gaussian | μ and σ |
logit2+ (lo+) | $\underset{2}{log}\left(\frac{y}{1-y}\right)$ | Gaussian | μ and σ |
arcsine+ (ar+) | $\stackrel{-1}{sin}\left(\sqrt{y}\right)$ | Gaussian | μ and σ |
beta+ (be+) | y | beta | μ and σ |
In location and scale submodels, the respective parameter was specified as a linear function of covariates that are known to have an effect on methylation levels, including age [6], sex [38], smoking state (with the categories current, never and former smoker) [3], alcohol intake (in g/d) [8], physical activity (with the categories active and inactive) [7], body mass index (BMI) [5, 9], white blood cell count (WBC) and estimated proportions of six white blood cell types as derived using the method by Houseman et al.[39]. In the location submodel, the first 15 control probe principal components (PCs) were included to avoid technical confounding (see Additional file 1).
Model performance was investigated in a random set of 10,000 CpG sites. Repeating analyses in newly drawn sets of CpG sites showed that results were stable towards the CpG sites chosen. We fitted every model to a random subsample comprising 50% of the observations as a training set, drawn anew for each CpG site, with the remaining 50% left out as a test set. The pseudo R ^{2} criterion [40, 41] was then calculated from both training and test sets as a measure of model fit and predictive model performance, respectively. The pseudo R ^{2} criterion has been used to compare beta and Gaussian regression before [34, 35]. It measures the improvement of the likelihood of the fitted as compared to the intercept-only model, and is therefore more appropriate in the context of models for location and scale than measures based on correlation or deviance of fitted versus observed response values, since it better accounts for the fit of the scale submodel. The precise definition of the pseudo R ^{2} criterion for the different models is given in Additional file 1. The analysis was repeated in a smaller sample size of 250 observations, randomly drawn for each CpG site, to evaluate the impact of sample size on model performance.
Residual model fit was assessed using Shapiro-Wilk tests on normalized quantile residuals [28].
Simulation study
To assess type I error and power of downstream hypothesis tests of the competing models, and the sensibility towards violations of the distribution assumptions, a data-driven simulation study was conducted comprising beta distributed and real-data distributed methylation responses. All data were deliberately simulated to be close to the observed methylation data, in terms of average location and scale, as well as in terms of the size of observed covariate effects, aiming to approximate the real-data scenario as closely as possible.
Beta-distributed data setting
Simulation settings
Setting | Tested parameter | ${\gamma}_{\text{BMI}}^{\left(\mu \right)}$ | ${\gamma}_{\text{BMI}}^{\left(\sigma \right)}$ |
---|---|---|---|
Settings for the assessment of type I error | |||
μ|-σ | ${\gamma}_{\text{BMI}}^{\left(\mu \right)}$ | 0 | 0 |
μ|+σ moderate | 0 | 0.005 | |
μ|+σ strong | 0 | 0.05 | |
σ|-μ | ${\gamma}_{\text{BMI}}^{\left(\sigma \right)}$ | 0 | 0 |
σ|+μ moderate | 0.005 | 0 | |
σ|+μ strong | 0.05 | 0 | |
Settings for the assessment of power | |||
μ|-σ | ${\gamma}_{\text{BMI}}^{\left(\mu \right)}$ | 0.005 | 0 |
μ|+σ moderate | 0.005 | 0.005 | |
μ|+σ strong | 0.005 | 0.05 | |
σ|-μ | ${\gamma}_{\text{BMI}}^{\left(\sigma \right)}$ | 0 | 0.005 |
σ|+μ moderate | 0.005 | 0.005 | |
σ|+μ strong | 0.05 | 0.005 |
Finally, beta distributed data were simulated as a parametric bootstrap sample [42]${\mathit{y}}_{j}^{\ast}\sim \text{BE}(\mathit{\mu}={\widehat{\mu}}_{j},\mathit{\sigma}={\widehat{\sigma}}_{j})$ for each CpG site. The resulting data were beta distributed with known effects of the covariate BMI on location and scale, in absence or presence of nuisance effects on the respective other parameter.
The eight competing models (Table 1) were fitted to these data, and observed type I error rate and power of the test with the null hypothesis H_{0}:γ_{BMI}=0 were estimated as the proportion of p-values below 0.05, given γ_{BMI}=0 or γ_{BMI}=0.005, respectively. The analysis was conducted at n=1763 and n=250 as before.
Real data setting
To investigate the influence of the response distribution on test performance, data were generated having the distribution of the real β-values, while covariate effects were maintained. This was achieved by reassigning the originally observed methylation values to the subjects, according to the ranks of the subjects’ simulated response values. Thereby, the simulated covariate effects were largely maintained, while the original response distribution was reconstructed. The eight competing models were fitted to these data, and observed type I error rates and power were estimated.
Results and discussion
Comparison of competing models for methylation data
When the analyses were repeated with a reduced sample size (n=250), training set R ^{2} values indicated a substantial increase in model fit for variable scale as compared to fixed scale models (Figure S2 in Additional file 3). At the same time, test set R ^{2} values were strongly deflated for a part of the CpG sites, indicating diminished predictive performance in the presence of overparametrization.
The unexpectedly bad beta distribution fit might be attributed to an error component arising from the microarray experiment [43]. Of note, the two assay designs present on the Infinium HumanMethylation 450K BeadChip differed in the severity of the deviation from beta distribution fit.
When we compared the eight models in terms of severity of deviation from residual normality, model fit was improved when also the scale parameter was modeled as a function of covariates (Figure 2B). Overall, the variable scale Gaussian model on M-values (lo+) showed the best residual normality fit in the majority (34.12%) of CpG sites.
Performance of hypothesis tests
Next, we investigated how strongly the violation of the distribution assumptions of the competing models, specifically the beta regression model, affected the performance of downstream t-tests for covariate effects. Specifically, we explored the impact of known covariate effects on one distribution parameter on the performance of tests for the effect of this covariate on the other distribution parameter. For this purpose, in a simulation study beta distributed and real-data distributed methylation responses were generated with known covariate effects on location and/or scale (see Methods section and Table 2).
Observed type I error rates in beta distributed data
Next, we investigated tests for covariate effects on the scale parameter σ in beta distributed data (σ|μ setting). The models be+ and ar+ almost adhered to their nominal level irrespective of covariate effects on μ, whereas ra+ and lo+ showed a mild inflation of type I error rates (>10%) in the presence of weak effects on μ, which increased to above 88% when the covariate effect on μ was strong (Figure 3B; Figure S3B, D and F in Additional file 3).
Here, ${\stackrel{~}{\sigma}}_{i}$ is approximately independent of μ_{ i }, explaining the good performance of the ar+ model in the σ|μ setting.
Observed type I error rates in the real-data scenario
Next, analyses were repeated with real-data distributed methylation responses, which had exactly the same marginal distributions as the original data. In contrast to beta distributed data, none of the models adhered to the nominal level in the presence of strong covariate effects on the other distribution parameter (Figure 3C and D; Figure S4 in Additional file 3), with type I errors of above 40% observed for all models in both μ|σ and σ|μ settings.
Although methylation data are bounded in the unit interval with a dependency between mean and variance, they show deviation from the beta distribution (Figure 2) so that the relation between mean and variance might not be correctly reflected by V a r(y)=μ(1-μ)σ^{2}. Consequently, effects on either location or scale parameter might be falsely attributed to the other during model fitting.
In addition, in the σ|μ setting, type I error rates larger than 15% were observed for all models, even in absence of covariate effects on μ. They were largest (22.9%) for the ra+ model (Figure 3D). This suggests that tests on the scale parameter are more sensitive towards violations of the distribution assumption than tests on the location parameter.
When the analyses were repeated in a smaller sample (n=250), an additional trend emerged: observed type I error was systematically larger for variable scale than for fixed scale models even in absence of covariate effects on the other distribution parameter (Figure S5 in Additional file 3). This was observed for both beta distributed and real-data distributed methylation responses, and was in line with the reduced predictive performance of variable scale models in the smaller sample (Figure S2 in Additional file 3). Together these findings suggest overfitting.
Observed power
Finally, we compared the models in terms of their power to detect existing covariate effects of moderate strength. We observed that power of downstream t-tests was generally smaller to detect similarly sized covariate effects on σ than on μ (Figure S6 in Additional file 3). However, none of the models clearly outperformed the others in terms of power in absence of covariate effects on the other distribution parameter. The presence of weak effects on σ tended to reduce power to detect effects on μ for all models. When strong effects on σ were simulated, type I error superimposed power, so that results are difficult to interpret (Figure S6 in Additional file 3).
Resampling procedure for models for location and scale
Since asymptotic inference for variable scale models for methylation data is associated with inflated type I errors, we next considered resampling-based modes of inference. Standard parametric bootstrap is not an option for DNA methylation data, since we do not know the true distribution of the data. In non-parametric bootstrap, i.e. sampling of observations with replacement, the problem arises that the mean-variance structure within each observation is maintained, and false positives due to identifiability problems between covariate effects on the two distribution parameters would not be corrected for. This problem might occur in a reverse manner in permutation testing. Therefore, we developed a resampling procedure for models for location and scale (Algorithm 1), which provides a solution for both the unknown distribution of methylation data and false positives arising from covariate effects on the respective other distribution parameter.
Validation in a data set of acute lymphoblastic leukemia patients and healthy controls
An important question is whether our observations are valid also for methylation data from cancer patients and healthy controls, which might differ fundamentally from population-based data in terms of their distribution and outlier structure. These aspects might have an impact on the performance of the different models, as well as on the performance of the proposed resampling strategy. Therefore, we repeated all performance comparisons and evaluations on a publicly available data set comprising Infinium HumanMethylation 450K methylation data from bone marrow samples of 615 acute lymphoblastic leukemia (ALL) patients of two different types (B-cell precursor ALL and T-cell ALL), as well as 80 healthy controls (GEO accession number: GSE49031 [45]; see Additional file 1 for methodological details).
In these data, we were able to confirm our main findings: First, models of location and scale outperformed models of location only in terms of model fit and predictive performance (Figure S7 in Additional file 3). Second, the Gaussian variable scale model on methylation M-values most often showed the best residual model fit (Figure S8 in Additional file 3). Third, we again observed increased type I error rates for tests of covariate effects on the scale parameter, and on both distribution parameters in the presence of strong effects on the respective other parameter (Figure S9 in Additional file 3). Finally, although genetic confounding could not be accounted for due to lack of SNP data, the proposed resampling precedure performed considerably well in the moderate effect size scenario (Figure S10 in Additional file 3).
It remains a challenge to improve the method to control type I error in the presence of strong covariate effects on the other distribution parameter. In the population-based KORA data, effects of lifestyle and phenotypic factors were rarely stronger than the simulated moderate effect size. In contrast, in the investigated cancer data set, case-control effects were sometimes larger than the simulated moderate effect size, so that the proposed resampling procedure will not be capable of clearly separating effects on location and scale. A similar issue might arise when investigating the effect of proportions of white blood cell types, which tend to have strong effects on specific CpG sites, affecting both mean [39] and variability [27]. However, we would like to emphasize that a clear separation of covariate effects on location and scale in the presence of a strong (but unknown) relation between the two is a major challenge to any parametric or nonparametric approaches.
Application to an EWAS of BMI and age in the KORA study
Our results have shown that it is worth to examine covariate effects on DNA methylation variability, since model performance was improved by modeling the scale parameter. Therefore, we undertook an EWAS of BMI and age in the large population-based KORA F4 sample. We used the Gaussian model on M-values (lo+) which achieved the best model fit and predictive performance in our model comparisons. BMI was chosen since obesity has been reported before to associate with DNA methylation variability at specific CpG sites [9]. In addition, we were interested in age since a recent investigation has revealed the presence of regions in the genome that are characterized by an increased methylation variability in an age group specific manner, where significantly more of these regions were observed in older age groups [26].
To avoid inflated type I errors of tests for covariate effects on location and scale, we included SNPs as additional covariates as described above. Otherwise, all covariates were included that were also included in the model comparisons. In addition, we used the resampling procedure described above for the significant associations.
For BMI DVCs, age DMCs and age DVCs, we observed that associations with significant resampling-based p-values could be validated significantly more often in the independent KORA F3 study (Figure 6B-D). This observation was stable to the threshold of significance in the replication study. For instance, when the threshold was changed from nominal significance (p≤0.05) to a Bonferroni-corrected threshold (p≤0.05/{number of tests}) which is often used, enrichment results essentially did not change. Assuming that validation in an independent study is an indicator of trueness of an effect (ignoring the presence of unaccounted confounders), these results suggest that the resampling procedure filtered out false positives. Note that the replication study was smaller (n=486) than the discovery study (n=1763), so that not all true effects can be expected to validate.
Investigating the observed effect directions, we noted that for the majority (79.9%) of BMI DVCs that were not confirmed by resampling, associations were positive (p=4.9·10^{-16}). This was not observed for the DVCs significant after resampling. A plausible explanation might be that the variable scale model is more susceptible towards methylation outliers at larger BMI values where BMI density is sparser. Interestingly, Xu et al.[9] reported an enrichment of positive associations of obesity and methylation variability in a study population of 48 obese and 48 lean subjects, where the standard deviation of BMI in the obese group was fivefold that of the lean group. The identified DVCs were often characterized by one or several outliers occuring in the obese group. The authors used the parametric Bartlett’s test to identify DVCs. To evaluate this test procedure using our own data, we subjected the 3481 initially discovered BMI DVCs to Bartlett’s test after dividing the study population into two groups by the median BMI. Although this approach differs considerably from ours, for instance since Bartlett’s test does not include covariate information, and the variability is modeled separately from the mean, Bartlett’s test indicated genome-wide significance for 1893 (54.4%) of the DVC initially identified by GAMLSS, 80.3% of which were positive associations. This suggests that Bartlett’s test might share the susceptibility to outlier values and likely also to genetic confounding in the methylation data, and might have increased type I error rates in that case.
In the EWAS of age, a substantial number of DMCs (22774, 87.2%) and DVCs (8279, 38%) passed both resampling-based correction and validation step (Figure 6C and D). The vast majority of validated DVCs (98.3%) showed positive associations (p-values <2.2·10^{-16} as compared to unvalidated CpG sites), indicating increased methylation variability with increasing age. This observation fits well with the previous finding of larger DNA methylation variability in adult as compared to newborn blood [27]. It is also in agreement with an increased number of age group specific highly variable CpG sites in older as compared to younger age groups [26]. Specifically, we also observed DVCs at neurotransmission-related genes. In addition, highly variable CpG sites were enriched in the vicinity of genes involved in developmental and morphogenetic processes [46, 47], suggesting a role of methylation variability during development. Age-related increases in methylation variability at specific CpG sites might be attributed to both stochastic events [46] as well as accumulating environmental and lifestyle influences. We also have to acknowledge the possibility that the observed age effects on methylation variability are partially attributed to changes in white blood cell proportions with age [48]. White blood cell types differ strongly in methylation variability [27]. Although we included estimated white blood cell proportions as covariates, residual confounding might occur [48].
Conclusions
We have addressed two challenges arising from the characteristics of methylation data: First, the appropriate treatment of methylation β-values as a proportion response, and second, the difficulty of assessing covariate effects on both location (mean) and scale (variability) of these data. The latter issue has become important because recent findings suggest a role of methylation variability in addition to methylation level in disease processes, including cancer [21–25]. In contrast to recent strategies to assess associations between methylation variability and disease-related traits [9, 25–27], we propose simultaneous modeling of location and scale using the GAMLSS framework [28].
A key result of our study is that simultaneous modeling of mean and variability of DNA methylation data improved the predictive performance as compared to modeling the mean only. A particularly good performance was observed for the Gaussian model on methylation M-values. To avoid false positives arising from violations of the distribution assumption, we proposed and applied a resampling procedure as a mode of inference for models for location and scale. In our experiments, this procedure substantially reduced type I error rates so that they became close to the nominal level in practically relevant settings. The validity of this approach could be confirmed both in population-based data and in a data set of cancer patients and healthy controls. Moreover, the application of our methodology to an EWAS of BMI and age in the large population-based KORA F4 study revealed biologically plausible positive effects of age on methylation variability. These effects were stable towards validation in an independent study.
Our findings suggest that GAMLSS is a useful tool to explore environmental and lifestyle effects on methylation variability, simultaneously to effects on the mean. Since the investigated models for location and scale were susceptible towards overfitting when sample size was moderate, it could be promising to investigate extensions based on regularization techniques such as boosting (which has been implemented for GAMLSS [49]). In addition, in our data, methylation at the majority of CpG sites follows none of the compared distributions. Thus, it might be relevant to explore robust methods such as quantile regression [50].
Declarations
Acknowledgements
The KORA research platform and the KORA Augsburg studies are financed by the Helmholtz Zentrum München – German Research Center for Environmental Health, which is funded by the German Federal Ministry of Education and Research (BMBF) and by the State of Bavaria. We thank all KORA study participants and all members of the field staff in Augsburg who planned and conducted the study. The German Diabetes Center is funded by the German Federal Ministry of Health and the Ministry of School, Science and Research of the State of North-Rhine-Westphalia. Additional support was given by the BMBF (National Genome Research Network NGFNplus 01GS0823). The work of MS was supported by Deutsche Forschungsgemeinschaft (DFG), grant SCHM 2966/1-1. KS is supported by ‘Biomedical Research Program’ funds at Weill Cornell Medical College in Qatar, a program funded by the Qatar Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Nadine Lindemann, Franziska Scharl and Nicole Spada for excellent technical support.
Authors’ Affiliations
References
- Portela A, Esteller M: Epigenetic modifications and human disease. Nat Biotechnol. 2010, 28: 1057-1068.View ArticlePubMedGoogle Scholar
- Rakyan VK, Down TA, Balding DJ, Beck S: Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011, 12: 529-541.View ArticlePubMed CentralPubMedGoogle Scholar
- Zeilinger S, Kühnel B, Klopp N, Baurecht H, Kleinschmidt A, Gieger C, Weidinger S, Lattka E, Adamski J, Peters A, Strauch K, Waldenberger M, Illig T: Tobacco smoking leads to extensive genome-wide changes in DNA methylation. PLoS ONE. 2013, 8: e63812-View ArticlePubMed CentralPubMedGoogle Scholar
- Petersen AK, Zeilinger S, Kastenmüller G, Römisch-Margl W, Brugger M, Peters A, Meisinger C, Strauch K, Hengstenberg C, Pagel P, Huber F, Mohney RP, Grallert H, Illig T, Adamski J, Waldenberger M, Gieger C, Suhre K: Epigenetics meets metabolomics: an epigenome-wide association study with blood serum metabolic traits. Hum Mol Genet. 2014, 23 (2): 534-545.View ArticlePubMed CentralPubMedGoogle Scholar
- Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aïssi D, Wahl S, Meduri E, Morange PE, Gagnon F, Grallert H, Waldenberger M, Peters A, Erdmann J, Hengstenberg C, Cambien F, Goodall AH, Ouwehand WH, Schunkert H, Thompson JR, Spector TD, Gieger C, Trégouët DA, Deloukas P, Samani NJ: DNA methylation and body-mass index: a genome-wide analysis. The Lancet. 2014, 383 (9933): 1990-1998.View ArticleGoogle Scholar
- Bell JT, Tsai PC, Yang TP, Nisbet J, Glass D, Mangino M, Zhai G, Zhang F, Valdes A, Shin SY, Dempster EL, Murray RM, Grundberg E, Hedman AK, Nica A, Small KS, Consortium M, Dermitzakis ET, McCarthy MI, Mill J, Spector TD, Deloukas P, R RP: Epigenome-wide scans identify differentially methylated regions for age and age-related phenotypes in a healthy ageing population. PLoS Genetics. 2012, 8: e1002629-View ArticlePubMed CentralPubMedGoogle Scholar
- Rönn T, Volkov P, Davegårdh C, Dayeh T, Hall E, Olsson AH, Nilsson E, Tornberg A, Dekker Nitert M, Eriksson KF, Jones HA, Groop L, Ling C: A six months exercise intervention influences the genome-wide DNA methylation pattern in human adipose tissue. PLoS Genetics. 2013, 9: e1003572-View ArticlePubMed CentralPubMedGoogle Scholar
- Philibert RA, Plume JM, Gibbons FX, Brody GH, Beach SR: The impact of recent alcohol use on genome wide DNA methylation signatures. Front Genet. 2012, 3: 54-View ArticlePubMed CentralPubMedGoogle Scholar
- Xu X, Su S, Barnes VA, Miguel CD, Pollock J, Ownby D, Shi H, Zhu H, Snieder H, Wang X: A genome-wide methylation study on obesity: differential variability and differential methylation. Epigenetics. 2013, 8: 522-533.View ArticlePubMed CentralPubMedGoogle Scholar
- Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan JB, Shen R: High density DNA methylation array with single CpG site resolution. Genomics. 2011, 98: 288-295.View ArticlePubMedGoogle Scholar
- Smithson M, Verkuilen J: A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychol Methods. 2006, 11: 54-71.View ArticlePubMedGoogle Scholar
- Kieschnick R, McCullough B: Regression analysis of variates observed on (0, 1): percentages, proportions and fractions. Statistical Modelling. 2003, 3: 193-213.View ArticleGoogle Scholar
- Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, Lin SM: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010, 11: 587-View ArticlePubMed CentralPubMedGoogle Scholar
- Warton DI, Hui FKC: The arcsine is asinine: the analysis of proportions in ecology. Ecology. 2011, 92: 3-10.View ArticlePubMedGoogle Scholar
- Binder AM, Michels KB: The causal effect of red blood cell folate on genome-wide methylation in cord blood: a Mendelian randomization approach. BMC Bioinformatics. 2013, 14: 353-View ArticlePubMed CentralPubMedGoogle Scholar
- Ferrari SL, Cribari-Neto F: Beta regression for modelling rates and proportions. J Appl Stat. 2004, 31: 799-815.View ArticleGoogle Scholar
- Breton CV, Salam MT, Wang X, Byun HM, Siegmund KD, Gilliland FD: Particulate matter, DNA methylation in nitric oxide synthase, and childhood respiratory disease. Environ Health Perspect. 2012, 120: 1320-1326.View ArticlePubMed CentralPubMedGoogle Scholar
- Seow WJ, Pesatori AC, Dimont E, Farmer PB, Albetti B, Ettinger AS, Bollati V, Bolognesi C, Roggieri P, Panev TI, Georgieva T, Merlo DF, Bertazzi PA, Baccarelli AA: Urinary benzene biomarkers and DNA methylation in Bulgarian petrochemical workers: study findings and comparison of linear and beta regression models. PLoS ONE. 2012, 7: e50471-View ArticlePubMed CentralPubMedGoogle Scholar
- Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT: Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics. 2008, 9: 365-View ArticlePubMed CentralPubMedGoogle Scholar
- Hebestreit K, Dugas M, Klein HU: Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics. 2013, 29 (13): 1647-1653.View ArticlePubMedGoogle Scholar
- Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, McDonald OG, Wen B, Wu H, Liu Y, Diep D, Briem E, Zhang K, Irizarry RA, Feinberg AP: Increased methylation variation in epigenetic domains across cancer types. Nat Genet. 2011, 43 (8): 768-775.View ArticlePubMed CentralPubMedGoogle Scholar
- Teschendorff AE, Jones A, Fiegl H, Sargent A, Zhuang JJ, Kitchener HC, Widschwendter M: Epigenetic variability in cells of normal cytology is associated with the risk of future morphological transformation. Genome Med. 2012, 4 (3): 24-View ArticlePubMed CentralPubMedGoogle Scholar
- Jaffe AE, Feinberg AP, Irizarry RA, Leek JT: Significance analysis and statistical dissection of variably methylated regions. Biostatistics. 2012, 13: 166-178.View ArticlePubMed CentralPubMedGoogle Scholar
- Schoofs T, Rohde C, Hebestreit K, Klein HU, Göllner S, Schulze I, Lerdrup M, Dietrich N, Agrawal-Singh S, Witten A, Stoll M, Lengfelder E, Hofmann WK, Schlenke P, Büchner T, Hansen K, Berdel WE, Rosenbauer F, Dugas M, Müller-Tidow C: DNA methylation changes are a late event in acute promyelocytic leukemia and coincide with loss of transcription factor binding. Blood. 2013, 121: 178-187.View ArticlePubMedGoogle Scholar
- Ahn S, Wang T: A powerful statistical method for identifying differentially methylated markers in complex diseases. Pac Symp Biocomput. 2013, 69-79.Google Scholar
- Ong ML, Holbrook JD: Novel region discovery method for Infinium 450K DNA methylation data reveals changes associated with aging in muscle and neuronal pathways. Aging Cell. 2014, 13: 142-155.View ArticlePubMed CentralPubMedGoogle Scholar
- Jacoby M, Gohrbandt S, Clausse V, Brons NH, Muller CP: Interindividual variability and co-regulation of DNA methylation differ among blood cell populations. Epigenetics. 2012, 7 (12): 1421-1434.View ArticlePubMed CentralPubMedGoogle Scholar
- Rigby RA, Stasinopoulos DM: Generalized additive models for location, scale and shape (with discussion). Appl Stat. 2005, 54: 507-554.Google Scholar
- Holle R, Happich M, Löwel H, Wichmann HE: KORA - a research platform for population based health research. Gesundheitswesen. 2005, 67: S19—S25-View ArticlePubMedGoogle Scholar
- Rathmann W, Strassburger K, Heier M, Holle R, Thorand B, Giani G, Meisinger C: Incidence of Type 2 diabetes in the elderly German population and the effect of clinical and lifestyle risk factors: KORA S4/F4 cohort study. Diabet Med. 2009, 26 (12): 1212-1219.View ArticlePubMedGoogle Scholar
- Suhre K, Shin SY, Petersen AK, Mohney RP, Meredith D, Wägele B, Altmaier E, Deloukas P, Erdmann J, Grundberg E, Hammond CJ, de Angelis MH, Kastenmüller G, Köttgen A, Kronenberg F, Mangino M, Meisinger C, Meitinger T, Mewes HW, Milburn MV, Prehn C, Raffler J, Ried JS, Römisch-Margl W, Samani NJ, Small KS, Wichmann HE, Zhai G, CARDIoGRAM, et al: Human metabolic individuality in biomedical and pharmaceutical research. Nature. 2011, 477 (7362): 54-60.View ArticlePubMedGoogle Scholar
- Cook DO, Kieschnick R, McCullough BD: Regression analysis of proportions in insurance with self selection. J Empirical Finance. 2008, 15: 860-867.View ArticleGoogle Scholar
- Zhuang J, Widschwendter M, Teschendorff AE: A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform. BMC Bioinformatics. 2012, 13: 59-View ArticlePubMed CentralPubMedGoogle Scholar
- Hunger M, Döring A, Holle R: Longitudinal beta regression models for analyzing health-related quality of life scores over time. BMC Med Res Methodol. 2012, 12: 144-View ArticlePubMed CentralPubMedGoogle Scholar
- Schmid M, Wickler F, Maloney KO, Mitchell R, Fenske N, Mayr A: Boosted beta regression. PLoS ONE. 2013, 8: e61623-View ArticlePubMed CentralPubMedGoogle Scholar
- Rigby RA, Stasinopoulos DM: A flexible regression approach using GAMLSS in R. Technical Report. 2010, http://www.gamlss.org/wp-content/uploads/2013/01/Lancaster-booklet.pdf.,Google Scholar
- Wood SN: Generalized additive models: an introduction with R. 2006, Boca Raton: Chapman & Hall/CRCGoogle Scholar
- Liu J, Morgan M, Hutchison K, Calhoun VD: A study of the influence of sex on genome wide methylation. PLoS ONE. 2010, 5: e10028-View ArticlePubMed CentralPubMedGoogle Scholar
- Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT: DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012, 13: 86-View ArticlePubMed CentralPubMedGoogle Scholar
- Maddala GS: Limited-Dependent and Qualitative Variables in Econometrics. 1983, Cambridge: Cambridge University PressView ArticleGoogle Scholar
- Cox DR, Snell JE: Analysis of binary data. 1989, London: Chapman & HallGoogle Scholar
- Davison A, Hinkley D: Bootstrap methods and their application. 1997, Cambridge: Cambridge University PressView ArticleGoogle Scholar
- Houseman EA, Molitor J, Marsit CJ: Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics. 2014, [http://dx.doi.org/10.1093/bioinformatics/btu029],Google Scholar
- Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y, Pritchard JK: DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 2011, 12: R10-View ArticlePubMed CentralPubMedGoogle Scholar
- Nordlund J, Bäcklin CL, Wahlberg P, Busche S, Berglund EC, Eloranta ML, Flaegstad T, Forestier E, Frost BM, Harila-Saari A, Heyman M, Jónsson OG, Larsson R, Palle J, Rnnblom L, Schmiegelow K, Sinnett D, Söderhäll S, Pastinen T, Gustafsson MG, Lönnerholm G, Syvänen AC: Genome-wide signatures of differential DNA methylation in pediatric acute lymphoblastic leukemia. Genome Biol. 2013, 14 (9): r105-View ArticlePubMed CentralPubMedGoogle Scholar
- Feinberg AP, Irizarry RA, Fradin D, Aryee MJ, Murakami P, Aspelund T, Eiriksdottir G, Harris TB, Launer L, Gudnason V, Fallin MD: Personalized epigenomic signatures that are stable over time and covary with body mass index. Sci Transl Med. 2010, 2 (49): 49ra67-PubMed CentralPubMedGoogle Scholar
- Feinberg AP, Irizarry RA: Evolution in health and medicine Sackler colloquium: Stochastic epigenetic variation as a driving force of development, evolutionary adaptation, and disease. Proc Natl Acad Sci U S A. 2010, 107 Suppl 1: 1757-1764.View ArticlePubMedGoogle Scholar
- Jaffe AE, Irizarry RA: Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 2014, 15 (2): R31-View ArticlePubMed CentralPubMedGoogle Scholar
- Mayr A, Fenske N, Hofner B, Kneib T, Schmid M: Generalized additive models for location, scale and shape for high dimensional data - A flexible approach based on boosting. J R Stat Soc Series C. 2012, 61: 403-427.View ArticleGoogle Scholar
- Koenker R, Hallock K: Quantile regression. J Econom Perspect. 2001, 15: 143-156.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.