Age-adjusted nonparametric detection of differential DNA methylation with case-control designs
- Hanwen Huang^{1},
- Zhongxue Chen^{2}Email author and
- Xudong Huang^{3}
DOI: 10.1186/1471-2105-14-86
© Huang et al.; licensee BioMed Central Ltd. 2013
Received: 13 August 2012
Accepted: 20 February 2013
Published: 6 March 2013
Abstract
Background
DNA methylation profiles differ among disease types and, therefore, can be used in disease diagnosis. In addition, large-scale whole genome DNA methylation data offer tremendous potential in understanding the role of DNA methylation in normal development and function. However, due to the unique feature of the methylation data, powerful and robust statistical methods are very limited in this area.
Results
In this paper, we proposed and examined a new statistical method to detect differentially methylated loci for case control designs that is fully nonparametric and does not depend on any assumption for the underlying distribution of the data. Moreover, the proposed method adjusts for the age effect that has been shown to be highly correlated with DNA methylation profiles. Using simulation studies and a real data application, we have demonstrated the advantages of our method over existing commonly used methods.
Conclusions
Compared to existing methods, our method improved the detection power for differentially methylated loci for case control designs and controlled the type I error well. Its applications are not limited to methylation data; it can be extended to many other case-control studies.
Keywords
Nonparametric method One-sided test Combining p-valueBackground
Essential for normal development and an influence on a variety of processes related to DNA integrity and function, DNA methylation plays an important role for epigenetic gene regulation in both development and disease [1]. It is associated with processes including genomic imprinting, X-chromosome inactivation, suppression of repetitive elements, and carcinogenesis [2-4]. Aberrant DNA methylation, such as hypomethylation of oncogenes and hypermethylation of tumor suppressor genes, leads to genomic instability and tumorigenesis [5-10].
With the development of high-throughput platforms, genome-wide analysis of large-scale DNA methylation patterns and their impacts on gene regulation have received a significant amount of attention. We are proposing an age-adjusted nonparametric method to detect differentially methylated loci that can account for age effects that has advantages over existing methods the limitations of which we explain next. Typically, methylation levels in Illumina methylation assays are quantified in terms of the β-value calculated from the intensity ratio of methylated (M) to unmethylated (U) alleles. Specifically, $\beta =\frac{\mathit{\text{max}}\left(M,0\right)}{\mathit{\text{max}}\left(M,0\right)+\mathit{\text{max}}\left(U,0\right)+100},$ where M and U are the intensities of red and green dyes, respectively, for the GoldenGate and VeraCode Methylation assays, or the signals A and B, respectively, for the Illumina assay. The striking feature of the β-values is that they are continuous and range from 0 (totally unmethylated) to 1 (fully methylated).
With more and more DNA methylation data generated from the high-throughput DNA methylation platforms, powerful and efficient statistical methods to handle these complex data are becoming highly demanded. One of the important research topics in this area is to detect differentially methylated loci in case and control studies. The commonly used methods for this purpose are the Student’s t-test and linear regression. Recently, a number of model-based approaches have been proposed in the literature. Siegmund [11] introduced a Bernoulli-lognormal mixture model to perform clustering analysis on methylation data generated using MethyLight. Houseman [12] proposed a β-mixture model to classify different tissue types using a recursive-partitioning algorithm for high-dimensional data from Illumina arrays. Wang [13] developed a likelihood based uniform-normal-mixture model to select differentially methylated loci between case and control groups using the Illumina array. The basic idea of Wang [13] is to describe the data using a three-component mixture model and treat completed methylated, unmethylated, and partially methylated loci differently. Wang [13] showed that, compared to the Student’s t-test under some situations, their method increases detection power [13]. However, the aforementioned methods assume that the methylation profiles follow some special distributions that are known in terms of a finite number of parameters. Obviously if the underlying true distribution is far from the proposed ones, such assumptions will lead to biased results in practice.
Another complexity of the methylation study comes from the presence of other potential confounders such as age effects. As shown in [14-17], age is by far the strongest demographic risk factor for cancer, and there is substantial evidence that aging affects DNA methylation of specific loci, including cancer-related genes. Based on these observations, it is necessary to adjust age effects in the analysis of detecting differentially methylated loci. If the relationship between the methylation and age is more complex than a linear one, a traditional linear regression leads to inaccurate results. To solve this problem, Chen [16] proposed a method by first dividing the samples into several age groups and then combined the p-values obtained from each individual group together to form a new test. This method has been shown to increase the power in contrast to the traditional t-test without age adjustment or regression model with age adjusted linearly. However, within each group, a p-value is obtained from a simple t-test that is less robust and thus leaves room for improvement.
In this paper, we propose and examine a novel means to detecting differentially methylated loci and, that is, an age-adjusted nonparametric method that can account for age effects, given that substantial evidence exists that aging affects DNA methylation of specific loci, including cancer-related genes. Our method stems from the rank-based nonparametric methods that focus on a comparison of two entire empirical distribution functions rather than only two means. More specifically, we first divide the subjects into several age groups; then for each group, a nonparametric test is performed on each locus, and an individual p-value is reported. An overall p-value for each locus is estimated through combining all the individual p-values computed previously for that locus in different age groups. Our method does not depend on any distribution assumption but rather adjusts for age effects in an efficient way. We demonstrate the powerfulness of our method using both simulated and real datasets.
Methods
Where a_{ j } and b_{ k } are regression coefficients and ε_{ijk} follows a i.i.d normal distribution. The normality assumption is clearly invalid for the β-values which have limited range between 0 and 1. Moreover, the relationship between the β-value and age is likely to be more complicated than a linear one. Therefore more powerful and robust nonparametric methods are desirable.
Here G_{ i }, i = 1,2,…, n_{ 1k } and H_{ j },j = 1, 2,…,n_{ 2k } are the ranks of the samples from the k^{th} case and control groups, respectively. Due to the intractable asymptotic distribution for the test statistics B, it is hard to find a close form for the relationship between p-value and B. We will use numerical fit to approximate this relationship. We first obtain the empirical distribution of B based on 10^{7} permutations and then fit the distribution function piecewise exponentially to obtain the empirical relationship. The p-value for the k^{th} age group can be calculated according to this empirical formula.
More details can be found in [21-23]. For large α, we will fit a formula empirically through permutation. We call the above proposed method “combined Baumgartner-Weiß-Schindler (BWS) test”.
Results and discussion
Empirical fit of the p-value formula for one-sided BWS test
The node points we used here are 1.5 and 9. We find that the choice of node points has very little influence on the final analysis results for both simulated and real data. Note that the sample size we used for deriving this formula is 30. We have also tried some other choices and found that the results are quite similar. Thus, we recommend the above formula to be used in practice for problems with a sample size larger than 10.
Simulation results
Estimated Type I error rates at significant level 0.05 based on the four methods under different parameter settings of the three-component mixture distributions with τ_{ 1 } = 0.1, τ_{ 2 } = 0.9 and δ_{ μ } = 0.05
Parameter Setting | π _{ 1 } | π _{ 2 } | π _{ 3 } | μ | σ | t-test | Linear regression | Wilcox | BWS |
---|---|---|---|---|---|---|---|---|---|
1 | 0.3 | 0.5 | 0.2 | 0.3 | 0.1 | 0.0513 | 0.0521 | 0.0514 | 0.0458 |
2 | 0.4 | 0.5 | 0.1 | 0.2 | 0.1 | 0.0495 | 0.0494 | 0.0519 | 0.0492 |
3 | 0.4 | 0.5 | 0.1 | 0.3 | 0.2 | 0.0511 | 0.0519 | 0.0503 | 0.0495 |
4 | 0.5 | 0.1 | 0.4 | 0.3 | 0.2 | 0.0528 | 0.0521 | 0.0544 | 0.0511 |
5 | 0.4 | 0.2 | 0.4 | 0.2 | 0.1 | 0.0509 | 0.0510 | 0.0472 | 0.0464 |
Estimated powers of the four methods at significant level 0.05 under different parameter settings for the three-component mixture distributions with τ_{ 1 } = 0.1, τ_{ 2 } = 0.9
Parameter Setting | π _{ 1 } | π _{ 2 } | π _{ 3 } | μ | σ | t-test | Linear regression | Wilcox | BWS |
---|---|---|---|---|---|---|---|---|---|
δ_{ μ } = 0.03 | 0.3 | 0.5 | 0.2 | 0.3 | 0.1 | 0.475 | 0.479 | 0.749 | 0.836 |
δ_{ μ } = 0.05 | 0.3 | 0.5 | 0.2 | 0.3 | 0.1 | 0.889 | 0.892 | 0.951 | 0.988 |
δ_{ τ } = 0.03 | 0.45 | 0.1 | 0.45 | 0.5 | 0.05 | 0.048 | 0.047 | 0.078 | 0.727 |
δ_{ τ } = 0.06 | 0.45 | 0.1 | 0.45 | 0.5 | 0.05 | 0.048 | 0.047 | 0.092 | 0.877 |
The third settings assume that the β-values for both the case and control subjects follow the beta-distributions. Let s_{1} = s_{2} = 4. For the case group, the β-values are sampled from a beta-distribution Beta(s_{1} + δ, s_{2} − δ). For the control group, the β-values are sampled from a beta-distribution Beta(5s_{1} + δ, 5s_{2} − δ). Here δ takes six different values, -3λ, -2λ, -λ, 0.5λ, λ, and 1.5λ, one for each age group.
Change of the power with λ for four different methods when the distributions are Beta ( s_{ 1 } + δ , s_{ 2 } − δ ) and Beta (5 s_{ 1 } − δ , 5 s_{ 2 } + δ ) for the case and control groups; δ takes the values of -3λ, -2λ, -λ, 0.5λ, λ , and 1.5λ for the six age groups and s_{ 1 } = s_{ 2 } = 4
λ | Combined t-test | Linear regression | Combined Wilcoxon | Combined BWS |
---|---|---|---|---|
0 | 0.056 | 0.053 | 0.073 | 0.669 |
0.05 | 0.075 | 0.057 | 0.094 | 0.703 |
0.1 | 0.182 | 0.087 | 0.200 | 0.832 |
0.15 | 0.402 | 0.141 | 0.412 | 0.939 |
0.2 | 0.701 | 0.212 | 0.687 | 0.988 |
0.25 | 0.911 | 0.298 | 0.893 | 0.999 |
Real data results
We also applied our proposed method to the United Kingdom Ovarian Cancer Population Study (UKOPS) [15] to select differentially methylated loci between ovarian cancer cases and healthy controls. The data were created using Illumina Infinium Human Methylation27 Beadchip and downloaded from the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under the accession number GSE19711. There were 274 control samples and 131 pre-treatment case samples, and the number of loci was 27578. For the data quality control, we removed 29 patients (15 controls and 14 treatments) with BS conversion efficiency value < 4000 or coverage rate < 95% [15]. For each treatment group, the samples were divided into 6 age groups (50-55, 55-60, 60-65, 65-70, 70-75, and 75 and over). We further removed 12 patients in the pre-treatment group whose ages exceeded 78 since there were no such patients in the control group. The final sample size for each individual group is the same as the one listed in Table 2 of Chen [16] except that the “75 and over” group has 13 pre-treatment samples instead of 25. This dataset was analyzed by both Wang [13] and Chen [16] in their papers. Wang [13] did not consider the age effect, and only 96 cases and 136 controls were included in their analysis after further screening; while Chen [16] included the 12 patients with ages exceeding 78; thus their results are different from ours even though the same method was used.
Number of loci with p-values less than the given cutoff significance levels from different methods
Cutoff p-value | Linear regression (I) | Combined t-test (II) | Combined Wilcoxon test (III) | Combined BWS test (IV) | From both I and IV | From both II and IV | From both III and IV |
---|---|---|---|---|---|---|---|
10^{-3} | 2038 | 2754 | 3143 | 3387 | 1884 | 2659 | 3081 |
10^{-4} | 1438 | 1879 | 2152 | 2321 | 1352 | 1795 | 2117 |
10^{-5} | 1120 | 1343 | 1495 | 1653 | 1059 | 1286 | 1479 |
10^{-6} | 894 | 982 | 1109 | 1222 | 844 | 931 | 1099 |
Discussion
In this paper we chose different cutoff p-values to compare the performance of the proposed test with others. We did not consider the multiple testing issue, which is an important but difficult task for this area [4] and other areas where a large number of correlated variables are tested simultaneously [24-28]. The traditional correction methods for multiple comparisons, such as Bonferroni correction, are inappropriate here as they are too conservative due to the fact that many loci from the same gene are highly positively correlated. To account for the correlations among loci, methods based on the concept of “effective number” may be adopted [29].
There are many ways to combine p-values from independent tests [20, 30-32]. In this paper, we chose to use Fisher test due to its robustness. It is possible, however, that other methods may be more powerful under some certain conditions. The combined p-value method used in this paper is based on the assumption that the test for every individual age group is independent from each other. However, if this assumption is questionable, the current combined p-value method needs to be modified such that it can handle the correlations among the individual tests as well. This is another research topic we will pursue in a future paper.
Conclusions
In case-control methylation studies, the underlying distribution of the β-values is rarely known in advance, and clearly the normality assumption is not valid. Parametric models can be powerful if the assumptions for the models are valid, but they can also lead to biased results if the underlying true distribution is far deviated from the imposed ones. Thus, it is desirable to choose a powerful yet robust statistical test that does not depend on any underlying distribution assumption. In this article we proposed and examined a rank-based nonparametric method to detect differentially methylated loci. Through simulation, we showed that our proposed method is as powerful as the linear regression, t-test and Wilcoxon rank sum test methods if the mean differences between the two treatment groups are large. However, our method substantially outperformed the others in situations where the mean differences between the two groups were small while the variance differences were large.
Note that the age effects are strongly associated with methylation, and the ignoring age effect will cause a loss of power or a large number of false positives. Another advantage of the proposed method over many existed ones is that it combined the nonparametric test together with age adjustment. Our next goal was to generalize our method to adjust for more confounders other than the age such that it can have a much wider application in methylation research.
Declarations
Acknowledgements
The authors would like to thank Ms. Kim Lawson for her professional editorial assistance, which highly improved the presentation of this paper. ZC also thanks the support from the faculty research funds awarded by the School of Public Health at the Indiana University Bloomington.
Authors’ Affiliations
References
- Baylin SB, Ohm JE: Epigenetic gene silencing in cancer-a mechanism for early oncogenic pathway addiction? Nat Rev Cancer 2006,6(2):107-116.View ArticlePubMedGoogle Scholar
- Kuan PF, Wang S, Zhou X, Chu H: A statistical framework for Illumina DNA methylation arrays. Bioinformatics 2010,26(22):2849. 10.1093/bioinformatics/btq553PubMed CentralView ArticlePubMedGoogle Scholar
- Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Gräf S, Tomazou EM, Bäckdahl L, Johnson N, Herberth M: An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs). Genome Res 2008,18(9):1518-1529. 10.1101/gr.077479.108PubMed CentralView ArticlePubMedGoogle Scholar
- Kuan PF, Chiang DY: Integrating prior knowledge in multiple testing under dependence with applications to detecting differential DNA methylation. Biometrics 2012. 10.1111/j.1541-0420.2011.01730.xGoogle Scholar
- Feinberg AP, Tycko B: The history of cancer epigenetics. Nat Rev Cancer 2004,4(2):143-153.View ArticlePubMedGoogle Scholar
- Jabbari K, Bernardi G: Cytosine methylation and CpG, TpG (CpA) and TpA frequencies. Gene 2004, 333: 143-149.View ArticlePubMedGoogle Scholar
- Jones PA, Baylin SB: The fundamental role of epigenetic events in cancer. Nat Rev Genet 2002,3(6):415-428.PubMedGoogle Scholar
- Kulis M, Esteller M: DNA methylation and cancer. Adv Genet 2010, 70: 27-56.View ArticlePubMedGoogle Scholar
- Laird PW: Principles and challenges of genome-wide DNA methylation analysis. Nat Rev Genet 2010,11(3):191-203.View ArticlePubMedGoogle Scholar
- Xu GL, Bestor TH, Bourc'his D, Hsieh CL, Tommerup N, Bugge M, Hulten M, Qu X, Russo JJ, Viegas-Péquignot E: Chromosome instability and immunodeficiency syndrome caused by mutations in a DNA methyltransferase gene. Nature 1999,402(6758):187-191. 10.1038/46052View ArticlePubMedGoogle Scholar
- Siegmund KD, Laird PW, Laird-Offringa IA: A comparison of cluster analysis methods using DNA methylation data. Bioinformatics 2004,20(12):1896-1904. 10.1093/bioinformatics/bth176View ArticlePubMedGoogle Scholar
- Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK: Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics 2008,9(1):365. 10.1186/1471-2105-9-365PubMed CentralView ArticlePubMedGoogle Scholar
- Wang S: Method to detect differentially methylated loci with case-control designs using Illumina arrays. Genet Epidemiol 2011,35(7):686-694. 10.1002/gepi.20619PubMed CentralView ArticlePubMedGoogle Scholar
- Christensen BC, Houseman EA, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, Nelson HH, Karagas MR, Padbury JF, Bueno R: Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLoS Genet 2009,5(8):e1000602. 10.1371/journal.pgen.1000602PubMed CentralView ArticlePubMedGoogle Scholar
- Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Weisenberger DJ, Shen H, Campan M, Noushmehr H, Bell CG, Maxwell AP: Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res 2010,20(4):440-446. 10.1101/gr.103606.109PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Z, Liu Q, Nadarajah S: A new statistical approach to detecting differentially methylated loci for case control Illumina array methylation data. Bioinformatics 2012,28(8):1109-1113. 10.1093/bioinformatics/bts093PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Z, Huang H, Liu J, Ng HKT, Nadarajah S, Huang X, Deng Y: Detecting differentially methylated loci for Illumina Array methylation data based on human ovarian cancer data. BMC Med Genomics 2013,6(Suppl 1):S9.PubMed CentralPubMedGoogle Scholar
- Baumgartner W, Weiß P, Schindler H: A nonparametric test for the general two-sample problem. Biometrics 1998, 54: 1129-1135. 10.2307/2533862View ArticleGoogle Scholar
- Neuhäuser M: Exact tests for the analysis of case-control studies of genetic markers. Hum Hered 2002,54(3):151-156. 10.1159/000068838View ArticlePubMedGoogle Scholar
- Fisher RA: Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd; 1932.Google Scholar
- Owen AB: Karl Pearson's meta-analysis revisited. Ann Statist 2009,37(6B):3867-3892. 10.1214/09-AOS697View ArticleGoogle Scholar
- Chen Z, Ng HKT: A robust method for testing association in genome-wide association studies. Hum Hered 2012,73(1):26-34. 10.1159/000334719PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Z: A new association test based on Chi‐square partition for case‐control GWA studies. Genet Epidemiol 2011,35(7):658-663. 10.1002/gepi.20615View ArticlePubMedGoogle Scholar
- Chen Z, Huang H, Ng HKT: Testing for association in case-control genome-wide association studies with shared controls. Stat Methods Med Res 2013. First published on February 1, 2013 First published on February 1, 2013 10.1177/0962280212474061Google Scholar
- Chen Z, Huang H, Ng HKT: Design and analysis of multiple diseases genome-wide association studies without controls. Gene 2012,510(1):87-92. 10.1016/j.gene.2012.07.089PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Z, Liu J, Ng HKT, Nadarajah S, Kaufman HL, Yang JY, Deng Y: Statistical methods on detecting differentially expressed genes for RNA-seq data. BMC Syst Biol 2011,5(Suppl 3):S1. 10.1186/1752-0509-5-S3-S1View ArticleGoogle Scholar
- Chen Z, McGee M, Liu Q, Kong YM, Huang X, Yang JY, Scheuermann RH: Identifying differentially expressed genes based on probe level data for GeneChip arrays. Int J Comput Biol Drug Des 2010,3(3):237-257. 10.1504/IJCBDD.2010.038028View ArticlePubMedGoogle Scholar
- Chen Z, Liu Q, McGee M, Kong M, Huang X, Deng Y, Scheuermann RH: A gene selection method for GeneChip array data with small sample sizes. BMC Genomics 2011,12(Suppl 5):S7. 10.1186/1471-2164-12-S5-S7View ArticleGoogle Scholar
- Chen Z, Liu Q: A New approach to account for the correlations among single nucleotide polymorphisms in genome-wide association studies. Hum Hered 2011,72(1):1-9. 10.1159/000330135PubMed CentralView ArticlePubMedGoogle Scholar
- Chen Z: Is the weighted z‐test the best method for combining probabilities from independent tests? J Evol Biol 2011,24(4):926-930. 10.1111/j.1420-9101.2010.02226.xView ArticlePubMedGoogle Scholar
- Chen Z, Nadarajah S: Comments on ‘Choosing an optimal method to combine p‐values’ by Sungho Won, Nathan Morris, Qing Lu and Robert C. Elston, Statistics in Medicine 2009; 28: 1537-1553. Stat Med 2011,30(24):2959-2961. 10.1002/sim.4222View ArticlePubMedGoogle Scholar
- Lancaster H: The combination of probabilities: an application of orthonormal functions. Austral J Statist 1961, 3: 20-33. 10.1111/j.1467-842X.1961.tb00058.xView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.