Skip to main content

Screening genome-wide DNA methylation CpG sites via training and testing data utilizing surrogate variables


Screening Cytosine-phosphate-Guanine dinucleotide (CpG) DNA methylation sites in association with single-nucleotide polymorphisms (SNPs), or covariate of interest, and/or their interactions is desired before performing more complicated analyses due to high dimensionality. It is possible the variation in methylation cannot be fully explained by SNPs and covariates of interest and thus it is important to account for variations introduced by other unknown factors. Furthermore, CpG sites screened from one data set may be inconsistent with those from another data set and it is equally important to improve the reproducibility of the selected CpG sites.

Materials and methods

A user-friendly R package, training-testing screening method (ttScreening), was developed to achieve these goals and provides users the flexibility of choosing different screening methods: proposed training and testing method, a method controlling false discovery rate (FDR), and a method controlling the significance level corrected by use of the Bonferroni method.


Linear regressions were applied in the screening process, with methylation of a CpG site as the dependent variable, a single SNP, a covariate, and possibly their interactions as independent variables. Surrogate variable analyses were included to adjust for unknown factor effects. Randomly chosen training and testing samples were used to estimate and test the effects, respectively. Simulations based on different scenarios were implemented to test the robustness and sensitivity of the method and to compare with the other two screening methods. For almost all simulation scenarios, the training and testing screening method proved to outperform other methods in terms of correct identification of important CpG sites. For other occasions, ttScreening performed equally well. We applied ttScreening to 40,000 CpG sites based on their association with smoking and forced vital capacity. The ttScreening method selected 9 CpG sites and the other two methods selected 0 CpG sites.


Our simulation results indicate that ttScreening performs better than FDR-based screening and that it is at least as good as Bonferroni in terms of correctly identifying CpG sites that are associated with other variables. The package is computationally efficient and user-friendly, which indicates its suitability to high dimensional data for dimension reduction and its broad application in addition to epigenetic studies. The package can be downloaded at [1].

Author information

Authors and Affiliations


Corresponding author

Correspondence to Hongmei Zhang.

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ray, M., Tong, X., Zhang, H. et al. Screening genome-wide DNA methylation CpG sites via training and testing data utilizing surrogate variables. BMC Bioinformatics 15 (Suppl 10), P4 (2014).

Download citation

  • Published:

  • DOI: