Identification of markers associated with global changes in DNA methylation regulation in cancers
© Qiu and Zhang; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Skip to main content
© Qiu and Zhang; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
DNA methylation exhibits different patterns in different cancers. DNA methylation rates at different genomic loci appear to be highly correlated in some samples but not in others. We call such phenomena conditional concordant relationships (CCRs). In this study, we explored DNA methylation patterns in 12 common cancers using data of 2434 patient samples collected by The Cancer Genome Atlas project. We developed an exploratory method to characterize CCRs in the methylation data and identified the 200 gene markers whose on-and-off statuses in DNA methylation are most significantly associated with drastic changes in CCRs throughout the genome. Clustering analysis of the methylation data of the 200 markers showed that they are tightly associated with cancer subtypes. We also generated a library of the significant CCRs that may be of interest to future studies of the regulation network of DNA methylation in cancer.
DNA methylation plays an important role in carcinogenesis and cancer progression through hypermethylation to turn off the expression of tumor suppressors and hypothmethylation to activate the expression of oncogenes . Genomic analyses of DNA methylation using microarrays and next generation sequencing technologies have shown that various forms of neoplasia and cancers are associated with massive changes in DNA methylation [2, 3]. Such changes are often distinctive depending on the subtype of cancer [4, 5]. DNA methylation in cells is apparently regulated by a large, intricate network. However, although a large number of genomic network studies have focused on data regarding gene expression, protein-protein interactions, and protein-DNA/RNA interactions [6, 7], little has been done to incorporate DNA methylation data to study the underlying regulatory network.
In general, relationships that link different genes at DNA, RNA, protein, and metabolite levels strongly depend on the specific context, such as cell type, subcellular location and time of the biological processes. A number of methods have been developed to uncover context-dependent relationships using gene expression data. For example, the liquid association model was developed to identify mediator genes that can modulate coexpression of other pairs of genes . A few other similar models have been proposed to describe three-way relationships among genes [9–11]. Cancer type dependent coexpression patterns have been reported previously [12, 13]. The MINDy algorithm used conditional mutual information to identify modulators that strongly affect the concerted activities of transcription factors and their targets, and found novel modulators of MYC function in B cells .
In this study, we focused on the dynamic nature of concordant relationships between the methylation status of genes, using a large DNA methylation dataset of 2434 samples across 12 cancer types generated by The Cancer Genome Atlas (TCGA) project. We observed that many gene pairs showed dramatic changes in methylation patterns in different cancers. We call such phenomena conditional concordant relationships (CCRs). CCRs are commonly observed in cancer. For example, Hess et al. found that methylation of the ESR1 promoter is strongly predictive of the concurrent methylation of a group of tumor suppressors in acute myeloid leukemia, and is associated with clinical outcome . Carvalho et al. found that the concurrent methylation of a group of cancer-related genes is associated with the microsatellite instability phenotype . We are particularly interested in finding marker genes that have the following property: depending on the methylation status of the marker, the patient samples can be dichotomized into two groups, and the gene-gene correlation matrices derived from the methylation data of the two groups are drastically different. Such markers are likely to be associated with global changes in methylation correlation patterns. This concept of the methylation markers resembles the modulator in three-way gene expression studies . We have developed a method to identify such markers. We demonstrate the utility of our approach to study CCRs, classify cancer subtypes, and explore the patterns of DNA methylation in cancer.
Cancer type and sample size of TCGA methylation data
GBM - Glioblastoma multiforme
LAML - Acute myeloid leukemia
KIRC - Kidney renal clear cell carcinoma
KIRP - Kidney renal papillary cell carcinoma
LUAD - Lung adenocarcinoma
LUSC - Lung squamous cell carcinoma
STAD - Stomach adenocarcinoma
READ - Rectum adenocarcinoma
COAD - Colon adenocarcinoma
BRCA - Breast invasive carcinoma
UCEC - Uterine corpus endometrioid carcinoma
OV - Ovarian serous cystadenocarcinoma
For most cancer types where normal and cancerous samples were both available, the cancerous and corresponding normal samples were clustered close to each other. This observation suggests that the difference in methylation across different tissue types is larger than cancer-induced methylation changes. The only exception in this dataset was STAD. In Figure 5, we observed that the STAD normal samples were more similar to the lung samples, whereas the STAD cancer samples were more similar to the COAD and READ samples. This observation indicates that methylation might play a major role in stomach adenocarcinoma.
We have described an approach to explore complex patterns observed in DNA methylation data. We identified CCRs and markers associated with global changes in methylation correlation in different cancers. Expectedly, when the identified markers were used for clustering analysis, the clustering diagram largely coincided with cancer types, since distinct methylation patterns exist in different tissue types. We demonstrated that our approach can be used to uncover tissue types and subtypes of cancer. In this sense, our method is similar to feature selection and unsupervised clustering.
However, there are also important distinctions. In clustering methods, the common approach is to divide samples into groups so that within-group variation is small and between-group variation is large. In contrast, our method seeks markers that define two sample groups whose within-group correlation patterns differ. We do not require within-group variation to be small.
It should be noted that the associations between the markers and CCRs shown in this study are statistical associations identified from the data. The markers are not necessarily the causative agents that drive the changes in the correlations. Nevertheless, the markers provide candidates and useful information to identify the underlying causative agents. We believe that the main utility of our approach is to facilitate a systematic assessment of CCRs, which could be useful toward a better understanding of DNA methylation regulation in cancer.
The current study was limited to methylation data only. However, data from multiple platforms measuring gene expression, microRNA expression, DNA copy number, and somatic mutations can all be evaluated as candidate markers that affect CCRs in DNA methylation. Integrating data from multiple platforms will be increasingly powerful as more data are accumulated in the TCGA project.
In this study, we focused on the DNA methylation data provided by TCGA (http://tcga-data.nci.nih.gov/tcga/tcgaHome2.jsp). Genome-wide methylation measurements of 2434 samples were available, spanning across 12 cancer types. The data were generated using the IIllumina Infinium Human DNA Methylation27 array platform, which interrogates the methylation status of 27,578 CpG sites for each sample. The data is available at http://odin.mdacc.tmc.edu/~pqiu/projects/TCGAMethData/index.htm.
We used the level 3 methylation data defined by TCGA, which is the ratio of M i /(U i + M i ) for each CpG site i. M i represents the methylated probe intensity of CpG site i, while U i is the unmethylated probe intensity. Therefore, the numerical range of the data is between 0 and 1. 0 means unmethylated, and 1 means completely methylated. The data contain null entries, which correspond to probes that overlap with known single nucleotide polymorphisms (SNPs) or other genomic variations, and probes whose signal intensities are lower than the background.
In our analysis, we filtered out probes with many null entries (number of nulls more than 1% of the sample size) and probes with small standard deviation (SD < 0.1). Roughly 9000 probes survived these two filtering criteria and were considered in the analysis of CCRs.
Although DNA methylation is a reversible process and methylated CpG sites may not be completely methylated, methylation data appear to be bimodal in general. By thresholding the ratio M i /(U i + M i ) (i.e., nominal threshold 0.2), we can use probe i to divide samples into two groups. The status of CpG site i in one group is unmethylated, whereas CpG site i in the other group is methylated. If the methylation correlation patterns in the two sample groups are quite different, the CpG site i is likely to be related to the global changes in methylation regulation.
Before calculating the changes in methylation correlation, clustering is performed to find modules of highly correlated probes. The purpose of this step is to reduce computational complexity. The pairwise correlations between modules can be used as surrogates of the pairwise correlations between individual probes.
We use a variation of the agglomerative clustering algorithm [24, 25]. This algorithm requires a user-specified threshold for cluster coherence, defined as the average Pearson correlation between each probe in the cluster and the cluster mean. This parameter determines the quality of the resulting clusters (the default setting is 0.7). At the beginning of the first iteration of the agglomerative algorithm, each probe forms its own cluster. One probe is randomly chosen and merged with its nearest neighbor as defined by Pearson correlation and average linkage, and these two probes become unavailable in the current iteration. Then, another probe is randomly chosen from the remaining ones and merged with its nearest neighbor, if the nearest neighbor is still available. Again, the chosen probe and its nearest neighbor become unavailable in the current iteration. If a merge results in a cluster whose coherence is below the user-specified threshold, the merge is rejected. After all the probes become unavailable, the first iteration ends and the number of clusters is reduced by approximately half. The same procedure is repeated in the second iteration to further reduce the number of clusters. The iterative process continues until all merges in a particular iteration are rejected.
This algorithm guarantees that the quality of all the resulting probe clusters is higher than the user-specified threshold. The average of each cluster can be viewed as a meta-probe that summarizes the average methylation status of the cluster of correlated probes.
To identify CCR-associated probes, we used the training samples to filter for roughly 9000 probes that had small number of null entries and high standard deviation. These probes were considered as candidates to be evaluated. We also performed the above agglomerative algorithm using the training set to cluster probes into modules that contained highly correlated probes, and we represented each module by the mean methylation profile of probes in that module.
For each candidate probe, we evaluated whether its on-off status affected methylation correlation globally. We dichotomized the training samples into two groups (i.e., threshold = 0.2), computed the module-module correlation matrices for the two sample groups separately, performed z-transform, and summarized the difference between the two correlation matrices into one scalar score (s = ∑ i , j |z 1(i, j) – z 2(i, j)|). If a candidate probe resulted in an extremely unbalanced split (i.e., the smaller sample group contained less than 15% of samples), this candidate probe was not scored, because correlation computed from on a small number of samples may not be accurate and reliable. The candidate probes were rank-ordered according to their scores, where the methylation status of top ranking probes were associated with large changes in methylation correlation.
The authors thank Dr. Keith A. Baggerly who assembled the TCGA methylation dataset used in this study and made insightful suggestions and comments. This work is partially supported by TCGA Genome Data Analysis Center (GDAC) grant and the Cancer Center Support Grant at the University of Texas MD Anderson Cancer Center (U24 CA143883 02 S1 and P30 CA016672).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT-2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S13/S1