Prediction of epigenetically regulated genes in breast cancer cell lines
© Loss et al; licensee BioMed Central Ltd. 2010
Received: 22 August 2009
Accepted: 4 June 2010
Published: 4 June 2010
Methylation of CpG islands within the DNA promoter regions is one mechanism that leads to aberrant gene expression in cancer. In particular, the abnormal methylation of CpG islands may silence associated genes. Therefore, using high-throughput microarrays to measure CpG island methylation will lead to better understanding of tumor pathobiology and progression, while revealing potentially new biomarkers. We have examined a recently developed high-throughput technology for measuring genome-wide methylation patterns called mTACL. Here, we propose a computational pipeline for integrating gene expression and CpG island methylation profles to identify epigenetically regulated genes for a panel of 45 breast cancer cell lines, which is widely used in the Integrative Cancer Biology Program (ICBP). The pipeline (i) reduces the dimensionality of the methylation data, (ii) associates the reduced methylation data with gene expression data, and (iii) ranks methylation-expression associations according to their epigenetic regulation. Dimensionality reduction is performed in two steps: (i) methylation sites are grouped across the genome to identify regions of interest, and (ii) methylation profles are clustered within each region. Associations between the clustered methylation and the gene expression data sets generate candidate matches within a fxed neighborhood around each gene. Finally, the methylation-expression associations are ranked through a logistic regression, and their significance is quantified through permutation analysis.
Our two-step dimensionality reduction compressed 90% of the original data, reducing 137,688 methylation sites to 14,505 clusters. Methylation-expression associations produced 18,312 correspondences, which were used to further analyze epigenetic regulation. Logistic regression was used to identify 58 genes from these correspondences that showed a statistically signifcant negative correlation between methylation profles and gene expression in the panel of breast cancer cell lines. Subnetwork enrichment of these genes has identifed 35 common regulators with 6 or more predicted markers. In addition to identifying epigenetically regulated genes, we show evidence of differentially expressed methylation patterns between the basal and luminal subtypes.
Our results indicate that the proposed computational protocol is a viable platform for identifying epigenetically regulated genes. Our protocol has generated a list of predictors including COL1A2, TOP2A, TFF1, and VAV3, genes whose key roles in epigenetic regulation is documented in the literature. Subnetwork enrichment of these predicted markers further suggests that epigenetic regulation of individual genes occurs in a coordinated fashion and through common regulators.
Epigenetic regulation and methylation-expression associations
Epigenetics refers to the study of heritable changes that cannot be explained by changes in the DNA sequence [1–4]. One mechanism of epigenetic regulation involves DNA methylation of CG dinucleotides, commonly represented as CpG. It is known that around 50% of the protein-coding genes are near CpG-rich sequences, known as CpG islands. Patterns of methylation in the CpG islands play an important role in regulating gene expression during both normal cellular development and disease processes. Increased methylation of CpG islands (hypermethylation) in tumor suppressor genes have been observed during tumor progression and metastasis as a result of aberrant methylation patterns [5, 6]. At the same time, aberrations leading to decreased methylation of CpG islands (hypomethylation) of oncogenes are known to occur . A review of epigenetics in cancer and the role of DNA methylation markers can be found in . Since hyper and hypomethylation of the genome are considered widespread attributes of tumors, predicting the regulation of gene expression through CpG island methylation at an epigenome level will provide a better understanding of the tumor pathobiology and progression.
To measure genome-wide methylation, we used Target Amplification by Capture and Ligation (mTACL), a high-throughput technique developed by Affymetrix Inc., which has been used to measure the methylation of 145,148 CpGs in the promoters of 5,472 genes for 221 samples . In the mTACL approach, regions of the genome to be analyzed (the targets) are first captured using dU probes. Such probes contain segments of DNA complementary to the targets with all the thymidines (T) substituted by uridines (U), and two common primers flanking the target sequences. mTACL has about 19,250 dU probes within the vicinity of transcriptional start sites of 5,472 genes, with 170,000 CpGs that are potentially relevant in tumorigenesis. Moreover, the dU probes were designed so that they hybridize specifically to target genomic DNA digested with restriction enzymes MspI and HpyF3I, along with adaptor oligonucleotides complementary to the common primers of dU probes. All cytosines (C) of the adaptor oligonucleotides were substituted with 5'-methyl cytosine (5-mC). dU probes, adaptor oligonucleotides and the target genomic regions were hybridized using the "touchdown annealing" protocol followed by ligation of oligonucleotides to the ends of the target genomic DNA. After ligation, the dU probes were removed by digestion using uracil DNA-glycosylase, leaving only the target genomic DNA ligated to common primers. Later, the target DNA was treated with bisulfite followed by amplification using common primers and hybridization to microarray containing 21-mer probes that span across the CpGs in the target DNA. The extent of CpG methylation is measured using relative signal of two probes (probsets) for each CpG: one corresponding to the case in which CpG(s) covered by the probe are methylated, and the other one to the sequence in which CpG(s) covered by the probe are unmethylated. There are at least 3 different probe sets that cover the same CpG. The resulting hybridization signals were translated into methylation values using logistic regression by fitting models of the relative probe signal to percentage methylation for each CpG. The regression used artificial samples of known CpG methylation (i.e. 0, 10, 25, 50, 75 and 100%) and the quality of fit was assessed with r2.
Identifying epigenetically regulated genes
This paper discusses how a novel computational protocol can be used to integrate CpG methylation and gene expression data sets to systematically identify epigenetically regulated genes. Our assumption is that the effect of DNA methylation on gene expression is local and limited to the promoter region. A computational protocol on the exploratory analysis of epigenetic regulation using coupled methylation and expression data was proposed by Sjahputera et al. . Their work investigated differential methylation hybridization and associated gene expression data to build a relational data space for non-Hodgkin's lymphoma. Fuzzy set theory is used to identify epigenetically regulated genes from the relational data space. In this process, methylation-expression associations were transformed into a logarithmic map, which was divided into four discriminative quadrants. Each quadrant represented one out of four gene regulation behaviors (i.e., hypermethylation and up-regulation; hypomethylation and up-regulation; hypermethylation and down-regulation; and hypomethylation and down-regulation). Clustering was applied to sets of associations, and the epigenetic regulation was determined from the cluster's location and quadrant's membership. A measurement of confidence is then computed from the probabilities involved in the determination of the clusters. This computational framework suffers from a number of limitations in the context of the high-dimensionality mTACL technology: (i) processing time of the high-volume relational data may be prohibitive; (ii) fuzzy clustering approaches are iterative and sensitive to the initial conditions, which may lead to unstable solutions; (iii) the division of quadrants is arbitrary and too rigid to incorporate the natural scale of data; and (iv) confidence in the solution is not established in terms of statistical significance (i.e., p-value).
To overcome the issues described above, we first reduced the dimensionality of the methylation data to alleviate the computational load resulting from the data. Consequently, this enables the efficient correlative analysis and assignment of p-values through permutation analysis that otherwise would be unmanageable in the original space. To this end, we used the following two-step clustering approach: (i) grouping along the genome to reveal regions with high concentration of assayed methylation sites, and (ii) clustering of methylation profiles within each region to identify similar methylation patterns. For the latter, we used spectral clustering, as it offers a number of advantages. For instance, it is noniterative; it can identify clusters along nonlinear boundaries; and it has been proven to outperform other techniques [11, 12]. Its improved performance is attributed to the transformation of data into a higher-dimensional space, which requires less complex problem solving than in the original data space . Here, a K-Spectral Clustering (KSC) is employed, and optimal input parameters are determined automatically. Secondly, associations between clustered methylation and gene expression data sets are produced by setting a fixed constraint of 20,000 base pairs in the vicinity of either 5' or 3' ends to match methylation sites to their genes. Finally, prediction and ranking of epigenetic regulated genes is performed based on logistic regressions of the methylation-expression associations onto an exponential curve. This logistic approach is flexible enough to incorporate any data scale and distribution, and does not contain rigid and arbitrary definitions that could limit its application. Finally, the significance of the logistic regression is verified by permutation analysis and computing the p-value.
Clustering on the basis of proximity: In this step, regions of concentration are identified by the proximity of CpG methylation sites along the genome. In each chromosome, methylation sites adjoining within 2,000 base pairs are aggregated and form distinct regions from methylation sites adjoined by more than 2,000 base pairs. Such regions provide a spatial context for methylation sites, grouping and isolating distant chromosomal regions. This is an important step for subsequent clustering based on the similarity of methylation profiles.
Clustering on the basis of similarity: In this step, methylation profiles are clustered to identify cross-similarities within each region. Prior to the clustering, however, methylation profiles are pre-processed and represented by the largest principal components, which embed 99% of the data underlying variance. This is a standard approach and well documented in the machine learning literature. Clustering high-dimensional data in their principal component space results in lower computational complexity and lower risk from the curse of dimensionality . The clustering method used here is unsupervised, and based on K-Spectral Clustering (KSC) , as discussed below.
where D is a diagonal matrix whose D i,i elements are the sum of A's i-th row. Let X be the n × k matrix that is formed by the k largest principal components of L. K-means clustering  is then applied to the normalized matrix Y, whose elements are represented by . Finally, the methylation profiles S i receive the same clustering assignment proposed for Y by k-means, i.e., a profile is assigned to cluster j if and only if row Y i is assigned to cluster j.
Each representative methylation site, averaged over members of the same cluster, is associated with a gene or a set of genes. The association uses only the methylation site and the gene's probe set base range. A gene may have multiple probe sets in the expression data, which cover different portions of a chromosome. These associations are created for representative methylation sites being (i) within a gene probe set, or (ii) within a 20,000-base-pair window adjacent to the gene probe set. The latter accounts for natural uncertainties for locating a potential CpG island along the DNA.
Logistic regression and assignment of p-value
The p-value is computed by comparing the value of R resulting from the curve regression, and the values of R m , m = 1, 2,...,M, resulting from M attempts for fitting the same curve after permuting the methylation measurements of each association. In our implementation, M is set at 10,000.
Results and Discussion
Panel of cell lines.
The first clustering step grouped the 137,688 methylation sites on the Affymetrix array into 5,785 distinct clusters (regions of concentration) across 23 chromosomes. Out of these 5,785 regions, the second clustering step generated 14,505 clusters, and produced representative methylation patterns by averaging the cluster's respective members. Note that this result represents a reduction of around 90% from the original raw data. Furthermore, 99% of the cell line's principal components' variance was found to be concentrated in 12 to 14 components, which reveals a high correlation between cell lines. Subsequent associations between the reduced methylation data and the gene expression generated 18,312 associations.
Logistic regression and assignment of p-value
We have compared the percentage of selected markers with two cancer-specific gene data sets of (i) 5900 genes that The Cancer Genome Atlas Project (TCGA) is targeting for sequencing , and (ii) genes that were selected using Prediction Analysis of Microarrays (PAM) data as described in . The TCGA gene set represents genes that are widely expressed in cancer whereas PAM gene set represents breast cancer subtypes. We found that 66% and 22% of our gene list are also in the TCGA and the PAM data sets, respectively. This analysis is promising since (i) the TCGA gene list is not specific to breast tissue, and (ii) the PAM data set does not incorporate methylation data; thus, by incorporating methylation data, a reduced number of biomarkers can be hypothesized.
Epigenetically regulated genes
COL1A2 plays important role in collagen production and tumor development , and is hypermethylated and down-regulated in about 40% of the ICBP cell lines. Let us assume that hypermethylation and up-regulation accounts for measurements above the 50% threshold. It is interesting to note that our method has identified epigenetic regulation of COL1A2 even in the presence of only 3 up-regulated cell lines. These 3 lines are not outliers as the computational protocol has generated a hypothesis for further bioinformatics analysis. Epigenetic regulation of COL1A2 is consistent with the published literature, which suggests that its down-regulation correlates with hypermethylation, and is a frequent event in breast cancer cell lines such as MCF7 and HS578T . Furthermore, aberrant methylation of COL1A2 has been identified in medulloblastoma and hepatoma [20, 21], where biallelic methylation of COL1A2 was observed in 77% of medulloblastomas, in addition to be shown to distinguish histological subtypes of tumors . TOP2A is an enzyme involved in controlling the topological state of the DNA machinery. Approximately 50% of ICBP cell lines are hypomethylated and up-regulated TOP2A. TOP2A is (i) a good prognostic marker in breast cancer and response to therapy , (ii) a prognostic factor for ER-positive breast cancer , and (iii) is epigenetically regulated for cellular assembly and organization in lymphoblastoid cell lines .
TFF1's function is not well known to date. However, it has been widely studied because of its presence in human tumors. For example, a recent study has identified and validated over-expression of TFF1 in breast carcinoma . Another study has concluded decreased methylation levels in breast tumor cells , while a much older study states that TFF1 expression is regulated by DNA methylation in breast cancer . VAV3 is a nucleotide exchange factor that activates rearrangement of actin filament, and its association shows that only 4 cell lines are hypermethylated. Epigenetic regulation of VAV3 is consistent with a recent report showing that 83% of breast tumors overexpress VAV3 .
CDKN2A is part of the cell cycle machinery and is an important tumor suppressor gene. Our analysis indicates that CDKN2A is hypermethylated and down-regulated in only about 30% of the samples, whereas the majority of the samples are hypomethylated and down-regulated. This discrepancy can be explained by DNA copy number loss or CDKN2A mutation, which is frequently associated with pathophysiology of certain types of cancers, including breast cancer [29–32].
Subnetwork enrichment analysis
CYP1B1, CTGF, ESR1, IGFBP5, TFF1, KRT18, INHBA, IGFBP2
Subtype-specific epigenetic regulation
In this paper, we proposed a computational pipeline for identifying epigenetically regulated genes for a panel of breast cancer cell lines. The protocol avoids excessive computational complexity through a step-wise reduction of methylation data for the required expression data associations. To this end, a twofold clustering approach explored both the proximity of methylation sites and similarities among methylation profiles across cell lines. K-Spectral Clustering was presented and used in the latter step. As a result of data clustering, a number of representative methylation profiles were generated for direct association with candidate genes. Epigenetic regulation was estimated from logistic regressions of the methylation-expression associations and its significance verified through the computed p-value. The computational pipeline was applied to a panel of 45 breast cancer cell lines, and the protocol identified a list of 58 genes, including COL1A2, TOP2A, TFF1, and VAV3, whose key roles in epigenetic regulation are consistent with known literature. Subnetwork enrichment of these markers identified 35 common regulators of the type "Pathway" with 6 or more predicted genes, further suggesting that epigenetic regulation of individual genes occurs in a coordinated fashion and through common regulators. Our current efforts focus on associating methylation data with the therapeutic responses and other biological data derived from the same panel of cell lines.
This work was supported by the Director, Office of Science, Office of Biological & Environmental Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and by the National Institutes of Health, National Cancer Institute grants P50 CA 58207, and the U54 CA 112970.
- Russo VEA, Martienssen RA, Riggs AD: Epigenetic mechanism of gene regulation. Cold Spring Harbor Laboratory Press; 1996.Google Scholar
- Dillon N: Gene regulation and large scale chromatin organization in the nucleus. Chromosome Research 2006, 14: 117–126. 10.1007/s10577-006-1027-8View ArticlePubMedGoogle Scholar
- Esteller M: Cancer epigenomics: DNA methylomes and histon modification maps. Nature Review Genetics 2007, 286–298. 10.1038/nrg2005Google Scholar
- Bock C, Lengauer T: Computational epigenetic. Bioinformatics 2008, 24: 1–10. 10.1093/bioinformatics/btm546View ArticlePubMedGoogle Scholar
- Jones PA, Baylin SB: The fundamental role of epigenetic events in cancer. Nature Rev Genetics 2002, 3: 415–428.View ArticleGoogle Scholar
- Esteller M: CpG island hypermethylation and tumor suppressor genes: a booming present, a brighter future. Oncogene 2002, 21: 5427–5440. 10.1038/sj.onc.1205600View ArticlePubMedGoogle Scholar
- Widschwendter M, Jiang G, Woods C, Müller HM, Fiegl H, Goebel G, Marth C, Müller-Holzner E, Zeimet AG, Laird PW, Ehrlich M: DNA Hypomethylation and Ovarian Cancer Biology. Cancer Research 2004, 64: 4472–4480. 10.1158/0008-5472.CAN-04-0238View ArticlePubMedGoogle Scholar
- Laird PW: The power and the promise of DNA methylation markers. Nature Rev Cancer 2003, 3(4):253–266. 10.1038/nrc1045View ArticleGoogle Scholar
- Nautiyal S, Carlton V, Lu Y, Ireland J, Flaucher D, Moorhead M, Gray J, Spellman P, Mindrinos M, Berg P, Faham M: A High-Throughput Method for Analyzing Methylation of CpGs in Targeted Genomic Regions. PNAS, in press.Google Scholar
- Sjahputera O, Keller JM, Davis JW, Taylor KH, Rahmatpanah F, Shi H, Anderson DT, Blisard SN, Luke RH III, Popescu M, Arthur GC, Caldwell CW: Relational Analysis of CpG Islands Methylation and Gene Expression in Human Lymphomas Using Possibilistic C-Means Clustering and Modified Cluster Fuzzy Density. IEEE Transactions on Computational Biology and Bioinformatics 2007, 4(2):176–189. 10.1109/TCBB.2007.070205View ArticlePubMedGoogle Scholar
- Chung F: Spectral graph theory. Volume 92. CBMS Regional Conference Series in Mathematics, Amercan Mathematical Society; 1997.Google Scholar
- Shi J, Malik J: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22: 888–905. 10.1109/34.868688View ArticleGoogle Scholar
- Ng AY, Jordan MI, Weiss Y: On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 2001, 14: 849–856.Google Scholar
- Bellman R: Adaptive Control Processes: A Guided Tour. Princeton University Press; 1961.Google Scholar
- Hartigan JA, Wong MA: A K-Means Clustering Algorithm. Applied Statistics 1979, 28: 100–108. 10.2307/2346830View ArticleGoogle Scholar
- Jain A, Dubes R: Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall; 1988.Google Scholar
- Neve RM, Chin K, Fridlyand J, Yeh J, Baehner FL, Fevr T, Clark L, Bayani N, Coppe JP, Tong F, Speed T, Spellman PT, DeVries S, Lapuk A, Wang NJ, Kuo WL, Stilwell JL, Pinkel D, Albertson DG, Waldman FM, McCormick F, Dickson RB, Johnson MD, Lippman M, Ethier S, Gazdar A, Gray JW: A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer cell 2006, 10: 515–527. 10.1016/j.ccr.2006.10.008View ArticlePubMedPubMed CentralGoogle Scholar
- Sengupta PK, Smith EM, Kim K, Murnane MJ, Smith BD: DNA hypermethylation near the transcription start site of collagen alpha2(I) gene occurs in both cancer cell lines and primary colorectal cancers. Cancer research 2003, 63: 1789–1797.PubMedGoogle Scholar
- Anderton JA, Lindsey JC, Lusher ME, Gilbertson RJ, Bailey S, Ellison DW, Clifford SC: Global analysis of the medulloblastoma epigenome identifies disease-subgroup-specific inactivation of COL1A2. Neuro-oncology 2008, 10: 981–994. 10.1215/15228517-2008-048View ArticlePubMedPubMed CentralGoogle Scholar
- Chiba T, Yokosuka O, Fukai K, Hirasawa Y, Tada M, Mikata R, Imazeki F, Taniguchi H, Iwama A, Miyazaki M, Ochiai T, Saisho H: Identification and investigation of methylated genes in hepatoma. Eur J Cancer 2005, 41: 1185–1194. 10.1016/j.ejca.2005.02.014View ArticlePubMedGoogle Scholar
- Brasel JC, Schmidt M, Fischbach T, Sültmann H, Bojar H, Koelbl H, Hellwig B, Rahnenführer J, Hengstler JG, Gehrmann MC: ERBB2 and TOP2A in Breast Cancer: A Comprehensive Analysis of Gene Amplification, RNA Levels, and Protein Expression and Their Influence on Prognosis and Prediction. Clinical Cancer Research 2010., 16(2391):Google Scholar
- Rody A, Karn T, Ruckhaberle E, Muller V, Gehrmann M, Solbach C, Ahr A, Gatje R, Holtrich U, Kaufmann M: Gene expression of topoisomerase II alpha (TOP2A) by microarray analysis is highly prognostic in estrogen receptor (ER) positive breast cancre. Breast cancer research and treatment 2009, 113: 457–466. 10.1007/s10549-008-9964-xView ArticlePubMedGoogle Scholar
- Nguyen A, Rauch TA, Pfeifer GP, Hu VW: Global methylation profiling of lymphoblastoid cell lines reveals epigenetic contributions to autism spectrum disorders and a novel autism candidate gene, RORA, whose protein product is reduced in autistic brain. FJ Express 10.1096/fj.fj.10–154484 2010.Google Scholar
- Davidson B, Stavnes HT, Holth A, Chen X, Yang Y, Shih IM, L WT: Gene expression signatures differentiate ovarian/peritoneal serous carcinoma from breast carcinoma in effusions. J Cell Mol Med 2010.Google Scholar
- Dietrich D, Lesche R, Tetzner R, Krispin M, Dietrich J, Haedicke W, Schuster M, Kristiansen G: Analysis of DNA methylation of multiple genes in microdissected cells from formalin-fixed and paraffin-embedded tissues. J Histochem Cytochem 2009, 57(5):477–489. 10.1369/jhc.2009.953026View ArticlePubMedPubMed CentralGoogle Scholar
- Martin V, Ribieras S, Song-Wang XG, Lasne Y, Frappart L, Rio MC, Dante R: Involvement of DNA methylation in the control of the expression of an estrogen-induced breast-cancer-associated protein (pS2) in human breast cancers. Journal of cellular biochemistry 1997, 65: 95–106. 10.1002/(SICI)1097-4644(199704)65:1<95::AID-JCB10>3.0.CO;2-GView ArticlePubMedGoogle Scholar
- Lee K, Liu Y, Mo JQ, Zhang J, Dong Z, Lu S: Vav3 oncogene activates estrogen receptor and its overexpression may be involved in human breast cancer. BMC cancer 2008, 8: 158. 10.1186/1471-2407-8-158View ArticlePubMedPubMed CentralGoogle Scholar
- Borg A, Sandberg T, Nilsson K, Johannsson O, Klinker M, Masback A, Westerdahl J, Olsson H, Ingvar C: High frequency of multiple melanomas and breast and pancreas carcinomas in CDKN2A mutation-positive melanoma families. Journal of the National Cancer Institute 2000, 92: 1260–1265. 10.1093/jnci/92.15.1260View ArticlePubMedGoogle Scholar
- Johnson N, Speirs V, Curtin NJ, Hall AG: A comparative study of genome-wide SNP, CGH microarray and protein expression analysis to explore genotypic and phenotypic mechanisms of acquired antiestrogen resistance in breast cancer. Breast cancer research and treatment 2008, 111: 55–63. 10.1007/s10549-007-9758-6View ArticlePubMedGoogle Scholar
- Hui AM, Shi YZ, Li X, Takayama T, Makuuchi M: Loss of p16(INK4) protein, alone and together with loss of retinoblastoma protein, correlate with hepatocellular carcinoma progression. Cancer letters 2000, 154: 93–99. 10.1016/S0304-3835(00)00385-2View ArticlePubMedGoogle Scholar
- Bardeesy N, Aguirre AJ, Chu GC, Cheng KH, Lopez L, Hezel AF, Feng B, Brennan C, Weissleder R, Mahmood U, Hanahan D, Redston MS, et al.: Both p16(Ink4a) and the p19(Arf)-p53 pathway constrain progression of pancreatic adenocarcinoma in the mouse. Proceedings of the National Academy of Sciences of the United States of America 2006, 103: 5947–5952. 10.1073/pnas.0601273103View ArticlePubMedPubMed CentralGoogle Scholar
- Wicki R, Franz C, Scholl FA, Heizmann CW, Schafer BW: Repression of the candidate tumor suppressor gene S100A2 in breast cancer is mediated by site-specific hypermethylation. Cell Calcium 1997, 22(4):243–254. 10.1016/S0143-4160(97)90063-4View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.