Epigenetic regulation and methylation-expression associations
Epigenetics refers to the study of heritable changes that cannot be explained by changes in the DNA sequence [1–4]. One mechanism of epigenetic regulation involves DNA methylation of CG dinucleotides, commonly represented as CpG. It is known that around 50% of the protein-coding genes are near CpG-rich sequences, known as CpG islands. Patterns of methylation in the CpG islands play an important role in regulating gene expression during both normal cellular development and disease processes. Increased methylation of CpG islands (hypermethylation) in tumor suppressor genes have been observed during tumor progression and metastasis as a result of aberrant methylation patterns [5, 6]. At the same time, aberrations leading to decreased methylation of CpG islands (hypomethylation) of oncogenes are known to occur [7]. A review of epigenetics in cancer and the role of DNA methylation markers can be found in [8]. Since hyper and hypomethylation of the genome are considered widespread attributes of tumors, predicting the regulation of gene expression through CpG island methylation at an epigenome level will provide a better understanding of the tumor pathobiology and progression.
To measure genome-wide methylation, we used Target Amplification by Capture and Ligation (mTACL), a high-throughput technique developed by Affymetrix Inc., which has been used to measure the methylation of 145,148 CpGs in the promoters of 5,472 genes for 221 samples [9]. In the mTACL approach, regions of the genome to be analyzed (the targets) are first captured using dU probes. Such probes contain segments of DNA complementary to the targets with all the thymidines (T) substituted by uridines (U), and two common primers flanking the target sequences. mTACL has about 19,250 dU probes within the vicinity of transcriptional start sites of 5,472 genes, with 170,000 CpGs that are potentially relevant in tumorigenesis. Moreover, the dU probes were designed so that they hybridize specifically to target genomic DNA digested with restriction enzymes MspI and HpyF3I, along with adaptor oligonucleotides complementary to the common primers of dU probes. All cytosines (C) of the adaptor oligonucleotides were substituted with 5'-methyl cytosine (5-mC). dU probes, adaptor oligonucleotides and the target genomic regions were hybridized using the "touchdown annealing" protocol followed by ligation of oligonucleotides to the ends of the target genomic DNA. After ligation, the dU probes were removed by digestion using uracil DNA-glycosylase, leaving only the target genomic DNA ligated to common primers. Later, the target DNA was treated with bisulfite followed by amplification using common primers and hybridization to microarray containing 21-mer probes that span across the CpGs in the target DNA. The extent of CpG methylation is measured using relative signal of two probes (probsets) for each CpG: one corresponding to the case in which CpG(s) covered by the probe are methylated, and the other one to the sequence in which CpG(s) covered by the probe are unmethylated. There are at least 3 different probe sets that cover the same CpG. The resulting hybridization signals were translated into methylation values using logistic regression by fitting models of the relative probe signal to percentage methylation for each CpG. The regression used artificial samples of known CpG methylation (i.e. 0, 10, 25, 50, 75 and 100%) and the quality of fit was assessed with r2.
Identifying epigenetically regulated genes
This paper discusses how a novel computational protocol can be used to integrate CpG methylation and gene expression data sets to systematically identify epigenetically regulated genes. Our assumption is that the effect of DNA methylation on gene expression is local and limited to the promoter region. A computational protocol on the exploratory analysis of epigenetic regulation using coupled methylation and expression data was proposed by Sjahputera et al. [10]. Their work investigated differential methylation hybridization and associated gene expression data to build a relational data space for non-Hodgkin's lymphoma. Fuzzy set theory is used to identify epigenetically regulated genes from the relational data space. In this process, methylation-expression associations were transformed into a logarithmic map, which was divided into four discriminative quadrants. Each quadrant represented one out of four gene regulation behaviors (i.e., hypermethylation and up-regulation; hypomethylation and up-regulation; hypermethylation and down-regulation; and hypomethylation and down-regulation). Clustering was applied to sets of associations, and the epigenetic regulation was determined from the cluster's location and quadrant's membership. A measurement of confidence is then computed from the probabilities involved in the determination of the clusters. This computational framework suffers from a number of limitations in the context of the high-dimensionality mTACL technology: (i) processing time of the high-volume relational data may be prohibitive; (ii) fuzzy clustering approaches are iterative and sensitive to the initial conditions, which may lead to unstable solutions; (iii) the division of quadrants is arbitrary and too rigid to incorporate the natural scale of data; and (iv) confidence in the solution is not established in terms of statistical significance (i.e., p-value).
To overcome the issues described above, we first reduced the dimensionality of the methylation data to alleviate the computational load resulting from the data. Consequently, this enables the efficient correlative analysis and assignment of p-values through permutation analysis that otherwise would be unmanageable in the original space. To this end, we used the following two-step clustering approach: (i) grouping along the genome to reveal regions with high concentration of assayed methylation sites, and (ii) clustering of methylation profiles within each region to identify similar methylation patterns. For the latter, we used spectral clustering, as it offers a number of advantages. For instance, it is noniterative; it can identify clusters along nonlinear boundaries; and it has been proven to outperform other techniques [11, 12]. Its improved performance is attributed to the transformation of data into a higher-dimensional space, which requires less complex problem solving than in the original data space [13]. Here, a K-Spectral Clustering (KSC) is employed, and optimal input parameters are determined automatically. Secondly, associations between clustered methylation and gene expression data sets are produced by setting a fixed constraint of 20,000 base pairs in the vicinity of either 5' or 3' ends to match methylation sites to their genes. Finally, prediction and ranking of epigenetic regulated genes is performed based on logistic regressions of the methylation-expression associations onto an exponential curve. This logistic approach is flexible enough to incorporate any data scale and distribution, and does not contain rigid and arbitrary definitions that could limit its application. Finally, the significance of the logistic regression is verified by permutation analysis and computing the p-value.
Computational protocol
The proposed computational protocol for identifying epigenetically regulated genes consists of three steps (Figure 1): (i) dimensionality reduction of the methylation profile, which is comprised of two sub-steps: (i.i) clustering of methylation profiles on the basis of proximity, and (i.ii) clustering within methylation sub-regions on the basis of similarity; (ii) association of the clustered methylation data to gene expression data; and (iii) logistic regression and ranking of the methylation-expression associations. Software and data can be downloaded at http://vision.lbl.gov/Software.
Dimensionality reduction
Genome-wide methylation measurements in CpG islands produce a high-volume data that make it computationally unmanageable for association, ranking, and required permutation analyses. Therefore, dimensionality reduction is a necessity and is implemented by us with two steps: (i) clustering on the basis of proximity of methylation sites within each chromosome along the genome; and (ii) clustering on the basis of similarity among methylation profiles across cell lines.
-
Clustering on the basis of proximity: In this step, regions of concentration are identified by the proximity of CpG methylation sites along the genome. In each chromosome, methylation sites adjoining within 2,000 base pairs are aggregated and form distinct regions from methylation sites adjoined by more than 2,000 base pairs. Such regions provide a spatial context for methylation sites, grouping and isolating distant chromosomal regions. This is an important step for subsequent clustering based on the similarity of methylation profiles.
-
Clustering on the basis of similarity: In this step, methylation profiles are clustered to identify cross-similarities within each region. Prior to the clustering, however, methylation profiles are pre-processed and represented by the largest principal components, which embed 99% of the data underlying variance. This is a standard approach and well documented in the machine learning literature. Clustering high-dimensional data in their principal component space results in lower computational complexity and lower risk from the curse of dimensionality [14]. The clustering method used here is unsupervised, and based on K-Spectral Clustering (KSC) [13], as discussed below.
Given a set of methylation profiles (s
1
,...s
n
) across l cell line principal components, the algorithm starts by computing an affinity matrix A whose diagonal elements A
ii
= 0 and off-diagonal elements are
Next, the k largest principal components are computed from the matrix
where D is a diagonal matrix whose D
i,i
elements are the sum of A's i-th row. Let X be the n × k matrix that is formed by the k largest principal components of L. K-means clustering [15] is then applied to the normalized matrix Y, whose elements are represented by
. Finally, the methylation profiles S
i
receive the same clustering assignment proposed for Y by k-means, i.e., a profile is assigned to cluster j if and only if row Y
i
is assigned to cluster j.
The above formulation of KSC requires parameter setting for the variables σ and k. σ determines the magnitude of the exponential decay in the computation of the affinity matrix A. Its value plays a role on the determination of boundaries between adjacent clusters. k specifies the number of clusters, which controls the amount of data quantization, but is often difficult to be determined in practice. A simple yet effective strategy to infer these parameters involves clustering with different combinations of the parameters and estimating the compactness of the inferred clusters. One way to characterize cluster compactness is to measure the cluster's internal homogeneity over external heterogeneity [16]. This relation can be mathematically defined by the ratio of
, where W is the maximum distance between a point within a cluster and its center, and B is the minimum distance between two cluster centers. In our implementation, we partition the space of k and σ into fixed intervals, perform KSC for each enumerated pair of variables, and select the pair that produces the minimum measure of compactness. A representative methylation profile for each cluster is then computed by averaging all methylation measurements across cell lines. We tested this approach on synthetic data with linear and nonlinear boundaries to predict the validity of the results. Figure 2 demonstrates the selection of KSC's optimal parameters σ and k for synthetic data equally distributed into three concentric circles. The compactness measures, produced by each pair of parameters, are then normalized and shown as a heat map. The minimum value is marked by the blue box.
Methylation-Expression Association
Each representative methylation site, averaged over members of the same cluster, is associated with a gene or a set of genes. The association uses only the methylation site and the gene's probe set base range. A gene may have multiple probe sets in the expression data, which cover different portions of a chromosome. These associations are created for representative methylation sites being (i) within a gene probe set, or (ii) within a 20,000-base-pair window adjacent to the gene probe set. The latter accounts for natural uncertainties for locating a potential CpG island along the DNA.
Logistic regression and assignment of p-value
In order to characterize a negative correlation between the methylation and expression data (i.e., hypermethylation and down-regulation; and hypomethylation and up-regulation), we perform logistic regression of the methylation-expression associations. To this end, we experimented with several functional representations and propose a generic model, with flexible degrees of freedom, corresponding to an exponential curve of the form:
where E and M are respectively the expression and methylation measurements for each cell line, and a, b, and c are the free variables of the logistic model. Evaluation of the logistic regression on synthetic data reveals that expected inverse relationship between expression and methylation can be correctly ranked (Figure 3). Note that R is the correlation coefficient by which the associations are ranked. It reflects the quality of the logistic regression and, consequently, the method's confidence in an epigenetic regulation.
where SSR = ∑(E
fit
- mean(E))2 and SSE = ∑(E
fit
- E)2, and R = 1 indicates a perfect fit to the model. The corresponding p-value is estimated for each association by computing:
The p-value is computed by comparing the value of R resulting from the curve regression, and the values of R
m
, m = 1, 2,...,M, resulting from M attempts for fitting the same curve after permuting the methylation measurements of each association. In our implementation, M is set at 10,000.