Skip to main content
  • Methodology article
  • Open access
  • Published:

Methylation Linear Discriminant Analysis (MLDA) for identifying differentially methylated CpG islands

Abstract

Background

Hypermethylation of promoter CpG islands is strongly correlated to transcriptional gene silencing and epigenetic maintenance of the silenced state. As well as its role in tumor development, CpG island methylation contributes to the acquisition of resistance to chemotherapy. Differential Methylation Hybridisation (DMH) is one technique used for genome-wide DNA methylation analysis. The study of such microarray data sets should ideally account for the specific biological features of DNA methylation and the non-symmetrical distribution of the ratios of unmethylated and methylated sequences hybridised on the array. We have therefore developed a novel algorithm tailored to this type of data, Methylation Linear Discriminant Analysis (MLDA).

Results

MLDA was programmed in R (version 2.7.0) and the package is available at CRAN [1]. This approach utilizes linear regression models of non-normalised hybridisation data to define methylation status. Log-transformed signal intensities of unmethylated controls on the microarray are used as a reference. The signal intensities of DNA samples digested with methylation sensitive restriction enzymes and mock digested are then transformed to the likelihood of a locus being methylated using this reference. We tested the ability of MLDA to identify loci differentially methylated as analysed by DMH between cisplatin sensitive and resistant ovarian cancer cell lines. MLDA identified 115 differentially methylated loci and 23 out of 26 of these loci have been independently validated by Methylation Specific PCR and/or bisulphite pyrosequencing.

Conclusion

MLDA has advantages for analyzing methylation data from CpG island microarrays, since there is a clear rational for the definition of methylation status, it uses DMH data without between-group normalisation and is less influenced by cross-hybridisation of loci. The MLDA algorithm successfully identified differentially methylated loci between two classes of samples analysed by DMH using CpG island microarrays.

Background

DNA methylation frequently occurs in mammalian DNA at the 5 position of cytosine in CpG dinucleotides. It has been estimated that over 70% of cytosines of CpG dinucleotides are methylated in the human genome. CpG dinucleotides are under-represented in the genome and methylated CpG dinucleotides predominantly occur within repetitive elements [2]. However, there are CpG rich regions of the genome which generally remain unmethylated [3]. These CpG rich regions are known as CpG islands and are frequently located in the promoter or the first exon regions of approximately 60% of all genes [4]. The unmethylated status of CpG islands is thought to be a prerequisite state to maintain the linked gene in an active transcribed and transcriptional permissive state.

Differential Methylation Hybridisation (DMH) is one of several techniques for examining CpG island methylation at a genome-wide scale that has been applied to the identification of aberrantly methylated gene promoters in various cancers [5–12]. Nouzova et al[13] modified the original method by using digestion with a methylation-dependent enzyme, Mcr BC. This enzyme only cleaves methylated CpG DNA sequences. Within-sample comparison is applied after competitive hybridisation with Mcr BC digested DNA and undigested (mock digested) DNA labelled with Cy3 and Cy5. If a locus is unmethylated the signal intensities of Cy3 and Cy5 are equivalent, while if methylated the Cy5/Cy3 (undigested/digested) ratio is greater than one. However, no common reference is generally used in the modified DMH method, and the unequal representation of methylated and unmethylated sequences due to competitive hybridisation may reduce sensitivity and specificity to detect differential methylation.

Currently, Significance Analysis of Microarrays (SAM) [14] and Prediction Analysis for Microarrays (PAM) [15] are commonly applied in DNA methylation analysis. Based on the change in hybridisation relative to the standard deviation of repeated measurements, SAM assigns each gene a score that is an extension of the t-statistic. For significant genes with a score over a certain threshold, SAM uses permutations to estimate the false discovery rate (FDR). It has been implemented in many studies of gene expression data [16–21] as well as DMH data, e.g. Wei et al. [22] applied SAM to find the differential methylation of CpG island loci between ovarian caner patient groups with short and long progression-free survival (PFS). However, SAM assumes that the microarray data conform to approximate normality and symmetry, leading to the loss of power in the analysis of DMH data that are inherently skewed due to the biological features of DNA methylation in cancer and competitive hybridisation on DMH arrays (Figure 1).

Figure 1
figure 1

Distribution of log-transformed ratio of gene expression data in breast cancer and DMH data in A2780 cell line. The left histogram shows the distribution of log-transformed ratios (cy3/cy5) in gene expression profiling data from a previous study of breast cancer 36] which is symmetric, while the right histogram shows the log-transformed ratios (undigested/digested) of DMH data from the present study which is skewed.

In the modified DMH method, the ratios of raw signal intensities (undigested/digested) greater than 1 reflect the various methylation levels [13]. A ratio cut-off is generally used to identify the hypermethylated loci [7]. However, this is an arbitrary value and does not necessarily accurately reflect the various sources of variation in the experiment. It is therefore desirable to develop an algorithm to more objectively assess the methylation status of loci from DMH data.

PAM is a nearest centroid shrinkage method that identifies those genes that discriminate best between classes. This technique shrinks the class gene centroid towards the overall centroid by a "threshold" amount after standardizing each gene by its within class standard deviation. The "threshold" is identified by cross-validation. This approach was applied in the study by Wei et al. [22] and showed certain power in the identification of differentially methylated loci, but PAM is designed for class prediction rather than class comparison. Although the class predictor used in PAM can reflect the difference between classes, a large number of loci actually differentially methylated between the classes are excluded to improve the accuracy of prediction.

Although normalisation has become a standard procedure for the study of microarray data and is necessary for SAM and PAM analysis, unbalanced shifts in methylation status between class samples in DMH limit the use of between-class normalisation which assumes the changes are roughly symmetric. Thus, the differential methylation can be masked by the over-correction of normalisation and it would be preferable to use a method of analysis that does not require normalisation of the data.

Since PAM and SAM may have limitations for analysing DMH data, we have developed an alternative approach based on the specific features and known biological properties of the arrays used for DMH analysis. The algorithm is named as Methylation Linear Discriminant Analysis (MLDA) and has been applied to identify a set of loci differentially methylated between ovarian cisplatin sensitive and resistant cancer cell lines.

Results

Outline of MLDA

In this study, we have developed a novel approach, named MLDA, for analysing CpG island microarray hybridisation data that allows the identification of differentially methylated loci. MLDA was programmed in R (version 2.7.0) and the package is available at CRAN [1]. This approach uses three relatively simple linear regression models. The first one is constructed by the log-transformed signal intensities of unmethylated features and used as the reference for unmethylation (Figure 2b). The second one is the intermediate model constructed through the point corresponding to the 97.5-quantiles residual below the first linear regression line (Figure 2c). The features with a standardised residual less than 2 from this intermediate model are used to generate the third model which is used as the reference for methylation (Figure 2d). The log likelihood ratio of a locus being methylated is then proportional to the difference between the squared standardised residual from the methylated line and that from the unmethylated line. The log likelihood threshold of zero then provides a more rational basis for distinguishing between methylated and unmethylated loci than a robust undigested/digested ratio of 1.5, as it takes into account the observed variability in the experiment.

Figure 2
figure 2

An illustration of unmethylated and methylated model construction in MLDA in A2780 cell line. a: Three patterns can be observed on the scatter plot of log-transformed Cy3 (undigested) against log-transformed Cy5 (digested) intensities. b: The unmethylated model constructed using 94 mitochondrial sequences as a unmethylation reference. c: The intermediate model constructed through the 97.5 quantile residual. The point X is the 97.5 quantile residual. The microarray probes colored in blue (standardised residual to the intermediate model is less than 2) are selected to construct the methylated model. d: Methylated (in blue) and unmethylated (in red) models in A2780 cell line.

In our approach the consistency and inconsistency rates of log likelihood ratios on dye-swapped/duplicate arrays are used to determine methylation and unmethylation cut-offs, which keep the consistency rate (CR) relatively high (about 140%) and the inconsistency rate (IR) low (about 1%). Each loci is assigned a score based on the cut-offs using the weighted methylation scoring scheme. The feature consistently identified as methylated candidates on dye-swapped/duplicate arrays are scored as 1; similarly unmethylated features are scored as -1; the rest of the feature are assigned a weighted score corresponding to their location on the plot of log-likelihood ratios (Figure 3).

Figure 3
figure 3

Weighted scoring scheme. The microarray probes consistently identified as methylated candidates on dye-swap arrays were scored 1; similarly unmethylated microarray probes were scored -1. The rest of the microarray probes were assigned a weighted score based on their location on the plot. LRmeth: log likelihood ratio cut-off for methylated loci; LRunmeth: log likelihood ratio cut-off for unmethylated loci. LR: log likelihood ratio on dye-swapped arrays.

The averaged score for each locus is calculated in each sample class (e.g. resistant or sensitive) and plotted against each other. A robust regression model is then fitted to these data. The standardised residuals from the robust regression model are assumed to follow a normal distribution N(μ, σ2). The outliers of the standardised residuals are identified as the differentially methylated loci between the class samples.

DMH dataset

MLDA was applied to identify the CGIs differentially methylated from DMH data derived from sensitive A2780 derivatives (A2780, A2780p3, A2780p5, A2780p6, A2780p13, A2780p14) and isogenically matched, resistant lines [23] derived by multiple exposures to cytotoxic levels of cisplatin and which are 2–5 fold resistant to cisplatin in clonogenic assays (A2780cp70, A2780/MCP1, A2780/MCP2, A2780/MCP3, A2780/MCP4, A2780/MCP5, A2780/MCP6, A2780/MCP7, A2780/MCP8, A2780/MCP9). After background correction, the log-transformed digested and undigested intensities of the 13056 microarray probes show three approximately parallel linear patterns (Figure 2a). The first pattern (digested/undigested is close to 1) represents the unmethylated sequences. The second pattern represents either hemi-methylated sequences or the unmethylated sequences cross-hybridised with the methylated ones on the panel. The third pattern represents the methylated sequences in target DNA. The methylated and unmethylated loci in target DNA can be characterised by a linear regression model for each pattern. As previously mentioned, normalisation may not be appropriate for DMH data, so the log ratios of signal intensities in two classes of samples are not at the same level (Figure 4). Normalisation is not required for MLDA as the determination of the methylation score is based on the data within each experiment.

Figure 4
figure 4

Box plot of log ratios of undigested signal intensities against digested signal intensities in 16 cell lines (dye-swapped arrays). The boxes colored in red are the A2780 sensitive cell lines; in blue are the A2780 resistant cell lines. As normalisation is not applied, the center and scale of log ratios for the 16 cell lines are not at the same level.

Mitochondrial DNA is unmethylated [24], therefore, the signal intensities of both channels of microarray probes for mitochondrial sequences are expected to be equal. However, a bi-modal distribution is observed in the log-transformed fluorescence ratios (digested/undigested) of 121 mitochondrial sequences. The first peak represents the unmethylated mitochondrial sequences and the second lower peak is assumed to be the mitochondrial sequences cross-hybridised with other methylated sequences on the panel. Thus, we selected 94 of 121 mitochondrial sequences that were consistently unmethylated through all the cell lines and used them as the unmethylation reference in target DNA.

The parameters of those two models in all 16 cell lines were estimated (Table 1). The slope of the unmethylated regression line constructed by 94 mitochondrial sequences is indeed close to 1. After computing the log-likelihood ratios, the methylation and unmethylation cut-offs and associated IRs and CRs were determined from the dye-swapped array pairs (details in Method section). As shown in Figure 5, IR tends to rise with the increase of CR slowly, but starts to increase dramatically when the CR goes above 140%, at which point IR is generally about 1%. We have therefore used CR > 140% and IR < 1% as the criteria for determining the methylation and unmethylation cut-offs. Each locus was scored using the weighted scoring scheme based on those cut-offs. The averaged scores in 6 cisplatin-sensitive cell lines and 10 cisplatin-resistant cell lines were used to construct a robust regression model. Figure 6a shows that the standardised residuals (residual/σ) from the robust regression model roughly follow a normal distribution. The positive and negative outliers are determined as described in Method section.

Figure 5
figure 5

CR against IR in 16 cell lines. X axis is the consistency rate (CR) and y axis is the inconsistency rate (IR). IR tends to rise with the increase of CR slowly, but starts to increase dramatically when the CR goes above 140%, at which point the inconsistency rate is generally about 1%. Not all cell lines could reach this point e.g. MCP3.

Figure 6
figure 6

Outliers identifications. a: Distribution of the observed (histogram) standardised residuals and the theoretical distribution based on the fitted model (dashed smooth line in red). The red and blue solid line are the positive and negative cut-offs, respectively. b: Scatter plot of sensitive scores against resistant scores in A2780 series cell lines. The hypermethylated loci are colored in red and hypomethylated loci are in blue. The robust regression model is Y = 0.9956X + 0.0019.

Table 1 Parameters of linear models in MLDA for 16 cell lines in DMH dataset I

Finally, 115 loci were identified as candidates differentially methylated between A2780 sensitive and these resistant cell lines (additional file 1). Noticeably, 113 of 115 loci (p = 8.8 × 10-3, outlier detection test [25]) were hypermethylated, but only 2 loci (p < 0.001, outlier detection test) lost methylation in the resistant cell lines (Figure 6b). This is consistent with the unbalanced shift in DMH data and indicates cisplatin treatment of cells selects preferentially for hypermethylation of loci, rather than hypomethylation in these tumor cells.

Validation of differential methylation

To confirm the differential methylation of loci identified in this study, we experimentally tested the methylation of 26 loci by methylation-specific PCR (MSP) and/or pyrosequencing of bisulphite modified DNA [26] in sensitive A2780 derivatives and cisplatin resistant derivatives. Twenty-three out of the 26 loci have been confirmed as differentially methylated (additional file 1). It should be noted that MSP and pyrosequencing only examine methylation at a limited number of CpG sites of the sequence present on the DMH analysis. It is possible that the loci which were not confirmed as differentially methylated are methylated at other CpG sites which are detected by DMH but not targeted by MSP and/or pyrosequencing primers and so 23 out of 26 loci confirmed as differentially methylated is a minimum estimation.

To compare the results from MLDA, SAM and PAM, we analysed the DMH dataset by all three methods. MLDA identified 115 loci (113 hypermethylated and 2 hypomethylated loci, misclassification error < 0.001), SAM identified 152 loci (149 hypermethylated and 3 hypomethylated loci, misclassification error = 0.227, FDR = 6.17 × 10-3), and PAM found 24 hypermethylated loci (misclassification error = 0.084, FDR < 0.001) in the resistant cell lines. Twenty-four loci identified by all three methods are listed in Table 2.

Table 2 24 loci identified by MLDA, PAM and SAM as differentially methylated candidates in the comparison between A2780 cisplatin sensitive and cisplatin multiple-selected resistant cell lines.

Discussion

Hypermethylation of promoter CpG islands is strongly correlated to transcriptional gene silencing and epigenetic maintenance of the silenced state and is a potential rich source of biomarkers of cancer. Differential Methylation Hybridisation (DMH) is one technique used for genome-wide DNA methylation analysis. The study of such microarray data sets should ideally account for the specific biological features of DNA methylation and the non-symmetrical distribution of the ratios of unmethylated and methylated sequences hybridised on the array. We have therefore developed a novel algorithm tailored to this type of data, Methylation Linear Discriminant Analysis (MLDA). MLDA utilises log likelihood ratios representing the relative probability that loci are methylated instead of log ratios of signal intensities used in previous studies [6–10, 27]. Validation of 23/26 identified loci using independent methods of methylation analysis shows that MLDA can robustly identify differential methylated loci between ovarian cancer sensitive and resistant cell lines without requiring the data to be normalised.

Although a log likelihood ratio above zero means that the locus tends to be methylated, we did not use zero as the cut-off to determine the number of methylated and unmethylated sequences, as the existence of cross-hybridisation and measurement errors in the DMH assay makes this unreliable. To increase the precision of the methylation classification, we used the inconsistency (IR) and consistency (CR) rates between the dye-swap arrays to determine likelihood ratio cut-offs for methylation and unmethylation and assigned each locus a methylation score based on the position relative to these cut-offs. As shown in Figure 5, not all cell lines can reach the point that CR is around 140% and IR is about 1%. IR and CR need to be carefully selected as the methylation scores of loci are consequently influenced by the change of IR and CR. We also observed a lower CR (about 120%) and a higher IR (about 2%) in another CpG island array using DMH (data not shown), therefore, further examination of what factors influence the achievable CR and IR rates may improve the utility of the MLDA approach.

Data on methylation status for 121 mitochondrial derived sequences were available in this study. Mitochondrial sequences would be expected to be unmethylated. We used 94 mitochondrial sequences to construct unmethylated linear model at the beginning of the study, and indeed, 93 of 121 mitochondrial loci were defined as unmethylated and 25 loci being of uncertain methylation status by MLDA. However, three mitochondrial loci were identified as hypermethylated candidates in the resistant ovarian carcinoma cell lines by both MLDA and SAM. One explanation of this discrepancy is that all these three loci have more than one BLAT hit indicating the existence of homology with nuclear DNA sequences, raising the possibility of hybridisation with these nuclear DNA sequences which may be differentially methylated. As shown in Figure 2a, the loci in the middle pattern represent either hemi-methylated sequences or the unmethylated sequences cross-hybridised with the methylated ones on the panel. No specific allowance is made for these intermediate points in analysis by SAM and PAM, whereas MLDA attempts specifically to down-weight these points in the identification of the methylation regression line. By giving a lower weighted score (close to 0) (Figure 3) to those loci, MLDA reduces the influence of cross-hybridisation among this group of sequences. Of course cross-hybridisation may also occur in the loci in the other two patterns (methylated and unmethylated patterns), but it is not possible for any mathematical approach to identify this.

The misclassification error of MLDA based on the methylation score is much lower than that for either SAM or PAM based on the log ratios, indicating the potential of MLDA methylation scores to be used as a reliable discriminator between classes of samples.

Conclusion

We have developed a novel method, named MLDA, for genome-wide DNA methylation studies. MLDA can transform the signal intensities to log-likelihood ratios through three linear regression models. Using this approach MLDA allows determination of the methylation status of a locus based on dye-swapped/duplicate arrays. The method has been applied to assess the methylation status of each locus and identified 115 loci that exhibit differential methylation between A2780 sensitive and resistant cell lines. A minimum of 23 out of 26 loci have been confirmed by independent methods as differentially methylated.

Methods

First, all intensity values were log transformed. A multiplicative background correction was applied to correct signal intensities for the background noise in each array. After background correction, the log-transformed digested and undigested intensities show three approximately parallel linear patterns (Figure 2a). The first pattern (digested/undigested is close to 1) represents the unmethylated sequences. The second pattern represents either hemi-methylated sequences or the unmethylated sequences cross-hybridised with the methylated ones on the panel. The third pattern represents the methylated sequences in target DNA. The methylated and unmethylated loci in target DNA can be characterised by a linear regression model for each pattern. The distance of each spot to the methylated and unmethylated lines respectively can then be estimated by standardised residuals. The log likelihood ratio of a locus being methylated is then proportional to the difference between the squared standardised residual from the methylated line and that from the unmethylated one. The algorithm based around this regression approach is named Methylation Linear Discriminant Analysis (MLDA) and was programmed in R version 2.7.0.

Log-likelihood ratio transformation

a An univariate linear regression model was constructed for the unmethylated probes (e.g. mitochondrial derived features) using formula (1) where α is the intercept, β is the slope of the model, and ξ is the error representing the unpredicted or unexplained variation in the model (Figure 2b). The parameters of regression line were estimated by the method of least squares (formula 2 and 3).G i = α + βR i + ξ i i = 1,2,3.......k

β ^ = ∑ ( R i − R ¯ ) ( G i − G ¯ ) ∑ ( R i − R ¯ ) 2 i = 1 , 2 , 3....... k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiqbek7aIzaajaGaeyypa0tcfa4aaSaaaeaadaaeabqaaiabcIcaOiabdkfasnaaBaaabaGaemyAaKgabeaacqGHsislcuWGsbGugaqeaiabcMcaPiabcIcaOiabdEeahnaaBaaabaGaemyAaKgabeaacqGHsislcuWGhbWrgaqeaiabcMcaPaqabeqacqGHris5aaqaamaaqaeabaGaeiikaGIaemOuai1aaSbaaeaacqWGPbqAaeqaaiabgkHiTiqbdkfaszaaraGaeiykaKYaaWbaaeqabaGaeGOmaidaaaqabeqacqGHris5aaaaaOqaaiabdMgaPjabg2da9iabigdaXiabcYcaSiabikdaYiabcYcaSiabiodaZiabc6caUiabc6caUiabc6caUiabc6caUiabc6caUiabc6caUiabc6caUiabdUgaRbaaaaa@56A8@
(2)
α ^ = β ^ R ¯ − G ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqySdeMbaKaacqGH9aqpcuaHYoGygaqcaiqbdkfaszaaraGaeyOeI0Iafm4raCKbaebaaaa@33EA@
(3)

k is the number of unmethylated controls on DMH array. G i and R i are the logarithmic-transformed digested and undigested intensities of microarray probes for mitochondrial sequences, respectively. G ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4raCKbaebaaaa@2D04@ and R ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOuaiLbaebaaaa@2D1A@ are the averaged logarithmic-transformed undigested and digested intensities of the k unmethylated controls.

  1. b.

    The scale estimate σmito associated with the error term in the linear regression model was estimated from the residuals from the observed k points to the fitted line. The most extreme 10% of residuals was omitted from either end of the distribution to minimise the impact of extreme residuals on this estimate.

  2. c.

    The standardised residuals of all the microarray probes to the unmethylation regression line were calculated as formula (4).

    S R m i t o = r e s i d u a l s m i t o σ m i t o MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaemOuai1aaSbaaSqaaiabd2gaTjabdMgaPjabdsha0jabd+gaVbqabaGccqGH9aqpjuaGdaWcaaqaaiabdkhaYjabdwgaLjabdohaZjabdMgaPjabdsgaKjabdwha1jabdggaHjabdYgaSjabdohaZnaaBaaabaGaemyBa0MaemyAaKMaemiDaqNaem4Ba8gabeaaaeaacqaHdpWCdaWgaaqaaiabd2gaTjabdMgaPjabdsha0jabd+gaVbqabaaaaaaa@4F89@
    (4)
  3. d.

    The point corresponding to the 97.5-quantiles residual below the unmethylation line is represented as X (R.975, G.975). The intermediate linear model (Figure 2c) was constructed through point X with a slope assumed to be 1 and the intercept estimated as formula (5).

    α ^ = G .975 − R .975 + 1.96 σ m i t o MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqySdeMbaKaacqGH9aqpcqWGhbWrdaWgaaWcbaGaeiOla4IaeGyoaKJaeG4naCJaeGynaudabeaakiabgkHiTiabdkfasnaaBaaaleaacqGGUaGlcqaI5aqocqaI3aWncqaI1aqnaeqaaOGaey4kaSIaeGymaeJaeiOla4IaeGyoaKJaeGOnayJaeq4Wdm3aaSbaaSqaaiabd2gaTjabdMgaPjabdsha0jabd+gaVbqabaaaaa@465A@
    (5)
  4. e.

    The standardised residuals of all the microarray probes to the line with slope 1 and intercept estimated from (5) were calculated as formula (6). The variance of the residuals to the intermediate model was assumed to be similar as that in the mitochondrial model.

    S R .975 = r e s i d u a l s .975 σ m i t o MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaemOuai1aaSbaaSqaaiabc6caUiabiMda5iabiEda3iabiwda1aqabaGccqGH9aqpjuaGdaWcaaqaaiabdkhaYjabdwgaLjabdohaZjabdMgaPjabdsgaKjabdwha1jabdggaHjabdYgaSjabdohaZnaaBaaabaGaeiOla4IaeGyoaKJaeG4naCJaeGynaudabeaaaeaacqaHdpWCdaWgaaqaaiabd2gaTjabdMgaPjabdsha0jabd+gaVbqabaaaaaaa@4C0D@
    (6)
  5. f.

    The microarray probes with standardised residuals less than 2 were included for later robust regression analysis. The line estimated from this regression analysis represents the methylation regression line (Figure 2d).

  6. g.

    The scale estimate σmeth of the methylation regression line was estimated using only those microarray probes below the line, with the most extreme 5% removed.

  7. h.

    The standardised residuals of all the microarray probes to the methylated regression line were calculated as formula (7). The log likelihood ratio (LR) of all the microarray probes was estimated by formula (8) for further analysis.

    S R m e t h = r e s i d u a l s m e t h σ m e t h MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaemOuai1aaSbaaSqaaiabd2gaTjabdwgaLjabdsha0jabdIgaObqabaGccqGH9aqpjuaGdaWcaaqaaiabdkhaYjabdwgaLjabdohaZjabdMgaPjabdsgaKjabdwha1jabdggaHjabdYgaSjabdohaZnaaBaaabaGaemyBa0MaemyzauMaemiDaqNaemiAaGgabeaaaeaacqaHdpWCdaWgaaqaaiabd2gaTjabdwgaLjabdsha0jabdIgaObqabaaaaaaa@4F47@
    (7)

    LR = SR2 mito - SR2 meth

Determination of log likelihood ratio cut-offs

Two inconsistency rates (IRmeth and IRunmeth) and two consistency rates (CRmeth and CRunmeth) between dye-swap arrays were used to determine the log like likelihood ratio threshold. IRmeth (formula 9) represents the rate of the microarray probes identified as methylated in one array but as unmethylated in the other one, while IRunmeth (formula 10) is the rate of the microarray probes identified as unmethylated in one array but as methylated in the other one. CRmeth (formula 11) and CRunmeth (formula 12) are the rates for the spots identified as methylated (CRmeth) and unmethylated (CRunmeth) in both dye-swap arrays (Figure 7).

Figure 7
figure 7

Determination of methylation and unmethylation cut-offs of likelihood ratios on dye-swapped arrays. LRmeth: log likelihood ratio cut-off for methylated spots; LRunmeth: log likelihood ratio cut-off for unmethylated spots.

Table 3 Classification of loci
I R m e t h = c c + f + i + g g + h + i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaemOuai1aaSbaaSqaaiabd2gaTjabdwgaLjabdsha0jabdIgaObqabaGccqGH9aqpjuaGdaWcaaqaaiabdogaJbqaaiabdogaJjabgUcaRiabdAgaMjabgUcaRiabdMgaPbaakiabgUcaRKqbaoaalaaabaGaem4zaCgabaGaem4zaCMaey4kaSIaemiAaGMaey4kaSIaemyAaKgaaaaa@4587@
(9)
I R u n m e t h = g a + d + g + c a + b + c MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaemOuai1aaSbaaSqaaiabdwha1jabd6gaUjaaykW7cqWGTbqBcqWGLbqzcqWG0baDcqWGObaAaeqaaOGaeyypa0tcfa4aaSaaaeaacqWGNbWzaeaacqWGHbqycqGHRaWkcqWGKbazcqGHRaWkcqWGNbWzaaGccqGHRaWkjuaGdaWcaaqaaiabdogaJbqaaiabdggaHjabgUcaRiabdkgaIjabgUcaRiabdogaJbaaaaa@49BA@
(10)
C R m e t h = i c + f + i + i g + h + i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qamKaemOuai1aaSbaaSqaaiabd2gaTjabdwgaLjabdsha0jabdIgaObqabaGccqGH9aqpjuaGdaWcaaqaaiabdMgaPbqaaiabdogaJjabgUcaRiabdAgaMjabgUcaRiabdMgaPbaakiabgUcaRKqbaoaalaaabaGaemyAaKgabaGaem4zaCMaey4kaSIaemiAaGMaey4kaSIaemyAaKgaaaaa@458B@
(11)
C R m e t h = a a + b + c + a a + d + g MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qamKaemOuai1aaSbaaSqaaiabd2gaTjabdwgaLjabdsha0jabdIgaObqabaGccqGH9aqpjuaGdaWcaaqaaiabdggaHbqaaiabdggaHjabgUcaRiabdkgaIjabgUcaRiabdogaJbaakiabgUcaRKqbaoaalaaabaGaemyyaegabaGaemyyaeMaey4kaSIaemizaqMaey4kaSIaem4zaCgaaaaa@453B@
(12)

b The log likelihood ratio thresholds (LRmeth and LRunmeth) for methylated and unmethylated microarray probes, which kept the IR rates low (at or close to 1%) and the CR rates high (at or close to 140%), were used as the cut-offs for methylated and unmethylated loci. IR tends to rise with the increase of CR slowly, but starts to increase dramatically when the CR goes above 140%, at which point the inconsistency rate is generally about 1%. We have therefore used CR > 140% and IR < 1% as the criteria for determining the methylation cut-offs.

Identification of robust regression outliers

Each microarray probe was scored based on the cut-offs of likelihood ratios for methylation and unmethylation on dye-swap arrays using the weighted methylation scoring scheme shown in Figure 3. The microarray probes consistently identified as methylated candidates on dye-swap arrays were scored of 1; similarly unmethylated microarray probes were scored of -1. The rest of the microarray probes were assigned a weighted score based on their location on the plot.

A robust regression model [28] was constructed with the averaged scores in one class of samples as the explanatory variable, and the corresponding scores in the other class of samples as the dependent variable. The degree of trimming was determined according to Barnett et al. [29] when estimating the variance of residuals to the robust linear regression model.

It was assumed that the standardised residuals (SRs) from the robust regression line followed a normal distribution N(μ, σ2). μ and σ were estimated excluding outliers using the MAD-Median Rule [30]. The p value for each SR cut-off was calculated as described by Simon et al [25]. This p-value reflects the probability of observing a group of more extreme residuals from the fitted normal distribution. Microarray probes were identified as outliers if their SRs were larger than the cut-off for which the p-value was less than 0.01.

Estimation of misclassification rate

The misclassification rate was estimated by drawing bootstrap samples 500 times with replacement from the two classes (sensitive and resistant) and carrying out hierarchical clustering based on the loci identified as differentially methylated using weighted scores for MLDA and log ratios without between-group normalisation for SAM and PAM, respectively. Clustering was carried out using Euclidean distance as the distance metric, and clusters were agglomerated using the average linkage criterion. The clustering tree was cut into two groups and the number of misclassified cell lines was counted. The misclassification rate was obtained from the averaged number of misclassified samples in 500 bootstraps divided by the total number of samples.

SAM and PAM analysis

The raw signal intensities of each channel were subtracted by the median signal intensities of corresponding channel of controls on HCGI12K array. After this correction, SAM in samr package and PAM in pamr package were applied using log ratios (digested/undigested) in R version 2.7.0. Between-group normalisation was not used in SAM and PAM to avoid over-correction masking the differential methylation.

References

  1. [http://cran.r-project.org/]

  2. Cooper DN, Krawczak M: Cytosine methylation and the fate of CpG dinucleotides in vertebrate genomes. Hum Genet 1989, 83(2):181–188. 10.1007/BF00286715

    Article  CAS  PubMed  Google Scholar 

  3. Bird AP: CpG-rich islands and the function of DNA methylation. Nature 1986, 321(6067):209–213. 10.1038/321209a0

    Article  CAS  PubMed  Google Scholar 

  4. Antequera F, Bird A: Number of CpG islands and genes in human and mouse. Proc Natl Acad Sci U S A 1993, 90(24):11995–11999. 10.1073/pnas.90.24.11995

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Huang TH, Perry MR, Laux DE: Methylation profiling of CpG islands in human breast cancer cells. Hum Mol Genet 1999, 8(3):459–470. 10.1093/hmg/8.3.459

    Article  CAS  PubMed  Google Scholar 

  6. Yan PS, Chen CM, Shi H, Rahmatpanah F, Wei SH, Caldwell CW, Huang TH: Dissecting complex epigenetic alterations in breast cancer using CpG island microarrays. Cancer Res 2001, 61(23):8375–8380.

    CAS  PubMed  Google Scholar 

  7. Wei SH, Chen CM, Strathdee G, Harnsomburana J, Shyu CR, Rahmatpanah F, Shi H, Ng SW, Yan PS, Nephew KP, Brown R, Huang TH: Methylation microarray analysis of late-stage ovarian carcinomas distinguishes progression-free survival in patients and identifies candidate epigenetic markers. Clin Cancer Res 2002, 8(7):2246–2252.

    CAS  PubMed  Google Scholar 

  8. Yan PS, Chen CM, Shi H, Rahmatpanah F, Wei SH, Huang TH: Applications of CpG island microarrays for high-throughput analysis of DNA methylation. J Nutr 2002, 132(8 Suppl):2430S-2434S.

    CAS  PubMed  Google Scholar 

  9. Yan PS, Efferth T, Chen HL, Lin J, Rodel F, Fuzesi L, Huang TH: Use of CpG island microarrays to identify colorectal tumors with a high degree of concurrent methylation. Methods 2002, 27(2):162–169. 10.1016/S1046-2023(02)00070-1

    Article  CAS  PubMed  Google Scholar 

  10. Yan PS, Wei SH, Huang TH: Differential methylation hybridization using CpG island arrays. Methods Mol Biol 2002, 200: 87–100.

    CAS  PubMed  Google Scholar 

  11. Rahmatpanah FB, Carstens S, Guo J, Sjahputera O, Taylor KH, Duff D, Shi H, Davis JW, Hooshmand SI, Chitma-Matsiga R, Caldwell CW: Differential DNA methylation patterns of small B-cell lymphoma subclasses with different clinical behavior. Leukemia 2006, 20(10):1855–1862. 10.1038/sj.leu.2404345

    Article  CAS  PubMed  Google Scholar 

  12. Zighelboim I, Goodfellow PJ, Schmidt AP, Walls KC, Mallon MA, Mutch DG, Yan PS, Huang TH, Powell MA: Differential methylation hybridization array of endometrial cancers reveals two novel cancer-specific methylation markers. Clin Cancer Res 2007, 13(10):2882–2889. 10.1158/1078-0432.CCR-06-2367

    Article  CAS  PubMed  Google Scholar 

  13. Nouzova M, Holtan N, Oshiro MM, Isett RB, Munoz-Rodriguez JL, List AF, Narro ML, Miller SJ, Merchant NC, Futscher BW: Epigenomic changes during leukemia cell differentiation: analysis of histone acetylation and cytosine methylation using CpG island microarrays. J Pharmacol Exp Ther 2004, 311(3):968–981. 10.1124/jpet.104.072488

    Article  CAS  PubMed  Google Scholar 

  14. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98(9):5116–5121. 10.1073/pnas.091062498

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A 2002, 99(10):6567–6572. 10.1073/pnas.082099299

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Leamey CA, Glendining KA, Kreiman G, Kang ND, Wang KH, Fassler R, Sawatari A, Tonegawa S, Sur M: Differential gene expression between sensory neocortical areas: potential roles for Ten_m3 and Bcl6 in patterning visual and somatosensory pathways. Cereb Cortex 2008, 18(1):53–66. 10.1093/cercor/bhm031

    Article  PubMed  Google Scholar 

  17. Flavell JR, Baumforth KR, Wood VH, Davies GL, Wei W, Reynolds GM, Morgan S, Boyce A, Kelly GL, Young LS, Murray PG: Down-regulation of the TGF-beta target gene, PTPRK, by the Epstein-Barr virus encoded EBNA1 contributes to the growth and survival of Hodgkin lymphoma cells. Blood 2008, 111(1):292–301. 10.1182/blood-2006-11-059881

    Article  CAS  PubMed  Google Scholar 

  18. Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, Strumpf D, Johnston MR, Darling G, Keshavjee S, Waddell TK, Liu N, Lau D, Penn LZ, Shepherd FA, Jurisica I, Der SD, Tsao MS: Three-gene prognostic classifier for early-stage non small-cell lung cancer. J Clin Oncol 2007, 25(35):5562–5569. 10.1200/JCO.2007.12.0352

    Article  PubMed  Google Scholar 

  19. Lapointe J, Li C, Giacomini CP, Salari K, Huang S, Wang P, Ferrari M, Hernandez-Boussard T, Brooks JD, Pollack JR: Genomic profiling reveals alternative genetic pathways of prostate tumorigenesis. Cancer Res 2007, 67(18):8504–8510. 10.1158/0008-5472.CAN-07-0673

    Article  CAS  PubMed  Google Scholar 

  20. Goold R, Hubank M, Hunt A, Holton J, Menon RP, Revesz T, Pandolfo M, Matilla-Duenas A: Down-regulation of the dopamine receptor D2 in mice lacking ataxin 1. Hum Mol Genet 2007, 16(17):2122–2134. 10.1093/hmg/ddm162

    Article  CAS  PubMed  Google Scholar 

  21. Cheishvili D, Maayan C, Smith Y, Ast G, Razin A: IKAP/hELP1 deficiency in the cerebrum of familial dysautonomia patients results in down regulation of genes involved in oligodendrocyte differentiation and in myelination. Hum Mol Genet 2007, 16(17):2097–2104. 10.1093/hmg/ddm157

    Article  CAS  PubMed  Google Scholar 

  22. Wei SH, Balch C, Paik HH, Kim YS, Baldwin RL, Liyanarachchi S, Li L, Wang Z, Wan JC, Davuluri RV, Karlan BY, Gifford G, Brown R, Kim S, Huang TH, Nephew KP: Prognostic DNA methylation biomarkers in ovarian cancer. Clin Cancer Res 2006, 12(9):2788–2794. 10.1158/1078-0432.CCR-05-1551

    Article  CAS  PubMed  Google Scholar 

  23. Anthoney DA, McIlwrath AJ, Gallagher WM, Edlin AR, Brown R: Microsatellite instability, apoptosis, and loss of p53 function in drug-resistant tumor cells. Cancer Res 1996, 56(6):1374–1381.

    CAS  PubMed  Google Scholar 

  24. Maekawa M, Taniguchi T, Higashi H, Sugimura H, Sugano K, Kanno T: Methylation of mitochondrial DNA is not a useful marker for cancer detection. Clin Chem 2004, 50(8):1480–1481. 10.1373/clinchem.2004.035139

    Article  CAS  PubMed  Google Scholar 

  25. Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y: Design and Analysis of DNA Microarray Investigations. In Statistics for Biology and Health. New York , Springer; 2003.

    Google Scholar 

  26. Herman JG, Graff JR, Myohanen S, Nelkin BD, Baylin SB: Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc Natl Acad Sci U S A 1996, 93(18):9821–9826. 10.1073/pnas.93.18.9821

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  27. Wei SH, Brown R, Huang TH: Aberrant DNA methylation in ovarian cancer: is there an epigenetic predisposition to drug response? Ann N Y Acad Sci 2003, 983: 243–250.

    Article  CAS  PubMed  Google Scholar 

  28. Tukey JW: Exploratory Data Analysis. Rading Massachusetts , Addison-Wesley; 1977.

    Google Scholar 

  29. Wilcox RR: Robust Regression. In Robust Esitmation and Hypothesis Testing. Edited by: Holland BA. London , Elsevier Academic Press; 2005.

    Google Scholar 

  30. Wilcox RR: Estimating Measures of Location and Scale. In Robust Esitmation and Hypothesis Testing. Edited by: Holland BA. London , Elsevier Academic Press; 2005.

    Google Scholar 

  31. Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J Mol Biol 1987, 196(2):261–282. 10.1016/0022-2836(87)90689-9

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by a CR-UK programme grant to RB, NIH R21 grant CA110475 to RB and TH, and an Ovarian Cancer Action project grant to RB, JMT and JG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Brown.

Additional information

Authors' contributions

WD conducted the statistical analysis and algorithm development supervised by JP and RB. The DMH data was produced by JMT in collaboration with TH and PY. RB, JP and KV conceived the study. JMT, JG and CZ conducted validation by MSP or pyrosequencing in RB's lab. WD, JP and RB prepared the manuscript with review by all authors. Funding was obtained by RB and TH.

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Dai, W., Teodoridis, J.M., Graham, J. et al. Methylation Linear Discriminant Analysis (MLDA) for identifying differentially methylated CpG islands. BMC Bioinformatics 9, 337 (2008). https://doi.org/10.1186/1471-2105-9-337

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-337

Keywords