We use two DMH microarray datasets generated from 40 breast cancer cell lines and 26 ovarian cancer patients. In particular, we use the 2-color 244K Agilent arrays hybridized with the test samples (e.g. the breast cancer cell lines) dye coupled with Cy5 (red) and a common normal reference dye coupled with Cy3 (green). The base two log ratio of red over green intensity, log_{2}(Cy5/Cy3), is used as the observed methylation signal at each probe. For each array, dye effects are corrected using the standard within array LOESS normalization in the Bioconductor package "limma" [23]. We have explored several normalization methods and found that the standard LOESS normalization produces more consistent and reliable results than the others (data not shown).

In a common DMH experiment, it is desirable to identify CGIs that are hypermethylated in a large percentage of the total N samples (e.g., N cancer patients or N cancer cell lines). Therefore, one important goal of our DMH microarray study is to identify the CGIs that are commonly methylated in N samples (N = 40 for breast cancer data and N = 26 for ovarian cancer data). In order to control the noise due to measured and unmeasured factors such as GC content, scanner effects, and PCR effects that may affect the signals, we apply the following quantile regression model to each CGI:

{Q}_{\text{Ysp}}\phantom{\rule{0.5em}{0ex}}(\tau |sampl{e}_{s},\phantom{\rule{0.5em}{0ex}}prob{e}_{p})=sampl{e}_{s}(\tau )+prob{e}_{p}(\tau )

where Q_{Ysp} (τ|*sample*_{
s
}*, probe*_{
p
} ) is the τ-th conditional quantile of the observed probe log ratio of sample *s* at probe *p*, *sample*_{
s
} represents the expected signal from the sample, and *probe*_{
p
} denotes the probe effect. In the above quantile regression model, error terms are assumed to be independent and distribution-free. The regression coefficients, especially *sample*_{
s
} and *probe*_{
p
} , are estimated by formulating the quantile regression problem as a linear program [22]. In fact, both parameter estimation and inference are conducted using the R package "quantreg" [22]. An example of using this package to fit a quantile regression model for one CpG island has been provided (see Additional file 1).

In the above regression model, we let τ = 95%, 90%, 85%, 80%, 75%, 70%, 65% and 60%. We choose quantile levels over 50% because we are interested in identifying hypermethylated regions. In particular, for each sample (or cell line) effect from the quantile regression output, there is a p-value indicating whether a sample (or cell line) shows significant methylation signals at one particular CGI under the null hypothesis that *sample*_{
s
}*(τ) =* 0. The methylation level at each CGI is taken as the number of samples for which their associated p-values are less than a certain cutoff value p_{0} where we let p_{0} = 0.05, 0.04, 0.03, 0.02, and 0.01. For example, if a CGI has p-values less than 0.01 in 38 out of 40 breast cancer arrays, this indicates that this CGI may have very strong methylation signals across many samples.

In order to verify that our quantile regression model can identify the real methylation signals and to compare the results of our regression models at different quantile levels, we use known methylated and housekeeping genes as "positive" and "negative" controls respectively. In fact, 30 known hypermethylated genes [24–27] have been reported for breast cancer, and 32 known hypermethylated genes have been reported for ovarian cancer [28]. For both breast and ovarian cancer, 47 housekeeping genes [29] are selected as "negative" control (i.e., known unmethylated genes) due to their low methylation signals. Recall that the methylation score given to each CGI is the count of samples with p-value less than a cutoff point. At each p-value cutoff point p_{0}, we have a methylation score for each CGI. Then, there are N_{m} and N_{HK} methylation scores with N_{m} = 30 for breast cancer data, N_{m} = 32 for ovarian cancer data, and N_{HK} = 47 for unmethylated housekeeping genes. We choose these N_{m} and N_{HK} genes because each of them is associated with at least one CGI. Therefore, this paper will refer to these genes as N_{m} methylated and N_{HK} unmethylated CGIs.

In order to determine if known methylated and unmethylated CGIs are identified correctly, we use three different statistical measurements for known methylated and unmethylated CGIs. The first measurement is the area under a Receiver Operating Characteristic (ROC) curve, which we call "AUC" (Area Under Curve). A ROC is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold varies. The ROC can also be represented equivalently by plotting true positive rates (TPR) vs. false positive rates (FPR). In this paper, the TPR is the fraction of known methylated CGIs that are correctly classified as methylated CGIs at a specific methylation score level C_{0} (0 ≤ C_{0} ≤ N). The FPR is the fraction of known unmethylated housekeeping CGIs that are incorrectly classified as methylated CGIs at a specific methylation score level C_{0}. The second measurement is the mean difference of methylation scores of two groups. We call this measurement mean.diff, that is {\overline{x}}_{m}-{\overline{x}}_{HK}, where {\overline{x}}_{m} and {\overline{x}}_{HK} are mean methylation scores for known methylated and unmethylated housekeeping CGIs. The third measurement is the mean difference of methylation scores of two groups of CGIs divided by their standard deviation. That is,\frac{{\overline{x}}_{m}-{\overline{x}}_{HK}}{\sqrt{{s}_{m}^{2}/{N}^{m}+{s}_{HK}^{2}/{N}^{HK}}}, where {\overline{x}}_{m}, {\overline{x}}_{HK}, {s}_{m}^{2} and {s}_{Hk}^{2} are the mean and variance of methylation scores for known methylated and housekeeping CGIs respectively, we call this measurement "T.stat". At each quantile level τ, the larger a statistical measurement is, the more evident that this quantile level is better at identifying methylated and unmethylated CGIs.